mysql - Write MapReduce ETL job -


i have table of ~300gb of data in mysql. want add new column in table. when alter table using alter command, never completes , process dies. planning on writing sqoop job data table , dump hdfs , create new db in mysql , create table additional column , re-import data hdfs again.

table structure: create table `nodes` (   `id` bigint(20) not null auto_increment,   `type` int(11) not null,   `name` varchar(50) not null,   `postcode` varchar(4) default null,   `updated` datetime default null,   `db_updated` datetime default current_timestamp,   `useragent` bigint(20) default null,   `last_seen` date default null, --newly added column   primary key (`id`),   unique key `akaid_index` (`type`,`name`),   key `useragent_idx` (`useragent`),   key `type` (`type`),   constraint `useragentfk` foreign key (`useragent`) references `useragents` (`id`) on delete no action on update no action ) engine=innodb auto_increment=1091725696 default charset=latin1; 

sqoop command:

sqoop export --connect jdbc:mysql://localhost:3306/graph1 --table nodes --username root --password password --export-dir <dir-path> --input-fields-terminated-by ',' 

i'm getting below error:

error: java.io.ioexception: can't export data, please check failed map task logs     @ org.apache.sqoop.mapreduce.textexportmapper.map(textexportmapper.java:112)     @ org.apache.sqoop.mapreduce.textexportmapper.map(textexportmapper.java:39)     @ org.apache.hadoop.mapreduce.mapper.run(mapper.java:146)     @ org.apache.sqoop.mapreduce.autoprogressmapper.run(autoprogressmapper.java:64)     @ org.apache.hadoop.mapred.maptask.runnewmapper(maptask.java:787)     @ org.apache.hadoop.mapred.maptask.run(maptask.java:341)     @ org.apache.hadoop.mapred.yarnchild$2.run(yarnchild.java:164)     @ java.security.accesscontroller.doprivileged(native method)     @ javax.security.auth.subject.doas(subject.java:415)     @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1657)     @ org.apache.hadoop.mapred.yarnchild.main(yarnchild.java:158) caused by: java.lang.runtimeexception: can't parse input data: 'null'     @ nodes.__loadfromfields(nodes.java:470)     @ nodes.parse(nodes.java:388)     @ org.apache.sqoop.mapreduce.textexportmapper.map(textexportmapper.java:83)     ... 10 more caused by: java.lang.numberformatexception: input string: "null"     @ java.lang.numberformatexception.forinputstring(numberformatexception.java:65)     @ java.lang.long.parselong(long.java:441)     @ java.lang.long.valueof(long.java:540)     @ nodes.__loadfromfields(nodes.java:467)     ... 12 more 

file in hdfs contains following records:

1289603991,1,fee1cee723bdb0bc499c443765b40e3d,,2016-04-13 10:19:59,2016-04-14 03:44:55,5296252 1289603992,1,edf65c2e7b89388fe9068cc3a898a3fd,,2016-04-13 10:20:00,2016-04-14 03:44:55,5411481 1289603993,1,5760fd1cca92a65ce6f2db43853fc118,,2016-04-13 10:19:59,2016-04-14 03:44:55,4441745 1289603994,1,65dd92c80df5581f55bc60f3e997ec05,,2016-04-13 10:19:59,2016-04-14 03:44:55,5332084 1289603995,1,7654a84428f3064828f5972cfce5f8e6,,2016-04-13 10:20:00,2016-04-14 03:44:55,5202243 1289603996,1,84c270212fe5f3a52cb2bd75403da058,,2016-04-13 10:20:00,2016-04-14 03:44:55,5398729 1289603997,1,a486382c4fc296a5e8d3c0491568c22c,,2016-04-13 10:19:57,2016-04-14 03:44:55,5289170 1289603998,112,2_3nns7yxpms_xv3imjbiw04bqf1snc2tjrtfj5tcx98,,2016-04-13 10:20:00,2016-04-14 03:44:55,null 1289603999,1,a3607df77e025b12c1728f62589857fa,,2016-04-13 10:19:59,2016-04-14 03:44:55,12 1289604000,113,570e1d4e6372cd9c,,2016-04-13 10:19:59,2016-04-14 03:44:55,null 1289604001,113,57023dd016258fbf,,2016-04-13 10:20:00,2016-04-14 03:44:55,null 

after adding --input-null-string , --input-null-non-string sqoop job works fails following reason:

2016-06-07 18:11:37,750 error [thread-9] org.apache.sqoop.mapreduce.asyncsqloutputformat: got exception in update thread: com.mysql.jdbc.exceptions.jdbc4.mysqlintegrityconstraintviolationexception: duplicate entry '1289603991' key 'primary'     @ sun.reflect.nativeconstructoraccessorimpl.newinstance0(native method)     @ sun.reflect.nativeconstructoraccessorimpl.newinstance(nativeconstructoraccessorimpl.java:57)     @ sun.reflect.delegatingconstructoraccessorimpl.newinstance(delegatingconstructoraccessorimpl.java:45)     @ java.lang.reflect.constructor.newinstance(constructor.java:526)     @ com.mysql.jdbc.util.handlenewinstance(util.java:404)     @ com.mysql.jdbc.util.getinstance(util.java:387)     @ com.mysql.jdbc.sqlerror.createsqlexception(sqlerror.java:934)     @ com.mysql.jdbc.mysqlio.checkerrorpacket(mysqlio.java:3966)     @ com.mysql.jdbc.mysqlio.checkerrorpacket(mysqlio.java:3902)     @ com.mysql.jdbc.mysqlio.sendcommand(mysqlio.java:2526)     @ com.mysql.jdbc.mysqlio.sqlquerydirect(mysqlio.java:2673)     @ com.mysql.jdbc.connectionimpl.execsql(connectionimpl.java:2549)     @ com.mysql.jdbc.preparedstatement.executeinternal(preparedstatement.java:1861)     @ com.mysql.jdbc.preparedstatement.execute(preparedstatement.java:1192)     @ org.apache.sqoop.mapreduce.asyncsqloutputformat$asyncsqlexecthread.run(asyncsqloutputformat.java:233) 

since table has auto_increment column getting above error. there other way alter table , make fast? per thread(changing large mysql innodb tables) seems large innodb tables takes more time days.

if can suggest alternate suggestion good.

your data violate primary key constraint:

if table has constraints (e.g., primary key column values must unique) , contains data, must take care avoid inserting records violate these constraints.

you need truncate table before executing exporting command, or update existing data using argument :

--update-key id