scala - Is it inefficient to manually iterate Spark SQL data frames and create column values? -

in order run few ml algorithms, need create columns of data. each of these columns involves intense calculations involves keeping moving averages , recording information go through each row (and updating meanwhile). i've done mock through simple python script , works, , looking translate scala spark script run on larger data set.

the issue seems these highly efficient, using spark sql, preferred use built in syntax , operations (which sql-like). encoding logic in sql expression seems thought-intensive process, i'm wondering downsides if manually create new column values iterating through each row, keeping track of variables , inserting column value @ end.

you can convert rdd dataframe. use map on data frame , process each row wish. if need add new column, can use, withcolumn. allow 1 column added , happens entire dataframe. if want more columns added, inside map method,

a. can gather new values based on calculations

b. add these new column values main rdd below

val newcolumns: seq[any] = seq(newcol1,newcol2) row.fromseq(row.toseq.init ++ newcolumns)

here row, reference of row in map method

c. create new schema below

val newcolumnsstructtype = structtype{seq(new structfield("newcolname1",integertype),new structfield("newcolname2", integertype))

d. add old schema

val newschema = structtype(maindataframe.schema.init ++ newcolumnsstructtype)

e. create new dataframe new columns

val newdataframe = sqlcontext.createdataframe(newrdd, newschema)

Story

Search This Blog

scala - Is it inefficient to manually iterate Spark SQL data frames and create column values? -