in order run few ml algorithms, need create columns of data. each of these columns involves intense calculations involves keeping moving averages , recording information go through each row (and updating meanwhile). i've done mock through simple python script , works, , looking translate scala spark script run on larger data set.
the issue seems these highly efficient, using spark sql, preferred use built in syntax , operations (which sql-like). encoding logic in sql expression seems thought-intensive process, i'm wondering downsides if manually create new column values iterating through each row, keeping track of variables , inserting column value @ end.
you can convert rdd dataframe. use map on data frame , process each row wish. if need add new column, can use, withcolumn. allow 1 column added , happens entire dataframe. if want more columns added, inside map method,
a. can gather new values based on calculations
b. add these new column values main rdd below
val newcolumns: seq[any] = seq(newcol1,newcol2) row.fromseq(row.toseq.init ++ newcolumns)
here row, reference of row in map method
c. create new schema below
val newcolumnsstructtype = structtype{seq(new structfield("newcolname1",integertype),new structfield("newcolname2", integertype))
d. add old schema
val newschema = structtype(maindataframe.schema.init ++ newcolumnsstructtype)
e. create new dataframe new columns
val newdataframe = sqlcontext.createdataframe(newrdd, newschema)