i trying calculate percentile of column in dataframe? cant find percentile_approx function in spark aggregation functions.
for e.g. in hive have percentile_approx , can use in following way
hivecontext.sql("select percentile_approx("open_rate",0.10) mytable); but want using spark dataframe performance reasons.
sample data set
|user id|open_rate| ------------------- |a1 |10.3 | |b1 |4.04 | |c1 |21.7 | |d1 |18.6 | i want find out how many users fall 10 percentile or 20 percentile , on. want this
df.select($"id",percentile($"open_rate",0.1)).show
sparksql , scala dataframe/dataset apis executed same engine. equivalent operations generate equivalent execution plans. can see execution plans explain.
sql(...).explain df.explain when comes specific question, common pattern intermix sparksql , scala dsl syntax because, have discovered, capabilities not yet equivalent. (another example difference between sql's explode() , dsl's explode(), latter being more powerful more inefficient due marshalling.)
the simple way follows:
df.registertemptable("tmp_tbl") val newdf = sql(/* tmp_tbl */) // continue using newdf scala dsl what need keep in mind if go simple way temporary table names cluster-global (up 1.6.x). therefore, should use randomized table names if code may run simultaneously more once on same cluster.
on team pattern common-enough have added .sql() implicit dataframe automatically registers , unregisters temp table scope of sql statement.