scala - How to calculate Percentile of column in a DataFrame in spark? -


i trying calculate percentile of column in dataframe? cant find percentile_approx function in spark aggregation functions.

for e.g. in hive have percentile_approx , can use in following way

hivecontext.sql("select percentile_approx("open_rate",0.10) mytable);  

but want using spark dataframe performance reasons.

sample data set

|user id|open_rate| -------------------  |a1     |10.3     | |b1     |4.04     | |c1     |21.7     | |d1     |18.6     | 

i want find out how many users fall 10 percentile or 20 percentile , on. want this

df.select($"id",percentile($"open_rate",0.1)).show 

sparksql , scala dataframe/dataset apis executed same engine. equivalent operations generate equivalent execution plans. can see execution plans explain.

sql(...).explain df.explain 

when comes specific question, common pattern intermix sparksql , scala dsl syntax because, have discovered, capabilities not yet equivalent. (another example difference between sql's explode() , dsl's explode(), latter being more powerful more inefficient due marshalling.)

the simple way follows:

df.registertemptable("tmp_tbl") val newdf = sql(/* tmp_tbl */) // continue using newdf scala dsl 

what need keep in mind if go simple way temporary table names cluster-global (up 1.6.x). therefore, should use randomized table names if code may run simultaneously more once on same cluster.

on team pattern common-enough have added .sql() implicit dataframe automatically registers , unregisters temp table scope of sql statement.