i trying calculate percentile of column in dataframe? cant find percentile_approx function in spark aggregation functions.
for e.g. in hive have percentile_approx , can use in following way
hivecontext.sql("select percentile_approx("open_rate",0.10) mytable);
but want using spark dataframe performance reasons.
sample data set
|user id|open_rate| ------------------- |a1 |10.3 | |b1 |4.04 | |c1 |21.7 | |d1 |18.6 |
i want find out how many users fall 10 percentile or 20 percentile , on. want this
df.select($"id",percentile($"open_rate",0.1)).show
sparksql , scala dataframe/dataset apis executed same engine. equivalent operations generate equivalent execution plans. can see execution plans explain
.
sql(...).explain df.explain
when comes specific question, common pattern intermix sparksql , scala dsl syntax because, have discovered, capabilities not yet equivalent. (another example difference between sql's explode()
, dsl's explode()
, latter being more powerful more inefficient due marshalling.)
the simple way follows:
df.registertemptable("tmp_tbl") val newdf = sql(/* tmp_tbl */) // continue using newdf scala dsl
what need keep in mind if go simple way temporary table names cluster-global (up 1.6.x). therefore, should use randomized table names if code may run simultaneously more once on same cluster.
on team pattern common-enough have added .sql()
implicit dataframe
automatically registers , unregisters temp table scope of sql statement.