column_aggregate_functions {SparkR} | R Documentation |
Aggregate functions defined for Column
.
approxCountDistinct(x, ...) collect_list(x) collect_set(x) countDistinct(x, ...) grouping_bit(x) grouping_id(x, ...) kurtosis(x) n_distinct(x, ...) sd(x, na.rm = FALSE) skewness(x) stddev(x) stddev_pop(x) stddev_samp(x) sumDistinct(x) var(x, y = NULL, na.rm = FALSE, use) variance(x) var_pop(x) var_samp(x) ## S4 method for signature 'Column' approxCountDistinct(x, rsd = 0.05) ## S4 method for signature 'Column' kurtosis(x) ## S4 method for signature 'Column' max(x) ## S4 method for signature 'Column' mean(x) ## S4 method for signature 'Column' min(x) ## S4 method for signature 'Column' sd(x) ## S4 method for signature 'Column' skewness(x) ## S4 method for signature 'Column' stddev(x) ## S4 method for signature 'Column' stddev_pop(x) ## S4 method for signature 'Column' stddev_samp(x) ## S4 method for signature 'Column' sum(x) ## S4 method for signature 'Column' sumDistinct(x) ## S4 method for signature 'Column' var(x) ## S4 method for signature 'Column' variance(x) ## S4 method for signature 'Column' var_pop(x) ## S4 method for signature 'Column' var_samp(x) ## S4 method for signature 'Column' approxCountDistinct(x, rsd = 0.05) ## S4 method for signature 'Column' countDistinct(x, ...) ## S4 method for signature 'Column' n_distinct(x, ...) ## S4 method for signature 'Column' collect_list(x) ## S4 method for signature 'Column' collect_set(x) ## S4 method for signature 'Column' grouping_bit(x) ## S4 method for signature 'Column' grouping_id(x, ...)
x |
Column to compute on. |
... |
additional argument(s). For example, it could be used to pass additional Columns. |
y, na.rm, use |
currently not used. |
rsd |
maximum estimation error allowed (default = 0.05). |
approxCountDistinct
: Returns the approximate number of distinct items in a group.
kurtosis
: Returns the kurtosis of the values in a group.
max
: Returns the maximum value of the expression in a group.
mean
: Returns the average of the values in a group. Alias for avg
.
min
: Returns the minimum value of the expression in a group.
sd
: Alias for stddev_samp
.
skewness
: Returns the skewness of the values in a group.
stddev
: Alias for std_dev
.
stddev_pop
: Returns the population standard deviation of the expression in a group.
stddev_samp
: Returns the unbiased sample standard deviation of the expression in a group.
sum
: Returns the sum of all values in the expression.
sumDistinct
: Returns the sum of distinct values in the expression.
var
: Alias for var_samp
.
var_pop
: Returns the population variance of the values in a group.
var_samp
: Returns the unbiased variance of the values in a group.
countDistinct
: Returns the number of distinct items in a group.
n_distinct
: Returns the number of distinct items in a group.
collect_list
: Creates a list of objects with duplicates.
Note: the function is non-deterministic because the order of collected results depends
on order of rows which may be non-deterministic after a shuffle.
collect_set
: Creates a list of objects with duplicate elements eliminated.
Note: the function is non-deterministic because the order of collected results depends
on order of rows which may be non-deterministic after a shuffle.
grouping_bit
: Indicates whether a specified column in a GROUP BY list is aggregated or
not, returns 1 for aggregated or 0 for not aggregated in the result set. Same as GROUPING
in SQL and grouping
function in Scala.
grouping_id
: Returns the level of grouping.
Equals to
grouping_bit(c1) * 2^(n - 1) + grouping_bit(c2) * 2^(n - 2) + ... + grouping_bit(cn)
.
approxCountDistinct(Column) since 1.4.0
kurtosis since 1.6.0
max since 1.5.0
mean since 1.5.0
min since 1.5.0
sd since 1.6.0
skewness since 1.6.0
stddev since 1.6.0
stddev_pop since 1.6.0
stddev_samp since 1.6.0
sum since 1.5.0
sumDistinct since 1.4.0
var since 1.6.0
variance since 1.6.0
var_pop since 1.5.0
var_samp since 1.6.0
approxCountDistinct(Column, numeric) since 1.4.0
countDistinct since 1.4.0
n_distinct since 1.4.0
collect_list since 2.3.0
collect_set since 2.3.0
grouping_bit since 2.3.0
grouping_id since 2.3.0
Other aggregate functions: avg
,
corr
, count
,
cov
, first
,
last
## Not run:
##D # Dataframe used throughout this doc
##D df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
## End(Not run)
## Not run:
##D head(select(df, approxCountDistinct(df$gear)))
##D head(select(df, approxCountDistinct(df$gear, 0.02)))
##D head(select(df, countDistinct(df$gear, df$cyl)))
##D head(select(df, n_distinct(df$gear)))
##D head(distinct(select(df, "gear")))
## End(Not run)
## Not run:
##D head(select(df, mean(df$mpg), sd(df$mpg), skewness(df$mpg), kurtosis(df$mpg)))
## End(Not run)
## Not run:
##D head(select(df, avg(df$mpg), mean(df$mpg), sum(df$mpg), min(df$wt), max(df$qsec)))
##D
##D # metrics by num of cylinders
##D tmp <- agg(groupBy(df, "cyl"), avg(df$mpg), avg(df$hp), avg(df$wt), avg(df$qsec))
##D head(orderBy(tmp, "cyl"))
##D
##D # car with the max mpg
##D mpg_max <- as.numeric(collect(agg(df, max(df$mpg))))
##D head(where(df, df$mpg == mpg_max))
## End(Not run)
## Not run:
##D head(select(df, sd(df$mpg), stddev(df$mpg), stddev_pop(df$wt), stddev_samp(df$qsec)))
## End(Not run)
## Not run:
##D head(select(df, sumDistinct(df$gear)))
##D head(distinct(select(df, "gear")))
## End(Not run)
## Not run:
##D head(agg(df, var(df$mpg), variance(df$mpg), var_pop(df$mpg), var_samp(df$mpg)))
## End(Not run)
## Not run:
##D df2 = df[df$mpg > 20, ]
##D collect(select(df2, collect_list(df2$gear)))
##D collect(select(df2, collect_set(df2$gear)))
## End(Not run)
## Not run:
##D # With cube
##D agg(
##D cube(df, "cyl", "gear", "am"),
##D mean(df$mpg),
##D grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am)
##D )
##D
##D # With rollup
##D agg(
##D rollup(df, "cyl", "gear", "am"),
##D mean(df$mpg),
##D grouping_bit(df$cyl), grouping_bit(df$gear), grouping_bit(df$am)
##D )
## End(Not run)
## Not run:
##D # With cube
##D agg(
##D cube(df, "cyl", "gear", "am"),
##D mean(df$mpg),
##D grouping_id(df$cyl, df$gear, df$am)
##D )
##D
##D # With rollup
##D agg(
##D rollup(df, "cyl", "gear", "am"),
##D mean(df$mpg),
##D grouping_id(df$cyl, df$gear, df$am)
##D )
## End(Not run)