R: gapply

gapply {SparkR}

R Documentation

gapply

Description

Groups the SparkDataFrame using the specified columns and applies the R function to each group.

gapply

Usage

## S4 method for signature 'SparkDataFrame'
gapply(x, cols, func, schema)

gapply(x, ...)

## S4 method for signature 'GroupedData'
gapply(x, func, schema)

Arguments

`x`	A SparkDataFrame
`cols`	Grouping columns
`func`	A function to be applied to each group partition specified by grouping column of the SparkDataFrame. The function 'func' takes as argument a key - grouping columns and a data frame - a local R data.frame. The output of 'func' is a local R data.frame.
`schema`	The schema of the resulting SparkDataFrame after the function is applied. The schema must match to output of 'func'. It has to be defined for each output column with preferred output column name and corresponding data type.
`x`	A GroupedData

Value

a SparkDataFrame

Note

gapply(SparkDataFrame) since 2.0.0

gapply(GroupedData) since 2.0.0

Examples

## Not run: 
##D Computes the arithmetic mean of the second column by grouping
##D on the first and third columns. Output the grouping values and the average.
##D 
##D df <- createDataFrame (
##D list(list(1L, 1, "1", 0.1), list(1L, 2, "1", 0.2), list(3L, 3, "3", 0.3)),
##D   c("a", "b", "c", "d"))
##D 
##D Here our output contains three columns, the key which is a combination of two
##D columns with data types integer and string and the mean which is a double.
##D schema <-  structType(structField("a", "integer"), structField("c", "string"),
##D   structField("avg", "double"))
##D result <- gapply(
##D   df,
##D   c("a", "c"),
##D   function(key, x) {
##D     y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
##D }, schema)
##D 
##D We can also group the data and afterwards call gapply on GroupedData.
##D For Example:
##D gdf <- group_by(df, "a", "c")
##D result <- gapply(
##D   gdf,
##D   function(key, x) {
##D     y <- data.frame(key, mean(x$b), stringsAsFactors = FALSE)
##D }, schema)
##D collect(result)
##D 
##D Result
##D ------
##D a c avg
##D 3 3 3.0
##D 1 1 1.5
##D 
##D Fits linear models on iris dataset by grouping on the 'Species' column and
##D using 'Sepal_Length' as a target variable, 'Sepal_Width', 'Petal_Length'
##D and 'Petal_Width' as training features.
##D 
##D df <- createDataFrame (iris)
##D schema <- structType(structField("(Intercept)", "double"),
##D   structField("Sepal_Width", "double"),structField("Petal_Length", "double"),
##D   structField("Petal_Width", "double"))
##D df1 <- gapply(
##D   df,
##D   df$"Species",
##D   function(key, x) {
##D     m <- suppressWarnings(lm(Sepal_Length ~
##D     Sepal_Width + Petal_Length + Petal_Width, x))
##D     data.frame(t(coef(m)))
##D   }, schema)
##D collect(df1)
##D 
##D Result
##D ---------
##D Model  (Intercept)  Sepal_Width  Petal_Length  Petal_Width
##D 1        0.699883    0.3303370    0.9455356    -0.1697527
##D 2        1.895540    0.3868576    0.9083370    -0.6792238
##D 3        2.351890    0.6548350    0.2375602     0.2521257
##D 
## End(Not run)

[Package SparkR version 2.0.0 Index]

gapply

Description

Usage

Arguments

Value

Note

See Also

Examples