Performs a group by and apply similar to ddply.
h2o.group_by( data, by, ..., gb.control = list(na.methods = NULL, col.names = NULL) )
data | an H2OFrame object. |
---|---|
by | a list of column names |
... | any supported aggregate function. See |
gb.control | a list of how to handle |
Returns a new H2OFrame object with columns equivalent to the number of groups created
In the case of na.methods
within gb.control
, there are three possible settings.
"all"
will include NAs
in computation of functions. "rm"
will completely
remove all NA
fields. "ignore"
will remove NAs
from the numerator but keep
the rows for computational purposes. If a list smaller than the number of columns groups is
supplied, the list will be padded by "ignore"
.
Note that to specify a list of column names in the gb.control
list, you must add the
col.names
argument. Similar to na.methods
, col.names
will pad the list with
the default column names if the length is less than the number of colums groups supplied.
Supported functions include nrow
. This function is required and accepts a string for the
name of the generated column. Other supported aggregate functions accept col
and na
arguments for specifying columns and the handling of NAs ("all"
, "ignore"
, and
GroupBy object; max
calculates the maximum of each column specified in col
for each
group of a GroupBy object; mean
calculates the mean of each column specified in col
for each group of a GroupBy object; min
calculates the minimum of each column specified in
col
for each group of a GroupBy object; mode
calculates the mode of each column
specified in col
for each group of a GroupBy object; sd
calculates the standard
deviation of each column specified in col
for each group of a GroupBy object; ss
calculates the sum of squares of each column specified in col
for each group of a GroupBy
object; sum
calculates the sum of each column specified in col
for each group of a
GroupBy object; and var
calculates the variance of each column specified in col
for
each group of a GroupBy object. If an aggregate is provided without a value (for example, as
max
in sum(col="X1", na="all").mean(col="X5", na="all").max()
), then it is assumed
that the aggregation should apply to all columns except the GroupBy columns. However, operations
will not be performed on String columns. They will be skipped. Note again that
nrow
is required and cannot be empty.
if (FALSE) { library(h2o) h2o.init() df <- h2o.importFile(paste("https://s3.amazonaws.com/h2o-public-test-data", "/smalldata/prostate/prostate.csv", sep="")) h2o.group_by(data = df, by = "RACE", nrow("VOL")) }