Group By¶
The group_by
function allows you to group one or more columns and apply a function to the result. Specifically, the group_by
function performs the following actions on an H2O Frame:
splits the data into groups based on some criteria
applies a function to each group independently
combines the results into an H2OFrame
The result is a new H2OFrame with columns equivalent to the number of groups created. The returned groups are sorted by the natural group-by column sort.
The group_by
function accepts the following parameters:
Python and R
H2O Frame: This specifies the H2OFrame that you want the group by operation to be performed on.
by
: Theby
option can take a list of columns if you want to group by more than one column to compute the summary.
Python Only
na
, which controls treatment of NA values during the calculation. It can be one of:all
(default): any NAs are used in the calculation as-is; which usually results in the final result being NA too.ignore
: NA entries are not included in calculations, but the total number of entries is taken as the total number of rows. For example,mean([1, 2, 3, nan], na="ignore")
will produce1.5
.rm
: entries are skipped during the calculations, reducing the total effective count of entries. For example,mean([1, 2, 3, nan], na="rm")
will produce2
.
R Only
gb.control
: In R, thegb.control
option specifies how to handle NA values in the dataset as well as how to name output columns. Note that to specify a list of column names in thegb.control
list, you must add thecol.names
argument.
nrow
: Specify the name of the generated column.
na.methods
, which controls treatment of NA values during the calculation. It can be one of:
all
(default): any NAs are used in the calculation as-is; which usually results in the final result being NA too.
ignore
: NA entries are not included in calculations, but the total number of entries is taken as the total number of rows. For example,mean([1, 2, 3, nan], na="ignore")
will produce1.5
.
rm
: entries are skipped during the calculations, reducing the total effective count of entries. For example,mean([1, 2, 3, nan], na="rm")
will produce2
.Note: If a list smaller than the number of columns groups is supplied, then the list will be padded by
ignore
.
In addition to the above parameters, any number of the following aggregations can be chained together in the group_by
function:
count
: Count the number of rows in each group of a GroupBy object.max
: Calculate the maximum of each column specified incol
for each group of a GroupBy object.mean
: Calculate the mean of each column specified incol
for each group of a GroupBy object.min
: Calculate the minimum of each column specified incol
for each group of a GroupBy object.mode
: Calculate the mode of each column specified incol
for each group of a GroupBy object.sd
: Calculate the standard deviation of each column specified incol
for each group of a GroupBy object.ss
: Calculate the sum of squares of each column specified incol
for each group of a GroupBy object.sum
: Calculate the sum of each column specified incol
for each group of a GroupBy object.var
: Calculate the variance of each column specified incol
for each group of a GroupBy object.
If no arguments are given to the aggregation (e.g.,
max()
ingrouped.sum(col="X1", na="all").mean(col="X5", na="all").max()
), then it is assumed that the aggregation should apply to all columns except the GroupBy columns.
Note that once the aggregation operations are complete, calling the GroupBy object with a new set of aggregations will yield no effect. You must generate a new GroupBy object in order to apply a new aggregation on it. In addition, certain aggregations are only defined for numerical or categorical columns. An error will be thrown for calling aggregation on the wrong data types.
library(h2o)
h2o.init()
# Import the airlines data set and display a summary.
airlines_url <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
airlines <- h2o.importFile(path = airlines_url)
summary(airlines)
# Find number of flights by airport
origin_flights <- h2o.group_by(data = airlines, by = "Origin", nrow("Origin"), gb.control = list(na.methods = "rm"))
origin_flights_df <- as.data.frame(origin_flights)
origin_flights_df
Origin nrow
1 ABE 59
2 ABQ 876
3 ACY 31
...
# Find number of flights per month
flights_by_month <- h2o.group_by(data = airlines,
by = "Month",
nrow("Month"),
gb.control = list(na.methods = "rm"))
flights_by_month_df <- as.data.frame(flights_by_month)
flights_by_month_df
Month nrow
1 1 41979
2 10 1999
# Find the number of flights in a given month based on the origin
cols <- c("Origin","Month")
flights_by_origin_month <- h2o.group_by(data = airlines,
by = cols,
nrow("Month"),
gb.control = list(na.methods = "rm"))
flights_by_origin_month_df <- as.data.frame(flights_by_origin_month)
flights_by_origin_month_df
Origin Month nrow
1 ABE 1 59
2 ABQ 1 846
3 ABQ 10 30
4 ACY 1 31
5 ALB 1 75
...
# Find months with the highest cancellation ratio
which(colnames(airlines)=="Cancelled")
[1] 22
cancellations_by_month <- h2o.group_by(data = airlines,
by = "Month",
sum("Cancelled"),
gb.control=list(na.methods="rm"))
cancellation_rate <- cancellations_by_month$sum_Cancelled/flights_by_month$nrow
rates_table <- h2o.cbind(flights_by_month$Month,cancellation_rate)
rates_table_df <- as.data.frame(rates_table)
rates_table_df
Month sum_Cancelled
1 1 0.025417471
2 10 0.009504752
# Use group_by with multiple columns. Summarize the destination,
# arrival delays, and departure delays for an origin
cols <- c("Dest", "IsArrDelayed", "IsDepDelayed")
origin_flights <- h2o.group_by(data = airlines[c("Origin",cols)],
by = "Origin",
sum(cols),
gb.control = list(na.methods = "ignore", col.names = NULL))
# Note a warning because col.names null
res <- h2o.cbind(lapply(cols, function(x){h2o.group_by(airlines, by = "Origin", sum(x))}))[,c(1,2,4,6)]
res
Origin sum_Dest sum_IsArrDelayed sum_IsDepDelayed
1 ABE 5884 40 30
2 ABQ 84505 545 370
3 ACY 3131 9 7
4 ALB 3646 49 50
5 AMA 317 4 6
6 ANC 100 0 1
import h2o
h2o.init()
# Upload the airlines dataset
air = h2o.import_file("https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv")
air.dim
[43978, 31]
# Find number of flights by airport
origin_flights = air.group_by("Origin")
origin_fights.count()
origin_fights.get_frame()
Origin nrow
-------- ------
ABE 59
ABQ 876
ACY 31
...
# Find number of flights per month based on the origin
cols = ["Origin","Month"]
flights_by_origin_month = air.group_by(by=cols).count(na ="all")
flights_by_origin_month.get_frame()
Origin Month nrow
-------- ------- ------
ABE 1 59
ABQ 1 846
ABQ 10 30
...
# Find months with the highest cancellation ratio
cancellation_by_month = air.group_by(by='Month').sum('Cancelled', na="all")
flights_by_month = air.group_by('Month').count(na="all")
cancelled = cancellation_by_month.get_frame()['sum_Cancelled']
flights = flights_by_month.get_frame()['nrow']
month_count = flights_by_month.get_frame()['Month']
ratio = cancelled/flights
month_count.cbind(ratio)
Month sum_Cancelled
------- ---------------
1 0.0254175
10 0.00950475
[2 rows x 2 columns]
# Use group_by with multiple columns. Summarize the destination,
# arrival delays, and departure delays for an origin
cols_1 = ['Origin', 'Dest', 'IsArrDelayed', 'IsDepDelayed']
cols_2 = ["Dest", "IsArrDelayed", "IsDepDelayed"]
air[cols_1].group_by(by='Origin').sum(cols_2, na="ignore").get_frame()
Origin sum_Dest sum_IsDepDelayed sum_IsArrDelayed
-------- ---------- ------------------ ------------------
ABE 5884 30 40
ABQ 84505 370 545
ACY 3131 7 9
ALB 3646 50 49
AMA 317 6 4
ANC 100 1 0
...