Imputing Data

The impute function allows you to perform in-place imputation by filling missing values with aggregates computed on the “na.rm’d” vector. Additionally, you can also perform imputation based on groupings of columns from within the dataset. These columns can be passed by index or by column name to the by parameter. Note that if a factor column is supplied, then the method must be mode.

The impute function accepts the following arguments:

  • dataset: The dataset containing the column to impute
  • column: A specific column to impute. The default of 0 specifies to impute the entire frame.
  • method: The type of imputation to perform. mean replaces NAs with the column mean; median replaces NAs with the column median; mode replaces with the most common factor (for factor columns only).
  • combine_method: If method is median, then choose how to combine quantiles on even sample sizes. This parameter is ignored in all other cases. Available options for combine_method include interpolate, average, low, and high.
  • by: Group by columns
  • groupByFrame or group_by_frame: Impute the column with this pre-computed grouped frame.
  • values: A vector of impute values (one per column). NaN indicates to skip the column.
  • r
  • python
#Upload the Airlines dataset
> filePath <- h2o:::.h2o.locate("smalldata/airlines/allyears2k_headers.zip")
> air <- h2o.importFile(filePath, "air")
> print(dim(air))
43978    31

#Show the number of rows with NA.
> print(numNAs <- sum(is.na(air$DepTime)))
[1] 1086

> DepTime_mean <- mean(air$DepTime, na.rm = TRUE)
> print(DepTime_mean)
[1] 1345.847

#Mean impute the DepTime column
> h2o.impute(air, "DepTime", method = "mean")
 [1]     NaN      NaN      NaN      NaN 1345.847      NaN      NaN      NaN
 [9]     NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
[17]     NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
[25]     NaN      NaN      NaN      NaN      NaN      NaN      NaN

#Revert the imputations
> air <- h2o.importFile(filePath, "air")

#Impute the column using a grouping based on the Origin and Distance
#If the Origin and Distance produce groupings of NAs, then no imputation will be done (NAs will result).
> h2o.impute(air, "DepTime", method = "mean", by = c("Dest"))
  Dest mean_DepTime
1  ABE     1671.795
2  ABQ     1308.074
3  ACY     1651.095
4  ALB     1405.412
5  AMA     1404.333
6  ANC     2022.000

[134 rows x 2 columns]

#Revert the imputations
> air <- h2o.importFile(filePath, "air")

#Impute a factor column by the most common factor in that column
> h2o.impute(air, "TailNum", method = "mode")
 [1]  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN 3499  NaN  NaN  NaN  NaN
[16]  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
[31]  NaN

#Revert imputations
> air <- h2o.importFile(filePath, "air")

#Impute a factor column using a grouping based on the Month
> h2o.impute(air, "TailNum", method = "mode", by=c("Month"))
  Month mode_TailNum
1     1         3499
2    10         3499