Perform inplace imputation by filling missing values with aggregates computed on the "na.rm'd" vector. Additionally, it's possible to perform imputation based on groupings of columns from within data; these columns can be passed by index or name to the by parameter. If a factor column is supplied, then the method must be "mode".
h2o.impute(data, column = 0, method = c("mean", "median", "mode"), combine_method = c("interpolate", "average", "lo", "hi"), by = NULL, groupByFrame = NULL, values = NULL)
data | The dataset containing the column to impute. |
---|---|
column | A specific column to impute, default of 0 means impute the whole frame. |
method | "mean" replaces NAs with the column mean; "median" replaces NAs with the column median; "mode" replaces with the most common factor (for factor columns only); |
combine_method | If method is "median", then choose how to combine quantiles on even sample sizes. This parameter is ignored in all other cases. |
by | group by columns |
groupByFrame | Impute the column col with this pre-computed grouped frame. |
values | A vector of impute values (one per column). NaN indicates to skip the column |
an H2OFrame with imputed values
The default method is selected based on the type of the column to impute. If the column is numeric then "mean" is selected; if it is categorical, then "mode" is selected. Other column types (e.g. String, Time, UUID) are not supported.
# NOT RUN { h2o.init() fr <- as.h2o(iris, destination_frame="iris") fr[sample(nrow(fr),40),5] <- NA # randomly replace 50 values with NA # impute with a group by fr <- h2o.impute(fr, "Species", "mode", by=c("Sepal.Length", "Sepal.Width")) # }