blending_frame

  • Available in: Stacked Ensembles, AutoML
  • Hyperparameter: no

Description

H2O’s Stacked Ensemble method is supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking (or Super Learning). The algorithm that learns the optimal combination of the base learners is called the metalearning algorithm or metalearner.

The optional blending_frame parameter is used to specify a frame to be used for computing the predictions that serve as the training frame for the metalearner. If provided, this triggers blending mode. Blending mode is faster than cross-validating the base learners (though these ensembles may not perform as well as the Super Learner ensemble). In addition, a blending frame adds the ability to train stacked ensembles on time-series data, where holdout data is “future” data compared to “past” data in training set.

Example

  • r
  • python
library(h2o)
h2o.init()

# import the higgs_train_5k train and test datasets
higgs <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")

# split the dataset into training and blending frames
higgs.splits <- h2o.splitFrame(data =  higgs, ratios = .8, seed = 1234)
train <- higgs.splits[[1]]
blend <- higgs.splits[[2]]

# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)

# Convert the response column in train and test datasets to a factor
train[,y] <- as.factor(train[,y])
blend[,y] <- as.factor(blend[,y])

# Set number of folds for base learners
nfolds <- 3

# Train & Cross-validate a GBM model
my_gbm <- h2o.gbm(x = x,
                  y = y,
                  training_frame = train,
                  distribution = "bernoulli",
                  ntrees = 10,
                  nfolds = nfolds,
                  keep_cross_validation_predictions = TRUE,
                  seed = 1)

# Train & Cross-validate an RF model
my_rf <- h2o.randomForest(x = x,
                          y = y,
                          training_frame = train,
                          ntrees = 10,
                          nfolds = nfolds,
                          keep_cross_validation_predictions = TRUE,
                          seed = 1)

# Train a stacked ensemble using a blending frame
stack <- h2o.stackedEnsemble(x = x,
                             y = y,
                             base_models = list(my_gbm, my_rf),
                             training_frame = train,
                             blending_frame = blend,
                             seed = 1)
h2o.auc(h2o.performance(stack, blend))
# [1] 0.7576039