metalearner_algorithm

  • Available in: Stacked Ensembles
  • Hyperparameter: no

Description

H2O’s Stacked Ensemble method is supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking (or Super Learning). The algorithm that learns the optimal combination of the base learners is called the metalearning algorithm or metalearner. By default, the Stacked Ensemble metalearner is a default H2O GLM with non-negative weights. The metalearner_algorithm option allows you to specify a different metalearner algorithm. Options include:

  • "AUTO" (GLM with non negative weights, and if validation_frame is present, lambda_search is set to True; may change over time). This is the default.
  • "glm" (Generalized Linear Model, aka GLM)
  • "gbm" (Gradient Boosting Machine, aka GBM)
  • "drf" (Distributed Random Forest)
  • "deeplearning" (Deep Learning)

The metalearner will use default hyperparameter values for the specified algorithm, unless you also specify metalearner_params to customize the hyperparameters values.

The default parameters for the metalearning algorithms may not be the best choice, so it’s a good idea to experiment a bit with different parameters using metalearner_params. In the next release of H2O, there will be an option to easily do a grid search over metalearner parameters using the standard H2O Grid interface, which will make tuning the metalearner a bit easier.

Example

library(h2o)
h2o.init()

# import the higgs_train_5k train and test datasets
train <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_test_5k.csv")

# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)

# Convert the response column in train and test datasets to a factor
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])


# Set number of folds for base learners
nfolds <- 3

# Train & Cross-validate a GBM model
my_gbm <- h2o.gbm(x = x,
                  y = y,
                  training_frame = train,
                  distribution = "bernoulli",
                  ntrees = 10,
                  nfolds = nfolds,
                  keep_cross_validation_predictions = TRUE,
                  seed = 1)

# Train & Cross-validate an RF model
my_rf <- h2o.randomForest(x = x,
                          y = y,
                          training_frame = train,
                          ntrees = 10,
                          nfolds = nfolds,
                          keep_cross_validation_predictions = TRUE,
                          seed = 1)


# Next we can train a few different ensembles using different metalearners

# Train a stacked ensemble using the default metalearner algorithm
stack <- h2o.stackedEnsemble(x = x,
                             y = y,
                             training_frame = train,
                             base_models = list(my_gbm, my_rf))
h2o.auc(h2o.performance(stack, test))
# 0.7570171

# Train a stacked ensemble using GBM as the metalearner algorithm
# The metalearner will use GBM default values
stack_gbm <- h2o.stackedEnsemble(x = x,
                                 y = y,
                                 training_frame = train,
                                 base_models = list(my_gbm, my_rf),
                                 metalearner_algorithm = "gbm")
h2o.auc(h2o.performance(stack_gbm, test))
# 0.7511055

# Train a stacked ensemble using RF as the metalearner algorithm
# The metelearner will use RF default values
stack_rf <- h2o.stackedEnsemble(x = x,
                                y = y,
                                training_frame = train,
                                base_models = list(my_gbm, my_rf),
                                metalearner_algorithm = "drf")
h2o.auc(h2o.performance(stack_rf, test))
# 0.7232461

# Train a stacked ensemble using Deep Learning as the metalearner algorithm
# The metelearner will use RF default values
stack_dl <- h2o.stackedEnsemble(x = x,
                                y = y,
                                training_frame = train,
                                base_models = list(my_gbm, my_rf),
                                metalearner_algorithm = "deeplearning")
h2o.auc(h2o.performance(stack_dl, test))
# 0.7571556
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
h2o.init()

# import the higgs_train_5k train and test datasets
train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_test_5k.csv")

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# Convert the response column in train and test datasets to a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()


# Set number of folds for base learners
nfolds = 3

# Train and cross-validate a GBM model
my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
                                      ntrees=10,
                                      nfolds=nfolds,
                                      fold_assignment="Modulo",
                                      keep_cross_validation_predictions=True,
                                      seed=1)
my_gbm.train(x=x, y=y, training_frame=train)

# Train and cross-validate an RF model
my_rf = H2ORandomForestEstimator(ntrees=50,
                                 nfolds=nfolds,
                                 fold_assignment="Modulo",
                                 keep_cross_validation_predictions=True,
                                 seed=1)
my_rf.train(x=x, y=y, training_frame=train)


# Next we can train a few different ensembles using different metalearners

# Train a stacked ensemble using the default metalearner algorithm
stack = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf])
stack.train(x=x, y=y, training_frame=train)
stack.model_performance(test).auc()
# 0.7522591310013634

# Train a stacked ensemble with a GBM metalearner algorithm
# The metelearner will use GBM default values
stack_gbm = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
                                        metalearner_algorithm="gbm")
stack_gbm.train(x=x, y=y, training_frame=train)
stack_gbm.model_performance(test).auc()
# 0.7522591310013634

# Train a stacked ensemble with a RF metalearner algorithm
# The metelearner will use RF default values
stack_rf = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
                                       metalearner_algorithm="drf")
stack_rf.train(x=x, y=y, training_frame=train)
stack_rf.model_performance(test).auc()
# 0.7016302070136065

# Train a stacked ensemble with a Deep Learning metalearner algorithm
# The metelearner will use Deep Learning default values
stack_dl = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
                                       metalearner_algorithm="deeplearning")
stack_dl.train(x=x, y=y, training_frame=train)
stack_dl.model_performance(test).auc()
# 0.7634122856763638