max_active_predictors

  • Available in: GLM
  • Hyperparameter: no

Description

This option limits the number of active predictors. (Note that the actual number of non-zero predictors in the model is going to be slightly lower). It is useful when obtaining a sparse solution to avoid costly computation of models with too many predictors.

When using the \(\lambda_1\) penalty with lambda search, this option will stop the search before it completes. Models built at the beginning of the lambda search have higher lambda values, consider fewer predictors, and take less time to calculate the model. Models built at the end of the lambda search have lower lambda values, incorporate more predictors, and take a longer time to calculate the model. Set the nlambdas parameter for a lambda search to specify the number of models attempted across the search.

Default Value

  • If solver is IRLSM, COORDINATE_DESCENT, or COORDINATE_DESCENT_NAIVE, then max_active_predictors defaults to 5000.
  • If the solver is AUTO and you have less than 5000 active predictors initially, then the solver will be IRLSM or COD (with lambda search), and max_active_predictors is set to 5000.
  • If you run lambda search with alpha > 0, and solver is AUTO, then solver will be COORDINATE_DESCENT, and max_active_predictors will default to 5000.
  • For all other scenarios, max_active_predictors will default to 100000000.

Example

library(h2o)
h2o.init()
# import the higgs dataset:
# This dataset is used to classify whether or not a signal process produces a Higgs bosons.
# original data can be found at https://archive.ics.uci.edu/ml/datasets/HIGGS
higgs <-  h2o.importFile("https://h2o-public-test-data.s3.amazonaws.com/smalldata/testng/higgs_train_5k.csv")

# set the predictor names and the response column name
predictors <- colnames(higgs)[-1]
response <- "response"

# split into train and validation
higgs.splits <- h2o.splitFrame(data =  higgs, ratios = .8)
train <- higgs.splits[[1]]
valid <- higgs.splits[[2]]

# try using the `max_active_predictors` parameter:
higgs_glm <- h2o.glm(family = 'binomial', x = predictors, y = response, training_frame = train,
                        validation_frame = valid,
                        max_active_predictors = 200)

# print the AUC for the validation data
print(h2o.auc(higgs_glm, valid = TRUE))
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()

# import the higgs dataset:
# This dataset is used to classify whether or not a signal process produces a Higgs bosons.
# original data can be found at https://archive.ics.uci.edu/ml/datasets/HIGGS
higgs= h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/testng/higgs_train_5k.csv")

# set the predictor names and the response column name
predictors = higgs.names
predictors.remove('response')
# The response
response = "response"

# split into train and validation sets
train, valid = higgs.split_frame(ratios = [.8])

# try using the `max_active_predictors` parameter:
# initialize the estimator then train the model
higgs_glm = H2OGeneralizedLinearEstimator(family = 'binomial', max_active_predictors = 200)
higgs_glm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

# print the auc for the validation data
print(higgs_glm.auc(valid=True))