The Automatic Machine Learning (AutoML) function automates the supervised machine learning model training process. AutoML finds the best model, given a training frame and response, and returns an H2OAutoML object, which contains a leaderboard of all the models that were trained in the process, ranked by a default model performance metric.
h2o.automl( x, y, training_frame, validation_frame = NULL, leaderboard_frame = NULL, blending_frame = NULL, nfolds = -1, fold_column = NULL, weights_column = NULL, balance_classes = FALSE, class_sampling_factors = NULL, max_after_balance_size = 5, max_runtime_secs = NULL, max_runtime_secs_per_model = NULL, max_models = NULL, distribution = c("AUTO", "bernoulli", "ordinal", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber", "custom"), stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "AUCPR", "lift_top_group", "misclassification", "mean_per_class_error"), stopping_tolerance = NULL, stopping_rounds = 3, seed = NULL, project_name = NULL, exclude_algos = NULL, include_algos = NULL, modeling_plan = NULL, preprocessing = NULL, exploitation_ratio = -1, monotone_constraints = NULL, keep_cross_validation_predictions = FALSE, keep_cross_validation_models = FALSE, keep_cross_validation_fold_assignment = FALSE, sort_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "AUCPR", "mean_per_class_error"), export_checkpoints_dir = NULL, verbosity = "warn", ... )
x | A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used. |
---|---|
y | The name or index of the response variable in the model. For classification, the y column must be a factor, otherwise regression will be performed. Indexes are 1-based in R. |
training_frame | Training frame (H2OFrame or ID). |
validation_frame | Validation frame (H2OFrame or ID); Optional. This argument is ignored unless the user sets nfolds = 0. If cross-validation is turned off, then a validation frame can be specified and used for early stopping of individual models and early stopping of the grid searches. By default and when nfolds > 1, cross-validation metrics will be used for early stopping and thus validation_frame will be ignored. |
leaderboard_frame | Leaderboard frame (H2OFrame or ID); Optional. If provided, the Leaderboard will be scored using this data frame intead of using cross-validation metrics, which is the default. |
blending_frame | Blending frame (H2OFrame or ID) used to train the the metalearning algorithm in Stacked Ensembles (instead of relying on cross-validated predicted values); Optional.
When provided, it also is recommended to disable cross validation by setting |
nfolds | Number of folds for k-fold cross-validation. Must be >= 2; defaults to 5. Use 0 to disable cross-validation; this will also disable Stacked Ensemble (thus decreasing the overall model performance). |
fold_column | Column with cross-validation fold index assignment per observation; used to override the default, randomized, 5-fold cross-validation scheme for individual models in the AutoML run. |
weights_column | Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. |
balance_classes |
|
class_sampling_factors | Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will
be automatically computed to obtain class balance during training. Requires |
max_after_balance_size | Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires
|
max_runtime_secs | This argument specifies the maximum time that the AutoML process will run for.
If both |
max_runtime_secs_per_model | Maximum runtime in seconds dedicated to each individual model training process. Use 0 to disable. Defaults to 0. Note that models constrained by a time budget are not guaranteed reproducible. |
max_models | Maximum number of models to build in the AutoML process (does not include Stacked Ensembles). Defaults to NULL (no strict limit). Always set this parameter to ensure AutoML reproducibility: all models are then trained until convergence and none is constrained by a time budget. |
distribution | Distribution function used by algorithms that support it; other algorithms use their defaults. Possible values: "AUTO", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber", "custom", and for parameterized distributions list form is used to specify the parameter, e.g., |
stopping_metric | Metric to use for early stopping ("AUTO" is logloss for classification, deviance for regression). Must be one of "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "AUCPR", "lift_top_group", "misclassification", "mean_per_class_error". Defaults to "AUTO". |
stopping_tolerance | Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much). This value defaults to 0.001 if the dataset is at least 1 million rows; otherwise it defaults to a bigger value determined by the size of the dataset and the non-NA-rate. In that case, the value is computed as 1/sqrt(nrows * non-NA-rate). |
stopping_rounds | Integer. Early stopping based on convergence of |
seed | Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if |
project_name | Character string to identify an AutoML project. Defaults to NULL, which means a project name will be auto-generated. More models can be trained and added to an existing AutoML project by specifying the same project name in multiple calls to the AutoML function (as long as the same training frame is used in subsequent runs). |
exclude_algos | Vector of character strings naming the algorithms to skip during the model-building phase. An example use is |
include_algos | Vector of character strings naming the algorithms to restrict to during the model-building phase. This can't be used in combination with |
modeling_plan | List. The list of modeling steps to be used by the AutoML engine (they may not all get executed, depending on other constraints). Optional (Expert usage only). |
preprocessing | List. The list of preprocessing steps to run. Only 'target_encoding' is currently supported. |
exploitation_ratio | The budget ratio (between 0 and 1) dedicated to the exploitation (vs exploration) phase. By default, this is set to AUTO (exploitation_ratio=-1) as this is still experimental; to activate it, it is recommended to try a ratio around 0.1. Note that the current exploitation phase only tries to fine-tune the best XGBoost and the best GBM found during exploration. |
monotone_constraints | List. A mapping representing monotonic constraints. Use +1 to enforce an increasing constraint and -1 to specify a decreasing constraint. |
keep_cross_validation_predictions |
|
keep_cross_validation_models |
|
keep_cross_validation_fold_assignment |
|
sort_metric | Metric to sort the leaderboard by. For binomial classification choose between "AUC", "AUCPR", "logloss", "mean_per_class_error", "RMSE", "MSE". For regression choose between "mean_residual_deviance", "RMSE", "MSE", "MAE", and "RMSLE". For multinomial classification choose between "mean_per_class_error", "logloss", "RMSE", "MSE". Default is "AUTO". If set to "AUTO", then "AUC" will be used for binomial classification, "mean_per_class_error" for multinomial classification, and "mean_residual_deviance" for regression. |
export_checkpoints_dir | (Optional) Path to a directory where every model will be stored in binary form. |
verbosity | Verbosity of the backend messages printed during training; Optional. Must be one of NULL (live log disabled), "debug", "info", "warn", "error". Defaults to "warn". |
... | Additional (experimental) arguments to be passed through; Optional. |
An H2OAutoML object.
AutoML trains several models, cross-validated by default, by using the following available algorithms:
XGBoost
GBM (Gradient Boosting Machine)
GLM (Generalized Linear Model)
DRF (Distributed Random Forest)
XRT (eXtremely Randomized Trees)
DeepLearning (Fully Connected Deep Neural Network)
It also applies HPO on the following algorithms:
XGBoost
GBM
DeepLearning
In some cases, there will not be enough time to complete all the algorithms, so some may be missing from the leaderboard.
Finally, AutoML also trains several Stacked Ensemble models at various stages during the run. Mainly two kinds of Stacked Ensemble models are trained:
one of all available models at time t.
one of only the best models of each kind at time t.
Note that Stacked Ensemble models are trained only if there isn't another stacked ensemble with the same base models.
if (FALSE) { library(h2o) h2o.init() prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.importFile(path = prostate_path, header = TRUE) y <- "CAPSULE" prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 30) lb <- h2o.get_leaderboard(aml) head(lb) }