The Automatic Machine Learning (AutoML) function automates the supervised machine learning model training process. The current version of AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, and then trains a Stacked Ensemble using all of the models.
h2o.automl(x, y, training_frame, validation_frame = NULL, leaderboard_frame = NULL, blending_frame = NULL, nfolds = 5, fold_column = NULL, weights_column = NULL, balance_classes = FALSE, class_sampling_factors = NULL, max_after_balance_size = 5, max_runtime_secs = 3600, max_runtime_secs_per_model = NULL, max_models = NULL, stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "lift_top_group", "misclassification", "mean_per_class_error"), stopping_tolerance = NULL, stopping_rounds = 3, seed = NULL, project_name = NULL, exclude_algos = NULL, include_algos = NULL, keep_cross_validation_predictions = FALSE, keep_cross_validation_models = FALSE, keep_cross_validation_fold_assignment = FALSE, sort_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "mean_per_class_error"), export_checkpoints_dir = NULL, verbosity = NULL)
x | A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used. |
---|---|
y | The name or index of the response variable in the model. For classification, the y column must be a factor, otherwise regression will be performed. Indexes are 1-based in R. |
training_frame | Training frame (H2OFrame or ID). |
validation_frame | Validation frame (H2OFrame or ID); Optional. This argument is ignored unless the user sets nfolds = 0. If cross-validation is turned off, then a validation frame can be specified and used for early stopping of individual models and early stopping of the grid searches. By default and when nfolds > 1, cross-validation metrics will be used for early stopping and thus validation_frame will be ignored. |
leaderboard_frame | Leaderboard frame (H2OFrame or ID); Optional. If provided, the Leaderboard will be scored using this data frame intead of using cross-validation metrics, which is the default. |
blending_frame | Blending frame (H2OFrame or ID) used to train the the metalearning algorithm in Stacked Ensembles (instead of relying on cross-validated predicted values); Optional. |
nfolds | Number of folds for k-fold cross-validation. Defaults to 5. Use 0 to disable cross-validation; this will also disable Stacked Ensemble (thus decreasing the overall model performance). |
fold_column | Column with cross-validation fold index assignment per observation; used to override the default, randomized, 5-fold cross-validation scheme for individual models in the AutoML run. |
weights_column | Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. |
balance_classes |
|
class_sampling_factors | Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes. |
max_after_balance_size | Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. Defaults to 5.0. |
max_runtime_secs | Maximum allowed runtime in seconds for the entire model training process. Use 0 to disable. Defaults to 3600 secs (1 hour). |
max_runtime_secs_per_model | Maximum runtime in seconds dedicated to each individual model training process. Use 0 to disable. Defaults to 0. |
max_models | Maximum number of models to build in the AutoML process (does not include Stacked Ensembles). Defaults to NULL. |
stopping_metric | Metric to use for early stopping ("AUTO" is logloss for classification, deviance for regression). Must be one of "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "lift_top_group", "misclassification", "mean_per_class_error". Defaults to AUTO. |
stopping_tolerance | Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much). This value defaults to 0.001 if the dataset is at least 1 million rows; otherwise it defaults to a bigger value determined by the size of the dataset and the non-NA-rate. In that case, the value is computed as 1/sqrt(nrows * non-NA-rate). |
stopping_rounds | Integer. Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k (stopping_rounds) scoring events. Defaults to 3 and must be an non-zero integer. Use 0 to disable early stopping. |
seed | Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models or early stopping is used because max_runtime_secs is resource limited, meaning that if the resources are not the same between runs, AutoML may be able to train more models on one run vs another. |
project_name | Character string to identify an AutoML project. Defaults to NULL, which means a project name will be auto-generated based on the training frame ID. |
exclude_algos | Vector of character strings naming the algorithms to skip during the model-building phase. An example use is exclude_algos = c("GLM", "DeepLearning", "DRF"), and the full list of options is: "DRF" (Random Forest and Extremely-Randomized Trees), "GLM", "XGBoost", "GBM", "DeepLearning" and "StackedEnsemble". |
include_algos | Vector of character strings naming the algorithms to restrict to during the model-building phase. This can't be used in combination with exclude_algos param. |
keep_cross_validation_predictions |
|
keep_cross_validation_models |
|
keep_cross_validation_fold_assignment |
|
sort_metric | Metric to sort the leaderboard by. For binomial classification choose between "AUC", "logloss", "mean_per_class_error", "RMSE", "MSE". For regression choose between "mean_residual_deviance", "RMSE", "MSE", "MAE", and "RMSLE". For multinomial classification choose between "mean_per_class_error", "logloss", "RMSE", "MSE". Default is "AUTO". If set to "AUTO", then "AUC" will be used for binomial classification, "mean_per_class_error" for multinomial classification, and "mean_residual_deviance" for regression. |
export_checkpoints_dir | (Optional) Path to a directory where every model will be stored in binary form. |
verbosity | Verbosity of the backend messages printed during training; Optional. Must be one of "debug", "info", "warn". Defaults to NULL (disable live log). |
An H2OAutoML object.
AutoML finds the best model, given a training frame and response, and returns an H2OAutoML object, which contains a leaderboard of all the models that were trained in the process, ranked by a default model performance metric.
# NOT RUN { library(h2o) h2o.init() votes_path <- system.file("extdata", "housevotes.csv", package = "h2o") votes_hf <- h2o.uploadFile(path = votes_path, header = TRUE) aml <- h2o.automl(y = "Class", training_frame = votes_hf, max_runtime_secs = 30) # }