R/automl.R
The Automatic Machine Learning (AutoML) function automates the supervised machine learning model training process. The current version of AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, and then trains a Stacked Ensemble using all of the models.
h2o.automl(x, y, training_frame, validation_frame = NULL, leaderboard_frame = NULL, nfolds = 5, fold_column = NULL, weights_column = NULL, max_runtime_secs = 3600, max_models = NULL, stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "lift_top_group", "misclassification", "mean_per_class_error"), stopping_tolerance = NULL, stopping_rounds = 3, seed = NULL, project_name = NULL, exclude_algos = NULL)
x | A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used. |
---|---|
y | The name or index of the response variable in the model. For classification, the y column must be a factor, otherwise regression will be performed. Indexes are 1-based in R. |
training_frame | Training frame (H2OFrame or ID). |
validation_frame | Validation frame (H2OFrame or ID); Optional. This frame is used for early stopping of individual models and early stopping of the grid searches (unless max_models or max_runtimes_secs overrides metric-based early stopping). |
leaderboard_frame | Leaderboard frame (H2OFrame or ID); Optional. If provided, the Leaderboard will be scored using this data frame intead of using cross-validation metrics, which is the default. |
nfolds | Number of folds for k-fold cross-validation. Defaults to 5. Use 0 to disable cross-validation; this will also disable Stacked Ensemble (thus decreasing the overall model performance). |
fold_column | Column with cross-validation fold index assignment per observation; used to override the default, randomized, 5-fold cross-validation scheme for individual models in the AutoML run. |
weights_column | Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. |
max_runtime_secs | Maximum allowed runtime in seconds for the entire model training process. Use 0 to disable. Defaults to 3600 secs (1 hour). |
max_models | Maximum number of models to build in the AutoML process (does not include Stacked Ensembles). Defaults to NULL. |
stopping_metric | Metric to use for early stopping (AUTO is logloss for classification, deviance for regression). Must be one of "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "lift_top_group", "misclassification", "mean_per_class_error". Defaults to AUTO. |
stopping_tolerance | Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much). This value defaults to 0.001 if the dataset is at least 1 million rows; otherwise it defaults to a bigger value determined by the size of the dataset and the non-NA-rate. In that case, the value is computed as 1/sqrt(nrows * non-NA-rate). |
stopping_rounds | Integer. Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k (stopping_rounds) scoring events. Defaults to 3 and must be an non-zero integer. Use 0 to disable early stopping. |
seed | Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models or early stopping is used because max_runtime_secs is resource limited, meaning that if the resources are not the same between runs, AutoML may be able to train more models on one run vs another. |
project_name | Character string to identify an AutoML project. Defaults to NULL, which means a project name will be auto-generated based on the training frame ID. |
exclude_algos | Vector of character strings naming the algorithms to skip during the model-building phase. An example use is exclude_algos = c("GLM", "DeepLearning", "DRF"), and the full list of options is: "GLM", "GBM", "DRF" (Random Forest and Extremely-Randomized Trees), "DeepLearning" and "StackedEnsemble". Defaults to NULL, which means that all appropriate H2O algorithms will be used, if the search stopping criteria allow. Optional. |
An H2OAutoML object.
AutoML finds the best model, given a training frame and response, and returns an H2OAutoML object, which contains a leaderboard of all the models that were trained in the process, ranked by a default model performance metric.
# NOT RUN { library(h2o) h2o.init() votes_path <- system.file("extdata", "housevotes.csv", package = "h2o") votes_hf <- h2o.uploadFile(path = votes_path, header = TRUE) aml <- h2o.automl(y = "Class", training_frame = votes_hf, max_runtime_secs = 30) # }