Cross Validation¶
Which parameters are used for or with cross validation?¶
nfolds
keep_cross_validation_models
keep_cross_validation_predictions
keep_cross_validation_fold_assignment
fold_assignment
fold_column
If a user activates cross validation in one of the algorithms (h2o.randomForest()
, h2o.gbm()
, etc), will H2O output estimates of model performance on only the holdout sets?¶
No, H2O will build nfolds+1 models in total, the ‘main’ model on 100% of training data and nfolds ‘cross-validation’ models that use disjoint holdout ‘validation’ sets (obtained from the training data) to estimate the generalization of the main model. The main model contains a cross-validation metrics object that is computed from the combined holdout predictions (obtain by setting xval to true in h2o.performance), as well as a table containing the statistics of various metrics across all nfolds cross-validation models (e.g., the mean and stddev of the logloss, rmse, etc.). You can also get the performance of the main model on the training_frame
dataset if you specify train = TRUE
(R) or train = True
(python) when you ask for a model performance metric. If you provide a validation_frame
during cross-validation, then you can get the performance of the main model on that by specifying valid = TRUE
(R) or valid = True
(python) when you ask for a model performance metric.
Can H2O automatically feed back the implications of the cross-validation results to improve the algorithm during training, as well as tune some of the model’s hyperparamters?¶
Yes, H2O can use cross-validation for parameter tuning if early stopping is enabled (stopping_rounds>0). In that case, cross-validation is used to automatically tune the optimal number of epochs for Deep Learning or the number of trees for DRF/GBM. The main model will use the mean number of epochs across all cross-validation models.
If a validation_frame
isn’t specified, does supplying the nfolds
parameter activate cross-validation scoring on the training_frame
dataset’s holdouts?¶
True (if nfolds > 1
)
Does the model only train on the training data?¶
The model only ever trains on training data, but can use validation data (if provided) to tune parameters related to early stopping (epochs, number of trees). If no validation data is provided, we will tune based off training data.
Does supplying the validation_frame
parameter activate scoring on the validation_frame
dataset instead of the training_frame
dataset?¶
No, the models always score on the training frame (unless explicitly turned off - only available in Deep Learning), but if a validation frame is provided, then the model will score on that as well (and can use it for parameter tuning such as early stopping). It’s always a good idea to provide a validation set. If you don’t want to ‘sacrifice’ data, use cross-validation instead. Then, you can still provide a validation frame, but you don’t have to (and it isn’t used for parameter tuning either, just for metrics reporting).
If the nfolds
parameter is not specified, while validation_frame
and training_frame
are , then would cross validation be activate, and some default value for nfolds
parameter will be applied?¶
No, when a training frame and validation frame are supplied without the nfolds
parameter, then training is done on the training_frame
and validation is done on the validation_frame
(CV will only ever activate unless nfolds > 1
)
Is early stopping (stopping_rounds > 0
) based on the validation_frame
dataset, if provided, and otherwise based on the training_frame
dataset?¶
Yes.