Which parameters are used for or with cross validation?
If a user activates cross validation in one of the algorithms (
h2o.gbm(), etc), will H2O output estimates of model performance on only the holdout sets.
No, H2O will build nfolds+1 models in total, the ‘main’ model on 100% of training data and nfolds ‘cross-validation’ models that use disjoint holdout ‘validation’ sets (obtained from the training data) to estimate the generalization of the main model. The main model contains a cross-validation metrics object that is computed from the combined holdout predictions (obtain by setting xval to true in h2o.performance), as well as a table containing the statistics of various metrics across all nfolds cross-validation models (e.g., the mean and stddev of the logloss, rmse, etc.). You can also get the performance of the main model on the
training_framedataset if you specify
train = TRUE(R) or
train = True(python) when you ask for a model performance metric. If you provide a
validation_frameduring cross-validation, then you can get the performance of the main model on that by specifying
valid = TRUE(R) or
valid = True(python) when you ask for a model performance metric.
Can H2O automatically feed back the implications of the cross-validation results to improve the algorithm during training, as well as tune some of the model’s hyperparamters?
Yes, H2O can use cross-validation for parameter tuning if early stopping is enabled (stopping_rounds>0). In that case, cross-validation is used to automatically tune the optimal number of epochs for Deep Learning or the number of trees for DRF/GBM. The main model will use the mean number of epochs across all cross-validation models.
validation_frameisn’t specified, does supplying the
nfoldsparameter activate cross-validation scoring on the
nfolds > 1)
Does the model only train on the training data?
The model only ever trains on training data, but can use validation data (if provided) to tune parameters related to early stopping (epochs, number of trees). If no validation data is provided, we will tune based off training data.
Does supplying the
validation_frameparameter activate scoring on the
validation_framedataset instead of the
No, the models always score on the training frame (unless explicitly turned off - only available in Deep Learning), but if a validation frame is provided, then the model will score on that as well (and can use it for parameter tuning such as early stopping). It’s always a good idea to provide a validation set. If you don’t want to ‘sacrifice’ data, use cross-validation instead. Then, you can still provide a validation frame, but you don’t have to (and it isn’t used for parameter tuning either, just for metrics reporting).
nfoldsparameter is not specified, while
training_frameare , then would cross validation be activate, and some default value for
nfoldsparameter will be applied?
No, when a training frame and validation frame are supplied without the
nfoldsparameter, then training is done on the
training_frameand validation is done on the
validation_frame(CV will only ever activate unless
nfolds > 1)
Is early stopping (
stopping_rounds > 0) based on the
validation_framedataset, if provided, and otherwise based on
the training_framedataset .