validation_frame

  • Available in: GBM, DRF, Deep Learning, GLM, PCA, GLRM, Naïve-Bayes, K-Means, Stacked Ensembles, AutoML, XGBoost
  • Hyperparameter: no

Description

Datasets are commonly split into training, testing, and validation sets. When splitting a dataset, the bulk of the data goes into the training dataset, with small portions held out for the testing and validation dataframes.

While the training_frame is used to build the model, the validation_frame is used to compare against the adjusted model and evaluate the model’s accuracy. Typically, the model will include sampled data which will then be compared against the validation frame’s unsampled data. The recommended process is to train on the training set and stop early based on the validation set (and/or cross-validation). When you find a good model, you score it once on the test set to estimate the generalization error.

Example

  • r
  • python
library(h2o)
h2o.init()

# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

# convert response column to a factor
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"])

# set the predictor names and the response column name
predictors <- c("displacement","power","weight","acceleration","year")
response <- "economy_20mpg"

# split into train and validation sets
cars.split <- h2o.splitFrame(data = cars,ratios = 0.8, seed = 1234)
train <- cars.split[[1]]
valid <- cars.split[[2]]

# try using the `validation_frame` parameter:
# train your model, where you specify your 'x' predictors, your 'y' the response column
# training_frame and validation_frame
cars_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train,
                    validation_frame = valid, seed = 1234)

# print the auc for your model
print(h2o.auc(cars_gbm, valid = TRUE))