checkpoint

  • Available in: GBM, DRF, XGBoost, Deep Learning

  • Hyperparameter: no

Description

In real-world scenarios, data can change. For example, you may have a model currently in production that was built using 1 million records. At a later date, you may receive several hundred thousand more records. Rather than building a new model from scratch, you can use checkpointing to create a new model based on the existing model.

The checkpoint option allows you to specify a model key associated with a previously trained model. This will build a new model as a continuation of a previously generated model. If this is not specified, then the algorithm will start training a new model instead of continuing building a previous model.

When setting parameters that continue to build on a previous model, specifically ntrees (in GBM/DRF/XGBoost) or epochs (in Deep Learning), specify the total amount of training that you want if you had started from scratch, not the number of additional epochs or trees you want. Note that this means the ntrees or epochs parameter for the checkpointed model must always be greater than the original value. For example:

  • If the first model builds 1 tree, and you want your new model to build 50 trees, then the continuation model (using checkpointing) would specify ntrees=50. This gives you a total of 50 trees including 49 new ones.

  • If your original model included 20 trees, and you specify ntrees=50 for the continuation model, then the new model will add 30 trees to the model, again giving you a total of 50 trees.

  • If your oringinal model included 20 trees, and you specify ntrees=10 (a lower value), then you will receive an error indicating that the requested ntrees must be higher than 21.

Notes:

  • The response type and model type of the training data must be the same as for the checkpointed model.

  • The columns of the training data must be the same as for the checkpointed model.

  • Categorical factor levels of the training data must be the same as for the checkpointed model.

  • The total number of predictors of the training data must be the same as for the checkpointed model.

The following options cannot be modified when rebuilding a model using checkpoint:

GBM/DRF Options

  • build_tree_one_node

  • max_depth

  • min_rows

  • nbins

  • nbins_cats

  • nbins_top_level

  • sample_rate

XGBoost Options

  • booster

  • grow_policy

  • max_rows

  • min_rows

  • sample_rate

  • tree_method

Deep Learning Options

  • activation

  • adaptive_rate

  • autoencoder

  • col_major

  • distribution

  • drop_na20_cols

  • epsilon

  • huber_alpha

  • ignore_const_cols

  • max_categorical_features

  • momentum_ramp

  • momentum_stable

  • momentum_start

  • nesterov_accelerated_gradient

  • nfolds

  • quantile_alpha

  • rate

  • rate_annealing

  • rate_decay

  • rho

  • sparse

  • sparsity_beta

  • standardize

  • tweedie_power

  • use_all_factor_levels

  • y (response_column)

Example

library(h2o)
h2o.init()

# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

# convert response column to a factor
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"])

# set the predictor names and the response column name
predictors <- c("displacement", "power", "weight", "acceleration", "year")
response <- "economy_20mpg"

# split into train and validation sets
cars_split <- h2o.splitFrame(data = cars,ratios = 0.8, seed = 1234)
train <- cars_split[[1]]
valid <- cars_split[[2]]

# build a GBM with 1 tree (ntrees = 1) for the first model:
cars_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train,
                    validation_frame = valid, ntrees = 1, seed = 1234)

# print the auc for the validation data
print(h2o.auc(cars_gbm, valid = TRUE))
[1] 0.9690799

# re-start the training process on a saved GBM model using the ‘checkpoint‘ argument:
# the checkpoint argument requires the model id of the model on which you want to
# continue building
# get the model's id from "cars_gbm" model using `cars_gbm@model_id`
# the first model has 1 tree, let's continue building the GBM with an additional 49
# more trees, so set ntrees = 50

# to see how many trees the original model built you can look at the `ntrees` attribute
print(paste("Number of trees built for cars_gbm model:", cars_gbm@allparameters$ntrees))
[1] "Number of trees built for cars_gbm model: 1"

# build and train model with 49 additional trees for a total of 50 trees:
cars_gbm_continued <- h2o.gbm(x = predictors, y = response, training_frame = train,
                              validation_frame = valid,
                              checkpoint = cars_gbm@model_id,
                              ntrees = 50,
                              seed = 1234)

# print the auc for the validation data
print(h2o.auc(cars_gbm_continued, valid = TRUE))
[1] 0.9803922

# to see how many trees the continuation model built you can look at the `ntrees` attribute
print(paste("Number of trees built for cars_gbm model:", cars_gbm_continued@allparameters$ntrees))
[1] "Number of trees built for cars_gbm model: 50"

# you can also use checkpointing to pass in a new dataset
# (see options above for parameters you cannot change)
# simply change out the training and validation frames with your new dataset
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()

# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

# convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()

# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"

# split into train and validation sets
train, valid = cars.split_frame(ratios = [.8], seed = 1234)

# build a GBM with 1 tree (ntrees = 1) for the first model:
cars_gbm = H2OGradientBoostingEstimator(ntrees = 1, seed = 1234)
cars_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

# print the auc for the validation data
print(cars_gbm.auc(valid=True))
0.981146304676

# re-start the training process on a saved GBM model using the ‘checkpoint‘ argument:
# the checkpoint argument requires the model id of the model on which you wish to continue building
# get the model's id from "cars_gbm" model using `cars_gbm.model_id`
# the first model has 1 tree, let's continue building the GBM with an additional 49 more trees,
# so set ntrees = 50

# to see how many trees the original model built you can look at the `ntrees` attribute
print("Number of trees built for cars_gbm model:", cars_gbm.ntrees)
('Number of trees built for cars_gbm model:', 20)

# build and train model with 49 additional trees for a total of 50 trees:
cars_gbm_continued = H2OGradientBoostingEstimator(checkpoint= cars_gbm.model_id, ntrees = 50, seed = 1234)
cars_gbm_continued.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

# print the auc for the validation data
cars_gbm_continued.auc(valid=True)
0.9803921568627451

# to see how many trees the continuation model built you can look at the `ntrees` attribute
print("Number of trees built for cars_gbm model:", cars_gbm_continued.ntrees)
('Number of trees built for cars_gbm model:', 50)

# you can also use checkpointing to pass in a new dataset in addition to increasing
# the number of trees/epochs. (See options above for parameters you cannot change.)
# simply change out the training and validation frames with your new dataset.