`checkpoint`¶

Available in: GBM, DRF, XGBoost, Deep Learning, GLM
Hyperparameter: no

Description¶

In real-world scenarios, data can change. For example, you may have a model currently in production that was built using 1 million records. At a later date, you may receive several hundred thousand more records. Rather than building a new model from scratch, you can use checkpointing to create a new model based on the existing model.

The checkpoint option allows you to specify a model key associated with a previously trained model. This will build a new model as a continuation of a previously generated model. If this is not specified, then the algorithm will start training a new model instead of continuing building a previous model.

When setting parameters that continue to build on a previous model, specifically ntrees (in GBM/DRF/XGBoost) or epochs (in Deep Learning), specify the total amount of training that you want if you had started from scratch, not the number of additional epochs or trees you want. Note that this means the ntrees or epochs parameter for the checkpointed model must always be greater than the original value. For example:

If the first model builds 1 tree, and you want your new model to build 50 trees, then the continuation model (using checkpointing) would specify ntrees=50. This gives you a total of 50 trees including 49 new ones.
If your original model included 20 trees, and you specify ntrees=50 for the continuation model, then the new model will add 30 trees to the model, again giving you a total of 50 trees.
If your oringinal model included 20 trees, and you specify ntrees=10 (a lower value), then you will receive an error indicating that the requested ntrees must be higher than 21.

Notes:

The response type and model type of the training data must be the same as for the checkpointed model.
The columns of the training data must be the same as for the checkpointed model.
Categorical factor levels of the training data must be the same as for the checkpointed model.
The total number of predictors of the training data must be the same as for the checkpointed model.

The following options cannot be modified when rebuilding a model using checkpoint:

GBM/DRF Options

build_tree_one_node

max_depth

min_rows

nbins

nbins_cats

nbins_top_level

sample_rate

XGBoost Options

booster

grow_policy

max_rows

min_rows

sample_rate

tree_method

Deep Learning Options

activation

adaptive_rate

autoencoder

col_major

distribution

drop_na20_cols

epsilon

huber_alpha

ignore_const_cols

max_categorical_features

momentum_ramp

momentum_stable

momentum_start

nesterov_accelerated_gradient

nfolds

quantile_alpha

rate

rate_annealing

rate_decay

rho

sparse

sparsity_beta

standardize

tweedie_power

use_all_factor_levels

y (response_column)

Example¶

library(h2o)
h2o.init()

# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

# convert response column to a factor
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"])

# set the predictor names and the response column name
predictors <- c("displacement", "power", "weight", "acceleration", "year")
response <- "economy_20mpg"

# split into train and validation sets
cars_split <- h2o.splitFrame(data = cars,ratios = 0.8, seed = 1234)
train <- cars_split[[1]]
valid <- cars_split[[2]]

# build a GBM with 1 tree (ntrees = 1) for the first model:
cars_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train,
                    validation_frame = valid, ntrees = 1, seed = 1234)

# print the auc for the validation data
print(h2o.auc(cars_gbm, valid = TRUE))
[1] 0.9690799

# re-start the training process on a saved GBM model using the ‘checkpoint‘ argument:
# the checkpoint argument requires the model id of the model on which you want to
# continue building
# get the model's id from "cars_gbm" model using `cars_gbm@model_id`
# the first model has 1 tree, let's continue building the GBM with an additional 49
# more trees, so set ntrees = 50

# to see how many trees the original model built you can look at the `ntrees` attribute
print(paste("Number of trees built for cars_gbm model:", cars_gbm@params$actual$ntrees))
[1] "Number of trees built for cars_gbm model: 1"

# build and train model with 49 additional trees for a total of 50 trees:
cars_gbm_continued <- h2o.gbm(x = predictors, y = response, training_frame = train,
                              validation_frame = valid,
                              checkpoint = cars_gbm@model_id,
                              ntrees = 50,
                              seed = 1234)

# print the auc for the validation data
print(h2o.auc(cars_gbm_continued, valid = TRUE))
[1] 0.9803922

# to see how many trees the continuation model built you can look at the `ntrees` attribute
print(paste("Number of trees built for cars_gbm model:", cars_gbm_continued@allparameters$ntrees))
[1] "Number of trees built for cars_gbm model: 50"

# you can also use checkpointing to pass in a new dataset
# (see options above for parameters you cannot change)
# simply change out the training and validation frames with your new dataset

import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()

# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

# convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()

# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"

# split into train and validation sets
train, valid = cars.split_frame(ratios = [.8], seed = 1234)

# build a GBM with 1 tree (ntrees = 1) for the first model:
cars_gbm = H2OGradientBoostingEstimator(ntrees = 1, seed = 1234)
cars_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

# print the auc for the validation data
print(cars_gbm.auc(valid=True))
0.981146304676

# re-start the training process on a saved GBM model using the ‘checkpoint‘ argument:
# the checkpoint argument requires the model id of the model on which you wish to continue building
# get the model's id from "cars_gbm" model using `cars_gbm.model_id`
# the first model has 1 tree, let's continue building the GBM with an additional 49 more trees,
# so set ntrees = 50

# to see how many trees the original model built you can look at the `ntrees` attribute
print("Number of trees built for cars_gbm model:", cars_gbm.ntrees)
('Number of trees built for cars_gbm model:', 20)

# build and train model with 49 additional trees for a total of 50 trees:
cars_gbm_continued = H2OGradientBoostingEstimator(checkpoint= cars_gbm.model_id, ntrees = 50, seed = 1234)
cars_gbm_continued.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

# print the auc for the validation data
cars_gbm_continued.auc(valid=True)
0.9803921568627451

# to see how many trees the continuation model built you can look at the `ntrees` attribute
print("Number of trees built for cars_gbm model:", cars_gbm_continued.ntrees)
('Number of trees built for cars_gbm model:', 50)

# you can also use checkpointing to pass in a new dataset in addition to increasing
# the number of trees/epochs. (See options above for parameters you cannot change.)
# simply change out the training and validation frames with your new dataset.

checkpoint¶

Description¶

Related Parameters¶

Example¶

`checkpoint`¶