col_sample_rate_change_per_level

  • Available in: GBM, DRF
  • Hyperparameter: yes

Description

This option specifies the relative change of the column sampling rate for every level. For example, if you want to specify how the sampling rate per split should change as a function of the tree depth, you might consider the following:

  • level 0: col_sample_rate
  • level 1: col_sample_rate * factor
  • level 2: col_sample_rate * factor^2
  • level 3: col_sample_rate * factor^3

where factor is the col_sample_rate_change_per_level

As indicated above, this option is multiplicative with col_sample_rate. The effective sampling rate at a given level is:

col_sample_rate_per_tree * col_sample_rate * col_sample_rate_change_per_level^depth

This option defaults to 1.0 and can be in the range of 0.0 to 2.0.

Example

  • r
  • python
library(h2o)
h2o.init()
# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines <-  h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")

# convert columns to factors
airlines["Year"] <- as.factor(airlines["Year"])
airlines["Month"] <- as.factor(airlines["Month"])
airlines["DayOfWeek"] <- as.factor(airlines["DayOfWeek"])
airlines["Cancelled"] <- as.factor(airlines["Cancelled"])
airlines['FlightNum'] <- as.factor(airlines['FlightNum'])

# set the predictor names and the response column name
predictors <- c("Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance", "FlightNum")
response <- "IsDepDelayed"

# split into train and validation
airlines.splits <- h2o.splitFrame(data =  airlines, ratios = .8, seed = 1234)
train <- airlines.splits[[1]]
valid <- airlines.splits[[2]]

# try using the `col_sample_rate_change_per_level` parameter:
airlines.gbm <- h2o.gbm(x = predictors, y = response, training_frame = train,
                        validation_frame = valid, col_sample_rate_change_per_level = .9 ,
                        seed = 1234)

# print the AUC for the validation data
print(h2o.auc(airlines.gbm, valid = TRUE))


# Example of values to grid over for `col_sample_rate_change_per_level`
hyper_params <- list( col_sample_rate_change_per_level = c(.3, .7, .8, 2) )

# this example uses cartesian grid search because the search space is small
# and we want to see the performance of all models. For a larger search space use
# random grid search instead: list(strategy = "RandomDiscrete")
# this GBM uses early stopping once the validation AUC doesn't improve by at least 0.01% for
# 5 consecutive scoring events
grid <- h2o.grid(x = predictors, y = response, training_frame = train, validation_frame = valid,
                 algorithm = "gbm", grid_id = "air_grid", hyper_params = hyper_params,
                 stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC",
                 search_criteria = list(strategy = "Cartesian"), seed = 1234)

## Sort the grid models by AUC
sortedGrid <- h2o.getGrid("air_grid", sort_by = "auc", decreasing = TRUE)
sortedGrid