GBM

Introduction

Gradient Boosted Regression and Gradient Boosted Classification are forward learning ensemble methods. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.

The current version of GBM is fundamentally the same as in previous versions of H2O (same algorithmic steps, same histogramming techniques), with the exception of the following changes:

  • Improved ability to train on categorical variables (using the nbins_cats parameter)
  • Minor changes in histogramming logic for some corner cases

There was some code cleanup and refactoring to support the following features:

  • Per-row observation weights
  • Per-row offsets
  • N-fold cross-validation
  • Support for more distribution functions (such as Gamma, Poisson, and Tweedie)

Defining a GBM Model

  • model_id: (Optional) Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.

  • training_frame: (Required) Specify the dataset used to build the model. NOTE: In Flow, if you click the Build a model button from the Parse cell, the training frame is entered automatically.

  • validation_frame: (Optional) Specify the dataset used to evaluate the accuracy of the model.

  • nfolds: Specify the number of folds for cross-validation.

  • response_column: (Required) Specify the column to use as the independent variable. The data can be numeric or categorical.

  • ignored_columns: (Optional) Specify the column or columns to be excluded from the model. In Flow, click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the All button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the None button. To search for a specific column, type the column name in the Search field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the Only show columns with more than 0% missing values field. To change the selections for the hidden columns, use the Select Visible or Deselect Visible buttons.

  • ignore_const_cols: Specify whether to ignore constant training columns, since no information can be gained from them. This option is enabled by default.

  • ntrees: Specify the number of trees to build.

  • max_depth: Specify the maximum tree depth.

  • min_rows: Specify the minimum number of observations for a leaf (nodesize in R).

  • nbins: (Numerical/real/int only) Specify the number of bins for the histogram to build, then split at the best point.

  • nbins_cats: (Categorical/enums only) Specify the maximum number of bins for the histogram to build, then split at the best point. Higher values can lead to more overfitting. The levels are ordered alphabetically; if there are more levels than bins, adjacent levels share bins. This value has a more significant impact on model fitness than nbins. Larger values may increase runtime, especially for deep trees and large clusters, so tuning may be required to find the optimal value for your configuration.

  • seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.

  • learn_rate: Specify the learning rate. The range is 0.0 to 1.0.

  • distribution: Specify the loss function. The options are auto, bernoulli, multinomial, gaussian, poisson, gamma, or tweedie.

    • If the distribution is multinomial, the response column must be categorical.
    • If the distribution is poisson, the response column must be numeric.
    • If the distribution is gamma, the response column must be numeric.
    • If the distribution is tweedie, the response column must be numeric.
    • If the distribution is gaussian, the response column must be numeric.
    • If the distribution is multinomial, the response column must be categorical.
    • If the distribution is poisson, the response column must be numeric.
    • If the distribution is gamma, the response column must be numeric.
    • If the distribution is tweedie, the response column must be numeric.
    • If the distribution is gaussian, the response column must be numeric.
    • If the distribution is laplace, the data must be numeric and continuous (Int).
    • If the distribution is quantile, the data must be numeric and continuous (Int).
  • sample_rate: Specify the row sampling rate (x-axis). The range is 0.0 to 1.0. Higher values may improve training accuracy. Test accuracy improves when either columns or rows are sampled. For details, refer to “Stochastic Gradient Boosting” (Friedman, 1999).

  • col_sample_rate: Specify the column sampling rate (y-axis). The range is 0.0 to 1.0. Higher values may improve training accuracy. Test accuracy improves when either columns or rows are sampled. For details, refer to “Stochastic Gradient Boosting” (Friedman, 1999).

  • col_sample_rate_change_per_level: This option specifies to change the column sampling rate as a function of the depth in the tree. For example:

    level 1: col_sample_rate

    level 2: col_sample_rate * factor

    level 3: col_sample_rate * factor^2

    level 4: col_sample_rate * factor^3

    etc.

  • min_split_improvement: The value of this option specifies the minimum relative improvement in squared error reduction in order for a split to happen. When properly tuned, this option can help reduce overfitting. Optimal values would be in the 1e-10...1e-3 range.

  • histogram_type: By default (AUTO) GBM bins from min...max in steps of (max-min)/N. Random split points or quantile-based split points can be selected as well. RoundRobin can be specified to cycle through all histogram types (one per tree). Use this option to specify the type of histogram to use for finding optimal split points:

    • AUTO
    • UniformAdaptive
    • Random
    • QuantilesGlobal
    • RoundRobin
  • score_each_iteration: (Optional) Specify whether to score during each iteration of the model training.

  • fold_assignment: (Applicable only if a value for nfolds is specified and fold_column is not specified) Specify the cross-validation fold assignment scheme. The available options are AUTO (which is Random), Random, Modulo, or Stratified (which will stratify the folds based on the response variable for classification problems).

  • score_tree_interval: Score the model after every so many trees. Disabled if set to 0.

  • fold_column: Specify the column that contains the cross-validation fold index assignment per observation.

  • offset_column: (Not applicable if the distribution is multinomial) Specify a column to use as the offset.

    Note: Offsets are per-row “bias values” that are used during model training. For Gaussian distributions, they can be seen as simple corrections to the response (y) column. Instead of learning to predict the response (y-row), the model learns to predict the (row) offset of the response column. For other distributions, the offset corrections are applied in the linearized space before applying the inverse link function to get the actual response values. For more information, refer to the following link. If the distribution is Bernoulli, the value must be less than one.

  • weights_column: Specify a column to use for the observation weights, which are used for bias correction. The specified weights_column must be included in the specified training_frame.

    Python only: To use a weights column when passing an H2OFrame to x instead of a list of column names, the specified training_frame must contain the specified weights_column.

    Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.

  • balance_classes: Specify whether to oversample the minority classes to balance the class distribution. This option is not enabled by default and can increase the data frame size. This option is only applicable for classification. Majority classes can be undersampled to satisfy the Max_after_balance_size parameter.

  • max_confusion_matrix_size: Specify the maximum size (in number of classes) for confusion matrices to be printed in the Logs.

  • max_hit_ratio_k: Specify the maximum number (top K) of predictions to use for hit ratio computation. Applicable to multi-class only. To disable, enter 0.

  • r2_stopping: Specify a threshold for the coefficient of determination ((r^2)) metric value. When this threshold is met or exceeded, H2O stops making trees.

  • stopping_rounds: Stops training when the option selected for stopping_metric doesn’t improve for the specified number of training rounds, based on a simple moving average. To disable this feature, specify 0. The metric is computed on the validation data (if provided); otherwise, training data is used. When used with overwrite_with_best_model, the final model is the best model generated for the given stopping_metric option.

    Note: If cross-validation is enabled:

    1. All cross-validation models stop training when the validation metric doesn’t improve.
    2. The main model runs for the mean number of epochs.
    3. N+1 models do not use overwrite_with_best_model
    4. N+1 models may be off by the number specified for stopping_rounds from the best model, but the cross-validation metric estimates the performance of the main model for the resulting number of epochs (which may be fewer than the specified number of epochs).
  • stopping_metric: Specify the metric to use for early stopping. The available options are:

    • AUTO: Logloss for classification, deviance for regression
    • deviance
    • logloss
    • MSE
    • AUC
    • r2
    • misclassification
  • stopping_tolerance: Specify the relative tolerance for the metric-based stopping to stop training if the improvement is less than this value.

  • max_runtime_secs: Maximum allowed runtime in seconds for model training. Use 0 to disable.

  • build_tree_one_node: To run on a single node, check this checkbox. This is suitable for small datasets as there is no network overhead but fewer CPUs are used.

  • quantile_alpha: (Only applicable if Quantile is specified for distribution) Specify the quantile to be used for Quantile Regression.

  • tweedie_power: (Only applicable if Tweedie is specified for distribution) Specify the Tweedie power. The range is from 1 to 2. For a normal distribution, enter 0. For Poisson distribution, enter 1. For a gamma distribution, enter 2. For a compound Poisson-gamma distribution, enter a value greater than 1 but less than 2. For more information, refer to Tweedie distribution.

  • checkpoint: Enter a model key associated with a previously-trained model. Use this option to build a new model as a continuation of a previously-generated model.

  • keep_cross_validation_predictions: Enable this option to keep the cross-validation predictions.

  • class_sampling_factors: Specify the per-class (in lexicographical order) over/under-sampling ratios. By default, these ratios are automatically computed during training to obtain the class balance.

  • max_after_balance_size: Specify the maximum relative size of the training data after balancing class counts (balance_classes must be enabled). The value can be less than 1.0.

  • nbins_top_level: (For numerical/real/int columns only) Specify the minimum number of bins at the root level to use to build the histogram. This number will then be decreased by a factor of two per level.

Interpreting a GBM Model

The output for GBM includes the following:

  • Model parameters (hidden)
  • A graph of the scoring history (training MSE vs number of trees)
  • A graph of the variable importances
  • Output (model category, validation metrics, initf)
  • Model summary (number of trees, min. depth, max. depth, mean depth, min. leaves, max. leaves, mean leaves)
  • Scoring history in tabular format
  • Training metrics (model name, model checksum name, frame name, description, model category, duration in ms, scoring time, predictions, MSE, R2)
  • Variable importances in tabular format

Leaf Node Assignment

Trees cluster observations into leaf nodes, and this information can be useful for feature engineering or model interpretability. Use h2o.predict_leaf_node_assignment(model, frame) to get an H2OFrame with the leaf node assignments, or click the checkbox when making predictions from Flow. Those leaf nodes represent decision rules that can be fed to other models (i.e., GLM with lambda search and strong rules) to obtain a limited set of the most important rules.

FAQ

  • How does the algorithm handle missing values during training?
Missing values affect tree split points. NAs always “go right”, and hence affect the split-finding math (since the corresponding response for the row still matters). If the response is missing, then the row won’t affect the split-finding math. No new node is created. Instead, the observation is treated as if it had the maximum feature value of all observations in the node to be split. Note that the missing value might not be separated from the largest value itself. For example, if a node contains feature values of 0,1,2,3,4,5, then the missing value is counted as a 5. No matter what split decision is then made, the value 5 and the missing values won’t be separated. The 5 and the missing stay together, even in splits down the tree.
  • How does the algorithm handle missing values during testing?
During scoring, missing values “always go right” at any decision point in a tree. Due to dynamic binning in GBM, a row with a missing value typically ends up in the “rightmost bin” - with other outliers.
  • What happens if the response has missing values?
No errors will occur, but nothing will be learned from rows containing missing the response.
  • What happens when you try to predict on a categorical level not seen during training?
GBM converts a new categorical level to an “undefined” value in the test set, and then splits either left or right during scoring.
  • Does it matter if the data is sorted?
No.
  • Should data be shuffled before training?
No.
  • How does the algorithm handle highly imbalanced data in a response column?
You can specify balance_classes, class_sampling_factors and max_after_balance_size to control over/under-sampling.
  • What if there are a large number of columns?
DRF models are best for datasets with fewer than a few thousand columns.
  • What if there are a large number of categorical factor levels?
Large numbers of categoricals are handled very efficiently - there is never any one-hot encoding.
  • Given the same training set and the same GBM parameters, will GBM produce a different model with two different validation data sets, or the same model?
The same model will be generated.
  • How deterministic is GBM?
The nfolds and balance_classes parameters use the seed directly. Otherwise, GBM is deterministic up to floating point rounding errors (out-of-order atomic addition of multiple threads during histogram building). Any observed variations in the AUC curve should be the same up to at least three to four significant digits.
  • When fitting a random number between 0 and 1 as a single feature, the training ROC curve is consistent with ``random`` for low tree numbers and overfits as the number of trees is increased, as expected. However, when a random number is included as part of a set of hundreds of features, as the number of trees increases, the random number increases in feature importance. Why is this?
This is a known behavior of GBM that is similar to its behavior in R. If, for example, it takes 50 trees to learn all there is to learn from a frame without the random features, when you add a random predictor and train 1000 trees, the first 50 trees will be approximately the same. The final 950 trees are used to make sense of the random number, which will take a long time since there’s no structure. The variable importance will reflect the fact that all the splits from the first 950 trees are devoted to the random feature.
  • How is column sampling implemented for GBM?

For an example model using:

  • 100 columns
  • col_sample_rate_per_tree=0.754
  • col_sample_rate=0.8 (refers to available columns after per-tree sampling)

For each tree, the floor is used to determine the number - in this example, (0.754 * 100)=75 out of the 100 - of columns that are randomly picked, and then the floor is used to determine the number - in this case, (0.754 * 0.8 * 100)=60 - of columns that are then randomly chosen for each split decision (out of the 75).

  • I want to score multiple models on a huge dataset. Is it possible to score these models in parallel?
The best way to score models in parallel is to use the in-H2O binary models. To do this, import the binary (non-POJO, previously exported) model into an H2O cluster; import the datasets into H2O as well; call the predict endpoint either from R, Python, Flow or the REST API directly; then export the predictions to file or download them from the server.
  • Are there any tutorials for GBM?
You can find tutorials for using GBM with R, Python, and Flow at the following location: https://github.com/h2oai/h2o-3/tree/master/h2o-docs/src/product/tutorials/gbm.

GBM Algorithm

H2O’s Gradient Boosting Algorithms follow the algorithm specified by Hastie et al (2001):

Initialize \(f_{k0} = 0, k=1,2,…,K\)

For \(m=1\) to \(M\):

  1. Set \(p_{k}(x)=\frac{e^{f_{k}(x)}}{\sum_{l=1}^{K}e^{f_{l}(x)}},k=1,2,…,K\)

  2. For \(k=1\) to \(K\):

    1. Compute \(r_{ikm}=y_{ik}-p_{k}(x_{i}),i=1,2,…,N\)
    2. Fit a regression tree to the targets \(r_{ikm},i=1,2,…,N\), giving terminal regions \(R_{jim},j=1,2,…,J_{m}\)
    3. Compute \(\gamma_{jkm}=\frac{K-1}{K} \frac{\sum_{x_{i} \in R_{jkm}}(r_{ikm})}{\sum_{x_{i} \in R_{jkm}}|r_{ikm}|(1-|r_{ikm})},j=1,2,…,J_m\).
    4. Update \(f_{km}(x)=f_{k,m-1}(x)+\sum_{j=1}^{J_m}\gamma_{jkm} I(x\in R_{jkm})\).

Output \(\hat{f_{k}}(x)=f_{kM}(x),k=1,2,…,K\)

Be aware that the column type affects how the histogram is created and the column type depends on whether rows are excluded or assigned a weight of 0. For example:

val weight 1 1 0.5 0 5 1 3.5 0

The above vec has a real-valued type if passed as a whole, but if the zero-weighted rows are sliced away first, the integer weight is used. The resulting histogram is either kept at full nbins resolution or potentially shrunk to the discrete integer range, which affects the split points.

For more information about the GBM algorithm, refer to the Gradient Boosted Machines booklet.

Binning In GBM

Is the binning range-based or percentile-based?

It’s range based, and re-binned at each tree split. NAs always “go to the left” (smallest) bin. There’s a minimum observations required value (default 10). There has to be at least 1 FP ULP improvement in error to split (all-constant predictors won’t split). nbins is at least 1024 at the top-level, and divides by 2 down each level until you hit the nbins parameter (default: 20). Categoricals use a separate, more aggressive, binning range.

Re-binning means, eg, suppose your column C1 data is: {1,1,2,4,8,16,100,1000}. Then a 20-way binning will use the range from 1 to 1000, bin by units of 50. The first binning will be a lumpy: {1,1,2,4,8,16},{100},{47_empty_bins},{1000}. Suppose the split peels out the {1000} bin from the rest.

Next layer in the tree for the left-split has value from 1 to 100 (not 1000!) and so re-bins in units of 5: {1,1,2,4},{8},{},{16},{lots of empty bins}{100} (the RH split has the single value 1000).

And so on: important dense ranges with split essentially logrithmeticaly at each layer.

What should I do if my variables are long skewed in the tail and might have large outliers?

You can try adding a new predictor column which is either pre-binned (e.g. as a categorical - “small”, “median”, and “giant” values), or a log-transform - plus keep the old column.

References

Dietterich, Thomas G, and Eun Bae Kong. “Machine Learning Bias, Statistical Bias, and Statistical Variance of Decision Tree Algorithms.” ML-95 255 (1995).

Elith, Jane, John R Leathwick, and Trevor Hastie. “A Working Guide to Boosted Regression Trees.” Journal of Animal Ecology 77.4 (2008): 802-813

Friedman, Jerome H. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics (2001): 1189-1232.

Friedman, Jerome, Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. “Discussion of Boosting Papers.” Ann. Statist 32 (2004): 102-107

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. “Additive Logistic Regression: A Statistical View of Boosting (With Discussion and a Rejoinder by the Authors).” The Annals of Statistics 28.2 (2000): 337-407

Hastie, Trevor, Robert Tibshirani, and J Jerome H Friedman. The Elements of Statistical Learning. Vol.1. N.p., page 339: Springer New York, 2001.