nbins
¶
- Available in: GBM, DRF
- Hyperparameter: yes
Description¶
The nbins
option specifies the number of bins to be included in the histogram and then split at the best point. These split points are evaluated at the boundaries of each of these bins.
Bins are linear sized from the observed min-to-max for the subset being split again (with an enforced large-nbins for shallow tree depths). As the tree gets deeper, each subset (enforced by the tree decisions) covers a smaller range, and the bins are uniformly spread over this smaller range. Bin range decisions are thus made at each node level, not at the feature level.
This value defaults to 20 bins. If you have few observations in a node (but greater than 10), and nbins
is set to 20 (the default), empty bins will be created if there aren’t enough observations to go in each bin. As nbins
goes up, the algorithm will more closely approximate evaluating each individual observation as a split point. To make a model more general, decrease nbins_top_level
and nbins_cats
. To make a model more specific, increase nbins
and/or nbins_top_level
and nbins_cats
. Keep in mind that increasing nbins_cats
can have a dramatic effect on the amount of overfitting.
Example¶
- r
- python
library(h2o)
h2o.init()
# import the EEG dataset:
# All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset.
# The duration of the measurement was 117 seconds. The eye state was detected via a camera during the
# EEG measurement and added later manually to the file after analysing the video frames.
# '1' indicates the eye-closed and '0' the eye-open state. All values are in chronological
# order with the first measured value at the top of the data.
# original dataset can be found at the UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/EEG+Eye+State
eeg <- h2o.importFile("https://h2o-public-test-data.s3.amazonaws.com/smalldata/eeg/eeg_eyestate.csv")
# convert response column to a factor
eeg['eyeDetection'] <- as.factor(eeg['eyeDetection'])
# set the predictor names and the response column name
predictors <- colnames(eeg)[1:(length(eeg)-1)]
response <- "eyeDetection"
# split into train and validation
eeg.splits <- h2o.splitFrame(data = eeg, ratios = .8, seed = 1234)
train <- eeg.splits[[1]]
valid <- eeg.splits[[2]]
# try a range of nbins:
bin_num = c(8, 16, 32, 64, 128, 256, 512)
label = c("8", "16", "32", "64", "128", "256", "512")
lapply(seq_along(1:length(bin_num)),function(num) {
eeg.gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, validation_frame = valid,
nbins = bin_num[num], nfolds = 5, seed = 1234)
# print the value used and AUC score for train and valid
print(paste(label[num], 'training score', h2o.auc(eeg.gbm, train = TRUE)))
print(paste(label[num], 'validation score', h2o.auc(eeg.gbm, valid = TRUE)))
})
# Example of values to grid over for `nbins`
hyper_params <- list( nbins = c(8, 16, 32, 64, 128, 256, 512) )
# this example uses cartesian grid search because the search space is small
# and we want to see the performance of all models. For a larger search space use
# random grid search instead: list(strategy = "RandomDiscrete")
# this GBM uses early stopping once the validation AUC doesn't improve by at least 0.01% for
# 5 consecutive scoring events
grid <- h2o.grid(x = predictors, y = response, training_frame = train, validation_frame = valid,
algorithm = "gbm", grid_id = "eeg_grid", hyper_params = hyper_params,
stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC",
search_criteria = list(strategy = "Cartesian"), seed = 1234)
## Sort the grid models by AUC
sortedGrid <- h2o.getGrid("eeg_grid", sort_by = "auc", decreasing = TRUE)
sortedGrid