sample_rate_per_class
¶
- Available in: GBM, DRF
- Hyperparameter: yes
Description¶
When building models from imbalanced datasets, this option specifies that each tree in the ensemble should sample (without replacement) from the full training dataset using a per-class-specific sampling rate rather than a global sample factor (as with sample_rate
). The range for this option is 0.0 to 1.0.
Note: If sample_rate_per_class
is specified, then sample_rate
will be ignored.
Example¶
library(h2o)
h2o.init()
# import the covtype dataset:
# this dataset is used to classify the correct forest cover type
# original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Covertype
covtype <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
# convert response column to a factor
covtype[,55] <- as.factor(covtype[,55])
# set the predictor names and the response column name
predictors <- colnames(covtype[1:54])
response <- 'C55'
# split into train and validation sets
covtype.splits <- h2o.splitFrame(data = covtype, ratios = .8, seed = 1234)
train <- covtype.splits[[1]]
valid <- covtype.splits[[2]]
# look at the counts per class in the training set:
h2o.table(train[response])
# try using the `sample_rate_per_class` parameter:
# downsample the Class 2, and leave the rest the same
rate_per_class_list = c(1, .4, 1, 1, 1, 1, 1)
cov_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train,
validation_frame = valid, sample_rate_per_class = rate_per_class_list,
seed = 1234)
# print the logloss
print(h2o.logloss(cov_gbm, valid = TRUE))
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# import the covtype dataset:
# this dataset is used to classify the correct forest cover type
# original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Covertype
covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
# convert response column to a factor
covtype[54] = covtype[54].asfactor()
# set the predictor names and the response column name
predictors = covtype.columns[0:54]
response = 'C55'
# split into train and validation sets
train, valid = covtype.split_frame(ratios = [.8], seed = 1234)
# look at the counts per class in the training set:
print(train[response].table())
# try using the `sample_rate_per_class` parameter:
# downsample the Class 2, and leave the rest the same
rate_per_class_list = [1, .4, 1, 1, 1, 1, 1]
cov_gbm = H2OGradientBoostingEstimator(sample_rate_per_class = rate_per_class_list, seed = 1234)
cov_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
# print the logloss for the validation data
print('logloss', cov_gbm.logloss(valid = True))