ntrees
¶
- Available in: GBM, DRF, XGBoost
- Hyperparameter: yes
Description¶
For tree-based algorithms, this option specifies the number of trees to build in the model. In tree-based models, each node in the tree corresponds to a feature field from a dataset. Except for the top node, each node has an incoming branch. Similarly, except for the bottom node (or leaf node), each node has a number of outgoing branches. A branch represents a possible value for the input field from the originating dataset. A leaf represents the value of the objective field, given all the values for each input field in the chain of branches that go from the root (top) to that leaf.
This option defaults to 50 trees.
Example¶
- r
- python
library(h2o)
h2o.init()
# import the titanic dataset:
# This dataset is used to classify whether a passenger will survive '1' or not '0'
# original dataset can be found at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/Titanic.html
titanic <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
# convert response column to a factor
titanic['survived'] <- as.factor(titanic['survived'])
# set the predictor names and the response column name
# predictors include all columns except 'name' and the response column ("survived")
predictors <- setdiff(colnames(titanic), colnames(titanic)[2:3])
response <- "survived"
# split into train and validation
titanic.splits <- h2o.splitFrame(data = titanic, ratios = .8, seed = 1234)
train <- titanic.splits[[1]]
valid <- titanic.splits[[2]]
# try a range of ntrees:
bin_num = c(20, 50, 80, 110, 140, 170, 200)
label = c("20", "50", "80", "110", "140", "170", "200")
lapply(seq_along(1:length(bin_num)),function(num) {
titanic.gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, validation_frame = valid,
ntrees = bin_num[num], nfolds = 5, seed = 1234)
# print the value used and AUC score for train and valid
print(paste(label[num], 'training score', h2o.auc(titanic.gbm, train = TRUE)))
print(paste(label[num], 'validation score', h2o.auc(titanic.gbm, valid = TRUE)))
})
# Example of values to grid over for `ntrees`
hyper_params <- list( ntrees = c(20, 50, 80, 110, 140, 170, 200) )
# this example uses cartesian grid search because the search space is small
# and we want to see the performance of all models. For a larger search space use
# random grid search instead: list(strategy = "RandomDiscrete")
grid <- h2o.grid(x = predictors, y = response, training_frame = train, validation_frame = valid,
algorithm = "gbm", grid_id = "titanic_grid", hyper_params = hyper_params,
search_criteria = list(strategy = "Cartesian"), seed = 1234)
## Sort the grid models by AUC
sortedGrid <- h2o.getGrid("titanic_grid", sort_by = "auc", decreasing = TRUE)
sortedGrid