Splitting¶
- Does the algo stop splitting when all the possible splits lead to worse error measures?
It does if you usemin_split_improvement
(min_split_improvement turned ON by default (0.00001).) When properly tuned, this option can help reduce overfitting.
- When does the algo stop splitting on an internal node?
A single tree will stop splitting when there are no more splits that satisfy the minimum rows parameter, if it reachesmax_depth
, or if there are no splits that satisfy themin_split_improvement
parameter.
- How does the minimum rows parameter work?
min_rows
specifies the minimum number of observations for a leaf. If a user specifiesmin_rows = 500
, and they still have 500 TRUEs and 400 FALSEs, we won’t split because we need 500 on both sides. The default formin_rows
is 10, so this option rarely affects the GBM splits because GBMs are typically shallow, but the concept still applies.
- How does GBM decide which feature to split on?
It splits on the column and level that results in the greatest reduction in residual sum of the squares (RSS) in the subtree at that point. It considers all fields available from the algorithm. Note that any use of column sampling and row sampling will cause each decision to not consider all data points, and that this is on purpose to generate more robust trees. To find the best level, the histogram binning process is used to quickly compute the potential MSE of each possible split. The number of bins is controlled via
nbins_cats
for categoricals, the pair ofnbins
(the number of bins for the histogram to build, then split at the best point), andnbins_top_level
(the minimum number of bins at the root level to use to build the histogram). This number will then be decreased by a factor of two per level.For
nbins_top_level
, higher = more precise, but potentially more prone to overfitting. Higher also takes more memory and possibly longer to run.
- What is the difference between nbins and nbins_top_level?
nbins
andnbins_top_level
are both for numerics (real and integer).nbins_top_level
is the number of bins GBM uses at the top of each tree. It then divides by 2 at each ensuing level to find a new number.nbins
controls when GBM stops dividing by 2.
- Doesn’t GBM do the same thing as RF for col_sample_rate < 1 ?
Yes for splitting, there is no difference between RF and GBM. They both use the same tree splitting.