- Does the algo stop splitting when all the possible splits lead to worse error measures?
It does if you use
min_split_improvement(min_split_improvement turned ON by default (0.00001).) When properly tuned, this option can help reduce overfitting.
- When does the algo stop splitting on an internal node?
A single tree will stop splitting when there are no more splits that satisfy the minimum rows parameter, if it reaches
max_depth, or if there are no splits that satisfy the
- How does the minimum rows parameter work?
min_rowsspecifies the minimum number of observations for a leaf. If a user specifies
min_rows = 500, and they still have 500 TRUEs and 400 FALSEs, we won’t split because we need 500 on both sides. The default for
min_rowsis 10, so this option rarely affects the GBM splits because GBMs are typically shallow, but the concept still applies.
- How does GBM decide which feature to split on?
It splits on the column and level that results in the greatest reduction in residual sum of the squares (RSS) in the subtree at that point. It considers all fields available from the algorithm. Note that any use of column sampling and row sampling will cause each decision to not consider all data points, and that this is on purpose to generate more robust trees. To find the best level, the histogram binning process is used to quickly compute the potential MSE of each possible split. The number of bins is controlled via
nbins_catsfor categoricals, the pair of
nbins(the number of bins for the histogram to build, then split at the best point), and
nbins_top_level(the minimum number of bins at the root level to use to build the histogram). This number will then be decreased by a factor of two per level.
nbins_top_level, higher = more precise, but potentially more prone to overfitting. Higher also takes more memory and possibly longer to run.
- What is the difference between nbins and nbins_top_level?
nbins_top_levelare both for numerics (real and integer).
nbins_top_levelis the number of bins GBM uses at the top of each tree. It then divides by 2 at each ensuing level to find a new number.
nbinscontrols when GBM stops dividing by 2.
- Doesn’t GBM do the same thing as RF for col_sample_rate < 1 ?
Yes for splitting, there is no difference between RF and GBM. They both use the same tree splitting.