Splitting
^^^^^^^^^
- **Does the algo stop splitting when all the possible splits lead to worse error measures?**
It does if you use ``min_split_improvement`` (min_split_improvement turned ON by default (0.00001).) When properly tuned, this option can help reduce overfitting.
- **When does the algo stop splitting on an internal node?**
A single tree will stop splitting when there are no more splits that satisfy the minimum rows parameter, if it reaches ``max_depth``, or if there are no splits that satisfy the ``min_split_improvement`` parameter.
- **How does the minimum rows parameter work?**
``min_rows`` specifies the minimum number of observations for a leaf. If a user specifies ``min_rows = 500``, and they still have 500 TRUEs and 400 FALSEs, we won't split because we need 500 on both sides. The default for ``min_rows`` is 10, so this option rarely affects the GBM splits because GBMs are typically shallow, but the concept still applies.
- **How does GBM decide which feature to split on?**
It splits on the column and level that results in the greatest reduction in residual sum of the squares (RSS) in the subtree at that point. It considers all fields available from the algorithm. Note that any use of column sampling and row sampling will cause each decision to not consider all data points, and that this is on purpose to generate more robust trees. To find the best level, the histogram binning process is used to quickly compute the potential MSE of each possible split. The number of bins is controlled via ``nbins_cats`` for categoricals, the pair of ``nbins`` (the number of bins for the histogram to build, then split at the best point), and ``nbins_top_level`` (the minimum number of bins at the root level to use to build the histogram). This number will then be decreased by a factor of two per level.
For ``nbins_top_level``, higher = more precise, but potentially more prone to overfitting. Higher also takes more memory and possibly longer to run.
- **What is the difference between nbins and nbins_top_level?**
``nbins`` and ``nbins_top_level`` are both for numerics (real and integer). ``nbins_top_level`` is the number of bins GBM uses at the top of each tree. It then divides by 2 at each ensuing level to find a new number. ``nbins`` controls when GBM stops dividing by 2.
- **Doesn't GBM do the same thing as RF for col_sample_rate < 1 ?**
Yes for splitting, there is no difference between RF and GBM. They both use the same tree splitting.