AdaBoost¶
Note: This is a beta version of the algorithm.
Introduction¶
AdaBoost, short for Adaptive Boosting, is a powerful and versatile machine learning ensemble technique. It operates by combining the strengths of multiple weak or base learners, typically decision trees with limited depth, to create a strong and accurate predictive model. AdaBoost assigns higher weights to misclassified data points in each iteration, allowing subsequent weak learners to focus on those instances, progressively refining the model’s performance. The final model is a weighted sum of the weak learners’ predictions, resulting in a robust and flexible classifier capable of effectively handling complex datasets and improving generalization. AdaBoost’s emphasis on misclassified instances and its iterative learning process make it a popular choice for classification tasks in various domains, showcasing its ability to adapt and improve predictive performance.
H2O’s implementation of AdaBoost follows the Rojas, R. (2009), ‘AdaBoost and the Super Bowl of Classifiers A Tutorial Introduction to Adaptive Boosting’ specification. It can be used to solve binary classification problems only.
Defining an AdaBoost Model¶
Algorithm-specific parameters¶
learn_rate: Specify the learning rate. The range is 0.0 to 1.0, and the default value is
0.5
.nlearners: Number of AdaBoost weak learners.
weak_learner: Choose a weak learner type. Must be one of:
"AUTO"
,"DRF"
,"GBM"
, or"GLM"
. Defaults to"AUTO"
(which means"DRF"
).DRF
: Trains only one tree in each iteration with the following parameters:(mtries=1, min_rows=1, sample_rate=1, max_depth=1)
GBM
: Trains only one tree in each iteration with the following parameters:(mtries=1, min_rows=1, sample_rate=1, max_depth=1, learn_rate=0.1)
GLM
: Trains a binary classifier withmax_iterations=50
.
Common parameters¶
categorical_encoding: Only the ordinal nature of encoding is used for splitting in the case of AdaBoost. Specify one of the following encoding schemes for handling categorical features:
auto
orAUTO
(default): Allow the algorithm to decide. For AdaBoost, the algorithm will automatically performenum
encoding.enum
orEnum
: 1 column per categorical feature.enum_limited
orEnumLimited
: Automatically reduce categorical levels to the most prevalent ones during training and only keep the T (10) most frequent levels.one_hot_explicit
orOneHotExplicit
: N+1 new columns for categorical features with N levels.binary
orBinary
: No more than 32 columns per categorical feature.eigen
orEigen
: k columns per categorical feature, keeping projections of one-hot-encoded matrix onto k-dim eigen space only.label_encoder
orLabelEncoder
: Convert every enum into the integer of its index (for example, level 0 -> 0, level 1 -> 1, etc.).sort_by_response
orSortByResponse
: Reorders the levels by the mean response (for example, the level with lowest response -> 0, the level with second-lowest response -> 1, etc.).
ignore_const_cols: Specify whether to ignore constant training columns, since no information can be gained from them. This option defaults to
True
(enabled).ignored_columns: (Python and Flow only) Specify the column or columns to be excluded from the model. In Flow, click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the All button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the None button. To search for a specific column, type the column name in the Search field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the Only show columns with more than 0% missing values field. To change the selections for the hidden columns, use the Select Visible or Deselect Visible buttons.
model_id: Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations. This value defaults to
-1
(time-based random number).training_frame: Required Specify the dataset used to build the model.
NOTE: In Flow, if you click the Build a model button from the
Parse
cell, the training frame is entered automatically.weights_column: Specify a column to use for the observation weights, which are used for bias correction. The specified
weights_column
must be included in the specifiedtraining_frame
. By default the AdaBoost algorithm generates constant column with value1
x: Specify a vector containing the names or indices of the predictor variables to use in building the model. If
x
is missing, then all columns excepty
are used.y: Required Specify the column to use as the dependent variable. The data can be only categorical binary.
Examples¶
Below is a simple example showing how to build an AdaBoost model.
library(h2o)
h2o.init()
# Import the prostate dataset into H2O:
prostate <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
predictors <- c("AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON")
response <- "CAPSULE"
prostate[response] <- as.factor(prostate[response])
# Build and train the model:
adaboost_model <- h2o.adaBoost(nlearners=50,
learn_rate = 0.5,
weak_learner = "DRF",
x = predictors,
y = response,
training_frame = prostate)
# Generate predictions:
h2o.predict(adaboost_model, prostate)
import h2o
from h2o.estimators import H2OAdaBoostEstimator
h2o.init()
# Import the prostate dataset into H2O:
prostate = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
prostate["CAPSULE"] = prostate["CAPSULE"].asfactor()
# Build and train the model:
adaboost_model = H2OAdaBoostEstimator(nlearners=50,
learn_rate = 0.8,
weak_learner = "DRF",
seed=0xBEEF)
adaboost_model.train(y = "CAPSULE", training_frame = prostate)
# Generate predictions:
pred = adaboost_model.predict(prostate)
pred
References¶
Rojas, R. (2009), ‘AdaBoost and the Super Bowl of Classifiers A Tutorial Introduction to Adaptive Boosting’.
Niculescu-Mizil, Alexandru & Caruana, Rich. (2012). Obtaining Calibrated Probabilities from Boosting.
Freund, R. Schapire, “A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting”, 1995.