ModelSelection¶
Introduction¶
We implemented the ModelSelection toolbox based on GLM at H2O to help users select the best predictor subsets from their dataset for model building. We have currently implemented four modes to select the predictor subsets:
mode = "allsubsets"
where all possible combinations of predictor subsets are generated for a given subset size. A model is built for each subset and the one with the highest \(R^2\) is returned. The best subsets are also returned for subset size \(1, 2, ..., n\). This mode guarantees to return the predictor subset with the highest \(R^2\) value at the cost of computation complexity.mode = "maxr"
where a sequential replacement method is used to find the best subsets for subset size of \(1, 2, ..., n\). However, the predictor subsets are not guaranteed to have the highest \(R^2\) value.mode = "backward"
where a model is built starting with all predictors. The predictor with the smallest absolute z-value (or z-score) is dropped after each model is built. This process repeats until only one predictor remains or until the number of predictors equal tomin_predictor_number
is reached. The model build can also be stopped usingp_values_threshold
.mode = "maxrsweep"
where the model runs similar tomode = "maxr"
except that instead of calling our GLM toolbox to build models, we use the sweep operator [3] plus our own incremental sweep operation using sweep vectors. This change speeds up the execution of finding the best predictor subset for each subset size and is essential in dropping the build time of the model.
The fastest mode for ModelSelection is mode = "maxrsweep"
with build_glm_model = False
. This skips the GLM model process while still generating the predictor subsets, the coefficients, and the coefficient values.
This model only supports GLM regression families.
Defining a ModelSelection Model¶
Parameters are optional unless specified as required. ModelSelection shares many GLM parameters.
Algorithm-specific parameters¶
build_glm_model: (Applicable for
mode = "maxrsweep"
only) If enabled, this option will return full GLM models with the desired predictor subsets. If disabled, only the predictor subsets and predictor coefficients are returned. Disabling this parameter speeds up the model selection process. You can also choose to build the GLM models themselves by using the returned predictor subsets. This option defaults toTrue
(enabled).max_predictor_number: Maximum number of predictors to be considered when building GLM models. This option defaults to
1
.min_predictor_number: (Applicable for
mode = "backward"
only) Minimum number of predictors to be considered when building GLM models starting with all predictors to be included. This option defaults to1
.mode: Required Specify the model selection algorithm to use. One of:
"maxr"
(default)"allsubsets"
"backward"
"maxrsweep"
nparallelism: Number of models to be built in parallel. This option defaults to
0.0
(which is adaptive to the system’s capabilities).p_values_threshold: (Applicable for
mode = "backward"
only) If specified, this option will stop the model building process when all coefficient p-values drop to or below this threshold. This option defaults to0.0
.score_iteration_interval: Perform scoring for every
score_iteration_interval
iteration. This option defaults to-1
.
Common parameters¶
custom_metric_func: Optionally specify a custom evaluation function.
early_stopping: Specify whether to stop early when there is no more relative improvement on the training or validation set. This option is set to
True
(enabled) by default.fold_assignment: (Applicable only if a value for
nfolds
is specified andfold_column
is not specified) Specify the cross-validation fold assignment scheme. One of:AUTO
(default; usesRandom
)Random
Modulo
(read more about Modulo)Stratified
(which will stratify the folds based on the response variable for classification problems)
fold_column: Specify the column that contains the cross-validation fold index assignment per observation.
ignore_const_cols: Enable this option to ignore constant training columns since no information can be gained from them. This option defaults to
True
(enabled).ignored_columns: (Python and Flow only) Specify the column or columns to be excluded from the model. In Flow, click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the All button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the None button. To search for a specific column, type the column name in the Search field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the Only show columns with more than 0% missing values field. To change the selections for the hidden columns, use the Select Visible or Deselect Visible buttons.
max_iterations: Specify the number of training iterations. This option defaults to
-1
.max_runtime_secs: Maximum allowed runtime in seconds for model training. This option defaults to
0
(unlimited).missing_values_handling: Specify how to handle missing values (one of:
Skip
,MeanImputation
(default), orPlugValues
).model_id: Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
nfolds: Specify the number of folds for cross-validation. The value can be set to
0
(default) to disable or to \(\geq\)2
.offset_column: Specify a column to use as the offset; the value cannot be the same as the value for the
weights_column
. This wll be added to the combination of columns before applying the link function.Note: Offsets are per-row “bias values” that are used during model training. For Gaussian distributions, they can be seen as simple corrections to the response (
y
) column. Instead of learning to predict the response (y-row), the model learns to predict the (row) offset of the response column. For other distributions, the offset corrections are applied in the linearized space before applying the inverse link function to get the actual response values.score_each_iteration: Enable this option to score during each iteration of the model training. This option defaults to
False
(disabled).seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations. This option defaults to
-1
(time-based random number).standardize: Specify whether to standardize the numeric columns to have a mean of zero and unit variance. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option defaults to
True
(enabled).stopping_metric: Specify the metric to use for early stopping. The available options are:
AUTO
(default): (This defaults tologloss
for classification anddeviance
for regression)deviance
logloss
MSE
RMSE
MAE
RMSLE
AUC
(area under the ROC curve)AUCPR
(area under the Precision-Recall curve)lift_top_group
misclassification
mean_per_class_error
stopping_rounds: Stops training when the option selected for
stopping_metric
doesn’t improve for the specified number of training rounds, based on a simple moving average. To disable this feature, specify0
(default).Note: If cross-validation is enabled:
All cross-validation models stop training when the validation metric doesn’t improve.
The main model runs for the mean number of epochs.
N+1 models may be off by the number specified for
stopping_rounds
from the best model, but the cross-validation metric estimates the performance of the main model for the resulting number of epochs (which may be fewer than the specified number of epochs).
stopping_tolerance: Specify the relative tolerance for the metric-based stopping to stop training if the improvement is less than this value. This option defaults to
0.001
.training_frame: Required Specify the dataset used to build the model.
NOTE: In Flow, if you click the Build a model button from the
Parse
cell, the training frame is entered automatically.validation_frame: Specify the dataset used to evaluate the accuracy of the model.
weights_column: Specify a column to use for the observation weights, which are used for bias correction. The specified
weights_column
must be included in the specifiedtraining_frame
.Python only: To use a weights column when passing an H2OFrame to
x
instead of a list of column names, the specifiedtraining_frame
must contain the specifiedweights_column
.Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
x: Specify a vector containing the names or indices of the predictor variables to use when building the model. If
x
is missing, then all columns excepty
are used.y: Required Specify the column to use as the dependent variable.
For a regression model only, this column must be numeric (Real or Int).
Understanding ModelSelection mode = allsubsets
¶
Setting the H2O ModelSelection mode = allsubsets
guarantees the return of the model with the best \(R^2\) value.
For each predictor subset size \(x\):
For \(n\) predictors and using \(x\) predictors, first generate all possible combinations of \(x\) predictors out of the \(n\) predictors;
for each element in the combination of \(x\) predictors: generate the training frame, build the model, and look at the \(R^2\) value of the model;
the best \(R^2\) value, the predictor names, and the
model_id
of the best models are stored in arrays as well as H2OFrame;access functions are written in Java/R/Python to extract coefficients associated with the models with the best \(R^2\) values.
The main disadvantage of this mode is the long computation time.
Understanding ModelSelection mode = maxr
¶
The H2O ModelSelection mode = maxr
is implemented using the sequential replacement method [1]. This consists of a forward step and a replacement step. The sequential replacement method goes like this (where the predictors are denoted by A, B, C, …, Z):
Start with the current subset = {} (empty)
Forward step for 1 predictor subset:
add each available predictor (from A to Z) to the current empty subset and build a GLM model with each predictor subset;
save the model with the highest \(R^2\) for all models built with predictor A, B, …, Z;
set the new current subset = {predictor with highest \(R^2\) } (for example, predictor A).
Forward step for 2 predictor subset (starting with current subset = {A} ):
add each available predictor (from B to Z) to the current subset and build a GLM model;
save the model with the highest \(R^2\) for all models with predictor subsets AB, AC, …, AZ;
set the new current subset = {model with highest \(R^2\) } and save the best subset (for example, {AB}).
Replacement for 2 predictor subset from best subset chosen from forward step for 2 predictor subsets (i.e. starting from best subset {AB} from previous step):
fixing the second predictor, choose a different predictor for the first predictor from the remaining predictors C, D, …, Z (skipping predictor A as it was chosen already by forward step; B is taken as the second predictor). Then, build a GLM model for each new subset of (CB, DB, EB, …, ZB). Save the model with the highest \(R^2\) (for example, {DB}) from all models built with predictor subsets (CB, DB, EB, …, ZB);
fixing the first predictor, choose a different second predictor from the remaining predictor subset. Then, build a GLM model for each new subset generated. Save the model with the highest \(R^2\) from all models built;
compare the \(R^2\) value from the models built with forward step, step 4(a), and step 4(b) and choose the subset with the highest \(R^2\). If the best model is built with {AB}, proceed to step 5 because steps 4(a) and 4(b) generated no improvement. If the best model is built with {DB}, repeat steps 4(a), 4(b), and 4(c) until no improvement is found. For the two predictor case, the first 4(b) can be skipped since it is already done in the forward step.
Start with the best \(n\) predictor subset and forward step for \(n\) predictor subsets:
add each predictor available to the \(n\) predictor subset and build a GLM model;
save the model with the highest \(R^2\) for all models built with \(n+1\) predictor subsets;
Replacement for \(n+1\) predictor subsets:
Repeat for predictor in location 0,1,2,…,n:
keep all predictors fixed except in location k (k will be from 0,1,2,…,n) and switch out the predictor at location k with one predictor from the available predictors. If there are m predictors in the available predictor subset, m GLM models will be built and the model with the best \(R^2\) value will be saved;
from all the n best models found from step 6(a), if the best \(R^2\) value has improved from the forward step or the previous 6(a), return to 6(a). If no improvement is found, break and just take the best \(R^2\) model as the one to save.
Again, the best \(R^2\) value, the predictor names, and the model_id
of the best models are stored in arrays as well as H2OFrame. Additionally, coefficients associated with the models built with all the predictor subset sizes are available and accessible as well.
Understanding ModelSelection mode = backward
¶
A model with all predictors is built;
the z-values of all coefficients (except
intercept
) are considered. The coefficient with the smallest z-value magnitude is eliminated;a new model is built with the remaining predictors;
steps 2 and 3 are repeated until
no predictors are left,
min_predictor_number - 1
predictors are left, orp_values_threshold
condition is satisfied.
To increase flexibility in the model building process, you can stop the model building process by specifying a p_values_threshold
. When the p_values
of all predictors (except intercept
) are \(\leq\) p_values_threshold
, the model building process will stop as well.
Interpreting a ModelSelection Model¶
Result Frame¶
To help you understand your model, a result frame is generated at the end of the building process. For maxr
and allsubsets
modes, the result frame will contain:
model_name: string describing how many predictors are used to build the model
model_id: model ID of the GLM model built. You can use this model ID to obtain the original GLM model and perform scoring or anything else you want to do with an H2O model
best_r2_value: the highest \(R^2\) value from the predictor subsets of a fixed size
predictor_names: names of the predictors used to build the model
For backward
mode, the result frame will contain:
model_name: string describing how many predictors are used to build the model
model_id: model ID of the GLM model built. You can use this model ID to obtain the original GLM model and perform scoring or anything else you want to do with an H2O model
z_values: z-values of all coefficients of the GLM model
p_values: p-values of all coefficients of the GLM model
coefficient_names: coefficients (including
intercept
) of the GLM model
Model Coefficients¶
The coefficients of each model built for each predictor size are available. You can see how to access the coefficients in the Examples section.
Cross-Validation¶
ModelSelection supports cross-validation and the use of the validation dataset for mode = "maxr"
and mode = "allsubsets"
. Only family = gaussian
is supported.
For mode = "backward"
, cross-validation is not supported as the model selection process depends on training z-values and p-values. All GLM families are supported except for ordinal
and multinomial
.
Model Scoring¶
The model IDs of all models built for each predictor subset size are stored in the result frame. These IDs can be used to obtain the original models. They can be used for scoring just like any returned H2O models.
Examples¶
library(h2o)
h2o.init()
# Import the prostate dataset:
prostate <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/prostate.csv")
|======================================================================| 100%
# Set the predictors & response:
predictors <- c("AGE", "RACE", "CAPSULE", "DCAPS", "PSA", "VOL", "DPROS")
response <- "GLEASON"
# Build & train the model:
allsubsetsModel <- h2o.modelSelection(x = predictors,
y = response,
training_frame = prostate,
seed = 12345,
max_predictor_number = 7,
mode = "allsubsets")
|======================================================================| 100%
# Retrieve the results (H2OFrame containing best model_ids, best_r2_value, & predictor subsets):
results <- h2o.result(allsubsetsModel)
print(results)
model_name model_id best_r2_value predictor_names
1 best 1 predictor(s) model GLM_model_1637788524625_26 0.2058868 1 CAPSULE
2 best 2 predictor(s) model GLM_model_1637788524625_37 0.2695678 2 CAPSULE, PSA
3 best 3 predictor(s) model GLM_model_1637788524625_66 0.2862530 3 CAPSULE, DCAPS, PSA
4 best 4 predictor(s) model GLM_model_1637788524625_105 0.2904461 4 CAPSULE, DPROS, DCAPS, PSA
5 best 5 predictor(s) model GLM_model_1637788524625_130 0.2921695 5 CAPSULE, AGE, DPROS, DCAPS, PSA
6 best 6 predictor(s) model GLM_model_1637788524625_145 0.2924758 6 CAPSULE, AGE, RACE, DPROS, DCAPS, PSA
7 best 7 predictor(s) model GLM_model_1637788524625_152 0.2925563 7 CAPSULE, AGE, RACE, DPROS, DCAPS, PSA, VOL
# Retrieve the list of coefficients:
coeff <- h2o.coef(allsubsetsModel)
print(coeff)
[[1]]
Intercept CAPSULE
5.978584 1.007438
[[2]]
Intercept CAPSULE PSA
5.83309940 0.81073054 0.01458179
[[3]]
Intercept CAPSULE DCAPS PSA
5.34902149 0.75750144 0.47979555 0.01289096
[[4]]
Intercept CAPSULE DPROS DCAPS PSA
5.23924958 0.71845861 0.07616614 0.44257893 0.01248512
[[5]]
Intercept CAPSULE AGE DPROS DCAPS PSA
4.78548229 0.72070240 0.00687360 0.07827698 0.43777710 0.01245014
[[6]]
Intercept CAPSULE AGE RACE DPROS DCAPS PSA
4.853286962 0.717393309 0.006790891 -0.060686926 0.079288081 0.438470913 0.012572276
[[7]]
Intercept CAPSULE AGE RACE DPROS DCAPS PSA VOL
4.8526636043 0.7153633278 0.0069487980 -0.0584344031 0.0791810013 0.4353149856 0.0126060611 -0.0005196059
# Retrieve the list of coefficients for a subset size of 3:
coeff3 <- h2o.coef(allsubsetsModel, 3)
print(coeff3)
[[3]]
Intercept CAPSULE DCAPS PSA
5.34902149 0.75750144 0.47979555 0.01289096
# Retrieve the list of standardized coefficients:
coeff_norm <- h2o.coef_norm(allsubsetsModel)
print(coeff_norm)
[[1]]
Intercept CAPSULE
6.3842105 0.4947269
[[2]]
Intercept CAPSULE PSA
6.3842105 0.3981290 0.2916004
[[3]]
Intercept CAPSULE DCAPS PSA
6.3842105 0.3719895 0.1490516 0.2577879
[[4]]
Intercept CAPSULE DPROS DCAPS PSA
6.38421053 0.35281659 0.07617433 0.13749000 0.24967213
[[5]]
Intercept CAPSULE AGE DPROS DCAPS PSA
6.38421053 0.35391845 0.04486448 0.07828541 0.13599828 0.24897265
[[6]]
Intercept CAPSULE AGE RACE DPROS DCAPS PSA
6.38421053 0.35229345 0.04432463 -0.01873850 0.07929661 0.13621382 0.25141500
[[7]]
Intercept CAPSULE AGE RACE DPROS DCAPS PSA VOL
6.384210526 0.351296573 0.045355300 -0.018042981 0.079189523 0.135233408 0.252090622 -0.009533532
# Retrieve the list of standardized coefficients for a subset size of 3:
coeff_norm3 <- h2o.coef_norm(allsubsetsModel)
print(coeff_norm3)
[[3]]
Intercept CAPSULE DCAPS PSA
6.3842105 0.3719895 0.1490516 0.2577879
# Check the variables that were added during this process:
h2o.get_predictors_added_per_step(allsubsetsModel)
[,1]
[1,] "CAPSULE"
[2,] "PSA"
[3,] "DCAPS"
[4,] "DPROS"
[5,] "AGE"
[6,] "RACE"
[7,] "VOL"
# To find out which variables get removed, build a new model with ``mode = "backward``
# using the above training information:
bwModel <- h2o.modelSelection(x = predictors,
y = response,
training_frame = prostate,
seed = 12345,
max_predictor_number = 7,
mode = "backward")
h2o.get_predictors_removed_per_step(bwModel)
[,1]
[1,] "CAPSULE"
[2,] "PSA"
[3,] "DCAPS"
[4,] "DPROS"
[5,] "AGE"
[6,] "RACE"
[7,] "VOL"
# To build the fastest model with ModelSelection, use ``mode = "maxrsweep"``:
sweepModel <- h2o.modelSelection(x = predictors,
y = response,
training_frame = prostate,
mode = "maxrsweep",
build_glm_model = FALSE,
max_predictor_number = 3,
seed = 12345)
|======================================================================| 100%
# Retrieve the results to view the best predictor subsets:
h2o.result(sweepModel)
model_name best_r2_value coefficient_names predictor_names predictors_removed predictors_added
1 best 1 predictors model 0.2058873 CAPSULE, Intercept CAPSULE CAPSULE
2 best 2 predictors model 0.2695684 CAPSULE, PSA, Intercept CAPSULE, PSA PSA
3 best 3 predictors model 0.2862536 CAPSULE, PSA, DCAPS, Intercept CAPSULE, PSA, DCAPS DCAPS
import h2o
from h2o.estimators import H2OModelSelectionEstimator
h2o.init()
# Import the prostate dataset:
prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/prostate.csv")
Parse progress: ======================================= (done)| 100%
# Set the predictors & response:
predictors = ["AGE","RACE","CAPSULE","DCAPS","PSA","VOL","DPROS"]
response = "GLEASON"
# Build & train the model:
maxrModel = H2OModelSelectionEstimator(max_predictor_number=7,
seed=12345,
mode="maxr")
maxrModel.train(x=predictors, y=response, training_frame=prostate)
maxr Model Build progress: ======================================= (done)| 100%
# Retrieve the results (H2OFrame containing best model_ids, best_r2_value, & predictor subsets):
results = maxrModel.result()
print(results)
model_name model_id best_r2_value predictor_names
------------------------- --------------------------- --------------- ------------------------------------------
best 1 predictor(s) model GLM_model_1638380984255_2 0.205887 CAPSULE
best 2 predictor(s) model GLM_model_1638380984255_13 0.269568 CAPSULE, PSA
best 3 predictor(s) model GLM_model_1638380984255_42 0.286253 CAPSULE, DCAPS, PSA
best 4 predictor(s) model GLM_model_1638380984255_81 0.290446 CAPSULE, DPROS, DCAPS, PSA
best 5 predictor(s) model GLM_model_1638380984255_106 0.29217 CAPSULE, AGE, DPROS, DCAPS, PSA
best 6 predictor(s) model GLM_model_1638380984255_121 0.292476 CAPSULE, AGE, RACE, DPROS, DCAPS, PSA
best 7 predictor(s) model GLM_model_1638380984255_128 0.292556 CAPSULE, AGE, RACE, DPROS, DCAPS, PSA, VOL
[7 rows x 4 columns]
# Retrieve the list of coefficients:
coeff = maxrModel.coef()
print(coeff)
# [{‘Intercept’: 5.978584176203302, ‘CAPSULE’: 1.0074379937434323},
# {‘Intercept’: 5.83309940166519, ‘CAPSULE’: 0.8107305373380133, ‘PSA’: 0.01458178860012023},
# {‘Intercept’: 5.349021488372978, ‘CAPSULE’: 0.757501440465183, ‘DCAPS’: 0.47979554935185015, ‘PSA’: 0.012890961277678725},
# {‘Intercept’: 5.239249580225221, ‘CAPSULE’: 0.7184586144005665, ‘DPROS’: 0.07616613714619831, ‘DCAPS’: 0.4425789341205361, ‘PSA’: 0.012485121785672872},
# {‘Intercept’: 4.785482292681689, ‘CAPSULE’: 0.7207023955198935, ‘AGE’: 0.006873599969264931, ‘DPROS’: 0.07827698214607832, ‘DCAPS’: 0.4377770966619996, ‘PSA’: 0.012450143759298283},
# {‘Intercept’: 4.853286962151182, ‘CAPSULE’: 0.7173933092205801, ‘AGE’: 0.00679089119920351, ‘RACE’: -0.06068692599374028, ‘DPROS’: 0.07928808123744804, ‘DCAPS’: 0.4384709133624667, ‘PSA’: 0.012572275831333262},
# {‘Intercept’: 4.852663604264297, ‘CAPSULE’: 0.7153633277776693, ‘AGE’: 0.006948797960002643, ‘RACE’: -0.05843440305164041, ‘DPROS’: 0.07918100130777159, ‘DCAPS’: 0.43531498557623927, ‘PSA’: 0.012606061059188276, ‘VOL’: -0.0005196059470357373}]
# Retrieve the list of coefficients for a subset size of 3:
coeff3 = maxrModel.coef(3)
print(coeff3)
# {'Intercept': 5.349021488372978, 'CAPSULE': 0.757501440465183, 'DCAPS': 0.47979554935185015, 'PSA': 0.012890961277678725}
# Retrieve the list of standardized coefficients:
coeff_norm = maxrModel.coef_norm()
print(coeff_norm)
# [{‘Intercept’: 6.38421052631579, ‘CAPSULE’: 0.49472694682382257},
# {‘Intercept’: 6.38421052631579, ‘CAPSULE’: 0.39812896270042736, ‘PSA’: 0.29160037716849074},
# {‘Intercept’: 6.38421052631579, ‘CAPSULE’: 0.37198951914000183, ‘DCAPS’: 0.1490515817762952, ‘PSA’: 0.25778793491797924},
# {‘Intercept’: 6.38421052631579, ‘CAPSULE’: 0.3528165891390707, ‘DPROS’: 0.07617433400499243, ‘DCAPS’: 0.13749000023165447, ‘PSA’: 0.24967213018482057},
# {‘Intercept’: 6.38421052631579, ‘CAPSULE’: 0.353918452469022, ‘AGE’: 0.04486447687517968, ‘DPROS’: 0.07828540617010687, ‘DCAPS’: 0.1359982784564225, ‘PSA’: 0.2489726545605919},
# {‘Intercept’: 6.38421052631579, ‘CAPSULE’: 0.352293445102015, ‘AGE’: 0.044324630838403115, ‘RACE’: -0.018738499858626197, ‘DPROS’: 0.07929661407409055, ‘DCAPS’: 0.1362138170890904, ‘PSA’: 0.2514149995462732},
# {‘Intercept’: 6.38421052631579, ‘CAPSULE’: 0.35129657330683034, ‘AGE’: 0.04535529952002336, ‘RACE’: -0.018042981011017332, ‘DPROS’: 0.07918952262067014, ‘DCAPS’: 0.13523340776861126, ‘PSA’: 0.25209062209542776, ‘VOL’: -0.009533532448945743}]
# Retrieve the list of standardized coefficients for a subset size of 3:
coeff_norm3 = maxrModel.coef_norm(3)
print(coeff_norm3)
# {‘Intercept’: 6.38421052631579, ‘CAPSULE’: 0.37198951914000183, ‘DCAPS’: 0.1490515817762952, ‘PSA’: 0.25778793491797924}
# Check the variables that were added during this process:
maxrModel.get_predictors_added_per_step()
[['CAPSULE'], ['PSA'], ['DCAPS'], ['DPROS'], ['AGE'], ['RACE'], ['VOL']]
# Using the above training information, build a model using ``mode = "backward"``:
bwModel = H2OModelSelectionEstimator(max_predictor_number=3,
seed=12345,
mode="backward")
bwModel.train(x=predictors, y=response, training_frame=prostate)
ModelSelection Model Summary: summary
coefficient_names z_values p_values
----------------- ------------------------------- --------------------------------------------------------------------------- ----------------------------------------------------------------------------------------
with 1 predictors CAPSULE, Intercept 9.899643676508614, 92.43746760936982 1.070331637158796E-20, 1.3321139829486397E-261
with 2 predictors CAPSULE, PSA, Intercept 7.825700947986458, 5.733056921838707, 86.91622746127426 5.144662722557474E-14, 2.023486352710146E-8, 1.7241718600984578E-251
with 3 predictors CAPSULE, DCAPS, PSA, Intercept 7.275417885570092, 2.964750742738588, 4.992785143892783, 30.274880599946904 2.0273323955515335E-12, 0.0032224082063575395, 9.124834372427609E-7, 7.417923313036E-103
# Check the variables that were removed during this process:
bwModel.get_predictors_removed_per_step()
[['CAPSULE'], ['PSA'], ['DCAPS'], ['DPROS'], ['AGE'], ['RACE'], ['VOL']]
# To build the fastest model with ModelSelection, use ``mode="maxrsweep"``:
sweepModel = H2OModelSelectionEstimator(mode="maxrsweep",
build_glm_model=False,
max_predictor_number=3,
seed=12345)
sweepModel.train(x=predictors, y=response, training_frame=prostate)
modelselection Model Build progress: ======================================= (done)| 100%
# Retrieve the results to view the best predictor subsets:
print(sweepModel.results())
model_name best_r2_value coefficient_names predictor_names predictors_removed predictors_added
best 1 predictors model 0.205887 CAPSULE, Intercept CAPSULE CAPSULE
best 2 predictors model 0.269568 CAPSULE, PSA, Intercept CAPSULE, PSA PSA
best 3 predictors model 0.286254 CAPSULE, PSA, DCAPS, Intercept CAPSULE, PSA, DCAPS DCAPS
[3 rows x 6 columns]
References¶
Alan Miller, Subset Selection in Regression, section 3.5, Second Edition, 2002 Chapman & Hall/CRC.
Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning, Section 3.3.2, Second Edition, Springer, 2008.
Schatzoff, R. Tsao, S. Fierberg, “Efficient Calculation of All Possible Regressions”, TECHNOMETRICS, Vol. 10, No. 4, NOVEMBER 1968.