Model Categories¶
-
class
h2o.model.
ModelBase
[source]¶ Bases:
h2o.model.model_base.ModelBase
Base class for all models.
-
property
actual_params
¶ Dictionary of actual parameters of the model.
-
aic
(train=False, valid=False, xval=False)[source]¶ Get the AIC (Akaike Information Criterium).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the AIC value for the training data.valid (bool) – If
valid=True
, then return the AIC value for the validation data.xval (bool) – If
xval=True
, then return the AIC value for the validation data.
- Returns
The AIC.
-
auc
(train=False, valid=False, xval=False)[source]¶ Get the AUC (Area Under Curve).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the AUC value for the training data.valid (bool) – If
valid=True
, then return the AUC value for the validation data.xval (bool) – If
xval=True
, then return the AUC value for the validation data.
- Returns
The AUC.
-
aucpr
(train=False, valid=False, xval=False)[source]¶ Get the aucPR (Area Under PRECISION RECALL Curve).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the aucpr value for the training data.valid (bool) – If
valid=True
, then return the aucpr value for the validation data.xval (bool) – If
xval=True
, then return the aucpr value for the validation data.
- Returns
The aucpr.
-
average_objective
()[source]¶ Retrieve model average objective function value from scoring history if exists for GLM model. If there is no regularization, the avearge objective value*obj_reg should equal the neg_log_likelihood value.
- Returns
the average objective function value
-
biases
(vector_id=0)[source]¶ Return the frame for the respective bias vector.
- Parameters
vector_id – an integer, ranging from 0 to number of layers, that specifies the bias vector to return.
- Returns
an H2OFrame which represents the bias vector identified by
vector_id
.
-
calibrate
(calibration_model)[source]¶ Calibrate a trained model with a supplied calibration model.
Only tree-based models can be calibrated.
- Parameters
calibration_model – a GLM model (for Platt Scaling) or Isotonic Regression model trained with the purpose of calibrating output of this model.
- Examples
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.isotonicregression import H2OIsotonicRegressionEstimator >>> df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/ecology_model.csv") >>> df["Angaus"] = df["Angaus"].asfactor() >>> train, calib = df.split_frame(ratios=[.8], destination_frames=["eco_train", "eco_calib"], seed=42) >>> model = H2OGradientBoostingEstimator() >>> model.train(x=list(range(2, train.ncol)), y="Angaus", training_frame=train) >>> isotonic_train = calib[["Angaus"]] >>> isotonic_train = isotonic_train.cbind(model.predict(calib)["p1"]) >>> h2o_iso_reg = H2OIsotonicRegressionEstimator(out_of_bounds="clip") >>> h2o_iso_reg.train(training_frame=isotonic_train, x="p1", y="Angaus") >>> model.calibrate(h2o_iso_reg) >>> model.predict(train)
-
coef
()[source]¶ Return the coefficients which can be applied to the non-standardized data.
Note:
standardize=True
by default; whenstandardize=False
, thencoef()
will return the coefficients which are fit directly.
-
coef_norm
()[source]¶ Return coefficients fitted on the standardized data (requires
standardize=True
, which is on by default).These coefficients can be used to evaluate variable importance.
-
cross_validation_fold_assignment
()[source]¶ Obtain the cross-validation fold assignment for all rows in the training data.
- Returns
H2OFrame
-
cross_validation_holdout_predictions
()[source]¶ Obtain the (out-of-sample) holdout predictions of all cross-validation models on the training data.
This is equivalent to summing up all H2OFrames returned by
cross_validation_predictions
.- Returns
H2OFrame
-
cross_validation_metrics_summary
()[source]¶ Retrieve Cross-Validation Metrics Summary.
- Returns
The cross-validation metrics summary as an H2OTwoDimTable
-
cross_validation_models
()[source]¶ Obtain a list of cross-validation models.
- Returns
list of H2OModel objects.
-
cross_validation_predictions
()[source]¶ Obtain the (out-of-sample) holdout predictions of all cross-validation models on their holdout data.
Note that the predictions are expanded to the full number of rows of the training data, with 0 fill-in.
- Returns
list of H2OFrame objects.
-
deepfeatures
(test_data, layer)[source]¶ Return hidden layer details.
- Parameters
test_data – Data to create a feature space on.
layer – 0 index hidden layer.
-
property
default_params
¶ Dictionary of the default parameters of the model.
-
download_model
(path='', filename=None)[source]¶ Download an H2O Model object to disk.
- Parameters
path – a path to the directory where the model should be saved.
filename – a filename for the saved model.
- Returns
the path of the downloaded model.
-
download_mojo
(path='.', get_genmodel_jar=False, genmodel_name='')[source]¶ Download the model in MOJO format.
- Parameters
path – the path where MOJO file should be saved.
get_genmodel_jar – if True, then also download h2o-genmodel.jar and store it in folder
path
.genmodel_name – Custom name of genmodel jar
- Returns
name of the MOJO file written.
-
download_pojo
(path='', get_genmodel_jar=False, genmodel_name='')[source]¶ Download the POJO for this model to the directory specified by path.
If path is an empty string, then dump the output to screen.
- Parameters
path – An absolute path to the directory where POJO should be saved.
get_genmodel_jar – if True, then also download h2o-genmodel.jar and store it in folder
path
.genmodel_name – Custom name of genmodel jar
- Returns
name of the POJO file written.
-
property
end_time
¶ Timestamp (milliseconds since 1970) when the model training was ended.
-
explain
(frame, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, figsize=(16, 9), render=True, qualitative_colormap='Dark2', sequential_colormap='RdYlBu_r', background_frame=None)¶ Generate model explanations on frame data set.
The H2O Explainability Interface is a convenient wrapper to a number of explainabilty methods and visualizations in H2O. The function can be applied to a single model or group of models and returns an object containing explanations, such as a partial dependence plot or a variable importance plot. Most of the explanations are visual (plots). These plots can also be created by individual utility functions/methods as well.
- Parameters
models – a list of H2O models, an H2O AutoML instance, or an H2OFrame with a ‘model_id’ column (e.g. H2OAutoML leaderboard).
frame – H2OFrame.
columns – either a list of columns or column indices to show. If specified parameter
top_n_features
will be ignored.top_n_features – a number of columns to pick using variable importance (where applicable).
include_explanations – if specified, return only the specified model explanations (mutually exclusive with
exclude_explanations
).exclude_explanations – exclude specified model explanations.
plot_overrides – overrides for individual model explanations.
figsize – figure size; passed directly to matplotlib.
render – if
True
, render the model explanations; otherwise model explanations are just returned.qualitative_colormap – used for setting qualitative colormap, that is passed to individual plots.
sequential_colormap – used for setting sequential colormap, that is passed to individual plots.
background_frame – optional frame, that is used as the source of baselines for the marginal SHAP. Setting it enables calculating SHAP in more models but it can be more time and memory consuming.
- Returns
H2OExplanation containing the model explanations including headers and descriptions.
- Examples
>>> import h2o >>> from h2o.automl import H2OAutoML >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train an H2OAutoML >>> aml = H2OAutoML(max_models=10) >>> aml.train(y=response, training_frame=train) >>> >>> # Create the H2OAutoML explanation >>> aml.explain(test) >>> >>> # Create the leader model explanation >>> aml.leader.explain(test)
-
explain_row
(frame, row_index, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, qualitative_colormap='Dark2', figsize=(16, 9), render=True, background_frame=None)¶ Generate model explanations on frame data set for a given instance.
Explain the behavior of a model or group of models with respect to a single row of data. The function returns an object containing explanations, such as a partial dependence plot or a variable importance plot. Most of the explanations are visual (plots). These plots can also be created by individual utility functions/methods as well.
- Parameters
models – H2OAutoML object, supervised H2O model, or list of supervised H2O models.
frame – H2OFrame.
row_index – row index of the instance to inspect.
columns – either a list of columns or column indices to show. If specified, parameter
top_n_features
will be ignored.top_n_features – a number of columns to pick using variable importance (where applicable).
include_explanations – if specified, return only the specified model explanations (mutually exclusive with
exclude_explanations
).exclude_explanations – exclude specified model explanations.
plot_overrides – overrides for individual model explanations.
qualitative_colormap – a colormap name.
figsize – figure size; passed directly to matplotlib.
render – if
True
, render the model explanations; otherwise model explanations are just returned.background_frame – optional frame, that is used as the source of baselines for the marginal SHAP. Setting it enables calculating SHAP in more models but it can be more time and memory consuming.
- Returns
H2OExplanation containing the model explanations including headers and descriptions.
- Examples
>>> import h2o >>> from h2o.automl import H2OAutoML >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train an H2OAutoML >>> aml = H2OAutoML(max_models=10) >>> aml.train(y=response, training_frame=train) >>> >>> # Create the H2OAutoML explanation >>> aml.explain_row(test, row_index=0) >>> >>> # Create the leader model explanation >>> aml.leader.explain_row(test, row_index=0)
-
feature_frequencies
(test_data)[source]¶ Retrieve the number of occurrences of each feature for given observations on their respective paths in a tree ensemble model. Available for GBM, Random Forest and Isolation Forest models.
- Parameters
test_data (H2OFrame) – Data on which to calculate feature frequencies.
- Returns
A new H2OFrame made of feature contributions.
- Examples
>>> from h2o.estimators import H2OIsolationForestEstimator >>> h2o_df = h2o.import_file("https://raw.github.com/h2oai/h2o/master/smalldata/logreg/prostate.csv") >>> train,test = h2o_df.split_frame(ratios=[0.75]) >>> model = H2OIsolationForestEstimator(sample_rate = 0.1, ... max_depth = 20, ... ntrees = 50) >>> model.train(training_frame=train) >>> model.feature_frequencies(test_data = test)
-
feature_interaction
(max_interaction_depth=100, max_tree_depth=100, max_deepening=-1, path=None)[source]¶ Feature interactions and importance, leaf statistics and split value histograms in a tabular form. Available for XGBoost and GBM.
Metrics:
Gain - Total gain of each feature or feature interaction.
FScore - Amount of possible splits taken on a feature or feature interaction.
wFScore - Amount of possible splits taken on a feature or feature interaction weighed by the probability of the splits to take place.
Average wFScore - wFScore divided by FScore.
Average Gain - Gain divided by FScore.
Expected Gain - Total gain of each feature or feature interaction weighed by the probability to gather the gain.
Average Tree Index
Average Tree Depth
- Parameters
max_interaction_depth – Upper bound for extracted feature interactions depth. Defaults to
100
.max_tree_depth – Upper bound for tree depth. Defaults to
100
.max_deepening – Upper bound for interaction start deepening (zero deepening => interactions starting at root only). Defaults to
-1.
path – (Optional) Path where to save the output in .xlsx format (e.g.
/mypath/file.xlsx
). Please note that Pandas and XlsxWriter need to be installed for using this option. Defaults to None.
- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_xgb = H2OXGBoostEstimator(seed=1234) >>> boston_xgb.train(y=response, x=predictors, training_frame=train) >>> feature_interactions = boston_xgb.feature_interaction()
-
property
full_parameters
¶ Dictionary of the full specification of all parameters.
-
get_xval_models
(key=None)[source]¶ Return a Model object.
- Parameters
key – If None, return all cross-validated models; otherwise return the model that key points to.
- Returns
A model or list of models.
-
gini
(train=False, valid=False, xval=False)[source]¶ Get the Gini coefficient.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”- Parameters
train (bool) – If
train=True
, then return the Gini Coefficient value for the training data.valid (bool) – If
valid=True
, then return the Gini Coefficient value for the validation data.xval (bool) – If
xval=True
, then return the Gini Coefficient value for the cross validation data.
- Returns
The Gini Coefficient for this binomial model.
-
h
(frame, variables)[source]¶ Calculates Friedman and Popescu’s H statistics, in order to test for the presence of an interaction between specified variables in H2O GBM and XGB models. H varies from
0
to1
. It will have a value of0
if the model exhibits no interaction between specified variables and a correspondingly larger value for a stronger interaction effect between them.NaN
is returned if a computation is spoiled by weak main effects and rounding errors.See Jerome H. Friedman and Bogdan E. Popescu, 2008, “Predictive learning via rule ensembles”, Ann. Appl. Stat. 2:916-954, http://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908046, s. 8.1.
- Parameters
frame – the frame that current model has been fitted to.
variables – variables of the interest.
- Returns
H statistic of the variables.
- Examples
>>> prostate_train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/prostate_train.csv") >>> prostate_train["CAPSULE"] = prostate_train["CAPSULE"].asfactor() >>> gbm_h2o = H2OGradientBoostingEstimator(ntrees=100, learn_rate=0.1, >>> max_depth=5, >>> min_rows=10, >>> distribution="bernoulli") >>> gbm_h2o.train(x=list(range(1,prostate_train.ncol)),y="CAPSULE", training_frame=prostate_train) >>> h = gbm_h2o.h(prostate_train, ['DPROS','DCAPS'])
-
property
have_mojo
¶ True, if export to MOJO is possible
-
property
have_pojo
¶ True, if export to POJO is possible
-
ice_plot
(frame, column, target=None, max_levels=30, figsize=(16, 9), colormap='plasma', save_plot_path=None, show_pdp=True, binary_response_scale='response', centered=False, grouping_column=None, output_graphing_data=False, nbins=100, show_rug=True, **kwargs)¶ Plot Individual Conditional Expectations (ICE) for each decile.
The individual conditional expectations (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. The ICE plot is similar to a partial dependence plot (PDP) because a PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. The following plot shows the effect for each decile. In contrast to a partial dependence plot, the ICE plot can provide more insight especially when there is stronger feature interaction. Also, the plot shows the original observation values marked by a semi-transparent circle on each ICE line. Note that the score of the original observation value may differ from score value of the underlying ICE line at the original observation point as the ICE line is drawn as an interpolation of several points.
- Parameters
model – H2OModel.
frame – H2OFrame.
column – string containing column name.
target – (only for multinomial classification) for what target should the plot be done.
max_levels – maximum number of factor levels to show.
figsize – figure size; passed directly to matplotlib.
colormap – colormap name.
save_plot_path – a path to save the plot via using matplotlib function savefig.
show_pdp – option to turn on/off PDP line. Defaults to
True
.binary_response_scale – option for binary model to display (on the y-axis) the logodds instead of the actual score. Can be one of: “response” (default) or “logodds”.
centered – a bool that determines whether to center curves around 0 at the first valid x value or not.
grouping_column – a feature column name to group the data and provide separate sets of plots by grouping feature values.
output_graphing_data – a bool that determmines whether to output final graphing data to a frame.
nbins – Number of bins used.
show_rug – Show rug to visualize the density of the column
- Returns
object that contains the resulting matplotlib figure (can be accessed using
result.figure()
).- Examples
>>> import h2o >>> from h2o.estimators import H2OGradientBoostingEstimator >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response: >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train a GBM: >>> gbm = H2OGradientBoostingEstimator() >>> gbm.train(y=response, training_frame=train) >>> >>> # Create the individual conditional expectations plot: >>> gbm.ice_plot(test, column="alcohol")
-
property
key
¶ - Returns
the unique key representing the object on the backend
-
learning_curve_plot
(metric='AUTO', cv_ribbon=None, cv_lines=None, figsize=(16, 9), colormap=None, save_plot_path=None)¶ Learning curve plot.
Create the learning curve plot for an H2O Model. Learning curves show the error metric dependence on learning progress (e.g. RMSE vs number of trees trained so far in GBM). There can be up to 4 curves showing Training, Validation, Training on CV Models, and Cross-validation error.
- Parameters
model – an H2O model.
metric – a stopping metric.
cv_ribbon – if
True
, plot the CV mean and CV standard deviation as a ribbon around the mean; if None, it will attempt to automatically determine if this is suitable visualization.cv_lines – if
True
, plot scoring history for individual CV models; if None, it will attempt to automatically determine if this is suitable visualization.figsize – figure size; passed directly to matplotlib.
colormap – colormap to use.
save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
object that contains the resulting figure (can be accessed using
result.figure()
).- Examples
>>> import h2o >>> from h2o.estimators import H2OGradientBoostingEstimator >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train a GBM >>> gbm = H2OGradientBoostingEstimator() >>> gbm.train(y=response, training_frame=train) >>> >>> # Create the learning curve plot >>> gbm.learning_curve_plot()
-
loglikelihood
(train=False, valid=False, xval=False)[source]¶ Get the log likelihood.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the log likelihood value for the training data.valid (bool) – If
valid=True
, then return the log likelihood value for the validation data.xval (bool) – If
xval=True
, then return the log likelihood value for the validation data.
- Returns
The log likelihood.
-
logloss
(train=False, valid=False, xval=False)[source]¶ Get the Log Loss.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the log loss value for the training data.valid (bool) – If
valid=True
, then return the log loss value for the validation data.xval (bool) – If
xval=True
, then return the log loss value for the cross validation data.
- Returns
The log loss for this regression model.
-
mae
(train=False, valid=False, xval=False)[source]¶ Get the Mean Absolute Error.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the MAE value for the training data.valid (bool) – If
valid=True
, then return the MAE value for the validation data.xval (bool) – If
xval=True
, then return the MAE value for the cross validation data.
- Returns
The MAE for this regression model.
-
mean_residual_deviance
(train=False, valid=False, xval=False)[source]¶ Get the Mean Residual Deviances.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the Mean Residual Deviance value for the training data.valid (bool) – If
valid=True
, then return the Mean Residual Deviance value for the validation data.xval (bool) – If
xval=True
, then return the Mean Residual Deviance value for the cross validation data.
- Returns
The Mean Residual Deviance for this regression model.
-
property
model_id
¶ Model identifier.
-
model_performance
(test_data=None, train=False, valid=False, xval=False, auc_type='none', auuc_type=None, custom_auuc_thresholds=None)[source]¶ Generate model metrics for this model on
test_data
.- Parameters
test_data (H2OFrame) – Data set for which model metrics shall be computed against. All three of train, valid and xval arguments are ignored if
test_data
is notNone
.train (bool) – Report the training metrics for the model. Defaults false.
valid (bool) – Report the validation metrics for the model. Defaults false.
xval (bool) – Report the cross-validation metrics for the model. Defaults false.
auc_type (String) –
Change default AUC type for multinomial classification AUC/AUCPR calculation when
test_data
is notNone
. One of: -"auto"
-"none"
(default) -"macro_ovr"
-"weighted_ovr"
-"macro_ovo"
-"weighted_ovo"
If type is
"auto"
or"none"
, AUC and AUCPR are not calculated.auuc_type (String) –
Change default AUUC type for uplift binomial classification AUUC calculation when
test_data
is not None. One of:"AUTO"
(default)"qini"
"lift"
"gain"
If type is
"auto"
(“qini”), AUUC is calculated.float (list) – List of custom thresholds to calculate AUUC when
test_data
is not None. Defaults None.
- Returns
An instance of
MetricsBase
or one of its subclass.
-
mse
(train=False, valid=False, xval=False)[source]¶ Get the Mean Square Error.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the MSE value for the training data.valid (bool) – If
valid=True
, then return the MSE value for the validation data.xval (bool) – If
xval=True
, then return the MSE value for the cross validation data.
- Returns
The MSE for this regression model.
-
negative_log_likelihood
()[source]¶ Retrieve model negative likelihood function value from scoring history if exists for GLM model
- Returns
the negative likelihood function value
-
ntrees_actual
()[source]¶ Returns actual number of trees in a tree model. If early stopping is enabled, GBM can reset the ntrees value. In this case, the actual ntrees value is less than the original ntrees value a user set before building the model.
Type:
float
-
null_degrees_of_freedom
(train=False, valid=False, xval=False)[source]¶ Retreive the null degress of freedom (dof) if this model has the attribute, or None otherwise.
- Parameters
train (bool) – Get the null dof for the training set. If both train and valid are
False
, then train is selected by default.valid (bool) – Get the null dof for the validation set. If both train and valid are
True
, then train is selected by default.
- Returns
Return the null dof, or None if it is not present.
-
null_deviance
(train=False, valid=False, xval=False)[source]¶ Retreive the null deviance if this model has the attribute, or None otherwise.
- Parameters
train (bool) – Get the null deviance for the training set. If both train and valid are
False
, then train is selected by default.valid (bool) – Get the null deviance for the validation set. If both train and valid are
True
, then train is selected by default.
- Returns
Return the null deviance, or None if it is not present.
-
property
params
¶ Get the parameters and the actual/default values only.
- Returns
A dictionary of parameters used to build this model.
-
partial_plot
(frame, cols=None, destination_key=None, nbins=20, weight_column=None, plot=True, plot_stddev=True, figsize=(7, 10), server=False, include_na=False, user_splits=None, col_pairs_2dpdp=None, save_plot_path=None, row_index=None, targets=None)[source]¶ Create partial dependence plot which gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response.
- Parameters
frame (H2OFrame) – An H2OFrame object used for scoring and constructing the plot.
cols – Feature(s) for which partial dependence will be calculated.
destination_key – A key reference to the created partial dependence tables in H2O.
nbins – Number of bins used. For categorical columns make sure the number of bins exceed the level count. If you enable
add_missing_NA
, the returned length will be nbin+1.weight_column – A string denoting which column of data should be used as the weight column.
plot – A boolean specifying whether to plot partial dependence table.
plot_stddev – A boolean specifying whether to add std err to partial dependence plot.
figsize – Dimension/size of the returning plots, adjust to fit your output cells.
server – Specify whether to activate matplotlib “server” mode. In this case, the plots are saved to a file instead of being rendered.
include_na – A boolean specifying whether missing value should be included in the Feature values.
user_splits – A dictionary containing column names as key and user defined split values as value in a list.
col_pairs_2dpdp – List containing pairs of column names for 2D pdp
save_plot_path – Fully qualified name to an image file the resulting plot should be saved to (e.g.
'/home/user/pdpplot.png'
). The ‘png’ postfix might be omitted. If the file already exists, it will be overridden. Plot is only saved ifplot=True
.row_index – Row for which partial dependence will be calculated instead of the whole input frame.
targets – Target classes for multiclass model.
- Returns
Plot and list of calculated mean response tables for each feature requested + the resulting plot (can be accessed using
result.figure()
).
-
pd_plot
(frame, column, row_index=None, target=None, max_levels=30, figsize=(16, 9), colormap='Dark2', save_plot_path=None, binary_response_scale='response', grouping_column=None, output_graphing_data=False, nbins=100, show_rug=True, **kwargs)¶ Plot partial dependence plot.
The partial dependence plot (PDP) provides a graph of the marginal effect of a variable on the response. The effect of a variable is measured by the change in the mean response. The PDP assumes independence between the feature for which is the PDP computed and the rest.
- Parameters
model – H2O Model object.
frame – H2OFrame.
column – string containing column name.
row_index – if None, do partial dependence; if integer, do individual conditional expectation for the row specified by this integer.
target – (only for multinomial classification) for what target should the plot be done.
max_levels – maximum number of factor levels to show.
figsize – figure size; passed directly to matplotlib.
colormap – colormap name; used to get just the first color to keep the api and color scheme similar with
pd_multi_plot
.save_plot_path – a path to save the plot via using matplotlib function savefig.
binary_response_scale – option for binary model to display (on the y-axis) the logodds instead of the actual score. Can be one of: “response” (default), “logodds”.
grouping_column – A feature column name to group the data and provide separate sets of plots by grouping feature values.
output_graphing_data – a bool that determines whether to output final graphing data to a frame.
nbins – Number of bins used.
show_rug – Show rug to visualize the density of the column
- Returns
object that contains the resulting matplotlib figure (can be accessed using
result.figure()
).- Examples
>>> import h2o >>> from h2o.estimators import H2OGradientBoostingEstimator >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train a GBM >>> gbm = H2OGradientBoostingEstimator() >>> gbm.train(y=response, training_frame=train) >>> >>> # Create partial dependence plot >>> gbm.pd_plot(test, column="alcohol")
-
permutation_importance
(frame, metric='AUTO', n_samples=10000, n_repeats=1, features=None, seed=-1, use_pandas=False)[source]¶ Get Permutation Variable Importance.
When
n_repeats == 1
, the result is similar to the one fromvarimp()
method (i.e. it contains the following columns: “Relative Importance”, “Scaled Importance”, and “Percentage”).When
n_repeats > 1
, the individual columns correspond to the permutation variable importance values from individual runs which corresponds to the “Relative Importance” and also to the distance between the original prediction error and prediction error using a frame with a given feature permuted.- Parameters
frame – training frame.
metric –
metric to be used. One of:
”AUTO”
”AUC”
”MAE”
”MSE”
”RMSE”
”logloss”
”mean_per_class_error”
”PR_AUC”
Defaults to “AUTO”.
n_samples – number of samples to be evaluated. Use
-1
to use the whole dataset. Defaults to10 000
.n_repeats – number of repeated evaluations. Defaults to
1
.features – features to include in the permutation importance. Use
None
to include all.seed – seed for the random generator. Use
-1
(default) to pick a random seed.use_pandas – set to
True
to return pandas data frame.
- Returns
H2OTwoDimTable or Pandas data frame
-
permutation_importance_plot
(frame, metric='AUTO', n_samples=10000, n_repeats=1, features=None, seed=-1, num_of_features=10, server=False, save_plot_path=None)[source]¶ Plot Permutation Variable Importance. This method plots either a bar plot or, if
n_repeats > 1
, a box plot and returns the variable importance table.- Parameters
frame – training frame.
metric –
metric to be used. One of:
”AUTO”
”AUC”
”MAE”
”MSE”
”RMSE”
”logloss”
”mean_per_class_error”,
”PR_AUC”
Defaults to “AUTO”.
n_samples – number of samples to be evaluated. Use
-1
to use the whole dataset. Defaults to10 000
.n_repeats – number of repeated evaluations. Defaults to
1
.features – features to include in the permutation importance. Use
None
to include all.seed – seed for the random generator. Use
-1
(default) to pick a random seed.num_of_features – number of features to plot. Defaults to
10
.server – if
True
, set server settings to matplotlib and do not show the plot.save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
object that contains H2OTwoDimTable with variable importance and the resulting figure (can be accessed using
result.figure()
)
-
pr_auc
(train=False, valid=False, xval=False)[source]¶ ModelBase.pr_auc
is deprecated, please useModelBase.aucpr
instead.
-
predict
(test_data, custom_metric=None, custom_metric_func=None)[source]¶ Predict on a dataset.
- Parameters
test_data (H2OFrame) – Data on which to make predictions.
custom_metric – custom evaluation function defined as class reference, the class get uploaded into the cluster.
custom_metric_func – custom evaluation function reference (e.g, result of
upload_custom_metric
).
- Returns
A new H2OFrame of predictions.
-
predict_contributions
(test_data, output_format='Original', top_n=None, bottom_n=None, compare_abs=False, background_frame=None, output_space=False, output_per_reference=False)[source]¶ Predict feature contributions - SHAP values on an H2O Model (only GBM, XGBoost, DRF models and equivalent imported MOJOs).
Returned H2OFrame has shape (#rows, #features + 1). There is a feature contribution column for each input feature, and the last column is the model bias (same value for each row). The sum of the feature contributions and the bias term is equal to the raw prediction of the model. Raw prediction of tree-based models is the sum of the predictions of the individual trees before the inverse link function is applied to get the actual prediction. For Gaussian distribution the sum of the contributions is equal to the model prediction.
Note: Multinomial classification models are currently not supported.
- Parameters
test_data (H2OFrame) – Data on which to calculate contributions.
output_format (Enum) – Specify how to output feature contributions in XGBoost. XGBoost by default outputs contributions for 1-hot encoded features, specifying a Compact output format will produce a per-feature contribution. One of:
"Original"
(default),"Compact"
.top_n –
Return only #top_n highest contributions + bias:
If
top_n<0
then sort all SHAP values in descending orderIf
top_n<0 && bottom_n<0
then sort all SHAP values in descending order
bottom_n –
Return only #bottom_n lowest contributions + bias:
If top_n and bottom_n are defined together then return array of #top_n + #bottom_n + bias
If
bottom_n<0
then sort all SHAP values in ascending orderIf
top_n<0 && bottom_n<0
then sort all SHAP values in descending order
compare_abs – True to compare absolute values of contributions
background_frame – Optional frame, that is used as the source of baselines for the baseline SHAP (when output_per_reference == True) or for the marginal SHAP (when output_per_reference == False).
output_space – If True, linearly scale the contributions so that they sum up to the prediction. NOTE: This will result only in approximate SHAP values even if the model supports exact SHAP calculation. NOTE: This will not have any effect if the estimator doesn’t use a link function.
output_per_reference – If True, return baseline SHAP, i.e., contribution for each data point for each reference from the background_frame. If False, return TreeSHAP if no background_frame is provided, or marginal SHAP if background frame is provided. Can be used only with background_frame.
- Returns
A new H2OFrame made of feature contributions.
- Examples
>>> prostate = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv" >>> fr = h2o.import_file(prostate) >>> predictors = list(range(2, fr.ncol)) >>> m = H2OGradientBoostingEstimator(ntrees=10, seed=1234) >>> m.train(x=predictors, y=1, training_frame=fr) >>> # Compute SHAP >>> m.predict_contributions(fr) >>> # Compute SHAP and pick the top two highest >>> m.predict_contributions(fr, top_n=2) >>> # Compute SHAP and pick the top two lowest >>> m.predict_contributions(fr, bottom_n=2) >>> # Compute SHAP and pick the top two highest regardless of the sign >>> m.predict_contributions(fr, top_n=2, compare_abs=True) >>> # Compute SHAP and pick top two lowest regardless of the sign >>> m.predict_contributions(fr, bottom_n=2, compare_abs=True) >>> # Compute SHAP values and show them all in descending order >>> m.predict_contributions(fr, top_n=-1) >>> # Compute SHAP and pick the top two highest and top two lowest >>> m.predict_contributions(fr, top_n=2, bottom_n=2) >>> # Compute Marginal SHAP, this enables looking at the contributions against different baselines, e.g., older people in the following example >>> m.predict_contributions(fr, background_frame=fr[fr["AGE"] > 75, :])
-
predict_leaf_node_assignment
(test_data, type='Path')[source]¶ Predict on a dataset and return the leaf node assignment (only for tree-based models).
- Parameters
test_data (H2OFrame) – Data on which to make predictions.
type (Enum) – How to identify the leaf node. Nodes can be either identified by a path from to the root node of the tree to the node or by H2O’s internal node id. One of:
"Path"
(default),"Node_ID"
.
- Returns
A new H2OFrame of predictions.
-
predicted_vs_actual_by_variable
(frame, predicted, variable, use_pandas=False)[source]¶ Calculates per-level mean of predicted value vs actual value for a given variable.
In the basic setting, this function is equivalent to doing group-by on variable and calculating mean on predicted and actual. It also handles NAs in response and weights automatically.
- Parameters
frame – input frame (can be
training/test/...
frame).predicted – frame of predictions for the given input frame.
variable – variable to inspect.
use_pandas – set true to return pandas data frame.
- Returns
H2OTwoDimTable or Pandas data frame
-
r2
(train=False, valid=False, xval=False)[source]¶ Return the R squared for this regression model.
Will return \(R^2\) for GLM Models.
The \(R^2\) value is defined to be \(1 - MSE / var\), where var is computed as \(\sigma * \sigma\).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the R^2 value for the training data.valid (bool) – If
valid=True
, then return the R^2 value for the validation data.xval (bool) – If
xval=True
, then return the R^2 value for the cross validation data.
- Returns
The R squared for this regression model.
-
residual_degrees_of_freedom
(train=False, valid=False, xval=False)[source]¶ Retreive the residual degress of freedom (dof) if this model has the attribute, or None otherwise.
- Parameters
train (bool) – Get the residual dof for the training set. If both train and valid are
False
, then train is selected by default.valid (bool) – Get the residual dof for the validation set. If both train and valid are
True
, then train is selected by default.
- Returns
Return the residual dof, or None if it is not present.
-
residual_deviance
(train=False, valid=False, xval=None)[source]¶ Retreive the residual deviance if this model has the attribute, or None otherwise.
- Parameters
train (bool) – Get the residual deviance for the training set. If both train and valid are
False
, then train is selected by default.valid (bool) – Get the residual deviance for the validation set. If both train and valid are
True
, then train is selected by default.
- Returns
Return the residual deviance, or None if it is not present.
-
rmse
(train=False, valid=False, xval=False)[source]¶ Get the Root Mean Square Error.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the RMSE value for the training data.valid (bool) – If
valid=True
, then return the RMSE value for the validation data.xval (bool) – If
xval=True
, then return the RMSE value for the cross validation data.
- Returns
The RMSE for this regression model.
-
rmsle
(train=False, valid=False, xval=False)[source]¶ Get the Root Mean Squared Logarithmic Error.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the RMSLE value for the training data.valid (bool) – If
valid=True
, then return the RMSLE value for the validation data.xval (bool) – If
xval=True
, then return the RMSLE value for the cross validation data.
- Returns
The RMSLE for this regression model.
-
row_to_tree_assignment
(original_training_data)[source]¶ Output row to tree assignment for the model and provided training data.
- Output is frame of size nrow = nrow(original_training_data) and ncol = number_of_trees_in_model+1 in format:
- row_id tree_1 tree_2 tree_3
0 0 1 1 1 1 1 1 2 1 0 0 3 1 1 0 4 0 1 1 5 1 1 1 6 1 0 0 7 0 1 0 8 0 1 1 9 1 0 0
- Parameters
original_training_data (H2OFrame) – Data that was used for model training. Currently there is no validation of the input.
- Returns
A new H2OFrame made of row to tree assignment output.
Note: Multinomial classification generate tree for each category, each tree use the same sample of the data.
- Examples
>>> prostate = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv" >>> fr = h2o.import_file(prostate) >>> predictors = list(range(2, fr.ncol)) >>> m = H2OGradientBoostingEstimator(ntrees=10, seed=1234, sample_rate=0.6) >>> m.train(x=predictors, y=1, training_frame=fr) >>> # Output row to tree assignment >>> m.row_to_tree_assignment(fr)
-
property
run_time
¶ Model training time in milliseconds.
-
save_model_details
(path='', force=False, filename=None)[source]¶ Save Model Details of an H2O Model in JSON Format to disk.
- Parameters
path – a path to save the model details at (e.g. hdfs, s3, local).
force – if
True
, overwrite destination directory in case it exists, or throw exception if set toFalse
.filename – a filename for the saved model (file type is always .json).
- Returns str
the path of the saved model details
-
save_mojo
(path='', force=False, filename=None)[source]¶ Save an H2O Model as MOJO (Model Object, Optimized) to disk.
- Parameters
path – a path to save the model at (e.g. hdfs, s3, local).
force – if
True
, overwrite destination directory in case it exists, or throw exception if set toFalse
.filename – a filename for the saved model (file type is always .zip).
- Returns str
the path of the saved model
-
score_history
()[source]¶ DEPRECATED. Use
scoring_history()
instead.
-
scoring_history
()[source]¶ Retrieve Model Score History.
- Returns
The score history as an H2OTwoDimTable or a Pandas DataFrame.
-
shap_explain_row_plot
(frame, row_index, columns=None, top_n_features=10, figsize=(16, 9), plot_type='barplot', contribution_type='both', save_plot_path=None, background_frame=None)¶ SHAP local explanation.
SHAP explanation shows the contribution of features for a given instance. The sum of the feature contributions and the bias term is equal to the raw prediction of the model (i.e. the prediction before applying inverse link function). H2O implements TreeSHAP which, when the features are correlated, can increase the contribution of a feature that had no influence on the prediction.
- Parameters
model – h2o tree model, such as DRF, XRT, GBM, XGBoost.
frame – H2OFrame.
row_index – row index of the instance to inspect.
columns – either a list of columns or column indices to show. If specified parameter
top_n_features
will be ignored.top_n_features – a number of columns to pick using variable importance (where applicable). When
plot_type="barplot"
, thentop_n_features
will be chosen for eachcontribution_type
.figsize – figure size; passed directly to matplotlib.
plot_type – either “barplot” or “breakdown”.
contribution_type –
One of:
”positive”
”negative”
”both”
Used only for
plot_type="barplot"
.save_plot_path – a path to save the plot via using matplotlib function savefig.
background_frame – optional frame, that is used as the source of baselines for the marginal SHAP.
- Returns
object that contains the resulting matplotlib figure (can be accessed using
result.figure()
).- Examples
>>> import h2o >>> from h2o.estimators import H2OGradientBoostingEstimator >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train a GBM >>> gbm = H2OGradientBoostingEstimator() >>> gbm.train(y=response, training_frame=train) >>> >>> # Create SHAP row explanation plot >>> gbm.shap_explain_row_plot(test, row_index=0)
-
shap_summary_plot
(frame, columns=None, top_n_features=20, samples=1000, colorize_factors=True, alpha=1, colormap=None, figsize=(12, 12), jitter=0.35, save_plot_path=None, background_frame=None)¶ SHAP summary plot.
The SHAP summary plot shows the contribution of features for each instance. The sum of the feature contributions and the bias term is equal to the raw prediction of the model (i.e. prediction before applying inverse link function).
- Parameters
model – h2o tree model (e.g. DRF, XRT, GBM, XGBoost).
frame – H2OFrame.
columns – either a list of columns or column indices to show. If specified parameter
top_n_features
will be ignored.top_n_features – a number of columns to pick using variable importance (where applicable).
samples – maximum number of observations to use; if lower than number of rows in the frame, take a random sample.
colorize_factors – if
True
, use colors from the colormap to colorize the factors; otherwise all levels will have same color.alpha – transparency of the points.
colormap – colormap to use instead of the default blue to red colormap.
figsize – figure size; passed directly to matplotlib.
jitter – amount of jitter used to show the point density.
save_plot_path – a path to save the plot via using matplotlib function savefig.
background_frame – optional frame, that is used as the source of baselines for the marginal SHAP.
- Returns
object that contains the resulting matplotlib figure (can be accessed using
result.figure()
).- Examples
>>> import h2o >>> from h2o.estimators import H2OGradientBoostingEstimator >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train a GBM >>> gbm = H2OGradientBoostingEstimator() >>> gbm.train(y=response, training_frame=train) >>> >>> # Create SHAP summary plot >>> gbm.shap_summary_plot(test)
-
show
(verbosity=None, fmt=None)[source]¶ Describe and renders the current object in the given format and verbosity level if supported, by default guessing the best format for the current environment.
- Parameters
verbosity – one of (None, ‘short’, ‘medium’, ‘full’). Defaults to None (object’s default verbosity).
fmt – one of (None, ‘plain’, ‘pretty’, ‘html’). Defaults to None (picks appropriate format depending on platform/context).
-
staged_predict_proba
(test_data)[source]¶ Predict class probabilities at each stage of an H2O Model (only GBM models).
The output structure is analogous to the output of function
predict_leaf_node_assignment
. For each tree t and class c there will be a column Tt.Cc (eg. T3.C1 for tree 3 and class 1). The value will be the corresponding predicted probability of this class by combining the raw contributions of trees T1.Cc,..,TtCc. Binomial models build the trees just for the first class and values in columns Tx.C1 thus correspond to the the probability p0.- Parameters
test_data (H2OFrame) – Data on which to make predictions.
- Returns
A new H2OFrame of staged predictions.
-
property
start_time
¶ Timestamp (milliseconds since 1970) when the model training was started.
-
std_coef_plot
(num_of_features=None, server=False, save_plot_path=None)[source]¶ Plot a model’s standardized coefficient magnitudes.
- Parameters
num_of_features – the number of features shown in the plot.
server – if
True
, set server settings to matplotlib and show the graph.save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
object that contains the resulting figure (can be accessed using
result.figure()
).
-
property
type
¶ The type of model built. One of:
"classifier"
"regressor"
"unsupervised"
-
update_tree_weights
(frame, weights_column)[source]¶ Re-calculates tree-node weights based on the provided dataset. Modifying node weights will affect how contribution predictions (Shapley values) are calculated. This can be used to explain the model on a curated sub-population of the training dataset.
- Parameters
frame – frame that will be used to re-populate trees with new observations and to collect per-node weights.
weights_column – name of the weight column (can be different from training weights).
-
varimp
(use_pandas=False)[source]¶ Pretty print the variable importances, or return them in a list.
- Parameters
use_pandas (bool) – If
True
, then the variable importances will be returned as a pandas data frame.- Returns
A list or Pandas DataFrame.
-
varimp_plot
(num_of_features=None, server=False, save_plot_path=None)[source]¶ Plot the variable importance for a trained model.
- Parameters
num_of_features – the number of features shown in the plot (default is
10
or all if less than 10).server – if
True
, set server settings to matplotlib and do not show the graph.save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
object that contains the resulting figure (can be accessed using
result.figure()
).
-
weights
(matrix_id=0)[source]¶ Return the frame for the respective weight matrix.
- Parameters
matrix_id – an integer, ranging from 0 to number of layers, that specifies the weight matrix to return.
- Returns
an H2OFrame which represents the weight matrix identified by
matrix_id
.
-
property
xvals
¶ Return a list of the cross-validated models.
- Returns
A list of models.
-
property
-
class
h2o.model.
MetricsBase
(metric_json, on=None, algo='')[source]¶ Bases:
h2o.model.metrics_base.MetricsBase
A parent class to house common metrics available for the various Metrics types.
The methods here are available across different model categories.
Note
This class and its subclasses are used at runtime as mixins: their methods can (and should) be accessed directly from a metrics object, for example as a result of
model_performance()
.-
aic
()[source]¶ The AIC for this set of metrics.
- Examples
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate[2] = prostate[2].asfactor() >>> prostate[4] = prostate[4].asfactor() >>> prostate[5] = prostate[5].asfactor() >>> prostate[8] = prostate[8].asfactor() >>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"] >>> response = "CAPSULE" >>> train, valid = prostate.split_frame(ratios=[.8],seed=1234) >>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial") >>> pros_glm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> pros_glm.aic()
-
auc
()[source]¶ The AUC for this set of metrics.
- Examples
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios = [.8], seed = 1234) >>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) >>> cars_gbm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> cars_gbm.auc()
-
aucpr
()[source]¶ The area under the precision recall curve.
- Examples
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios = [.8], seed = 1234) >>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) >>> cars_gbm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> cars_gbm.aucpr()
-
gini
()[source]¶ Gini coefficient.
- Examples
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios = [.8], seed = 1234) >>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) >>> cars_gbm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> cars_gbm.gini()
-
loglikelihood
()[source]¶ The log likelihood for this set of metrics.
- Examples
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate[2] = prostate[2].asfactor() >>> prostate[4] = prostate[4].asfactor() >>> prostate[5] = prostate[5].asfactor() >>> prostate[8] = prostate[8].asfactor() >>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"] >>> response = "CAPSULE" >>> train, valid = prostate.split_frame(ratios=[.8],seed=1234) >>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial") >>> pros_glm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> pros_glm.loglikelihood()
-
logloss
()[source]¶ Log loss.
- Examples
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios = [.8], seed = 1234) >>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) >>> cars_gbm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> cars_gbm.logloss()
-
mae
()[source]¶ The MAE for this set of metrics.
- Examples
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios = [.8], seed = 1234) >>> cars_gbm = H2OGradientBoostingEstimator(distribution = "poisson", ... seed = 1234) >>> cars_gbm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> cars_gbm.mae()
-
classmethod
make
(kvs)[source]¶ Factory method to instantiate a MetricsBase object from the list of key-value pairs.
-
mean_per_class_error
()[source]¶ The mean per class error.
- Examples
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate[2] = prostate[2].asfactor() >>> prostate[4] = prostate[4].asfactor() >>> prostate[5] = prostate[5].asfactor() >>> prostate[8] = prostate[8].asfactor() >>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"] >>> response = "CAPSULE" >>> train, valid = prostate.split_frame(ratios=[.8],seed=1234) >>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial") >>> pros_glm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> pros_glm.mean_per_class_error()
-
mean_residual_deviance
()[source]¶ The mean residual deviance for this set of metrics.
- Examples
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/AirlinesTest.csv.zip") >>> air_gbm = H2OGradientBoostingEstimator() >>> air_gbm.train(x=list(range(9)), ... y=9, ... training_frame=airlines, ... validation_frame=airlines) >>> air_gbm.mean_residual_deviance(train=True,valid=False,xval=False)
-
mse
()[source]¶ The MSE for this set of metrics.
- Examples
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios = [.8], seed = 1234) >>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) >>> cars_gbm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> cars_gbm.mse()
-
nobs
()[source]¶ The number of observations.
- Examples
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios = [.8], seed = 1234) >>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) >>> cars_gbm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> perf = cars_gbm.model_performance() >>> perf.nobs()
-
null_degrees_of_freedom
()[source]¶ The null DoF if the model has residual deviance, otherwise None.
- Examples
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate[2] = prostate[2].asfactor() >>> prostate[4] = prostate[4].asfactor() >>> prostate[5] = prostate[5].asfactor() >>> prostate[8] = prostate[8].asfactor() >>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"] >>> response = "CAPSULE" >>> train, valid = prostate.split_frame(ratios=[.8],seed=1234) >>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial") >>> pros_glm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> pros_glm.null_degrees_of_freedom()
-
null_deviance
()[source]¶ The null deviance if the model has residual deviance, otherwise None.
- Examples
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate[2] = prostate[2].asfactor() >>> prostate[4] = prostate[4].asfactor() >>> prostate[5] = prostate[5].asfactor() >>> prostate[8] = prostate[8].asfactor() >>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"] >>> response = "CAPSULE" >>> train, valid = prostate.split_frame(ratios=[.8],seed=1234) >>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial") >>> pros_glm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> pros_glm.null_deviance()
-
r2
()[source]¶ The R squared coefficient.
- Examples
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios = [.8], seed = 1234) >>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) >>> cars_gbm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> cars_gbm.r2()
-
residual_degrees_of_freedom
()[source]¶ The residual DoF if the model has residual deviance, otherwise None.
- Examples
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate[2] = prostate[2].asfactor() >>> prostate[4] = prostate[4].asfactor() >>> prostate[5] = prostate[5].asfactor() >>> prostate[8] = prostate[8].asfactor() >>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"] >>> response = "CAPSULE" >>> train, valid = prostate.split_frame(ratios=[.8],seed=1234) >>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial") >>> pros_glm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> pros_glm.residual_degrees_of_freedom()
-
residual_deviance
()[source]¶ The residual deviance if the model has it, otherwise None.
- Examples
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate[2] = prostate[2].asfactor() >>> prostate[4] = prostate[4].asfactor() >>> prostate[5] = prostate[5].asfactor() >>> prostate[8] = prostate[8].asfactor() >>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"] >>> response = "CAPSULE" >>> train, valid = prostate.split_frame(ratios=[.8],seed=1234) >>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial") >>> pros_glm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> pros_glm.residual_deviance()
-
rmse
()[source]¶ The RMSE for this set of metrics.
- Examples
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios = [.8], seed = 1234) >>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) >>> cars_gbm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> cars_gbm.rmse()
-
rmsle
()[source]¶ The RMSLE for this set of metrics.
- Examples
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios = [.8], seed = 1234) >>> cars_gbm = H2OGradientBoostingEstimator(distribution = "poisson", ... seed = 1234) >>> cars_gbm.train(x = predictors, ... y = response, ... training_frame = train, ... validation_frame = valid) >>> cars_gbm.rmsle()
-
show
(verbosity=None, fmt=None)[source]¶ Describe and renders the current object in the given format and verbosity level if supported, by default guessing the best format for the current environment.
- Parameters
verbosity – one of (None, ‘short’, ‘medium’, ‘full’). Defaults to None (object’s default verbosity).
fmt – one of (None, ‘plain’, ‘pretty’, ‘html’). Defaults to None (picks appropriate format depending on platform/context).
-
-
class
h2o.model.
H2OBinomialModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
F0point5
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the F0.5 for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the F0.5 value for the training data.valid (bool) – If
True
, return the F0.5 value for the validation data.xval (bool) – If
True
, return the F0.5 value for each of the cross-validated splits.
- Returns
The F0.5 values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> F0point5 = gbm.F0point5() # <- Default: return training metric value >>> F0point5 = gbm.F0point5(train=True, valid=True, xval=True)
-
F1
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the F1 value for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the F1 value for the training data.valid (bool) – If
True
, return the F1 value for the validation data.xval (bool) – If
True
, return the F1 value for each of the cross-validated splits.
- Returns
The F1 values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.F1()# <- Default: return training metric value >>> gbm.F1(train=True, valid=True, xval=True)
-
F2
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the F2 for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the F2 value for the training data.valid (bool) – If
True
, return the F2 value for the validation data.xval (bool) – If
True
, return the F2 value for each of the cross-validated splits.
- Returns
The F2 values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.F2() # <- Default: return training metric value >>> gbm.F2(train=True, valid=True, xval=True)
-
accuracy
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the accuracy for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the accuracy value for the training data.valid (bool) – If
True
, return the accuracy value for the validation data.xval (bool) – If
True
, return the accuracy value for each of the cross-validated splits.
- Returns
The accuracy values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.accuracy() # <- Default: return training metric value >>> gbm.accuracy(train=True, valid=True, xval=True)
-
confusion_matrix
(metrics=None, thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the confusion matrix for the specified metrics/thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”- Parameters
metrics – A string (or list of strings) among metrics listed in
H2OBinomialModelMetrics.maximizing_metrics
. Defaults to'f1'
.thresholds – A value (or list of values) between 0 and 1. If None, then the thresholds maximizing each provided metric will be used.
train (bool) – If
True
, return the confusion matrix value for the training data.valid (bool) – If
True
, return the confusion matrix value for the validation data.xval (bool) – If
True
, return the confusion matrix value for each of the cross-validated splits.
- Returns
The confusion matrix values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.confusion_matrix() # <- Default: return training metric value >>> gbm.confusion_matrix(train=True, valid=True, xval=True)
-
error
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the error for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold minimizing the error will be used.
train (bool) – If
True
, return the error value for the training data.valid (bool) – If
True
, return the error value for the validation data.xval (bool) – If
True
, return the error value for each of the cross-validated splits.
- Returns
The error values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.error() # <- Default: return training metric >>> gbm.error(train=True, valid=True, xval=True)
-
fallout
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the fallout for a set of thresholds (aka False Positive Rate).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the fallout value for the training data.valid (bool) – If
True
, return the fallout value for the validation data.xval (bool) – If
True
, return the fallout value for each of the cross-validated splits.
- Returns
The fallout values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.fallout() # <- Default: return training metric >>> gbm.fallout(train=True, valid=True, xval=True)
-
find_idx_by_threshold
(threshold, train=False, valid=False, xval=False)[source]¶ Retrieve the index in this metric’s threshold list at which the given threshold is located.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
threshold (float) – Threshold value to search for in the threshold list.
train (bool) – If
True
, return the find idx by threshold value for the training data.valid (bool) – If
True
, return the find idx by threshold value for the validation data.xval (bool) – If
True
, return the find idx by threshold value for each of the cross-validated splits.
- Returns
The find idx by threshold values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", ... "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> idx_threshold = gbm.find_idx_by_threshold(threshold=0.39438, ... train=True) >>> idx_threshold
-
find_threshold_by_max_metric
(metric, train=False, valid=False, xval=False)[source]¶ If all are
False
(default), then return the training metric value.If more than one option is set to
True
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
metric (str) – A metric among the metrics listed in
H2OBinomialModelMetrics.maximizing_metrics
.train (bool) – If
True
, return the find threshold by max metric value for the training data.valid (bool) – If
True
, return the find threshold by max metric value for the validation data.xval (bool) – If
True
, return the find threshold by max metric value for each of the cross-validated splits.
- Returns
The find threshold by max metric values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", ... "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> max_metric = gbm.find_threshold_by_max_metric(metric="f2", ... train=True) >>> max_metric
-
fnr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the False Negative Rates for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the FNR value for the training data.valid (bool) – If
True
, return the FNR value for the validation data.xval (bool) – If
True
, return the FNR value for each of the cross-validated splits.
- Returns
The FNR values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.fnr() # <- Default: return training metric >>> gbm.fnr(train=True, valid=True, xval=True)
-
fpr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the False Positive Rates for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the FPR value for the training data.valid (bool) – If
True
, return the FPR value for the validation data.xval (bool) – If
True
, return the FPR value for each of the cross-validated splits.
- Returns
The FPR values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.fpr() # <- Default: return training metric >>> gbm.fpr(train=True, valid=True, xval=True)
-
gains_lift
(train=False, valid=False, xval=False)[source]¶ Get the Gains/Lift table for the specified metrics.
If all are
False
(default), then return the training metric Gains/Lift table. If more than one option is set toTrue
, then return a dictionary of metrics where t he keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return the gains lift value for the training data.valid (bool) – If
True
, return the gains lift value for the validation data.xval (bool) – If
True
, return the gains lift value for each of the cross-validated splits.
- Returns
The gains lift values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.gains_lift() # <- Default: return training metric Gain/Lift table >>> gbm.gains_lift(train=True, valid=True, xval=True)
-
gains_lift_plot
(type='both', xval=False, server=False, save_plot_path=None, plot=True)[source]¶ Plot Gains/Lift curves.
- Parameters
type –
One of:
”both” (default)
”gains”
”lift”
xval – if
True
, use cross-validation metrics.server – if
True
, generate plot inline using matplotlib’s “Agg” backend.save_plot_path – filename to save the plot to.
plot –
True
to plot curve,False
to get a gains lift table
- Returns
Gains lift table + the resulting plot (can be accessed using
result.figure()
)
-
kolmogorov_smirnov
()[source]¶ Retrieves the Kolmogorov-Smirnov metric (K-S metric) for a given binomial model. The number returned is in range between 0 and 1. The K-S metric represents the degree of separation between the positive (1) and negative (0) cumulative distribution functions. Detailed metrics per each group are to be found in the gains-lift table.
- Returns
Kolmogorov-Smirnov metric, a number between 0 and 1.
- Examples
>>> from h2o.estimators import H2OGradientBoostingEstimator >>> airlines = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv") >>> model = H2OGradientBoostingEstimator(ntrees=1, ... gainslift_bins=20) >>> model.train(x=["Origin", "Distance"], ... y="IsDepDelayed", ... training_frame=airlines) >>> model.kolmogorov_smirnov()
-
max_per_class_error
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the max per class error for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold minimizing the error will be used.
train (bool) – If
True
, return the max per class error value for the training data.valid (bool) – If
True
, return the max per class error value for the validation data.xval (bool) – If
True
, return the max per class error value for each of the cross-validated splits.
- Returns
The max per class error values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.max_per_class_error() # <- Default: return training metric value >>> gbm.max_per_class_error(train=True, valid=True, xval=True)
-
mcc
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the MCC for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the MCC value for the training data.valid (bool) – If
True
, return the MCC value for the validation data.xval (bool) – If
True
, return the MCC value for each of the cross-validated splits.
- Returns
The MCC values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.mcc() # <- Default: return training metric value >>> gbm.mcc(train=True, valid=True, xval=True)
-
mean_per_class_error
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the mean per class error for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold minimizing the error will be used.
train (bool) – If
True
, return the mean per class error value for the training data.valid (bool) – If
True
, return the mean per class error value for the validation data.xval (bool) – If
True
, return the mean per class error value for each of the cross-validated splits.
- Returns
The mean per class error values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.mean_per_class_error() # <- Default: return training metric >>> gbm.mean_per_class_error(train=True, valid=True, xval=True)
-
metric
(metric, thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the metric value for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
metric (str) – name of the metric to retrieve.
thresholds – If None, then the threshold maximizing the metric will be used (or minimizing it if the metric is an error).
train (bool) – If
True
, return the metric value for the training data.valid (bool) – If
True
, return the metric value for the validation data.xval (bool) – If
True
, return the metric value for each of the cross-validated splits.
- Returns
The metric values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] # thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]) >>> thresholds = [0.01,0.5,0.99] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) # allowable metrics are absolute_mcc, accuracy, precision, # f0point5, f1, f2, mean_per_class_accuracy, min_per_class_accuracy, # tns, fns, fps, tps, tnr, fnr, fpr, tpr, recall, sensitivity, # missrate, fallout, specificity >>> gbm.metric(metric='tpr', thresholds=thresholds)
-
missrate
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the miss rate for a set of thresholds (aka False Negative Rate).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the miss rate value for the training data.valid (bool) – If
True
, return the miss rate value for the validation data.xval (bool) – If
True
, return the miss rate value for each of the cross-validated splits.
- Returns
The miss rate values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.missrate() # <- Default: return training metric >>> gbm.missrate(train=True, valid=True, xval=True)
-
plot
(timestep='AUTO', metric='AUTO', server=False, save_plot_path=None)[source]¶ Plot training set (and validation set if available) scoring history for an H2OBinomialModel.
The timestep and metric arguments are restricted to what is available in its scoring history.
- Parameters
timestep (str) – A unit of measurement for the x-axis.
metric (str) – A unit of measurement for the y-axis.
server (bool) – if
True
, then generate the image inline (using matplotlib’s “Agg” backend).save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
object that contains the resulting figure (can be accessed using
result.figure()
)- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> response = 3 >>> predictors = [0, 1, 2, 4, 5, 6, 7, 8, 9, 10] >>> model = H2OGeneralizedLinearEstimator(family="binomial") >>> model.train(x=predictors, y=response, training_frame=benign) >>> model.plot(timestep="AUTO", metric="objective", server=False)
-
precision
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the precision for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the precision value for the training data.valid (bool) – If
True
, return the precision value for the validation data.xval (bool) – If
True
, return the precision value for each of the cross-validated splits.
- Returns
The precision values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.precision() # <- Default: return training metric value >>> gbm.precision(train=True, valid=True, xval=True)
-
recall
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the recall for a set of thresholds (aka True Positive Rate).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the recall value for the training data.valid (bool) – If
True
, return the recall value for the validation data.xval (bool) – If
True
, return the recall value for each of the cross-validated splits.
- Returns
The recall values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.recall() # <- Default: return training metric >>> gbm.recall(train=True, valid=True, xval=True)
-
roc
(train=False, valid=False, xval=False)[source]¶ Return the coordinates of the ROC curve for a given set of data.
The coordinates are two-tuples containing the false positive rates as a list and true positive rates as a list. If all are
False
(default), then return is the training data. If more than one ROC curve is requested, the data is returned as a dictionary of two-tuples.- Parameters
train (bool) – If
True
, return the ROC value for the training data.valid (bool) – If
True
, return the ROC value for the validation data.xval (bool) – If
True
, return the ROC value for each of the cross-validated splits.
- Returns
The ROC values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.roc() # <- Default: return training data >>> gbm.roc(train=True, valid=True, xval=True)
-
sensitivity
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the sensitivity for a set of thresholds (aka True Positive Rate or Recall).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the sensitivity value for the training data.valid (bool) – If
True
, return the sensitivity value for the validation data.xval (bool) – If
True
, return the sensitivity value for each of the cross-validated splits.
- Returns
The sensitivity values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.sensitivity() # <- Default: return training metric >>> gbm.sensitivity(train=True, valid=True, xval=True)
-
specificity
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the specificity for a set of thresholds (aka True Negative Rate).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the specificity value for the training data.valid (bool) – If
True
, return the specificity value for the validation data.xval (bool) – If
True
, return the specificity value for each of the cross-validated splits.
- Returns
The specificity values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.specificity() # <- Default: return training metric >>> gbm.specificity(train=True, valid=True, xval=True)
-
thresholds_and_metric_scores
(train=False, valid=False, xval=False)[source]¶ Get the all thresholds and metric scores in a table.
If all are
False
(default), then return the training metric table. If more than one option is set toTrue
, then return a dictionary of tables where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return the thresholds and metric scores table for the training data.valid (bool) – If
True
, return the thresholds and metric scores table value for the validation data.xval (bool) – If
True
, return the thresholds and metric scores table value for each of the cross-validated splits.
- Returns
The thresholds and metric scores tables for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.thresholds_and_metric_scores()# <- Default: return training metric table >>> gbm.thresholds_and_metric_scores(train=True, valid=True, xval=True)
-
tnr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the True Negative Rate for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the TNR value for the training data.valid (bool) – If
True
, return the TNR value for the validation data.xval (bool) – If
True
, return the TNR value for each of the cross-validated splits.
- Returns
The TNR values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.tnr() # <- Default: return training metric >>> gbm.tnr(train=True, valid=True, xval=True)
-
tpr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the True Positive Rate for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the TPR value for the training data.valid (bool) – If
True
, return the TPR value for the validation data.xval (bool) – If
True
, return the TPR value for each of the cross-validated splits.
- Returns
The TPR values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.tpr() # <- Default: return training metric >>> gbm.tpr(train=True, valid=True, xval=True)
-
-
class
h2o.model.
H2OMultinomialModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
confusion_matrix
(data)[source]¶ Returns a confusion matrix based of H2O’s default prediction threshold for a dataset.
- Parameters
data (H2OFrame) – the frame with the prediction results for which the confusion matrix should be extracted.
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> distribution = "multinomial" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> confusion_matrix = gbm.confusion_matrix(train) >>> confusion_matrix
-
hit_ratio_table
(train=False, valid=False, xval=False)[source]¶ Retrieve the Hit Ratios.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train – If train is
True
, then return the hit ratio value for the training data.valid – If valid is
True
, then return the hit ratio value for the validation data.xval – If xval is
True
, then return the hit ratio value for the cross validation data.
- Returns
The hit ratio for this regression model.
- Example
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> distribution = "multinomial" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> hit_ratio_table = gbm.hit_ratio_table() # <- Default: return training metrics >>> hit_ratio_table >>> hit_ratio_table1 = gbm.hit_ratio_table(train=True, ... valid=True, ... xval=True) >>> hit_ratio_table1
-
mean_per_class_error
(train=False, valid=False, xval=False)[source]¶ Retrieve the mean per class error across all classes.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return themean_per_class_error
value for the training data.valid (bool) – If
True
, return themean_per_class_error
value for the validation data.xval (bool) – If
True
, return themean_per_class_error
value for each of the cross-validated splits.
- Returns
The
mean_per_class_error
values for the specified key(s).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> distribution = "multinomial" >>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> mean_per_class_error = gbm.mean_per_class_error() # <- Default: return training metric >>> mean_per_class_error >>> mean_per_class_error1 = gbm.mean_per_class_error(train=True, ... valid=True, ... xval=True) >>> mean_per_class_error1
-
multinomial_auc_table
(train=False, valid=False, xval=False)[source]¶ Retrieve the multinomial AUC table.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return themultinomial_auc_table
for the training data.valid (bool) – If
True
, return themultinomial_auc_table
for the validation data.xval (bool) – If
True
, return themultinomial_auc_table
for each of the cross-validated splits.
- Returns
The
multinomial_auc_table
values for the specified key(s).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> distribution = "multinomial" >>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> multinomial_auc_table = gbm.multinomial_auc_table() # <- Default: return training metric >>> multinomial_auc_table >>> multinomial_auc_table1 = gbm.multinomial_auc_table(train=True, ... valid=True, ... xval=True) >>> multinomial_auc_table1
-
multinomial_aucpr_table
(train=False, valid=False, xval=False)[source]¶ Retrieve the multinomial PR AUC table.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return themultinomial_aucpr_table
for the training data.valid (bool) – If
True
, return themultinomial_aucpr_table
for the validation data.xval (bool) – If
True
, return themultinomial_aucpr_table
for each of the cross-validated splits.
- Returns
The
average_pairwise_auc
values for the specified key(s).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> distribution = "multinomial" >>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> multinomial_aucpr_table = gbm.multinomial_aucpr_table() # <- Default: return training metric >>> multinomial_aucpr_table >>> multinomial_aucpr_table1 = gbm.multinomial_aucpr_table(train=True, ... valid=True, ... xval=True) >>> multinomial_aucpr_table1
-
plot
(timestep='AUTO', metric='AUTO', save_plot_path=None, **kwargs)[source]¶ Plots training set (and validation set if available) scoring history for an H2OMultinomialModel. The timestep and metric arguments are restricted to what is available in its scoring history.
- Parameters
timestep –
A unit of measurement for the x-axis. One of:
’AUTO’
’duration’
’number_of_trees’
metric –
A unit of measurement for the y-axis. One of:
’AUTO’
’logloss’
’classification_error’
’rmse’
- Returns
Object that contains the resulting scoring history plot (can be accessed using
result.figure()
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> distribution = "multinomial" >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> gbm.plot(metric="AUTO", timestep="AUTO")
-
-
class
h2o.model.
H2ORegressionModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
plot
(timestep='AUTO', metric='AUTO', save_plot_path=None, **kwargs)[source]¶ Plots training set (and validation set if available) scoring history for an H2ORegressionModel. The
timestep
andmetric
arguments are restricted to what is available in its scoring history.- Parameters
timestep – A unit of measurement for the x-axis.
metric – A unit of measurement for the y-axis.
save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
Object that contains the resulting scoring history plot (can be accessed using
result.figure()
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy" >>> distribution = "gaussian" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> gbm.plot(timestep="AUTO", metric="AUTO",)
-
residual_analysis_plot
(frame, figsize=(16, 9), save_plot_path=None)¶ Residual Analysis.
Do Residual Analysis and plot the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection (e.g. using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc.). If you notice “striped” lines of residuals, that is just an indication that your response variable was integer-valued instead of real-valued.
- Parameters
model – H2OModel.
frame – H2OFrame.
figsize – figure size; passed directly to matplotlib.
save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
object that contains the resulting matplotlib figure (can be accessed using
result.figure()
).- Examples
>>> import h2o >>> from h2o.estimators import H2OGradientBoostingEstimator >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train a GBM >>> gbm = H2OGradientBoostingEstimator() >>> gbm.train(y=response, training_frame=train) >>> >>> # Create the residual analysis plot >>> gbm.residual_analysis_plot(test)
-
-
class
h2o.model.
H2OOrdinalModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
confusion_matrix
(data)[source]¶ Returns a confusion matrix based on H2O’s default prediction threshold for a dataset.
- Parameters
data (H2OFrame) – the frame with the prediction results for which the confusion matrix should be extracted.
-
hit_ratio_table
(train=False, valid=False, xval=False)[source]¶ Retrieve the Hit Ratios.
If all are
False
(default), then return the training metric value. If more than one options is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train – If train is
True
, then return the hit ratio value for the training data.valid – If valid is
True
, then return the hit ratio value for the validation data.xval – If xval is
True
, then return the hit ratio value for the cross validation data.
- Returns
The hit ratio for this regression model.
-
mean_per_class_error
(train=False, valid=False, xval=False)[source]¶ Retrieve the mean per class error across all classes
If all are
False
(default), then return the training metric value. If more than one options is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return themean_per_class_error
value for the training data.valid (bool) – If
True
, return themean_per_class_error
value for the validation data.xval (bool) – If
True
, return themean_per_class_error
value for each of the cross-validated splits.
- Returns
The
mean_per_class_error
values for the specified key(s).
-
plot
(timestep='AUTO', metric='AUTO', save_plot_path=None, **kwargs)[source]¶ Plots training set (and validation set if available) scoring history for an H2OOrdinalModel. The
timestep
andmetric
arguments are restricted to what is available in its scoring history.- Parameters
timestep – A unit of measurement for the x-axis.
metric – A unit of measurement for the y-axis.
save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
Object that contains the resulting scoring history plot (can be accessed using
result.figure()
).
-
-
class
h2o.model.
H2OClusteringModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
betweenss
(train=False, valid=False, xval=False)[source]¶ Get the between cluster sum of squares.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return the between cluster sum of squares value for the training data.valid (bool) – If
True
, return the between cluster sum of squares value for the validation data.xval (bool) – If
True
, return the between cluster sum of squares value for each of the cross-validated splits.
- Returns
The between cluster sum of squares values for the specified key(s).
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> betweenss = km.betweenss() # <- Default: return training metrics >>> betweenss >>> betweenss3 = km.betweenss(train=False, ... valid=False, ... xval=True) >>> betweenss3
-
centers
()[source]¶ The centers for the KMeans model.
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> km.centers()
-
centers_std
()[source]¶ The standardized centers for the KMeans model.
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> km.centers_std()
-
centroid_stats
(train=False, valid=False)[source]¶ Get the centroid statistics for each cluster.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”. This metric is not available in cross-validation metrics.- Parameters
train (bool) – If
True
, return the centroid statistic for the training data.valid (bool) – If
True
, return the centroid statistic for the validation data.
- Returns
The centroid statistics for the specified key(s).
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> centroid_stats = km.centroid_stats() # <- Default: return training metrics >>> centroid_stats >>> centroid_stats1 = km.centroid_stats(train=True, ... valid=False) >>> centroid_stats1
-
num_iterations
()[source]¶ Get the number of iterations it took to converge or reach max iterations.
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> km.num_iterations()
-
size
(train=False, valid=False)[source]¶ Get the sizes of each cluster.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”. This metric is not available in cross-validation metrics.- Parameters
train (bool) – If
True
, return the cluster sizes for the training data.valid (bool) – If
True
, return the cluster sizes for the validation data.
- Returns
The cluster sizes for the specified key(s).
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> size = km.size() # <- Default: return training metrics >>> size >>> size1 = km.size(train=False, ... valid=False) >>> size1
-
tot_withinss
(train=False, valid=False, xval=False)[source]¶ Get the total within cluster sum of squares.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return the total within cluster sum of squares value for the training data.valid (bool) – If
True
, return the total within cluster sum of squares value for the validation data.xval (bool) – If
True
, return the total within cluster sum of squares value for each of the cross-validated splits.
- Returns
The total within cluster sum of squares values for the specified key(s).
- Examples
>>> >>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> tot_withinss = km.tot_withinss() # <- Default: return training metrics >>> tot_withinss >>> tot_withinss2 = km.tot_withinss(train=True, ... valid=False, ... xval=True) >>> tot_withinss2
-
totss
(train=False, valid=False, xval=False)[source]¶ Get the total sum of squares.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return the total sum of squares value for the training data.valid (bool) – If
True
, return the total sum of squares value for the validation data.xval (bool) – If
True
, return the total sum of squares value for each of the cross-validated splits.
- Returns
The total sum of squares values for the specified key(s).
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> totss = km.totss() # <- Default: return training metrics >>> totss
-
withinss
(train=False, valid=False)[source]¶ Get the within cluster sum of squares for each cluster.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”. This metric is not available in cross-validation metrics.- Parameters
train (bool) – If
True
, return the total sum of squares value for the training data.valid (bool) – If
True
, return the total sum of squares value for the validation data.
- Returns
The total sum of squares values for the specified key(s).
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> withinss = km.withinss() # <- Default: return training metrics >>> withinss >>> withinss2 = km.withinss(train=True, ... valid=True) >>> withinss2
-
-
class
h2o.model.
H2ODimReductionModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
Dimension reduction model, such as PCA or GLRM.
-
num_iterations
()[source]¶ Get the number of iterations that it took to converge or reach max iterations.
-
proj_archetypes
(test_data, reverse_transform=False)[source]¶ Convert archetypes of the model into original feature space.
- Parameters
test_data (H2OFrame) – The dataset upon which the model was trained.
reverse_transform (bool) – Whether the transformation of the training data during model-building should be reversed on the projected archetypes.
- Returns
model archetypes projected back into the original training data’s feature space.
-
reconstruct
(test_data, reverse_transform=False)[source]¶ Reconstruct the training data from the model and impute all missing values.
- Parameters
test_data (H2OFrame) – The dataset upon which the model was trained.
reverse_transform (bool) – Whether the transformation of the training data during model-building should be reversed on the reconstructed frame.
- Returns
the approximate reconstruction of the training data.
-
screeplot
(type='barplot', server=False, save_plot_path=None)[source]¶ Produce the scree plot.
Library
matplotlib
is required for this function.- Parameters
type (str) – either
"barplot"
or"lines"
.server (bool) – if
True
, setserver
settings to matplotlib and do not show the graph.save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
Object that contains the resulting scree plot (can be accessed like
result.figure()
).
-
-
class
h2o.model.
H2OAutoEncoderModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
anomaly
(test_data, per_feature=False)[source]¶ Obtain the reconstruction error for the input
test_data
.- Parameters
test_data (H2OFrame) – The dataset upon which the reconstruction error is computed.
per_feature (bool) – Whether to return the square reconstruction error per feature. Otherwise, return the mean square error.
- Returns
the reconstruction error.
- Examples
>>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz") >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz") >>> predictors = list(range(0,784)) >>> resp = 784 >>> train = train[predictors] >>> test = test[predictors] >>> ae_model = H2OAutoEncoderEstimator(activation="Tanh", ... hidden=[2], ... l1=1e-5, ... ignore_const_cols=False, ... epochs=1) >>> ae_model.train(x=predictors,training_frame=train) >>> test_rec_error = ae_model.anomaly(test) >>> test_rec_error >>> test_rec_error_features = ae_model.anomaly(test, per_feature=True) >>> test_rec_error_features
-
-
class
h2o.model.
H2OBinomialUpliftModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
atc
(train=False, valid=False)[source]¶ Retrieve Average Treatment Effect on the Control
If all are False (default), then return the training ATC metric. If more than one options is set to True, then return a dictionary of metrics where the keys are “train” and “valid”.
- Parameters
train (bool) – If True, return the ATC value for the training data.
valid (bool) – If True, return the ATC value for the validation data.
- Returns
the ATC value for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.atc() # <- Default: return training metric value >>> uplift_model.atc(train=True)
-
ate
(train=False, valid=False)[source]¶ Retrieve Average Treatment Effect
If all are False (default), then return the training ATE metric. If more than one options is set to True, then return a dictionary of metrics where the keys are “train” and “valid”.
- Parameters
train (bool) – If True, return the ATE value for the training data.
valid (bool) – If True, return the ATE value for the validation data.
- Returns
the ATE value for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.ate() # <- Default: return training metric value >>> uplift_model.ate(train=True)
-
att
(train=False, valid=False)[source]¶ Retrieve Average Treatment Effect on the Treated
If all are False (default), then return the training ATT metric. If more than one options is set to True, then return a dictionary of metrics where the keys are “train” and “valid”.
- Parameters
train (bool) – If True, return the ATT value for the training data.
valid (bool) – If True, return the ATT value for the validation data.
- Returns
the ATT value for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.att() # <- Default: return training metric value >>> uplift_model.att(train=True)
-
auuc
(metric=None, train=False, valid=False)[source]¶ Retrieve area under uplift curve (AUUC) value for the specified metrics in model params.
If all are
False
(default), then return the training metric AUUC value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the AUUC value for the training data.valid (bool) – If
True
, return the AUUC value for the validation data.metric –
AUUC metric type. One of:
”qini”
”lift”
”gain”
”None” (default; metric set in parameters)
- Returns
AUUC value for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.auuc() # <- Default: return training metric value >>> uplift_model.auuc(train=True, valid=True)
-
auuc_normalized
(metric=None, train=False, valid=False)[source]¶ Retrieve normalized area under uplift curve (AUUC) value for the specified metrics in model params.
If all are
False
(default), then return the training metric normalized AUUC value. If more than one options is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
metric – AUUC metric type (“qini”, “lift”, “gain”, default is None which means metric set in parameters)
train (bool) – If True, return the AUUC value for the training data.
valid (bool) – If True, return the AUUC value for the validation data.
metric –
AUUC metric type. One of:
”qini”
”lift”
”gain”
”None” (default; metric set in parameters)
- Returns
Normalized AUUC value for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.auuc_normalized() # <- Default: return training metric value >>> uplift_model.auuc_normalized(train=True, valid=True)
-
auuc_table
(train=False, valid=False)[source]¶ Retrieve all types of AUUC in a table.
If all are
False
(default), then return the training metric AUUC table. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the AUUC table for the training data.valid (bool) – If
True
, return the AUUC table for the validation data.
- Returns
the AUUC table for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.auuc_table() # <- Default: return training metric value >>> uplift_model.auuc_table(train=True)
-
n
(train=False, valid=False)[source]¶ Retrieve numbers of observations.
If all are
False
(default), then return the training metric number of observations. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the number of observations for the training data.valid (bool) – If
True
, return the number of observations for the validation data.
- Returns
a list of numbers of observation for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.n() # <- Default: return training metric value >>> uplift_model.n(train=True)
-
qini
(train=False, valid=False)[source]¶ Retrieve Qini value (area between Qini cumulative uplift curve and random curve)
If all are False (default), then return the training metric AUUC table. If more than one options is set to True, then return a dictionary of metrics where the keys are “train” and “valid”.
- Parameters
train (bool) – If True, return the Qini value for the training data.
valid (bool) – If True, return the Qini value for the validation data.
- Returns
the Qini value for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.qini() # <- Default: return training metric value >>> uplift_model.qini(train=True)
-
thresholds
(train=False, valid=False)[source]¶ Retrieve prediction thresholds for the specified metrics.
If all are
False
(default), then return the training metric prediction thresholds. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the prediction thresholds for the training data.valid (bool) – If
True
, return the prediction thresholds for the validation data.
- Returns
a list of numbers of observation for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.thresholds() # <- Default: return training metric value >>> uplift_model.thresholds(train=True)
-
thresholds_and_metric_scores
(train=False, valid=False)[source]¶ Retrieve thresholds and metric scores table for the specified metrics.
If all are
False
(default), then return the training metric thresholds and metric scores table. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the thresholds and metric scores table for the training data.valid (bool) – If
True
, return the thresholds and metric scores table for the validation data.
- Returns
the thresholds and metric scores table for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.thresholds_and_metric_scores() # <- Default: return training metric value >>> uplift_model.thresholds_and_metric_scores(train=True)
-
uplift
(metric='qini', train=False, valid=False)[source]¶ Retrieve uplift values for the specified metrics.
If all are
False
(default), then return the training metric uplift values. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the uplift values for the training data.valid (bool) – If
True
, return the uplift values for the validation data.metric –
Uplift metric type. One of:
”qini” (default)
”lift”
”gain”
metric – Uplift metric type (“qini”, “lift”, “gain”, default is “qini”)
train – If True, return the uplift values for the training data.
valid – If True, return the uplift values for the validation data.
- Returns
a list of uplift values for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.uplift() # <- Default: return training metric value >>> uplift_model.uplift(train=True, metric="gain")
-
uplift_normalized
(metric='qini', train=False, valid=False)[source]¶ Retrieve normalized uplift values for the specified metrics.
If all are
False
(default), then return the training metric normalized uplift values. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the uplift values for the training data.valid (bool) – If
True
, return the uplift values for the validation data.metric –
Uplift metric type. One of:
”qini” (default)
”lift”
”gain”
- Returns
a list of normalized uplift values for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.uplift_normalized() # <- Default: return training metric value >>> uplift_model.uplift_normalized(train=True, metric="gain")
-
-
class
h2o.model.
ConfusionMatrix
(cm, domains=None, table_header=None)[source]¶ Bases:
h2o.display.H2ODisplay
-
ROUND
= 4¶
-
-
class
h2o.model.
H2OSegmentModels
(segment_models_id=None)[source]¶ Bases:
h2o.base.Keyed
Collection of H2O Models built for each input segment.
- Parameters
segment_models_id – identifier of this collection of Segment Models
- Example
>>> segment_models = h2o.model.segment_models.H2OSegmentModels(segment_models_id="my_sm_id") >>> segment_models.as_frame()
-
as_frame
()[source]¶ Converts this collection of models to a tabular representation.
- Returns
An H2OFrame, first columns identify the input segments, rest of the columns describe the built models.
-
property
key
¶ - Returns
the unique key representing the object on the backend
ModelBase
¶
-
class
h2o.model.model_base.
ModelBase
[source]¶ Bases:
h2o.model.model_base.ModelBase
Base class for all models.
-
property
actual_params
¶ Dictionary of actual parameters of the model.
-
aic
(train=False, valid=False, xval=False)[source]¶ Get the AIC (Akaike Information Criterium).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the AIC value for the training data.valid (bool) – If
valid=True
, then return the AIC value for the validation data.xval (bool) – If
xval=True
, then return the AIC value for the validation data.
- Returns
The AIC.
-
auc
(train=False, valid=False, xval=False)[source]¶ Get the AUC (Area Under Curve).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the AUC value for the training data.valid (bool) – If
valid=True
, then return the AUC value for the validation data.xval (bool) – If
xval=True
, then return the AUC value for the validation data.
- Returns
The AUC.
-
aucpr
(train=False, valid=False, xval=False)[source]¶ Get the aucPR (Area Under PRECISION RECALL Curve).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the aucpr value for the training data.valid (bool) – If
valid=True
, then return the aucpr value for the validation data.xval (bool) – If
xval=True
, then return the aucpr value for the validation data.
- Returns
The aucpr.
-
average_objective
()[source]¶ Retrieve model average objective function value from scoring history if exists for GLM model. If there is no regularization, the avearge objective value*obj_reg should equal the neg_log_likelihood value.
- Returns
the average objective function value
-
biases
(vector_id=0)[source]¶ Return the frame for the respective bias vector.
- Parameters
vector_id – an integer, ranging from 0 to number of layers, that specifies the bias vector to return.
- Returns
an H2OFrame which represents the bias vector identified by
vector_id
.
-
calibrate
(calibration_model)[source]¶ Calibrate a trained model with a supplied calibration model.
Only tree-based models can be calibrated.
- Parameters
calibration_model – a GLM model (for Platt Scaling) or Isotonic Regression model trained with the purpose of calibrating output of this model.
- Examples
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.isotonicregression import H2OIsotonicRegressionEstimator >>> df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/ecology_model.csv") >>> df["Angaus"] = df["Angaus"].asfactor() >>> train, calib = df.split_frame(ratios=[.8], destination_frames=["eco_train", "eco_calib"], seed=42) >>> model = H2OGradientBoostingEstimator() >>> model.train(x=list(range(2, train.ncol)), y="Angaus", training_frame=train) >>> isotonic_train = calib[["Angaus"]] >>> isotonic_train = isotonic_train.cbind(model.predict(calib)["p1"]) >>> h2o_iso_reg = H2OIsotonicRegressionEstimator(out_of_bounds="clip") >>> h2o_iso_reg.train(training_frame=isotonic_train, x="p1", y="Angaus") >>> model.calibrate(h2o_iso_reg) >>> model.predict(train)
-
coef
()[source]¶ Return the coefficients which can be applied to the non-standardized data.
Note:
standardize=True
by default; whenstandardize=False
, thencoef()
will return the coefficients which are fit directly.
-
coef_norm
()[source]¶ Return coefficients fitted on the standardized data (requires
standardize=True
, which is on by default).These coefficients can be used to evaluate variable importance.
-
cross_validation_fold_assignment
()[source]¶ Obtain the cross-validation fold assignment for all rows in the training data.
- Returns
H2OFrame
-
cross_validation_holdout_predictions
()[source]¶ Obtain the (out-of-sample) holdout predictions of all cross-validation models on the training data.
This is equivalent to summing up all H2OFrames returned by
cross_validation_predictions
.- Returns
H2OFrame
-
cross_validation_metrics_summary
()[source]¶ Retrieve Cross-Validation Metrics Summary.
- Returns
The cross-validation metrics summary as an H2OTwoDimTable
-
cross_validation_models
()[source]¶ Obtain a list of cross-validation models.
- Returns
list of H2OModel objects.
-
cross_validation_predictions
()[source]¶ Obtain the (out-of-sample) holdout predictions of all cross-validation models on their holdout data.
Note that the predictions are expanded to the full number of rows of the training data, with 0 fill-in.
- Returns
list of H2OFrame objects.
-
deepfeatures
(test_data, layer)[source]¶ Return hidden layer details.
- Parameters
test_data – Data to create a feature space on.
layer – 0 index hidden layer.
-
property
default_params
¶ Dictionary of the default parameters of the model.
-
download_model
(path='', filename=None)[source]¶ Download an H2O Model object to disk.
- Parameters
path – a path to the directory where the model should be saved.
filename – a filename for the saved model.
- Returns
the path of the downloaded model.
-
download_mojo
(path='.', get_genmodel_jar=False, genmodel_name='')[source]¶ Download the model in MOJO format.
- Parameters
path – the path where MOJO file should be saved.
get_genmodel_jar – if True, then also download h2o-genmodel.jar and store it in folder
path
.genmodel_name – Custom name of genmodel jar
- Returns
name of the MOJO file written.
-
download_pojo
(path='', get_genmodel_jar=False, genmodel_name='')[source]¶ Download the POJO for this model to the directory specified by path.
If path is an empty string, then dump the output to screen.
- Parameters
path – An absolute path to the directory where POJO should be saved.
get_genmodel_jar – if True, then also download h2o-genmodel.jar and store it in folder
path
.genmodel_name – Custom name of genmodel jar
- Returns
name of the POJO file written.
-
property
end_time
¶ Timestamp (milliseconds since 1970) when the model training was ended.
-
explain
(frame, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, figsize=(16, 9), render=True, qualitative_colormap='Dark2', sequential_colormap='RdYlBu_r', background_frame=None)¶ Generate model explanations on frame data set.
The H2O Explainability Interface is a convenient wrapper to a number of explainabilty methods and visualizations in H2O. The function can be applied to a single model or group of models and returns an object containing explanations, such as a partial dependence plot or a variable importance plot. Most of the explanations are visual (plots). These plots can also be created by individual utility functions/methods as well.
- Parameters
models – a list of H2O models, an H2O AutoML instance, or an H2OFrame with a ‘model_id’ column (e.g. H2OAutoML leaderboard).
frame – H2OFrame.
columns – either a list of columns or column indices to show. If specified parameter
top_n_features
will be ignored.top_n_features – a number of columns to pick using variable importance (where applicable).
include_explanations – if specified, return only the specified model explanations (mutually exclusive with
exclude_explanations
).exclude_explanations – exclude specified model explanations.
plot_overrides – overrides for individual model explanations.
figsize – figure size; passed directly to matplotlib.
render – if
True
, render the model explanations; otherwise model explanations are just returned.qualitative_colormap – used for setting qualitative colormap, that is passed to individual plots.
sequential_colormap – used for setting sequential colormap, that is passed to individual plots.
background_frame – optional frame, that is used as the source of baselines for the marginal SHAP. Setting it enables calculating SHAP in more models but it can be more time and memory consuming.
- Returns
H2OExplanation containing the model explanations including headers and descriptions.
- Examples
>>> import h2o >>> from h2o.automl import H2OAutoML >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train an H2OAutoML >>> aml = H2OAutoML(max_models=10) >>> aml.train(y=response, training_frame=train) >>> >>> # Create the H2OAutoML explanation >>> aml.explain(test) >>> >>> # Create the leader model explanation >>> aml.leader.explain(test)
-
explain_row
(frame, row_index, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, qualitative_colormap='Dark2', figsize=(16, 9), render=True, background_frame=None)¶ Generate model explanations on frame data set for a given instance.
Explain the behavior of a model or group of models with respect to a single row of data. The function returns an object containing explanations, such as a partial dependence plot or a variable importance plot. Most of the explanations are visual (plots). These plots can also be created by individual utility functions/methods as well.
- Parameters
models – H2OAutoML object, supervised H2O model, or list of supervised H2O models.
frame – H2OFrame.
row_index – row index of the instance to inspect.
columns – either a list of columns or column indices to show. If specified, parameter
top_n_features
will be ignored.top_n_features – a number of columns to pick using variable importance (where applicable).
include_explanations – if specified, return only the specified model explanations (mutually exclusive with
exclude_explanations
).exclude_explanations – exclude specified model explanations.
plot_overrides – overrides for individual model explanations.
qualitative_colormap – a colormap name.
figsize – figure size; passed directly to matplotlib.
render – if
True
, render the model explanations; otherwise model explanations are just returned.background_frame – optional frame, that is used as the source of baselines for the marginal SHAP. Setting it enables calculating SHAP in more models but it can be more time and memory consuming.
- Returns
H2OExplanation containing the model explanations including headers and descriptions.
- Examples
>>> import h2o >>> from h2o.automl import H2OAutoML >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train an H2OAutoML >>> aml = H2OAutoML(max_models=10) >>> aml.train(y=response, training_frame=train) >>> >>> # Create the H2OAutoML explanation >>> aml.explain_row(test, row_index=0) >>> >>> # Create the leader model explanation >>> aml.leader.explain_row(test, row_index=0)
-
feature_frequencies
(test_data)[source]¶ Retrieve the number of occurrences of each feature for given observations on their respective paths in a tree ensemble model. Available for GBM, Random Forest and Isolation Forest models.
- Parameters
test_data (H2OFrame) – Data on which to calculate feature frequencies.
- Returns
A new H2OFrame made of feature contributions.
- Examples
>>> from h2o.estimators import H2OIsolationForestEstimator >>> h2o_df = h2o.import_file("https://raw.github.com/h2oai/h2o/master/smalldata/logreg/prostate.csv") >>> train,test = h2o_df.split_frame(ratios=[0.75]) >>> model = H2OIsolationForestEstimator(sample_rate = 0.1, ... max_depth = 20, ... ntrees = 50) >>> model.train(training_frame=train) >>> model.feature_frequencies(test_data = test)
-
feature_interaction
(max_interaction_depth=100, max_tree_depth=100, max_deepening=-1, path=None)[source]¶ Feature interactions and importance, leaf statistics and split value histograms in a tabular form. Available for XGBoost and GBM.
Metrics:
Gain - Total gain of each feature or feature interaction.
FScore - Amount of possible splits taken on a feature or feature interaction.
wFScore - Amount of possible splits taken on a feature or feature interaction weighed by the probability of the splits to take place.
Average wFScore - wFScore divided by FScore.
Average Gain - Gain divided by FScore.
Expected Gain - Total gain of each feature or feature interaction weighed by the probability to gather the gain.
Average Tree Index
Average Tree Depth
- Parameters
max_interaction_depth – Upper bound for extracted feature interactions depth. Defaults to
100
.max_tree_depth – Upper bound for tree depth. Defaults to
100
.max_deepening – Upper bound for interaction start deepening (zero deepening => interactions starting at root only). Defaults to
-1.
path – (Optional) Path where to save the output in .xlsx format (e.g.
/mypath/file.xlsx
). Please note that Pandas and XlsxWriter need to be installed for using this option. Defaults to None.
- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_xgb = H2OXGBoostEstimator(seed=1234) >>> boston_xgb.train(y=response, x=predictors, training_frame=train) >>> feature_interactions = boston_xgb.feature_interaction()
-
property
full_parameters
¶ Dictionary of the full specification of all parameters.
-
get_xval_models
(key=None)[source]¶ Return a Model object.
- Parameters
key – If None, return all cross-validated models; otherwise return the model that key points to.
- Returns
A model or list of models.
-
gini
(train=False, valid=False, xval=False)[source]¶ Get the Gini coefficient.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”- Parameters
train (bool) – If
train=True
, then return the Gini Coefficient value for the training data.valid (bool) – If
valid=True
, then return the Gini Coefficient value for the validation data.xval (bool) – If
xval=True
, then return the Gini Coefficient value for the cross validation data.
- Returns
The Gini Coefficient for this binomial model.
-
h
(frame, variables)[source]¶ Calculates Friedman and Popescu’s H statistics, in order to test for the presence of an interaction between specified variables in H2O GBM and XGB models. H varies from
0
to1
. It will have a value of0
if the model exhibits no interaction between specified variables and a correspondingly larger value for a stronger interaction effect between them.NaN
is returned if a computation is spoiled by weak main effects and rounding errors.See Jerome H. Friedman and Bogdan E. Popescu, 2008, “Predictive learning via rule ensembles”, Ann. Appl. Stat. 2:916-954, http://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908046, s. 8.1.
- Parameters
frame – the frame that current model has been fitted to.
variables – variables of the interest.
- Returns
H statistic of the variables.
- Examples
>>> prostate_train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/prostate_train.csv") >>> prostate_train["CAPSULE"] = prostate_train["CAPSULE"].asfactor() >>> gbm_h2o = H2OGradientBoostingEstimator(ntrees=100, learn_rate=0.1, >>> max_depth=5, >>> min_rows=10, >>> distribution="bernoulli") >>> gbm_h2o.train(x=list(range(1,prostate_train.ncol)),y="CAPSULE", training_frame=prostate_train) >>> h = gbm_h2o.h(prostate_train, ['DPROS','DCAPS'])
-
property
have_mojo
¶ True, if export to MOJO is possible
-
property
have_pojo
¶ True, if export to POJO is possible
-
ice_plot
(frame, column, target=None, max_levels=30, figsize=(16, 9), colormap='plasma', save_plot_path=None, show_pdp=True, binary_response_scale='response', centered=False, grouping_column=None, output_graphing_data=False, nbins=100, show_rug=True, **kwargs)¶ Plot Individual Conditional Expectations (ICE) for each decile.
The individual conditional expectations (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. The ICE plot is similar to a partial dependence plot (PDP) because a PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. The following plot shows the effect for each decile. In contrast to a partial dependence plot, the ICE plot can provide more insight especially when there is stronger feature interaction. Also, the plot shows the original observation values marked by a semi-transparent circle on each ICE line. Note that the score of the original observation value may differ from score value of the underlying ICE line at the original observation point as the ICE line is drawn as an interpolation of several points.
- Parameters
model – H2OModel.
frame – H2OFrame.
column – string containing column name.
target – (only for multinomial classification) for what target should the plot be done.
max_levels – maximum number of factor levels to show.
figsize – figure size; passed directly to matplotlib.
colormap – colormap name.
save_plot_path – a path to save the plot via using matplotlib function savefig.
show_pdp – option to turn on/off PDP line. Defaults to
True
.binary_response_scale – option for binary model to display (on the y-axis) the logodds instead of the actual score. Can be one of: “response” (default) or “logodds”.
centered – a bool that determines whether to center curves around 0 at the first valid x value or not.
grouping_column – a feature column name to group the data and provide separate sets of plots by grouping feature values.
output_graphing_data – a bool that determmines whether to output final graphing data to a frame.
nbins – Number of bins used.
show_rug – Show rug to visualize the density of the column
- Returns
object that contains the resulting matplotlib figure (can be accessed using
result.figure()
).- Examples
>>> import h2o >>> from h2o.estimators import H2OGradientBoostingEstimator >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response: >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train a GBM: >>> gbm = H2OGradientBoostingEstimator() >>> gbm.train(y=response, training_frame=train) >>> >>> # Create the individual conditional expectations plot: >>> gbm.ice_plot(test, column="alcohol")
-
property
key
¶ - Returns
the unique key representing the object on the backend
-
learning_curve_plot
(metric='AUTO', cv_ribbon=None, cv_lines=None, figsize=(16, 9), colormap=None, save_plot_path=None)¶ Learning curve plot.
Create the learning curve plot for an H2O Model. Learning curves show the error metric dependence on learning progress (e.g. RMSE vs number of trees trained so far in GBM). There can be up to 4 curves showing Training, Validation, Training on CV Models, and Cross-validation error.
- Parameters
model – an H2O model.
metric – a stopping metric.
cv_ribbon – if
True
, plot the CV mean and CV standard deviation as a ribbon around the mean; if None, it will attempt to automatically determine if this is suitable visualization.cv_lines – if
True
, plot scoring history for individual CV models; if None, it will attempt to automatically determine if this is suitable visualization.figsize – figure size; passed directly to matplotlib.
colormap – colormap to use.
save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
object that contains the resulting figure (can be accessed using
result.figure()
).- Examples
>>> import h2o >>> from h2o.estimators import H2OGradientBoostingEstimator >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train a GBM >>> gbm = H2OGradientBoostingEstimator() >>> gbm.train(y=response, training_frame=train) >>> >>> # Create the learning curve plot >>> gbm.learning_curve_plot()
-
loglikelihood
(train=False, valid=False, xval=False)[source]¶ Get the log likelihood.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the log likelihood value for the training data.valid (bool) – If
valid=True
, then return the log likelihood value for the validation data.xval (bool) – If
xval=True
, then return the log likelihood value for the validation data.
- Returns
The log likelihood.
-
logloss
(train=False, valid=False, xval=False)[source]¶ Get the Log Loss.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the log loss value for the training data.valid (bool) – If
valid=True
, then return the log loss value for the validation data.xval (bool) – If
xval=True
, then return the log loss value for the cross validation data.
- Returns
The log loss for this regression model.
-
mae
(train=False, valid=False, xval=False)[source]¶ Get the Mean Absolute Error.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the MAE value for the training data.valid (bool) – If
valid=True
, then return the MAE value for the validation data.xval (bool) – If
xval=True
, then return the MAE value for the cross validation data.
- Returns
The MAE for this regression model.
-
mean_residual_deviance
(train=False, valid=False, xval=False)[source]¶ Get the Mean Residual Deviances.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the Mean Residual Deviance value for the training data.valid (bool) – If
valid=True
, then return the Mean Residual Deviance value for the validation data.xval (bool) – If
xval=True
, then return the Mean Residual Deviance value for the cross validation data.
- Returns
The Mean Residual Deviance for this regression model.
-
property
model_id
¶ Model identifier.
-
model_performance
(test_data=None, train=False, valid=False, xval=False, auc_type='none', auuc_type=None, custom_auuc_thresholds=None)[source]¶ Generate model metrics for this model on
test_data
.- Parameters
test_data (H2OFrame) – Data set for which model metrics shall be computed against. All three of train, valid and xval arguments are ignored if
test_data
is notNone
.train (bool) – Report the training metrics for the model. Defaults false.
valid (bool) – Report the validation metrics for the model. Defaults false.
xval (bool) – Report the cross-validation metrics for the model. Defaults false.
auc_type (String) –
Change default AUC type for multinomial classification AUC/AUCPR calculation when
test_data
is notNone
. One of: -"auto"
-"none"
(default) -"macro_ovr"
-"weighted_ovr"
-"macro_ovo"
-"weighted_ovo"
If type is
"auto"
or"none"
, AUC and AUCPR are not calculated.auuc_type (String) –
Change default AUUC type for uplift binomial classification AUUC calculation when
test_data
is not None. One of:"AUTO"
(default)"qini"
"lift"
"gain"
If type is
"auto"
(“qini”), AUUC is calculated.float (list) – List of custom thresholds to calculate AUUC when
test_data
is not None. Defaults None.
- Returns
An instance of
MetricsBase
or one of its subclass.
-
mse
(train=False, valid=False, xval=False)[source]¶ Get the Mean Square Error.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the MSE value for the training data.valid (bool) – If
valid=True
, then return the MSE value for the validation data.xval (bool) – If
xval=True
, then return the MSE value for the cross validation data.
- Returns
The MSE for this regression model.
-
negative_log_likelihood
()[source]¶ Retrieve model negative likelihood function value from scoring history if exists for GLM model
- Returns
the negative likelihood function value
-
ntrees_actual
()[source]¶ Returns actual number of trees in a tree model. If early stopping is enabled, GBM can reset the ntrees value. In this case, the actual ntrees value is less than the original ntrees value a user set before building the model.
Type:
float
-
null_degrees_of_freedom
(train=False, valid=False, xval=False)[source]¶ Retreive the null degress of freedom (dof) if this model has the attribute, or None otherwise.
- Parameters
train (bool) – Get the null dof for the training set. If both train and valid are
False
, then train is selected by default.valid (bool) – Get the null dof for the validation set. If both train and valid are
True
, then train is selected by default.
- Returns
Return the null dof, or None if it is not present.
-
null_deviance
(train=False, valid=False, xval=False)[source]¶ Retreive the null deviance if this model has the attribute, or None otherwise.
- Parameters
train (bool) – Get the null deviance for the training set. If both train and valid are
False
, then train is selected by default.valid (bool) – Get the null deviance for the validation set. If both train and valid are
True
, then train is selected by default.
- Returns
Return the null deviance, or None if it is not present.
-
property
params
¶ Get the parameters and the actual/default values only.
- Returns
A dictionary of parameters used to build this model.
-
partial_plot
(frame, cols=None, destination_key=None, nbins=20, weight_column=None, plot=True, plot_stddev=True, figsize=(7, 10), server=False, include_na=False, user_splits=None, col_pairs_2dpdp=None, save_plot_path=None, row_index=None, targets=None)[source]¶ Create partial dependence plot which gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response.
- Parameters
frame (H2OFrame) – An H2OFrame object used for scoring and constructing the plot.
cols – Feature(s) for which partial dependence will be calculated.
destination_key – A key reference to the created partial dependence tables in H2O.
nbins – Number of bins used. For categorical columns make sure the number of bins exceed the level count. If you enable
add_missing_NA
, the returned length will be nbin+1.weight_column – A string denoting which column of data should be used as the weight column.
plot – A boolean specifying whether to plot partial dependence table.
plot_stddev – A boolean specifying whether to add std err to partial dependence plot.
figsize – Dimension/size of the returning plots, adjust to fit your output cells.
server – Specify whether to activate matplotlib “server” mode. In this case, the plots are saved to a file instead of being rendered.
include_na – A boolean specifying whether missing value should be included in the Feature values.
user_splits – A dictionary containing column names as key and user defined split values as value in a list.
col_pairs_2dpdp – List containing pairs of column names for 2D pdp
save_plot_path – Fully qualified name to an image file the resulting plot should be saved to (e.g.
'/home/user/pdpplot.png'
). The ‘png’ postfix might be omitted. If the file already exists, it will be overridden. Plot is only saved ifplot=True
.row_index – Row for which partial dependence will be calculated instead of the whole input frame.
targets – Target classes for multiclass model.
- Returns
Plot and list of calculated mean response tables for each feature requested + the resulting plot (can be accessed using
result.figure()
).
-
pd_plot
(frame, column, row_index=None, target=None, max_levels=30, figsize=(16, 9), colormap='Dark2', save_plot_path=None, binary_response_scale='response', grouping_column=None, output_graphing_data=False, nbins=100, show_rug=True, **kwargs)¶ Plot partial dependence plot.
The partial dependence plot (PDP) provides a graph of the marginal effect of a variable on the response. The effect of a variable is measured by the change in the mean response. The PDP assumes independence between the feature for which is the PDP computed and the rest.
- Parameters
model – H2O Model object.
frame – H2OFrame.
column – string containing column name.
row_index – if None, do partial dependence; if integer, do individual conditional expectation for the row specified by this integer.
target – (only for multinomial classification) for what target should the plot be done.
max_levels – maximum number of factor levels to show.
figsize – figure size; passed directly to matplotlib.
colormap – colormap name; used to get just the first color to keep the api and color scheme similar with
pd_multi_plot
.save_plot_path – a path to save the plot via using matplotlib function savefig.
binary_response_scale – option for binary model to display (on the y-axis) the logodds instead of the actual score. Can be one of: “response” (default), “logodds”.
grouping_column – A feature column name to group the data and provide separate sets of plots by grouping feature values.
output_graphing_data – a bool that determines whether to output final graphing data to a frame.
nbins – Number of bins used.
show_rug – Show rug to visualize the density of the column
- Returns
object that contains the resulting matplotlib figure (can be accessed using
result.figure()
).- Examples
>>> import h2o >>> from h2o.estimators import H2OGradientBoostingEstimator >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train a GBM >>> gbm = H2OGradientBoostingEstimator() >>> gbm.train(y=response, training_frame=train) >>> >>> # Create partial dependence plot >>> gbm.pd_plot(test, column="alcohol")
-
permutation_importance
(frame, metric='AUTO', n_samples=10000, n_repeats=1, features=None, seed=-1, use_pandas=False)[source]¶ Get Permutation Variable Importance.
When
n_repeats == 1
, the result is similar to the one fromvarimp()
method (i.e. it contains the following columns: “Relative Importance”, “Scaled Importance”, and “Percentage”).When
n_repeats > 1
, the individual columns correspond to the permutation variable importance values from individual runs which corresponds to the “Relative Importance” and also to the distance between the original prediction error and prediction error using a frame with a given feature permuted.- Parameters
frame – training frame.
metric –
metric to be used. One of:
”AUTO”
”AUC”
”MAE”
”MSE”
”RMSE”
”logloss”
”mean_per_class_error”
”PR_AUC”
Defaults to “AUTO”.
n_samples – number of samples to be evaluated. Use
-1
to use the whole dataset. Defaults to10 000
.n_repeats – number of repeated evaluations. Defaults to
1
.features – features to include in the permutation importance. Use
None
to include all.seed – seed for the random generator. Use
-1
(default) to pick a random seed.use_pandas – set to
True
to return pandas data frame.
- Returns
H2OTwoDimTable or Pandas data frame
-
permutation_importance_plot
(frame, metric='AUTO', n_samples=10000, n_repeats=1, features=None, seed=-1, num_of_features=10, server=False, save_plot_path=None)[source]¶ Plot Permutation Variable Importance. This method plots either a bar plot or, if
n_repeats > 1
, a box plot and returns the variable importance table.- Parameters
frame – training frame.
metric –
metric to be used. One of:
”AUTO”
”AUC”
”MAE”
”MSE”
”RMSE”
”logloss”
”mean_per_class_error”,
”PR_AUC”
Defaults to “AUTO”.
n_samples – number of samples to be evaluated. Use
-1
to use the whole dataset. Defaults to10 000
.n_repeats – number of repeated evaluations. Defaults to
1
.features – features to include in the permutation importance. Use
None
to include all.seed – seed for the random generator. Use
-1
(default) to pick a random seed.num_of_features – number of features to plot. Defaults to
10
.server – if
True
, set server settings to matplotlib and do not show the plot.save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
object that contains H2OTwoDimTable with variable importance and the resulting figure (can be accessed using
result.figure()
)
-
pr_auc
(train=False, valid=False, xval=False)[source]¶ ModelBase.pr_auc
is deprecated, please useModelBase.aucpr
instead.
-
predict
(test_data, custom_metric=None, custom_metric_func=None)[source]¶ Predict on a dataset.
- Parameters
test_data (H2OFrame) – Data on which to make predictions.
custom_metric – custom evaluation function defined as class reference, the class get uploaded into the cluster.
custom_metric_func – custom evaluation function reference (e.g, result of
upload_custom_metric
).
- Returns
A new H2OFrame of predictions.
-
predict_contributions
(test_data, output_format='Original', top_n=None, bottom_n=None, compare_abs=False, background_frame=None, output_space=False, output_per_reference=False)[source]¶ Predict feature contributions - SHAP values on an H2O Model (only GBM, XGBoost, DRF models and equivalent imported MOJOs).
Returned H2OFrame has shape (#rows, #features + 1). There is a feature contribution column for each input feature, and the last column is the model bias (same value for each row). The sum of the feature contributions and the bias term is equal to the raw prediction of the model. Raw prediction of tree-based models is the sum of the predictions of the individual trees before the inverse link function is applied to get the actual prediction. For Gaussian distribution the sum of the contributions is equal to the model prediction.
Note: Multinomial classification models are currently not supported.
- Parameters
test_data (H2OFrame) – Data on which to calculate contributions.
output_format (Enum) – Specify how to output feature contributions in XGBoost. XGBoost by default outputs contributions for 1-hot encoded features, specifying a Compact output format will produce a per-feature contribution. One of:
"Original"
(default),"Compact"
.top_n –
Return only #top_n highest contributions + bias:
If
top_n<0
then sort all SHAP values in descending orderIf
top_n<0 && bottom_n<0
then sort all SHAP values in descending order
bottom_n –
Return only #bottom_n lowest contributions + bias:
If top_n and bottom_n are defined together then return array of #top_n + #bottom_n + bias
If
bottom_n<0
then sort all SHAP values in ascending orderIf
top_n<0 && bottom_n<0
then sort all SHAP values in descending order
compare_abs – True to compare absolute values of contributions
background_frame – Optional frame, that is used as the source of baselines for the baseline SHAP (when output_per_reference == True) or for the marginal SHAP (when output_per_reference == False).
output_space – If True, linearly scale the contributions so that they sum up to the prediction. NOTE: This will result only in approximate SHAP values even if the model supports exact SHAP calculation. NOTE: This will not have any effect if the estimator doesn’t use a link function.
output_per_reference – If True, return baseline SHAP, i.e., contribution for each data point for each reference from the background_frame. If False, return TreeSHAP if no background_frame is provided, or marginal SHAP if background frame is provided. Can be used only with background_frame.
- Returns
A new H2OFrame made of feature contributions.
- Examples
>>> prostate = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv" >>> fr = h2o.import_file(prostate) >>> predictors = list(range(2, fr.ncol)) >>> m = H2OGradientBoostingEstimator(ntrees=10, seed=1234) >>> m.train(x=predictors, y=1, training_frame=fr) >>> # Compute SHAP >>> m.predict_contributions(fr) >>> # Compute SHAP and pick the top two highest >>> m.predict_contributions(fr, top_n=2) >>> # Compute SHAP and pick the top two lowest >>> m.predict_contributions(fr, bottom_n=2) >>> # Compute SHAP and pick the top two highest regardless of the sign >>> m.predict_contributions(fr, top_n=2, compare_abs=True) >>> # Compute SHAP and pick top two lowest regardless of the sign >>> m.predict_contributions(fr, bottom_n=2, compare_abs=True) >>> # Compute SHAP values and show them all in descending order >>> m.predict_contributions(fr, top_n=-1) >>> # Compute SHAP and pick the top two highest and top two lowest >>> m.predict_contributions(fr, top_n=2, bottom_n=2) >>> # Compute Marginal SHAP, this enables looking at the contributions against different baselines, e.g., older people in the following example >>> m.predict_contributions(fr, background_frame=fr[fr["AGE"] > 75, :])
-
predict_leaf_node_assignment
(test_data, type='Path')[source]¶ Predict on a dataset and return the leaf node assignment (only for tree-based models).
- Parameters
test_data (H2OFrame) – Data on which to make predictions.
type (Enum) – How to identify the leaf node. Nodes can be either identified by a path from to the root node of the tree to the node or by H2O’s internal node id. One of:
"Path"
(default),"Node_ID"
.
- Returns
A new H2OFrame of predictions.
-
predicted_vs_actual_by_variable
(frame, predicted, variable, use_pandas=False)[source]¶ Calculates per-level mean of predicted value vs actual value for a given variable.
In the basic setting, this function is equivalent to doing group-by on variable and calculating mean on predicted and actual. It also handles NAs in response and weights automatically.
- Parameters
frame – input frame (can be
training/test/...
frame).predicted – frame of predictions for the given input frame.
variable – variable to inspect.
use_pandas – set true to return pandas data frame.
- Returns
H2OTwoDimTable or Pandas data frame
-
r2
(train=False, valid=False, xval=False)[source]¶ Return the R squared for this regression model.
Will return \(R^2\) for GLM Models.
The \(R^2\) value is defined to be \(1 - MSE / var\), where var is computed as \(\sigma * \sigma\).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the R^2 value for the training data.valid (bool) – If
valid=True
, then return the R^2 value for the validation data.xval (bool) – If
xval=True
, then return the R^2 value for the cross validation data.
- Returns
The R squared for this regression model.
-
residual_degrees_of_freedom
(train=False, valid=False, xval=False)[source]¶ Retreive the residual degress of freedom (dof) if this model has the attribute, or None otherwise.
- Parameters
train (bool) – Get the residual dof for the training set. If both train and valid are
False
, then train is selected by default.valid (bool) – Get the residual dof for the validation set. If both train and valid are
True
, then train is selected by default.
- Returns
Return the residual dof, or None if it is not present.
-
residual_deviance
(train=False, valid=False, xval=None)[source]¶ Retreive the residual deviance if this model has the attribute, or None otherwise.
- Parameters
train (bool) – Get the residual deviance for the training set. If both train and valid are
False
, then train is selected by default.valid (bool) – Get the residual deviance for the validation set. If both train and valid are
True
, then train is selected by default.
- Returns
Return the residual deviance, or None if it is not present.
-
rmse
(train=False, valid=False, xval=False)[source]¶ Get the Root Mean Square Error.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the RMSE value for the training data.valid (bool) – If
valid=True
, then return the RMSE value for the validation data.xval (bool) – If
xval=True
, then return the RMSE value for the cross validation data.
- Returns
The RMSE for this regression model.
-
rmsle
(train=False, valid=False, xval=False)[source]¶ Get the Root Mean Squared Logarithmic Error.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
train=True
, then return the RMSLE value for the training data.valid (bool) – If
valid=True
, then return the RMSLE value for the validation data.xval (bool) – If
xval=True
, then return the RMSLE value for the cross validation data.
- Returns
The RMSLE for this regression model.
-
row_to_tree_assignment
(original_training_data)[source]¶ Output row to tree assignment for the model and provided training data.
- Output is frame of size nrow = nrow(original_training_data) and ncol = number_of_trees_in_model+1 in format:
- row_id tree_1 tree_2 tree_3
0 0 1 1 1 1 1 1 2 1 0 0 3 1 1 0 4 0 1 1 5 1 1 1 6 1 0 0 7 0 1 0 8 0 1 1 9 1 0 0
- Parameters
original_training_data (H2OFrame) – Data that was used for model training. Currently there is no validation of the input.
- Returns
A new H2OFrame made of row to tree assignment output.
Note: Multinomial classification generate tree for each category, each tree use the same sample of the data.
- Examples
>>> prostate = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv" >>> fr = h2o.import_file(prostate) >>> predictors = list(range(2, fr.ncol)) >>> m = H2OGradientBoostingEstimator(ntrees=10, seed=1234, sample_rate=0.6) >>> m.train(x=predictors, y=1, training_frame=fr) >>> # Output row to tree assignment >>> m.row_to_tree_assignment(fr)
-
property
run_time
¶ Model training time in milliseconds.
-
save_model_details
(path='', force=False, filename=None)[source]¶ Save Model Details of an H2O Model in JSON Format to disk.
- Parameters
path – a path to save the model details at (e.g. hdfs, s3, local).
force – if
True
, overwrite destination directory in case it exists, or throw exception if set toFalse
.filename – a filename for the saved model (file type is always .json).
- Returns str
the path of the saved model details
-
save_mojo
(path='', force=False, filename=None)[source]¶ Save an H2O Model as MOJO (Model Object, Optimized) to disk.
- Parameters
path – a path to save the model at (e.g. hdfs, s3, local).
force – if
True
, overwrite destination directory in case it exists, or throw exception if set toFalse
.filename – a filename for the saved model (file type is always .zip).
- Returns str
the path of the saved model
-
score_history
()[source]¶ DEPRECATED. Use
scoring_history()
instead.
-
scoring_history
()[source]¶ Retrieve Model Score History.
- Returns
The score history as an H2OTwoDimTable or a Pandas DataFrame.
-
shap_explain_row_plot
(frame, row_index, columns=None, top_n_features=10, figsize=(16, 9), plot_type='barplot', contribution_type='both', save_plot_path=None, background_frame=None)¶ SHAP local explanation.
SHAP explanation shows the contribution of features for a given instance. The sum of the feature contributions and the bias term is equal to the raw prediction of the model (i.e. the prediction before applying inverse link function). H2O implements TreeSHAP which, when the features are correlated, can increase the contribution of a feature that had no influence on the prediction.
- Parameters
model – h2o tree model, such as DRF, XRT, GBM, XGBoost.
frame – H2OFrame.
row_index – row index of the instance to inspect.
columns – either a list of columns or column indices to show. If specified parameter
top_n_features
will be ignored.top_n_features – a number of columns to pick using variable importance (where applicable). When
plot_type="barplot"
, thentop_n_features
will be chosen for eachcontribution_type
.figsize – figure size; passed directly to matplotlib.
plot_type – either “barplot” or “breakdown”.
contribution_type –
One of:
”positive”
”negative”
”both”
Used only for
plot_type="barplot"
.save_plot_path – a path to save the plot via using matplotlib function savefig.
background_frame – optional frame, that is used as the source of baselines for the marginal SHAP.
- Returns
object that contains the resulting matplotlib figure (can be accessed using
result.figure()
).- Examples
>>> import h2o >>> from h2o.estimators import H2OGradientBoostingEstimator >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train a GBM >>> gbm = H2OGradientBoostingEstimator() >>> gbm.train(y=response, training_frame=train) >>> >>> # Create SHAP row explanation plot >>> gbm.shap_explain_row_plot(test, row_index=0)
-
shap_summary_plot
(frame, columns=None, top_n_features=20, samples=1000, colorize_factors=True, alpha=1, colormap=None, figsize=(12, 12), jitter=0.35, save_plot_path=None, background_frame=None)¶ SHAP summary plot.
The SHAP summary plot shows the contribution of features for each instance. The sum of the feature contributions and the bias term is equal to the raw prediction of the model (i.e. prediction before applying inverse link function).
- Parameters
model – h2o tree model (e.g. DRF, XRT, GBM, XGBoost).
frame – H2OFrame.
columns – either a list of columns or column indices to show. If specified parameter
top_n_features
will be ignored.top_n_features – a number of columns to pick using variable importance (where applicable).
samples – maximum number of observations to use; if lower than number of rows in the frame, take a random sample.
colorize_factors – if
True
, use colors from the colormap to colorize the factors; otherwise all levels will have same color.alpha – transparency of the points.
colormap – colormap to use instead of the default blue to red colormap.
figsize – figure size; passed directly to matplotlib.
jitter – amount of jitter used to show the point density.
save_plot_path – a path to save the plot via using matplotlib function savefig.
background_frame – optional frame, that is used as the source of baselines for the marginal SHAP.
- Returns
object that contains the resulting matplotlib figure (can be accessed using
result.figure()
).- Examples
>>> import h2o >>> from h2o.estimators import H2OGradientBoostingEstimator >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train a GBM >>> gbm = H2OGradientBoostingEstimator() >>> gbm.train(y=response, training_frame=train) >>> >>> # Create SHAP summary plot >>> gbm.shap_summary_plot(test)
-
show
(verbosity=None, fmt=None)[source]¶ Describe and renders the current object in the given format and verbosity level if supported, by default guessing the best format for the current environment.
- Parameters
verbosity – one of (None, ‘short’, ‘medium’, ‘full’). Defaults to None (object’s default verbosity).
fmt – one of (None, ‘plain’, ‘pretty’, ‘html’). Defaults to None (picks appropriate format depending on platform/context).
-
staged_predict_proba
(test_data)[source]¶ Predict class probabilities at each stage of an H2O Model (only GBM models).
The output structure is analogous to the output of function
predict_leaf_node_assignment
. For each tree t and class c there will be a column Tt.Cc (eg. T3.C1 for tree 3 and class 1). The value will be the corresponding predicted probability of this class by combining the raw contributions of trees T1.Cc,..,TtCc. Binomial models build the trees just for the first class and values in columns Tx.C1 thus correspond to the the probability p0.- Parameters
test_data (H2OFrame) – Data on which to make predictions.
- Returns
A new H2OFrame of staged predictions.
-
property
start_time
¶ Timestamp (milliseconds since 1970) when the model training was started.
-
std_coef_plot
(num_of_features=None, server=False, save_plot_path=None)[source]¶ Plot a model’s standardized coefficient magnitudes.
- Parameters
num_of_features – the number of features shown in the plot.
server – if
True
, set server settings to matplotlib and show the graph.save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
object that contains the resulting figure (can be accessed using
result.figure()
).
-
property
type
¶ The type of model built. One of:
"classifier"
"regressor"
"unsupervised"
-
update_tree_weights
(frame, weights_column)[source]¶ Re-calculates tree-node weights based on the provided dataset. Modifying node weights will affect how contribution predictions (Shapley values) are calculated. This can be used to explain the model on a curated sub-population of the training dataset.
- Parameters
frame – frame that will be used to re-populate trees with new observations and to collect per-node weights.
weights_column – name of the weight column (can be different from training weights).
-
varimp
(use_pandas=False)[source]¶ Pretty print the variable importances, or return them in a list.
- Parameters
use_pandas (bool) – If
True
, then the variable importances will be returned as a pandas data frame.- Returns
A list or Pandas DataFrame.
-
varimp_plot
(num_of_features=None, server=False, save_plot_path=None)[source]¶ Plot the variable importance for a trained model.
- Parameters
num_of_features – the number of features shown in the plot (default is
10
or all if less than 10).server – if
True
, set server settings to matplotlib and do not show the graph.save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
object that contains the resulting figure (can be accessed using
result.figure()
).
-
weights
(matrix_id=0)[source]¶ Return the frame for the respective weight matrix.
- Parameters
matrix_id – an integer, ranging from 0 to number of layers, that specifies the weight matrix to return.
- Returns
an H2OFrame which represents the weight matrix identified by
matrix_id
.
-
property
xvals
¶ Return a list of the cross-validated models.
- Returns
A list of models.
-
property
Binomial Classification
¶
-
class
h2o.model.models.binomial.
H2OBinomialModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
F0point5
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the F0.5 for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the F0.5 value for the training data.valid (bool) – If
True
, return the F0.5 value for the validation data.xval (bool) – If
True
, return the F0.5 value for each of the cross-validated splits.
- Returns
The F0.5 values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> F0point5 = gbm.F0point5() # <- Default: return training metric value >>> F0point5 = gbm.F0point5(train=True, valid=True, xval=True)
-
F1
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the F1 value for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the F1 value for the training data.valid (bool) – If
True
, return the F1 value for the validation data.xval (bool) – If
True
, return the F1 value for each of the cross-validated splits.
- Returns
The F1 values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.F1()# <- Default: return training metric value >>> gbm.F1(train=True, valid=True, xval=True)
-
F2
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the F2 for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the F2 value for the training data.valid (bool) – If
True
, return the F2 value for the validation data.xval (bool) – If
True
, return the F2 value for each of the cross-validated splits.
- Returns
The F2 values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.F2() # <- Default: return training metric value >>> gbm.F2(train=True, valid=True, xval=True)
-
accuracy
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the accuracy for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the accuracy value for the training data.valid (bool) – If
True
, return the accuracy value for the validation data.xval (bool) – If
True
, return the accuracy value for each of the cross-validated splits.
- Returns
The accuracy values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.accuracy() # <- Default: return training metric value >>> gbm.accuracy(train=True, valid=True, xval=True)
-
confusion_matrix
(metrics=None, thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the confusion matrix for the specified metrics/thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”- Parameters
metrics – A string (or list of strings) among metrics listed in
H2OBinomialModelMetrics.maximizing_metrics
. Defaults to'f1'
.thresholds – A value (or list of values) between 0 and 1. If None, then the thresholds maximizing each provided metric will be used.
train (bool) – If
True
, return the confusion matrix value for the training data.valid (bool) – If
True
, return the confusion matrix value for the validation data.xval (bool) – If
True
, return the confusion matrix value for each of the cross-validated splits.
- Returns
The confusion matrix values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.confusion_matrix() # <- Default: return training metric value >>> gbm.confusion_matrix(train=True, valid=True, xval=True)
-
error
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the error for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold minimizing the error will be used.
train (bool) – If
True
, return the error value for the training data.valid (bool) – If
True
, return the error value for the validation data.xval (bool) – If
True
, return the error value for each of the cross-validated splits.
- Returns
The error values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.error() # <- Default: return training metric >>> gbm.error(train=True, valid=True, xval=True)
-
fallout
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the fallout for a set of thresholds (aka False Positive Rate).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the fallout value for the training data.valid (bool) – If
True
, return the fallout value for the validation data.xval (bool) – If
True
, return the fallout value for each of the cross-validated splits.
- Returns
The fallout values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.fallout() # <- Default: return training metric >>> gbm.fallout(train=True, valid=True, xval=True)
-
find_idx_by_threshold
(threshold, train=False, valid=False, xval=False)[source]¶ Retrieve the index in this metric’s threshold list at which the given threshold is located.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
threshold (float) – Threshold value to search for in the threshold list.
train (bool) – If
True
, return the find idx by threshold value for the training data.valid (bool) – If
True
, return the find idx by threshold value for the validation data.xval (bool) – If
True
, return the find idx by threshold value for each of the cross-validated splits.
- Returns
The find idx by threshold values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", ... "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> idx_threshold = gbm.find_idx_by_threshold(threshold=0.39438, ... train=True) >>> idx_threshold
-
find_threshold_by_max_metric
(metric, train=False, valid=False, xval=False)[source]¶ If all are
False
(default), then return the training metric value.If more than one option is set to
True
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
metric (str) – A metric among the metrics listed in
H2OBinomialModelMetrics.maximizing_metrics
.train (bool) – If
True
, return the find threshold by max metric value for the training data.valid (bool) – If
True
, return the find threshold by max metric value for the validation data.xval (bool) – If
True
, return the find threshold by max metric value for each of the cross-validated splits.
- Returns
The find threshold by max metric values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", ... "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> max_metric = gbm.find_threshold_by_max_metric(metric="f2", ... train=True) >>> max_metric
-
fnr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the False Negative Rates for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the FNR value for the training data.valid (bool) – If
True
, return the FNR value for the validation data.xval (bool) – If
True
, return the FNR value for each of the cross-validated splits.
- Returns
The FNR values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.fnr() # <- Default: return training metric >>> gbm.fnr(train=True, valid=True, xval=True)
-
fpr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the False Positive Rates for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the FPR value for the training data.valid (bool) – If
True
, return the FPR value for the validation data.xval (bool) – If
True
, return the FPR value for each of the cross-validated splits.
- Returns
The FPR values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.fpr() # <- Default: return training metric >>> gbm.fpr(train=True, valid=True, xval=True)
-
gains_lift
(train=False, valid=False, xval=False)[source]¶ Get the Gains/Lift table for the specified metrics.
If all are
False
(default), then return the training metric Gains/Lift table. If more than one option is set toTrue
, then return a dictionary of metrics where t he keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return the gains lift value for the training data.valid (bool) – If
True
, return the gains lift value for the validation data.xval (bool) – If
True
, return the gains lift value for each of the cross-validated splits.
- Returns
The gains lift values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.gains_lift() # <- Default: return training metric Gain/Lift table >>> gbm.gains_lift(train=True, valid=True, xval=True)
-
gains_lift_plot
(type='both', xval=False, server=False, save_plot_path=None, plot=True)[source]¶ Plot Gains/Lift curves.
- Parameters
type –
One of:
”both” (default)
”gains”
”lift”
xval – if
True
, use cross-validation metrics.server – if
True
, generate plot inline using matplotlib’s “Agg” backend.save_plot_path – filename to save the plot to.
plot –
True
to plot curve,False
to get a gains lift table
- Returns
Gains lift table + the resulting plot (can be accessed using
result.figure()
)
-
kolmogorov_smirnov
()[source]¶ Retrieves the Kolmogorov-Smirnov metric (K-S metric) for a given binomial model. The number returned is in range between 0 and 1. The K-S metric represents the degree of separation between the positive (1) and negative (0) cumulative distribution functions. Detailed metrics per each group are to be found in the gains-lift table.
- Returns
Kolmogorov-Smirnov metric, a number between 0 and 1.
- Examples
>>> from h2o.estimators import H2OGradientBoostingEstimator >>> airlines = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv") >>> model = H2OGradientBoostingEstimator(ntrees=1, ... gainslift_bins=20) >>> model.train(x=["Origin", "Distance"], ... y="IsDepDelayed", ... training_frame=airlines) >>> model.kolmogorov_smirnov()
-
max_per_class_error
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the max per class error for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold minimizing the error will be used.
train (bool) – If
True
, return the max per class error value for the training data.valid (bool) – If
True
, return the max per class error value for the validation data.xval (bool) – If
True
, return the max per class error value for each of the cross-validated splits.
- Returns
The max per class error values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.max_per_class_error() # <- Default: return training metric value >>> gbm.max_per_class_error(train=True, valid=True, xval=True)
-
mcc
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the MCC for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the MCC value for the training data.valid (bool) – If
True
, return the MCC value for the validation data.xval (bool) – If
True
, return the MCC value for each of the cross-validated splits.
- Returns
The MCC values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.mcc() # <- Default: return training metric value >>> gbm.mcc(train=True, valid=True, xval=True)
-
mean_per_class_error
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the mean per class error for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold minimizing the error will be used.
train (bool) – If
True
, return the mean per class error value for the training data.valid (bool) – If
True
, return the mean per class error value for the validation data.xval (bool) – If
True
, return the mean per class error value for each of the cross-validated splits.
- Returns
The mean per class error values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.mean_per_class_error() # <- Default: return training metric >>> gbm.mean_per_class_error(train=True, valid=True, xval=True)
-
metric
(metric, thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the metric value for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
metric (str) – name of the metric to retrieve.
thresholds – If None, then the threshold maximizing the metric will be used (or minimizing it if the metric is an error).
train (bool) – If
True
, return the metric value for the training data.valid (bool) – If
True
, return the metric value for the validation data.xval (bool) – If
True
, return the metric value for each of the cross-validated splits.
- Returns
The metric values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] # thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]) >>> thresholds = [0.01,0.5,0.99] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) # allowable metrics are absolute_mcc, accuracy, precision, # f0point5, f1, f2, mean_per_class_accuracy, min_per_class_accuracy, # tns, fns, fps, tps, tnr, fnr, fpr, tpr, recall, sensitivity, # missrate, fallout, specificity >>> gbm.metric(metric='tpr', thresholds=thresholds)
-
missrate
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the miss rate for a set of thresholds (aka False Negative Rate).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the miss rate value for the training data.valid (bool) – If
True
, return the miss rate value for the validation data.xval (bool) – If
True
, return the miss rate value for each of the cross-validated splits.
- Returns
The miss rate values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.missrate() # <- Default: return training metric >>> gbm.missrate(train=True, valid=True, xval=True)
-
plot
(timestep='AUTO', metric='AUTO', server=False, save_plot_path=None)[source]¶ Plot training set (and validation set if available) scoring history for an H2OBinomialModel.
The timestep and metric arguments are restricted to what is available in its scoring history.
- Parameters
timestep (str) – A unit of measurement for the x-axis.
metric (str) – A unit of measurement for the y-axis.
server (bool) – if
True
, then generate the image inline (using matplotlib’s “Agg” backend).save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
object that contains the resulting figure (can be accessed using
result.figure()
)- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> response = 3 >>> predictors = [0, 1, 2, 4, 5, 6, 7, 8, 9, 10] >>> model = H2OGeneralizedLinearEstimator(family="binomial") >>> model.train(x=predictors, y=response, training_frame=benign) >>> model.plot(timestep="AUTO", metric="objective", server=False)
-
precision
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the precision for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the precision value for the training data.valid (bool) – If
True
, return the precision value for the validation data.xval (bool) – If
True
, return the precision value for each of the cross-validated splits.
- Returns
The precision values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.precision() # <- Default: return training metric value >>> gbm.precision(train=True, valid=True, xval=True)
-
recall
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the recall for a set of thresholds (aka True Positive Rate).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the recall value for the training data.valid (bool) – If
True
, return the recall value for the validation data.xval (bool) – If
True
, return the recall value for each of the cross-validated splits.
- Returns
The recall values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.recall() # <- Default: return training metric >>> gbm.recall(train=True, valid=True, xval=True)
-
roc
(train=False, valid=False, xval=False)[source]¶ Return the coordinates of the ROC curve for a given set of data.
The coordinates are two-tuples containing the false positive rates as a list and true positive rates as a list. If all are
False
(default), then return is the training data. If more than one ROC curve is requested, the data is returned as a dictionary of two-tuples.- Parameters
train (bool) – If
True
, return the ROC value for the training data.valid (bool) – If
True
, return the ROC value for the validation data.xval (bool) – If
True
, return the ROC value for each of the cross-validated splits.
- Returns
The ROC values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.roc() # <- Default: return training data >>> gbm.roc(train=True, valid=True, xval=True)
-
sensitivity
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the sensitivity for a set of thresholds (aka True Positive Rate or Recall).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the sensitivity value for the training data.valid (bool) – If
True
, return the sensitivity value for the validation data.xval (bool) – If
True
, return the sensitivity value for each of the cross-validated splits.
- Returns
The sensitivity values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.sensitivity() # <- Default: return training metric >>> gbm.sensitivity(train=True, valid=True, xval=True)
-
specificity
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the specificity for a set of thresholds (aka True Negative Rate).
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the specificity value for the training data.valid (bool) – If
True
, return the specificity value for the validation data.xval (bool) – If
True
, return the specificity value for each of the cross-validated splits.
- Returns
The specificity values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.specificity() # <- Default: return training metric >>> gbm.specificity(train=True, valid=True, xval=True)
-
thresholds_and_metric_scores
(train=False, valid=False, xval=False)[source]¶ Get the all thresholds and metric scores in a table.
If all are
False
(default), then return the training metric table. If more than one option is set toTrue
, then return a dictionary of tables where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return the thresholds and metric scores table for the training data.valid (bool) – If
True
, return the thresholds and metric scores table value for the validation data.xval (bool) – If
True
, return the thresholds and metric scores table value for each of the cross-validated splits.
- Returns
The thresholds and metric scores tables for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.thresholds_and_metric_scores()# <- Default: return training metric table >>> gbm.thresholds_and_metric_scores(train=True, valid=True, xval=True)
-
tnr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the True Negative Rate for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the TNR value for the training data.valid (bool) – If
True
, return the TNR value for the validation data.xval (bool) – If
True
, return the TNR value for each of the cross-validated splits.
- Returns
The TNR values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.tnr() # <- Default: return training metric >>> gbm.tnr(train=True, valid=True, xval=True)
-
tpr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the True Positive Rate for a set of thresholds.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
thresholds – If None, then the threshold maximizing the metric will be used.
train (bool) – If
True
, return the TPR value for the training data.valid (bool) – If
True
, return the TPR value for the validation data.xval (bool) – If
True
, return the TPR value for each of the cross-validated splits.
- Returns
The TPR values for the specified key(s).
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.tpr() # <- Default: return training metric >>> gbm.tpr(train=True, valid=True, xval=True)
-
Multinomial Classification
¶
-
class
h2o.model.models.multinomial.
H2OMultinomialModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
confusion_matrix
(data)[source]¶ Returns a confusion matrix based of H2O’s default prediction threshold for a dataset.
- Parameters
data (H2OFrame) – the frame with the prediction results for which the confusion matrix should be extracted.
- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> distribution = "multinomial" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> confusion_matrix = gbm.confusion_matrix(train) >>> confusion_matrix
-
hit_ratio_table
(train=False, valid=False, xval=False)[source]¶ Retrieve the Hit Ratios.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train – If train is
True
, then return the hit ratio value for the training data.valid – If valid is
True
, then return the hit ratio value for the validation data.xval – If xval is
True
, then return the hit ratio value for the cross validation data.
- Returns
The hit ratio for this regression model.
- Example
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> distribution = "multinomial" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> hit_ratio_table = gbm.hit_ratio_table() # <- Default: return training metrics >>> hit_ratio_table >>> hit_ratio_table1 = gbm.hit_ratio_table(train=True, ... valid=True, ... xval=True) >>> hit_ratio_table1
-
mean_per_class_error
(train=False, valid=False, xval=False)[source]¶ Retrieve the mean per class error across all classes.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return themean_per_class_error
value for the training data.valid (bool) – If
True
, return themean_per_class_error
value for the validation data.xval (bool) – If
True
, return themean_per_class_error
value for each of the cross-validated splits.
- Returns
The
mean_per_class_error
values for the specified key(s).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> distribution = "multinomial" >>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> mean_per_class_error = gbm.mean_per_class_error() # <- Default: return training metric >>> mean_per_class_error >>> mean_per_class_error1 = gbm.mean_per_class_error(train=True, ... valid=True, ... xval=True) >>> mean_per_class_error1
-
multinomial_auc_table
(train=False, valid=False, xval=False)[source]¶ Retrieve the multinomial AUC table.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return themultinomial_auc_table
for the training data.valid (bool) – If
True
, return themultinomial_auc_table
for the validation data.xval (bool) – If
True
, return themultinomial_auc_table
for each of the cross-validated splits.
- Returns
The
multinomial_auc_table
values for the specified key(s).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> distribution = "multinomial" >>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> multinomial_auc_table = gbm.multinomial_auc_table() # <- Default: return training metric >>> multinomial_auc_table >>> multinomial_auc_table1 = gbm.multinomial_auc_table(train=True, ... valid=True, ... xval=True) >>> multinomial_auc_table1
-
multinomial_aucpr_table
(train=False, valid=False, xval=False)[source]¶ Retrieve the multinomial PR AUC table.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return themultinomial_aucpr_table
for the training data.valid (bool) – If
True
, return themultinomial_aucpr_table
for the validation data.xval (bool) – If
True
, return themultinomial_aucpr_table
for each of the cross-validated splits.
- Returns
The
average_pairwise_auc
values for the specified key(s).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> distribution = "multinomial" >>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> multinomial_aucpr_table = gbm.multinomial_aucpr_table() # <- Default: return training metric >>> multinomial_aucpr_table >>> multinomial_aucpr_table1 = gbm.multinomial_aucpr_table(train=True, ... valid=True, ... xval=True) >>> multinomial_aucpr_table1
-
plot
(timestep='AUTO', metric='AUTO', save_plot_path=None, **kwargs)[source]¶ Plots training set (and validation set if available) scoring history for an H2OMultinomialModel. The timestep and metric arguments are restricted to what is available in its scoring history.
- Parameters
timestep –
A unit of measurement for the x-axis. One of:
’AUTO’
’duration’
’number_of_trees’
metric –
A unit of measurement for the y-axis. One of:
’AUTO’
’logloss’
’classification_error’
’rmse’
- Returns
Object that contains the resulting scoring history plot (can be accessed using
result.figure()
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> distribution = "multinomial" >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> gbm.plot(metric="AUTO", timestep="AUTO")
-
Regression
¶
-
class
h2o.model.models.regression.
H2ORegressionModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
plot
(timestep='AUTO', metric='AUTO', save_plot_path=None, **kwargs)[source]¶ Plots training set (and validation set if available) scoring history for an H2ORegressionModel. The
timestep
andmetric
arguments are restricted to what is available in its scoring history.- Parameters
timestep – A unit of measurement for the x-axis.
metric – A unit of measurement for the y-axis.
save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
Object that contains the resulting scoring history plot (can be accessed using
result.figure()
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy" >>> distribution = "gaussian" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> gbm.plot(timestep="AUTO", metric="AUTO",)
-
residual_analysis_plot
(frame, figsize=(16, 9), save_plot_path=None)¶ Residual Analysis.
Do Residual Analysis and plot the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection (e.g. using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc.). If you notice “striped” lines of residuals, that is just an indication that your response variable was integer-valued instead of real-valued.
- Parameters
model – H2OModel.
frame – H2OFrame.
figsize – figure size; passed directly to matplotlib.
save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
object that contains the resulting matplotlib figure (can be accessed using
result.figure()
).- Examples
>>> import h2o >>> from h2o.estimators import H2OGradientBoostingEstimator >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train a GBM >>> gbm = H2OGradientBoostingEstimator() >>> gbm.train(y=response, training_frame=train) >>> >>> # Create the residual analysis plot >>> gbm.residual_analysis_plot(test)
-
-
h2o.model.models.regression.
h2o_explained_variance_score
(y_actual, y_predicted, weights=None)[source]¶ Explained variance regression score function.
- Parameters
y_actual – H2OFrame of actual response.
y_predicted – H2OFrame of predicted response.
weights – (Optional) sample weights.
- Returns
the explained variance score.
-
h2o.model.models.regression.
h2o_mean_absolute_error
(y_actual, y_predicted, weights=None)[source]¶ Mean absolute error regression loss.
- Parameters
y_actual – H2OFrame of actual response.
y_predicted – H2OFrame of predicted response.
weights – (Optional) sample weights.
- Returns
mean absolute error loss (best is 0.0).
-
h2o.model.models.regression.
h2o_mean_squared_error
(y_actual, y_predicted, weights=None)[source]¶ Mean squared error regression loss
- Parameters
y_actual – H2OFrame of actual response.
y_predicted – H2OFrame of predicted response.
weights – (Optional) sample weights.
- Returns
mean squared error loss (best is 0.0).
-
h2o.model.models.regression.
h2o_median_absolute_error
(y_actual, y_predicted)[source]¶ Median absolute error regression loss
- Parameters
y_actual – H2OFrame of actual response.
y_predicted – H2OFrame of predicted response.
- Returns
median absolute error loss (best is 0.0).
-
h2o.model.models.regression.
h2o_r2_score
(y_actual, y_predicted, weights=1.0)[source]¶ R-squared (coefficient of determination) regression score function
- Parameters
y_actual – H2OFrame of actual response.
y_predicted – H2OFrame of predicted response.
weights – (Optional) sample weights.
- Returns
R-squared (best is 1.0, lower is worse).
Anomaly Detection
¶
-
class
h2o.model.models.anomaly_detection.
H2OAnomalyDetectionModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
varsplits
(use_pandas=False)[source]¶ Retrieve per-variable split information for a given Isolation Forest model. Output will include:
- count
The number of times a variable was used to make a split.
- aggregated_split_ratios
The split ratio is defined as
abs(#left_observations - #right_observations) / #before_split
. Even splits (#left_observations
approx the same as#right_observations
) contribute less to the total aggregated split ratio value for the given feature; highly imbalanced splits (eg.#left_observations >> #right_observations
) contribute more.
- aggregated_split_depths
The sum of all depths of a variable used to make a split. (If a variable is used on level N of a tree, then it contributes with N to the total aggregate.)
- Parameters
use_pandas – If
True
, then the variable splits will be returned as a Pandas data frame.- Returns
A list or Pandas DataFrame.
- Examples
>>> from h2o.estimators import H2OIsolationForestEstimator >>> h2o_df = h2o.import_file("https://raw.github.com/h2oai/h2o/master/smalldata/logreg/prostate.csv") >>> train,test = h2o_df.split_frame(ratios=[0.75]) >>> model = H2OIsolationForestEstimator(sample_rate = 0.1, ... max_depth = 20, ... ntrees = 50) >>> model.train(training_frame=train) >>> model.varsplits()
-
AutoEncoders
¶
-
class
h2o.model.models.autoencoder.
H2OAutoEncoderModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
anomaly
(test_data, per_feature=False)[source]¶ Obtain the reconstruction error for the input
test_data
.- Parameters
test_data (H2OFrame) – The dataset upon which the reconstruction error is computed.
per_feature (bool) – Whether to return the square reconstruction error per feature. Otherwise, return the mean square error.
- Returns
the reconstruction error.
- Examples
>>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz") >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz") >>> predictors = list(range(0,784)) >>> resp = 784 >>> train = train[predictors] >>> test = test[predictors] >>> ae_model = H2OAutoEncoderEstimator(activation="Tanh", ... hidden=[2], ... l1=1e-5, ... ignore_const_cols=False, ... epochs=1) >>> ae_model.train(x=predictors,training_frame=train) >>> test_rec_error = ae_model.anomaly(test) >>> test_rec_error >>> test_rec_error_features = ae_model.anomaly(test, per_feature=True) >>> test_rec_error_features
-
Clustering Methods
¶
-
class
h2o.model.models.clustering.
H2OClusteringModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
betweenss
(train=False, valid=False, xval=False)[source]¶ Get the between cluster sum of squares.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return the between cluster sum of squares value for the training data.valid (bool) – If
True
, return the between cluster sum of squares value for the validation data.xval (bool) – If
True
, return the between cluster sum of squares value for each of the cross-validated splits.
- Returns
The between cluster sum of squares values for the specified key(s).
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> betweenss = km.betweenss() # <- Default: return training metrics >>> betweenss >>> betweenss3 = km.betweenss(train=False, ... valid=False, ... xval=True) >>> betweenss3
-
centers
()[source]¶ The centers for the KMeans model.
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> km.centers()
-
centers_std
()[source]¶ The standardized centers for the KMeans model.
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> km.centers_std()
-
centroid_stats
(train=False, valid=False)[source]¶ Get the centroid statistics for each cluster.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”. This metric is not available in cross-validation metrics.- Parameters
train (bool) – If
True
, return the centroid statistic for the training data.valid (bool) – If
True
, return the centroid statistic for the validation data.
- Returns
The centroid statistics for the specified key(s).
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> centroid_stats = km.centroid_stats() # <- Default: return training metrics >>> centroid_stats >>> centroid_stats1 = km.centroid_stats(train=True, ... valid=False) >>> centroid_stats1
-
num_iterations
()[source]¶ Get the number of iterations it took to converge or reach max iterations.
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> km.num_iterations()
-
size
(train=False, valid=False)[source]¶ Get the sizes of each cluster.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”. This metric is not available in cross-validation metrics.- Parameters
train (bool) – If
True
, return the cluster sizes for the training data.valid (bool) – If
True
, return the cluster sizes for the validation data.
- Returns
The cluster sizes for the specified key(s).
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> size = km.size() # <- Default: return training metrics >>> size >>> size1 = km.size(train=False, ... valid=False) >>> size1
-
tot_withinss
(train=False, valid=False, xval=False)[source]¶ Get the total within cluster sum of squares.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return the total within cluster sum of squares value for the training data.valid (bool) – If
True
, return the total within cluster sum of squares value for the validation data.xval (bool) – If
True
, return the total within cluster sum of squares value for each of the cross-validated splits.
- Returns
The total within cluster sum of squares values for the specified key(s).
- Examples
>>> >>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> tot_withinss = km.tot_withinss() # <- Default: return training metrics >>> tot_withinss >>> tot_withinss2 = km.tot_withinss(train=True, ... valid=False, ... xval=True) >>> tot_withinss2
-
totss
(train=False, valid=False, xval=False)[source]¶ Get the total sum of squares.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return the total sum of squares value for the training data.valid (bool) – If
True
, return the total sum of squares value for the validation data.xval (bool) – If
True
, return the total sum of squares value for each of the cross-validated splits.
- Returns
The total sum of squares values for the specified key(s).
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> totss = km.totss() # <- Default: return training metrics >>> totss
-
withinss
(train=False, valid=False)[source]¶ Get the within cluster sum of squares for each cluster.
If all are
False
(default), then return the training metric value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”. This metric is not available in cross-validation metrics.- Parameters
train (bool) – If
True
, return the total sum of squares value for the training data.valid (bool) – If
True
, return the total sum of squares value for the validation data.
- Returns
The total sum of squares values for the specified key(s).
- Examples
>>> from h2o.estimators.kmeans import H2OKMeansEstimator >>> >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> withinss = km.withinss() # <- Default: return training metrics >>> withinss >>> withinss2 = km.withinss(train=True, ... valid=True) >>> withinss2
-
CoxPH
¶
Dimensionality Reduction
¶
-
class
h2o.model.models.dim_reduction.
H2ODimReductionModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
Dimension reduction model, such as PCA or GLRM.
-
num_iterations
()[source]¶ Get the number of iterations that it took to converge or reach max iterations.
-
proj_archetypes
(test_data, reverse_transform=False)[source]¶ Convert archetypes of the model into original feature space.
- Parameters
test_data (H2OFrame) – The dataset upon which the model was trained.
reverse_transform (bool) – Whether the transformation of the training data during model-building should be reversed on the projected archetypes.
- Returns
model archetypes projected back into the original training data’s feature space.
-
reconstruct
(test_data, reverse_transform=False)[source]¶ Reconstruct the training data from the model and impute all missing values.
- Parameters
test_data (H2OFrame) – The dataset upon which the model was trained.
reverse_transform (bool) – Whether the transformation of the training data during model-building should be reversed on the reconstructed frame.
- Returns
the approximate reconstruction of the training data.
-
screeplot
(type='barplot', server=False, save_plot_path=None)[source]¶ Produce the scree plot.
Library
matplotlib
is required for this function.- Parameters
type (str) – either
"barplot"
or"lines"
.server (bool) – if
True
, setserver
settings to matplotlib and do not show the graph.save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
Object that contains the resulting scree plot (can be accessed like
result.figure()
).
-
Ordinal
¶
-
class
h2o.model.models.ordinal.
H2OOrdinalModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
confusion_matrix
(data)[source]¶ Returns a confusion matrix based on H2O’s default prediction threshold for a dataset.
- Parameters
data (H2OFrame) – the frame with the prediction results for which the confusion matrix should be extracted.
-
hit_ratio_table
(train=False, valid=False, xval=False)[source]¶ Retrieve the Hit Ratios.
If all are
False
(default), then return the training metric value. If more than one options is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train – If train is
True
, then return the hit ratio value for the training data.valid – If valid is
True
, then return the hit ratio value for the validation data.xval – If xval is
True
, then return the hit ratio value for the cross validation data.
- Returns
The hit ratio for this regression model.
-
mean_per_class_error
(train=False, valid=False, xval=False)[source]¶ Retrieve the mean per class error across all classes
If all are
False
(default), then return the training metric value. If more than one options is set toTrue
, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.- Parameters
train (bool) – If
True
, return themean_per_class_error
value for the training data.valid (bool) – If
True
, return themean_per_class_error
value for the validation data.xval (bool) – If
True
, return themean_per_class_error
value for each of the cross-validated splits.
- Returns
The
mean_per_class_error
values for the specified key(s).
-
plot
(timestep='AUTO', metric='AUTO', save_plot_path=None, **kwargs)[source]¶ Plots training set (and validation set if available) scoring history for an H2OOrdinalModel. The
timestep
andmetric
arguments are restricted to what is available in its scoring history.- Parameters
timestep – A unit of measurement for the x-axis.
metric – A unit of measurement for the y-axis.
save_plot_path – a path to save the plot via using matplotlib function savefig.
- Returns
Object that contains the resulting scoring history plot (can be accessed using
result.figure()
).
-
Uplift
¶
-
class
h2o.model.models.uplift.
H2OBinomialUpliftModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
-
atc
(train=False, valid=False)[source]¶ Retrieve Average Treatment Effect on the Control
If all are False (default), then return the training ATC metric. If more than one options is set to True, then return a dictionary of metrics where the keys are “train” and “valid”.
- Parameters
train (bool) – If True, return the ATC value for the training data.
valid (bool) – If True, return the ATC value for the validation data.
- Returns
the ATC value for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.atc() # <- Default: return training metric value >>> uplift_model.atc(train=True)
-
ate
(train=False, valid=False)[source]¶ Retrieve Average Treatment Effect
If all are False (default), then return the training ATE metric. If more than one options is set to True, then return a dictionary of metrics where the keys are “train” and “valid”.
- Parameters
train (bool) – If True, return the ATE value for the training data.
valid (bool) – If True, return the ATE value for the validation data.
- Returns
the ATE value for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.ate() # <- Default: return training metric value >>> uplift_model.ate(train=True)
-
att
(train=False, valid=False)[source]¶ Retrieve Average Treatment Effect on the Treated
If all are False (default), then return the training ATT metric. If more than one options is set to True, then return a dictionary of metrics where the keys are “train” and “valid”.
- Parameters
train (bool) – If True, return the ATT value for the training data.
valid (bool) – If True, return the ATT value for the validation data.
- Returns
the ATT value for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.att() # <- Default: return training metric value >>> uplift_model.att(train=True)
-
auuc
(metric=None, train=False, valid=False)[source]¶ Retrieve area under uplift curve (AUUC) value for the specified metrics in model params.
If all are
False
(default), then return the training metric AUUC value. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the AUUC value for the training data.valid (bool) – If
True
, return the AUUC value for the validation data.metric –
AUUC metric type. One of:
”qini”
”lift”
”gain”
”None” (default; metric set in parameters)
- Returns
AUUC value for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.auuc() # <- Default: return training metric value >>> uplift_model.auuc(train=True, valid=True)
-
auuc_normalized
(metric=None, train=False, valid=False)[source]¶ Retrieve normalized area under uplift curve (AUUC) value for the specified metrics in model params.
If all are
False
(default), then return the training metric normalized AUUC value. If more than one options is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
metric – AUUC metric type (“qini”, “lift”, “gain”, default is None which means metric set in parameters)
train (bool) – If True, return the AUUC value for the training data.
valid (bool) – If True, return the AUUC value for the validation data.
metric –
AUUC metric type. One of:
”qini”
”lift”
”gain”
”None” (default; metric set in parameters)
- Returns
Normalized AUUC value for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.auuc_normalized() # <- Default: return training metric value >>> uplift_model.auuc_normalized(train=True, valid=True)
-
auuc_table
(train=False, valid=False)[source]¶ Retrieve all types of AUUC in a table.
If all are
False
(default), then return the training metric AUUC table. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the AUUC table for the training data.valid (bool) – If
True
, return the AUUC table for the validation data.
- Returns
the AUUC table for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.auuc_table() # <- Default: return training metric value >>> uplift_model.auuc_table(train=True)
-
n
(train=False, valid=False)[source]¶ Retrieve numbers of observations.
If all are
False
(default), then return the training metric number of observations. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the number of observations for the training data.valid (bool) – If
True
, return the number of observations for the validation data.
- Returns
a list of numbers of observation for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.n() # <- Default: return training metric value >>> uplift_model.n(train=True)
-
qini
(train=False, valid=False)[source]¶ Retrieve Qini value (area between Qini cumulative uplift curve and random curve)
If all are False (default), then return the training metric AUUC table. If more than one options is set to True, then return a dictionary of metrics where the keys are “train” and “valid”.
- Parameters
train (bool) – If True, return the Qini value for the training data.
valid (bool) – If True, return the Qini value for the validation data.
- Returns
the Qini value for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.qini() # <- Default: return training metric value >>> uplift_model.qini(train=True)
-
thresholds
(train=False, valid=False)[source]¶ Retrieve prediction thresholds for the specified metrics.
If all are
False
(default), then return the training metric prediction thresholds. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the prediction thresholds for the training data.valid (bool) – If
True
, return the prediction thresholds for the validation data.
- Returns
a list of numbers of observation for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.thresholds() # <- Default: return training metric value >>> uplift_model.thresholds(train=True)
-
thresholds_and_metric_scores
(train=False, valid=False)[source]¶ Retrieve thresholds and metric scores table for the specified metrics.
If all are
False
(default), then return the training metric thresholds and metric scores table. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the thresholds and metric scores table for the training data.valid (bool) – If
True
, return the thresholds and metric scores table for the validation data.
- Returns
the thresholds and metric scores table for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.thresholds_and_metric_scores() # <- Default: return training metric value >>> uplift_model.thresholds_and_metric_scores(train=True)
-
uplift
(metric='qini', train=False, valid=False)[source]¶ Retrieve uplift values for the specified metrics.
If all are
False
(default), then return the training metric uplift values. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the uplift values for the training data.valid (bool) – If
True
, return the uplift values for the validation data.metric –
Uplift metric type. One of:
”qini” (default)
”lift”
”gain”
metric – Uplift metric type (“qini”, “lift”, “gain”, default is “qini”)
train – If True, return the uplift values for the training data.
valid – If True, return the uplift values for the validation data.
- Returns
a list of uplift values for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.uplift() # <- Default: return training metric value >>> uplift_model.uplift(train=True, metric="gain")
-
uplift_normalized
(metric='qini', train=False, valid=False)[source]¶ Retrieve normalized uplift values for the specified metrics.
If all are
False
(default), then return the training metric normalized uplift values. If more than one option is set toTrue
, then return a dictionary of metrics where the keys are “train” and “valid”.- Parameters
train (bool) – If
True
, return the uplift values for the training data.valid (bool) – If
True
, return the uplift values for the validation data.metric –
Uplift metric type. One of:
”qini” (default)
”lift”
”gain”
- Returns
a list of normalized uplift values for the specified key(s).
- Examples
>>> from h2o.estimators import H2OUpliftRandomForestEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/uplift/criteo_uplift_13k.csv") >>> treatment_column = "treatment" >>> response_column = "conversion" >>> train[treatment_column] = train[treatment_column].asfactor() >>> train[response_column] = train[response_column].asfactor() >>> predictors = ["f1", "f2", "f3", "f4", "f5", "f6"] >>> >>> uplift_model = H2OUpliftRandomForestEstimator(ntrees=10, ... max_depth=5, ... treatment_column=treatment_column, ... uplift_metric="kl", ... distribution="bernoulli", ... min_rows=10, ... auuc_type="gain") >>> uplift_model.train(y=response_column, x=predictors, training_frame=train) >>> uplift_model.uplift_normalized() # <- Default: return training metric value >>> uplift_model.uplift_normalized(train=True, metric="gain")
-
Word Embedding
¶
-
class
h2o.model.models.word_embedding.
H2OWordEmbeddingModel
[source]¶ Bases:
h2o.model.model_base.ModelBase
Word embedding model.
-
find_synonyms
(word, count=20)[source]¶ Find synonyms using a word2vec model.
- Parameters
word (str) – A single word to find synonyms for.
count (int) – The first “count” synonyms will be returned.
- Returns
the approximate reconstruction of the training data.
- Examples
>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> words = job_titles.tokenize(" ") >>> w2v_model = H2OWord2vecEstimator(epochs = 10) >>> w2v_model.train(training_frame=words) >>> synonyms = w2v_model.find_synonyms("teacher", count = 5) >>> print(synonyms)
-
to_frame
()[source]¶ Converts a given word2vec model into H2OFrame.
- Returns
a frame representing learned word embeddings.
- Examples
>>> words = h2o.create_frame(rows=1000,cols=1,string_fraction=1.0,missing_fraction=0.0) >>> embeddings = h2o.create_frame(rows=1000,cols=100,real_fraction=1.0,missing_fraction=0.0) >>> word_embeddings = words.cbind(embeddings) >>> w2v_model = H2OWord2vecEstimator(pre_trained=word_embeddings) >>> w2v_model.train(training_frame=word_embeddings) >>> w2v_frame = w2v_model.to_frame() >>> word_embeddings.names = w2v_frame.names >>> word_embeddings.as_data_frame().equals(word_embeddings.as_data_frame())
-
transform
(words, aggregate_method)[source]¶ Transform words (or sequences of words) to vectors using a word2vec model.
- Parameters
words (str) – An H2OFrame made of a single column containing source words.
aggregate_method (str) – Specifies how to aggregate sequences of words. If your method is
`NONE`
, no aggregation is performed and each input word is mapped to a single word-vector. If your method is'AVERAGE'
, input is treated as sequences of words delimited by NA. Each word of a sequences is internally mapped to a vector, and vectors belonging to the same sentence are averaged and returned in the result.
- Returns
the approximate reconstruction of the training data.
- Examples
>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> STOP_WORDS = ["ax","i","you","edu","s","t","m","subject","can","lines","re","what", ... "there","all","we","one","the","a","an","of","or","in","for","by","on", ... "but","is","in","a","not","with","as","was","if","they","are","this","and","it","have", ... "from","at","my","be","by","not","that","to","from","com","org","like","likes","so"] >>> words = job_titles.tokenize(" ") >>> words = words[(words.isna()) | (~ words.isin(STOP_WORDS)),:] >>> w2v_model = H2OWord2vecEstimator(epochs = 10) >>> w2v_model.train(training_frame=words) >>> job_title_vecs = w2v_model.transform(words, aggregate_method = "AVERAGE")
-