Model Categories

class h2o.model.ModelBase[source]

Bases: h2o.model.model_base.ModelBase

Base class for all models.

property actual_params

Dictionary of actual parameters of the model.

aic(train=False, valid=False, xval=False)[source]

Get the AIC (Akaike Information Criterium).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the AIC value for the training data.

  • valid (bool) – If valid is True, then return the AIC value for the validation data.

  • xval (bool) – If xval is True, then return the AIC value for the validation data.

Returns

The AIC.

auc(train=False, valid=False, xval=False)[source]

Get the AUC (Area Under Curve).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the AUC value for the training data.

  • valid (bool) – If valid is True, then return the AUC value for the validation data.

  • xval (bool) – If xval is True, then return the AUC value for the validation data.

Returns

The AUC.

aucpr(train=False, valid=False, xval=False)[source]

Get the aucPR (Area Under PRECISION RECALL Curve).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the aucpr value for the training data.

  • valid (bool) – If valid is True, then return the aucpr value for the validation data.

  • xval (bool) – If xval is True, then return the aucpr value for the validation data.

Returns

The aucpr.

biases(vector_id=0)[source]

Return the frame for the respective bias vector.

Param

vector_id: an integer, ranging from 0 to number of layers, that specifies the bias vector to return.

Returns

an H2OFrame which represents the bias vector identified by vector_id

catoffsets()[source]

Categorical offsets for one-hot encoding.

coef()[source]

Return the coefficients which can be applied to the non-standardized data.

Note: standardize = True by default, if set to False then coef() return the coefficients which are fit directly.

coef_norm()[source]

Return coefficients fitted on the standardized data (requires standardize = True, which is on by default).

These coefficients can be used to evaluate variable importance.

cross_validation_fold_assignment()[source]

Obtain the cross-validation fold assignment for all rows in the training data.

Returns

H2OFrame

cross_validation_holdout_predictions()[source]

Obtain the (out-of-sample) holdout predictions of all cross-validation models on the training data.

This is equivalent to summing up all H2OFrames returned by cross_validation_predictions.

Returns

H2OFrame

cross_validation_metrics_summary()[source]

Retrieve Cross-Validation Metrics Summary.

Returns

The cross-validation metrics summary as an H2OTwoDimTable

cross_validation_models()[source]

Obtain a list of cross-validation models.

Returns

list of H2OModel objects.

cross_validation_predictions()[source]

Obtain the (out-of-sample) holdout predictions of all cross-validation models on their holdout data.

Note that the predictions are expanded to the full number of rows of the training data, with 0 fill-in.

Returns

list of H2OFrame objects.

deepfeatures(test_data, layer)[source]

Return hidden layer details.

Parameters
  • test_data – Data to create a feature space on

  • layer – 0 index hidden layer

property default_params

Dictionary of the default parameters of the model.

detach()[source]

Detach the Python object from the backend, usually by clearing its key

download_model(path='', filename=None)[source]

Download an H2O Model object to disk.

Parameters
  • path – a path to the directory where the model should be saved.

  • filename – a filename for the saved model

Returns

the path of the downloaded model

download_mojo(path='.', get_genmodel_jar=False, genmodel_name='')[source]

Download the model in MOJO format.

Parameters
  • path – the path where MOJO file should be saved.

  • get_genmodel_jar – if True, then also download h2o-genmodel.jar and store it in folder path.

  • genmodel_name – Custom name of genmodel jar

Returns

name of the MOJO file written.

download_pojo(path='', get_genmodel_jar=False, genmodel_name='')[source]

Download the POJO for this model to the directory specified by path.

If path is an empty string, then dump the output to screen.

Parameters
  • path – An absolute path to the directory where POJO should be saved.

  • get_genmodel_jar – if True, then also download h2o-genmodel.jar and store it in folder path.

  • genmodel_name – Custom name of genmodel jar

Returns

name of the POJO file written.

property end_time

Timestamp (milliseconds since 1970) when the model training was ended.

explain(frame, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, figsize=(16, 9), render=True, qualitative_colormap='Dark2', sequential_colormap='RdYlBu_r')

Generate model explanations on frame data set.

The H2O Explainability Interface is a convenient wrapper to a number of explainabilty methods and visualizations in H2O. The function can be applied to a single model or group of models and returns an object containing explanations, such as a partial dependence plot or a variable importance plot. Most of the explanations are visual (plots). These plots can also be created by individual utility functions/methods as well.

Parameters
  • models – a list of H2O models, an H2O AutoML instance, or an H2OFrame with a ‘model_id’ column (e.g. H2OAutoML leaderboard)

  • frame – H2OFrame

  • columns – either a list of columns or column indices to show. If specified parameter top_n_features will be ignored.

  • top_n_features – a number of columns to pick using variable importance (where applicable).

  • include_explanations – if specified, return only the specified model explanations (Mutually exclusive with exclude_explanations)

  • exclude_explanations – exclude specified model explanations

  • plot_overrides – overrides for individual model explanations

  • figsize – figure size; passed directly to matplotlib

  • render – if True, render the model explanations; otherwise model explanations are just returned

Returns

H2OExplanation containing the model explanations including headers and descriptions

Examples

>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create the H2OAutoML explanation
>>> aml.explain(test)
>>>
>>> # Create the leader model explanation
>>> aml.leader.explain(test)
explain_row(frame, row_index, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, qualitative_colormap='Dark2', figsize=(16, 9), render=True)

Generate model explanations on frame data set for a given instance.

Explain the behavior of a model or group of models with respect to a single row of data. The function returns an object containing explanations, such as a partial dependence plot or a variable importance plot. Most of the explanations are visual (plots). These plots can also be created by individual utility functions/methods as well.

Parameters
  • models – H2OAutoML object, supervised H2O model, or list of supervised H2O models

  • frame – H2OFrame

  • row_index – row index of the instance to inspect

  • columns – either a list of columns or column indices to show. If specified parameter top_n_features will be ignored.

  • top_n_features – a number of columns to pick using variable importance (where applicable).

  • include_explanations – if specified, return only the specified model explanations (Mutually exclusive with exclude_explanations)

  • exclude_explanations – exclude specified model explanations

  • plot_overrides – overrides for individual model explanations

  • qualitative_colormap – a colormap name

  • figsize – figure size; passed directly to matplotlib

  • render – if True, render the model explanations; otherwise model explanations are just returned

Returns

H2OExplanation containing the model explanations including headers and descriptions

Examples

>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create the H2OAutoML explanation
>>> aml.explain_row(test, row_index=0)
>>>
>>> # Create the leader model explanation
>>> aml.leader.explain_row(test, row_index=0)
feature_frequencies(test_data)[source]

Retrieve the number of occurrences of each feature for given observations on their respective paths in a tree ensemble model. Available for GBM, Random Forest and Isolation Forest models.

Parameters

test_data (H2OFrame) – Data on which to calculate feature frequencies.

Returns

A new H2OFrame made of feature contributions.

feature_interaction(max_interaction_depth=100, max_tree_depth=100, max_deepening=-1, path=None)[source]

Feature interactions and importance, leaf statistics and split value histograms in a tabular form. Available for XGBoost and GBM.

Metrics: Gain - Total gain of each feature or feature interaction. FScore - Amount of possible splits taken on a feature or feature interaction. wFScore - Amount of possible splits taken on a feature or feature interaction weighed by the probability of the splits to take place. Average wFScore - wFScore divided by FScore. Average Gain - Gain divided by FScore. Expected Gain - Total gain of each feature or feature interaction weighed by the probability to gather the gain. Average Tree Index Average Tree Depth

Parameters
  • max_interaction_depth – Upper bound for extracted feature interactions depth. Defaults to 100.

  • max_tree_depth – Upper bound for tree depth. Defaults to 100.

  • max_deepening – Upper bound for interaction start deepening (zero deepening => interactions starting at root only). Defaults to -1.

  • path – (Optional) Path where to save the output in .xlsx format (e.g. /mypath/file.xlsx). Please note that Pandas and XlsxWriter need to be installed for using this option. Defaults to None.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_xgb = H2OXGBoostEstimator(seed=1234)
>>> boston_xgb.train(y=response, x=predictors, training_frame=train)
>>> feature_interactions = boston_xgb.feature_interaction()
property full_parameters

Dictionary of the full specification of all parameters.

get_xval_models(key=None)[source]

Return a Model object.

Parameters

key – If None, return all cross-validated models; otherwise return the model that key points to.

Returns

A model or list of models.

gini(train=False, valid=False, xval=False)[source]

Get the Gini coefficient.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”

Parameters
  • train (bool) – If train is True, then return the Gini Coefficient value for the training data.

  • valid (bool) – If valid is True, then return the Gini Coefficient value for the validation data.

  • xval (bool) – If xval is True, then return the Gini Coefficient value for the cross validation data.

Returns

The Gini Coefficient for this binomial model.

h(frame, variables)[source]

Calculates Friedman and Popescu’s H statistics, in order to test for the presence of an interaction between specified variables in h2o gbm and xgb models. H varies from 0 to 1. It will have a value of 0 if the model exhibits no interaction between specified variables and a correspondingly larger value for a stronger interaction effect between them. NaN is returned if a computation is spoiled by weak main effects and rounding errors.

See Jerome H. Friedman and Bogdan E. Popescu, 2008, “Predictive learning via rule ensembles”, Ann. Appl. Stat. 2:916-954, http://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908046, s. 8.1.

Parameters
  • frame – the frame that current model has been fitted to

  • variables – variables of the interest

Returns

H statistic of the variables

Examples

>>> prostate_train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/prostate_train.csv")
>>> prostate_train["CAPSULE"] = prostate_train["CAPSULE"].asfactor()
>>> gbm_h2o = H2OGradientBoostingEstimator(ntrees=100, learn_rate=0.1,
>>>                                 max_depth=5,
>>>                                 min_rows=10,
>>>                                 distribution="bernoulli")
>>> gbm_h2o.train(x=list(range(1,prostate_train.ncol)),y="CAPSULE", training_frame=prostate_train)
>>> h = gbm_h2o.h(prostate_train, ['DPROS','DCAPS'])
property have_mojo

True, if export to MOJO is possible

property have_pojo

True, if export to POJO is possible

ice_plot(frame, column, target=None, max_levels=30, figsize=(16, 9), colormap='plasma')

Plot Individual Conditional Expectations (ICE) for each decile

Individual conditional expectations (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plot is similar to partial dependence plot (PDP), PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. The following plot shows the effect for each decile. In contrast to partial dependence plot, ICE plot can provide more insight especially when there is stronger feature interaction.

Parameters
  • model – H2OModel

  • frame – H2OFrame

  • column – string containing column name

  • target – (only for multinomial classification) for what target should the plot be done

  • max_levels – maximum number of factor levels to show

  • figsize – figure size; passed directly to matplotlib

  • colormap – colormap name

Returns

a matplotlib figure object

Examples

>>> import h2o
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train a GBM
>>> gbm = H2OGradientBoostingEstimator()
>>> gbm.train(y=response, training_frame=train)
>>>
>>> # Create the individual conditional expectations plot
>>> gbm.ice_plot(test, column="alcohol")
is_cross_validated()[source]

Return True if the model was cross-validated.

property key
Returns

the unique key representing the object on the backend

learning_curve_plot(metric='AUTO', cv_ribbon=None, cv_lines=None, figsize=(16, 9), colormap=None)

Learning curve

Create learning curve plot for an H2O Model. Learning curves show error metric dependence on learning progress, e.g., RMSE vs number of trees trained so far in GBM. There can be up to 4 curves showing Training, Validation, Training on CV Models, and Cross-validation error.

Parameters
  • model – an H2O model

  • metric – a stopping metric

  • cv_ribbon – if True, plot the CV mean as a and CV standard deviation as a ribbon around the mean, if None, it will attempt to automatically determine if this is suitable visualisation

  • cv_lines – if True, plot scoring history for individual CV models, if None, it will attempt to automatically determine if this is suitable visualisation

  • figsize – figure size; passed directly to matplotlib

  • colormap – colormap to use

Returns

a matplotlib figure

Examples

>>> import h2o
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train a GBM
>>> gbm = H2OGradientBoostingEstimator()
>>> gbm.train(y=response, training_frame=train)
>>>
>>> # Create the learning curve plot
>>> gbm.learning_curve_plot()
logloss(train=False, valid=False, xval=False)[source]

Get the Log Loss.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the log loss value for the training data.

  • valid (bool) – If valid is True, then return the log loss value for the validation data.

  • xval (bool) – If xval is True, then return the log loss value for the cross validation data.

Returns

The log loss for this regression model.

mae(train=False, valid=False, xval=False)[source]

Get the Mean Absolute Error.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the MAE value for the training data.

  • valid (bool) – If valid is True, then return the MAE value for the validation data.

  • xval (bool) – If xval is True, then return the MAE value for the cross validation data.

Returns

The MAE for this regression model.

mean_residual_deviance(train=False, valid=False, xval=False)[source]

Get the Mean Residual Deviances.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the Mean Residual Deviance value for the training data.

  • valid (bool) – If valid is True, then return the Mean Residual Deviance value for the validation data.

  • xval (bool) – If xval is True, then return the Mean Residual Deviance value for the cross validation data.

Returns

The Mean Residual Deviance for this regression model.

property model_id

Model identifier.

model_performance(test_data=None, train=False, valid=False, xval=False, auc_type='none')[source]

Generate model metrics for this model on test_data.

Parameters
  • test_data (H2OFrame) – Data set for which model metrics shall be computed against. All three of train, valid and xval arguments are ignored if test_data is not None.

  • train (bool) – Report the training metrics for the model.

  • valid (bool) – Report the validation metrics for the model.

  • xval (bool) – Report the cross-validation metrics for the model. If train and valid are True, then it defaults to True.

  • auc_type (String) – Change default AUC type for multinomial classification AUC/AUCPR calculation when test_data is not None. One of: "auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo" (default: "none"). If type is “auto” or “none” AUC and AUCPR is not calculated.

Returns

An object of class H2OModelMetrics.

mse(train=False, valid=False, xval=False)[source]

Get the Mean Square Error.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the MSE value for the training data.

  • valid (bool) – If valid is True, then return the MSE value for the validation data.

  • xval (bool) – If xval is True, then return the MSE value for the cross validation data.

Returns

The MSE for this regression model.

normmul()[source]

Normalization/Standardization multipliers for numeric predictors.

normsub()[source]

Normalization/Standardization offsets for numeric predictors.

ntrees_actual()[source]

Returns actual number of trees in a tree model. If early stopping enabled, GBM can reset the ntrees value. In this case, the actual ntrees value is less than the original ntrees value a user set before building the model.

Type: float

null_degrees_of_freedom(train=False, valid=False, xval=False)[source]

Retreive the null degress of freedom if this model has the attribute, or None otherwise.

Parameters
  • train (bool) – Get the null dof for the training set. If both train and valid are False, then train is selected by default.

  • valid (bool) – Get the null dof for the validation set. If both train and valid are True, then train is selected by default.

Returns

Return the null dof, or None if it is not present.

null_deviance(train=False, valid=False, xval=False)[source]

Retreive the null deviance if this model has the attribute, or None otherwise.

Parameters
  • train (bool) – Get the null deviance for the training set. If both train and valid are False, then train is selected by default.

  • valid (bool) – Get the null deviance for the validation set. If both train and valid are True, then train is selected by default.

Returns

Return the null deviance, or None if it is not present.

property params

Get the parameters and the actual/default values only.

Returns

A dictionary of parameters used to build this model.

partial_plot(data, cols=None, destination_key=None, nbins=20, weight_column=None, plot=True, plot_stddev=True, figsize=(7, 10), server=False, include_na=False, user_splits=None, col_pairs_2dpdp=None, save_to_file=None, row_index=None, targets=None)[source]

Create partial dependence plot which gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response.

Parameters
  • data (H2OFrame) – An H2OFrame object used for scoring and constructing the plot.

  • cols – Feature(s) for which partial dependence will be calculated.

  • destination_key – An key reference to the created partial dependence tables in H2O.

  • nbins – Number of bins used. For categorical columns make sure the number of bins exceed the level count. If you enable add_missing_NA, the returned length will be nbin+1.

  • weight_column – A string denoting which column of data should be used as the weight column.

  • plot – A boolean specifying whether to plot partial dependence table.

  • plot_stddev – A boolean specifying whether to add std err to partial dependence plot.

  • figsize – Dimension/size of the returning plots, adjust to fit your output cells.

  • server – Specify whether to activate matplotlib “server” mode. In this case, the plots are saved to a file instead of being rendered.

  • include_na – A boolean specifying whether missing value should be included in the Feature values.

  • user_splits – a dictionary containing column names as key and user defined split values as value in a list.

  • col_pairs_2dpdp – list containing pairs of column names for 2D pdp

  • save_to_file – Fully qualified name to an image file the resulting plot should be saved to, e.g. ‘/home/user/pdpplot.png’. The ‘png’ postfix might be omitted. If the file already exists, it will be overridden. Plot is only saved if plot = True.

  • row_index – Row for which partial dependence will be calculated instead of the whole input frame.

  • targets – Target classes for multiclass model.

Returns

Plot and list of calculated mean response tables for each feature requested.

pd_plot(frame, column, row_index=None, target=None, max_levels=30, figsize=(16, 9), colormap='Dark2')

Plot partial dependence plot.

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.

Parameters
  • model – H2O Model object

  • frame – H2OFrame

  • column – string containing column name

  • row_index – if None, do partial dependence, if integer, do individual conditional expectation for the row specified by this integer

  • target – (only for multinomial classification) for what target should the plot be done

  • max_levels – maximum number of factor levels to show

  • figsize – figure size; passed directly to matplotlib

  • colormap – colormap name; used to get just the first color to keep the api and color scheme similar with pd_multi_plot

Returns

a matplotlib figure object

Examples

>>> import h2o
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train a GBM
>>> gbm = H2OGradientBoostingEstimator()
>>> gbm.train(y=response, training_frame=train)
>>>
>>> # Create partial dependence plot
>>> gbm.pd_plot(test, column="alcohol")
permutation_importance(frame, metric='AUTO', n_samples=10000, n_repeats=1, features=None, seed=-1, use_pandas=False)[source]

Get Permutation Variable Importance.

When n_repeats == 1, the result is similar to the one from varimp() method, i.e., it contains the following columns “Relative Importance”, “Scaled Importance”, and “Percentage”.

When n_repeats > 1, the individual columns correspond to the permutation variable importance values from individual runs which corresponds to the “Relative Importance” and also to the distance between the original prediction error and prediction error using a frame with a given feature permuted.

Parameters
  • frame – training frame

  • metric – metric to be used. One of “AUTO”, “AUC”, “MAE”, “MSE”, “RMSE”, “logloss”, “mean_per_class_error”, “PR_AUC”. Defaults to “AUTO”.

  • n_samples – number of samples to be evaluated. Use -1 to use the whole dataset. Defaults to 10 000.

  • n_repeats – number of repeated evaluations. Defaults to 1.

  • features – features to include in the permutation importance. Use None to include all.

  • seed – seed for the random generator. Use -1 to pick a random seed. Defaults to -1.

  • use_pandas – set true to return pandas data frame.

Returns

H2OTwoDimTable or Pandas data frame

permutation_importance_plot(frame, metric='AUTO', n_samples=10000, n_repeats=1, features=None, seed=-1, num_of_features=10, server=False)[source]

Plot Permutation Variable Importance. This method plots either a bar plot or if n_repeats > 1 a box plot and returns the variable importance table.

Parameters
  • frame – training frame

  • metric – metric to be used. One of “AUTO”, “AUC”, “MAE”, “MSE”, “RMSE”, “logloss”, “mean_per_class_error”, “PR_AUC”. Defaults to “AUTO”.

  • n_samples – number of samples to be evaluated. Use -1 to use the whole dataset. Defaults to 10 000.

  • n_repeats – number of repeated evaluations. Defaults to 1.

  • features – features to include in the permutation importance. Use None to include all.

  • seed – seed for the random generator. Use -1 to pick a random seed. Defaults to -1.

  • num_of_features – number of features to plot. Defaults to 10.

  • server – if true set server settings to matplotlib and do not show the plot

Returns

H2OTwoDimTable with variable importance

pprint_coef()[source]

Pretty print the coefficents table (includes normalized coefficients).

pr_auc(train=False, valid=False, xval=False)[source]

ModelBase.pr_auc is deprecated, please use ModelBase.aucpr instead.

predict(test_data, custom_metric=None, custom_metric_func=None)[source]

Predict on a dataset.

Parameters
  • test_data (H2OFrame) – Data on which to make predictions.

  • custom_metric – custom evaluation function defined as class reference, the class get uploaded into the cluster

  • custom_metric_func – custom evaluation function reference, e.g, result of upload_custom_metric

Returns

A new H2OFrame of predictions.

predict_contributions(test_data, output_format='Original', top_n=None, bottom_n=None, compare_abs=False)[source]

Predict feature contributions - SHAP values on an H2O Model (only GBM, XGBoost, DRF models and equivalent imported MOJOs).

Returned H2OFrame has shape (#rows, #features + 1) - there is a feature contribution column for each input feature, the last column is the model bias (same value for each row). The sum of the feature contributions and the bias term is equal to the raw prediction of the model. Raw prediction of tree-based model is the sum of the predictions of the individual trees before the inverse link function is applied to get the actual prediction. For Gaussian distribution the sum of the contributions is equal to the model prediction.

Note: Multinomial classification models are currently not supported.

Parameters
  • test_data (H2OFrame) – Data on which to calculate contributions.

  • output_format (Enum) – Specify how to output feature contributions in XGBoost - XGBoost by default outputs contributions for 1-hot encoded features, specifying a Compact output format will produce a per-feature contribution. One of: "Original", "Compact" (default: "Original").

  • top_n – Return only #top_n highest contributions + bias. If top_n<0 then sort all SHAP values in descending order If top_n<0 && bottom_n<0 then sort all SHAP values in descending order

  • bottom_n – Return only #bottom_n lowest contributions + bias If top_n and bottom_n are defined together then return array of #top_n + #bottom_n + bias If bottom_n<0 then sort all SHAP values in ascending order If top_n<0 && bottom_n<0 then sort all SHAP values in descending order

  • compare_abs – True to compare absolute values of contributions

Returns

A new H2OFrame made of feature contributions.

Examples

>>> prostate = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv"
>>> fr = h2o.import_file(prostate)
>>> predictors = list(range(2, fr.ncol))
>>> m = H2OGradientBoostingEstimator(ntrees=10, seed=1234)
>>> m.train(x=predictors, y=1, training_frame=fr)
>>> # Compute SHAP
>>> m.predict_contributions(fr)
>>> # Compute SHAP and pick the top two highest
>>> m.predict_contributions(fr, top_n=2)
>>> # Compute SHAP and pick the top two lowest
>>> m.predict_contributions(fr, bottom_n=2)
>>> # Compute SHAP and pick the top two highest regardless of the sign
>>> m.predict_contributions(fr, top_n=2, compare_abs=True)
>>> # Compute SHAP and pick top two lowest regardless of the sign
>>> m.predict_contributions(fr, bottom_n=2, compare_abs=True)
>>> # Compute SHAP values and show them all in descending order
>>> m.predict_contributions(fr, top_n=-1)
>>> # Compute SHAP and pick the top two highest and top two lowest
>>> m.predict_contributions(fr, top_n=2, bottom_n=2)
predict_leaf_node_assignment(test_data, type='Path')[source]

Predict on a dataset and return the leaf node assignment (only for tree-based models).

Parameters
  • test_data (H2OFrame) – Data on which to make predictions.

  • type (Enum) – How to identify the leaf node. Nodes can be either identified by a path from to the root node of the tree to the node or by H2O’s internal node id. One of: "Path", "Node_ID" (default: "Path").

Returns

A new H2OFrame of predictions.

r2(train=False, valid=False, xval=False)[source]

Return the R squared for this regression model.

Will return R^2 for GLM Models and will return NaN otherwise.

The R^2 value is defined to be 1 - MSE/var, where var is computed as sigma*sigma.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the R^2 value for the training data.

  • valid (bool) – If valid is True, then return the R^2 value for the validation data.

  • xval (bool) – If xval is True, then return the R^2 value for the cross validation data.

Returns

The R squared for this regression model.

residual_degrees_of_freedom(train=False, valid=False, xval=False)[source]

Retreive the residual degress of freedom if this model has the attribute, or None otherwise.

Parameters
  • train (bool) – Get the residual dof for the training set. If both train and valid are False, then train is selected by default.

  • valid (bool) – Get the residual dof for the validation set. If both train and valid are True, then train is selected by default.

Returns

Return the residual dof, or None if it is not present.

residual_deviance(train=False, valid=False, xval=None)[source]

Retreive the residual deviance if this model has the attribute, or None otherwise.

Parameters
  • train (bool) – Get the residual deviance for the training set. If both train and valid are False, then train is selected by default.

  • valid (bool) – Get the residual deviance for the validation set. If both train and valid are True, then train is selected by default.

Returns

Return the residual deviance, or None if it is not present.

respmul()[source]

Normalization/Standardization multipliers for numeric response.

respsub()[source]

Normalization/Standardization offsets for numeric response.

rmse(train=False, valid=False, xval=False)[source]

Get the Root Mean Square Error.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the RMSE value for the training data.

  • valid (bool) – If valid is True, then return the RMSE value for the validation data.

  • xval (bool) – If xval is True, then return the RMSE value for the cross validation data.

Returns

The RMSE for this regression model.

rmsle(train=False, valid=False, xval=False)[source]

Get the Root Mean Squared Logarithmic Error.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the RMSLE value for the training data.

  • valid (bool) – If valid is True, then return the RMSLE value for the validation data.

  • xval (bool) – If xval is True, then return the RMSLE value for the cross validation data.

Returns

The RMSLE for this regression model.

rotation()[source]

Obtain the rotations (eigenvectors) for a PCA model

Returns

H2OFrame

property run_time

Model training time in milliseconds

save_model_details(path='', force=False, filename=None)[source]

Save Model Details of an H2O Model in JSON Format to disk.

Parameters
  • path – a path to save the model details at (hdfs, s3, local)

  • force – if True overwrite destination directory in case it exists, or throw exception if set to False.

  • filename – a filename for the saved model (file type is always .json)

Returns str

the path of the saved model details

save_mojo(path='', force=False, filename=None)[source]

Save an H2O Model as MOJO (Model Object, Optimized) to disk.

Parameters
  • path – a path to save the model at (hdfs, s3, local)

  • force – if True overwrite destination directory in case it exists, or throw exception if set to False.

  • filename – a filename for the saved model (file type is always .zip)

Returns str

the path of the saved model

score_history()[source]

DEPRECATED. Use scoring_history() instead.

scoring_history()[source]

Retrieve Model Score History.

Returns

The score history as an H2OTwoDimTable or a Pandas DataFrame.

shap_explain_row_plot(frame, row_index, columns=None, top_n_features=10, figsize=(16, 9), plot_type='barplot', contribution_type='both')

SHAP local explanation

SHAP explanation shows contribution of features for a given instance. The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function. H2O implements TreeSHAP which when the features are correlated, can increase contribution of a feature that had no influence on the prediction.

Parameters
  • model – h2o tree model, such as DRF, XRT, GBM, XGBoost

  • frame – H2OFrame

  • row_index – row index of the instance to inspect

  • columns – either a list of columns or column indices to show. If specified parameter top_n_features will be ignored.

  • top_n_features – a number of columns to pick using variable importance (where applicable). When plot_type=”barplot”, then top_n_features will be chosen for each contribution_type.

  • figsize – figure size; passed directly to matplotlib

  • plot_type – either “barplot” or “breakdown”

  • contribution_type – One of “positive”, “negative”, or “both”. Used only for plot_type=”barplot”.

Returns

a matplotlib figure object

Examples

>>> import h2o
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train a GBM
>>> gbm = H2OGradientBoostingEstimator()
>>> gbm.train(y=response, training_frame=train)
>>>
>>> # Create SHAP row explanation plot
>>> gbm.shap_explain_row_plot(test, row_index=0)
shap_summary_plot(frame, columns=None, top_n_features=20, samples=1000, colorize_factors=True, alpha=1, colormap=None, figsize=(12, 12), jitter=0.35)

SHAP summary plot

SHAP summary plot shows contribution of features for each instance. The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.

Parameters
  • model – h2o tree model, such as DRF, XRT, GBM, XGBoost

  • frame – H2OFrame

  • columns – either a list of columns or column indices to show. If specified parameter top_n_features will be ignored.

  • top_n_features – a number of columns to pick using variable importance (where applicable).

  • samples – maximum number of observations to use; if lower than number of rows in the frame, take a random sample

  • colorize_factors – if True, use colors from the colormap to colorize the factors; otherwise all levels will have same color

  • alpha – transparency of the points

  • colormap – colormap to use instead of the default blue to red colormap

  • figsize – figure size; passed directly to matplotlib

  • jitter – amount of jitter used to show the point density

Returns

a matplotlib figure object

Examples

>>> import h2o
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train a GBM
>>> gbm = H2OGradientBoostingEstimator()
>>> gbm.train(y=response, training_frame=train)
>>>
>>> # Create SHAP summary plot
>>> gbm.shap_summary_plot(test)
show()[source]

Print innards of model, without regards to type.

staged_predict_proba(test_data)[source]

Predict class probabilities at each stage of an H2O Model (only GBM models).

The output structure is analogous to the output of function predict_leaf_node_assignment. For each tree t and class c there will be a column Tt.Cc (eg. T3.C1 for tree 3 and class 1). The value will be the corresponding predicted probability of this class by combining the raw contributions of trees T1.Cc,..,TtCc. Binomial models build the trees just for the first class and values in columns Tx.C1 thus correspond to the the probability p0.

Parameters

test_data (H2OFrame) – Data on which to make predictions.

Returns

A new H2OFrame of staged predictions.

property start_time

Timestamp (milliseconds since 1970) when the model training was started.

std_coef_plot(num_of_features=None, server=False)[source]

Plot a model’s standardized coefficient magnitudes.

Parameters
  • num_of_features – the number of features shown in the plot.

  • server – if true set server settings to matplotlib and show the graph

Returns

None.

summary()[source]

Print a detailed summary of the model.

training_model_metrics()[source]

Return training model metrics for any model.

property type

The type of model built: "classifier" or "regressor" or "unsupervised"

update_tree_weights(frame, weights_column)[source]

Re-calculates tree-node weights based on provided dataset. Modifying node weights will affect how contribution predictions (Shapley values) are calculated. This can be used to explain the model on a curated sub-population of the training dataset.

Parameters
  • frame – frame that will be used to re-populate trees with new observations and to collect per-node weights

  • weights_column – name of the weight column (can be different from training weights)

varimp(use_pandas=False)[source]

Pretty print the variable importances, or return them in a list.

Parameters

use_pandas (bool) – If True, then the variable importances will be returned as a pandas data frame.

Returns

A list or Pandas DataFrame.

varimp_plot(num_of_features=None, server=False)[source]

Plot the variable importance for a trained model.

Parameters
  • num_of_features – the number of features shown in the plot (default is 10 or all if less than 10).

  • server – if true set server settings to matplotlib and do not show the graph

Returns

None.

weights(matrix_id=0)[source]

Return the frame for the respective weight matrix.

Parameters

matrix_id – an integer, ranging from 0 to number of layers, that specifies the weight matrix to return.

Returns

an H2OFrame which represents the weight matrix identified by matrix_id

xval_keys()[source]

Return model keys for the cross-validated model.

property xvals

Return a list of the cross-validated models.

Returns

A list of models.

class h2o.model.MetricsBase(metric_json, on=None, algo='')[source]

Bases: h2o.model.metrics_base.MetricsBase

A parent class to house common metrics available for the various Metrics types.

The methods here are available across different model categories.

aic()[source]

The AIC for this set of metrics.

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate[2] = prostate[2].asfactor()
>>> prostate[4] = prostate[4].asfactor()
>>> prostate[5] = prostate[5].asfactor()
>>> prostate[8] = prostate[8].asfactor()
>>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> response = "CAPSULE"
>>> train, valid = prostate.split_frame(ratios=[.8],seed=1234)
>>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial")
>>> pros_glm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> pros_glm.aic()
auc()[source]

The AUC for this set of metrics.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.auc()
aucpr()[source]

The area under the precision recall curve.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.aucpr()
custom_metric_name()[source]

Name of custom metric or None.

custom_metric_value()[source]

Value of custom metric or None.

gini()[source]

Gini coefficient.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.gini()
hglm_metric(metric_string)[source]
logloss()[source]

Log loss.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.logloss()
mae()[source]

The MAE for this set of metrics.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(distribution = "poisson",
...                                         seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.mae()
classmethod make(kvs)[source]

Factory method to instantiate a MetricsBase object from the list of key-value pairs.

mean_per_class_error()[source]

The mean per class error.

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate[2] = prostate[2].asfactor()
>>> prostate[4] = prostate[4].asfactor()
>>> prostate[5] = prostate[5].asfactor()
>>> prostate[8] = prostate[8].asfactor()
>>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> response = "CAPSULE"
>>> train, valid = prostate.split_frame(ratios=[.8],seed=1234)
>>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial")
>>> pros_glm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> pros_glm.mean_per_class_error()
mean_residual_deviance()[source]

The mean residual deviance for this set of metrics.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/AirlinesTest.csv.zip")
>>> air_gbm = H2OGradientBoostingEstimator()
>>> air_gbm.train(x=list(range(9)),
...               y=9,
...               training_frame=airlines,
...               validation_frame=airlines)
>>> air_gbm.mean_residual_deviance(train=True,valid=False,xval=False)
mse()[source]

The MSE for this set of metrics.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.mse()
nobs()[source]

The number of observations.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> perf = cars_gbm.model_performance()
>>> perf.nobs()
null_degrees_of_freedom()[source]

The null DoF if the model has residual deviance, otherwise None.

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate[2] = prostate[2].asfactor()
>>> prostate[4] = prostate[4].asfactor()
>>> prostate[5] = prostate[5].asfactor()
>>> prostate[8] = prostate[8].asfactor()
>>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> response = "CAPSULE"
>>> train, valid = prostate.split_frame(ratios=[.8],seed=1234)
>>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial")
>>> pros_glm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> pros_glm.null_degrees_of_freedom()
null_deviance()[source]

The null deviance if the model has residual deviance, otherwise None.

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate[2] = prostate[2].asfactor()
>>> prostate[4] = prostate[4].asfactor()
>>> prostate[5] = prostate[5].asfactor()
>>> prostate[8] = prostate[8].asfactor()
>>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> response = "CAPSULE"
>>> train, valid = prostate.split_frame(ratios=[.8],seed=1234)
>>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial")
>>> pros_glm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> pros_glm.null_deviance()
pr_auc()[source]

MetricsBase.pr_auc is deprecated, please use MetricsBase.aucpr instead.

r2()[source]

The R squared coefficient.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.r2()
residual_degrees_of_freedom()[source]

The residual DoF if the model has residual deviance, otherwise None.

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate[2] = prostate[2].asfactor()
>>> prostate[4] = prostate[4].asfactor()
>>> prostate[5] = prostate[5].asfactor()
>>> prostate[8] = prostate[8].asfactor()
>>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> response = "CAPSULE"
>>> train, valid = prostate.split_frame(ratios=[.8],seed=1234)
>>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial")
>>> pros_glm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> pros_glm.residual_degrees_of_freedom()
residual_deviance()[source]

The residual deviance if the model has it, otherwise None.

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate[2] = prostate[2].asfactor()
>>> prostate[4] = prostate[4].asfactor()
>>> prostate[5] = prostate[5].asfactor()
>>> prostate[8] = prostate[8].asfactor()
>>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> response = "CAPSULE"
>>> train, valid = prostate.split_frame(ratios=[.8],seed=1234)
>>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial")
>>> pros_glm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> pros_glm.residual_deviance()
rmse()[source]

The RMSE for this set of metrics.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.rmse()
rmsle()[source]

The RMSLE for this set of metrics.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(distribution = "poisson",
...                                         seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.rmsle()
show()[source]

Display a short summary of the metrics.

Examples

>>> from from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.show()
class h2o.model.H2OBinomialModel[source]

Bases: h2o.model.model_base.ModelBase

F0point5(thresholds=None, train=False, valid=False, xval=False)[source]

Get the F0.5 for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the F0.5 value for the training data.

  • valid (bool) – If True, return the F0.5 value for the validation data.

  • xval (bool) – If True, return the F0.5 value for each of the cross-validated splits.

Returns

The F0.5 values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)         
>>> F0point5 = gbm.F0point5() # <- Default: return training metric value
>>> F0point5 = gbm.F0point5(train=True,  valid=True,  xval=True)
F1(thresholds=None, train=False, valid=False, xval=False)[source]

Get the F1 value for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the F1 value for the training data.

  • valid (bool) – If True, return the F1 value for the validation data.

  • xval (bool) – If True, return the F1 value for each of the cross-validated splits.

Returns

The F1 values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2] 
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.F1()# <- Default: return training metric value
>>> gbm.F1(train=True,  valid=True,  xval=True)
F2(thresholds=None, train=False, valid=False, xval=False)[source]

Get the F2 for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the F2 value for the training data.

  • valid (bool) – If True, return the F2 value for the validation data.

  • xval (bool) – If True, return the F2 value for each of the cross-validated splits.

Returns

The F2 values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.F2() # <- Default: return training metric value
>>> gbm.F2(train=True, valid=True, xval=True)
accuracy(thresholds=None, train=False, valid=False, xval=False)[source]

Get the accuracy for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the accuracy value for the training data.

  • valid (bool) – If True, return the accuracy value for the validation data.

  • xval (bool) – If True, return the accuracy value for each of the cross-validated splits.

Returns

The accuracy values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.accuracy() # <- Default: return training metric value
>>> gbm.accuracy(train=True, valid=True, xval=True)
confusion_matrix(metrics=None, thresholds=None, train=False, valid=False, xval=False)[source]

Get the confusion matrix for the specified metrics/thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”

Parameters
  • metrics – A string (or list of strings) among metrics listed in H2OBinomialModelMetrics.maximizing_metrics. Defaults to ‘f1’.

  • thresholds – A value (or list of values) between 0 and 1. If None, then the thresholds maximizing each provided metric will be used.

  • train (bool) – If True, return the confusion matrix value for the training data.

  • valid (bool) – If True, return the confusion matrix value for the validation data.

  • xval (bool) – If True, return the confusion matrix value for each of the cross-validated splits.

Returns

The confusion matrix values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.confusion_matrix() # <- Default: return training metric value
>>> gbm.confusion_matrix(train=True, valid=True, xval=True)
error(thresholds=None, train=False, valid=False, xval=False)[source]

Get the error for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold minimizing the error will be used.

  • train (bool) – If True, return the error value for the training data.

  • valid (bool) – If True, return the error value for the validation data.

  • xval (bool) – If True, return the error value for each of the cross-validated splits.

Returns

The error values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.error() # <- Default: return training metric
>>> gbm.error(train=True, valid=True, xval=True)
fallout(thresholds=None, train=False, valid=False, xval=False)[source]

Get the fallout for a set of thresholds (aka False Positive Rate).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the fallout value for the training data.

  • valid (bool) – If True, return the fallout value for the validation data.

  • xval (bool) – If True, return the fallout value for each of the cross-validated splits.

Returns

The fallout values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.fallout() # <- Default: return training metric
>>> gbm.fallout(train=True, valid=True, xval=True)
find_idx_by_threshold(threshold, train=False, valid=False, xval=False)[source]

Retrieve the index in this metric’s threshold list at which the given threshold is located.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • threshold (float) – Threshold value to search for in the threshold list.

  • train (bool) – If True, return the find idx by threshold value for the training data.

  • valid (bool) – If True, return the find idx by threshold value for the validation data.

  • xval (bool) – If True, return the find idx by threshold value for each of the cross-validated splits.

Returns

The find idx by threshold values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight",
...               "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> idx_threshold = gbm.find_idx_by_threshold(threshold=0.39438,
...                                           train=True)
>>> idx_threshold
find_threshold_by_max_metric(metric, train=False, valid=False, xval=False)[source]

If all are False (default), then return the training metric value.

If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • metric (str) – A metric among the metrics listed in H2OBinomialModelMetrics.maximizing_metrics.

  • train (bool) – If True, return the find threshold by max metric value for the training data.

  • valid (bool) – If True, return the find threshold by max metric value for the validation data.

  • xval (bool) – If True, return the find threshold by max metric value for each of the cross-validated splits.

Returns

The find threshold by max metric values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight",
...               "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> max_metric = gbm.find_threshold_by_max_metric(metric="f2",
...                                               train=True)
>>> max_metric
fnr(thresholds=None, train=False, valid=False, xval=False)[source]

Get the False Negative Rates for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the FNR value for the training data.

  • valid (bool) – If True, return the FNR value for the validation data.

  • xval (bool) – If True, return the FNR value for each of the cross-validated splits.

Returns

The FNR values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.fnr() # <- Default: return training metric
>>> gbm.fnr(train=True, valid=True, xval=True)
fpr(thresholds=None, train=False, valid=False, xval=False)[source]

Get the False Positive Rates for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the FPR value for the training data.

  • valid (bool) – If True, return the FPR value for the validation data.

  • xval (bool) – If True, return the FPR value for each of the cross-validated splits.

Returns

The FPR values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.fpr() # <- Default: return training metric
>>> gbm.fpr(train=True, valid=True, xval=True)
gains_lift(train=False, valid=False, xval=False)[source]

Get the Gains/Lift table for the specified metrics.

If all are False (default), then return the training metric Gains/Lift table. If more than one options is set to True, then return a dictionary of metrics where t he keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the gains lift value for the training data.

  • valid (bool) – If True, return the gains lift value for the validation data.

  • xval (bool) – If True, return the gains lift value for each of the cross-validated splits.

Returns

The gains lift values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.gains_lift() # <- Default: return training metric Gain/Lift table
>>> gbm.gains_lift(train=True, valid=True, xval=True)
kolmogorov_smirnov()[source]

Retrieves a Kolmogorov-Smirnov metric for given binomial model. The number returned is in range between 0 and 1. K-S metric represents the degree of separation between the positive (1) and negative (0) cumulative distribution functions. Detailed metrics per each group are to be found in the gains-lift table.

Returns

Kolmogorov-Smirnov metric, a number between 0 and 1

Examples

>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> airlines = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv")
>>> model = H2OGradientBoostingEstimator(ntrees=1,
...                                      gainslift_bins=20)
>>> model.train(x=["Origin", "Distance"],
...             y="IsDepDelayed",
...             training_frame=airlines)
>>> model.kolmogorov_smirnov()
max_per_class_error(thresholds=None, train=False, valid=False, xval=False)[source]

Get the max per class error for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold minimizing the error will be used.

  • train (bool) – If True, return the max per class error value for the training data.

  • valid (bool) – If True, return the max per class error value for the validation data.

  • xval (bool) – If True, return the max per class error value for each of the cross-validated splits.

Returns

The max per class error values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.max_per_class_error() # <- Default: return training metric value
>>> gbm.max_per_class_error(train=True, valid=True, xval=True)
mcc(thresholds=None, train=False, valid=False, xval=False)[source]

Get the MCC for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the MCC value for the training data.

  • valid (bool) – If True, return the MCC value for the validation data.

  • xval (bool) – If True, return the MCC value for each of the cross-validated splits.

Returns

The MCC values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.mcc() # <- Default: return training metric value
>>> gbm.mcc(train=True, valid=True, xval=True)
mean_per_class_error(thresholds=None, train=False, valid=False, xval=False)[source]

Get the mean per class error for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold minimizing the error will be used.

  • train (bool) – If True, return the mean per class error value for the training data.

  • valid (bool) – If True, return the mean per class error value for the validation data.

  • xval (bool) – If True, return the mean per class error value for each of the cross-validated splits.

Returns

The mean per class error values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.mean_per_class_error() # <- Default: return training metric
>>> gbm.mean_per_class_error(train=True, valid=True, xval=True)
metric(metric, thresholds=None, train=False, valid=False, xval=False)[source]

Get the metric value for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • metric (str) – name of the metric to retrieve.

  • thresholds – If None, then the threshold maximizing the metric will be used (or minimizing it if the metric is an error).

  • train (bool) – If True, return the metric value for the training data.

  • valid (bool) – If True, return the metric value for the validation data.

  • xval (bool) – If True, return the metric value for each of the cross-validated splits.

Returns

The metric values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
# thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99])
>>> thresholds = [0.01,0.5,0.99]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
# allowable metrics are absolute_mcc, accuracy, precision,
# f0point5, f1, f2, mean_per_class_accuracy, min_per_class_accuracy,
# tns, fns, fps, tps, tnr, fnr, fpr, tpr, recall, sensitivity,
# missrate, fallout, specificity
>>> gbm.metric(metric='tpr', thresholds=thresholds)
missrate(thresholds=None, train=False, valid=False, xval=False)[source]

Get the miss rate for a set of thresholds (aka False Negative Rate).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the miss rate value for the training data.

  • valid (bool) – If True, return the miss rate value for the validation data.

  • xval (bool) – If True, return the miss rate value for each of the cross-validated splits.

Returns

The miss rate values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.missrate() # <- Default: return training metric
>>> gbm.missrate(train=True, valid=True, xval=True)
plot(timestep='AUTO', metric='AUTO', server=False, **kwargs)[source]

Plot training set (and validation set if available) scoring history for an H2OBinomialModel.

The timestep and metric arguments are restricted to what is available in its scoring history.

Parameters
  • timestep (str) – A unit of measurement for the x-axis.

  • metric (str) – A unit of measurement for the y-axis.

  • server (bool) – if True, then generate the image inline (using matplotlib’s “Agg” backend)

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> response = 3
>>> predictors = [0, 1, 2, 4, 5, 6, 7, 8, 9, 10]
>>> model = H2OGeneralizedLinearEstimator(family="binomial")
>>> model.train(x=predictors, y=response, training_frame=benign)
>>> model.plot(timestep="AUTO", metric="objective", server=False)
precision(thresholds=None, train=False, valid=False, xval=False)[source]

Get the precision for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the precision value for the training data.

  • valid (bool) – If True, return the precision value for the validation data.

  • xval (bool) – If True, return the precision value for each of the cross-validated splits.

Returns

The precision values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.precision() # <- Default: return training metric value
>>> gbm.precision(train=True, valid=True, xval=True)
recall(thresholds=None, train=False, valid=False, xval=False)[source]

Get the recall for a set of thresholds (aka True Positive Rate).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the recall value for the training data.

  • valid (bool) – If True, return the recall value for the validation data.

  • xval (bool) – If True, return the recall value for each of the cross-validated splits.

Returns

The recall values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.recall() # <- Default: return training metric
>>> gbm.recall(train=True, valid=True, xval=True)
roc(train=False, valid=False, xval=False)[source]

Return the coordinates of the ROC curve for a given set of data.

The coordinates are two-tuples containing the false positive rates as a list and true positive rates as a list. If all are False (default), then return is the training data. If more than one ROC curve is requested, the data is returned as a dictionary of two-tuples.

Parameters
  • train (bool) – If True, return the ROC value for the training data.

  • valid (bool) – If True, return the ROC value for the validation data.

  • xval (bool) – If True, return the ROC value for each of the cross-validated splits.

Returns

The ROC values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.roc() # <- Default: return training data
>>> gbm.roc(train=True, valid=True, xval=True)
sensitivity(thresholds=None, train=False, valid=False, xval=False)[source]

Get the sensitivity for a set of thresholds (aka True Positive Rate or Recall).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the sensitivity value for the training data.

  • valid (bool) – If True, return the sensitivity value for the validation data.

  • xval (bool) – If True, return the sensitivity value for each of the cross-validated splits.

Returns

The sensitivity values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.sensitivity() # <- Default: return training metric
>>> gbm.sensitivity(train=True, valid=True, xval=True)
specificity(thresholds=None, train=False, valid=False, xval=False)[source]

Get the specificity for a set of thresholds (aka True Negative Rate).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the specificity value for the training data.

  • valid (bool) – If True, return the specificity value for the validation data.

  • xval (bool) – If True, return the specificity value for each of the cross-validated splits.

Returns

The specificity values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.specificity() # <- Default: return training metric
>>> gbm.specificity(train=True, valid=True, xval=True)
tnr(thresholds=None, train=False, valid=False, xval=False)[source]

Get the True Negative Rate for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the TNR value for the training data.

  • valid (bool) – If True, return the TNR value for the validation data.

  • xval (bool) – If True, return the TNR value for each of the cross-validated splits.

Returns

The TNR values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.tnr() # <- Default: return training metric
>>> gbm.tnr(train=True, valid=True, xval=True)
tpr(thresholds=None, train=False, valid=False, xval=False)[source]

Get the True Positive Rate for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the TPR value for the training data.

  • valid (bool) – If True, return the TPR value for the validation data.

  • xval (bool) – If True, return the TPR value for each of the cross-validated splits.

Returns

The TPR values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.tpr() # <- Default: return training metric
>>> gbm.tpr(train=True, valid=True, xval=True)
class h2o.model.H2OMultinomialModel[source]

Bases: h2o.model.model_base.ModelBase

confusion_matrix(data)[source]

Returns a confusion matrix based of H2O’s default prediction threshold for a dataset.

Parameters

data (H2OFrame) – the frame with the prediction results for which the confusion matrix should be extracted.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "cylinders"
>>> distribution = "multinomial"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution)
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> confusion_matrix = gbm.confusion_matrix(train)
>>> confusion_matrix
hit_ratio_table(train=False, valid=False, xval=False)[source]

Retrieve the Hit Ratios.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train – If train is True, then return the hit ratio value for the training data.

  • valid – If valid is True, then return the hit ratio value for the validation data.

  • xval – If xval is True, then return the hit ratio value for the cross validation data.

Returns

The hit ratio for this regression model.

Example

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "cylinders"
>>> distribution = "multinomial"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution)
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> hit_ratio_table = gbm.hit_ratio_table() # <- Default: return training metrics
>>> hit_ratio_table
>>> hit_ratio_table1 = gbm.hit_ratio_table(train=True,
...                                        valid=True,
...                                        xval=True)
>>> hit_ratio_table1
mean_per_class_error(train=False, valid=False, xval=False)[source]

Retrieve the mean per class error across all classes

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the mean_per_class_error value for the training data.

  • valid (bool) – If True, return the mean_per_class_error value for the validation data.

  • xval (bool) – If True, return the mean_per_class_error value for each of the cross-validated splits.

Returns

The mean_per_class_error values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "cylinders"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> distribution = "multinomial"
>>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution)
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> mean_per_class_error = gbm.mean_per_class_error() # <- Default: return training metric
>>> mean_per_class_error
>>> mean_per_class_error1 = gbm.mean_per_class_error(train=True,
...                                                  valid=True,
...                                                  xval=True)
>>> mean_per_class_error1
multinomial_auc_table(train=False, valid=False, xval=False)[source]

Retrieve the multinomial AUC table.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the multinomial_auc_table for the training data.

  • valid (bool) – If True, return the multinomial_auc_table for the validation data.

  • xval (bool) – If True, return the multinomial_auc_table for each of the cross-validated splits.

Returns

The multinomial_auc_table values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "cylinders"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> distribution = "multinomial"
>>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution)
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> multinomial_auc_table = gbm.multinomial_auc_table() # <- Default: return training metric
>>> multinomial_auc_table
>>> multinomial_auc_table1 = gbm.multinomial_auc_table(train=True,
...                                        valid=True,
...                                        xval=True)
>>> multinomial_auc_table1
multinomial_aucpr_table(train=False, valid=False, xval=False)[source]

Retrieve the multinomial PR AUC table.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the amultinomial_aucpr_table for the training data.

  • valid (bool) – If True, return the multinomial_aucpr_table for the validation data.

  • xval (bool) – If True, return the multinomial_aucpr_table for each of the cross-validated splits.

Returns

The average_pairwise_auc values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "cylinders"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> distribution = "multinomial"
>>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution)
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> multinomial_aucpr_table = gbm.multinomial_aucpr_table() # <- Default: return training metric
>>> multinomial_aucpr_table
>>> multinomial_aucpr_table1 = gbm.multinomial_aucpr_table(train=True,
...                                        valid=True,
...                                        xval=True)
>>> multinomial_aucpr_table1
plot(timestep='AUTO', metric='AUTO', **kwargs)[source]

Plots training set (and validation set if available) scoring history for an H2OMultinomialModel. The timestep and metric arguments are restricted to what is available in its scoring history.

Parameters
  • timestep – A unit of measurement for the x-axis. This can be AUTO, duration, or number_of_trees.

  • metric – A unit of measurement for the y-axis. This can be AUTO, logloss, classification_error, or rmse.

Returns

A scoring history plot.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "cylinders"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> distribution = "multinomial"
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution)
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> gbm.plot(metric="AUTO", timestep="AUTO")
class h2o.model.H2ORegressionModel[source]

Bases: h2o.model.model_base.ModelBase

plot(timestep='AUTO', metric='AUTO', **kwargs)[source]

Plots training set (and validation set if available) scoring history for an H2ORegressionModel. The timestep and metric arguments are restricted to what is available in its scoring history.

Parameters
  • timestep – A unit of measurement for the x-axis.

  • metric – A unit of measurement for the y-axis.

Returns

A scoring history plot.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy"
>>> distribution = "gaussian"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> gbm.plot(timestep="AUTO", metric="AUTO",)
residual_analysis_plot(frame, figsize=(16, 9))

Residual Analysis

Do Residual Analysis and plot the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. If you notice “striped” lines of residuals, that is just an indication that your response variable was integer valued instead of real valued.

Parameters
  • model – H2OModel

  • frame – H2OFrame

  • figsize – figure size; passed directly to matplotlib

Returns

a matplotlib figure object

Examples

>>> import h2o
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train a GBM
>>> gbm = H2OGradientBoostingEstimator()
>>> gbm.train(y=response, training_frame=train)
>>>
>>> # Create the residual analysis plot
>>> gbm.residual_analysis_plot(test)
class h2o.model.H2OOrdinalModel[source]

Bases: h2o.model.model_base.ModelBase

confusion_matrix(data)[source]

Returns a confusion matrix based of H2O’s default prediction threshold for a dataset.

Parameters

data (H2OFrame) – the frame with the prediction results for which the confusion matrix should be extracted.

hit_ratio_table(train=False, valid=False, xval=False)[source]

Retrieve the Hit Ratios.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train – If train is True, then return the hit ratio value for the training data.

  • valid – If valid is True, then return the hit ratio value for the validation data.

  • xval – If xval is True, then return the hit ratio value for the cross validation data.

Returns

The hit ratio for this regression model.

mean_per_class_error(train=False, valid=False, xval=False)[source]

Retrieve the mean per class error across all classes

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the mean_per_class_error value for the training data.

  • valid (bool) – If True, return the mean_per_class_error value for the validation data.

  • xval (bool) – If True, return the mean_per_class_error value for each of the cross-validated splits.

Returns

The mean_per_class_error values for the specified key(s).

plot(timestep='AUTO', metric='AUTO', **kwargs)[source]

Plots training set (and validation set if available) scoring history for an H2OOrdinalModel. The timestep and metric arguments are restricted to what is available in its scoring history.

Parameters
  • timestep – A unit of measurement for the x-axis.

  • metric – A unit of measurement for the y-axis.

Returns

A scoring history plot.

class h2o.model.H2OClusteringModel[source]

Bases: h2o.model.model_base.ModelBase

For examples: from h2o.estimators.kmeans import H2OKMeansEstimator

betweenss(train=False, valid=False, xval=False)[source]

Get the between cluster sum of squares.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the between cluster sum of squares value for the training data.

  • valid (bool) – If True, return the between cluster sum of squares value for the validation data.

  • xval (bool) – If True, return the between cluster sum of squares value for each of the cross-validated splits.

Returns

The between cluster sum of squares values for the specified key(s).

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> betweenss = km.betweenss() # <- Default: return training metrics
>>> betweenss
>>> betweenss3 = km.betweenss(train=False,
...                           valid=False,
...                           xval=True)
>>> betweenss3
centers()[source]

The centers for the KMeans model.

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> km.centers()
centers_std()[source]

The standardized centers for the kmeans model.

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> km.centers_std()
centroid_stats(train=False, valid=False, xval=False)[source]

Get the centroid statistics for each cluster.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the centroid statistic for the training data.

  • valid (bool) – If True, return the centroid statistic for the validation data.

  • xval (bool) – If True, return the centroid statistic for each of the cross-validated splits.

Returns

The centroid statistics for the specified key(s).

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> centroid_stats = km.centroid_stats() # <- Default: return training metrics
>>> centroid_stats
>>> centroid_stats1 = km.centroid_stats(train=True,
...                                     valid=False,
...                                     xval=False)
>>> centroid_stats1
num_iterations()[source]

Get the number of iterations it took to converge or reach max iterations.

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> km.num_iterations()
size(train=False, valid=False, xval=False)[source]

Get the sizes of each cluster.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the cluster sizes for the training data.

  • valid (bool) – If True, return the cluster sizes for the validation data.

  • xval (bool) – If True, return the cluster sizes for each of the cross-validated splits.

Returns

The cluster sizes for the specified key(s).

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> size = km.size() # <- Default: return training metrics
>>> size
>>> size1 = km.size(train=False,
...                 valid=False,
...                 xval=True)
>>> size1
tot_withinss(train=False, valid=False, xval=False)[source]

Get the total within cluster sum of squares.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the total within cluster sum of squares value for the training data.

  • valid (bool) – If True, return the total within cluster sum of squares value for the validation data.

  • xval (bool) – If True, return the total within cluster sum of squares value for each of the cross-validated splits.

Returns

The total within cluster sum of squares values for the specified key(s).

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> tot_withinss = km.tot_withinss() # <- Default: return training metrics
>>> tot_withinss
>>> tot_withinss2 = km.tot_withinss(train=True,
...                                 valid=False,
...                                 xval=True)
>>> tot_withinss2
totss(train=False, valid=False, xval=False)[source]

Get the total sum of squares.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the total sum of squares value for the training data.

  • valid (bool) – If True, return the total sum of squares value for the validation data.

  • xval (bool) – If True, return the total sum of squares value for each of the cross-validated splits.

Returns

The total sum of squares values for the specified key(s).

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> totss = km.totss() # <- Default: return training metrics
>>> totss
withinss(train=False, valid=False, xval=False)[source]

Get the within cluster sum of squares for each cluster.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the total sum of squares value for the training data.

  • valid (bool) – If True, return the total sum of squares value for the validation data.

  • xval (bool) – If True, return the total sum of squares value for each of the cross-validated splits.

Returns

The total sum of squares values for the specified key(s).

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> withinss = km.withinss() # <- Default: return training metrics
>>> withinss
>>> withinss2 = km.withinss(train=True,
...                         valid=True,
...                         xval=True)
>>> withinss2
class h2o.model.H2ODimReductionModel[source]

Bases: h2o.model.model_base.ModelBase

Dimension reduction model, such as PCA or GLRM.

archetypes()[source]

The archetypes (Y) of the GLRM model.

final_step()[source]

Get the final step size for the model.

num_iterations()[source]

Get the number of iterations that it took to converge or reach max iterations.

objective()[source]

Get the final value of the objective function.

proj_archetypes(test_data, reverse_transform=False)[source]

Convert archetypes of the model into original feature space.

Parameters
  • test_data (H2OFrame) – The dataset upon which the model was trained.

  • reverse_transform (bool) – Whether the transformation of the training data during model-building should be reversed on the projected archetypes.

Returns

model archetypes projected back into the original training data’s feature space.

reconstruct(test_data, reverse_transform=False)[source]

Reconstruct the training data from the model and impute all missing values.

Parameters
  • test_data (H2OFrame) – The dataset upon which the model was trained.

  • reverse_transform (bool) – Whether the transformation of the training data during model-building should be reversed on the reconstructed frame.

Returns

the approximate reconstruction of the training data.

screeplot(type='barplot', server=False)[source]

Produce the scree plot.

Library matplotlib is required for this function.

Parameters
  • type (str) – either "barplot" or "lines".

  • server (bool) – if true set server settings to matplotlib and do not show the graph

varimp(use_pandas=False)[source]

Return the Importance of components associcated with a pca model.

Parameters

use_pandas (bool) – If True, then the variable importances will be returned as a pandas data frame. (Default is False.)

class h2o.model.H2OAutoEncoderModel[source]

Bases: h2o.model.model_base.ModelBase

anomaly(test_data, per_feature=False)[source]

Obtain the reconstruction error for the input test_data.

Parameters
  • test_data (H2OFrame) – The dataset upon which the reconstruction error is computed.

  • per_feature (bool) – Whether to return the square reconstruction error per feature. Otherwise, return the mean square error.

Returns

the reconstruction error.

Examples

>>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator
>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
>>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
>>> predictors = list(range(0,784))
>>> resp = 784
>>> train = train[predictors]
>>> test = test[predictors]
>>> ae_model = H2OAutoEncoderEstimator(activation="Tanh",
...                                    hidden=[2],
...                                    l1=1e-5,
...                                    ignore_const_cols=False,
...                                    epochs=1)
>>> ae_model.train(x=predictors,training_frame=train)
>>> test_rec_error = ae_model.anomaly(test)
>>> test_rec_error
>>> test_rec_error_features = ae_model.anomaly(test, per_feature=True)
>>> test_rec_error_features
class h2o.model.ConfusionMatrix(cm, domains=None, table_header=None)[source]

Bases: object

ROUND = 4
static read_cms(cms=None, domains=None)[source]

Read confusion matrices from the list of sources (?).

show()[source]

Print the confusion matrix into the console.

to_list()[source]

Convert this confusion matrix into a 2x2 plain list of values.

class h2o.model.H2OModelFuture(job, x)[source]

Bases: object

A class representing a future H2O model (a model that may, or may not, be in the process of being built).

poll()[source]
class h2o.model.H2OSegmentModels(segment_models_id=None)[source]

Bases: h2o.base.Keyed

Collection of H2O Models built for each input segment.

Parameters

segment_models_id – identifier of this collection of Segment Models

Example

>>> segment_models = h2o.model.segment_models.H2OSegmentModels(segment_models_id="my_sm_id")
>>> segment_models.as_frame()
as_frame()[source]

Converts this collection of models to a tabular representation.

Returns

An H2OFrame, first columns identify the input segments, rest of the columns describe the built models.

detach()[source]

Detach the Python object from the backend, usually by clearing its key

property key
Returns

the unique key representing the object on the backend

ModelBase

class h2o.model.model_base.ModelBase[source]

Bases: h2o.model.model_base.ModelBase

Base class for all models.

property actual_params

Dictionary of actual parameters of the model.

aic(train=False, valid=False, xval=False)[source]

Get the AIC (Akaike Information Criterium).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the AIC value for the training data.

  • valid (bool) – If valid is True, then return the AIC value for the validation data.

  • xval (bool) – If xval is True, then return the AIC value for the validation data.

Returns

The AIC.

auc(train=False, valid=False, xval=False)[source]

Get the AUC (Area Under Curve).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the AUC value for the training data.

  • valid (bool) – If valid is True, then return the AUC value for the validation data.

  • xval (bool) – If xval is True, then return the AUC value for the validation data.

Returns

The AUC.

aucpr(train=False, valid=False, xval=False)[source]

Get the aucPR (Area Under PRECISION RECALL Curve).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the aucpr value for the training data.

  • valid (bool) – If valid is True, then return the aucpr value for the validation data.

  • xval (bool) – If xval is True, then return the aucpr value for the validation data.

Returns

The aucpr.

biases(vector_id=0)[source]

Return the frame for the respective bias vector.

Param

vector_id: an integer, ranging from 0 to number of layers, that specifies the bias vector to return.

Returns

an H2OFrame which represents the bias vector identified by vector_id

catoffsets()[source]

Categorical offsets for one-hot encoding.

coef()[source]

Return the coefficients which can be applied to the non-standardized data.

Note: standardize = True by default, if set to False then coef() return the coefficients which are fit directly.

coef_norm()[source]

Return coefficients fitted on the standardized data (requires standardize = True, which is on by default).

These coefficients can be used to evaluate variable importance.

cross_validation_fold_assignment()[source]

Obtain the cross-validation fold assignment for all rows in the training data.

Returns

H2OFrame

cross_validation_holdout_predictions()[source]

Obtain the (out-of-sample) holdout predictions of all cross-validation models on the training data.

This is equivalent to summing up all H2OFrames returned by cross_validation_predictions.

Returns

H2OFrame

cross_validation_metrics_summary()[source]

Retrieve Cross-Validation Metrics Summary.

Returns

The cross-validation metrics summary as an H2OTwoDimTable

cross_validation_models()[source]

Obtain a list of cross-validation models.

Returns

list of H2OModel objects.

cross_validation_predictions()[source]

Obtain the (out-of-sample) holdout predictions of all cross-validation models on their holdout data.

Note that the predictions are expanded to the full number of rows of the training data, with 0 fill-in.

Returns

list of H2OFrame objects.

deepfeatures(test_data, layer)[source]

Return hidden layer details.

Parameters
  • test_data – Data to create a feature space on

  • layer – 0 index hidden layer

property default_params

Dictionary of the default parameters of the model.

detach()[source]

Detach the Python object from the backend, usually by clearing its key

download_model(path='', filename=None)[source]

Download an H2O Model object to disk.

Parameters
  • path – a path to the directory where the model should be saved.

  • filename – a filename for the saved model

Returns

the path of the downloaded model

download_mojo(path='.', get_genmodel_jar=False, genmodel_name='')[source]

Download the model in MOJO format.

Parameters
  • path – the path where MOJO file should be saved.

  • get_genmodel_jar – if True, then also download h2o-genmodel.jar and store it in folder path.

  • genmodel_name – Custom name of genmodel jar

Returns

name of the MOJO file written.

download_pojo(path='', get_genmodel_jar=False, genmodel_name='')[source]

Download the POJO for this model to the directory specified by path.

If path is an empty string, then dump the output to screen.

Parameters
  • path – An absolute path to the directory where POJO should be saved.

  • get_genmodel_jar – if True, then also download h2o-genmodel.jar and store it in folder path.

  • genmodel_name – Custom name of genmodel jar

Returns

name of the POJO file written.

property end_time

Timestamp (milliseconds since 1970) when the model training was ended.

explain(frame, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, figsize=(16, 9), render=True, qualitative_colormap='Dark2', sequential_colormap='RdYlBu_r')

Generate model explanations on frame data set.

The H2O Explainability Interface is a convenient wrapper to a number of explainabilty methods and visualizations in H2O. The function can be applied to a single model or group of models and returns an object containing explanations, such as a partial dependence plot or a variable importance plot. Most of the explanations are visual (plots). These plots can also be created by individual utility functions/methods as well.

Parameters
  • models – a list of H2O models, an H2O AutoML instance, or an H2OFrame with a ‘model_id’ column (e.g. H2OAutoML leaderboard)

  • frame – H2OFrame

  • columns – either a list of columns or column indices to show. If specified parameter top_n_features will be ignored.

  • top_n_features – a number of columns to pick using variable importance (where applicable).

  • include_explanations – if specified, return only the specified model explanations (Mutually exclusive with exclude_explanations)

  • exclude_explanations – exclude specified model explanations

  • plot_overrides – overrides for individual model explanations

  • figsize – figure size; passed directly to matplotlib

  • render – if True, render the model explanations; otherwise model explanations are just returned

Returns

H2OExplanation containing the model explanations including headers and descriptions

Examples

>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create the H2OAutoML explanation
>>> aml.explain(test)
>>>
>>> # Create the leader model explanation
>>> aml.leader.explain(test)
explain_row(frame, row_index, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, qualitative_colormap='Dark2', figsize=(16, 9), render=True)

Generate model explanations on frame data set for a given instance.

Explain the behavior of a model or group of models with respect to a single row of data. The function returns an object containing explanations, such as a partial dependence plot or a variable importance plot. Most of the explanations are visual (plots). These plots can also be created by individual utility functions/methods as well.

Parameters
  • models – H2OAutoML object, supervised H2O model, or list of supervised H2O models

  • frame – H2OFrame

  • row_index – row index of the instance to inspect

  • columns – either a list of columns or column indices to show. If specified parameter top_n_features will be ignored.

  • top_n_features – a number of columns to pick using variable importance (where applicable).

  • include_explanations – if specified, return only the specified model explanations (Mutually exclusive with exclude_explanations)

  • exclude_explanations – exclude specified model explanations

  • plot_overrides – overrides for individual model explanations

  • qualitative_colormap – a colormap name

  • figsize – figure size; passed directly to matplotlib

  • render – if True, render the model explanations; otherwise model explanations are just returned

Returns

H2OExplanation containing the model explanations including headers and descriptions

Examples

>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create the H2OAutoML explanation
>>> aml.explain_row(test, row_index=0)
>>>
>>> # Create the leader model explanation
>>> aml.leader.explain_row(test, row_index=0)
feature_frequencies(test_data)[source]

Retrieve the number of occurrences of each feature for given observations on their respective paths in a tree ensemble model. Available for GBM, Random Forest and Isolation Forest models.

Parameters

test_data (H2OFrame) – Data on which to calculate feature frequencies.

Returns

A new H2OFrame made of feature contributions.

feature_interaction(max_interaction_depth=100, max_tree_depth=100, max_deepening=-1, path=None)[source]

Feature interactions and importance, leaf statistics and split value histograms in a tabular form. Available for XGBoost and GBM.

Metrics: Gain - Total gain of each feature or feature interaction. FScore - Amount of possible splits taken on a feature or feature interaction. wFScore - Amount of possible splits taken on a feature or feature interaction weighed by the probability of the splits to take place. Average wFScore - wFScore divided by FScore. Average Gain - Gain divided by FScore. Expected Gain - Total gain of each feature or feature interaction weighed by the probability to gather the gain. Average Tree Index Average Tree Depth

Parameters
  • max_interaction_depth – Upper bound for extracted feature interactions depth. Defaults to 100.

  • max_tree_depth – Upper bound for tree depth. Defaults to 100.

  • max_deepening – Upper bound for interaction start deepening (zero deepening => interactions starting at root only). Defaults to -1.

  • path – (Optional) Path where to save the output in .xlsx format (e.g. /mypath/file.xlsx). Please note that Pandas and XlsxWriter need to be installed for using this option. Defaults to None.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_xgb = H2OXGBoostEstimator(seed=1234)
>>> boston_xgb.train(y=response, x=predictors, training_frame=train)
>>> feature_interactions = boston_xgb.feature_interaction()
property full_parameters

Dictionary of the full specification of all parameters.

get_xval_models(key=None)[source]

Return a Model object.

Parameters

key – If None, return all cross-validated models; otherwise return the model that key points to.

Returns

A model or list of models.

gini(train=False, valid=False, xval=False)[source]

Get the Gini coefficient.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”

Parameters
  • train (bool) – If train is True, then return the Gini Coefficient value for the training data.

  • valid (bool) – If valid is True, then return the Gini Coefficient value for the validation data.

  • xval (bool) – If xval is True, then return the Gini Coefficient value for the cross validation data.

Returns

The Gini Coefficient for this binomial model.

h(frame, variables)[source]

Calculates Friedman and Popescu’s H statistics, in order to test for the presence of an interaction between specified variables in h2o gbm and xgb models. H varies from 0 to 1. It will have a value of 0 if the model exhibits no interaction between specified variables and a correspondingly larger value for a stronger interaction effect between them. NaN is returned if a computation is spoiled by weak main effects and rounding errors.

See Jerome H. Friedman and Bogdan E. Popescu, 2008, “Predictive learning via rule ensembles”, Ann. Appl. Stat. 2:916-954, http://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908046, s. 8.1.

Parameters
  • frame – the frame that current model has been fitted to

  • variables – variables of the interest

Returns

H statistic of the variables

Examples

>>> prostate_train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/prostate_train.csv")
>>> prostate_train["CAPSULE"] = prostate_train["CAPSULE"].asfactor()
>>> gbm_h2o = H2OGradientBoostingEstimator(ntrees=100, learn_rate=0.1,
>>>                                 max_depth=5,
>>>                                 min_rows=10,
>>>                                 distribution="bernoulli")
>>> gbm_h2o.train(x=list(range(1,prostate_train.ncol)),y="CAPSULE", training_frame=prostate_train)
>>> h = gbm_h2o.h(prostate_train, ['DPROS','DCAPS'])
property have_mojo

True, if export to MOJO is possible

property have_pojo

True, if export to POJO is possible

ice_plot(frame, column, target=None, max_levels=30, figsize=(16, 9), colormap='plasma')

Plot Individual Conditional Expectations (ICE) for each decile

Individual conditional expectations (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plot is similar to partial dependence plot (PDP), PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. The following plot shows the effect for each decile. In contrast to partial dependence plot, ICE plot can provide more insight especially when there is stronger feature interaction.

Parameters
  • model – H2OModel

  • frame – H2OFrame

  • column – string containing column name

  • target – (only for multinomial classification) for what target should the plot be done

  • max_levels – maximum number of factor levels to show

  • figsize – figure size; passed directly to matplotlib

  • colormap – colormap name

Returns

a matplotlib figure object

Examples

>>> import h2o
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train a GBM
>>> gbm = H2OGradientBoostingEstimator()
>>> gbm.train(y=response, training_frame=train)
>>>
>>> # Create the individual conditional expectations plot
>>> gbm.ice_plot(test, column="alcohol")
is_cross_validated()[source]

Return True if the model was cross-validated.

property key
Returns

the unique key representing the object on the backend

learning_curve_plot(metric='AUTO', cv_ribbon=None, cv_lines=None, figsize=(16, 9), colormap=None)

Learning curve

Create learning curve plot for an H2O Model. Learning curves show error metric dependence on learning progress, e.g., RMSE vs number of trees trained so far in GBM. There can be up to 4 curves showing Training, Validation, Training on CV Models, and Cross-validation error.

Parameters
  • model – an H2O model

  • metric – a stopping metric

  • cv_ribbon – if True, plot the CV mean as a and CV standard deviation as a ribbon around the mean, if None, it will attempt to automatically determine if this is suitable visualisation

  • cv_lines – if True, plot scoring history for individual CV models, if None, it will attempt to automatically determine if this is suitable visualisation

  • figsize – figure size; passed directly to matplotlib

  • colormap – colormap to use

Returns

a matplotlib figure

Examples

>>> import h2o
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train a GBM
>>> gbm = H2OGradientBoostingEstimator()
>>> gbm.train(y=response, training_frame=train)
>>>
>>> # Create the learning curve plot
>>> gbm.learning_curve_plot()
logloss(train=False, valid=False, xval=False)[source]

Get the Log Loss.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the log loss value for the training data.

  • valid (bool) – If valid is True, then return the log loss value for the validation data.

  • xval (bool) – If xval is True, then return the log loss value for the cross validation data.

Returns

The log loss for this regression model.

mae(train=False, valid=False, xval=False)[source]

Get the Mean Absolute Error.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the MAE value for the training data.

  • valid (bool) – If valid is True, then return the MAE value for the validation data.

  • xval (bool) – If xval is True, then return the MAE value for the cross validation data.

Returns

The MAE for this regression model.

mean_residual_deviance(train=False, valid=False, xval=False)[source]

Get the Mean Residual Deviances.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the Mean Residual Deviance value for the training data.

  • valid (bool) – If valid is True, then return the Mean Residual Deviance value for the validation data.

  • xval (bool) – If xval is True, then return the Mean Residual Deviance value for the cross validation data.

Returns

The Mean Residual Deviance for this regression model.

property model_id

Model identifier.

model_performance(test_data=None, train=False, valid=False, xval=False, auc_type='none')[source]

Generate model metrics for this model on test_data.

Parameters
  • test_data (H2OFrame) – Data set for which model metrics shall be computed against. All three of train, valid and xval arguments are ignored if test_data is not None.

  • train (bool) – Report the training metrics for the model.

  • valid (bool) – Report the validation metrics for the model.

  • xval (bool) – Report the cross-validation metrics for the model. If train and valid are True, then it defaults to True.

  • auc_type (String) – Change default AUC type for multinomial classification AUC/AUCPR calculation when test_data is not None. One of: "auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo" (default: "none"). If type is “auto” or “none” AUC and AUCPR is not calculated.

Returns

An object of class H2OModelMetrics.

mse(train=False, valid=False, xval=False)[source]

Get the Mean Square Error.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the MSE value for the training data.

  • valid (bool) – If valid is True, then return the MSE value for the validation data.

  • xval (bool) – If xval is True, then return the MSE value for the cross validation data.

Returns

The MSE for this regression model.

normmul()[source]

Normalization/Standardization multipliers for numeric predictors.

normsub()[source]

Normalization/Standardization offsets for numeric predictors.

ntrees_actual()[source]

Returns actual number of trees in a tree model. If early stopping enabled, GBM can reset the ntrees value. In this case, the actual ntrees value is less than the original ntrees value a user set before building the model.

Type: float

null_degrees_of_freedom(train=False, valid=False, xval=False)[source]

Retreive the null degress of freedom if this model has the attribute, or None otherwise.

Parameters
  • train (bool) – Get the null dof for the training set. If both train and valid are False, then train is selected by default.

  • valid (bool) – Get the null dof for the validation set. If both train and valid are True, then train is selected by default.

Returns

Return the null dof, or None if it is not present.

null_deviance(train=False, valid=False, xval=False)[source]

Retreive the null deviance if this model has the attribute, or None otherwise.

Parameters
  • train (bool) – Get the null deviance for the training set. If both train and valid are False, then train is selected by default.

  • valid (bool) – Get the null deviance for the validation set. If both train and valid are True, then train is selected by default.

Returns

Return the null deviance, or None if it is not present.

property params

Get the parameters and the actual/default values only.

Returns

A dictionary of parameters used to build this model.

partial_plot(data, cols=None, destination_key=None, nbins=20, weight_column=None, plot=True, plot_stddev=True, figsize=(7, 10), server=False, include_na=False, user_splits=None, col_pairs_2dpdp=None, save_to_file=None, row_index=None, targets=None)[source]

Create partial dependence plot which gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response.

Parameters
  • data (H2OFrame) – An H2OFrame object used for scoring and constructing the plot.

  • cols – Feature(s) for which partial dependence will be calculated.

  • destination_key – An key reference to the created partial dependence tables in H2O.

  • nbins – Number of bins used. For categorical columns make sure the number of bins exceed the level count. If you enable add_missing_NA, the returned length will be nbin+1.

  • weight_column – A string denoting which column of data should be used as the weight column.

  • plot – A boolean specifying whether to plot partial dependence table.

  • plot_stddev – A boolean specifying whether to add std err to partial dependence plot.

  • figsize – Dimension/size of the returning plots, adjust to fit your output cells.

  • server – Specify whether to activate matplotlib “server” mode. In this case, the plots are saved to a file instead of being rendered.

  • include_na – A boolean specifying whether missing value should be included in the Feature values.

  • user_splits – a dictionary containing column names as key and user defined split values as value in a list.

  • col_pairs_2dpdp – list containing pairs of column names for 2D pdp

  • save_to_file – Fully qualified name to an image file the resulting plot should be saved to, e.g. ‘/home/user/pdpplot.png’. The ‘png’ postfix might be omitted. If the file already exists, it will be overridden. Plot is only saved if plot = True.

  • row_index – Row for which partial dependence will be calculated instead of the whole input frame.

  • targets – Target classes for multiclass model.

Returns

Plot and list of calculated mean response tables for each feature requested.

pd_plot(frame, column, row_index=None, target=None, max_levels=30, figsize=(16, 9), colormap='Dark2')

Plot partial dependence plot.

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.

Parameters
  • model – H2O Model object

  • frame – H2OFrame

  • column – string containing column name

  • row_index – if None, do partial dependence, if integer, do individual conditional expectation for the row specified by this integer

  • target – (only for multinomial classification) for what target should the plot be done

  • max_levels – maximum number of factor levels to show

  • figsize – figure size; passed directly to matplotlib

  • colormap – colormap name; used to get just the first color to keep the api and color scheme similar with pd_multi_plot

Returns

a matplotlib figure object

Examples

>>> import h2o
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train a GBM
>>> gbm = H2OGradientBoostingEstimator()
>>> gbm.train(y=response, training_frame=train)
>>>
>>> # Create partial dependence plot
>>> gbm.pd_plot(test, column="alcohol")
permutation_importance(frame, metric='AUTO', n_samples=10000, n_repeats=1, features=None, seed=-1, use_pandas=False)[source]

Get Permutation Variable Importance.

When n_repeats == 1, the result is similar to the one from varimp() method, i.e., it contains the following columns “Relative Importance”, “Scaled Importance”, and “Percentage”.

When n_repeats > 1, the individual columns correspond to the permutation variable importance values from individual runs which corresponds to the “Relative Importance” and also to the distance between the original prediction error and prediction error using a frame with a given feature permuted.

Parameters
  • frame – training frame

  • metric – metric to be used. One of “AUTO”, “AUC”, “MAE”, “MSE”, “RMSE”, “logloss”, “mean_per_class_error”, “PR_AUC”. Defaults to “AUTO”.

  • n_samples – number of samples to be evaluated. Use -1 to use the whole dataset. Defaults to 10 000.

  • n_repeats – number of repeated evaluations. Defaults to 1.

  • features – features to include in the permutation importance. Use None to include all.

  • seed – seed for the random generator. Use -1 to pick a random seed. Defaults to -1.

  • use_pandas – set true to return pandas data frame.

Returns

H2OTwoDimTable or Pandas data frame

permutation_importance_plot(frame, metric='AUTO', n_samples=10000, n_repeats=1, features=None, seed=-1, num_of_features=10, server=False)[source]

Plot Permutation Variable Importance. This method plots either a bar plot or if n_repeats > 1 a box plot and returns the variable importance table.

Parameters
  • frame – training frame

  • metric – metric to be used. One of “AUTO”, “AUC”, “MAE”, “MSE”, “RMSE”, “logloss”, “mean_per_class_error”, “PR_AUC”. Defaults to “AUTO”.

  • n_samples – number of samples to be evaluated. Use -1 to use the whole dataset. Defaults to 10 000.

  • n_repeats – number of repeated evaluations. Defaults to 1.

  • features – features to include in the permutation importance. Use None to include all.

  • seed – seed for the random generator. Use -1 to pick a random seed. Defaults to -1.

  • num_of_features – number of features to plot. Defaults to 10.

  • server – if true set server settings to matplotlib and do not show the plot

Returns

H2OTwoDimTable with variable importance

pprint_coef()[source]

Pretty print the coefficents table (includes normalized coefficients).

pr_auc(train=False, valid=False, xval=False)[source]

ModelBase.pr_auc is deprecated, please use ModelBase.aucpr instead.

predict(test_data, custom_metric=None, custom_metric_func=None)[source]

Predict on a dataset.

Parameters
  • test_data (H2OFrame) – Data on which to make predictions.

  • custom_metric – custom evaluation function defined as class reference, the class get uploaded into the cluster

  • custom_metric_func – custom evaluation function reference, e.g, result of upload_custom_metric

Returns

A new H2OFrame of predictions.

predict_contributions(test_data, output_format='Original', top_n=None, bottom_n=None, compare_abs=False)[source]

Predict feature contributions - SHAP values on an H2O Model (only GBM, XGBoost, DRF models and equivalent imported MOJOs).

Returned H2OFrame has shape (#rows, #features + 1) - there is a feature contribution column for each input feature, the last column is the model bias (same value for each row). The sum of the feature contributions and the bias term is equal to the raw prediction of the model. Raw prediction of tree-based model is the sum of the predictions of the individual trees before the inverse link function is applied to get the actual prediction. For Gaussian distribution the sum of the contributions is equal to the model prediction.

Note: Multinomial classification models are currently not supported.

Parameters
  • test_data (H2OFrame) – Data on which to calculate contributions.

  • output_format (Enum) – Specify how to output feature contributions in XGBoost - XGBoost by default outputs contributions for 1-hot encoded features, specifying a Compact output format will produce a per-feature contribution. One of: "Original", "Compact" (default: "Original").

  • top_n – Return only #top_n highest contributions + bias. If top_n<0 then sort all SHAP values in descending order If top_n<0 && bottom_n<0 then sort all SHAP values in descending order

  • bottom_n – Return only #bottom_n lowest contributions + bias If top_n and bottom_n are defined together then return array of #top_n + #bottom_n + bias If bottom_n<0 then sort all SHAP values in ascending order If top_n<0 && bottom_n<0 then sort all SHAP values in descending order

  • compare_abs – True to compare absolute values of contributions

Returns

A new H2OFrame made of feature contributions.

Examples

>>> prostate = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv"
>>> fr = h2o.import_file(prostate)
>>> predictors = list(range(2, fr.ncol))
>>> m = H2OGradientBoostingEstimator(ntrees=10, seed=1234)
>>> m.train(x=predictors, y=1, training_frame=fr)
>>> # Compute SHAP
>>> m.predict_contributions(fr)
>>> # Compute SHAP and pick the top two highest
>>> m.predict_contributions(fr, top_n=2)
>>> # Compute SHAP and pick the top two lowest
>>> m.predict_contributions(fr, bottom_n=2)
>>> # Compute SHAP and pick the top two highest regardless of the sign
>>> m.predict_contributions(fr, top_n=2, compare_abs=True)
>>> # Compute SHAP and pick top two lowest regardless of the sign
>>> m.predict_contributions(fr, bottom_n=2, compare_abs=True)
>>> # Compute SHAP values and show them all in descending order
>>> m.predict_contributions(fr, top_n=-1)
>>> # Compute SHAP and pick the top two highest and top two lowest
>>> m.predict_contributions(fr, top_n=2, bottom_n=2)
predict_leaf_node_assignment(test_data, type='Path')[source]

Predict on a dataset and return the leaf node assignment (only for tree-based models).

Parameters
  • test_data (H2OFrame) – Data on which to make predictions.

  • type (Enum) – How to identify the leaf node. Nodes can be either identified by a path from to the root node of the tree to the node or by H2O’s internal node id. One of: "Path", "Node_ID" (default: "Path").

Returns

A new H2OFrame of predictions.

r2(train=False, valid=False, xval=False)[source]

Return the R squared for this regression model.

Will return R^2 for GLM Models and will return NaN otherwise.

The R^2 value is defined to be 1 - MSE/var, where var is computed as sigma*sigma.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the R^2 value for the training data.

  • valid (bool) – If valid is True, then return the R^2 value for the validation data.

  • xval (bool) – If xval is True, then return the R^2 value for the cross validation data.

Returns

The R squared for this regression model.

residual_degrees_of_freedom(train=False, valid=False, xval=False)[source]

Retreive the residual degress of freedom if this model has the attribute, or None otherwise.

Parameters
  • train (bool) – Get the residual dof for the training set. If both train and valid are False, then train is selected by default.

  • valid (bool) – Get the residual dof for the validation set. If both train and valid are True, then train is selected by default.

Returns

Return the residual dof, or None if it is not present.

residual_deviance(train=False, valid=False, xval=None)[source]

Retreive the residual deviance if this model has the attribute, or None otherwise.

Parameters
  • train (bool) – Get the residual deviance for the training set. If both train and valid are False, then train is selected by default.

  • valid (bool) – Get the residual deviance for the validation set. If both train and valid are True, then train is selected by default.

Returns

Return the residual deviance, or None if it is not present.

respmul()[source]

Normalization/Standardization multipliers for numeric response.

respsub()[source]

Normalization/Standardization offsets for numeric response.

rmse(train=False, valid=False, xval=False)[source]

Get the Root Mean Square Error.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the RMSE value for the training data.

  • valid (bool) – If valid is True, then return the RMSE value for the validation data.

  • xval (bool) – If xval is True, then return the RMSE value for the cross validation data.

Returns

The RMSE for this regression model.

rmsle(train=False, valid=False, xval=False)[source]

Get the Root Mean Squared Logarithmic Error.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the RMSLE value for the training data.

  • valid (bool) – If valid is True, then return the RMSLE value for the validation data.

  • xval (bool) – If xval is True, then return the RMSLE value for the cross validation data.

Returns

The RMSLE for this regression model.

rotation()[source]

Obtain the rotations (eigenvectors) for a PCA model

Returns

H2OFrame

property run_time

Model training time in milliseconds

save_model_details(path='', force=False, filename=None)[source]

Save Model Details of an H2O Model in JSON Format to disk.

Parameters
  • path – a path to save the model details at (hdfs, s3, local)

  • force – if True overwrite destination directory in case it exists, or throw exception if set to False.

  • filename – a filename for the saved model (file type is always .json)

Returns str

the path of the saved model details

save_mojo(path='', force=False, filename=None)[source]

Save an H2O Model as MOJO (Model Object, Optimized) to disk.

Parameters
  • path – a path to save the model at (hdfs, s3, local)

  • force – if True overwrite destination directory in case it exists, or throw exception if set to False.

  • filename – a filename for the saved model (file type is always .zip)

Returns str

the path of the saved model

score_history()[source]

DEPRECATED. Use scoring_history() instead.

scoring_history()[source]

Retrieve Model Score History.

Returns

The score history as an H2OTwoDimTable or a Pandas DataFrame.

shap_explain_row_plot(frame, row_index, columns=None, top_n_features=10, figsize=(16, 9), plot_type='barplot', contribution_type='both')

SHAP local explanation

SHAP explanation shows contribution of features for a given instance. The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function. H2O implements TreeSHAP which when the features are correlated, can increase contribution of a feature that had no influence on the prediction.

Parameters
  • model – h2o tree model, such as DRF, XRT, GBM, XGBoost

  • frame – H2OFrame

  • row_index – row index of the instance to inspect

  • columns – either a list of columns or column indices to show. If specified parameter top_n_features will be ignored.

  • top_n_features – a number of columns to pick using variable importance (where applicable). When plot_type=”barplot”, then top_n_features will be chosen for each contribution_type.

  • figsize – figure size; passed directly to matplotlib

  • plot_type – either “barplot” or “breakdown”

  • contribution_type – One of “positive”, “negative”, or “both”. Used only for plot_type=”barplot”.

Returns

a matplotlib figure object

Examples

>>> import h2o
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train a GBM
>>> gbm = H2OGradientBoostingEstimator()
>>> gbm.train(y=response, training_frame=train)
>>>
>>> # Create SHAP row explanation plot
>>> gbm.shap_explain_row_plot(test, row_index=0)
shap_summary_plot(frame, columns=None, top_n_features=20, samples=1000, colorize_factors=True, alpha=1, colormap=None, figsize=(12, 12), jitter=0.35)

SHAP summary plot

SHAP summary plot shows contribution of features for each instance. The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.

Parameters
  • model – h2o tree model, such as DRF, XRT, GBM, XGBoost

  • frame – H2OFrame

  • columns – either a list of columns or column indices to show. If specified parameter top_n_features will be ignored.

  • top_n_features – a number of columns to pick using variable importance (where applicable).

  • samples – maximum number of observations to use; if lower than number of rows in the frame, take a random sample

  • colorize_factors – if True, use colors from the colormap to colorize the factors; otherwise all levels will have same color

  • alpha – transparency of the points

  • colormap – colormap to use instead of the default blue to red colormap

  • figsize – figure size; passed directly to matplotlib

  • jitter – amount of jitter used to show the point density

Returns

a matplotlib figure object

Examples

>>> import h2o
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train a GBM
>>> gbm = H2OGradientBoostingEstimator()
>>> gbm.train(y=response, training_frame=train)
>>>
>>> # Create SHAP summary plot
>>> gbm.shap_summary_plot(test)
show()[source]

Print innards of model, without regards to type.

staged_predict_proba(test_data)[source]

Predict class probabilities at each stage of an H2O Model (only GBM models).

The output structure is analogous to the output of function predict_leaf_node_assignment. For each tree t and class c there will be a column Tt.Cc (eg. T3.C1 for tree 3 and class 1). The value will be the corresponding predicted probability of this class by combining the raw contributions of trees T1.Cc,..,TtCc. Binomial models build the trees just for the first class and values in columns Tx.C1 thus correspond to the the probability p0.

Parameters

test_data (H2OFrame) – Data on which to make predictions.

Returns

A new H2OFrame of staged predictions.

property start_time

Timestamp (milliseconds since 1970) when the model training was started.

std_coef_plot(num_of_features=None, server=False)[source]

Plot a model’s standardized coefficient magnitudes.

Parameters
  • num_of_features – the number of features shown in the plot.

  • server – if true set server settings to matplotlib and show the graph

Returns

None.

summary()[source]

Print a detailed summary of the model.

training_model_metrics()[source]

Return training model metrics for any model.

property type

The type of model built: "classifier" or "regressor" or "unsupervised"

update_tree_weights(frame, weights_column)[source]

Re-calculates tree-node weights based on provided dataset. Modifying node weights will affect how contribution predictions (Shapley values) are calculated. This can be used to explain the model on a curated sub-population of the training dataset.

Parameters
  • frame – frame that will be used to re-populate trees with new observations and to collect per-node weights

  • weights_column – name of the weight column (can be different from training weights)

varimp(use_pandas=False)[source]

Pretty print the variable importances, or return them in a list.

Parameters

use_pandas (bool) – If True, then the variable importances will be returned as a pandas data frame.

Returns

A list or Pandas DataFrame.

varimp_plot(num_of_features=None, server=False)[source]

Plot the variable importance for a trained model.

Parameters
  • num_of_features – the number of features shown in the plot (default is 10 or all if less than 10).

  • server – if true set server settings to matplotlib and do not show the graph

Returns

None.

weights(matrix_id=0)[source]

Return the frame for the respective weight matrix.

Parameters

matrix_id – an integer, ranging from 0 to number of layers, that specifies the weight matrix to return.

Returns

an H2OFrame which represents the weight matrix identified by matrix_id

xval_keys()[source]

Return model keys for the cross-validated model.

property xvals

Return a list of the cross-validated models.

Returns

A list of models.

MetricsBase

Regression model.

copyright
  1. 2016 H2O.ai

license

Apache License Version 2.0 (see LICENSE for details)

class h2o.model.metrics_base.H2OAnomalyDetectionModelMetrics(metric_json, on=None, algo='')[source]

Bases: h2o.model.metrics_base.MetricsBase

mean_normalized_score()[source]

Mean Normalized Anomaly Score. For Isolation Forest - normalized average path length.

Examples

>>> from h2o.estimators.isolation_forest import H2OIsolationForestEstimator
>>> train = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/anomaly/ecg_discord_train.csv")
>>> test = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/anomaly/ecg_discord_test.csv")
>>> isofor_model = H2OIsolationForestEstimator(sample_size=5, ntrees=7)
>>> isofor_model.train(training_frame = train)
>>> perf = isofor_model.model_performance()
>>> perf.mean_normalized_score()
mean_score()[source]

Mean Anomaly Score. For Isolation Forest represents the average of all tree-path lengths.

Examples

>>> from h2o.estimators.isolation_forest import H2OIsolationForestEstimator
>>> train = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/anomaly/ecg_discord_train.csv")
>>> test = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/anomaly/ecg_discord_test.csv")
>>> isofor_model = H2OIsolationForestEstimator(sample_size=5, ntrees=7)
>>> isofor_model.train(training_frame = train)
>>> perf = isofor_model.model_performance()
>>> perf.mean_score()
class h2o.model.metrics_base.H2OAutoEncoderModelMetrics(metric_json, on=None, algo='')[source]

Bases: h2o.model.metrics_base.MetricsBase

Examples

>>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator
>>> train_ecg = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_train.csv")
>>> test_ecg = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/anomaly/ecg_discord_test.csv")
>>> anomaly_model = H2OAutoEncoderEstimator(activation="Tanh",
...                                         hidden=[50,50,50],
...                                         sparse=True, l1=1e-4,
...                                         epochs=100)
>>> anomaly_model.train(x=train_ecg.names, training_frame=train_ecg)
>>> anomaly_model.mse()
class h2o.model.metrics_base.H2OBinomialModelMetrics(metric_json, on=None, algo='')[source]

Bases: h2o.model.metrics_base.MetricsBase

This class is essentially an API for the AUC object. This class contains methods for inspecting the AUC for different criteria. To input the different criteria, use the static variable criteria.

F0point5(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

The F0.5 for this set of metrics and thresholds.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.F0point5()
F1(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

The F1 for the given set of thresholds.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.F1()
F2(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

The F2 for this set of metrics and thresholds.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.F2()
accuracy(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

The accuracy for this set of metrics and thresholds.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.accuracy()
confusion_matrix(metrics=None, thresholds=None)[source]

Get the confusion matrix for the specified metric

Parameters
  • metrics – A string (or list of strings) among metrics listed in maximizing_metrics. Defaults to ‘f1’.

  • thresholds – A value (or list of values) between 0 and 1. If None, then the thresholds maximizing each provided metric will be used.

Returns

a list of ConfusionMatrix objects (if there are more than one to return), or a single ConfusionMatrix (if there is only one).

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> response = "cylinders"
>>> distribution = "multinomial"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution)
>>> gbm.train(x=predictors,
...           y = response,
...           training_frame = train,
...           validation_frame = valid)
>>> gbm.confusion_matrix(train)
error(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold minimizing the error will be used.

Returns

The error for this set of metrics and thresholds.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.error()
fallout(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

The fallout (same as False Positive Rate) for this set of metrics and thresholds.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.fallout()
find_idx_by_threshold(threshold)[source]

Retrieve the index in this metric’s threshold list at which the given threshold is located.

Parameters

threshold – Find the index of this input threshold.

Returns

the index

Raises

ValueError – if no such index can be found.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> local_data = [[1, 'a'],[1, 'a'],[1, 'a'],[1, 'a'],[1, 'a'],
...               [1, 'a'],[1, 'a'],[1, 'a'],[1, 'a'],[1, 'a'],
...               [0, 'b'],[0, 'b'],[0, 'b'],[0, 'b'],[0, 'b'],
...               [0, 'b'],[0, 'b'],[0, 'b'],[0, 'b'],[0, 'b']]
>>> h2o_data = h2o.H2OFrame(local_data)
>>> h2o_data.set_names(['response', 'predictor'])
>>> h2o_data["response"] = h2o_data["response"].asfactor()
>>> gbm = H2OGradientBoostingEstimator(ntrees=1,
...                                    distribution="bernoulli")
>>> gbm.train(x=list(range(1,h2o_data.ncol)),
...           y="response",
...           training_frame=h2o_data)
>>> perf = gbm.model_performance()
>>> perf.find_idx_by_threshold(0.45)
find_threshold_by_max_metric(metric)[source]
Parameters

metrics – A string among the metrics listed in maximizing_metrics.

Returns

the threshold at which the given metric is maximal.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> local_data = [[1, 'a'],[1, 'a'],[1, 'a'],[1, 'a'],[1, 'a'],
...               [1, 'a'],[1, 'a'],[1, 'a'],[1, 'a'],[1, 'a'],
...               [0, 'b'],[0, 'b'],[0, 'b'],[0, 'b'],[0, 'b'],
...               [0, 'b'],[0, 'b'],[0, 'b'],[0, 'b'],[0, 'b']]
>>> h2o_data = h2o.H2OFrame(local_data)
>>> h2o_data.set_names(['response', 'predictor'])
>>> h2o_data["response"] = h2o_data["response"].asfactor()
>>> gbm = H2OGradientBoostingEstimator(ntrees=1,
...                                    distribution="bernoulli")
>>> gbm.train(x=list(range(1,h2o_data.ncol)),
...           y="response",
...           training_frame=h2o_data)
>>> perf = gbm.model_performance()
>>> perf.find_threshold_by_max_metric("f1")
fnr(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

The False Negative Rate.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.fnr()
fpr(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

The False Positive Rate.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.fpr()
property fprs

Return all false positive rates for all threshold values.

Returns

a list of false positive rates.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution, fold_assignment="Random")
>>> gbm.train(y=response_col, x=predictors, validation_frame=valid, training_frame=train)
>>> (fprs, tprs) = gbm.roc(train=True, valid=False, xval=False)
>>> fprs
gains_lift()[source]

Retrieve the Gains/Lift table.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> response_col = "cylinders"
>>> distribution = "multinomial"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution)
>>> gbm.train(x=predictors,
...           y = response,
...           training_frame = train,
...           validation_frame = valid)
>>> gbm.gains_lift()
max_per_class_error(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold minimizing the error will be used.

Returns

Return 1 - min(per class accuracy).

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.max_per_class_error()
maximizing_metrics = ('absolute_mcc', 'accuracy', 'precision', 'f0point5', 'f1', 'f2', 'mean_per_class_accuracy', 'min_per_class_accuracy', 'tns', 'fns', 'fps', 'tps', 'tnr', 'fnr', 'fpr', 'tpr', 'fallout', 'missrate', 'recall', 'sensitivity', 'specificity')

metrics names allowed for confusion matrix

mcc(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

The absolute MCC (a value between 0 and 1, 0 being totally dissimilar, 1 being identical).

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.mcc()
mean_per_class_error(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold minimizing the error will be used.

Returns

mean per class error.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.mean_per_class_error()
metric(metric, thresholds=None)[source]
Parameters
  • metric (str) – A metric among maximizing_metrics.

  • thresholds – thresholds parameter must be a number or a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used. If ‘all’, then all stored thresholds are used and returned with the matching metric.

Returns

The set of metrics for the list of thresholds. The returned list has a ‘value’ property holding only the metric value (if no threshold provided or if provided as a number), or all the metric values (if thresholds provided as a list)

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> local_data = [[1, 'a'],[1, 'a'],[1, 'a'],[1, 'a'],[1, 'a'],
...               [1, 'a'],[1, 'a'],[1, 'a'],[1, 'a'],[1, 'a'],
...               [0, 'b'],[0, 'b'],[0, 'b'],[0, 'b'],[0, 'b'],
...               [0, 'b'],[0, 'b'],[0, 'b'],[0, 'b'],[0, 'b']]
>>> h2o_data = h2o.H2OFrame(local_data)
>>> h2o_data.set_names(['response', 'predictor'])
>>> h2o_data["response"] = h2o_data["response"].asfactor()
>>> gbm = H2OGradientBoostingEstimator(ntrees=1,
...                                    distribution="bernoulli")
>>> gbm.train(x=list(range(1,h2o_data.ncol)),
...           y="response",
...           training_frame=h2o_data)
>>> perf = gbm.model_performance()
>>> perf.metric("tps", [perf.find_threshold_by_max_metric("f1")])[0][1]
metrics_aliases = {'fallout': 'fpr', 'missrate': 'fnr', 'recall': 'tpr', 'sensitivity': 'tpr', 'specificity': 'tnr'}
missrate(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

The miss rate (same as False Negative Rate).

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.missrate()
plot(type='roc', server=False, save_to_file=None, plot=True)[source]

Produce the desired metric plot.

Parameters
  • type – the type of metric plot (currently, only ROC curve (‘roc’) and Precision Recall curve (‘pr’) are supported).

  • server – if True, generate plot inline using matplotlib’s “Agg” backend.

:param save_to_file filename to save the plot to :param plot True to plot curve, False to get a tuple of values at axis x and y of the plot

(tprs and fprs for AUC, recall and precision for PR)

Returns

None or values of x and y axis of the plot

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.plot(type="roc")
>>> cars_gbm.plot(type="pr")
precision(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

The precision for this set of metrics and thresholds.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.precision()
recall(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

Recall for this set of metrics and thresholds.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.recall()
roc()[source]

Return the coordinates of the ROC curve as a tuple containing the false positive rates as a list and true positive rates as a list. :returns: The ROC values.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(x=predictors,
...           y=response_col,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.roc(train=True,  valid=False, xval=False)
sensitivity(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

Sensitivity or True Positive Rate for this set of metrics and thresholds.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.sensitivity()
specificity(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

The specificity (same as True Negative Rate).

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.specificity()
tnr(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

The True Negative Rate.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.tnr()
tpr(thresholds=None)[source]
Parameters

thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the threshold maximizing the metric will be used.

Returns

The True Postive Rate.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.tpr()
property tprs

Return all true positive rates for all threshold values.

Returns

a list of true positive rates.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution, fold_assignment="Random")
>>> gbm.train(y=response_col, x=predictors, validation_frame=valid, training_frame=train)
>>> (fprs, tprs) = gbm.roc(train=True, valid=False, xval=False)
>>> tprs
class h2o.model.metrics_base.H2OClusteringModelMetrics(metric_json, on=None, algo='')[source]

Bases: h2o.model.metrics_base.MetricsBase

betweenss()[source]

The Between Cluster Sum-of-Square Error, or None if not present.

Examples

>>> from h2o.estimators.kmeans import H2OKMeansEstimator
>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> km.betweenss()
tot_withinss()[source]

The Total Within Cluster Sum-of-Square Error, or None if not present.

Examples

>>> from h2o.estimators.kmeans import H2OKMeansEstimator
>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> km.tot_withinss()
totss()[source]

The Total Sum-of-Square Error to Grand Mean, or None if not present.

Examples

>>> from h2o.estimators.kmeans import H2OKMeansEstimator
>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> km.totss()
class h2o.model.metrics_base.H2ODimReductionModelMetrics(metric_json, on=None, algo='')[source]

Bases: h2o.model.metrics_base.MetricsBase

cat_err()[source]

The Number of Misclassified categories over non-missing categorical entries, or None if not present.

num_err()[source]

Sum of Squared Error over non-missing numeric entries, or None if not present.

class h2o.model.metrics_base.H2OHGLMModelMetrics(metric_json, on=None, algo='HGLM Gaussian Gaussian')[source]

Bases: h2o.model.metrics_base.MetricsBase

class h2o.model.metrics_base.H2OModelMetricsRegressionCoxPH(metric_json, on=None, algo='')[source]

Bases: h2o.model.metrics_base.MetricsBase

Examples

>>> from h2o.estimators.coxph import H2OCoxProportionalHazardsEstimator
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                            stop_column="stop",
...                                            ties="breslow")
>>> coxph.train(x="age", y="event", training_frame=heart)
>>> coxph
concordance()[source]

Concordance metrics (c-index). Proportion of concordant pairs divided by the total number of possible evaluation pairs. 1.0 for perfect match, 0.5 for random results.

concordant()[source]

Count of concordant pairs.

tied_y()[source]

Count of tied pairs.

class h2o.model.metrics_base.H2OMultinomialModelMetrics(metric_json, on=None, algo='')[source]

Bases: h2o.model.metrics_base.MetricsBase

confusion_matrix()[source]

Returns a confusion matrix based of H2O’s default prediction threshold for a dataset.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> response_col = "cylinders"
>>> distribution = "multinomial"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution = distribution)
>>> gbm.train(x=predictors,
...           y = response,
...           training_frame = train,
...           validation_frame = valid)
>>> gbm.confusion_matrix(train)
hit_ratio_table()[source]

Retrieve the Hit Ratios.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> response_col = "cylinders"
>>> distribution = "multinomial"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution = distribution)
>>> gbm.train(x=predictors,
...           y = response,
...           training_frame = train,
...           validation_frame = valid)
>>> gbm.hit_ratio_table()
multinomial_auc_table()[source]

Retrieve the multinomial AUC values.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> response_col = "cylinders"
>>> distribution = "multinomial"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution = distribution)
>>> gbm.train(x=predictors,
...           y = response,
...           training_frame = train,
...           validation_frame = valid)
>>> gbm.multinomial_auc_table()
multinomial_aucpr_table()[source]

Retrieve the multinomial PR AUC values.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> response_col = "cylinders"
>>> distribution = "multinomial"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution = distribution)
>>> gbm.train(x=predictors,
...           y = response,
...           training_frame = train,
...           validation_frame = valid)
>>> gbm.multinomial_aucpr_table()
class h2o.model.metrics_base.H2OOrdinalModelMetrics(metric_json, on=None, algo='')[source]

Bases: h2o.model.metrics_base.MetricsBase

confusion_matrix()[source]

Returns a confusion matrix based of H2O’s default prediction threshold for a dataset.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> response_col = "cylinders"
>>> distribution = "multinomial"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution = distribution)
>>> gbm.train(x=predictors,
...           y = response,
...           training_frame = train,
...           validation_frame = valid)
>>> gbm.confusion_matrix(train)
hit_ratio_table()[source]

Retrieve the Hit Ratios.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> response_col = "cylinders"
>>> distribution = "multinomial"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution = distribution)
>>> gbm.train(x=predictors,
...           y = response,
...           training_frame = train,
...           validation_frame = valid)
>>> gbm.hit_ratio_table()
class h2o.model.metrics_base.H2ORegressionModelMetrics(metric_json, on=None, algo='')[source]

Bases: h2o.model.metrics_base.MetricsBase

This class provides an API for inspecting the metrics returned by a regression model.

It is possible to retrieve the R^2 (1 - MSE/variance) and MSE.

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_glm = H2OGeneralizedLinearEstimator()
>>> cars_glm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_glm.mse()
class h2o.model.metrics_base.H2OTargetEncoderMetrics(metric_json, on=None, algo='')[source]

Bases: h2o.model.metrics_base.MetricsBase

class h2o.model.metrics_base.H2OWordEmbeddingModelMetrics(metric_json, on=None, algo='')[source]

Bases: h2o.model.metrics_base.MetricsBase

class h2o.model.metrics_base.List[source]

Bases: list

class h2o.model.metrics_base.MetricsBase(metric_json, on=None, algo='')[source]

Bases: h2o.model.metrics_base.MetricsBase

A parent class to house common metrics available for the various Metrics types.

The methods here are available across different model categories.

aic()[source]

The AIC for this set of metrics.

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate[2] = prostate[2].asfactor()
>>> prostate[4] = prostate[4].asfactor()
>>> prostate[5] = prostate[5].asfactor()
>>> prostate[8] = prostate[8].asfactor()
>>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> response = "CAPSULE"
>>> train, valid = prostate.split_frame(ratios=[.8],seed=1234)
>>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial")
>>> pros_glm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> pros_glm.aic()
auc()[source]

The AUC for this set of metrics.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.auc()
aucpr()[source]

The area under the precision recall curve.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.aucpr()
custom_metric_name()[source]

Name of custom metric or None.

custom_metric_value()[source]

Value of custom metric or None.

gini()[source]

Gini coefficient.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.gini()
hglm_metric(metric_string)[source]
logloss()[source]

Log loss.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.logloss()
mae()[source]

The MAE for this set of metrics.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(distribution = "poisson",
...                                         seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.mae()
classmethod make(kvs)[source]

Factory method to instantiate a MetricsBase object from the list of key-value pairs.

mean_per_class_error()[source]

The mean per class error.

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate[2] = prostate[2].asfactor()
>>> prostate[4] = prostate[4].asfactor()
>>> prostate[5] = prostate[5].asfactor()
>>> prostate[8] = prostate[8].asfactor()
>>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> response = "CAPSULE"
>>> train, valid = prostate.split_frame(ratios=[.8],seed=1234)
>>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial")
>>> pros_glm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> pros_glm.mean_per_class_error()
mean_residual_deviance()[source]

The mean residual deviance for this set of metrics.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/AirlinesTest.csv.zip")
>>> air_gbm = H2OGradientBoostingEstimator()
>>> air_gbm.train(x=list(range(9)),
...               y=9,
...               training_frame=airlines,
...               validation_frame=airlines)
>>> air_gbm.mean_residual_deviance(train=True,valid=False,xval=False)
mse()[source]

The MSE for this set of metrics.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.mse()
nobs()[source]

The number of observations.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> perf = cars_gbm.model_performance()
>>> perf.nobs()
null_degrees_of_freedom()[source]

The null DoF if the model has residual deviance, otherwise None.

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate[2] = prostate[2].asfactor()
>>> prostate[4] = prostate[4].asfactor()
>>> prostate[5] = prostate[5].asfactor()
>>> prostate[8] = prostate[8].asfactor()
>>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> response = "CAPSULE"
>>> train, valid = prostate.split_frame(ratios=[.8],seed=1234)
>>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial")
>>> pros_glm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> pros_glm.null_degrees_of_freedom()
null_deviance()[source]

The null deviance if the model has residual deviance, otherwise None.

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate[2] = prostate[2].asfactor()
>>> prostate[4] = prostate[4].asfactor()
>>> prostate[5] = prostate[5].asfactor()
>>> prostate[8] = prostate[8].asfactor()
>>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> response = "CAPSULE"
>>> train, valid = prostate.split_frame(ratios=[.8],seed=1234)
>>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial")
>>> pros_glm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> pros_glm.null_deviance()
pr_auc()[source]

MetricsBase.pr_auc is deprecated, please use MetricsBase.aucpr instead.

r2()[source]

The R squared coefficient.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.r2()
residual_degrees_of_freedom()[source]

The residual DoF if the model has residual deviance, otherwise None.

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate[2] = prostate[2].asfactor()
>>> prostate[4] = prostate[4].asfactor()
>>> prostate[5] = prostate[5].asfactor()
>>> prostate[8] = prostate[8].asfactor()
>>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> response = "CAPSULE"
>>> train, valid = prostate.split_frame(ratios=[.8],seed=1234)
>>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial")
>>> pros_glm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> pros_glm.residual_degrees_of_freedom()
residual_deviance()[source]

The residual deviance if the model has it, otherwise None.

Examples

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate[2] = prostate[2].asfactor()
>>> prostate[4] = prostate[4].asfactor()
>>> prostate[5] = prostate[5].asfactor()
>>> prostate[8] = prostate[8].asfactor()
>>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> response = "CAPSULE"
>>> train, valid = prostate.split_frame(ratios=[.8],seed=1234)
>>> pros_glm = H2OGeneralizedLinearEstimator(family="binomial")
>>> pros_glm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> pros_glm.residual_deviance()
rmse()[source]

The RMSE for this set of metrics.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.rmse()
rmsle()[source]

The RMSLE for this set of metrics.

Examples

>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(distribution = "poisson",
...                                         seed = 1234)
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.rmsle()
show()[source]

Display a short summary of the metrics.

Examples

>>> from from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234) 
>>> cars_gbm.train(x = predictors,
...                y = response,
...                training_frame = train,
...                validation_frame = valid)
>>> cars_gbm.show()

Binomial Classification

class h2o.model.binomial.H2OBinomialModel[source]

Bases: h2o.model.model_base.ModelBase

F0point5(thresholds=None, train=False, valid=False, xval=False)[source]

Get the F0.5 for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the F0.5 value for the training data.

  • valid (bool) – If True, return the F0.5 value for the validation data.

  • xval (bool) – If True, return the F0.5 value for each of the cross-validated splits.

Returns

The F0.5 values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)         
>>> F0point5 = gbm.F0point5() # <- Default: return training metric value
>>> F0point5 = gbm.F0point5(train=True,  valid=True,  xval=True)
F1(thresholds=None, train=False, valid=False, xval=False)[source]

Get the F1 value for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the F1 value for the training data.

  • valid (bool) – If True, return the F1 value for the validation data.

  • xval (bool) – If True, return the F1 value for each of the cross-validated splits.

Returns

The F1 values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2] 
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.F1()# <- Default: return training metric value
>>> gbm.F1(train=True,  valid=True,  xval=True)
F2(thresholds=None, train=False, valid=False, xval=False)[source]

Get the F2 for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the F2 value for the training data.

  • valid (bool) – If True, return the F2 value for the validation data.

  • xval (bool) – If True, return the F2 value for each of the cross-validated splits.

Returns

The F2 values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.F2() # <- Default: return training metric value
>>> gbm.F2(train=True, valid=True, xval=True)
accuracy(thresholds=None, train=False, valid=False, xval=False)[source]

Get the accuracy for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the accuracy value for the training data.

  • valid (bool) – If True, return the accuracy value for the validation data.

  • xval (bool) – If True, return the accuracy value for each of the cross-validated splits.

Returns

The accuracy values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.accuracy() # <- Default: return training metric value
>>> gbm.accuracy(train=True, valid=True, xval=True)
confusion_matrix(metrics=None, thresholds=None, train=False, valid=False, xval=False)[source]

Get the confusion matrix for the specified metrics/thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”

Parameters
  • metrics – A string (or list of strings) among metrics listed in H2OBinomialModelMetrics.maximizing_metrics. Defaults to ‘f1’.

  • thresholds – A value (or list of values) between 0 and 1. If None, then the thresholds maximizing each provided metric will be used.

  • train (bool) – If True, return the confusion matrix value for the training data.

  • valid (bool) – If True, return the confusion matrix value for the validation data.

  • xval (bool) – If True, return the confusion matrix value for each of the cross-validated splits.

Returns

The confusion matrix values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.confusion_matrix() # <- Default: return training metric value
>>> gbm.confusion_matrix(train=True, valid=True, xval=True)
error(thresholds=None, train=False, valid=False, xval=False)[source]

Get the error for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold minimizing the error will be used.

  • train (bool) – If True, return the error value for the training data.

  • valid (bool) – If True, return the error value for the validation data.

  • xval (bool) – If True, return the error value for each of the cross-validated splits.

Returns

The error values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.error() # <- Default: return training metric
>>> gbm.error(train=True, valid=True, xval=True)
fallout(thresholds=None, train=False, valid=False, xval=False)[source]

Get the fallout for a set of thresholds (aka False Positive Rate).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the fallout value for the training data.

  • valid (bool) – If True, return the fallout value for the validation data.

  • xval (bool) – If True, return the fallout value for each of the cross-validated splits.

Returns

The fallout values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.fallout() # <- Default: return training metric
>>> gbm.fallout(train=True, valid=True, xval=True)
find_idx_by_threshold(threshold, train=False, valid=False, xval=False)[source]

Retrieve the index in this metric’s threshold list at which the given threshold is located.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • threshold (float) – Threshold value to search for in the threshold list.

  • train (bool) – If True, return the find idx by threshold value for the training data.

  • valid (bool) – If True, return the find idx by threshold value for the validation data.

  • xval (bool) – If True, return the find idx by threshold value for each of the cross-validated splits.

Returns

The find idx by threshold values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight",
...               "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> idx_threshold = gbm.find_idx_by_threshold(threshold=0.39438,
...                                           train=True)
>>> idx_threshold
find_threshold_by_max_metric(metric, train=False, valid=False, xval=False)[source]

If all are False (default), then return the training metric value.

If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • metric (str) – A metric among the metrics listed in H2OBinomialModelMetrics.maximizing_metrics.

  • train (bool) – If True, return the find threshold by max metric value for the training data.

  • valid (bool) – If True, return the find threshold by max metric value for the validation data.

  • xval (bool) – If True, return the find threshold by max metric value for each of the cross-validated splits.

Returns

The find threshold by max metric values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight",
...               "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> max_metric = gbm.find_threshold_by_max_metric(metric="f2",
...                                               train=True)
>>> max_metric
fnr(thresholds=None, train=False, valid=False, xval=False)[source]

Get the False Negative Rates for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the FNR value for the training data.

  • valid (bool) – If True, return the FNR value for the validation data.

  • xval (bool) – If True, return the FNR value for each of the cross-validated splits.

Returns

The FNR values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.fnr() # <- Default: return training metric
>>> gbm.fnr(train=True, valid=True, xval=True)
fpr(thresholds=None, train=False, valid=False, xval=False)[source]

Get the False Positive Rates for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the FPR value for the training data.

  • valid (bool) – If True, return the FPR value for the validation data.

  • xval (bool) – If True, return the FPR value for each of the cross-validated splits.

Returns

The FPR values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.fpr() # <- Default: return training metric
>>> gbm.fpr(train=True, valid=True, xval=True)
gains_lift(train=False, valid=False, xval=False)[source]

Get the Gains/Lift table for the specified metrics.

If all are False (default), then return the training metric Gains/Lift table. If more than one options is set to True, then return a dictionary of metrics where t he keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the gains lift value for the training data.

  • valid (bool) – If True, return the gains lift value for the validation data.

  • xval (bool) – If True, return the gains lift value for each of the cross-validated splits.

Returns

The gains lift values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.gains_lift() # <- Default: return training metric Gain/Lift table
>>> gbm.gains_lift(train=True, valid=True, xval=True)
kolmogorov_smirnov()[source]

Retrieves a Kolmogorov-Smirnov metric for given binomial model. The number returned is in range between 0 and 1. K-S metric represents the degree of separation between the positive (1) and negative (0) cumulative distribution functions. Detailed metrics per each group are to be found in the gains-lift table.

Returns

Kolmogorov-Smirnov metric, a number between 0 and 1

Examples

>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> airlines = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv")
>>> model = H2OGradientBoostingEstimator(ntrees=1,
...                                      gainslift_bins=20)
>>> model.train(x=["Origin", "Distance"],
...             y="IsDepDelayed",
...             training_frame=airlines)
>>> model.kolmogorov_smirnov()
max_per_class_error(thresholds=None, train=False, valid=False, xval=False)[source]

Get the max per class error for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold minimizing the error will be used.

  • train (bool) – If True, return the max per class error value for the training data.

  • valid (bool) – If True, return the max per class error value for the validation data.

  • xval (bool) – If True, return the max per class error value for each of the cross-validated splits.

Returns

The max per class error values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.max_per_class_error() # <- Default: return training metric value
>>> gbm.max_per_class_error(train=True, valid=True, xval=True)
mcc(thresholds=None, train=False, valid=False, xval=False)[source]

Get the MCC for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the MCC value for the training data.

  • valid (bool) – If True, return the MCC value for the validation data.

  • xval (bool) – If True, return the MCC value for each of the cross-validated splits.

Returns

The MCC values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.mcc() # <- Default: return training metric value
>>> gbm.mcc(train=True, valid=True, xval=True)
mean_per_class_error(thresholds=None, train=False, valid=False, xval=False)[source]

Get the mean per class error for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold minimizing the error will be used.

  • train (bool) – If True, return the mean per class error value for the training data.

  • valid (bool) – If True, return the mean per class error value for the validation data.

  • xval (bool) – If True, return the mean per class error value for each of the cross-validated splits.

Returns

The mean per class error values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.mean_per_class_error() # <- Default: return training metric
>>> gbm.mean_per_class_error(train=True, valid=True, xval=True)
metric(metric, thresholds=None, train=False, valid=False, xval=False)[source]

Get the metric value for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • metric (str) – name of the metric to retrieve.

  • thresholds – If None, then the threshold maximizing the metric will be used (or minimizing it if the metric is an error).

  • train (bool) – If True, return the metric value for the training data.

  • valid (bool) – If True, return the metric value for the validation data.

  • xval (bool) – If True, return the metric value for each of the cross-validated splits.

Returns

The metric values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
# thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99])
>>> thresholds = [0.01,0.5,0.99]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
# allowable metrics are absolute_mcc, accuracy, precision,
# f0point5, f1, f2, mean_per_class_accuracy, min_per_class_accuracy,
# tns, fns, fps, tps, tnr, fnr, fpr, tpr, recall, sensitivity,
# missrate, fallout, specificity
>>> gbm.metric(metric='tpr', thresholds=thresholds)
missrate(thresholds=None, train=False, valid=False, xval=False)[source]

Get the miss rate for a set of thresholds (aka False Negative Rate).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the miss rate value for the training data.

  • valid (bool) – If True, return the miss rate value for the validation data.

  • xval (bool) – If True, return the miss rate value for each of the cross-validated splits.

Returns

The miss rate values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.missrate() # <- Default: return training metric
>>> gbm.missrate(train=True, valid=True, xval=True)
plot(timestep='AUTO', metric='AUTO', server=False, **kwargs)[source]

Plot training set (and validation set if available) scoring history for an H2OBinomialModel.

The timestep and metric arguments are restricted to what is available in its scoring history.

Parameters
  • timestep (str) – A unit of measurement for the x-axis.

  • metric (str) – A unit of measurement for the y-axis.

  • server (bool) – if True, then generate the image inline (using matplotlib’s “Agg” backend)

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> response = 3
>>> predictors = [0, 1, 2, 4, 5, 6, 7, 8, 9, 10]
>>> model = H2OGeneralizedLinearEstimator(family="binomial")
>>> model.train(x=predictors, y=response, training_frame=benign)
>>> model.plot(timestep="AUTO", metric="objective", server=False)
precision(thresholds=None, train=False, valid=False, xval=False)[source]

Get the precision for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the precision value for the training data.

  • valid (bool) – If True, return the precision value for the validation data.

  • xval (bool) – If True, return the precision value for each of the cross-validated splits.

Returns

The precision values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.precision() # <- Default: return training metric value
>>> gbm.precision(train=True, valid=True, xval=True)
recall(thresholds=None, train=False, valid=False, xval=False)[source]

Get the recall for a set of thresholds (aka True Positive Rate).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the recall value for the training data.

  • valid (bool) – If True, return the recall value for the validation data.

  • xval (bool) – If True, return the recall value for each of the cross-validated splits.

Returns

The recall values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.recall() # <- Default: return training metric
>>> gbm.recall(train=True, valid=True, xval=True)
roc(train=False, valid=False, xval=False)[source]

Return the coordinates of the ROC curve for a given set of data.

The coordinates are two-tuples containing the false positive rates as a list and true positive rates as a list. If all are False (default), then return is the training data. If more than one ROC curve is requested, the data is returned as a dictionary of two-tuples.

Parameters
  • train (bool) – If True, return the ROC value for the training data.

  • valid (bool) – If True, return the ROC value for the validation data.

  • xval (bool) – If True, return the ROC value for each of the cross-validated splits.

Returns

The ROC values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.roc() # <- Default: return training data
>>> gbm.roc(train=True, valid=True, xval=True)
sensitivity(thresholds=None, train=False, valid=False, xval=False)[source]

Get the sensitivity for a set of thresholds (aka True Positive Rate or Recall).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the sensitivity value for the training data.

  • valid (bool) – If True, return the sensitivity value for the validation data.

  • xval (bool) – If True, return the sensitivity value for each of the cross-validated splits.

Returns

The sensitivity values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.sensitivity() # <- Default: return training metric
>>> gbm.sensitivity(train=True, valid=True, xval=True)
specificity(thresholds=None, train=False, valid=False, xval=False)[source]

Get the specificity for a set of thresholds (aka True Negative Rate).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the specificity value for the training data.

  • valid (bool) – If True, return the specificity value for the validation data.

  • xval (bool) – If True, return the specificity value for each of the cross-validated splits.

Returns

The specificity values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.specificity() # <- Default: return training metric
>>> gbm.specificity(train=True, valid=True, xval=True)
tnr(thresholds=None, train=False, valid=False, xval=False)[source]

Get the True Negative Rate for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the TNR value for the training data.

  • valid (bool) – If True, return the TNR value for the validation data.

  • xval (bool) – If True, return the TNR value for each of the cross-validated splits.

Returns

The TNR values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.tnr() # <- Default: return training metric
>>> gbm.tnr(train=True, valid=True, xval=True)
tpr(thresholds=None, train=False, valid=False, xval=False)[source]

Get the True Positive Rate for a set of thresholds.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • thresholds – If None, then the threshold maximizing the metric will be used.

  • train (bool) – If True, return the TPR value for the training data.

  • valid (bool) – If True, return the TPR value for the validation data.

  • xval (bool) – If True, return the TPR value for each of the cross-validated splits.

Returns

The TPR values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <=.2]
>>> response_col = "economy_20mpg"
>>> distribution = "bernoulli"
>>> predictors = ["displacement", "power", "weight", "acceleration", "year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(y=response_col,
...           x=predictors,
...           validation_frame=valid,
...           training_frame=train)
>>> gbm.tpr() # <- Default: return training metric
>>> gbm.tpr(train=True, valid=True, xval=True)

Multinomial Classification

class h2o.model.multinomial.H2OMultinomialModel[source]

Bases: h2o.model.model_base.ModelBase

confusion_matrix(data)[source]

Returns a confusion matrix based of H2O’s default prediction threshold for a dataset.

Parameters

data (H2OFrame) – the frame with the prediction results for which the confusion matrix should be extracted.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "cylinders"
>>> distribution = "multinomial"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution)
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> confusion_matrix = gbm.confusion_matrix(train)
>>> confusion_matrix
hit_ratio_table(train=False, valid=False, xval=False)[source]

Retrieve the Hit Ratios.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train – If train is True, then return the hit ratio value for the training data.

  • valid – If valid is True, then return the hit ratio value for the validation data.

  • xval – If xval is True, then return the hit ratio value for the cross validation data.

Returns

The hit ratio for this regression model.

Example

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "cylinders"
>>> distribution = "multinomial"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution)
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> hit_ratio_table = gbm.hit_ratio_table() # <- Default: return training metrics
>>> hit_ratio_table
>>> hit_ratio_table1 = gbm.hit_ratio_table(train=True,
...                                        valid=True,
...                                        xval=True)
>>> hit_ratio_table1
mean_per_class_error(train=False, valid=False, xval=False)[source]

Retrieve the mean per class error across all classes

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the mean_per_class_error value for the training data.

  • valid (bool) – If True, return the mean_per_class_error value for the validation data.

  • xval (bool) – If True, return the mean_per_class_error value for each of the cross-validated splits.

Returns

The mean_per_class_error values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "cylinders"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> distribution = "multinomial"
>>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution)
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> mean_per_class_error = gbm.mean_per_class_error() # <- Default: return training metric
>>> mean_per_class_error
>>> mean_per_class_error1 = gbm.mean_per_class_error(train=True,
...                                                  valid=True,
...                                                  xval=True)
>>> mean_per_class_error1
multinomial_auc_table(train=False, valid=False, xval=False)[source]

Retrieve the multinomial AUC table.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the multinomial_auc_table for the training data.

  • valid (bool) – If True, return the multinomial_auc_table for the validation data.

  • xval (bool) – If True, return the multinomial_auc_table for each of the cross-validated splits.

Returns

The multinomial_auc_table values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "cylinders"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> distribution = "multinomial"
>>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution)
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> multinomial_auc_table = gbm.multinomial_auc_table() # <- Default: return training metric
>>> multinomial_auc_table
>>> multinomial_auc_table1 = gbm.multinomial_auc_table(train=True,
...                                        valid=True,
...                                        xval=True)
>>> multinomial_auc_table1
multinomial_aucpr_table(train=False, valid=False, xval=False)[source]

Retrieve the multinomial PR AUC table.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the amultinomial_aucpr_table for the training data.

  • valid (bool) – If True, return the multinomial_aucpr_table for the validation data.

  • xval (bool) – If True, return the multinomial_aucpr_table for each of the cross-validated splits.

Returns

The average_pairwise_auc values for the specified key(s).

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "cylinders"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> distribution = "multinomial"
>>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution)
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> multinomial_aucpr_table = gbm.multinomial_aucpr_table() # <- Default: return training metric
>>> multinomial_aucpr_table
>>> multinomial_aucpr_table1 = gbm.multinomial_aucpr_table(train=True,
...                                        valid=True,
...                                        xval=True)
>>> multinomial_aucpr_table1
plot(timestep='AUTO', metric='AUTO', **kwargs)[source]

Plots training set (and validation set if available) scoring history for an H2OMultinomialModel. The timestep and metric arguments are restricted to what is available in its scoring history.

Parameters
  • timestep – A unit of measurement for the x-axis. This can be AUTO, duration, or number_of_trees.

  • metric – A unit of measurement for the y-axis. This can be AUTO, logloss, classification_error, or rmse.

Returns

A scoring history plot.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["cylinders"] = cars["cylinders"].asfactor()
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "cylinders"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> distribution = "multinomial"
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution)
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> gbm.plot(metric="AUTO", timestep="AUTO")

Regression

class h2o.model.regression.H2ORegressionModel[source]

Bases: h2o.model.model_base.ModelBase

plot(timestep='AUTO', metric='AUTO', **kwargs)[source]

Plots training set (and validation set if available) scoring history for an H2ORegressionModel. The timestep and metric arguments are restricted to what is available in its scoring history.

Parameters
  • timestep – A unit of measurement for the x-axis.

  • metric – A unit of measurement for the y-axis.

Returns

A scoring history plot.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> r = cars[0].runif()
>>> train = cars[r > .2]
>>> valid = cars[r <= .2]
>>> response_col = "economy"
>>> distribution = "gaussian"
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> gbm = H2OGradientBoostingEstimator(nfolds=3,
...                                    distribution=distribution,
...                                    fold_assignment="Random")
>>> gbm.train(x=predictors,
...           y=response_col,
...           training_frame=train,
...           validation_frame=valid)
>>> gbm.plot(timestep="AUTO", metric="AUTO",)
residual_analysis_plot(frame, figsize=(16, 9))

Residual Analysis

Do Residual Analysis and plot the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. If you notice “striped” lines of residuals, that is just an indication that your response variable was integer valued instead of real valued.

Parameters
  • model – H2OModel

  • frame – H2OFrame

  • figsize – figure size; passed directly to matplotlib

Returns

a matplotlib figure object

Examples

>>> import h2o
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train a GBM
>>> gbm = H2OGradientBoostingEstimator()
>>> gbm.train(y=response, training_frame=train)
>>>
>>> # Create the residual analysis plot
>>> gbm.residual_analysis_plot(test)
h2o.model.regression.h2o_explained_variance_score(y_actual, y_predicted, weights=None)[source]

Explained variance regression score function.

Parameters
  • y_actual – H2OFrame of actual response.

  • y_predicted – H2OFrame of predicted response.

  • weights – (Optional) sample weights

Returns

the explained variance score.

h2o.model.regression.h2o_mean_absolute_error(y_actual, y_predicted, weights=None)[source]

Mean absolute error regression loss.

Parameters
  • y_actual – H2OFrame of actual response.

  • y_predicted – H2OFrame of predicted response.

  • weights – (Optional) sample weights

Returns

mean absolute error loss (best is 0.0).

h2o.model.regression.h2o_mean_squared_error(y_actual, y_predicted, weights=None)[source]

Mean squared error regression loss

Parameters
  • y_actual – H2OFrame of actual response.

  • y_predicted – H2OFrame of predicted response.

  • weights – (Optional) sample weights

Returns

mean squared error loss (best is 0.0).

h2o.model.regression.h2o_median_absolute_error(y_actual, y_predicted)[source]

Median absolute error regression loss

Parameters
  • y_actual – H2OFrame of actual response.

  • y_predicted – H2OFrame of predicted response.

Returns

median absolute error loss (best is 0.0)

h2o.model.regression.h2o_r2_score(y_actual, y_predicted, weights=1.0)[source]

R-squared (coefficient of determination) regression score function

Parameters
  • y_actual – H2OFrame of actual response.

  • y_predicted – H2OFrame of predicted response.

  • weights – (Optional) sample weights

Returns

R-squared (best is 1.0, lower is worse).

Dimensionality Reduction

class h2o.model.dim_reduction.H2ODimReductionModel[source]

Bases: h2o.model.model_base.ModelBase

Dimension reduction model, such as PCA or GLRM.

archetypes()[source]

The archetypes (Y) of the GLRM model.

final_step()[source]

Get the final step size for the model.

num_iterations()[source]

Get the number of iterations that it took to converge or reach max iterations.

objective()[source]

Get the final value of the objective function.

proj_archetypes(test_data, reverse_transform=False)[source]

Convert archetypes of the model into original feature space.

Parameters
  • test_data (H2OFrame) – The dataset upon which the model was trained.

  • reverse_transform (bool) – Whether the transformation of the training data during model-building should be reversed on the projected archetypes.

Returns

model archetypes projected back into the original training data’s feature space.

reconstruct(test_data, reverse_transform=False)[source]

Reconstruct the training data from the model and impute all missing values.

Parameters
  • test_data (H2OFrame) – The dataset upon which the model was trained.

  • reverse_transform (bool) – Whether the transformation of the training data during model-building should be reversed on the reconstructed frame.

Returns

the approximate reconstruction of the training data.

screeplot(type='barplot', server=False)[source]

Produce the scree plot.

Library matplotlib is required for this function.

Parameters
  • type (str) – either "barplot" or "lines".

  • server (bool) – if true set server settings to matplotlib and do not show the graph

varimp(use_pandas=False)[source]

Return the Importance of components associcated with a pca model.

Parameters

use_pandas (bool) – If True, then the variable importances will be returned as a pandas data frame. (Default is False.)

Clustering Methods

class h2o.model.clustering.H2OClusteringModel[source]

Bases: h2o.model.model_base.ModelBase

For examples: from h2o.estimators.kmeans import H2OKMeansEstimator

betweenss(train=False, valid=False, xval=False)[source]

Get the between cluster sum of squares.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the between cluster sum of squares value for the training data.

  • valid (bool) – If True, return the between cluster sum of squares value for the validation data.

  • xval (bool) – If True, return the between cluster sum of squares value for each of the cross-validated splits.

Returns

The between cluster sum of squares values for the specified key(s).

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> betweenss = km.betweenss() # <- Default: return training metrics
>>> betweenss
>>> betweenss3 = km.betweenss(train=False,
...                           valid=False,
...                           xval=True)
>>> betweenss3
centers()[source]

The centers for the KMeans model.

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> km.centers()
centers_std()[source]

The standardized centers for the kmeans model.

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> km.centers_std()
centroid_stats(train=False, valid=False, xval=False)[source]

Get the centroid statistics for each cluster.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the centroid statistic for the training data.

  • valid (bool) – If True, return the centroid statistic for the validation data.

  • xval (bool) – If True, return the centroid statistic for each of the cross-validated splits.

Returns

The centroid statistics for the specified key(s).

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> centroid_stats = km.centroid_stats() # <- Default: return training metrics
>>> centroid_stats
>>> centroid_stats1 = km.centroid_stats(train=True,
...                                     valid=False,
...                                     xval=False)
>>> centroid_stats1
num_iterations()[source]

Get the number of iterations it took to converge or reach max iterations.

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> km.num_iterations()
size(train=False, valid=False, xval=False)[source]

Get the sizes of each cluster.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the cluster sizes for the training data.

  • valid (bool) – If True, return the cluster sizes for the validation data.

  • xval (bool) – If True, return the cluster sizes for each of the cross-validated splits.

Returns

The cluster sizes for the specified key(s).

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> size = km.size() # <- Default: return training metrics
>>> size
>>> size1 = km.size(train=False,
...                 valid=False,
...                 xval=True)
>>> size1
tot_withinss(train=False, valid=False, xval=False)[source]

Get the total within cluster sum of squares.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the total within cluster sum of squares value for the training data.

  • valid (bool) – If True, return the total within cluster sum of squares value for the validation data.

  • xval (bool) – If True, return the total within cluster sum of squares value for each of the cross-validated splits.

Returns

The total within cluster sum of squares values for the specified key(s).

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> tot_withinss = km.tot_withinss() # <- Default: return training metrics
>>> tot_withinss
>>> tot_withinss2 = km.tot_withinss(train=True,
...                                 valid=False,
...                                 xval=True)
>>> tot_withinss2
totss(train=False, valid=False, xval=False)[source]

Get the total sum of squares.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the total sum of squares value for the training data.

  • valid (bool) – If True, return the total sum of squares value for the validation data.

  • xval (bool) – If True, return the total sum of squares value for each of the cross-validated splits.

Returns

The total sum of squares values for the specified key(s).

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> totss = km.totss() # <- Default: return training metrics
>>> totss
withinss(train=False, valid=False, xval=False)[source]

Get the within cluster sum of squares for each cluster.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If True, return the total sum of squares value for the training data.

  • valid (bool) – If True, return the total sum of squares value for the validation data.

  • xval (bool) – If True, return the total sum of squares value for each of the cross-validated splits.

Returns

The total sum of squares values for the specified key(s).

Examples

>>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> km = H2OKMeansEstimator(k=3, nfolds=3)
>>> km.train(x=list(range(4)), training_frame=iris)
>>> withinss = km.withinss() # <- Default: return training metrics
>>> withinss
>>> withinss2 = km.withinss(train=True,
...                         valid=True,
...                         xval=True)
>>> withinss2

AutoEncoders

class h2o.model.autoencoder.H2OAutoEncoderModel[source]

Bases: h2o.model.model_base.ModelBase

anomaly(test_data, per_feature=False)[source]

Obtain the reconstruction error for the input test_data.

Parameters
  • test_data (H2OFrame) – The dataset upon which the reconstruction error is computed.

  • per_feature (bool) – Whether to return the square reconstruction error per feature. Otherwise, return the mean square error.

Returns

the reconstruction error.

Examples

>>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator
>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
>>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
>>> predictors = list(range(0,784))
>>> resp = 784
>>> train = train[predictors]
>>> test = test[predictors]
>>> ae_model = H2OAutoEncoderEstimator(activation="Tanh",
...                                    hidden=[2],
...                                    l1=1e-5,
...                                    ignore_const_cols=False,
...                                    epochs=1)
>>> ae_model.train(x=predictors,training_frame=train)
>>> test_rec_error = ae_model.anomaly(test)
>>> test_rec_error
>>> test_rec_error_features = ae_model.anomaly(test, per_feature=True)
>>> test_rec_error_features

Word Embedding

class h2o.model.word_embedding.H2OWordEmbeddingModel[source]

Bases: h2o.model.model_base.ModelBase

Word embedding model.

find_synonyms(word, count=20)[source]

Find synonyms using a word2vec model.

Parameters
  • word (str) – A single word to find synonyms for.

  • count (int) – The first “count” synonyms will be returned.

Returns

the approximate reconstruction of the training data.

Examples

>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), 
...                               col_names = ["category", "jobtitle"], 
...                               col_types = ["string", "string"], 
...                               header = 1)
>>> words = job_titles.tokenize(" ")
>>> w2v_model = H2OWord2vecEstimator(epochs = 10)
>>> w2v_model.train(training_frame=words)
>>> synonyms = w2v_model.find_synonyms("teacher", count = 5)
>>> print(synonyms)
to_frame()[source]

Converts a given word2vec model into H2OFrame.

Returns

a frame representing learned word embeddings.

Examples

>>> words = h2o.create_frame(rows=1000,cols=1,string_fraction=1.0,missing_fraction=0.0)
>>> embeddings = h2o.create_frame(rows=1000,cols=100,real_fraction=1.0,missing_fraction=0.0)
>>> word_embeddings = words.cbind(embeddings)
>>> w2v_model = H2OWord2vecEstimator(pre_trained=word_embeddings)
>>> w2v_model.train(training_frame=word_embeddings)
>>> w2v_frame = w2v_model.to_frame()
>>> word_embeddings.names = w2v_frame.names
>>> word_embeddings.as_data_frame().equals(word_embeddings.as_data_frame())
transform(words, aggregate_method)[source]

Transform words (or sequences of words) to vectors using a word2vec model.

Parameters
  • words (str) – An H2OFrame made of a single column containing source words.

  • aggregate_method (str) – Specifies how to aggregate sequences of words. If method is NONE then no aggregation is performed and each input word is mapped to a single word-vector. If method is ‘AVERAGE’ then input is treated as sequences of words delimited by NA. Each word of a sequences is internally mapped to a vector and vectors belonging to the same sentence are averaged and returned in the result.

Returns

the approximate reconstruction of the training data.

Examples

>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), 
...                               col_names = ["category", "jobtitle"], 
...                               col_types = ["string", "string"], 
...                               header = 1)
>>> STOP_WORDS = ["ax","i","you","edu","s","t","m","subject","can","lines","re","what",
...               "there","all","we","one","the","a","an","of","or","in","for","by","on",
...               "but","is","in","a","not","with","as","was","if","they","are","this","and","it","have",
...               "from","at","my","be","by","not","that","to","from","com","org","like","likes","so"]
>>> words = job_titles.tokenize(" ")
>>> words = words[(words.isna()) | (~ words.isin(STOP_WORDS)),:] 
>>> w2v_model = H2OWord2vecEstimator(epochs = 10)
>>> w2v_model.train(training_frame=words)
>>> job_title_vecs = w2v_model.transform(words, aggregate_method = "AVERAGE")