Modeling in H2O

Modeling In H2O

ModelBase

This module implements the base model class. All model things inherit from this class.

class h2o.model.model_base.ModelBase(dest_key, model_json, metrics_class)[source]

Bases: object

aic(train=False, valid=False)[source]

Get the AIC. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the AIC for training data.
  • valid – Return the AIC for the validation data.
Returns:

Retrieve the AIC for this set of metrics

auc(train=False, valid=False)[source]

Get the AUC. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the AUC for training data.
  • valid – Return the AUC for the validation data.
Returns:

Retrieve the AUC coefficient for this set of metrics

biases(vector_id=0)[source]

Return the frame for the respective bias vector :param: vector_id: an integer, ranging from 0 to number of layers, that specifies the bias vector to return. :return: an H2OFrame which represents the bias vector identified by vector_id

coef()[source]
Returns:Return the coefficients for this model.
coef_norm()[source]
Returns:Return the normalized coefficients
deepfeatures(test_data, layer)[source]

Return hidden layer details

Parameters:
  • test_data – Data to create a feature space on
  • layer – 0 index hidden layer
download_pojo(path='')[source]

Download the POJO for this model to the directory specified by path (no trailing slash!). If path is “”, then dump to screen. :param model: Retrieve this model’s scoring POJO. :param path: An absolute path to the directory where POJO should be saved. :return: None

giniCoef(train=False, valid=False)[source]

Get the Gini. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the Gini for training data.
  • valid – Return the Gini for the validation data.
Returns:

Retrieve the Gini coefficient for this set of metrics

logloss(train=False, valid=False)[source]

Get the Log Loss. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the log loss for training data.
  • valid – Return the log loss for the validation data.
Returns:

Retrieve the log loss coefficient for this set of metrics

model_performance(test_data=None, train=False, valid=False)[source]

Generate model metrics for this model on test_data.

Parameters:
  • test_data – Data set for which model metrics shall be computed against. Both train and valid arguments are ignored if test_data is not None.
  • train – Report the training metrics for the model. If the test_data is the training data, the training metrics are returned.
  • valid – Report the validation metrics for the model. If train and valid are True, then it defaults to True.
Returns:

An object of class H2OModelMetrics.

mse(train=False, valid=False)[source]
Parameters:
  • train – If train is True, then return the MSE value for the training data. If train and valid are both False, then return the training MSE.
  • valid – If valid is True, then return the MSE value for the validation data. If train and valid are both True, then return the validation MSE.
Returns:

The MSE for this regression model.

null_degrees_of_freedom(train=False, valid=False)[source]

Retreive the null degress of freedom if this model has the attribute, or None otherwise.

Parameters:
  • train – Get the null dof for the training set. If both train and valid are False, then train is selected by default.
  • valid – Get the null dof for the validation set. If both train and valid are True, then train is selected by default.
Returns:

Return the null dof, or None if it is not present.

null_deviance(train=False, valid=False)[source]

Retreive the null deviance if this model has the attribute, or None otherwise.

Param:train Get the null deviance for the training set. If both train and valid are False, then train is selected by default.
Param:valid Get the null deviance for the validation set. If both train and valid are True, then train is selected by default.
Returns:Return the null deviance, or None if it is not present.
pprint_coef()[source]

Pretty print the coefficents table (includes normalized coefficients) :return: None

predict(test_data)[source]

Predict on a dataset.

Parameters:test_data – Data to be predicted on.
Returns:A new H2OFrame filled with predictions.
r2(train=False, valid=False)[source]

Return the R^2 for this regression model.

The R^2 value is defined to be 1 - MSE/var, where var is computed as sigma*sigma.

Parameters:
  • train – If train is True, then return the R^2 value for the training data. If train and valid are both False, then return the training R^2.
  • valid – If valid is True, then return the R^2 value for the validation data. If train and valid are both True, then return the validation R^2.
Returns:

The R^2 for this regression model.

residual_degrees_of_freedom(train=False, valid=False)[source]

Retreive the residual degress of freedom if this model has the attribute, or None otherwise.

Parameters:
  • train – Get the residual dof for the training set. If both train and valid are False, then train is selected by default.
  • valid – Get the residual dof for the validation set. If both train and valid are True, then train is selected by default.
Returns:

Return the residual dof, or None if it is not present.

residual_deviance(train=False, valid=False)[source]

Retreive the residual deviance if this model has the attribute, or None otherwise.

Parameters:
  • train – Get the residual deviance for the training set. If both train and valid are False, then train is selected by default.
  • valid – Get the residual deviance for the validation set. If both train and valid are True, then train is selected by default.
Returns:

Return the residual deviance, or None if it is not present.

score_history()[source]

Retrieve Model Score History :return: the score history (H2OTwoDimTable)

show()[source]

Print innards of model, without regards to type

Returns:None
summary()[source]

Print a detailed summary of the model.

Returns:
varimp(return_list=False)[source]

Pretty print the variable importances, or return them in a list :param return_list: if True, then return the variable importances in an list (ordered from most important to least important). Each entry in the list is a 4-tuple of (variable, relative_importance, scaled_importance, percentage). :return: None or ordered list

weights(matrix_id=0)[source]

Return the frame for the respective weight matrix :param: matrix_id: an integer, ranging from 0 to number of layers, that specifies the weight matrix to return. :return: an H2OFrame which represents the weight matrix identified by matrix_id

Binomial Classification

Binomial Models

class h2o.model.binomial.H2OBinomialModel(dest_key, model_json)[source]

Bases: h2o.model.model_base.ModelBase

Class for Binomial models.

F0point5(thresholds=None, train=False, valid=False)[source]

Get the F0.5 for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the F0.5 for training data.
  • valid – Return the F0.5 for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The F0.5 for the given set of thresholds.

F1(thresholds=None, train=False, valid=False)[source]

Get the F1 for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the F1 for training data.
  • valid – Return the F1 for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The F1 for the given set of thresholds.

F2(thresholds=None, train=False, valid=False)[source]

Get the F2 for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the F2 for training data.
  • valid – Return the F2 for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The F2 for the given set of thresholds.

accuracy(thresholds=None, train=False, valid=False)[source]

Get the accuracy for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the accuracy for training data.
  • valid – Return the accuracy for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The accuracy for the given set of thresholds.

confusion_matrix(metrics=None, thresholds=None, train=False, valid=False)[source]

Get the confusion matrix for the specified metrics/thresholds If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • metrics – A string (or list of strings) in {“min_per_class_accuracy”, “absolute_MCC”, “tnr”, “fnr”, “fpr”, “tpr”, “precision”, “accuracy”, “f0point5”, “f2”, “f1”}
  • thresholds – A value (or list of values) between 0 and 1
  • train – Return the max per class error for training data.
  • valid – Return the max per class error for the validation data.
Returns:

a list of ConfusionMatrix objects (if there are more than one to return), or a single ConfusionMatrix (if there is only one)

error(thresholds=None, train=False, valid=False)[source]

Get the error for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the error for training data.
  • valid – Return the error for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The error for the given set of thresholds.

fallout(thresholds=None, train=False, valid=False)[source]

Get the Fallout (AKA False Positive Rate) for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the Fallout for training data.
  • valid – Return the Fallout for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The Fallout for the given set of thresholds.

find_idx_by_threshold(threshold, train=False, valid=False)[source]

Retrieve the index in this metric’s threshold list at which the given threshold is located. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the max per class error for training data.
  • valid – Return the max per class error for the validation data.
  • threshold – Find the index of this input threshold.
Returns:

Return the index or throw a ValueError if no such index can be found.

find_threshold_by_max_metric(metric, train=False, valid=False)[source]

If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the max per class error for training data.
  • valid – Return the max per class error for the validation data.
  • metric – A string in {“min_per_class_accuracy”, “absolute_MCC”, “tnr”, “fnr”, “fpr”, “tpr”, “precision”, “error”, “accuracy”, “f0point5”, “f2”, “f1”}
Returns:

the threshold at which the given metric is maximum.

fnr(thresholds=None, train=False, valid=False)[source]

Get the False Negative Rates for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the False Negative Rate for training data.
  • valid – Return the False Negative Rate for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The False Negative Rate for the given set of thresholds.

fpr(thresholds=None, train=False, valid=False)[source]

Get the False Positive Rates for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the False Positive Rate for training data.
  • valid – Return the False Positive Rate for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The False Positive Rate for the given set of thresholds.

max_per_class_error(thresholds=None, train=False, valid=False)[source]

Get the max per class error for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the max per class error for training data.
  • valid – Return the max per class error for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The max per class error for the given set of thresholds.

mcc(thresholds=None, train=False, valid=False)[source]

Get the mcc for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the mcc for training data.
  • valid – Return the mcc for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The mcc for the given set of thresholds.

metric(metric, thresholds=None, train=False, valid=False)[source]

Get the metric value for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • metric – A string in {“min_per_class_accuracy”, “absolute_MCC”, “tnr”, “fnr”, “fpr”, “tpr”, “precision”, “error”, “accuracy”, “f0point5”, “f2”, “f1”}
  • train – Return the max per class error for training data.
  • valid – Return the max per class error for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The metric value.

missrate(thresholds=None, train=False, valid=False)[source]

Get the miss rate (AKA False Negative Rate) for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the miss rate for training data.
  • valid – Return the miss rate for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The miss rate for the given set of thresholds.

plot(type='roc', train=False, valid=False, **kwargs)[source]

Produce the desired metric plot :param type: the type of metric plot (currently, only ROC supported) :param train: Return the max per class error for training data. :param valid: Return the max per class error for the validation data. :param show: if False, the plot is not shown. matplotlib show method is blocking. :return: None

precision(thresholds=None, train=False, valid=False)[source]

Get the precision for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the precision for training data.
  • valid – Return the precision for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The precision for the given set of thresholds.

recall(thresholds=None, train=False, valid=False)[source]

Get the Recall (AKA True Positive Rate) for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the Recall for training data.
  • valid – Return the Recall for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The Recall for the given set of thresholds.

sensitivity(thresholds=None, train=False, valid=False)[source]

Get the sensitivity (AKA True Positive Rate or Recall) for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the Sensitivity for training data.
  • valid – Return the Sensitivity for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The Sensitivity for the given set of thresholds.

specificity(thresholds=None, train=False, valid=False)[source]

Get the specificity (AKA True Negative Rate) for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the specificity for training data.
  • valid – Return the specificity for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The specificity for the given set of thresholds.

tnr(thresholds=None, train=False, valid=False)[source]

Get the True Negative Rate for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the True Negative Rate for training data.
  • valid – Return the True Negative Rate for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The True Negative Rate for the given set of thresholds.

tpr(thresholds=None, train=False, valid=False)[source]

Get the True Positive Rate for a set of thresholds. If both train and valid are False, return the train. If both train and valid are True, return the valid.

Parameters:
  • train – Return the True Positive Rate for training data.
  • valid – Return the True Positive Rate for the validation data.
  • thresholds – thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]). If None, then the thresholds in this set of metrics will be used.
Returns:

The True Positive Rate for the given set of thresholds.

Multinomial Classification

Multinomial Models

class h2o.model.multinomial.H2OMultinomialModel(dest_key, model_json)[source]

Bases: h2o.model.model_base.ModelBase

confusion_matrix(data)[source]

Returns a confusion matrix based of H2O’s default prediction threshold for a dataset

hit_ratio_table(train=False, valid=False)[source]

Retrieve the Hit Ratios

Parameters:
  • train – Return the hit ratios for training data.
  • valid – Return the hit ratios for the validation data.
Returns:

The hit ratio table (H2OTwoDimTable).

Regression

Regression Models

class h2o.model.regression.H2ORegressionModel(dest_key, model_json)[source]

Bases: h2o.model.model_base.ModelBase

Class for Regression models.

Clustering Methods

Clustering Models

class h2o.model.clustering.H2OClusteringModel(dest_key, model_json)[source]

Bases: h2o.model.model_base.ModelBase

betweenss(train, valid)[source]

Get the between cluster sum of squares.

Parameters:
  • train – If train is True, then return the average between cluster sum of squares of clusters based on the training data. If both train and valid are False, then train=True is assumed.
  • valid – If valid is True, then return the average between cluster sum of squares of clusters based on the validation data. If both train and valid are True, then validation data is returned.
Returns:

The average between cluster sum of squares for either the training or validation dataset.

centers()[source]
Returns:the centers for the kmeans model.
centers_std()[source]
Returns:the standardized centers for the kmeans model.
centroid_stats(train=False, valid=False)[source]

Get the centroid statistics for each cluster.

Parameters:
  • train – If train is True, then return the centroid statistics based on the training data. If both train and valid are False, then train=True is assumed.
  • valid – If valid is True, then return the centroid statistics based on the validation data. If both train and valid are True, then validation data is returned.
Returns:

The centroid statistics on either the training or validation dataset.

num_iterations(train=False, valid=False)[source]

Get the number of iterations that it took to converge or reach max iterations.

Returns:number of iterations (integer)
size(train=False, valid=False)[source]

Get the sizes of each cluster.

Parameters:
  • train – If train is True, then return the sizes of clusters based on the training data. If both train and valid are False, then train=True is assumed.
  • valid – If valid is True, then return the sizes of clusters based on the validation data. If both train and valid are True, then validation data is returned.
Returns:

the sizes of clusters for either the training or validation dataset.

tot_withinss(train=False, valid=False)[source]

Get the total within cluster sum of squares.

Parameters:
  • train – If train is True, then return the average within cluster sum of squares of clusters based on the training data. If both train and valid are False, then train=True is assumed.
  • valid – If valid is True, then return the average within cluster sum of squares of clusters based on the validation data. If both train and valid are True, then validation data is returned.
Returns:

The average within cluster sum of squares for either the training or validation dataset.

totss(train=False, valid=False)[source]

Get the total sum of squares to grand mean.

Parameters:
  • train – If train is True, then return the average cluster sum of squares of clusters based on the training data. If both train and valid are False, then train=True is assumed.
  • valid – If valid is True, then return the average cluster sum of squares of clusters based on the validation data. If both train and valid are True, then validation data is returned.
Returns:

The average cluster sum of squares for either the training or validation dataset.

withinss(train=False, valid=False)[source]

Get the within cluster sum of squares for each cluster.

Parameters:
  • train – If train is True, then return the within cluster sum of squares for each cluster based on the training data. If both train and valid are False, then train=True is assumed.
  • valid – If valid is True, then return the within cluster sum of squares for each cluster based on the validation data. If both train and valid are True, then validation data is returned.
Returns:

The within cluster sum of squares for each cluster on either the training or validation dataset.

AutoEncoders

AutoEncoder Models

class h2o.model.autoencoder.H2OAutoEncoderModel(dest_key, model_json)[source]

Bases: h2o.model.model_base.ModelBase

Class for AutoEncoder models.

anomaly(test_data)[source]

Obtain the reconstruction error for the input test_data.

Parameters:test_data – The dataset upon which the reconstruction error is computed.
Returns:Return the reconstruction error.