Model Categories¶
-
class
h2o.model.
H2OAutoEncoderModel
(*args, **kwargs)[source]¶ Bases:
h2o.model.model_base.ModelBase
-
anomaly
(test_data, per_feature=False)[source]¶ Obtain the reconstruction error for the input test_data.
Parameters: - test_data (H2OFrame) – The dataset upon which the reconstruction error is computed.
- per_feature (bool) – Whether to return the square reconstruction error per feature. Otherwise, return the mean square error.
Returns: the reconstruction error.
Examples: >>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz") >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz") >>> predictors = list(range(0,784)) >>> resp = 784 >>> train = train[predictors] >>> test = test[predictors] >>> ae_model = H2OAutoEncoderEstimator(activation="Tanh", ... hidden=[2], ... l1=1e-5, ... ignore_const_cols=False, ... epochs=1) >>> ae_model.train(x=predictors,training_frame=train) >>> test_rec_error = ae_model.anomaly(test) >>> test_rec_error >>> test_rec_error_features = ae_model.anomaly(test, per_feature=True) >>> test_rec_error_features
-
-
class
h2o.model.
H2OBinomialModel
(*args, **kwargs)[source]¶ Bases:
h2o.model.model_base.ModelBase
-
F0point5
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the F0.5 for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the F0.5 value for the training data.
- valid (bool) – If True, return the F0.5 value for the validation data.
- xval (bool) – If True, return the F0.5 value for each of the cross-validated splits.
Returns: The F0.5 values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> F0point5 = gbm.F0point5() # <- Default: return training metric value >>> F0point5 = gbm.F0point5(train=True, valid=True, xval=True)
-
F1
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the F1 value for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the F1 value for the training data.
- valid (bool) – If True, return the F1 value for the validation data.
- xval (bool) – If True, return the F1 value for each of the cross-validated splits.
Returns: The F1 values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.F1()# <- Default: return training metric value >>> gbm.F1(train=True, valid=True, xval=True)
-
F2
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the F2 for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the F2 value for the training data.
- valid (bool) – If True, return the F2 value for the validation data.
- xval (bool) – If True, return the F2 value for each of the cross-validated splits.
Returns: The F2 values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.F2() # <- Default: return training metric value >>> gbm.F2(train=True, valid=True, xval=True)
-
accuracy
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the accuracy for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the accuracy value for the training data.
- valid (bool) – If True, return the accuracy value for the validation data.
- xval (bool) – If True, return the accuracy value for each of the cross-validated splits.
Returns: The accuracy values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.accuracy() # <- Default: return training metric value >>> gbm.accuracy(train=True, valid=True, xval=True)
-
confusion_matrix
(metrics=None, thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the confusion matrix for the specified metrics/thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”
Parameters: - metrics – A string (or list of strings) among metrics listed in
H2OBinomialModelMetrics.maximizing_metrics
. Defaults to ‘f1’. - thresholds – A value (or list of values) between 0 and 1. If None, then the thresholds maximizing each provided metric will be used.
- train (bool) – If True, return the confusion matrix value for the training data.
- valid (bool) – If True, return the confusion matrix value for the validation data.
- xval (bool) – If True, return the confusion matrix value for each of the cross-validated splits.
Returns: The confusion matrix values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.confusion_matrix() # <- Default: return training metric value >>> gbm.confusion_matrix(train=True, valid=True, xval=True)
- metrics – A string (or list of strings) among metrics listed in
-
error
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the error for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold minimizing the error will be used.
- train (bool) – If True, return the error value for the training data.
- valid (bool) – If True, return the error value for the validation data.
- xval (bool) – If True, return the error value for each of the cross-validated splits.
Returns: The error values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.error() # <- Default: return training metric >>> gbm.error(train=True, valid=True, xval=True)
-
fallout
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the fallout for a set of thresholds (aka False Positive Rate).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the fallout value for the training data.
- valid (bool) – If True, return the fallout value for the validation data.
- xval (bool) – If True, return the fallout value for each of the cross-validated splits.
Returns: The fallout values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.fallout() # <- Default: return training metric >>> gbm.fallout(train=True, valid=True, xval=True)
-
find_idx_by_threshold
(threshold, train=False, valid=False, xval=False)[source]¶ Retrieve the index in this metric’s threshold list at which the given threshold is located.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - threshold (float) – Threshold value to search for in the threshold list.
- train (bool) – If True, return the find idx by threshold value for the training data.
- valid (bool) – If True, return the find idx by threshold value for the validation data.
- xval (bool) – If True, return the find idx by threshold value for each of the cross-validated splits.
Returns: The find idx by threshold values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", ... "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> idx_threshold = gbm.find_idx_by_threshold(threshold=0.39438, ... train=True) >>> idx_threshold
-
find_threshold_by_max_metric
(metric, train=False, valid=False, xval=False)[source]¶ If all are False (default), then return the training metric value.
If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - metric (str) – A metric among the metrics listed in
H2OBinomialModelMetrics.maximizing_metrics
. - train (bool) – If True, return the find threshold by max metric value for the training data.
- valid (bool) – If True, return the find threshold by max metric value for the validation data.
- xval (bool) – If True, return the find threshold by max metric value for each of the cross-validated splits.
Returns: The find threshold by max metric values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", ... "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> max_metric = gbm.find_threshold_by_max_metric(metric="f2", ... train=True) >>> max_metric
- metric (str) – A metric among the metrics listed in
-
fnr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the False Negative Rates for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the FNR value for the training data.
- valid (bool) – If True, return the FNR value for the validation data.
- xval (bool) – If True, return the FNR value for each of the cross-validated splits.
Returns: The FNR values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.fnr() # <- Default: return training metric >>> gbm.fnr(train=True, valid=True, xval=True)
-
fpr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the False Positive Rates for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the FPR value for the training data.
- valid (bool) – If True, return the FPR value for the validation data.
- xval (bool) – If True, return the FPR value for each of the cross-validated splits.
Returns: The FPR values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.fpr() # <- Default: return training metric >>> gbm.fpr(train=True, valid=True, xval=True)
-
gains_lift
(train=False, valid=False, xval=False)[source]¶ Get the Gains/Lift table for the specified metrics.
If all are False (default), then return the training metric Gains/Lift table. If more than one options is set to True, then return a dictionary of metrics where t he keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the gains lift value for the training data.
- valid (bool) – If True, return the gains lift value for the validation data.
- xval (bool) – If True, return the gains lift value for each of the cross-validated splits.
Returns: The gains lift values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.gains_lift() # <- Default: return training metric Gain/Lift table >>> gbm.gains_lift(train=True, valid=True, xval=True)
-
max_per_class_error
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the max per class error for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold minimizing the error will be used.
- train (bool) – If True, return the max per class error value for the training data.
- valid (bool) – If True, return the max per class error value for the validation data.
- xval (bool) – If True, return the max per class error value for each of the cross-validated splits.
Returns: The max per class error values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.max_per_class_error() # <- Default: return training metric value >>> gbm.max_per_class_error(train=True, valid=True, xval=True)
-
mcc
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the MCC for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the MCC value for the training data.
- valid (bool) – If True, return the MCC value for the validation data.
- xval (bool) – If True, return the MCC value for each of the cross-validated splits.
Returns: The MCC values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.mcc() # <- Default: return training metric value >>> gbm.mcc(train=True, valid=True, xval=True)
-
mean_per_class_error
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the mean per class error for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold minimizing the error will be used.
- train (bool) – If True, return the mean per class error value for the training data.
- valid (bool) – If True, return the mean per class error value for the validation data.
- xval (bool) – If True, return the mean per class error value for each of the cross-validated splits.
Returns: The mean per class error values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.mean_per_class_error() # <- Default: return training metric >>> gbm.mean_per_class_error(train=True, valid=True, xval=True)
-
metric
(metric, thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the metric value for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - metric (str) – name of the metric to retrieve.
- thresholds – If None, then the threshold maximizing the metric will be used (or minimizing it if the metric is an error).
- train (bool) – If True, return the metric value for the training data.
- valid (bool) – If True, return the metric value for the validation data.
- xval (bool) – If True, return the metric value for each of the cross-validated splits.
Returns: The metric values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] # thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]) >>> thresholds = [0.01,0.5,0.99] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) # allowable metrics are absolute_mcc, accuracy, precision, # f0point5, f1, f2, mean_per_class_accuracy, min_per_class_accuracy, # tns, fns, fps, tps, tnr, fnr, fpr, tpr, recall, sensitivity, # missrate, fallout, specificity >>> gbm.metric(metric='tpr', thresholds=thresholds)
-
missrate
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the miss rate for a set of thresholds (aka False Negative Rate).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the miss rate value for the training data.
- valid (bool) – If True, return the miss rate value for the validation data.
- xval (bool) – If True, return the miss rate value for each of the cross-validated splits.
Returns: The miss rate values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.missrate() # <- Default: return training metric >>> gbm.missrate(train=True, valid=True, xval=True)
-
plot
(timestep='AUTO', metric='AUTO', server=False, **kwargs)[source]¶ Plot training set (and validation set if available) scoring history for an H2OBinomialModel.
The timestep and metric arguments are restricted to what is available in its scoring history.
Parameters: - timestep (str) – A unit of measurement for the x-axis.
- metric (str) – A unit of measurement for the y-axis.
- server (bool) – if True, then generate the image inline (using matplotlib’s “Agg” backend)
Examples: >>> airlines = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> myX = ["Origin", "Dest", "Distance", "UniqueCarrier", ... "Month", "DayofMonth", "DayOfWeek"] >>> myY = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> air_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=100, ... max_depth=3, ... learn_rate=0.01) >>> air_gbm.train(x=myX, ... y=myY, ... training_frame=train, ... validation_frame=valid) >>> air_gbm.plot(type="roc", train=True, server=True) >>> air_gbm.plot(type="roc", valid=True, server=True) >>> perf = air_gbm.model_performance(valid) >>> perf.plot(type="roc", server=True) >>> perf.plot
-
precision
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the precision for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the precision value for the training data.
- valid (bool) – If True, return the precision value for the validation data.
- xval (bool) – If True, return the precision value for each of the cross-validated splits.
Returns: The precision values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.precision() # <- Default: return training metric value >>> gbm.precision(train=True, valid=True, xval=True)
-
recall
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the recall for a set of thresholds (aka True Positive Rate).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the recall value for the training data.
- valid (bool) – If True, return the recall value for the validation data.
- xval (bool) – If True, return the recall value for each of the cross-validated splits.
Returns: The recall values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.recall() # <- Default: return training metric >>> gbm.recall(train=True, valid=True, xval=True)
-
roc
(train=False, valid=False, xval=False)[source]¶ Return the coordinates of the ROC curve for a given set of data.
The coordinates are two-tuples containing the false positive rates as a list and true positive rates as a list. If all are False (default), then return is the training data. If more than one ROC curve is requested, the data is returned as a dictionary of two-tuples.
Parameters: - train (bool) – If True, return the ROC value for the training data.
- valid (bool) – If True, return the ROC value for the validation data.
- xval (bool) – If True, return the ROC value for each of the cross-validated splits.
Returns: The ROC values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.roc() # <- Default: return training data >>> gbm.roc(train=True, valid=True, xval=True)
-
sensitivity
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the sensitivity for a set of thresholds (aka True Positive Rate or Recall).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the sensitivity value for the training data.
- valid (bool) – If True, return the sensitivity value for the validation data.
- xval (bool) – If True, return the sensitivity value for each of the cross-validated splits.
Returns: The sensitivity values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.sensitivity() # <- Default: return training metric >>> gbm.sensitivity(train=True, valid=True, xval=True)
-
specificity
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the specificity for a set of thresholds (aka True Negative Rate).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the specificity value for the training data.
- valid (bool) – If True, return the specificity value for the validation data.
- xval (bool) – If True, return the specificity value for each of the cross-validated splits.
Returns: The specificity values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.specificity() # <- Default: return training metric >>> gbm.specificity(train=True, valid=True, xval=True)
-
tnr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the True Negative Rate for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the TNR value for the training data.
- valid (bool) – If True, return the TNR value for the validation data.
- xval (bool) – If True, return the TNR value for each of the cross-validated splits.
Returns: The TNR values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.tnr() # <- Default: return training metric >>> gbm.tnr(train=True, valid=True, xval=True)
-
tpr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the True Positive Rate for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the TPR value for the training data.
- valid (bool) – If True, return the TPR value for the validation data.
- xval (bool) – If True, return the TPR value for each of the cross-validated splits.
Returns: The TPR values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.tpr() # <- Default: return training metric >>> gbm.tpr(train=True, valid=True, xval=True)
-
-
class
h2o.model.
H2OClusteringModel
(*args, **kwargs)[source]¶ Bases:
h2o.model.model_base.ModelBase
For examples: from h2o.estimators.kmeans import H2OKMeansEstimator
-
betweenss
(train=False, valid=False, xval=False)[source]¶ Get the between cluster sum of squares.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the between cluster sum of squares value for the training data.
- valid (bool) – If True, return the between cluster sum of squares value for the validation data.
- xval (bool) – If True, return the between cluster sum of squares value for each of the cross-validated splits.
Returns: The between cluster sum of squares values for the specified key(s).
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> betweenss = km.betweenss() # <- Default: return training metrics >>> betweenss >>> betweenss3 = km.betweenss(train=False, ... valid=False, ... xval=True) >>> betweenss3
-
centers
()[source]¶ The centers for the KMeans model.
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> km.centers()
-
centers_std
()[source]¶ The standardized centers for the kmeans model.
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> km.centers_std()
-
centroid_stats
(train=False, valid=False, xval=False)[source]¶ Get the centroid statistics for each cluster.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the centroid statistic for the training data.
- valid (bool) – If True, return the centroid statistic for the validation data.
- xval (bool) – If True, return the centroid statistic for each of the cross-validated splits.
Returns: The centroid statistics for the specified key(s).
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> centroid_stats = km.centroid_stats() # <- Default: return training metrics >>> centroid_stats >>> centroid_stats1 = km.centroid_stats(train=True, ... valid=False, ... xval=False) >>> centroid_stats1
-
num_iterations
()[source]¶ Get the number of iterations it took to converge or reach max iterations.
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> km.num_iterations()
-
size
(train=False, valid=False, xval=False)[source]¶ Get the sizes of each cluster.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the cluster sizes for the training data.
- valid (bool) – If True, return the cluster sizes for the validation data.
- xval (bool) – If True, return the cluster sizes for each of the cross-validated splits.
Returns: The cluster sizes for the specified key(s).
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> size = km.size() # <- Default: return training metrics >>> size >>> size1 = km.size(train=False, ... valid=False, ... xval=True) >>> size1
-
tot_withinss
(train=False, valid=False, xval=False)[source]¶ Get the total within cluster sum of squares.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the total within cluster sum of squares value for the training data.
- valid (bool) – If True, return the total within cluster sum of squares value for the validation data.
- xval (bool) – If True, return the total within cluster sum of squares value for each of the cross-validated splits.
Returns: The total within cluster sum of squares values for the specified key(s).
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> tot_withinss = km.tot_withinss() # <- Default: return training metrics >>> tot_withinss >>> tot_withinss2 = km.tot_withinss(train=True, ... valid=False, ... xval=True) >>> tot_withinss2
-
totss
(train=False, valid=False, xval=False)[source]¶ Get the total sum of squares.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the total sum of squares value for the training data.
- valid (bool) – If True, return the total sum of squares value for the validation data.
- xval (bool) – If True, return the total sum of squares value for each of the cross-validated splits.
Returns: The total sum of squares values for the specified key(s).
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> totss = km.totss() # <- Default: return training metrics >>> totss
-
withinss
(train=False, valid=False, xval=False)[source]¶ Get the within cluster sum of squares for each cluster.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the total sum of squares value for the training data.
- valid (bool) – If True, return the total sum of squares value for the validation data.
- xval (bool) – If True, return the total sum of squares value for each of the cross-validated splits.
Returns: The total sum of squares values for the specified key(s).
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> withinss = km.withinss() # <- Default: return training metrics >>> withinss >>> withinss2 = km.withinss(train=True, ... valid=True, ... xval=True) >>> withinss2
-
-
class
h2o.model.
ConfusionMatrix
(cm, domains=None, table_header=None)[source]¶ Bases:
object
-
ROUND
= 4¶
-
-
class
h2o.model.
H2ODimReductionModel
(*args, **kwargs)[source]¶ Bases:
h2o.model.model_base.ModelBase
Dimension reduction model, such as PCA or GLRM.
-
num_iterations
()[source]¶ Get the number of iterations that it took to converge or reach max iterations.
-
proj_archetypes
(test_data, reverse_transform=False)[source]¶ Convert archetypes of the model into original feature space.
Parameters: - test_data (H2OFrame) – The dataset upon which the model was trained.
- reverse_transform (bool) – Whether the transformation of the training data during model-building should be reversed on the projected archetypes.
Returns: model archetypes projected back into the original training data’s feature space.
-
reconstruct
(test_data, reverse_transform=False)[source]¶ Reconstruct the training data from the model and impute all missing values.
Parameters: - test_data (H2OFrame) – The dataset upon which the model was trained.
- reverse_transform (bool) – Whether the transformation of the training data during model-building should be reversed on the reconstructed frame.
Returns: the approximate reconstruction of the training data.
-
-
class
h2o.model.
MetricsBase
(*args, **kwargs)[source]¶ Bases:
h2o.model.metrics_base.MetricsBase
A parent class to house common metrics available for the various Metrics types.
The methods here are available across different model categories.
-
classmethod
make
(kvs)[source]¶ Factory method to instantiate a MetricsBase object from the list of key-value pairs.
-
classmethod
-
class
h2o.model.
ModelBase
(*args, **kwargs)[source]¶ Bases:
h2o.model.model_base.ModelBase
Base class for all models.
-
actual_params
¶ Dictionary of actual parameters of the model.
-
aic
(train=False, valid=False, xval=False)[source]¶ Get the AIC (Akaike Information Criterium).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the AIC value for the training data.
- valid (bool) – If valid is True, then return the AIC value for the validation data.
- xval (bool) – If xval is True, then return the AIC value for the validation data.
Returns: The AIC.
-
auc
(train=False, valid=False, xval=False)[source]¶ Get the AUC (Area Under Curve).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the AUC value for the training data.
- valid (bool) – If valid is True, then return the AUC value for the validation data.
- xval (bool) – If xval is True, then return the AUC value for the validation data.
Returns: The AUC.
-
aucpr
(train=False, valid=False, xval=False)[source]¶ Get the aucPR (Area Under PRECISION RECALL Curve).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the aucpr value for the training data.
- valid (bool) – If valid is True, then return the aucpr value for the validation data.
- xval (bool) – If xval is True, then return the aucpr value for the validation data.
Returns: The aucpr.
-
biases
(vector_id=0)[source]¶ Return the frame for the respective bias vector.
Param: vector_id: an integer, ranging from 0 to number of layers, that specifies the bias vector to return. Returns: an H2OFrame which represents the bias vector identified by vector_id
-
coef
()[source]¶ Return the coefficients which can be applied to the non-standardized data.
Note: standardize = True by default, if set to False then coef() return the coefficients which are fit directly.
-
coef_norm
()[source]¶ Return coefficients fitted on the standardized data (requires standardize = True, which is on by default).
These coefficients can be used to evaluate variable importance.
-
cross_validation_fold_assignment
()[source]¶ Obtain the cross-validation fold assignment for all rows in the training data.
Returns: H2OFrame
-
cross_validation_holdout_predictions
()[source]¶ Obtain the (out-of-sample) holdout predictions of all cross-validation models on the training data.
This is equivalent to summing up all H2OFrames returned by cross_validation_predictions.
Returns: H2OFrame
-
cross_validation_metrics_summary
()[source]¶ Retrieve Cross-Validation Metrics Summary.
Returns: The cross-validation metrics summary as an H2OTwoDimTable
-
cross_validation_models
()[source]¶ Obtain a list of cross-validation models.
Returns: list of H2OModel objects.
-
cross_validation_predictions
()[source]¶ Obtain the (out-of-sample) holdout predictions of all cross-validation models on their holdout data.
Note that the predictions are expanded to the full number of rows of the training data, with 0 fill-in.
Returns: list of H2OFrame objects.
-
deepfeatures
(test_data, layer)[source]¶ Return hidden layer details.
Parameters: - test_data – Data to create a feature space on
- layer – 0 index hidden layer
-
default_params
¶ Dictionary of the default parameters of the model.
-
download_model
(path='')[source]¶ Download an H2O Model object to disk.
Parameters: - model – The model object to download.
- path – a path to the directory where the model should be saved.
Returns: the path of the downloaded model
-
download_mojo
(path='.', get_genmodel_jar=False, genmodel_name='')[source]¶ Download the model in MOJO format.
Parameters: - path – the path where MOJO file should be saved.
- get_genmodel_jar – if True, then also download h2o-genmodel.jar and store it in folder
path
. - genmodel_name – Custom name of genmodel jar
Returns: name of the MOJO file written.
-
download_pojo
(path='', get_genmodel_jar=False, genmodel_name='')[source]¶ Download the POJO for this model to the directory specified by path.
If path is an empty string, then dump the output to screen.
Parameters: - path – An absolute path to the directory where POJO should be saved.
- get_genmodel_jar – if True, then also download h2o-genmodel.jar and store it in folder
path
. - genmodel_name – Custom name of genmodel jar
Returns: name of the POJO file written.
-
end_time
¶ Timestamp (milliseconds since 1970) when the model training was ended.
-
feature_frequencies
(test_data)[source]¶ Retrieve the number of occurrences of each feature for given observations on their respective paths in a tree ensemble model. Available for GBM, Random Forest and Isolation Forest models.
Parameters: test_data (H2OFrame) – Data on which to calculate feature frequencies. Returns: A new H2OFrame made of feature contributions.
-
full_parameters
¶ Dictionary of the full specification of all parameters.
-
get_xval_models
(key=None)[source]¶ Return a Model object.
Parameters: key – If None, return all cross-validated models; otherwise return the model that key points to. Returns: A model or list of models.
-
gini
(train=False, valid=False, xval=False)[source]¶ Get the Gini coefficient.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”
Parameters: - train (bool) – If train is True, then return the Gini Coefficient value for the training data.
- valid (bool) – If valid is True, then return the Gini Coefficient value for the validation data.
- xval (bool) – If xval is True, then return the Gini Coefficient value for the cross validation data.
Returns: The Gini Coefficient for this binomial model.
-
have_mojo
¶ True, if export to MOJO is possible
-
have_pojo
¶ True, if export to POJO is possible
-
key
¶
-
logloss
(train=False, valid=False, xval=False)[source]¶ Get the Log Loss.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the log loss value for the training data.
- valid (bool) – If valid is True, then return the log loss value for the validation data.
- xval (bool) – If xval is True, then return the log loss value for the cross validation data.
Returns: The log loss for this regression model.
-
mae
(train=False, valid=False, xval=False)[source]¶ Get the Mean Absolute Error.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the MAE value for the training data.
- valid (bool) – If valid is True, then return the MAE value for the validation data.
- xval (bool) – If xval is True, then return the MAE value for the cross validation data.
Returns: The MAE for this regression model.
-
mean_residual_deviance
(train=False, valid=False, xval=False)[source]¶ Get the Mean Residual Deviances.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the Mean Residual Deviance value for the training data.
- valid (bool) – If valid is True, then return the Mean Residual Deviance value for the validation data.
- xval (bool) – If xval is True, then return the Mean Residual Deviance value for the cross validation data.
Returns: The Mean Residual Deviance for this regression model.
-
model_id
¶ Model identifier.
-
model_performance
(test_data=None, train=False, valid=False, xval=False)[source]¶ Generate model metrics for this model on test_data.
Parameters: - test_data (H2OFrame) – Data set for which model metrics shall be computed against. All three of train, valid and xval arguments are ignored if test_data is not None.
- train (bool) – Report the training metrics for the model.
- valid (bool) – Report the validation metrics for the model.
- xval (bool) – Report the cross-validation metrics for the model. If train and valid are True, then it defaults to True.
Returns: An object of class H2OModelMetrics.
-
mse
(train=False, valid=False, xval=False)[source]¶ Get the Mean Square Error.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the MSE value for the training data.
- valid (bool) – If valid is True, then return the MSE value for the validation data.
- xval (bool) – If xval is True, then return the MSE value for the cross validation data.
Returns: The MSE for this regression model.
-
ntrees_actual
()[source]¶ Returns actual number of trees in a tree model. If early stopping enabled, GBM can reset the ntrees value. In this case, the actual ntrees value is less than the original ntrees value a user set before building the model.
Type:
float
-
null_degrees_of_freedom
(train=False, valid=False, xval=False)[source]¶ Retreive the null degress of freedom if this model has the attribute, or None otherwise.
Parameters: - train (bool) – Get the null dof for the training set. If both train and valid are False, then train is selected by default.
- valid (bool) – Get the null dof for the validation set. If both train and valid are True, then train is selected by default.
Returns: Return the null dof, or None if it is not present.
-
null_deviance
(train=False, valid=False, xval=False)[source]¶ Retreive the null deviance if this model has the attribute, or None otherwise.
Parameters: - train (bool) – Get the null deviance for the training set. If both train and valid are False, then train is selected by default.
- valid (bool) – Get the null deviance for the validation set. If both train and valid are True, then train is selected by default.
Returns: Return the null deviance, or None if it is not present.
-
params
¶ Get the parameters and the actual/default values only.
Returns: A dictionary of parameters used to build this model.
-
partial_plot
(data, cols=None, destination_key=None, nbins=20, weight_column=None, plot=True, plot_stddev=True, figsize=(7, 10), server=False, include_na=False, user_splits=None, col_pairs_2dpdp=None, save_to_file=None, row_index=None)[source]¶ Create partial dependence plot which gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response.
Parameters: - data (H2OFrame) – An H2OFrame object used for scoring and constructing the plot.
- cols – Feature(s) for which partial dependence will be calculated.
- destination_key – An key reference to the created partial dependence tables in H2O.
- nbins – Number of bins used. For categorical columns make sure the number of bins exceed the level count. If you enable add_missing_NA, the returned length will be nbin+1.
- weight_column – A string denoting which column of data should be used as the weight column.
- plot – A boolean specifying whether to plot partial dependence table.
- plot_stddev – A boolean specifying whether to add std err to partial dependence plot.
- figsize – Dimension/size of the returning plots, adjust to fit your output cells.
- server – Specify whether to activate matplotlib “server” mode. In this case, the plots are saved to a file instead of being rendered.
- include_na – A boolean specifying whether missing value should be included in the Feature values.
- user_splits – a dictionary containing column names as key and user defined split values as value in a list.
- col_pairs_2dpdp – list containing pairs of column names for 2D pdp
- save_to_file – Fully qualified name to an image file the resulting plot should be saved to, e.g. ‘/home/user/pdpplot.png’. The ‘png’ postfix might be omitted. If the file already exists, it will be overridden. Plot is only saved if plot = True.
- row_index – Row for which partial dependence will be calculated instead of the whole input frame.
Returns: Plot and list of calculated mean response tables for each feature requested.
-
pr_auc
(train=False, valid=False, xval=False)[source]¶ ModelBase.pr_auc is deprecated, please use
ModelBase.aucpr
instead.
-
predict
(test_data, custom_metric=None, custom_metric_func=None)[source]¶ Predict on a dataset.
Parameters: - test_data (H2OFrame) – Data on which to make predictions.
- custom_metric – custom evaluation function defined as class reference, the class get uploaded into the cluster
- custom_metric_func – custom evaluation function reference, e.g, result of upload_custom_metric
Returns: A new H2OFrame of predictions.
-
predict_contributions
(test_data)[source]¶ Predict feature contributions - SHAP values on an H2O Model (only GBM and XGBoost models).
Returned H2OFrame has shape (#rows, #features + 1) - there is a feature contribution column for each input feature, the last column is the model bias (same value for each row). The sum of the feature contributions and the bias term is equal to the raw prediction of the model. Raw prediction of tree-based model is the sum of the predictions of the individual trees before before the inverse link function is applied to get the actual prediction. For Gaussian distribution the sum of the contributions is equal to the model prediction.
Note: Multinomial classification models are currently not supported.
Parameters: test_data (H2OFrame) – Data on which to calculate contributions. Returns: A new H2OFrame made of feature contributions.
-
predict_leaf_node_assignment
(test_data, type='Path')[source]¶ Predict on a dataset and return the leaf node assignment (only for tree-based models).
Parameters: - test_data (H2OFrame) – Data on which to make predictions.
- type (Enum) – How to identify the leaf node. Nodes can be either identified by a path from to the root node
of the tree to the node or by H2O’s internal node id. One of:
"Path"
,"Node_ID"
(default:"Path"
).
Returns: A new H2OFrame of predictions.
-
r2
(train=False, valid=False, xval=False)[source]¶ Return the R squared for this regression model.
Will return R^2 for GLM Models and will return NaN otherwise.
The R^2 value is defined to be 1 - MSE/var, where var is computed as sigma*sigma.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the R^2 value for the training data.
- valid (bool) – If valid is True, then return the R^2 value for the validation data.
- xval (bool) – If xval is True, then return the R^2 value for the cross validation data.
Returns: The R squared for this regression model.
-
residual_degrees_of_freedom
(train=False, valid=False, xval=False)[source]¶ Retreive the residual degress of freedom if this model has the attribute, or None otherwise.
Parameters: - train (bool) – Get the residual dof for the training set. If both train and valid are False, then train is selected by default.
- valid (bool) – Get the residual dof for the validation set. If both train and valid are True, then train is selected by default.
Returns: Return the residual dof, or None if it is not present.
-
residual_deviance
(train=False, valid=False, xval=None)[source]¶ Retreive the residual deviance if this model has the attribute, or None otherwise.
Parameters: - train (bool) – Get the residual deviance for the training set. If both train and valid are False, then train is selected by default.
- valid (bool) – Get the residual deviance for the validation set. If both train and valid are True, then train is selected by default.
Returns: Return the residual deviance, or None if it is not present.
-
rmse
(train=False, valid=False, xval=False)[source]¶ Get the Root Mean Square Error.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the RMSE value for the training data.
- valid (bool) – If valid is True, then return the RMSE value for the validation data.
- xval (bool) – If xval is True, then return the RMSE value for the cross validation data.
Returns: The RMSE for this regression model.
-
rmsle
(train=False, valid=False, xval=False)[source]¶ Get the Root Mean Squared Logarithmic Error.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the RMSLE value for the training data.
- valid (bool) – If valid is True, then return the RMSLE value for the validation data.
- xval (bool) – If xval is True, then return the RMSLE value for the cross validation data.
Returns: The RMSLE for this regression model.
-
run_time
¶ Model training time in milliseconds
-
save_model_details
(path='', force=False)[source]¶ Save Model Details of an H2O Model in JSON Format to disk.
Parameters: - model – The model object to save.
- path – a path to save the model details at (hdfs, s3, local)
- force – if True overwrite destination directory in case it exists, or throw exception if set to False.
Returns str: the path of the saved model details
-
save_mojo
(path='', force=False)[source]¶ Save an H2O Model as MOJO (Model Object, Optimized) to disk.
Parameters: - model – The model object to save.
- path – a path to save the model at (hdfs, s3, local)
- force – if True overwrite destination directory in case it exists, or throw exception if set to False.
Returns str: the path of the saved model
-
score_history
()[source]¶ DEPRECATED. Use
scoring_history()
instead.
-
scoring_history
()[source]¶ Retrieve Model Score History.
Returns: The score history as an H2OTwoDimTable or a Pandas DataFrame.
-
staged_predict_proba
(test_data)[source]¶ Predict class probabilities at each stage of an H2O Model (only GBM models).
The output structure is analogous to the output of function predict_leaf_node_assignment. For each tree t and class c there will be a column Tt.Cc (eg. T3.C1 for tree 3 and class 1). The value will be the corresponding predicted probability of this class by combining the raw contributions of trees T1.Cc,..,TtCc. Binomial models build the trees just for the first class and values in columns Tx.C1 thus correspond to the the probability p0.
Parameters: test_data (H2OFrame) – Data on which to make predictions. Returns: A new H2OFrame of staged predictions.
-
start_time
¶ Timestamp (milliseconds since 1970) when the model training was started.
-
std_coef_plot
(num_of_features=None, server=False)[source]¶ Plot a GLM model”s standardized coefficient magnitudes.
Parameters: - num_of_features – the number of features shown in the plot.
- server –
?
Returns: None.
-
type
¶ The type of model built:
"classifier"
or"regressor"
or"unsupervised"
-
varimp
(use_pandas=False)[source]¶ Pretty print the variable importances, or return them in a list.
Parameters: use_pandas (bool) – If True, then the variable importances will be returned as a pandas data frame. Returns: A list or Pandas DataFrame.
-
varimp_plot
(num_of_features=None, server=False)[source]¶ Plot the variable importance for a trained model.
Parameters: - num_of_features – the number of features shown in the plot (default is 10 or all if less than 10).
- server –
?
Returns: None.
-
weights
(matrix_id=0)[source]¶ Return the frame for the respective weight matrix.
Parameters: matrix_id – an integer, ranging from 0 to number of layers, that specifies the weight matrix to return. Returns: an H2OFrame which represents the weight matrix identified by matrix_id
-
xvals
¶ Return a list of the cross-validated models.
Returns: A list of models.
-
-
class
h2o.model.
H2OModelFuture
(job, x)[source]¶ Bases:
object
A class representing a future H2O model (a model that may, or may not, be in the process of being built).
ModelBase
¶
-
class
h2o.model.model_base.
ModelBase
(*args, **kwargs)[source]¶ Bases:
h2o.model.model_base.ModelBase
Base class for all models.
-
actual_params
¶ Dictionary of actual parameters of the model.
-
aic
(train=False, valid=False, xval=False)[source]¶ Get the AIC (Akaike Information Criterium).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the AIC value for the training data.
- valid (bool) – If valid is True, then return the AIC value for the validation data.
- xval (bool) – If xval is True, then return the AIC value for the validation data.
Returns: The AIC.
-
auc
(train=False, valid=False, xval=False)[source]¶ Get the AUC (Area Under Curve).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the AUC value for the training data.
- valid (bool) – If valid is True, then return the AUC value for the validation data.
- xval (bool) – If xval is True, then return the AUC value for the validation data.
Returns: The AUC.
-
aucpr
(train=False, valid=False, xval=False)[source]¶ Get the aucPR (Area Under PRECISION RECALL Curve).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the aucpr value for the training data.
- valid (bool) – If valid is True, then return the aucpr value for the validation data.
- xval (bool) – If xval is True, then return the aucpr value for the validation data.
Returns: The aucpr.
-
biases
(vector_id=0)[source]¶ Return the frame for the respective bias vector.
Param: vector_id: an integer, ranging from 0 to number of layers, that specifies the bias vector to return. Returns: an H2OFrame which represents the bias vector identified by vector_id
-
coef
()[source]¶ Return the coefficients which can be applied to the non-standardized data.
Note: standardize = True by default, if set to False then coef() return the coefficients which are fit directly.
-
coef_norm
()[source]¶ Return coefficients fitted on the standardized data (requires standardize = True, which is on by default).
These coefficients can be used to evaluate variable importance.
-
cross_validation_fold_assignment
()[source]¶ Obtain the cross-validation fold assignment for all rows in the training data.
Returns: H2OFrame
-
cross_validation_holdout_predictions
()[source]¶ Obtain the (out-of-sample) holdout predictions of all cross-validation models on the training data.
This is equivalent to summing up all H2OFrames returned by cross_validation_predictions.
Returns: H2OFrame
-
cross_validation_metrics_summary
()[source]¶ Retrieve Cross-Validation Metrics Summary.
Returns: The cross-validation metrics summary as an H2OTwoDimTable
-
cross_validation_models
()[source]¶ Obtain a list of cross-validation models.
Returns: list of H2OModel objects.
-
cross_validation_predictions
()[source]¶ Obtain the (out-of-sample) holdout predictions of all cross-validation models on their holdout data.
Note that the predictions are expanded to the full number of rows of the training data, with 0 fill-in.
Returns: list of H2OFrame objects.
-
deepfeatures
(test_data, layer)[source]¶ Return hidden layer details.
Parameters: - test_data – Data to create a feature space on
- layer – 0 index hidden layer
-
default_params
¶ Dictionary of the default parameters of the model.
-
download_model
(path='')[source]¶ Download an H2O Model object to disk.
Parameters: - model – The model object to download.
- path – a path to the directory where the model should be saved.
Returns: the path of the downloaded model
-
download_mojo
(path='.', get_genmodel_jar=False, genmodel_name='')[source]¶ Download the model in MOJO format.
Parameters: - path – the path where MOJO file should be saved.
- get_genmodel_jar – if True, then also download h2o-genmodel.jar and store it in folder
path
. - genmodel_name – Custom name of genmodel jar
Returns: name of the MOJO file written.
-
download_pojo
(path='', get_genmodel_jar=False, genmodel_name='')[source]¶ Download the POJO for this model to the directory specified by path.
If path is an empty string, then dump the output to screen.
Parameters: - path – An absolute path to the directory where POJO should be saved.
- get_genmodel_jar – if True, then also download h2o-genmodel.jar and store it in folder
path
. - genmodel_name – Custom name of genmodel jar
Returns: name of the POJO file written.
-
end_time
¶ Timestamp (milliseconds since 1970) when the model training was ended.
-
feature_frequencies
(test_data)[source]¶ Retrieve the number of occurrences of each feature for given observations on their respective paths in a tree ensemble model. Available for GBM, Random Forest and Isolation Forest models.
Parameters: test_data (H2OFrame) – Data on which to calculate feature frequencies. Returns: A new H2OFrame made of feature contributions.
-
full_parameters
¶ Dictionary of the full specification of all parameters.
-
get_xval_models
(key=None)[source]¶ Return a Model object.
Parameters: key – If None, return all cross-validated models; otherwise return the model that key points to. Returns: A model or list of models.
-
gini
(train=False, valid=False, xval=False)[source]¶ Get the Gini coefficient.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”
Parameters: - train (bool) – If train is True, then return the Gini Coefficient value for the training data.
- valid (bool) – If valid is True, then return the Gini Coefficient value for the validation data.
- xval (bool) – If xval is True, then return the Gini Coefficient value for the cross validation data.
Returns: The Gini Coefficient for this binomial model.
-
have_mojo
¶ True, if export to MOJO is possible
-
have_pojo
¶ True, if export to POJO is possible
-
key
¶
-
logloss
(train=False, valid=False, xval=False)[source]¶ Get the Log Loss.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the log loss value for the training data.
- valid (bool) – If valid is True, then return the log loss value for the validation data.
- xval (bool) – If xval is True, then return the log loss value for the cross validation data.
Returns: The log loss for this regression model.
-
mae
(train=False, valid=False, xval=False)[source]¶ Get the Mean Absolute Error.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the MAE value for the training data.
- valid (bool) – If valid is True, then return the MAE value for the validation data.
- xval (bool) – If xval is True, then return the MAE value for the cross validation data.
Returns: The MAE for this regression model.
-
mean_residual_deviance
(train=False, valid=False, xval=False)[source]¶ Get the Mean Residual Deviances.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the Mean Residual Deviance value for the training data.
- valid (bool) – If valid is True, then return the Mean Residual Deviance value for the validation data.
- xval (bool) – If xval is True, then return the Mean Residual Deviance value for the cross validation data.
Returns: The Mean Residual Deviance for this regression model.
-
model_id
¶ Model identifier.
-
model_performance
(test_data=None, train=False, valid=False, xval=False)[source]¶ Generate model metrics for this model on test_data.
Parameters: - test_data (H2OFrame) – Data set for which model metrics shall be computed against. All three of train, valid and xval arguments are ignored if test_data is not None.
- train (bool) – Report the training metrics for the model.
- valid (bool) – Report the validation metrics for the model.
- xval (bool) – Report the cross-validation metrics for the model. If train and valid are True, then it defaults to True.
Returns: An object of class H2OModelMetrics.
-
mse
(train=False, valid=False, xval=False)[source]¶ Get the Mean Square Error.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the MSE value for the training data.
- valid (bool) – If valid is True, then return the MSE value for the validation data.
- xval (bool) – If xval is True, then return the MSE value for the cross validation data.
Returns: The MSE for this regression model.
-
ntrees_actual
()[source]¶ Returns actual number of trees in a tree model. If early stopping enabled, GBM can reset the ntrees value. In this case, the actual ntrees value is less than the original ntrees value a user set before building the model.
Type:
float
-
null_degrees_of_freedom
(train=False, valid=False, xval=False)[source]¶ Retreive the null degress of freedom if this model has the attribute, or None otherwise.
Parameters: - train (bool) – Get the null dof for the training set. If both train and valid are False, then train is selected by default.
- valid (bool) – Get the null dof for the validation set. If both train and valid are True, then train is selected by default.
Returns: Return the null dof, or None if it is not present.
-
null_deviance
(train=False, valid=False, xval=False)[source]¶ Retreive the null deviance if this model has the attribute, or None otherwise.
Parameters: - train (bool) – Get the null deviance for the training set. If both train and valid are False, then train is selected by default.
- valid (bool) – Get the null deviance for the validation set. If both train and valid are True, then train is selected by default.
Returns: Return the null deviance, or None if it is not present.
-
params
¶ Get the parameters and the actual/default values only.
Returns: A dictionary of parameters used to build this model.
-
partial_plot
(data, cols=None, destination_key=None, nbins=20, weight_column=None, plot=True, plot_stddev=True, figsize=(7, 10), server=False, include_na=False, user_splits=None, col_pairs_2dpdp=None, save_to_file=None, row_index=None)[source]¶ Create partial dependence plot which gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response.
Parameters: - data (H2OFrame) – An H2OFrame object used for scoring and constructing the plot.
- cols – Feature(s) for which partial dependence will be calculated.
- destination_key – An key reference to the created partial dependence tables in H2O.
- nbins – Number of bins used. For categorical columns make sure the number of bins exceed the level count. If you enable add_missing_NA, the returned length will be nbin+1.
- weight_column – A string denoting which column of data should be used as the weight column.
- plot – A boolean specifying whether to plot partial dependence table.
- plot_stddev – A boolean specifying whether to add std err to partial dependence plot.
- figsize – Dimension/size of the returning plots, adjust to fit your output cells.
- server – Specify whether to activate matplotlib “server” mode. In this case, the plots are saved to a file instead of being rendered.
- include_na – A boolean specifying whether missing value should be included in the Feature values.
- user_splits – a dictionary containing column names as key and user defined split values as value in a list.
- col_pairs_2dpdp – list containing pairs of column names for 2D pdp
- save_to_file – Fully qualified name to an image file the resulting plot should be saved to, e.g. ‘/home/user/pdpplot.png’. The ‘png’ postfix might be omitted. If the file already exists, it will be overridden. Plot is only saved if plot = True.
- row_index – Row for which partial dependence will be calculated instead of the whole input frame.
Returns: Plot and list of calculated mean response tables for each feature requested.
-
pr_auc
(train=False, valid=False, xval=False)[source]¶ ModelBase.pr_auc is deprecated, please use
ModelBase.aucpr
instead.
-
predict
(test_data, custom_metric=None, custom_metric_func=None)[source]¶ Predict on a dataset.
Parameters: - test_data (H2OFrame) – Data on which to make predictions.
- custom_metric – custom evaluation function defined as class reference, the class get uploaded into the cluster
- custom_metric_func – custom evaluation function reference, e.g, result of upload_custom_metric
Returns: A new H2OFrame of predictions.
-
predict_contributions
(test_data)[source]¶ Predict feature contributions - SHAP values on an H2O Model (only GBM and XGBoost models).
Returned H2OFrame has shape (#rows, #features + 1) - there is a feature contribution column for each input feature, the last column is the model bias (same value for each row). The sum of the feature contributions and the bias term is equal to the raw prediction of the model. Raw prediction of tree-based model is the sum of the predictions of the individual trees before before the inverse link function is applied to get the actual prediction. For Gaussian distribution the sum of the contributions is equal to the model prediction.
Note: Multinomial classification models are currently not supported.
Parameters: test_data (H2OFrame) – Data on which to calculate contributions. Returns: A new H2OFrame made of feature contributions.
-
predict_leaf_node_assignment
(test_data, type='Path')[source]¶ Predict on a dataset and return the leaf node assignment (only for tree-based models).
Parameters: - test_data (H2OFrame) – Data on which to make predictions.
- type (Enum) – How to identify the leaf node. Nodes can be either identified by a path from to the root node
of the tree to the node or by H2O’s internal node id. One of:
"Path"
,"Node_ID"
(default:"Path"
).
Returns: A new H2OFrame of predictions.
-
r2
(train=False, valid=False, xval=False)[source]¶ Return the R squared for this regression model.
Will return R^2 for GLM Models and will return NaN otherwise.
The R^2 value is defined to be 1 - MSE/var, where var is computed as sigma*sigma.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the R^2 value for the training data.
- valid (bool) – If valid is True, then return the R^2 value for the validation data.
- xval (bool) – If xval is True, then return the R^2 value for the cross validation data.
Returns: The R squared for this regression model.
-
residual_degrees_of_freedom
(train=False, valid=False, xval=False)[source]¶ Retreive the residual degress of freedom if this model has the attribute, or None otherwise.
Parameters: - train (bool) – Get the residual dof for the training set. If both train and valid are False, then train is selected by default.
- valid (bool) – Get the residual dof for the validation set. If both train and valid are True, then train is selected by default.
Returns: Return the residual dof, or None if it is not present.
-
residual_deviance
(train=False, valid=False, xval=None)[source]¶ Retreive the residual deviance if this model has the attribute, or None otherwise.
Parameters: - train (bool) – Get the residual deviance for the training set. If both train and valid are False, then train is selected by default.
- valid (bool) – Get the residual deviance for the validation set. If both train and valid are True, then train is selected by default.
Returns: Return the residual deviance, or None if it is not present.
-
rmse
(train=False, valid=False, xval=False)[source]¶ Get the Root Mean Square Error.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the RMSE value for the training data.
- valid (bool) – If valid is True, then return the RMSE value for the validation data.
- xval (bool) – If xval is True, then return the RMSE value for the cross validation data.
Returns: The RMSE for this regression model.
-
rmsle
(train=False, valid=False, xval=False)[source]¶ Get the Root Mean Squared Logarithmic Error.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If train is True, then return the RMSLE value for the training data.
- valid (bool) – If valid is True, then return the RMSLE value for the validation data.
- xval (bool) – If xval is True, then return the RMSLE value for the cross validation data.
Returns: The RMSLE for this regression model.
-
run_time
¶ Model training time in milliseconds
-
save_model_details
(path='', force=False)[source]¶ Save Model Details of an H2O Model in JSON Format to disk.
Parameters: - model – The model object to save.
- path – a path to save the model details at (hdfs, s3, local)
- force – if True overwrite destination directory in case it exists, or throw exception if set to False.
Returns str: the path of the saved model details
-
save_mojo
(path='', force=False)[source]¶ Save an H2O Model as MOJO (Model Object, Optimized) to disk.
Parameters: - model – The model object to save.
- path – a path to save the model at (hdfs, s3, local)
- force – if True overwrite destination directory in case it exists, or throw exception if set to False.
Returns str: the path of the saved model
-
score_history
()[source]¶ DEPRECATED. Use
scoring_history()
instead.
-
scoring_history
()[source]¶ Retrieve Model Score History.
Returns: The score history as an H2OTwoDimTable or a Pandas DataFrame.
-
staged_predict_proba
(test_data)[source]¶ Predict class probabilities at each stage of an H2O Model (only GBM models).
The output structure is analogous to the output of function predict_leaf_node_assignment. For each tree t and class c there will be a column Tt.Cc (eg. T3.C1 for tree 3 and class 1). The value will be the corresponding predicted probability of this class by combining the raw contributions of trees T1.Cc,..,TtCc. Binomial models build the trees just for the first class and values in columns Tx.C1 thus correspond to the the probability p0.
Parameters: test_data (H2OFrame) – Data on which to make predictions. Returns: A new H2OFrame of staged predictions.
-
start_time
¶ Timestamp (milliseconds since 1970) when the model training was started.
-
std_coef_plot
(num_of_features=None, server=False)[source]¶ Plot a GLM model”s standardized coefficient magnitudes.
Parameters: - num_of_features – the number of features shown in the plot.
- server –
?
Returns: None.
-
type
¶ The type of model built:
"classifier"
or"regressor"
or"unsupervised"
-
varimp
(use_pandas=False)[source]¶ Pretty print the variable importances, or return them in a list.
Parameters: use_pandas (bool) – If True, then the variable importances will be returned as a pandas data frame. Returns: A list or Pandas DataFrame.
-
varimp_plot
(num_of_features=None, server=False)[source]¶ Plot the variable importance for a trained model.
Parameters: - num_of_features – the number of features shown in the plot (default is 10 or all if less than 10).
- server –
?
Returns: None.
-
weights
(matrix_id=0)[source]¶ Return the frame for the respective weight matrix.
Parameters: matrix_id – an integer, ranging from 0 to number of layers, that specifies the weight matrix to return. Returns: an H2OFrame which represents the weight matrix identified by matrix_id
-
xvals
¶ Return a list of the cross-validated models.
Returns: A list of models.
-
Binomial Classification
¶
-
class
h2o.model.binomial.
H2OBinomialModel
(*args, **kwargs)[source]¶ Bases:
h2o.model.model_base.ModelBase
-
F0point5
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the F0.5 for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the F0.5 value for the training data.
- valid (bool) – If True, return the F0.5 value for the validation data.
- xval (bool) – If True, return the F0.5 value for each of the cross-validated splits.
Returns: The F0.5 values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> F0point5 = gbm.F0point5() # <- Default: return training metric value >>> F0point5 = gbm.F0point5(train=True, valid=True, xval=True)
-
F1
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the F1 value for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the F1 value for the training data.
- valid (bool) – If True, return the F1 value for the validation data.
- xval (bool) – If True, return the F1 value for each of the cross-validated splits.
Returns: The F1 values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.F1()# <- Default: return training metric value >>> gbm.F1(train=True, valid=True, xval=True)
-
F2
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the F2 for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the F2 value for the training data.
- valid (bool) – If True, return the F2 value for the validation data.
- xval (bool) – If True, return the F2 value for each of the cross-validated splits.
Returns: The F2 values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.F2() # <- Default: return training metric value >>> gbm.F2(train=True, valid=True, xval=True)
-
accuracy
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the accuracy for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the accuracy value for the training data.
- valid (bool) – If True, return the accuracy value for the validation data.
- xval (bool) – If True, return the accuracy value for each of the cross-validated splits.
Returns: The accuracy values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.accuracy() # <- Default: return training metric value >>> gbm.accuracy(train=True, valid=True, xval=True)
-
confusion_matrix
(metrics=None, thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the confusion matrix for the specified metrics/thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”
Parameters: - metrics – A string (or list of strings) among metrics listed in
H2OBinomialModelMetrics.maximizing_metrics
. Defaults to ‘f1’. - thresholds – A value (or list of values) between 0 and 1. If None, then the thresholds maximizing each provided metric will be used.
- train (bool) – If True, return the confusion matrix value for the training data.
- valid (bool) – If True, return the confusion matrix value for the validation data.
- xval (bool) – If True, return the confusion matrix value for each of the cross-validated splits.
Returns: The confusion matrix values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.confusion_matrix() # <- Default: return training metric value >>> gbm.confusion_matrix(train=True, valid=True, xval=True)
- metrics – A string (or list of strings) among metrics listed in
-
error
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the error for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold minimizing the error will be used.
- train (bool) – If True, return the error value for the training data.
- valid (bool) – If True, return the error value for the validation data.
- xval (bool) – If True, return the error value for each of the cross-validated splits.
Returns: The error values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.error() # <- Default: return training metric >>> gbm.error(train=True, valid=True, xval=True)
-
fallout
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the fallout for a set of thresholds (aka False Positive Rate).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the fallout value for the training data.
- valid (bool) – If True, return the fallout value for the validation data.
- xval (bool) – If True, return the fallout value for each of the cross-validated splits.
Returns: The fallout values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.fallout() # <- Default: return training metric >>> gbm.fallout(train=True, valid=True, xval=True)
-
find_idx_by_threshold
(threshold, train=False, valid=False, xval=False)[source]¶ Retrieve the index in this metric’s threshold list at which the given threshold is located.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - threshold (float) – Threshold value to search for in the threshold list.
- train (bool) – If True, return the find idx by threshold value for the training data.
- valid (bool) – If True, return the find idx by threshold value for the validation data.
- xval (bool) – If True, return the find idx by threshold value for each of the cross-validated splits.
Returns: The find idx by threshold values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", ... "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> idx_threshold = gbm.find_idx_by_threshold(threshold=0.39438, ... train=True) >>> idx_threshold
-
find_threshold_by_max_metric
(metric, train=False, valid=False, xval=False)[source]¶ If all are False (default), then return the training metric value.
If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - metric (str) – A metric among the metrics listed in
H2OBinomialModelMetrics.maximizing_metrics
. - train (bool) – If True, return the find threshold by max metric value for the training data.
- valid (bool) – If True, return the find threshold by max metric value for the validation data.
- xval (bool) – If True, return the find threshold by max metric value for each of the cross-validated splits.
Returns: The find threshold by max metric values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", ... "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> max_metric = gbm.find_threshold_by_max_metric(metric="f2", ... train=True) >>> max_metric
- metric (str) – A metric among the metrics listed in
-
fnr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the False Negative Rates for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the FNR value for the training data.
- valid (bool) – If True, return the FNR value for the validation data.
- xval (bool) – If True, return the FNR value for each of the cross-validated splits.
Returns: The FNR values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.fnr() # <- Default: return training metric >>> gbm.fnr(train=True, valid=True, xval=True)
-
fpr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the False Positive Rates for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the FPR value for the training data.
- valid (bool) – If True, return the FPR value for the validation data.
- xval (bool) – If True, return the FPR value for each of the cross-validated splits.
Returns: The FPR values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.fpr() # <- Default: return training metric >>> gbm.fpr(train=True, valid=True, xval=True)
-
gains_lift
(train=False, valid=False, xval=False)[source]¶ Get the Gains/Lift table for the specified metrics.
If all are False (default), then return the training metric Gains/Lift table. If more than one options is set to True, then return a dictionary of metrics where t he keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the gains lift value for the training data.
- valid (bool) – If True, return the gains lift value for the validation data.
- xval (bool) – If True, return the gains lift value for each of the cross-validated splits.
Returns: The gains lift values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.gains_lift() # <- Default: return training metric Gain/Lift table >>> gbm.gains_lift(train=True, valid=True, xval=True)
-
max_per_class_error
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the max per class error for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold minimizing the error will be used.
- train (bool) – If True, return the max per class error value for the training data.
- valid (bool) – If True, return the max per class error value for the validation data.
- xval (bool) – If True, return the max per class error value for each of the cross-validated splits.
Returns: The max per class error values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.max_per_class_error() # <- Default: return training metric value >>> gbm.max_per_class_error(train=True, valid=True, xval=True)
-
mcc
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the MCC for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the MCC value for the training data.
- valid (bool) – If True, return the MCC value for the validation data.
- xval (bool) – If True, return the MCC value for each of the cross-validated splits.
Returns: The MCC values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.mcc() # <- Default: return training metric value >>> gbm.mcc(train=True, valid=True, xval=True)
-
mean_per_class_error
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the mean per class error for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold minimizing the error will be used.
- train (bool) – If True, return the mean per class error value for the training data.
- valid (bool) – If True, return the mean per class error value for the validation data.
- xval (bool) – If True, return the mean per class error value for each of the cross-validated splits.
Returns: The mean per class error values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.mean_per_class_error() # <- Default: return training metric >>> gbm.mean_per_class_error(train=True, valid=True, xval=True)
-
metric
(metric, thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the metric value for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - metric (str) – name of the metric to retrieve.
- thresholds – If None, then the threshold maximizing the metric will be used (or minimizing it if the metric is an error).
- train (bool) – If True, return the metric value for the training data.
- valid (bool) – If True, return the metric value for the validation data.
- xval (bool) – If True, return the metric value for each of the cross-validated splits.
Returns: The metric values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] # thresholds parameter must be a list (i.e. [0.01, 0.5, 0.99]) >>> thresholds = [0.01,0.5,0.99] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) # allowable metrics are absolute_mcc, accuracy, precision, # f0point5, f1, f2, mean_per_class_accuracy, min_per_class_accuracy, # tns, fns, fps, tps, tnr, fnr, fpr, tpr, recall, sensitivity, # missrate, fallout, specificity >>> gbm.metric(metric='tpr', thresholds=thresholds)
-
missrate
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the miss rate for a set of thresholds (aka False Negative Rate).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the miss rate value for the training data.
- valid (bool) – If True, return the miss rate value for the validation data.
- xval (bool) – If True, return the miss rate value for each of the cross-validated splits.
Returns: The miss rate values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.missrate() # <- Default: return training metric >>> gbm.missrate(train=True, valid=True, xval=True)
-
plot
(timestep='AUTO', metric='AUTO', server=False, **kwargs)[source]¶ Plot training set (and validation set if available) scoring history for an H2OBinomialModel.
The timestep and metric arguments are restricted to what is available in its scoring history.
Parameters: - timestep (str) – A unit of measurement for the x-axis.
- metric (str) – A unit of measurement for the y-axis.
- server (bool) – if True, then generate the image inline (using matplotlib’s “Agg” backend)
Examples: >>> airlines = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> myX = ["Origin", "Dest", "Distance", "UniqueCarrier", ... "Month", "DayofMonth", "DayOfWeek"] >>> myY = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> air_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=100, ... max_depth=3, ... learn_rate=0.01) >>> air_gbm.train(x=myX, ... y=myY, ... training_frame=train, ... validation_frame=valid) >>> air_gbm.plot(type="roc", train=True, server=True) >>> air_gbm.plot(type="roc", valid=True, server=True) >>> perf = air_gbm.model_performance(valid) >>> perf.plot(type="roc", server=True) >>> perf.plot
-
precision
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the precision for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the precision value for the training data.
- valid (bool) – If True, return the precision value for the validation data.
- xval (bool) – If True, return the precision value for each of the cross-validated splits.
Returns: The precision values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.precision() # <- Default: return training metric value >>> gbm.precision(train=True, valid=True, xval=True)
-
recall
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the recall for a set of thresholds (aka True Positive Rate).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the recall value for the training data.
- valid (bool) – If True, return the recall value for the validation data.
- xval (bool) – If True, return the recall value for each of the cross-validated splits.
Returns: The recall values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.recall() # <- Default: return training metric >>> gbm.recall(train=True, valid=True, xval=True)
-
roc
(train=False, valid=False, xval=False)[source]¶ Return the coordinates of the ROC curve for a given set of data.
The coordinates are two-tuples containing the false positive rates as a list and true positive rates as a list. If all are False (default), then return is the training data. If more than one ROC curve is requested, the data is returned as a dictionary of two-tuples.
Parameters: - train (bool) – If True, return the ROC value for the training data.
- valid (bool) – If True, return the ROC value for the validation data.
- xval (bool) – If True, return the ROC value for each of the cross-validated splits.
Returns: The ROC values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.roc() # <- Default: return training data >>> gbm.roc(train=True, valid=True, xval=True)
-
sensitivity
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the sensitivity for a set of thresholds (aka True Positive Rate or Recall).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the sensitivity value for the training data.
- valid (bool) – If True, return the sensitivity value for the validation data.
- xval (bool) – If True, return the sensitivity value for each of the cross-validated splits.
Returns: The sensitivity values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.sensitivity() # <- Default: return training metric >>> gbm.sensitivity(train=True, valid=True, xval=True)
-
specificity
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the specificity for a set of thresholds (aka True Negative Rate).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the specificity value for the training data.
- valid (bool) – If True, return the specificity value for the validation data.
- xval (bool) – If True, return the specificity value for each of the cross-validated splits.
Returns: The specificity values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.specificity() # <- Default: return training metric >>> gbm.specificity(train=True, valid=True, xval=True)
-
tnr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the True Negative Rate for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the TNR value for the training data.
- valid (bool) – If True, return the TNR value for the validation data.
- xval (bool) – If True, return the TNR value for each of the cross-validated splits.
Returns: The TNR values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.tnr() # <- Default: return training metric >>> gbm.tnr(train=True, valid=True, xval=True)
-
tpr
(thresholds=None, train=False, valid=False, xval=False)[source]¶ Get the True Positive Rate for a set of thresholds.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - thresholds – If None, then the threshold maximizing the metric will be used.
- train (bool) – If True, return the TPR value for the training data.
- valid (bool) – If True, return the TPR value for the validation data.
- xval (bool) – If True, return the TPR value for each of the cross-validated splits.
Returns: The TPR values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <=.2] >>> response_col = "economy_20mpg" >>> distribution = "bernoulli" >>> predictors = ["displacement", "power", "weight", "acceleration", "year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(y=response_col, ... x=predictors, ... validation_frame=valid, ... training_frame=train) >>> gbm.tpr() # <- Default: return training metric >>> gbm.tpr(train=True, valid=True, xval=True)
-
Multinomial Classification
¶
-
class
h2o.model.multinomial.
H2OMultinomialModel
(*args, **kwargs)[source]¶ Bases:
h2o.model.model_base.ModelBase
-
confusion_matrix
(data)[source]¶ Returns a confusion matrix based of H2O’s default prediction threshold for a dataset.
Parameters: data (H2OFrame) – the frame with the prediction results for which the confusion matrix should be extracted. Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> distribution = "multinomial" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> confusion_matrix = gbm.confusion_matrix(train) >>> confusion_matrix
-
hit_ratio_table
(train=False, valid=False, xval=False)[source]¶ Retrieve the Hit Ratios.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train – If train is True, then return the hit ratio value for the training data.
- valid – If valid is True, then return the hit ratio value for the validation data.
- xval – If xval is True, then return the hit ratio value for the cross validation data.
Returns: The hit ratio for this regression model.
Example: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> distribution = "multinomial" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> hit_ratio_table = gbm.hit_ratio_table() # <- Default: return training metrics >>> hit_ratio_table >>> hit_ratio_table1 = gbm.hit_ratio_table(train=True, ... valid=True, ... xval=True) >>> hit_ratio_table1
-
mean_per_class_error
(train=False, valid=False, xval=False)[source]¶ Retrieve the mean per class error across all classes
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the mean_per_class_error value for the training data.
- valid (bool) – If True, return the mean_per_class_error value for the validation data.
- xval (bool) – If True, return the mean_per_class_error value for each of the cross-validated splits.
Returns: The mean_per_class_error values for the specified key(s).
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> distribution = "multinomial" >>> gbm = H2OGradientBoostingEstimator(nfolds=3, distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> mean_per_class_error = gbm.mean_per_class_error() # <- Default: return training metric >>> mean_per_class_error >>> mean_per_class_error1 = gbm.mean_per_class_error(train=True, ... valid=True, ... xval=True) >>> mean_per_class_error1
-
plot
(timestep='AUTO', metric='AUTO', **kwargs)[source]¶ Plots training set (and validation set if available) scoring history for an H2OMultinomialModel. The timestep and metric arguments are restricted to what is available in its scoring history.
Parameters: - timestep – A unit of measurement for the x-axis. This can be AUTO, duration, or number_of_trees.
- metric – A unit of measurement for the y-axis. This can be AUTO, logloss, classification_error, or rmse.
Returns: A scoring history plot.
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["cylinders"] = cars["cylinders"].asfactor() >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "cylinders" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> distribution = "multinomial" >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution) >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> gbm.plot(metric="AUTO", timestep="AUTO")
-
Regression
¶
-
class
h2o.model.regression.
H2ORegressionModel
(*args, **kwargs)[source]¶ Bases:
h2o.model.model_base.ModelBase
-
plot
(timestep='AUTO', metric='AUTO', **kwargs)[source]¶ Plots training set (and validation set if available) scoring history for an H2ORegressionModel. The timestep and metric arguments are restricted to what is available in its scoring history.
Parameters: - timestep – A unit of measurement for the x-axis.
- metric – A unit of measurement for the y-axis.
Returns: A scoring history plot.
Examples: >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> r = cars[0].runif() >>> train = cars[r > .2] >>> valid = cars[r <= .2] >>> response_col = "economy" >>> distribution = "gaussian" >>> predictors = ["displacement","power","weight","acceleration","year"] >>> gbm = H2OGradientBoostingEstimator(nfolds=3, ... distribution=distribution, ... fold_assignment="Random") >>> gbm.train(x=predictors, ... y=response_col, ... training_frame=train, ... validation_frame=valid) >>> gbm.plot(timestep="AUTO", metric="AUTO",)
-
-
h2o.model.regression.
h2o_explained_variance_score
(y_actual, y_predicted, weights=None)[source]¶ Explained variance regression score function.
Parameters: - y_actual – H2OFrame of actual response.
- y_predicted – H2OFrame of predicted response.
- weights – (Optional) sample weights
Returns: the explained variance score.
-
h2o.model.regression.
h2o_mean_absolute_error
(y_actual, y_predicted, weights=None)[source]¶ Mean absolute error regression loss.
Parameters: - y_actual – H2OFrame of actual response.
- y_predicted – H2OFrame of predicted response.
- weights – (Optional) sample weights
Returns: mean absolute error loss (best is 0.0).
-
h2o.model.regression.
h2o_mean_squared_error
(y_actual, y_predicted, weights=None)[source]¶ Mean squared error regression loss
Parameters: - y_actual – H2OFrame of actual response.
- y_predicted – H2OFrame of predicted response.
- weights – (Optional) sample weights
Returns: mean squared error loss (best is 0.0).
-
h2o.model.regression.
h2o_median_absolute_error
(y_actual, y_predicted)[source]¶ Median absolute error regression loss
Parameters: - y_actual – H2OFrame of actual response.
- y_predicted – H2OFrame of predicted response.
Returns: median absolute error loss (best is 0.0)
-
h2o.model.regression.
h2o_r2_score
(y_actual, y_predicted, weights=1.0)[source]¶ R-squared (coefficient of determination) regression score function
Parameters: - y_actual – H2OFrame of actual response.
- y_predicted – H2OFrame of predicted response.
- weights – (Optional) sample weights
Returns: R-squared (best is 1.0, lower is worse).
Clustering Methods
¶
-
class
h2o.model.clustering.
H2OClusteringModel
(*args, **kwargs)[source]¶ Bases:
h2o.model.model_base.ModelBase
For examples: from h2o.estimators.kmeans import H2OKMeansEstimator
-
betweenss
(train=False, valid=False, xval=False)[source]¶ Get the between cluster sum of squares.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the between cluster sum of squares value for the training data.
- valid (bool) – If True, return the between cluster sum of squares value for the validation data.
- xval (bool) – If True, return the between cluster sum of squares value for each of the cross-validated splits.
Returns: The between cluster sum of squares values for the specified key(s).
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> betweenss = km.betweenss() # <- Default: return training metrics >>> betweenss >>> betweenss3 = km.betweenss(train=False, ... valid=False, ... xval=True) >>> betweenss3
-
centers
()[source]¶ The centers for the KMeans model.
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> km.centers()
-
centers_std
()[source]¶ The standardized centers for the kmeans model.
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> km.centers_std()
-
centroid_stats
(train=False, valid=False, xval=False)[source]¶ Get the centroid statistics for each cluster.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the centroid statistic for the training data.
- valid (bool) – If True, return the centroid statistic for the validation data.
- xval (bool) – If True, return the centroid statistic for each of the cross-validated splits.
Returns: The centroid statistics for the specified key(s).
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> centroid_stats = km.centroid_stats() # <- Default: return training metrics >>> centroid_stats >>> centroid_stats1 = km.centroid_stats(train=True, ... valid=False, ... xval=False) >>> centroid_stats1
-
num_iterations
()[source]¶ Get the number of iterations it took to converge or reach max iterations.
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> km.num_iterations()
-
size
(train=False, valid=False, xval=False)[source]¶ Get the sizes of each cluster.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the cluster sizes for the training data.
- valid (bool) – If True, return the cluster sizes for the validation data.
- xval (bool) – If True, return the cluster sizes for each of the cross-validated splits.
Returns: The cluster sizes for the specified key(s).
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> size = km.size() # <- Default: return training metrics >>> size >>> size1 = km.size(train=False, ... valid=False, ... xval=True) >>> size1
-
tot_withinss
(train=False, valid=False, xval=False)[source]¶ Get the total within cluster sum of squares.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the total within cluster sum of squares value for the training data.
- valid (bool) – If True, return the total within cluster sum of squares value for the validation data.
- xval (bool) – If True, return the total within cluster sum of squares value for each of the cross-validated splits.
Returns: The total within cluster sum of squares values for the specified key(s).
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> tot_withinss = km.tot_withinss() # <- Default: return training metrics >>> tot_withinss >>> tot_withinss2 = km.tot_withinss(train=True, ... valid=False, ... xval=True) >>> tot_withinss2
-
totss
(train=False, valid=False, xval=False)[source]¶ Get the total sum of squares.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the total sum of squares value for the training data.
- valid (bool) – If True, return the total sum of squares value for the validation data.
- xval (bool) – If True, return the total sum of squares value for each of the cross-validated splits.
Returns: The total sum of squares values for the specified key(s).
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> totss = km.totss() # <- Default: return training metrics >>> totss
-
withinss
(train=False, valid=False, xval=False)[source]¶ Get the within cluster sum of squares for each cluster.
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
Parameters: - train (bool) – If True, return the total sum of squares value for the training data.
- valid (bool) – If True, return the total sum of squares value for the validation data.
- xval (bool) – If True, return the total sum of squares value for each of the cross-validated splits.
Returns: The total sum of squares values for the specified key(s).
Examples: >>> iris = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> km = H2OKMeansEstimator(k=3, nfolds=3) >>> km.train(x=list(range(4)), training_frame=iris) >>> withinss = km.withinss() # <- Default: return training metrics >>> withinss >>> withinss2 = km.withinss(train=True, ... valid=True, ... xval=True) >>> withinss2
-
AutoEncoders
¶
-
class
h2o.model.autoencoder.
H2OAutoEncoderModel
(*args, **kwargs)[source]¶ Bases:
h2o.model.model_base.ModelBase
-
anomaly
(test_data, per_feature=False)[source]¶ Obtain the reconstruction error for the input test_data.
Parameters: - test_data (H2OFrame) – The dataset upon which the reconstruction error is computed.
- per_feature (bool) – Whether to return the square reconstruction error per feature. Otherwise, return the mean square error.
Returns: the reconstruction error.
Examples: >>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz") >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz") >>> predictors = list(range(0,784)) >>> resp = 784 >>> train = train[predictors] >>> test = test[predictors] >>> ae_model = H2OAutoEncoderEstimator(activation="Tanh", ... hidden=[2], ... l1=1e-5, ... ignore_const_cols=False, ... epochs=1) >>> ae_model.train(x=predictors,training_frame=train) >>> test_rec_error = ae_model.anomaly(test) >>> test_rec_error >>> test_rec_error_features = ae_model.anomaly(test, per_feature=True) >>> test_rec_error_features
-
Word Embedding
¶
-
class
h2o.model.word_embedding.
H2OWordEmbeddingModel
(*args, **kwargs)[source]¶ Bases:
h2o.model.model_base.ModelBase
Word embedding model.
-
find_synonyms
(word, count=20)[source]¶ Find synonyms using a word2vec model.
Parameters: - word (str) – A single word to find synonyms for.
- count (int) – The first “count” synonyms will be returned.
Returns: the approximate reconstruction of the training data.
Examples: >>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> words = job_titles.tokenize(" ") >>> w2v_model = H2OWord2vecEstimator(epochs = 10) >>> w2v_model.train(training_frame=words) >>> synonyms = w2v_model.find_synonyms("teacher", count = 5) >>> print(synonyms)
-
to_frame
()[source]¶ Converts a given word2vec model into H2OFrame.
Returns: a frame representing learned word embeddings. Examples: >>> words = h2o.create_frame(rows=1000,cols=1,string_fraction=1.0,missing_fraction=0.0) >>> embeddings = h2o.create_frame(rows=1000,cols=100,real_fraction=1.0,missing_fraction=0.0) >>> word_embeddings = words.cbind(embeddings) >>> w2v_model = H2OWord2vecEstimator(pre_trained=word_embeddings) >>> w2v_model.train(training_frame=word_embeddings) >>> w2v_frame = w2v_model.to_frame() >>> word_embeddings.names = w2v_frame.names >>> word_embeddings.as_data_frame().equals(word_embeddings.as_data_frame())
-
transform
(words, aggregate_method)[source]¶ Transform words (or sequences of words) to vectors using a word2vec model.
Parameters: - words (str) – An H2OFrame made of a single column containing source words.
- aggregate_method (str) – Specifies how to aggregate sequences of words. If method is NONE then no aggregation is performed and each input word is mapped to a single word-vector. If method is ‘AVERAGE’ then input is treated as sequences of words delimited by NA. Each word of a sequences is internally mapped to a vector and vectors belonging to the same sentence are averaged and returned in the result.
Returns: the approximate reconstruction of the training data.
Examples: >>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> STOP_WORDS = ["ax","i","you","edu","s","t","m","subject","can","lines","re","what", ... "there","all","we","one","the","a","an","of","or","in","for","by","on", ... "but","is","in","a","not","with","as","was","if","they","are","this","and","it","have", ... "from","at","my","be","by","not","that","to","from","com","org","like","likes","so"] >>> words = job_titles.tokenize(" ") >>> words = words[(words.isna()) | (~ words.isin(STOP_WORDS)),:] >>> w2v_model = H2OWord2vecEstimator(epochs = 10) >>> w2v_model.train(training_frame=words) >>> job_title_vecs = w2v_model.transform(words, aggregate_method = "AVERAGE")
-