Modeling In H2O

Supervised

H2OANOVAGLMEstimator

class h2o.estimators.anovaglm.H2OANOVAGLMEstimator(model_id=None, training_frame=None, seed=-1, response_column=None, ignored_columns=None, ignore_const_cols=True, score_each_iteration=False, offset_column=None, weights_column=None, family='auto', tweedie_variance_power=0.0, tweedie_link_power=1.0, theta=0.0, solver='irlsm', missing_values_handling='mean_imputation', plug_values=None, compute_p_values=True, standardize=True, non_negative=False, max_iterations=0, link='family_default', prior=0.0, alpha=None, lambda_=[0.0], lambda_search=False, stopping_rounds=0, stopping_metric='auto', early_stopping=False, stopping_tolerance=0.001, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_runtime_secs=0.0, save_transformed_framekeys=False, highest_interaction_term=0, nparallelism=4, type=0)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

ANOVA for Generalized Linear Model

H2O ANOVAGLM is used to calculate Type III SS which is used to evaluate the contributions of individual predictors and their interactions to a model. Predictors or interactions with negligible contributions to the model will have high p-values while those with more contributions will have low p-values.

property Lambda

DEPRECATED. Use self.lambda_ instead

property alpha

Distribution of regularization between the L1 (Lasso) and L2 (Ridge) penalties. A value of 1 for alpha represents Lasso regression, a value of 0 produces Ridge regression, and anything in between specifies the amount of mixing between the two. Default value of alpha is 0 when SOLVER = ‘L-BFGS’; 0.5 otherwise.

Type: List[float].

property balance_classes

Balance training data class counts via over/under-sampling (for imbalanced data).

Type: bool, defaults to False.

property class_sampling_factors

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

Type: List[float].

property compute_p_values

Request p-values computation, p-values work only with IRLSM solver and no regularization

Type: bool, defaults to True.

property early_stopping

Stop early when there is no more relative improvement on train or validation (if provided).

Type: bool, defaults to False.

property family

Family. Use binomial for classification with logistic regression, others are for regression problems.

Type: Literal["auto", "gaussian", "binomial", "fractionalbinomial", "quasibinomial", "poisson", "gamma", "tweedie", "negativebinomial"], defaults to "auto".

property highest_interaction_term

Limit the number of interaction terms, if 2 means interaction between 2 columns only, 3 for three columns and so on… Default to 2.

Type: int, defaults to 0.

property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property lambda_

Regularization strength

Type: List[float], defaults to [0.0].

Use lambda search starting at lambda max, given lambda is then interpreted as lambda min

Type: bool, defaults to False.

Link function.

Type: Literal["family_default", "identity", "logit", "log", "inverse", "tweedie", "ologit"], defaults to "family_default".

property max_after_balance_size

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.

Type: float, defaults to 5.0.

property max_iterations

Maximum number of iterations

Type: int, defaults to 0.

property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

property missing_values_handling

Handling of missing values. Either MeanImputation, Skip or PlugValues.

Type: Literal["mean_imputation", "skip", "plug_values"], defaults to "mean_imputation".

property non_negative

Restrict coefficients (not intercept) to be non-negative

Type: bool, defaults to False.

property nparallelism

Number of models to build in parallel. Default to 4. Adjust according to your system.

Type: int, defaults to 4.

property offset_column

Offset column. This will be added to the combination of columns before applying the link function.

Type: str.

property plug_values

Plug Values (a single row frame containing values that will be used to impute missing values of the training/validation frame, use with conjunction missing_values_handling = PlugValues)

Type: Union[None, str, H2OFrame].

property prior

Prior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.

Type: float, defaults to 0.0.

property response_column

Response variable column.

Type: str.

result()[source]

Get result frame that contains information about the model building process like for modelselection and anovaglm. :return: the H2OFrame that contains information about the model building process like for modelselection and anovaglm.

property save_transformed_framekeys

true to save the keys of transformed predictors and interaction column.

Type: bool, defaults to False.

property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

property seed

Seed for pseudo random number generator (if applicable)

Type: int, defaults to -1.

property solver

AUTO will set the solver based on given data and the other parameters. IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, L_BFGS scales better for datasets with many columns.

Type: Literal["auto", "irlsm", "l_bfgs", "coordinate_descent_naive", "coordinate_descent", "gradient_descent_lh", "gradient_descent_sqerr"], defaults to "irlsm".

property standardize

Standardize numeric columns to have zero mean and unit variance

Type: bool, defaults to True.

property stopping_metric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.

Type: Literal["auto", "deviance", "logloss", "mse", "rmse", "mae", "rmsle", "auc", "aucpr", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing"], defaults to "auto".

property stopping_rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)

Type: int, defaults to 0.

property stopping_tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

Type: float, defaults to 0.001.

property theta

Theta

Type: float, defaults to 0.0.

property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Tweedie link power

Type: float, defaults to 1.0.

property tweedie_variance_power

Tweedie variance power

Type: float, defaults to 0.0.

property type

Refer to the SS type 1, 2, 3, or 4. We are currently only supporting 3

Type: int, defaults to 0.

property weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

Type: str.

H2OCoxProportionalHazardsEstimator

class h2o.estimators.coxph.H2OCoxProportionalHazardsEstimator(model_id=None, training_frame=None, start_column=None, stop_column=None, response_column=None, ignored_columns=None, weights_column=None, offset_column=None, stratify_by=None, ties='efron', init=0.0, lre_min=9.0, max_iterations=20, interactions=None, interaction_pairs=None, interactions_only=None, use_all_factor_levels=False, export_checkpoints_dir=None, single_node_mode=False)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Cox Proportional Hazards

Trains a Cox Proportional Hazards Model (CoxPH) on an H2O dataset.

property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> import tempfile
>>> from os import listdir
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> predictor = "age"
>>> response = "event"
>>> checkpoints_dir = tempfile.mkdtemp()
>>> coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                            stop_column="stop",
...                                            export_checkpoints_dir=checkpoints_dir)
>>> coxph.train(x=predictor,
...             y=response,
...             training_frame=heart)
>>> len(listdir(checkpoints_dir))
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property init

Coefficient starting value.

Type: float, defaults to 0.0.

Examples

>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> predictor = "age"
>>> response = "event"
>>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                                  stop_column="stop",
...                                                  init=2.9)
>>> heart_coxph.train(x=predictor,
...                   y=response,
...                   training_frame=heart)
>>> heart_coxph.scoring_history()
property interaction_pairs

A list of pairwise (first order) column interactions.

Type: List[tuple].

Examples

>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> predictor = "age"
>>> response = "event"
>>> interaction_pairs = [("start","stop")]
>>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                                  stop_column="stop",
...                                                  interaction_pairs=interaction_pairs)
>>> heart_coxph.train(x=predictor,
...                   y=response,
...                   training_frame=heart)
>>> heart_coxph.scoring_history()
property interactions

A list of predictor column indices to interact. All pairwise combinations will be computed for the list.

Type: List[str].

Examples

>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> predictor = "age"
>>> response = "event"
>>> interactions = ['start','stop']
>>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                                  stop_column="stop",
...                                                  interactions=interactions)
>>> heart_coxph.train(x=predictor,
...                   y=response,
...                   training_frame=heart)
>>> heart_coxph.scoring_history()
property interactions_only

A list of columns that should only be used to create interactions but should not itself participate in model training.

Type: List[str].

Examples

>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> predictor = "age"
>>> response = "event"
>>> interactions = ['start','stop']
>>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                                  stop_column="stop",
...                                                  interactions_only=interactions)
>>> heart_coxph.train(x=predictor,
...                   y=response,
...                   training_frame=heart)
>>> heart_coxph.scoring_history()
property lre_min

Minimum log-relative error.

Type: float, defaults to 9.0.

Examples

>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> predictor = "age"
>>> response = "event"
>>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                                  stop_column="stop",
...                                                  lre_min=5)
>>> heart_coxph.train(x=predictor,
...                   y=response,
...                   training_frame=heart)
>>> heart_coxph.scoring_history()
property max_iterations

Maximum number of iterations.

Type: int, defaults to 20.

Examples

>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> predictor = "age"
>>> response = "event"
>>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                                  stop_column="stop",
...                                                  max_iterations=50)
>>> heart_coxph.train(x=predictor,
...                   y=response,
...                   training_frame=heart)
>>> heart_coxph.scoring_history()
property offset_column

Offset column. This will be added to the combination of columns before applying the link function.

Type: str.

Examples

>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> predictor = "age"
>>> response = "event"
>>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                                  stop_column="stop",
...                                                  offset_column="transplant")
>>> heart_coxph.train(x=predictor,
...                   y=response,
...                   training_frame=heart)
>>> heart_coxph.scoring_history()
property response_column

Response variable column.

Type: str.

property single_node_mode

Run on a single node to reduce the effect of network overhead (for smaller datasets)

Type: bool, defaults to False.

property start_column

Start Time Column.

Type: str.

Examples

>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> predictor = "age"
>>> response = "event"
>>> train, valid = heart.split_frame(ratios=[.8])
>>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                                  stop_column="stop")
>>> heart_coxph.train(x=predictor,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> heart_coxph.scoring_history()
property stop_column

Stop Time Column.

Type: str.

Examples

>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> predictor = "age"
>>> response = "event"
>>> train, valid = heart.split_frame(ratios=[.8])
>>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                                  stop_column="stop")
>>> heart_coxph.train(x=predictor,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> heart_coxph.scoring_history()
property stratify_by

List of columns to use for stratification.

Type: List[str].

property ties

Method for Handling Ties.

Type: Literal["efron", "breslow"], defaults to "efron".

Examples

>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> predictor = "age"
>>> response = "event"
>>> train, valid = heart.split_frame(ratios=[.8])
>>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                                  stop_column="stop",
...                                                  ties="breslow")
>>> heart_coxph.train(x=predictor,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> heart_coxph.scoring_history()
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> predictor = "age"
>>> response = "event"
>>> train, valid = heart.split_frame(ratios=[.8])
>>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                                  stop_column="stop")
>>> heart_coxph.train(x=predictor,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> heart_coxph.scoring_history()
property use_all_factor_levels

(Internal. For development only!) Indicates whether to use all factor levels.

Type: bool, defaults to False.

Examples

>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
>>> predictor = "age"
>>> response = "event"
>>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start",
...                                                  stop_column="stop",
...                                                  use_all_factor_levels=True)
>>> heart_coxph.train(x=predictor,
...                   y=response,
...                   training_frame=heart)
>>> heart_coxph.scoring_history()
property weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

Type: str.

H2ODeepLearningEstimator

class h2o.estimators.deeplearning.H2ODeepLearningEstimator(model_id=None, training_frame=None, validation_frame=None, nfolds=0, keep_cross_validation_models=True, keep_cross_validation_predictions=False, keep_cross_validation_fold_assignment=False, fold_assignment='auto', fold_column=None, response_column=None, ignored_columns=None, ignore_const_cols=True, score_each_iteration=False, weights_column=None, offset_column=None, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_confusion_matrix_size=20, checkpoint=None, pretrained_autoencoder=None, overwrite_with_best_model=True, use_all_factor_levels=True, standardize=True, activation='rectifier', hidden=[200, 200], epochs=10.0, train_samples_per_iteration=-2, target_ratio_comm_to_comp=0.05, seed=-1, adaptive_rate=True, rho=0.99, epsilon=1e-08, rate=0.005, rate_annealing=1e-06, rate_decay=1.0, momentum_start=0.0, momentum_ramp=1000000.0, momentum_stable=0.0, nesterov_accelerated_gradient=True, input_dropout_ratio=0.0, hidden_dropout_ratios=None, l1=0.0, l2=0.0, max_w2=3.4028235e+38, initial_weight_distribution='uniform_adaptive', initial_weight_scale=1.0, initial_weights=None, initial_biases=None, loss='automatic', distribution='auto', quantile_alpha=0.5, tweedie_power=1.5, huber_alpha=0.9, score_interval=5.0, score_training_samples=10000, score_validation_samples=0, score_duty_cycle=0.1, classification_stop=0.0, regression_stop=1e-06, stopping_rounds=5, stopping_metric='auto', stopping_tolerance=0.0, max_runtime_secs=0.0, score_validation_sampling='uniform', diagnostics=True, fast_mode=True, force_load_balance=True, variable_importances=True, replicate_training_data=True, single_node_mode=False, shuffle_training_data=False, missing_values_handling='mean_imputation', quiet_mode=False, autoencoder=False, sparse=False, col_major=False, average_activation=0.0, sparsity_beta=0.0, max_categorical_features=2147483647, reproducible=False, export_weights_and_biases=False, mini_batch_size=1, categorical_encoding='auto', elastic_averaging=False, elastic_averaging_moving_rate=0.9, elastic_averaging_regularization=0.001, export_checkpoints_dir=None, auc_type='auto')[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Deep Learning

Build a Deep Neural Network model using CPUs Builds a feed-forward multilayer artificial neural network on an H2OFrame

Examples

>>> from h2o.estimators.deeplearning import H2ODeepLearningEstimator
>>> rows = [[1,2,3,4,0], [2,1,2,4,1], [2,1,4,2,1],
...         [0,1,2,34,1], [2,3,4,1,0]] * 50
>>> fr = h2o.H2OFrame(rows)
>>> fr[4] = fr[4].asfactor()
>>> model = H2ODeepLearningEstimator()
>>> model.train(x=range(4), y=4, training_frame=fr)
>>> model.logloss()
property activation

Activation function.

Type: Literal["tanh", "tanh_with_dropout", "rectifier", "rectifier_with_dropout", "maxout", "maxout_with_dropout"], defaults to "rectifier".

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> cars_dl = H2ODeepLearningEstimator(activation="tanh")
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property adaptive_rate

Adaptive learning rate.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> cars_dl = H2ODeepLearningEstimator(adaptive_rate=True)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property auc_type

Set default multinomial AUC type.

Type: Literal["auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo"], defaults to "auto".

property autoencoder

Auto-Encoder.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> cars_dl = H2ODeepLearningEstimator(autoencoder=True)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property average_activation

Average activation for sparse auto-encoder. #Experimental

Type: float, defaults to 0.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> cars_dl = H2ODeepLearningEstimator(average_activation=1.5,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property balance_classes

Balance training data class counts via over/under-sampling (for imbalanced data).

Type: bool, defaults to False.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> cov_dl = H2ODeepLearningEstimator(balance_classes=True,
...                                   seed=1234)
>>> cov_dl.train(x=predictors,
...              y=response,
...              training_frame=train,
...              validation_frame=valid)
>>> cov_dl.mse()
property categorical_encoding

Encoding scheme for categorical features

Type: Literal["auto", "enum", "one_hot_internal", "one_hot_explicit", "binary", "eigen", "label_encoder", "sort_by_response", "enum_limited"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> encoding = "one_hot_internal"
>>> airlines_dl = H2ODeepLearningEstimator(categorical_encoding=encoding,
...                                        seed=1234)
>>> airlines_dl.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> airlines_dl.mse()
property checkpoint

Model checkpoint to resume training with.

Type: Union[None, str, H2OEstimator].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(activation="tanh",
...                                    autoencoder=True,
...                                    seed=1234,
...                                    model_id="cars_dl")
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
>>> cars_cont = H2ODeepLearningEstimator(checkpoint=cars_dl,
...                                      seed=1234)
>>> cars_cont.train(x=predictors,
...                 y=response,
...                 training_frame=train,
...                 validation_frame=valid)
>>> cars_cont.mse()
property class_sampling_factors

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

Type: List[float].

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> sample_factors = [1., 0.5, 1., 1., 1., 1., 1.]
>>> cars_dl = H2ODeepLearningEstimator(balance_classes=True,
...                                    class_sampling_factors=sample_factors,
...                                    seed=1234)
>>> cov_dl.train(x=predictors,
...              y=response,
...              training_frame=train,
...              validation_frame=valid)
>>> cov_dl.mse()
property classification_stop

Stopping criterion for classification error fraction on training data (-1 to disable).

Type: float, defaults to 0.0.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(classification_stop=1.5,
...                                    seed=1234)
>>> cov_dl.train(x=predictors,
...              y=response,
...              training_frame=train,
...              validation_frame=valid)
>>> cov_dl.mse()
property col_major

#DEPRECATED Use a column major weight matrix for input layer. Can speed up forward propagation, but might slow down backpropagation.

Type: bool, defaults to False.

property diagnostics

Enable diagnostics for hidden layers.

Type: bool, defaults to True.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(diagnostics=True,
...                                    seed=1234)  
>>> cov_dl.train(x=predictors,
...              y=response,
...              training_frame=train,
...              validation_frame=valid)
>>> cov_dl.mse()
property distribution

Distribution function

Type: Literal["auto", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber"], defaults to "auto".

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(distribution="poisson",
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property elastic_averaging

Elastic averaging between compute nodes can improve distributed model convergence. #Experimental

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(elastic_averaging=True,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property elastic_averaging_moving_rate

Elastic averaging moving rate (only if elastic averaging is enabled).

Type: float, defaults to 0.9.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(elastic_averaging_moving_rate=.8,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property elastic_averaging_regularization

Elastic averaging regularization strength (only if elastic averaging is enabled).

Type: float, defaults to 0.001.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(elastic_averaging_regularization=.008,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property epochs

How many times the dataset should be iterated (streamed), can be fractional.

Type: float, defaults to 10.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(epochs=15,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property epsilon

Adaptive learning rate smoothing factor (to avoid divisions by zero and allow progress).

Type: float, defaults to 1e-08.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(epsilon=1e-6,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> import tempfile
>>> from os import listdir
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> checkpoints_dir = tempfile.mkdtemp()
>>> cars_dl = H2ODeepLearningEstimator(export_checkpoints_dir=checkpoints_dir,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> len(listdir(checkpoints_dir))
property export_weights_and_biases

Whether to export Neural Network weights and biases to H2O Frames.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(export_weights_and_biases=True,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property fast_mode

Enable fast mode (minor approximation in back-propagation).

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(fast_mode=False,
...                                    seed=1234)          
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.

Type: Literal["auto", "random", "modulo", "stratified"], defaults to "auto".

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(fold_assignment="Random",
...                                    nfolds=5,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property fold_column

Column with cross-validation fold index assignment per observation.

Type: str.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> fold_numbers = cars.kfold_column(n_folds=5, seed=1234)
>>> fold_numbers.set_names(["fold_numbers"])
>>> cars = cars.cbind(fold_numbers)
>>> print(cars['fold_numbers'])
>>> cars_dl = H2ODeepLearningEstimator(seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars,
...               fold_column="fold_numbers")
>>> cars_dl.mse()
property force_load_balance

Force extra load balancing to increase training speed for small datasets (to keep all cores busy).

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(force_load_balance=False,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property hidden

Hidden layer sizes (e.g. [100, 100]).

Type: List[int], defaults to [200, 200].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(hidden=[100,100],
...                                    seed=1234) 
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.mse()
property hidden_dropout_ratios

Hidden layer dropout ratios (can improve generalization), specify one value per hidden layer, defaults to 0.5.

Type: List[float].

Examples

>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
>>> valid = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
>>> features = list(range(0,784))
>>> target = 784
>>> train[target] = train[target].asfactor()
>>> valid[target] = valid[target].asfactor()
>>> model = H2ODeepLearningEstimator(epochs=20,
...                                  hidden=[200,200],
...                                  hidden_dropout_ratios=[0.5,0.5],
...                                  seed=1234,
...                                  activation='tanhwithdropout')
>>> model.train(x=features,
...             y=target,
...             training_frame=train,
...             validation_frame=valid)
>>> model.mse()
property huber_alpha

Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).

Type: float, defaults to 0.9.

Examples

>>> insurance = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> predictors = insurance.columns[0:4]
>>> response = 'Claims'
>>> insurance['Group'] = insurance['Group'].asfactor()
>>> insurance['Age'] = insurance['Age'].asfactor()
>>> train, valid = insurance.split_frame(ratios=[.8], seed=1234)
>>> insurance_dl = H2ODeepLearningEstimator(distribution="huber",
...                                         huber_alpha=0.9,
...                                         seed=1234)
>>> insurance_dl.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> insurance_dl.mse()
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars["const_1"] = 6
>>> cars["const_2"] = 7
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(seed=1234,
...                                    ignore_const_cols=True)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.auc()
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property initial_biases

A list of H2OFrame ids to initialize the bias vectors of this model with.

Type: List[Union[None, str, H2OFrame]].

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
>>> dl1 = H2ODeepLearningEstimator(hidden=[10,10],
...                                export_weights_and_biases=True)
>>> dl1.train(x=list(range(4)), y=4, training_frame=iris)
>>> p1 = dl1.model_performance(iris).logloss()
>>> ll1 = dl1.predict(iris)
>>> print(p1)
>>> w1 = dl1.weights(0)
>>> w2 = dl1.weights(1)
>>> w3 = dl1.weights(2)
>>> b1 = dl1.biases(0)
>>> b2 = dl1.biases(1)
>>> b3 = dl1.biases(2)
>>> dl2 = H2ODeepLearningEstimator(hidden=[10,10],
...                                initial_weights=[w1, w2, w3],
...                                initial_biases=[b1, b2, b3],
...                                epochs=0)
>>> dl2.train(x=list(range(4)), y=4, training_frame=iris)
>>> dl2.initial_biases
property initial_weight_distribution

Initial weight distribution.

Type: Literal["uniform_adaptive", "uniform", "normal"], defaults to "uniform_adaptive".

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(initial_weight_distribution="Uniform",
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.auc()
property initial_weight_scale

Uniform: -value…value, Normal: stddev.

Type: float, defaults to 1.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(initial_weight_scale=1.5,
...                                    seed=1234) 
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.auc()
property initial_weights

A list of H2OFrame ids to initialize the weight matrices of this model with.

Type: List[Union[None, str, H2OFrame]].

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
>>> dl1 = H2ODeepLearningEstimator(hidden=[10,10],
...                                export_weights_and_biases=True)
>>> dl1.train(x=list(range(4)), y=4, training_frame=iris)
>>> p1 = dl1.model_performance(iris).logloss()
>>> ll1 = dl1.predict(iris)
>>> print(p1)
>>> w1 = dl1.weights(0)
>>> w2 = dl1.weights(1)
>>> w3 = dl1.weights(2)
>>> b1 = dl1.biases(0)
>>> b2 = dl1.biases(1)
>>> b3 = dl1.biases(2)
>>> dl2 = H2ODeepLearningEstimator(hidden=[10,10],
...                                initial_weights=[w1, w2, w3],
...                                initial_biases=[b1, b2, b3],
...                                epochs=0)
>>> dl2.train(x=list(range(4)), y=4, training_frame=iris)
>>> dl2.initial_weights
property input_dropout_ratio

Input layer dropout ratio (can improve generalization, try 0.1 or 0.2).

Type: float, defaults to 0.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(input_dropout_ratio=0.2,
...                                    seed=1234) 
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.auc()
property keep_cross_validation_fold_assignment

Whether to keep the cross-validation fold assignment.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars_dl = H2ODeepLearningEstimator(keep_cross_validation_fold_assignment=True,
...                                    nfolds=5,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> print(cars_dl.cross_validation_fold_assignment())
property keep_cross_validation_models

Whether to keep the cross-validation models.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars_dl = H2ODeepLearningEstimator(keep_cross_validation_models=True,
...                                    nfolds=5,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> print(cars_dl.cross_validation_models())
property keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation models.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars_dl = H2ODeepLearningEstimator(keep_cross_validation_predictions=True,
...                                    nfolds=5,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> print(cars_dl.cross_validation_predictions())
property l1

L1 regularization (can add stability and improve generalization, causes many weights to become 0).

Type: float, defaults to 0.0.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> hh_imbalanced = H2ODeepLearningEstimator(l1=1e-5,
...                                          activation="Rectifier",
...                                          loss="CrossEntropy",
...                                          hidden=[200,200],
...                                          epochs=1,
...                                          balance_classes=False,
...                                          reproducible=True,
...                                          seed=1234)
>>> hh_imbalanced.train(x=list(range(54)),y=54, training_frame=covtype)
>>> hh_imbalanced.mse()
property l2

L2 regularization (can add stability and improve generalization, causes many weights to be small.

Type: float, defaults to 0.0.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> hh_imbalanced = H2ODeepLearningEstimator(l2=1e-5,
...                                          activation="Rectifier",
...                                          loss="CrossEntropy",
...                                          hidden=[200,200],
...                                          epochs=1,
...                                          balance_classes=False,
...                                          reproducible=True,
...                                          seed=1234)
>>> hh_imbalanced.train(x=list(range(54)),y=54, training_frame=covtype)
>>> hh_imbalanced.mse()
property loss

Loss function.

Type: Literal["automatic", "cross_entropy", "quadratic", "huber", "absolute", "quantile"], defaults to "automatic".

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> hh_imbalanced = H2ODeepLearningEstimator(l1=1e-5,
...                                          activation="Rectifier",
...                                          loss="CrossEntropy",
...                                          hidden=[200,200],
...                                          epochs=1,
...                                          balance_classes=False,
...                                          reproducible=True,
...                                          seed=1234)
>>> hh_imbalanced.train(x=list(range(54)),y=54, training_frame=covtype)
>>> hh_imbalanced.mse()
property max_after_balance_size

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.

Type: float, defaults to 5.0.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> max = .85
>>> cov_dl = H2ODeepLearningEstimator(balance_classes=True,
...                                   max_after_balance_size=max,
...                                   seed=1234)
>>> cov_dl.train(x=predictors,
...              y=response,
...              training_frame=train,
...              validation_frame=valid)
>>> cov_dl.logloss()
property max_categorical_features

Max. number of categorical features, enforced via hashing. #Experimental

Type: int, defaults to 2147483647.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> cov_dl = H2ODeepLearningEstimator(balance_classes=True,
...                                   max_categorical_features=2147483647,
...                                   seed=1234)
>>> cov_dl.train(x=predictors,
...              y=response,
...              training_frame=train,
...              validation_frame=valid)
>>> cov_dl.logloss()
property max_confusion_matrix_size

[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs.

Type: int, defaults to 20.

property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(max_runtime_secs=10,
...                                    seed=1234) 
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.auc()
property max_w2

Constraint for squared sum of incoming weights per unit (e.g. for Rectifier).

Type: float, defaults to 3.4028235e+38.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> cov_dl = H2ODeepLearningEstimator(activation="RectifierWithDropout",
...                                   hidden=[10,10],
...                                   epochs=10,
...                                   input_dropout_ratio=0.2,
...                                   l1=1e-5,
...                                   max_w2=10.5,
...                                   stopping_rounds=0)
>>> cov_dl.train(x=predictors,
...              y=response,
...              training_frame=train,
...              validation_frame=valid)
>>> cov_dl.mse()
property mini_batch_size

Mini-batch size (smaller leads to better fit, larger can speed up and generalize better).

Type: int, defaults to 1.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> cov_dl = H2ODeepLearningEstimator(activation="RectifierWithDropout",
...                                   hidden=[10,10],
...                                   epochs=10,
...                                   input_dropout_ratio=0.2,
...                                   l1=1e-5,
...                                   max_w2=10.5,
...                                   stopping_rounds=0)
...                                   mini_batch_size=35
>>> cov_dl.train(x=predictors,
...              y=response,
...              training_frame=train,
...              validation_frame=valid)
>>> cov_dl.mse()
property missing_values_handling

Handling of missing values. Either MeanImputation or Skip.

Type: Literal["mean_imputation", "skip"], defaults to "mean_imputation".

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> boston.insert_missing_values()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_dl = H2ODeepLearningEstimator(missing_values_handling="skip")
>>> boston_dl.train(x=predictors,
...                 y=response,
...                 training_frame=train,
...                 validation_frame=valid)
>>> boston_dl.mse()
property momentum_ramp

Number of training samples for which momentum increases.

Type: float, defaults to 1000000.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Year","Month","DayofMonth","DayOfWeek","CRSDepTime",
...               "CRSArrTime","UniqueCarrier","FlightNum"]
>>> response_col = "IsDepDelayed"
>>> airlines_dl = H2ODeepLearningEstimator(hidden=[200,200],
...                                        activation="Rectifier",
...                                        input_dropout_ratio=0.0,
...                                        momentum_start=0.9,
...                                        momentum_stable=0.99,
...                                        momentum_ramp=1e7,
...                                        epochs=100,
...                                        stopping_rounds=4,
...                                        train_samples_per_iteration=30000,
...                                        mini_batch_size=32,
...                                        score_duty_cycle=0.25,
...                                        score_interval=1)
>>> airlines_dl.train(x=predictors,
...                   y=response_col,
...                   training_frame=airlines)
>>> airlines_dl.mse()
property momentum_stable

Final momentum after the ramp is over (try 0.99).

Type: float, defaults to 0.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Year","Month","DayofMonth","DayOfWeek","CRSDepTime",
...               "CRSArrTime","UniqueCarrier","FlightNum"]
>>> response_col = "IsDepDelayed"
>>> airlines_dl = H2ODeepLearningEstimator(hidden=[200,200],
...                                        activation="Rectifier",
...                                        input_dropout_ratio=0.0,
...                                        momentum_start=0.9,
...                                        momentum_stable=0.99,
...                                        momentum_ramp=1e7,
...                                        epochs=100,
...                                        stopping_rounds=4,
...                                        train_samples_per_iteration=30000,
...                                        mini_batch_size=32,
...                                        score_duty_cycle=0.25,
...                                        score_interval=1)
>>> airlines_dl.train(x=predictors,
...                   y=response_col,
...                   training_frame=airlines)
>>> airlines_dl.mse()
property momentum_start

Initial momentum at the beginning of training (try 0.5).

Type: float, defaults to 0.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Year","Month","DayofMonth","DayOfWeek","CRSDepTime",
...               "CRSArrTime","UniqueCarrier","FlightNum"]
>>> response_col = "IsDepDelayed"
>>> airlines_dl = H2ODeepLearningEstimator(hidden=[200,200],
...                                        activation="Rectifier",
...                                        input_dropout_ratio=0.0,
...                                        momentum_start=0.9,
...                                        momentum_stable=0.99,
...                                        momentum_ramp=1e7,
...                                        epochs=100,
...                                        stopping_rounds=4,
...                                        train_samples_per_iteration=30000,
...                                        mini_batch_size=32,
...                                        score_duty_cycle=0.25,
...                                        score_interval=1)
>>> airlines_dl.train(x=predictors,
...                   y=response_col,
...                   training_frame=airlines)
>>> airlines_dl.mse()
property nesterov_accelerated_gradient

Use Nesterov accelerated gradient (recommended).

Type: bool, defaults to True.

Examples

>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
>>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
>>> predictors = list(range(0,784))
>>> resp = 784
>>> train[resp] = train[resp].asfactor()
>>> test[resp] = test[resp].asfactor()
>>> nclasses = train[resp].nlevels()[0]
>>> model = H2ODeepLearningEstimator(activation="RectifierWithDropout",
...                                  adaptive_rate=False,
...                                  rate=0.01,
...                                  rate_decay=0.9,
...                                  rate_annealing=1e-6,
...                                  momentum_start=0.95,
...                                  momentum_ramp=1e5,
...                                  momentum_stable=0.99,
...                                  nesterov_accelerated_gradient=False,
...                                  input_dropout_ratio=0.2,
...                                  train_samples_per_iteration=20000,
...                                  classification_stop=-1,
...                                  l1=1e-5) 
>>> model.train (x=predictors,
...              y=resp,
...              training_frame=train,
...              validation_frame=test)
>>> model.model_performance()
property nfolds

Number of folds for K-fold cross-validation (0 to disable or >= 2).

Type: int, defaults to 0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars_dl = H2ODeepLearningEstimator(nfolds=5, seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> cars_dl.auc()
property offset_column

Offset column. This will be added to the combination of columns before applying the link function.

Type: str.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> boston["offset"] = boston["medv"].log()
>>> train, valid = boston.split_frame(ratios=[.8], seed=1234)
>>> boston_dl = H2ODeepLearningEstimator(offset_column="offset",
...                                      seed=1234)
>>> boston_dl.train(x=predictors,
...                 y=response,
...                 training_frame=train,
...                 validation_frame=valid)
>>> boston_dl.mse()
property overwrite_with_best_model

If enabled, override the final model with the best model found during training.

Type: bool, defaults to True.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> boston["offset"] = boston["medv"].log()
>>> train, valid = boston.split_frame(ratios=[.8], seed=1234)
>>> boston_dl = H2ODeepLearningEstimator(overwrite_with_best_model=True,
...                                      seed=1234)
>>> boston_dl.train(x=predictors,
...                 y=response,
...                 training_frame=train,
...                 validation_frame=valid)
>>> boston_dl.mse()
property pretrained_autoencoder

Pretrained autoencoder model to initialize this model with.

Type: Union[None, str, H2OEstimator].

Examples

>>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator
>>> resp = 784
>>> nfeatures = 20
>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
>>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
>>> train[resp] = train[resp].asfactor()
>>> test[resp] = test[resp].asfactor()
>>> sid = train[0].runif(0)
>>> train_unsupervised = train[sid>=0.5]
>>> train_unsupervised.pop(resp)
>>> train_supervised = train[sid<0.5]
>>> ae_model = H2OAutoEncoderEstimator(activation="Tanh",
...                                    hidden=[nfeatures],
...                                    model_id="ae_model",
...                                    epochs=1,
...                                    ignore_const_cols=False,
...                                    reproducible=True,
...                                    seed=1234)
>>> ae_model.train(list(range(resp)), training_frame=train_unsupervised)
>>> ae_model.mse()
>>> pretrained_model = H2ODeepLearningEstimator(activation="Tanh",
...                                             hidden=[nfeatures],
...                                             epochs=1,
...                                             reproducible = True,
...                                             seed=1234,
...                                             ignore_const_cols=False,
...                                             pretrained_autoencoder="ae_model")
>>> pretrained_model.train(list(range(resp)), resp,
...                        training_frame=train_supervised,
...                        validation_frame=test)
>>> pretrained_model.mse()
property quantile_alpha

Desired quantile for Quantile regression, must be between 0 and 1.

Type: float, defaults to 0.5.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8], seed=1234)
>>> boston_dl = H2ODeepLearningEstimator(distribution="quantile",
...                                      quantile_alpha=.8,
...                                      seed=1234)
>>> boston_dl.train(x=predictors,
...                 y=response,
...                 training_frame=train,
...                 validation_frame=valid)
>>> boston_dl.mse()
property quiet_mode

Enable quiet mode for less output to standard output.

Type: bool, defaults to False.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8], seed=1234)
>>> titanic_dl = H2ODeepLearningEstimator(quiet_mode=True,
...                                       seed=1234)
>>> titanic_dl.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> titanic_dl.mse()
property rate

Learning rate (higher => less stable, lower => slower convergence).

Type: float, defaults to 0.005.

Examples

>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
>>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
>>> predictors = list(range(0,784))
>>> resp = 784
>>> train[resp] = train[resp].asfactor()
>>> test[resp] = test[resp].asfactor()
>>> nclasses = train[resp].nlevels()[0]
>>> model = H2ODeepLearningEstimator(activation="RectifierWithDropout",
...                                  adaptive_rate=False,
...                                  rate=0.01,
...                                  rate_decay=0.9,
...                                  rate_annealing=1e-6,
...                                  momentum_start=0.95,
...                                  momentum_ramp=1e5,
...                                  momentum_stable=0.99,
...                                  nesterov_accelerated_gradient=False,
...                                  input_dropout_ratio=0.2,
...                                  train_samples_per_iteration=20000,
...                                  classification_stop=-1,
...                                  l1=1e-5)
>>> model.train (x=predictors,y=resp, training_frame=train, validation_frame=test)
>>> model.model_performance(valid=True)
property rate_annealing

Learning rate annealing: rate / (1 + rate_annealing * samples).

Type: float, defaults to 1e-06.

Examples

>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
>>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
>>> predictors = list(range(0,784))
>>> resp = 784
>>> train[resp] = train[resp].asfactor()
>>> test[resp] = test[resp].asfactor()
>>> nclasses = train[resp].nlevels()[0]
>>> model = H2ODeepLearningEstimator(activation="RectifierWithDropout",
...                                  adaptive_rate=False,
...                                  rate=0.01,
...                                  rate_decay=0.9,
...                                  rate_annealing=1e-6,
...                                  momentum_start=0.95,
...                                  momentum_ramp=1e5,
...                                  momentum_stable=0.99,
...                                  nesterov_accelerated_gradient=False,
...                                  input_dropout_ratio=0.2,
...                                  train_samples_per_iteration=20000,
...                                  classification_stop=-1,
...                                  l1=1e-5)
>>> model.train (x=predictors,
...              y=resp,
...              training_frame=train,
...              validation_frame=test)
>>> model.mse()
property rate_decay

Learning rate decay factor between layers (N-th layer: rate * rate_decay ^ (n - 1).

Type: float, defaults to 1.0.

Examples

>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
>>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
>>> predictors = list(range(0,784))
>>> resp = 784
>>> train[resp] = train[resp].asfactor()
>>> test[resp] = test[resp].asfactor()
>>> nclasses = train[resp].nlevels()[0]
>>> model = H2ODeepLearningEstimator(activation="RectifierWithDropout",
...                                  adaptive_rate=False,
...                                  rate=0.01,
...                                  rate_decay=0.9,
...                                  rate_annealing=1e-6,
...                                  momentum_start=0.95,
...                                  momentum_ramp=1e5,
...                                  momentum_stable=0.99,
...                                  nesterov_accelerated_gradient=False,
...                                  input_dropout_ratio=0.2,
...                                  train_samples_per_iteration=20000,
...                                  classification_stop=-1,
...                                  l1=1e-5)
>>> model.train (x=predictors,
...              y=resp,
...              training_frame=train,
...              validation_frame=test)
>>> model.model_performance()
property regression_stop

Stopping criterion for regression error (MSE) on training data (-1 to disable).

Type: float, defaults to 1e-06.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_dl = H2ODeepLearningEstimator(regression_stop=1e-6,
...                                        seed=1234)
>>> airlines_dl.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> airlines_dl.auc()
property replicate_training_data

Replicate the entire training dataset onto every node for faster training on small datasets.

Type: bool, defaults to True.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> airlines_dl = H2ODeepLearningEstimator(replicate_training_data=False)
>>> airlines_dl.train(x=predictors,
...                   y=response,
...                   training_frame=airlines) 
>>> airlines_dl.auc()
property reproducible

Force reproducibility on small data (will be slow - only uses 1 thread).

Type: bool, defaults to False.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_dl = H2ODeepLearningEstimator(reproducible=True)
>>> airlines_dl.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> airlines_dl.auc()
property response_column

Response variable column.

Type: str.

property rho

Adaptive learning rate time decay factor (similarity to prior updates).

Type: float, defaults to 0.99.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars_dl = H2ODeepLearningEstimator(rho=0.9,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> cars_dl.auc()
property score_duty_cycle

Maximum duty cycle fraction for scoring (lower: more training, higher: more scoring).

Type: float, defaults to 0.1.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars_dl = H2ODeepLearningEstimator(score_duty_cycle=0.2,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> cars_dl.auc()
property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars_dl = H2ODeepLearningEstimator(score_each_iteration=True,
...                                    seed=1234) 
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> cars_dl.auc()
property score_interval

Shortest time interval (in seconds) between model scoring.

Type: float, defaults to 5.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars_dl = H2ODeepLearningEstimator(score_interval=3,
...                                    seed=1234) 
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> cars_dl.auc()
property score_training_samples

Number of training set samples for scoring (0 for all).

Type: int, defaults to 10000.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars_dl = H2ODeepLearningEstimator(score_training_samples=10000,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> cars_dl.auc()
property score_validation_samples

Number of validation set samples for scoring (0 for all).

Type: int, defaults to 0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(score_validation_samples=3,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.auc()
property score_validation_sampling

Method used to sample validation dataset for scoring.

Type: Literal["uniform", "stratified"], defaults to "uniform".

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(score_validation_sampling="uniform",
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.auc()
property seed

Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded.

Type: int, defaults to -1.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.auc()
property shuffle_training_data

Enable shuffling of training data (recommended if training data is replicated and train_samples_per_iteration is close to #nodes x #rows, of if using balance_classes).

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(shuffle_training_data=True,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> cars_dl.auc()
property single_node_mode

Run on a single node for fine-tuning of model parameters.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(single_node_mode=True,
...                                    seed=1234) 
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> cars_dl.auc()
property sparse

Sparse data handling (more efficient for data with lots of 0 values).

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(sparse=True,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> cars_dl.auc()
property sparsity_beta

Sparsity regularization. #Experimental

Type: float, defaults to 0.0.

Examples

>>> from h2o.estimators import H2OAutoEncoderEstimator
>>> resp = 784
>>> nfeatures = 20
>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
>>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
>>> train[resp] = train[resp].asfactor()
>>> test[resp] = test[resp].asfactor()
>>> sid = train[0].runif(0)
>>> train_unsupervised = train[sid>=0.5]
>>> train_unsupervised.pop(resp)
>>> ae_model = H2OAutoEncoderEstimator(activation="Tanh",
...                                    hidden=[nfeatures],
...                                    epochs=1,
...                                    ignore_const_cols=False,
...                                    reproducible=True,
...                                    sparsity_beta=0.5,
...                                    seed=1234)
>>> ae_model.train(list(range(resp)),
...                training_frame=train_unsupervised)
>>> ae_model.mse()
property standardize

If enabled, automatically standardize the data. If disabled, the user must provide properly scaled input data.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars_dl = H2ODeepLearningEstimator(standardize=True,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> cars_dl.auc()
property stopping_metric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.

Type: Literal["auto", "deviance", "logloss", "mse", "rmse", "mae", "rmsle", "auc", "aucpr", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_dl = H2ODeepLearningEstimator(stopping_metric="auc",
...                                        stopping_rounds=3,
...                                        stopping_tolerance=1e-2,
...                                        seed=1234)
>>> airlines_dl.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> airlines_dl.auc()
property stopping_rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)

Type: int, defaults to 5.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_dl = H2ODeepLearningEstimator(stopping_metric="auc",
...                                        stopping_rounds=3,
...                                        stopping_tolerance=1e-2,
...                                        seed=1234)
>>> airlines_dl.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> airlines_dl.auc()
property stopping_tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

Type: float, defaults to 0.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_dl = H2ODeepLearningEstimator(stopping_metric="auc",
...                                        stopping_rounds=3,
...                                        stopping_tolerance=1e-2,
...                                        seed=1234)
>>> airlines_dl.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> airlines_dl.auc()
property target_ratio_comm_to_comp

Target ratio of communication overhead to computation. Only for multi-node operation and train_samples_per_iteration = -2 (auto-tuning).

Type: float, defaults to 0.05.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_dl = H2ODeepLearningEstimator(target_ratio_comm_to_comp=0.05,
...                                        seed=1234)
>>> airlines_dl.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> airlines_dl.auc()
property train_samples_per_iteration

Number of training samples (globally) per MapReduce iteration. Special values are 0: one epoch, -1: all available data (e.g., replicated training data), -2: automatic.

Type: int, defaults to -2.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_dl = H2ODeepLearningEstimator(train_samples_per_iteration=-1,
...                                        epochs=1,
...                                        seed=1234)
>>> airlines_dl.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> airlines_dl.auc()
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_dl = H2ODeepLearningEstimator()
>>> airlines_dl.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> airlines_dl.auc()
property tweedie_power

Tweedie power for Tweedie regression, must be between 1 and 2.

Type: float, defaults to 1.5.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_dl = H2ODeepLearningEstimator(tweedie_power=1.5,
...                                        seed=1234) 
>>> airlines_dl.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> airlines_dl.auc()
property use_all_factor_levels

Use all factor levels of categorical variables. Otherwise, the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder.

Type: bool, defaults to True.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_dl = H2ODeepLearningEstimator(use_all_factor_levels=True,
...                                        seed=1234)
>>> airlines_dl.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> airlines_dl.mse()
property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(standardize=True,
...                                    seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.auc()
property variable_importances

Compute variable importances for input features (Gedeon method) - can be slow for large networks.

Type: bool, defaults to True.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_dl = H2ODeepLearningEstimator(variable_importances=True,
...                                        seed=1234)
>>> airlines_dl.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> airlines_dl.mse()
property weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

Type: str.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_dl = H2ODeepLearningEstimator(seed=1234)
>>> cars_dl.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_dl.auc()

H2OGeneralizedAdditiveEstimator

class h2o.estimators.gam.H2OGeneralizedAdditiveEstimator(model_id=None, training_frame=None, validation_frame=None, nfolds=0, seed=-1, keep_cross_validation_models=True, keep_cross_validation_predictions=False, keep_cross_validation_fold_assignment=False, fold_assignment='auto', fold_column=None, response_column=None, ignored_columns=None, ignore_const_cols=True, score_each_iteration=False, offset_column=None, weights_column=None, family='auto', tweedie_variance_power=0.0, tweedie_link_power=0.0, theta=0.0, solver='auto', alpha=None, lambda_=None, lambda_search=False, early_stopping=True, nlambdas=-1, standardize=False, missing_values_handling='mean_imputation', plug_values=None, compute_p_values=False, remove_collinear_columns=False, intercept=True, non_negative=False, max_iterations=-1, objective_epsilon=-1.0, beta_epsilon=0.0001, gradient_epsilon=-1.0, link='family_default', startval=None, prior=-1.0, cold_start=False, lambda_min_ratio=-1.0, beta_constraints=None, max_active_predictors=-1, interactions=None, interaction_pairs=None, obj_reg=-1.0, export_checkpoints_dir=None, stopping_rounds=0, stopping_metric='auto', stopping_tolerance=0.001, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_confusion_matrix_size=20, max_runtime_secs=0.0, custom_metric_func=None, num_knots=None, spline_orders=None, knot_ids=None, gam_columns=None, standardize_tp_gam_cols=False, scale_tp_penalty_mat=False, bs=None, scale=None, keep_gam_cols=False, auc_type='auto')[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Generalized Additive Model

Fits a generalized additive model, specified by a response variable, a set of predictors, and a description of the error distribution.

A subclass of ModelBase is returned. The specific subclass depends on the machine learning task at hand (if it’s binomial classification, then an H2OBinomialModel is returned, if it’s regression then a H2ORegressionModel is returned). The default print-out of the models is shown, but further GAM-specific information can be queried out of the object. Upon completion of the GAM, the resulting object has coefficients, normalized coefficients, residual/null deviance, aic, and a host of model metrics including MSE, AUC (for logistic regression), degrees of freedom, and confusion matrices.

property Lambda

[Deprecated] Use lambda_ instead

property alpha

Distribution of regularization between the L1 (Lasso) and L2 (Ridge) penalties. A value of 1 for alpha represents Lasso regression, a value of 0 produces Ridge regression, and anything in between specifies the amount of mixing between the two. Default value of alpha is 0 when SOLVER = ‘L-BFGS’; 0.5 otherwise.

Type: List[float].

property auc_type

Set default multinomial AUC type.

Type: Literal["auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo"], defaults to "auto".

property balance_classes

Balance training data class counts via over/under-sampling (for imbalanced data).

Type: bool, defaults to False.

property beta_constraints

Beta constraints

Type: Union[None, str, H2OFrame].

property beta_epsilon

Converge if beta changes less (using L-infinity norm) than beta esilon, ONLY applies to IRLSM solver

Type: float, defaults to 0.0001.

property bs

Basis function type for each gam predictors, 0 for cr, 1 for thin plate regression with knots, 2 for monotone splines. If specified, must be the same size as gam_columns

Type: List[int].

property class_sampling_factors

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

Type: List[float].

property cold_start

Only applicable to multiple alpha/lambda values when calling GLM from GAM. If false, build the next model for next set of alpha/lambda values starting from the values provided by current model. If true will start GLM model from scratch.

Type: bool, defaults to False.

property compute_p_values

Request p-values computation, p-values work only with IRLSM solver and no regularization

Type: bool, defaults to False.

property custom_metric_func

Reference to custom evaluation function, format: language:keyName=funcName

Type: str.

property early_stopping

Stop early when there is no more relative improvement on train or validation (if provided)

Type: bool, defaults to True.

property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

property family

Family. Use binomial for classification with logistic regression, others are for regression problems.

Type: Literal["auto", "gaussian", "binomial", "quasibinomial", "ordinal", "multinomial", "poisson", "gamma", "tweedie", "negativebinomial", "fractionalbinomial"], defaults to "auto".

property fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.

Type: Literal["auto", "random", "modulo", "stratified"], defaults to "auto".

property fold_column

Column with cross-validation fold index assignment per observation.

Type: str.

property gam_columns

Arrays of predictor column names for gam for smoothers using single or multiple predictors like {{‘c1’},{‘c2’,’c3’},{‘c4’},…}

Type: List[List[str]].

property gradient_epsilon

Converge if objective changes less (using L-infinity norm) than this, ONLY applies to L-BFGS solver. Default indicates: If lambda_search is set to False and lambda is equal to zero, the default value of gradient_epsilon is equal to .000001, otherwise the default value is .0001. If lambda_search is set to True, the conditional values above are 1E-8 and 1E-6 respectively.

Type: float, defaults to -1.0.

property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property interaction_pairs

A list of pairwise (first order) column interactions.

Type: List[tuple].

property interactions

A list of predictor column indices to interact. All pairwise combinations will be computed for the list.

Type: List[str].

property intercept

Include constant term in the model

Type: bool, defaults to True.

property keep_cross_validation_fold_assignment

Whether to keep the cross-validation fold assignment.

Type: bool, defaults to False.

property keep_cross_validation_models

Whether to keep the cross-validation models.

Type: bool, defaults to True.

property keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation models.

Type: bool, defaults to False.

property keep_gam_cols

Save keys of model matrix

Type: bool, defaults to False.

property knot_ids

Array storing frame keys of knots. One for each gam column set specified in gam_columns

Type: List[str].

property lambda_

Regularization strength

Type: List[float].

property lambda_min_ratio

Minimum lambda used in lambda search, specified as a ratio of lambda_max (the smallest lambda that drives all coefficients to zero). Default indicates: if the number of observations is greater than the number of variables, then lambda_min_ratio is set to 0.0001; if the number of observations is less than the number of variables, then lambda_min_ratio is set to 0.01.

Type: float, defaults to -1.0.

Use lambda search starting at lambda max, given lambda is then interpreted as lambda min

Type: bool, defaults to False.

Link function.

Type: Literal["family_default", "identity", "logit", "log", "inverse", "tweedie", "ologit"], defaults to "family_default".

property max_active_predictors

Maximum number of active predictors during computation. Use as a stopping criterion to prevent expensive model building with many predictors. Default indicates: If the IRLSM solver is used, the value of max_active_predictors is set to 5000 otherwise it is set to 100000000.

Type: int, defaults to -1.

property max_after_balance_size

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.

Type: float, defaults to 5.0.

property max_confusion_matrix_size

[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs

Type: int, defaults to 20.

property max_iterations

Maximum number of iterations

Type: int, defaults to -1.

property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

property missing_values_handling

Handling of missing values. Either MeanImputation, Skip or PlugValues.

Type: Literal["mean_imputation", "skip", "plug_values"], defaults to "mean_imputation".

property nfolds

Number of folds for K-fold cross-validation (0 to disable or >= 2).

Type: int, defaults to 0.

property nlambdas

Number of lambdas to be used in a search. Default indicates: If alpha is zero, with lambda search set to True, the value of nlamdas is set to 30 (fewer lambdas are needed for ridge regression) otherwise it is set to 100.

Type: int, defaults to -1.

property non_negative

Restrict coefficients (not intercept) to be non-negative

Type: bool, defaults to False.

property num_knots

Number of knots for gam predictors

Type: List[int].

property obj_reg

Likelihood divider in objective value computation, default is 1/nobs

Type: float, defaults to -1.0.

property objective_epsilon

Converge if objective value changes less than this. Default indicates: If lambda_search is set to True the value of objective_epsilon is set to .0001. If the lambda_search is set to False and lambda is equal to zero, the value of objective_epsilon is set to .000001, for any other value of lambda the default value of objective_epsilon is set to .0001.

Type: float, defaults to -1.0.

property offset_column

Offset column. This will be added to the combination of columns before applying the link function.

Type: str.

property plug_values

Plug Values (a single row frame containing values that will be used to impute missing values of the training/validation frame, use with conjunction missing_values_handling = PlugValues)

Type: Union[None, str, H2OFrame].

property prior

Prior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.

Type: float, defaults to -1.0.

property remove_collinear_columns

In case of linearly dependent columns, remove some of the dependent columns

Type: bool, defaults to False.

property response_column

Response variable column.

Type: str.

property scale

Smoothing parameter for gam predictors. If specified, must be of the same length as gam_columns

Type: List[float].

property scale_tp_penalty_mat

Scale penalty matrix for tp (thin plate) smoothers as in R

Type: bool, defaults to False.

property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

scoring_history()[source]

Retrieve Model Score History.

Returns

The score history as an H2OTwoDimTable or a Pandas DataFrame.

property seed

Seed for pseudo random number generator (if applicable)

Type: int, defaults to -1.

property solver

AUTO will set the solver based on given data and the other parameters. IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, L_BFGS scales better for datasets with many columns.

Type: Literal["auto", "irlsm", "l_bfgs", "coordinate_descent_naive", "coordinate_descent", "gradient_descent_lh", "gradient_descent_sqerr"], defaults to "auto".

property spline_orders

Order of I-splines used for gam predictors. If specified, must be the same size as gam_columns.Values for bs=0 or 1 will be ignored.

Type: List[int].

property standardize

Standardize numeric columns to have zero mean and unit variance

Type: bool, defaults to False.

property standardize_tp_gam_cols

standardize tp (thin plate) predictor columns

Type: bool, defaults to False.

property startval

double array to initialize coefficients for GAM.

Type: List[float].

property stopping_metric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.

Type: Literal["auto", "deviance", "logloss", "mse", "rmse", "mae", "rmsle", "auc", "aucpr", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing"], defaults to "auto".

property stopping_rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)

Type: int, defaults to 0.

property stopping_tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

Type: float, defaults to 0.001.

summary()[source]

Print a detailed summary of the model.

property theta

Theta

Type: float, defaults to 0.0.

property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Tweedie link power

Type: float, defaults to 0.0.

property tweedie_variance_power

Tweedie variance power

Type: float, defaults to 0.0.

property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

property weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

Type: str.

H2OGradientBoostingEstimator

class h2o.estimators.gbm.H2OGradientBoostingEstimator(model_id=None, training_frame=None, validation_frame=None, nfolds=0, keep_cross_validation_models=True, keep_cross_validation_predictions=False, keep_cross_validation_fold_assignment=False, score_each_iteration=False, score_tree_interval=0, fold_assignment='auto', fold_column=None, response_column=None, ignored_columns=None, ignore_const_cols=True, offset_column=None, weights_column=None, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_confusion_matrix_size=20, ntrees=50, max_depth=5, min_rows=10.0, nbins=20, nbins_top_level=1024, nbins_cats=1024, r2_stopping=None, stopping_rounds=0, stopping_metric='auto', stopping_tolerance=0.001, max_runtime_secs=0.0, seed=-1, build_tree_one_node=False, learn_rate=0.1, learn_rate_annealing=1.0, distribution='auto', quantile_alpha=0.5, tweedie_power=1.5, huber_alpha=0.9, checkpoint=None, sample_rate=1.0, sample_rate_per_class=None, col_sample_rate=1.0, col_sample_rate_change_per_level=1.0, col_sample_rate_per_tree=1.0, min_split_improvement=1e-05, histogram_type='auto', max_abs_leafnode_pred=None, pred_noise_bandwidth=0.0, categorical_encoding='auto', calibrate_model=False, calibration_frame=None, custom_metric_func=None, custom_distribution_func=None, export_checkpoints_dir=None, monotone_constraints=None, check_constant_response=True, gainslift_bins=-1, auc_type='auto', interaction_constraints=None)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Gradient Boosting Machine

Builds gradient boosted trees on a parsed data set, for regression or classification. The default distribution function will guess the model type based on the response column type. Otherwise, the response column must be an enum for “bernoulli” or “multinomial”, and numeric for all other distributions.

property auc_type

Set default multinomial AUC type.

Type: Literal["auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo"], defaults to "auto".

property balance_classes

Balance training data class counts via over/under-sampling (for imbalanced data).

Type: bool, defaults to False.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> cov_gbm = H2OGradientBoostingEstimator(balance_classes=True,
...                                        seed=1234)
>>> cov_gbm.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cov_gbm.logloss(valid=True)
property build_tree_one_node

Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(build_tree_one_node=True,
...                                         seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.auc(valid=True)
property calibrate_model

Use Platt Scaling to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities.

Type: bool, defaults to False.

Examples

>>> ecology = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/ecology_model.csv")
>>> ecology['Angaus'] = ecology['Angaus'].asfactor()
>>> response = 'Angaus'
>>> train, calib = ecology.split_frame(seed = 12354)
>>> predictors = ecology.columns[3:13]
>>> w = h2o.create_frame(binary_fraction=1,
...                      binary_ones_fraction=0.5,
...                      missing_fraction=0,
...                      rows=744, cols=1)
>>> w.set_names(["weight"])
>>> train = train.cbind(w)
>>> ecology_gbm = H2OGradientBoostingEstimator(ntrees=10,
...                                            max_depth=5,
...                                            min_rows=10,
...                                            learn_rate=0.1,
...                                            distribution="multinomial",
...                                            weights_column="weight",
...                                            calibrate_model=True,
...                                            calibration_frame=calib)
>>> ecology_gbm.train(x=predictors,
...                   y="Angaus",
...                   training_frame=train)
>>> ecology_gbm.auc()
property calibration_frame

Calibration frame for Platt Scaling

Type: Union[None, str, H2OFrame].

Examples

>>> ecology = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/ecology_model.csv")
>>> ecology['Angaus'] = ecology['Angaus'].asfactor()
>>> response = 'Angaus'
>>> predictors = ecology.columns[3:13]
>>> train, calib = ecology.split_frame(seed=12354)
>>> w = h2o.create_frame(binary_fraction=1,
...                      binary_ones_fraction=0.5,
...                      missing_fraction=0,
...                      rows=744,cols=1)
>>> w.set_names(["weight"])
>>> train = train.cbind(w)
>>> ecology_gbm = H2OGradientBoostingEstimator(ntrees=10,
...                                            max_depth=5,
...                                            min_rows=10,
...                                            learn_rate=0.1,
...                                            distribution="multinomial",
...                                            calibrate_model=True,
...                                            calibration_frame=calib)
>>> ecology_gbm.train(x=predictors,
...                   y="Angaus",
...                   training_frame=train,
...                   weights_column="weight")
>>> ecology_gbm.auc()
property categorical_encoding

Encoding scheme for categorical features

Type: Literal["auto", "enum", "one_hot_internal", "one_hot_explicit", "binary", "eigen", "label_encoder", "sort_by_response", "enum_limited"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_gbm = H2OGradientBoostingEstimator(categorical_encoding="labelencoder",
...                                             seed=1234)
>>> airlines_gbm.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_gbm.auc(valid=True)
property check_constant_response

Check if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not.

Type: bool, defaults to True.

Examples

>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> train["constantCol"] = 1
>>> my_gbm = H2OGradientBoostingEstimator(check_constant_response=False)
>>> my_gbm.train(x=list(range(1,5)),
...              y="constantCol",
...              training_frame=train)
property checkpoint

Model checkpoint to resume training with.

Type: Union[None, str, H2OEstimator].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(ntrees=1,
...                                         seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> print(cars_gbm.auc(valid=True))
>>> print("Number of trees built for cars_gbm model:", cars_gbm.ntrees)
>>> cars_gbm_continued = H2OGradientBoostingEstimator(checkpoint=cars_gbm.model_id,
...                                                   ntrees=50,
...                                                   seed=1234)
>>> cars_gbm_continued.train(x=predictors,
...                          y=response,
...                          training_frame=train,
...                          validation_frame=valid)
>>> cars_gbm_continued.auc(valid=True)
>>> print("Number of trees built for cars_gbm model:",cars_gbm_continued.ntrees) 
property class_sampling_factors

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

Type: List[float].

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> sample_factors = [1., 0.5, 1., 1., 1., 1., 1.]
>>> cov_gbm = H2OGradientBoostingEstimator(balance_classes=True,
...                                        class_sampling_factors=sample_factors,
...                                        seed=1234)
>>> cov_gbm.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cov_gbm.logloss(valid=True)
property col_sample_rate

Column sample rate (from 0.0 to 1.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_gbm = H2OGradientBoostingEstimator(col_sample_rate=.7,
...                                             seed=1234)
>>> airlines_gbm.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_gbm.auc(valid=True)
property col_sample_rate_change_per_level

Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_gbm = H2OGradientBoostingEstimator(col_sample_rate_change_per_level=.9,
...                                             seed=1234)
>>> airlines_gbm.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_gbm.auc(valid=True)
property col_sample_rate_per_tree

Column sample rate per tree (from 0.0 to 1.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_gbm = H2OGradientBoostingEstimator(col_sample_rate_per_tree=.7,
...                                             seed=1234)
>>> airlines_gbm.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_gbm.auc(valid=True)
property custom_distribution_func

Reference to custom distribution, format: language:keyName=funcName

Type: str.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_gbm = H2OGradientBoostingEstimator(ntrees=3,
...                                             max_depth=5,
...                                             distribution="bernoulli",
...                                             seed=1234)
>>> airlines_gbm.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame valid)
>>> from h2o.utils.distributions import CustomDistributionBernoulli
>>> custom_distribution_bernoulli = h2o.upload_custom_distribution(CustomDistributionBernoulli,
...                                                                func_name="custom_bernoulli",
...                                                                func_file="custom_bernoulli.py")
>>> airlines_gbm_custom = H2OGradientBoostingEstimator(ntrees=3,
...                                                    max_depth=5,
...                                                    distribution="custom",
...                                                    custom_distribution_func=custom_distribution_bernoulli,
...                                                    seed=1235)
>>> airlines_gbm_custom.train(x=predictors,
...                           y=response,
...                           training_frame=train,
...                           validation_frame=valid)
>>> airlines_gbm.auc()
property custom_metric_func

Reference to custom evaluation function, format: language:keyName=funcName

Type: str.

property distribution

Distribution function

Type: Literal["auto", "bernoulli", "quasibinomial", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber", "custom"], defaults to "auto".

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(distribution="poisson",
...                                         seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.mse(valid=True)
property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> airlines = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip", destination_frame="air.hex")
>>> predictors = ["DayofMonth", "DayOfWeek"]
>>> response = "IsDepDelayed"
>>> hyper_parameters = {'ntrees': [5,10]}
>>> search_crit = {'strategy': "RandomDiscrete",
...                'max_models': 5,
...                'seed': 1234,
...                'stopping_rounds': 3,
...                'stopping_metric': "AUTO",
...                'stopping_tolerance': 1e-2}
>>> checkpoints_dir = tempfile.mkdtemp()
>>> air_grid = H2OGridSearch(H2OGradientBoostingEstimator,
...                          hyper_params=hyper_parameters,
...                          search_criteria=search_crit)
>>> air_grid.train(x=predictors,
...                y=response,
...                training_frame=airlines,
...                distribution="bernoulli",
...                learn_rate=0.1,
...                max_depth=3,
...                export_checkpoints_dir=checkpoints_dir)
>>> len(listdir(checkpoints_dir))
property fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.

Type: Literal["auto", "random", "modulo", "stratified"], defaults to "auto".

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> assignment_type = "Random"
>>> cars_gbm = H2OGradientBoostingEstimator(fold_assignment=assignment_type,
...                                         nfolds=5,
...                                         seed=1234)
>>> cars_gbm.train(x=predictors, y=response, training_frame=cars)
>>> cars_gbm.auc(xval=True)
property fold_column

Column with cross-validation fold index assignment per observation.

Type: str.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> fold_numbers = cars.kfold_column(n_folds=5,
...                                  seed=1234)
>>> fold_numbers.set_names(["fold_numbers"])
>>> cars = cars.cbind(fold_numbers)
>>> cars_gbm = H2OGradientBoostingEstimator(seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=cars,
...                fold_column="fold_numbers")
>>> cars_gbm.auc(xval=True)
property gainslift_bins

Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.

Type: int, defaults to -1.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv")
>>> model = H2OGradientBoostingEstimator(ntrees=1, gainslift_bins=20)
>>> model.train(x=["Origin", "Distance"],
...             y="IsDepDelayed",
...             training_frame=airlines)
>>> model.gains_lift()
property histogram_type

What type of histogram to use for finding optimal split points

Type: Literal["auto", "uniform_adaptive", "random", "quantiles_global", "round_robin", "uniform_robust"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_gbm = H2OGradientBoostingEstimator(histogram_type="UniformAdaptive",
...                                             seed=1234)
>>> airlines_gbm.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_gbm.auc(valid=True)
property huber_alpha

Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).

Type: float, defaults to 0.9.

Examples

>>> insurance = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> predictors = insurance.columns[0:4]
>>> response = 'Claims'
>>> insurance['Group'] = insurance['Group'].asfactor()
>>> insurance['Age'] = insurance['Age'].asfactor()
>>> train, valid = insurance.split_frame(ratios=[.8], seed=1234)
>>> insurance_gbm = H2OGradientBoostingEstimator(distribution="huber",
...                                              huber_alpha=0.9,
...                                              seed=1234)
>>> insurance_gbm.train(x=predictors,
...                     y=response,
...                     training_frame=train,
...                     validation_frame=valid)
>>> insurance_gbm.mse(valid=True)
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars["const_1"] = 6
>>> cars["const_2"] = 7
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed=1234,
...                                         ignore_const_cols=True)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.auc(valid=True)
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property interaction_constraints

A set of allowed column interactions.

Type: List[List[str]].

property keep_cross_validation_fold_assignment

Whether to keep the cross-validation fold assignment.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> folds = 5
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(keep_cross_validation_fold_assignment=True,
...                                         nfolds=5,
...                                         seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.auc()
property keep_cross_validation_models

Whether to keep the cross-validation models.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> folds = 5
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(keep_cross_validation_models=True,
...                                         nfolds=5,
...                                         seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.auc()
property keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation models.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> folds = 5
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(keep_cross_validation_predictions=True,
...                                         nfolds=5,
...                                         seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.auc()
property learn_rate

Learning rate (from 0.0 to 1.0)

Type: float, defaults to 0.1.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8], seed=1234)
>>> titanic_gbm = H2OGradientBoostingEstimator(ntrees=10000,
...                                            learn_rate=0.01,
...                                            stopping_rounds=5,
...                                            stopping_metric="AUC",
...                                            stopping_tolerance=1e-4,
...                                            seed=1234)
>>> titanic_gbm.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> titanic_gbm.auc(valid=True)
property learn_rate_annealing

Scale the learning rate by this factor after each tree (e.g., 0.99 or 0.999)

Type: float, defaults to 1.0.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8], seed=1234)
>>> titanic_gbm = H2OGradientBoostingEstimator(ntrees=10000,
...                                            learn_rate=0.05,
...                                            learn_rate_annealing=.9,
...                                            stopping_rounds=5,
...                                            stopping_metric="AUC",
...                                            stopping_tolerance=1e-4,
...                                            seed=1234)
>>> titanic_gbm.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> titanic_gbm.auc(valid=True)
property max_abs_leafnode_pred

Maximum absolute value of a leaf node prediction

Type: float, defaults to .

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> cov_gbm = H2OGradientBoostingEstimator(max_abs_leafnode_pred=2,
...                                        seed=1234)
>>> cov_gbm.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cov_gbm.logloss(valid=True)
property max_after_balance_size

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.

Type: float, defaults to 5.0.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> max = .85
>>> cov_gbm = H2OGradientBoostingEstimator(balance_classes=True,
...                                        max_after_balance_size=max,
...                                        seed=1234)
>>> cov_gbm.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cov_gbm.logloss(valid=True)
property max_confusion_matrix_size

[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs

Type: int, defaults to 20.

property max_depth

Maximum tree depth (0 for unlimited).

Type: int, defaults to 5.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(ntrees=100,
...                                         max_depth=2,
...                                         seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.auc(valid=True)
property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(max_runtime_secs=10,
...                                         ntrees=10000,
...                                         max_depth=10,
...                                         seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.auc(valid=True)
property min_rows

Fewest allowed (weighted) observations in a leaf.

Type: float, defaults to 10.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(min_rows=16,
...                                         seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.auc(valid=True)
property min_split_improvement

Minimum relative improvement in squared error reduction for a split to happen

Type: float, defaults to 1e-05.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(min_split_improvement=1e-3,
...                                         seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.auc(valid=True)
property monotone_constraints

A mapping representing monotonic constraints. Use +1 to enforce an increasing constraint and -1 to specify a decreasing constraint.

Type: dict.

Examples

>>> prostate_hex = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate_hex["CAPSULE"] = prostate_hex["CAPSULE"].asfactor()
>>> response = "CAPSULE"
>>> seed = 42
>>> monotone_constraints = {"AGE":1}
>>> gbm_model = H2OGradientBoostingEstimator(seed=seed,
...                                          monotone_constraints=monotone_constraints)
>>> gbm_model.train(y=response,
...                 ignored_columns=["ID"],
...                 training_frame=prostate_hex)
>>> gbm_model.scoring_history()
property nbins

For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point

Type: int, defaults to 20.

Examples

>>> eeg = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/eeg/eeg_eyestate.csv")
>>> eeg['eyeDetection'] = eeg['eyeDetection'].asfactor()
>>> predictors = eeg.columns[:-1]
>>> response = 'eyeDetection'
>>> train, valid = eeg.split_frame(ratios=[.8], seed=1234)
>>> bin_num = [16, 32, 64, 128, 256, 512]
>>> label = ["16", "32", "64", "128", "256", "512"]
>>> for key, num in enumerate(bin_num):
...     eeg_gbm = H2OGradientBoostingEstimator(nbins=num, seed=1234)
...     eeg_gbm.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
...     print(label[key], 'training score', eeg_gbm.auc(train=True)) 
...     print(label[key], 'validation score', eeg_gbm.auc(valid=True))
property nbins_cats

For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.

Type: int, defaults to 1024.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8], seed=1234)
>>> bin_num = [8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]
>>> label = ["8", "16", "32", "64", "128", "256", "512", "1024", "2048", "4096"]
>>> for key, num in enumerate(bin_num):
...     airlines_gbm = H2OGradientBoostingEstimator(nbins_cats=num, seed=1234)
...     airlines_gbm.train(x=predictors,
...                        y=response,
...                        training_frame=train,
...                        validation_frame=valid)
...     print(label[key], 'training score', airlines_gbm.auc(train=True))
...     print(label[key], 'validation score', airlines_gbm.auc(valid=True))
property nbins_top_level

For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level

Type: int, defaults to 1024.

Examples

>>> eeg = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/eeg/eeg_eyestate.csv")
>>> eeg['eyeDetection'] = eeg['eyeDetection'].asfactor()
>>> predictors = eeg.columns[:-1]
>>> response = 'eyeDetection'
>>> train, valid = eeg.split_frame(ratios=[.8], seed=1234)
>>> bin_num = [32, 64, 128, 256, 512, 1024, 2048, 4096]
>>> label = ["32", "64", "128", "256", "512", "1024", "2048", "4096"]
>>> for key, num in enumerate(bin_num):
...     eeg_gbm = H2OGradientBoostingEstimator(nbins_top_level=num, seed=1234)
...     eeg_gbm.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
...     print(label[key], 'training score', eeg_gbm.auc(train=True)) 
...     print(label[key], 'validation score', eeg_gbm.auc(valid=True))
property nfolds

Number of folds for K-fold cross-validation (0 to disable or >= 2).

Type: int, defaults to 0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> folds = 5
>>> cars_gbm = H2OGradientBoostingEstimator(nfolds=folds,
...                                         seed=1234
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=cars)
>>> cars_gbm.auc()
property ntrees

Number of trees.

Type: int, defaults to 50.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8], seed=1234)
>>> tree_num = [20, 50, 80, 110, 140, 170, 200]
>>> label = ["20", "50", "80", "110", "140", "170", "200"]
>>> for key, num in enumerate(tree_num):
...     titanic_gbm = H2OGradientBoostingEstimator(ntrees=num,
...                                                seed=1234)
...     titanic_gbm.train(x=predictors,
...                       y=response,
...                       training_frame=train,
...                       validation_frame=valid)
...     print(label[key], 'training score', titanic_gbm.auc(train=True))
...     print(label[key], 'validation score', titanic_gbm.auc(valid=True))
property offset_column

Offset column. This will be added to the combination of columns before applying the link function.

Type: str.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> boston["offset"] = boston["medv"].log()
>>> train, valid = boston.split_frame(ratios=[.8], seed=1234)
>>> boston_gbm = H2OGradientBoostingEstimator(offset_column="offset",
...                                           seed=1234)
>>> boston_gbm.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> boston_gbm.mse(valid=True)
property pred_noise_bandwidth

Bandwidth (sigma) of Gaussian multiplicative noise ~N(1,sigma) for tree node predictions

Type: float, defaults to 0.0.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8], seed=1234)
>>> titanic_gbm = H2OGradientBoostingEstimator(pred_noise_bandwidth=0.1,
...                                            seed=1234)
>>> titanic_gbm.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> titanic_gbm.auc(valid = True)
property quantile_alpha

Desired quantile for Quantile regression, must be between 0 and 1.

Type: float, defaults to 0.5.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8], seed=1234)
>>> boston_gbm = H2OGradientBoostingEstimator(distribution="quantile",
...                                           quantile_alpha=.8,
...                                           seed=1234)
>>> boston_gbm.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> boston_gbm.mse(valid=True)
property r2_stopping

r2_stopping is no longer supported and will be ignored if set - please use stopping_rounds, stopping_metric and stopping_tolerance instead. Previous version of H2O would stop making trees when the R^2 metric equals or exceeds this

Type: float, defaults to .

property response_column

Response variable column.

Type: str.

property sample_rate

Row sample rate per tree (from 0.0 to 1.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Month"] = airlines["Month"].asfactor()                             >>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_gbm = H2OGradientBoostingEstimator(sample_rate=.7,
...                                             seed=1234)
>>> airlines_gbm.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_gbm.auc(valid=True)
property sample_rate_per_class

A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree

Type: List[float].

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> rate_per_class_list = [1, .4, 1, 1, 1, 1, 1]
>>> cov_gbm = H2OGradientBoostingEstimator(sample_rate_per_class=rate_per_class_list,
...                                        seed=1234)
>>> cov_gbm.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cov_gbm.logloss(valid=True)
property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8],
...                                 seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(score_each_iteration=True,
...                                         ntrees=55,
...                                         seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.scoring_history()
property score_tree_interval

Score the model after every so many trees. Disabled if set to 0.

Type: int, defaults to 0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8],
...                                 seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(score_tree_interval=True,
...                                         ntrees=55,
...                                         seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.scoring_history()
property seed

Seed for pseudo random number generator (if applicable)

Type: int, defaults to -1.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8], seed=1234)
>>> gbm_w_seed_1 = H2OGradientBoostingEstimator(col_sample_rate=.7,
...                                             seed=1234)
>>> gbm_w_seed_1.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print('auc for the 1st model built with a seed:', gbm_w_seed_1.auc(valid=True))
property stopping_metric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.

Type: Literal["auto", "deviance", "logloss", "mse", "rmse", "mae", "rmsle", "auc", "aucpr", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_gbm = H2OGradientBoostingEstimator(stopping_metric="auc",
...                                             stopping_rounds=3,
...                                             stopping_tolerance=1e-2,
...                                             seed=1234)
>>> airlines_gbm.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_gbm.auc(valid=True)
property stopping_rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)

Type: int, defaults to 0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_gbm = H2OGradientBoostingEstimator(stopping_metric="auc",
...                                             stopping_rounds=3,
...                                             stopping_tolerance=1e-2,
...                                             seed=1234)
>>> airlines_gbm.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_gbm.auc(valid=True)
property stopping_tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

Type: float, defaults to 0.001.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_gbm = H2OGradientBoostingEstimator(stopping_metric="auc",
...                                             stopping_rounds=3,
...                                             stopping_tolerance=1e-2,
...                                             seed=1234)
>>> airlines_gbm.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_gbm.auc(valid=True)
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.auc(valid=True)
property tweedie_power

Tweedie power for Tweedie regression, must be between 1 and 2.

Type: float, defaults to 1.5.

Examples

>>> insurance = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> predictors = insurance.columns[0:4]
>>> response = 'Claims'
>>> insurance['Group'] = insurance['Group'].asfactor()
>>> insurance['Age'] = insurance['Age'].asfactor()
>>> train, valid = insurance.split_frame(ratios=[.8], seed=1234)
>>> insurance_gbm = H2OGradientBoostingEstimator(distribution="tweedie",
...                                              tweedie_power=1.2,
...                                              seed=1234)
>>> insurance_gbm.train(x=predictors,
...                     y=response,
...                     training_frame=train,
...                     validation_frame=valid)
>>> insurance_gbm.mse(valid=True)
property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_gbm.auc(valid=True)
property weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

Type: str.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_gbm = H2OGradientBoostingEstimator(seed=1234)
>>> cars_gbm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid,
...                weights_column="weight")
>>> cars_gbm.auc(valid=True)

H2OGeneralizedLinearEstimator

class h2o.estimators.glm.H2OGeneralizedLinearEstimator(model_id=None, training_frame=None, validation_frame=None, nfolds=0, checkpoint=None, export_checkpoints_dir=None, seed=-1, keep_cross_validation_models=True, keep_cross_validation_predictions=False, keep_cross_validation_fold_assignment=False, fold_assignment='auto', fold_column=None, response_column=None, ignored_columns=None, random_columns=None, ignore_const_cols=True, score_each_iteration=False, score_iteration_interval=-1, offset_column=None, weights_column=None, family='auto', rand_family=None, tweedie_variance_power=0.0, tweedie_link_power=1.0, theta=1e-10, solver='auto', alpha=None, lambda_=None, lambda_search=False, early_stopping=True, nlambdas=-1, standardize=True, missing_values_handling='mean_imputation', plug_values=None, compute_p_values=False, remove_collinear_columns=False, intercept=True, non_negative=False, max_iterations=-1, objective_epsilon=-1.0, beta_epsilon=0.0001, gradient_epsilon=-1.0, link='family_default', rand_link=None, startval=None, calc_like=False, HGLM=False, prior=-1.0, cold_start=False, lambda_min_ratio=-1.0, beta_constraints=None, max_active_predictors=-1, interactions=None, interaction_pairs=None, obj_reg=-1.0, stopping_rounds=0, stopping_metric='auto', stopping_tolerance=0.001, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_confusion_matrix_size=20, max_runtime_secs=0.0, custom_metric_func=None, generate_scoring_history=False, auc_type='auto')[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Generalized Linear Modeling

Fits a generalized linear model, specified by a response variable, a set of predictors, and a description of the error distribution.

A subclass of ModelBase is returned. The specific subclass depends on the machine learning task at hand (if it’s binomial classification, then an H2OBinomialModel is returned, if it’s regression then a H2ORegressionModel is returned). The default print-out of the models is shown, but further GLM-specific information can be queried out of the object. Upon completion of the GLM, the resulting object has coefficients, normalized coefficients, residual/null deviance, aic, and a host of model metrics including MSE, AUC (for logistic regression), degrees of freedom, and confusion matrices.

property HGLM

If set to true, will return HGLM model. Otherwise, normal GLM model will be returned

Type: bool, defaults to False.

property Lambda

[Deprecated] Use lambda_ instead

property alpha

Distribution of regularization between the L1 (Lasso) and L2 (Ridge) penalties. A value of 1 for alpha represents Lasso regression, a value of 0 produces Ridge regression, and anything in between specifies the amount of mixing between the two. Default value of alpha is 0 when SOLVER = ‘L-BFGS’; 0.5 otherwise.

Type: List[float].

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_glm = H2OGeneralizedLinearEstimator(alpha=.25)
>>> boston_glm.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> print(boston_glm.mse(valid=True))
property auc_type

Set default multinomial AUC type.

Type: Literal["auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo"], defaults to "auto".

property balance_classes

Balance training data class counts via over/under-sampling (for imbalanced data).

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","year"]
>>> response = "acceleration"
>>> train, valid = cars.split_frame(ratios=[.8])
>>> cars_glm = H2OGeneralizedLinearEstimator(balance_classes=True,
...                                          seed=1234)
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.mse()
property beta_constraints

Beta constraints

Type: Union[None, str, H2OFrame].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","year"]
>>> response = "acceleration"
>>> train, valid = cars.split_frame(ratios=[.8])
>>> n = len(predictors)
>>> constraints = h2o.H2OFrame({'names':predictors,
...                             'lower_bounds': [-1000]*n,
...                             'upper_bounds': [1000]*n,
...                             'beta_given': [1]*n,
...                             'rho': [0.2]*n})
>>> cars_glm = H2OGeneralizedLinearEstimator(standardize=True,
...                                          beta_constraints=constraints)
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.mse()
property beta_epsilon

Converge if beta changes less (using L-infinity norm) than beta esilon, ONLY applies to IRLSM solver

Type: float, defaults to 0.0001.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","year"]
>>> response = "acceleration"
>>> train, valid = cars.split_frame(ratios=[.8])
>>> cars_glm = H2OGeneralizedLinearEstimator(beta_epsilon=1e-3)
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.mse()
property calc_like

if true, will return likelihood function value for HGLM.

Type: bool, defaults to False.

property checkpoint

Model checkpoint to resume training with.

Type: Union[None, str, H2OEstimator].

property class_sampling_factors

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

Type: List[float].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","year"]
>>> response = "acceleration"
>>> train, valid = cars.split_frame(ratios=[.8])
>>> sample_factors = [1., 0.5, 1., 1., 1., 1., 1.]
>>> cars_glm = H2OGeneralizedLinearEstimator(balance_classes=True,
...                                          class_sampling_factors=sample_factors,
...                                          seed=1234)
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.mse()
property cold_start

Only applicable to multiple alpha/lambda values. If false, build the next model for next set of alpha/lambda values starting from the values provided by current model. If true will start GLM model from scratch.

Type: bool, defaults to False.

property compute_p_values

Request p-values computation, p-values work only with IRLSM solver and no regularization

Type: bool, defaults to False.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8])
>>> airlines_glm = H2OGeneralizedLinearEstimator(family='binomial',
...                                              lambda_=0,
...                                              remove_collinear_columns=True,
...                                              compute_p_values=True)
>>> airlines_glm.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_glm.mse()
property custom_metric_func

Reference to custom evaluation function, format: language:keyName=funcName

Type: str.

property early_stopping

Stop early when there is no more relative improvement on train or validation (if provided)

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8])
>>> cars_glm = H2OGeneralizedLinearEstimator(family='binomial',
...                                          early_stopping=True)
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.auc(valid=True)
property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> import tempfile
>>> from os import listdir
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","year"]
>>> response = "acceleration"
>>> train, valid = cars.split_frame(ratios=[.8])
>>> checkpoints = tempfile.mkdtemp()
>>> cars_glm = H2OGeneralizedLinearEstimator(export_checkpoints_dir=checkpoints,
...                                          seed=1234)
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.mse()
>>> len(listdir(checkpoints_dir))
property family

Family. Use binomial for classification with logistic regression, others are for regression problems.

Type: Literal["auto", "gaussian", "binomial", "fractionalbinomial", "quasibinomial", "ordinal", "multinomial", "poisson", "gamma", "tweedie", "negativebinomial"], defaults to "auto".

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8])
>>> cars_glm = H2OGeneralizedLinearEstimator(family='binomial')
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.auc(valid = True)
property fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.

Type: Literal["auto", "random", "modulo", "stratified"], defaults to "auto".

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> assignment_type = "Random"
>>> cars_gml = H2OGeneralizedLinearEstimator(fold_assignment=assignment_type,
...                                          nfolds=5,
...                                          family='binomial',
...                                          seed=1234)
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=cars)
>>> cars_glm.auc(train=True)
property fold_column

Column with cross-validation fold index assignment per observation.

Type: str.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> fold_numbers = cars.kfold_column(n_folds=5, seed=1234)
>>> fold_numbers.set_names(["fold_numbers"])
>>> cars = cars.cbind(fold_numbers)
>>> print(cars['fold_numbers'])
>>>  cars_glm = H2OGeneralizedLinearEstimator(seed=1234,
...                                           family="binomial")
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=cars,
...                fold_column="fold_numbers")
>>> cars_glm.auc(xval=True)
property generate_scoring_history

If set to true, will generate scoring history for GLM. This may significantly slow down the algo.

Type: bool, defaults to False.

static getAlphaBest(model)[source]

Extract best alpha value found from glm model.

Parameters

model – source lambda search model

Examples

>>> d = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> m = H2OGeneralizedLinearEstimator(family = 'binomial',
...                                   lambda_search = True,
...                                   solver = 'COORDINATE_DESCENT')
>>> m.train(training_frame = d,
...         x = [2,3,4,5,6,7,8],
...         y = 1)
>>> bestAlpha = H2OGeneralizedLinearEstimator.getAlphaBest(m)
>>> print("Best alpha found is {0}".format(bestAlpha))
static getGLMRegularizationPath(model)[source]

Extract full regularization path explored during lambda search from glm model.

Parameters

model – source lambda search model

Examples

>>> d = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> m = H2OGeneralizedLinearEstimator(family = 'binomial',
...                                   lambda_search = True,
...                                   solver = 'COORDINATE_DESCENT')
>>> m.train(training_frame = d,
...         x = [2,3,4,5,6,7,8],
...         y = 1)
>>> r = H2OGeneralizedLinearEstimator.getGLMRegularizationPath(m)
>>> m2 = H2OGeneralizedLinearEstimator.makeGLMModel(model=m,
...                                                 coefs=r['coefficients'][10])
>>> dev1 = r['explained_deviance_train'][10]
>>> p = m2.model_performance(d)
>>> dev2 = 1-p.residual_deviance()/p.null_deviance()
>>> print(dev1, " =?= ", dev2)
static getLambdaBest(model)[source]

Extract best lambda value found from glm model.

Parameters

model – source lambda search model

Examples

>>> d = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> m = H2OGeneralizedLinearEstimator(family = 'binomial',
...                                   lambda_search = True,
...                                   solver = 'COORDINATE_DESCENT')
>>> m.train(training_frame = d,
...         x = [2,3,4,5,6,7,8],
...         y = 1)
>>> bestLambda = H2OGeneralizedLinearEstimator.getLambdaBest(m)
>>> print("Best lambda found is {0}".format(bestLambda))
static getLambdaMax(model)[source]

Extract the maximum lambda value used during lambda search.

Parameters

model – source lambda search model

Examples

>>> d = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> m = H2OGeneralizedLinearEstimator(family = 'binomial',
...                                   lambda_search = True,
...                                   solver = 'COORDINATE_DESCENT')
>>> m.train(training_frame = d,
...         x = [2,3,4,5,6,7,8],
...         y = 1)
>>> maxLambda = H2OGeneralizedLinearEstimator.getLambdaMax(m)
>>> print("Maximum lambda found is {0}".format(maxLambda))
static getLambdaMin(model)[source]

Extract the minimum lambda value calculated during lambda search from glm model. Note that due to early stop, this minimum lambda value may not be used in the actual lambda search.

Parameters

model – source lambda search model

Examples

>>> d = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> m = H2OGeneralizedLinearEstimator(family = 'binomial',
...                                   lambda_search = True,
...                                   solver = 'COORDINATE_DESCENT')
>>> m.train(training_frame = d,
...         x = [2,3,4,5,6,7,8],
...         y = 1)
>>> minLambda = H2OGeneralizedLinearEstimator.getLambdaMin(m)
>>> print("Minimum lambda found is {0}".format(minLambda))
property gradient_epsilon

Converge if objective changes less (using L-infinity norm) than this, ONLY applies to L-BFGS solver. Default (of -1.0) indicates: If lambda_search is set to False and lambda is equal to zero, the default value of gradient_epsilon is equal to .000001, otherwise the default value is .0001. If lambda_search is set to True, the conditional values above are 1E-8 and 1E-6 respectively.

Type: float, defaults to -1.0.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_glm = H2OGeneralizedLinearEstimator(gradient_epsilon=1e-3)
>>> boston_glm.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> boston_glm.mse()
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars["const_1"] = 6
>>> cars["const_2"] = 7
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_glm = H2OGeneralizedLinearEstimator(seed=1234,
...                                          ignore_const_cols=True,
...                                          family="binomial")
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.auc(valid=True)
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property interaction_pairs

A list of pairwise (first order) column interactions.

Type: List[tuple].

Examples

>>> df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> XY = [df.names[i-1] for i in [1,2,3,4,6,8,9,13,17,18,19,31]]
>>> interactions = [XY[i-1] for i in [5,7,9]]
>>> m = H2OGeneralizedLinearEstimator(lambda_search=True,
...                                   family="binomial",
...                                   interactions=interactions)
>>> m.train(x=XY[:len(XY)], y=XY[-1],training_frame=df)
>>> m._model_json['output']['coefficients_table']
>>> coef_m = m._model_json['output']['coefficients_table']
>>> interaction_pairs = [("CRSDepTime", "UniqueCarrier"),
...                      ("CRSDepTime", "Origin"),
...                      ("UniqueCarrier", "Origin")]
>>> mexp = H2OGeneralizedLinearEstimator(lambda_search=True,
...                                      family="binomial",
...                                      interaction_pairs=interaction_pairs)
>>> mexp.train(x=XY[:len(XY)], y=XY[-1],training_frame=df)
>>> mexp._model_json['output']['coefficients_table']
property interactions

A list of predictor column indices to interact. All pairwise combinations will be computed for the list.

Type: List[str].

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> interactions_list = ['crim', 'dis']
>>> boston_glm = H2OGeneralizedLinearEstimator(interactions=interactions_list) 
>>> boston_glm.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> boston_glm.mse()
property intercept

Include constant term in the model

Type: bool, defaults to True.

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> iris['class'] = iris['class'].asfactor()
>>> predictors = iris.columns[:-1]
>>> response = 'class'
>>> train, valid = iris.split_frame(ratios=[.8])
>>> iris_glm = H2OGeneralizedLinearEstimator(family='multinomial',
...                                          intercept=True)
>>> iris_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> iris_glm.logloss(valid=True)
property keep_cross_validation_fold_assignment

Whether to keep the cross-validation fold assignment.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_glm = H2OGeneralizedLinearEstimator(keep_cross_validation_fold_assignment=True,
...                                          nfolds=5,
...                                          seed=1234,
...                                          family="binomial")
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train)
>>> cars_glm.cross_validation_fold_assignment()
property keep_cross_validation_models

Whether to keep the cross-validation models.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_glm = H2OGeneralizedLinearEstimator(keep_cross_validation_models=True,
...                                          nfolds=5,
...                                          seed=1234,
...                                          family="binomial")
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train)
>>> cars_glm_cv_models = cars_glm.cross_validation_models()
>>> print(cars_glm.cross_validation_models())
property keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation models.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_glm = H2OGeneralizedLinearEstimator(keep_cross_validation_predictions=True,
...                                          nfolds=5,
...                                          seed=1234,
...                                          family="binomial")
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train)
>>> cars_glm.cross_validation_predictions()
property lambda_

Regularization strength

Type: List[float].

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8])
>>> airlines_glm = H2OGeneralizedLinearEstimator(family='binomial',
...                                              lambda_=.0001)
>>> airlines_glm.train(x=predictors,
...                    y=response
...                    trainig_frame=train,
...                    validation_frame=valid)
>>> print(airlines_glm.auc(valid=True))
property lambda_min_ratio

Minimum lambda used in lambda search, specified as a ratio of lambda_max (the smallest lambda that drives all coefficients to zero). Default indicates: if the number of observations is greater than the number of variables, then lambda_min_ratio is set to 0.0001; if the number of observations is less than the number of variables, then lambda_min_ratio is set to 0.01.

Type: float, defaults to -1.0.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_glm = H2OGeneralizedLinearEstimator(lambda_min_ratio=.0001)
>>> boston_glm.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> boston_glm.mse()

Use lambda search starting at lambda max, given lambda is then interpreted as lambda min

Type: bool, defaults to False.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_glm = H2OGeneralizedLinearEstimator(lambda_search=True)
>>> boston_glm.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> print(boston_glm.mse(valid=True))

Link function.

Type: Literal["family_default", "identity", "logit", "log", "inverse", "tweedie", "ologit"], defaults to "family_default".

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> iris['class'] = iris['class'].asfactor()
>>> predictors = iris.columns[:-1]
>>> response = 'class'
>>> train, valid = iris.split_frame(ratios=[.8])
>>> iris_glm = H2OGeneralizedLinearEstimator(family='multinomial',
...                                          link='family_default')
>>> iris_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> iris_glm.logloss()
static makeGLMModel(model, coefs, threshold=0.5)[source]

Create a custom GLM model using the given coefficients.

Needs to be passed source model trained on the dataset to extract the dataset information from.

Parameters
  • model – source model, used for extracting dataset information

  • coefs – dictionary containing model coefficients

  • threshold – (optional, only for binomial) decision threshold used for classification

Examples

>>> d = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> m = H2OGeneralizedLinearEstimator(family='binomial',
...                                   lambda_search=True,
...                                   solver='COORDINATE_DESCENT')
>>> m.train(training_frame=d,
...         x=[2,3,4,5,6,7,8],
...         y=1)
>>> r = H2OGeneralizedLinearEstimator.getGLMRegularizationPath(m)
>>> m2 = H2OGeneralizedLinearEstimator.makeGLMModel(model=m,
...                                                 coefs=r['coefficients'][10])
>>> dev1 = r['explained_deviance_train'][10]
>>> p = m2.model_performance(d)
>>> dev2 = 1-p.residual_deviance()/p.null_deviance()
>>> print(dev1, " =?= ", dev2)
property max_active_predictors

Maximum number of active predictors during computation. Use as a stopping criterion to prevent expensive model building with many predictors. Default indicates: If the IRLSM solver is used, the value of max_active_predictors is set to 5000 otherwise it is set to 100000000.

Type: int, defaults to -1.

Examples

>>> higgs= h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/testng/higgs_train_5k.csv")
>>> predictors = higgs.names
>>> predictors.remove('response')
>>> response = "response"
>>> train, valid = higgs.split_frame(ratios=[.8])
>>> higgs_glm = H2OGeneralizedLinearEstimator(family='binomial',
...                                           max_active_predictors=200)
>>> higgs_glm.train(x=predictors,
...                 y=response,
...                 training_frame=train,
...                 validation_frame=valid)
>>> higgs_glm.auc()
property max_after_balance_size

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.

Type: float, defaults to 5.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","year"]
>>> response = "acceleration"
>>> train, valid = cars.split_frame(ratios=[.8])
>>> max = .85
>>> cars_glm = H2OGeneralizedLinearEstimator(balance_classes=True,
...                                          max_after_balance_size=max,
...                                          seed=1234)
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.mse()
property max_confusion_matrix_size

[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs

Type: int, defaults to 20.

property max_iterations

Maximum number of iterations

Type: int, defaults to -1.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8])
>>> cars_glm = H2OGeneralizedLinearEstimator(family='binomial',
...                                          max_iterations=50)
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.mse()
property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8])
>>> cars_glm = H2OGeneralizedLinearEstimator(max_runtime_secs=10,
...                                          seed=1234) 
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.mse()
property missing_values_handling

Handling of missing values. Either MeanImputation, Skip or PlugValues.

Type: Literal["mean_imputation", "skip", "plug_values"], defaults to "mean_imputation".

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> boston.insert_missing_values()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_glm = H2OGeneralizedLinearEstimator(missing_values_handling="skip")
>>> boston_glm.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> boston_glm.mse()
property nfolds

Number of folds for K-fold cross-validation (0 to disable or >= 2).

Type: int, defaults to 0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> folds = 5
>>> cars_glm = H2OGeneralizedLinearEstimator(nfolds=folds,
...                                          seed=1234,
...                                          family='binomial')
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=cars)
>>> cars_glm.auc(xval=True)
property nlambdas

Number of lambdas to be used in a search. Default indicates: If alpha is zero, with lambda search set to True, the value of nlamdas is set to 30 (fewer lambdas are needed for ridge regression) otherwise it is set to 100.

Type: int, defaults to -1.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_glm = H2OGeneralizedLinearEstimator(lambda_search=True,
...                                            nlambdas=50)
>>> boston_glm.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> print(boston_glm.mse(valid=True))
property non_negative

Restrict coefficients (not intercept) to be non-negative

Type: bool, defaults to False.

Examples

>>> airlines = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8])
>>> airlines_glm = H2OGeneralizedLinearEstimator(family='binomial',
...                                              non_negative=True)
>>> airlines_glm.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_glm.auc()
property obj_reg

Likelihood divider in objective value computation, default (of -1.0) will set it to 1/nobs

Type: float, defaults to -1.0.

Examples

>>> df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/glm_ordinal_logit/ordinal_multinomial_training_set.csv")
>>> df["C11"] = df["C11"].asfactor()
>>> ordinal_fit = H2OGeneralizedLinearEstimator(family="ordinal",
...                                             alpha=1.0,
...                                             lambda_=0.000000001,
...                                             obj_reg=0.00001,
...                                             max_iterations=1000,
...                                             beta_epsilon=1e-8,
...                                             objective_epsilon=1e-10)
>>> ordinal_fit.train(x=list(range(0,10)),
...                   y="C11",
...                   training_frame=df)
>>> ordinal_fit.mse()
property objective_epsilon

Converge if objective value changes less than this. Default (of -1.0) indicates: If lambda_search is set to True the value of objective_epsilon is set to .0001. If the lambda_search is set to False and lambda is equal to zero, the value of objective_epsilon is set to .000001, for any other value of lambda the default value of objective_epsilon is set to .0001.

Type: float, defaults to -1.0.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_glm = H2OGeneralizedLinearEstimator(objective_epsilon=1e-3)
>>> boston_glm.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> boston_glm.mse()
property offset_column

Offset column. This will be added to the combination of columns before applying the link function.

Type: str.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> boston["offset"] = boston["medv"].log()
>>> train, valid = boston.split_frame(ratios=[.8], seed=1234)
>>> boston_glm = H2OGeneralizedLinearEstimator(offset_column="offset",
...                                            seed=1234)
>>> boston_glm.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> boston_glm.mse(valid=True)
property plug_values

Plug Values (a single row frame containing values that will be used to impute missing values of the training/validation frame, use with conjunction missing_values_handling = PlugValues)

Type: Union[None, str, H2OFrame].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars = cars.drop(0)
>>> means = cars.mean()
>>> means = H2OFrame._expr(ExprNode("mean", cars, True, 0))
>>> glm_means = H2OGeneralizedLinearEstimator(seed=42)
>>> glm_means.train(training_frame=cars, y="cylinders")
>>> glm_plugs1 = H2OGeneralizedLinearEstimator(seed=42,
...                                            missing_values_handling="PlugValues",
...                                            plug_values=means)
>>> glm_plugs1.train(training_frame=cars, y="cylinders")
>>> glm_means.coef() == glm_plugs1.coef()
>>> not_means = 0.1 + (means * 0.5)
>>> glm_plugs2 = H2OGeneralizedLinearEstimator(seed=42,
...                                            missing_values_handling="PlugValues",
...                                            plug_values=not_means)
>>> glm_plugs2.train(training_frame=cars, y="cylinders")
>>> glm_means.coef() != glm_plugs2.coef()
property prior

Prior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.

Type: float, defaults to -1.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8])
>>> cars_glm1 = H2OGeneralizedLinearEstimator(family='binomial', prior=0.5)
>>> cars_glm1.train(x=predictors,
...                 y=response,
...                 training_frame=train,
...                 validation_frame=valid)
>>> cars_glm1.mse()
property rand_family

Random Component Family array. One for each random component. Only support gaussian for now.

Type: List[Literal["[gaussian]"]].

Link function array for random component in HGLM.

Type: List[Literal["[identity]", "[family_default]"]].

property random_columns

random columns indices for HGLM.

Type: List[int].

property remove_collinear_columns

In case of linearly dependent columns, remove some of the dependent columns

Type: bool, defaults to False.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8])
>>> airlines_glm = H2OGeneralizedLinearEstimator(family='binomial',
...                                              lambda_=0,
...                                              remove_collinear_columns=True)
>>> airlines_glm.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_glm.auc()
property response_column

Response variable column.

Type: str.

property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_glm = H2OGeneralizedLinearEstimator(score_each_iteration=True,
...                                          seed=1234,
...                                          family='binomial')
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.scoring_history()
property score_iteration_interval

Perform scoring for every score_iteration_interval iterations

Type: int, defaults to -1.

property seed

Seed for pseudo random number generator (if applicable)

Type: int, defaults to -1.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid = airlines.split_frame(ratios=[.8], seed=1234)
>>> glm_w_seed = H2OGeneralizedLinearEstimator(family='binomial',
...                                            seed=1234)
>>> glm_w_seed.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> print(glm_w_seed_1.auc(valid=True))
property solver

AUTO will set the solver based on given data and the other parameters. IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, L_BFGS scales better for datasets with many columns.

Type: Literal["auto", "irlsm", "l_bfgs", "coordinate_descent_naive", "coordinate_descent", "gradient_descent_lh", "gradient_descent_sqerr"], defaults to "auto".

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_glm = H2OGeneralizedLinearEstimator(solver='irlsm')
>>> boston_glm.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> print(boston_glm.mse(valid=True))
property standardize

Standardize numeric columns to have zero mean and unit variance

Type: bool, defaults to True.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_glm = H2OGeneralizedLinearEstimator(standardize=True)
>>> boston_glm.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> boston_glm.mse()
property startval

double array to initialize fixed and random coefficients for HGLM, coefficients for GLM.

Type: List[float].

property stopping_metric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.

Type: Literal["auto", "deviance", "logloss", "mse", "rmse", "mae", "rmsle", "auc", "aucpr", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing"], defaults to "auto".

property stopping_rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)

Type: int, defaults to 0.

property stopping_tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

Type: float, defaults to 0.001.

property theta

Theta

Type: float, defaults to 1e-10.

Examples

>>> h2o_df = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/glm_test/Motor_insurance_sweden.txt")
>>> predictors = ["Payment", "Insured", "Kilometres", "Zone", "Bonus", "Make"]
>>> response = "Claims"
>>> negativebinomial_fit = H2OGeneralizedLinearEstimator(family="negativebinomial",
...                                                      link="identity",
...                                                      theta=0.5)
>>> negativebinomial_fit.train(x=predictors,
...                            y=response,
...                            training_frame=h2o_df)
>>> negativebinomial_fit.scoring_history()
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8],
...                                 seed=1234)
>>> cars_glm = H2OGeneralizedLinearEstimator(seed=1234,
...                                          family='binomial')
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.auc(train=True)

Tweedie link power

Type: float, defaults to 1.0.

Examples

>>> auto = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/auto.csv")
>>> predictors = auto.names
>>> predictors.remove('y')
>>> response = "y"
>>> train, valid = auto.split_frame(ratios=[.8])
>>> auto_glm = H2OGeneralizedLinearEstimator(family='tweedie',
...                                          tweedie_link_power=1)
>>> auto_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> print(auto_glm.mse(valid=True))
property tweedie_variance_power

Tweedie variance power

Type: float, defaults to 0.0.

Examples

>>> auto = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/auto.csv")
>>> predictors = auto.names
>>> predictors.remove('y')
>>> response = "y"
>>> train, valid = auto.split_frame(ratios=[.8])
>>> auto_glm = H2OGeneralizedLinearEstimator(family='tweedie',
...                                          tweedie_variance_power=1)
>>> auto_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> print(auto_glm.mse(valid=True))
property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_glm = H2OGeneralizedLinearEstimator(seed=1234,
...                                          family='binomial')
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_glm.auc(valid=True)
property weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

Type: str.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_glm = H2OGeneralizedLinearEstimator(seed=1234,
...                                          family='binomial')
>>> cars_glm.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid,
...                weights_column="weight")
>>> cars_glm.auc(valid=True)

H2OInfogram

class h2o.estimators.infogram.H2OInfogram(model_id=None, training_frame=None, validation_frame=None, seed=-1, keep_cross_validation_models=True, keep_cross_validation_predictions=False, keep_cross_validation_fold_assignment=False, nfolds=0, fold_assignment='auto', fold_column=None, response_column=None, ignored_columns=None, ignore_const_cols=True, score_each_iteration=False, offset_column=None, weights_column=None, standardize=False, distribution='auto', plug_values=None, max_iterations=0, stopping_rounds=0, stopping_metric='auto', stopping_tolerance=0.001, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_runtime_secs=0.0, custom_metric_func=None, auc_type='auto', algorithm='auto', algorithm_params=None, protected_columns=None, total_information_threshold=-1.0, net_information_threshold=-1.0, relevance_index_threshold=-1.0, safety_index_threshold=-1.0, data_fraction=1.0, top_n_features=50)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Information Diagram

The infogram is a graphical information-theoretic interpretability tool which allows the user to quickly spot the core, decision-making variables that uniquely and safely drive the response, in supervised classification problems. The infogram can significantly cut down the number of predictors needed to build a model by identifying only the most valuable, admissible features. When protected variables such as race or gender are present in the data, the admissibility of a variable is determined by a safety and relevancy index, and thus serves as a diagnostic tool for fairness. The safety of each feature can be quantified and variables that are unsafe will be considered inadmissible. Models built using only admissible features will naturally be more interpretable, given the reduced feature set. Admissible models are also less susceptible to overfitting and train faster, while providing similar accuracy as models built using all available features.

property algorithm

Type of machine learning algorithm used to build the infogram. Options include ‘AUTO’ (gbm), ‘deeplearning’ (Deep Learning with default parameters), ‘drf’ (Random Forest with default parameters), ‘gbm’ (GBM with default parameters), ‘glm’ (GLM with default parameters), or ‘xgboost’ (if available, XGBoost with default parameters).

Type: Literal["auto", "deeplearning", "drf", "gbm", "glm", "xgboost"], defaults to "auto".

property algorithm_params

Customized parameters for the machine learning algorithm specified in the algorithm parameter.

Type: dict.

property auc_type

Set default multinomial AUC type.

Type: Literal["auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo"], defaults to "auto".

property balance_classes

Balance training data class counts via over/under-sampling (for imbalanced data).

Type: bool, defaults to False.

property class_sampling_factors

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

Type: List[float].

property custom_metric_func

Reference to custom evaluation function, format: language:keyName=funcName

Type: str.

property data_fraction

The fraction of training frame to use to build the infogram model. Defaults to 1.0, and any value greater than 0 and less than or equal to 1.0 is acceptable.

Type: float, defaults to 1.0.

property distribution

Distribution function

Type: Literal["auto", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber"], defaults to "auto".

property fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.

Type: Literal["auto", "random", "modulo", "stratified"], defaults to "auto".

property fold_column

Column with cross-validation fold index assignment per observation.

Type: str.

get_admissible_cmi()[source]
Returns

a list of the normalized CMI of admissible attributes

get_admissible_cmi_raw()[source]
Returns

a list of raw cmi of admissible attributes

get_admissible_features()[source]
Returns

a list of predictor that are considered admissible

get_admissible_relevance()[source]
Returns

a list of relevance (variable importance) for admissible attributes

get_admissible_score_frame(valid=False, xval=False)[source]

Retreive admissible score frame which includes relevance and CMI information in an H2OFrame for training dataset by default :param valid: return infogram info on validation dataset if True :param xval: return infogram info on cross-validation hold outs if True :return: H2OFrame

get_all_predictor_cmi()[source]

Get normalized CMI of all predictors. :return: two tuples, first one is predictor names and second one is cmi

get_all_predictor_cmi_raw()[source]

Get raw CMI of all predictors. :return: two tuples, first one is predictor names and second one is cmi

get_all_predictor_relevance()[source]

Get relevance of all predictors :return: two tuples, first one is predictor names and second one is relevance

property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property keep_cross_validation_fold_assignment

Whether to keep the cross-validation fold assignment.

Type: bool, defaults to False.

property keep_cross_validation_models

Whether to keep the cross-validation models.

Type: bool, defaults to True.

property keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation models.

Type: bool, defaults to False.

property max_after_balance_size

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.

Type: float, defaults to 5.0.

property max_iterations

Maximum number of iterations.

Type: int, defaults to 0.

property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

property net_information_threshold

A number between 0 and 1 representing a threshold for net information, defaulting to 0.1. For a specific feature, if the net information is higher than this threshold, and the corresponding total information is also higher than the total_information_threshold, that feature will be considered admissible. The net information is the y-axis of the Core Infogram. Default is -1 which gets set to 0.1.

Type: float, defaults to -1.0.

property nfolds

Number of folds for K-fold cross-validation (0 to disable or >= 2).

Type: int, defaults to 0.

property offset_column

Offset column. This will be added to the combination of columns before applying the link function.

Type: str.

plot(train=True, valid=False, xval=False, figsize=(10, 10), title='Infogram', legend_on=False, server=False)[source]

Plot the infogram. By default, it will plot the infogram calculated from training dataset. Note that the frame rel_cmi_frame contains the following columns: - 0: predictor names - 1: admissible - 2: admissible index - 3: relevance-index or total information - 4: safety-index or net information, normalized from 0 to 1 - 5: safety-index or net information not normalized

Parameters
  • train – True if infogram is generated from training dataset

  • valid – True if infogram is generated from validation dataset

  • xval – True if infogram is generated from cross-validation holdout dataset

  • figsize – size of infogram plot

  • title – string to denote title of the plot

  • legend_on – legend text is included if True

  • server – True will not generate plot, False will produce plot

Returns

infogram plot if server=True or None if server=False

property plug_values

Plug Values (a single row frame containing values that will be used to impute missing values of the training/validation frame, use with conjunction missing_values_handling = PlugValues).

Type: Union[None, str, H2OFrame].

property protected_columns

Columns that contain features that are sensitive and need to be protected (legally, or otherwise), if applicable. These features (e.g. race, gender, etc) should not drive the prediction of the response.

Type: List[str].

property relevance_index_threshold

A number between 0 and 1 representing a threshold for the relevance index, defaulting to 0.1. This is only used when protected_columns is set by the user. For a specific feature, if the relevance index value is higher than this threshold, and the corresponding safety index is also higher than the safety_index_threshold``, that feature will be considered admissible. The relevance index is the x-axis of the Fair Infogram. Default is -1 which gets set to 0.1.

Type: float, defaults to -1.0.

property response_column

Response variable column.

Type: str.

property safety_index_threshold

A number between 0 and 1 representing a threshold for the safety index, defaulting to 0.1. This is only used when protected_columns is set by the user. For a specific feature, if the safety index value is higher than this threshold, and the corresponding relevance index is also higher than the relevance_index_threshold, that feature will be considered admissible. The safety index is the y-axis of the Fair Infogram. Default is -1 which gets set to 0.1.

Type: float, defaults to -1.0.

property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

property seed

Seed for pseudo random number generator (if applicable).

Type: int, defaults to -1.

property standardize

Standardize numeric columns to have zero mean and unit variance.

Type: bool, defaults to False.

property stopping_metric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.

Type: Literal["auto", "deviance", "logloss", "mse", "rmse", "mae", "rmsle", "auc", "aucpr", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing"], defaults to "auto".

property stopping_rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)

Type: int, defaults to 0.

property stopping_tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

Type: float, defaults to 0.001.

property top_n_features

An integer specifying the number of columns to evaluate in the infogram. The columns are ranked by variable importance, and the top N are evaluated. Defaults to 50.

Type: int, defaults to 50.

property total_information_threshold

A number between 0 and 1 representing a threshold for total information, defaulting to 0.1. For a specific feature, if the total information is higher than this threshold, and the corresponding net information is also higher than the threshold net_information_threshold, that feature will be considered admissible. The total information is the x-axis of the Core Infogram. Default is -1 which gets set to 0.1.

Type: float, defaults to -1.0.

train(x=None, y=None, training_frame=None, verbose=False, **kwargs)[source]

Train the H2O model.

Parameters
  • x – A list of column names or indices indicating the predictor columns.

  • y – An index or a column name indicating the response column.

  • training_frame (H2OFrame) – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).

  • offset_column – The name or index of the column in training_frame that holds the offsets.

  • fold_column – The name or index of the column in training_frame that holds the per-row fold assignments.

  • weights_column – The name or index of the column in training_frame that holds the per-row weights.

  • validation_frame – H2OFrame with validation data to be scored on while training.

  • max_runtime_secs (float) – Maximum allowed runtime in seconds for model training. Use 0 to disable.

  • verbose (bool) – Print scoring history to stdout. Defaults to False.

property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

property weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

Type: str.

H2OModelSelectionEstimator

class h2o.estimators.model_selection.H2OModelSelectionEstimator(model_id=None, training_frame=None, validation_frame=None, nfolds=0, seed=-1, fold_assignment='auto', fold_column=None, response_column=None, ignored_columns=None, ignore_const_cols=True, score_each_iteration=False, score_iteration_interval=0, offset_column=None, weights_column=None, family='auto', link='family_default', tweedie_variance_power=0.0, tweedie_link_power=0.0, theta=0.0, solver='irlsm', alpha=None, lambda_=None, lambda_search=False, early_stopping=False, nlambdas=0, standardize=True, missing_values_handling='mean_imputation', plug_values=None, compute_p_values=False, remove_collinear_columns=False, intercept=False, non_negative=False, max_iterations=0, objective_epsilon=-1.0, beta_epsilon=0.0001, gradient_epsilon=-1.0, startval=None, prior=0.0, cold_start=False, lambda_min_ratio=0.0, beta_constraints=None, max_active_predictors=-1, obj_reg=-1.0, stopping_rounds=0, stopping_metric='auto', stopping_tolerance=0.001, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_confusion_matrix_size=20, max_runtime_secs=0.0, custom_metric_func=None, nparallelism=0, max_predictor_number=1, min_predictor_number=1, mode='maxr', p_values_threshold=0.0)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Model Selection

H2O ModelSelection is used to build the best model with one predictor, two predictors, … up to max_predictor_number specified in the algorithm parameters when mode=allsubsets. The best model is the one with the highest R2 value. When mode=maxr, the model returned is no longer guaranteed to have the best R2 value.

property alpha

Distribution of regularization between the L1 (Lasso) and L2 (Ridge) penalties. A value of 1 for alpha represents Lasso regression, a value of 0 produces Ridge regression, and anything in between specifies the amount of mixing between the two. Default value of alpha is 0 when SOLVER = ‘L-BFGS’; 0.5 otherwise.

Type: List[float].

property balance_classes

Balance training data class counts via over/under-sampling (for imbalanced data).

Type: bool, defaults to False.

property beta_constraints

Beta constraints

Type: Union[None, str, H2OFrame].

property beta_epsilon

Converge if beta changes less (using L-infinity norm) than beta esilon, ONLY applies to IRLSM solver

Type: float, defaults to 0.0001.

property class_sampling_factors

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

Type: List[float].

coef(predictor_size=None)[source]

Get the coefficients for all models built with different number of predictors.

Parameters

predictor_size – predictor subset size, will only return model coefficients of that subset size.

Returns

list of Python Dicts of coefficients for all models built with different predictor numbers

coef_norm(predictor_size=None)[source]

Get the normalized coefficients for all models built with different number of predictors.

Parameters

predictor_size – predictor subset size, will only return model coefficients of that subset size.

Returns

list of Python Dicts of coefficients for all models built with different predictor numbers

property cold_start

Only applicable to multiple alpha/lambda values. If false, build the next model for next set of alpha/lambda values starting from the values provided by current model. If true will start GLM model from scratch.

Type: bool, defaults to False.

property compute_p_values

Request p-values computation, p-values work only with IRLSM solver and no regularization

Type: bool, defaults to False.

property custom_metric_func

Reference to custom evaluation function, format: language:keyName=funcName

Type: str.

property early_stopping

Stop early when there is no more relative improvement on train or validation (if provided)

Type: bool, defaults to False.

property family

Family. For MaxR, only gaussian. For backward, ordinal and multinomial families are not supported

Type: Literal["auto", "gaussian", "binomial", "fractionalbinomial", "quasibinomial", "poisson", "gamma", "tweedie", "negativebinomial"], defaults to "auto".

property fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.

Type: Literal["auto", "random", "modulo", "stratified"], defaults to "auto".

property fold_column

Column with cross-validation fold index assignment per observation.

Type: str.

get_best_R2_values()[source]

Get list of best R2 values of models with 1 predictor, 2 predictors, …, max_predictor_number of predictors

Returns

a list of best r2 values

get_best_model_predictors()[source]

Get list of best models with 1 predictor, 2 predictors, …, max_predictor_number of predictors that have the highest r2 values

Returns

a list of best r2 values

property gradient_epsilon

Converge if objective changes less (using L-infinity norm) than this, ONLY applies to L-BFGS solver. Default (of -1.0) indicates: If lambda_search is set to False and lambda is equal to zero, the default value of gradient_epsilon is equal to .000001, otherwise the default value is .0001. If lambda_search is set to True, the conditional values above are 1E-8 and 1E-6 respectively.

Type: float, defaults to -1.0.

property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property intercept

Include constant term in the model

Type: bool, defaults to False.

property lambda_

Regularization strength

Type: List[float].

property lambda_min_ratio

Minimum lambda used in lambda search, specified as a ratio of lambda_max (the smallest lambda that drives all coefficients to zero). Default indicates: if the number of observations is greater than the number of variables, then lambda_min_ratio is set to 0.0001; if the number of observations is less than the number of variables, then lambda_min_ratio is set to 0.01.

Type: float, defaults to 0.0.

Use lambda search starting at lambda max, given lambda is then interpreted as lambda min

Type: bool, defaults to False.

Link function.

Type: Literal["family_default", "identity", "logit", "log", "inverse", "tweedie", "ologit"], defaults to "family_default".

property max_active_predictors

Maximum number of active predictors during computation. Use as a stopping criterion to prevent expensive model building with many predictors. Default indicates: If the IRLSM solver is used, the value of max_active_predictors is set to 5000 otherwise it is set to 100000000.

Type: int, defaults to -1.

property max_after_balance_size

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.

Type: float, defaults to 5.0.

property max_confusion_matrix_size

[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs

Type: int, defaults to 20.

property max_iterations

Maximum number of iterations

Type: int, defaults to 0.

property max_predictor_number

Maximum number of predictors to be considered when building GLM models. Defaults to 1.

Type: int, defaults to 1.

property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

property min_predictor_number

For mode = ‘backward’ only. Minimum number of predictors to be considered when building GLM models starting with all predictors to be included. Defaults to 1.

Type: int, defaults to 1.

property missing_values_handling

Handling of missing values. Either MeanImputation, Skip or PlugValues.

Type: Literal["mean_imputation", "skip", "plug_values"], defaults to "mean_imputation".

property mode

Mode: Used to choose model selection algorithms to use. Options include ‘allsubsets’ for all subsets, ‘maxr’ for MaxR, ‘backward’ for backward selection

Type: Literal["allsubsets", "maxr", "backward"], defaults to "maxr".

property nfolds

Number of folds for K-fold cross-validation (0 to disable or >= 2).

Type: int, defaults to 0.

property nlambdas

Number of lambdas to be used in a search. Default indicates: If alpha is zero, with lambda search set to True, the value of nlamdas is set to 30 (fewer lambdas are needed for ridge regression) otherwise it is set to 100.

Type: int, defaults to 0.

property non_negative

Restrict coefficients (not intercept) to be non-negative

Type: bool, defaults to False.

property nparallelism

number of models to build in parallel. Defaults to 0.0 which is adaptive to the system capability

Type: int, defaults to 0.

property obj_reg

Likelihood divider in objective value computation, default (of -1.0) will set it to 1/nobs

Type: float, defaults to -1.0.

property objective_epsilon

Converge if objective value changes less than this. Default (of -1.0) indicates: If lambda_search is set to True the value of objective_epsilon is set to .0001. If the lambda_search is set to False and lambda is equal to zero, the value of objective_epsilon is set to .000001, for any other value of lambda the default value of objective_epsilon is set to .0001.

Type: float, defaults to -1.0.

property offset_column

Offset column. This will be added to the combination of columns before applying the link function.

Type: str.

property p_values_threshold

For mode=’backward’ only. If specified, will stop the model building process when all coefficientsp-values drop below this threshold

Type: float, defaults to 0.0.

property plug_values

Plug Values (a single row frame containing values that will be used to impute missing values of the training/validation frame, use with conjunction missing_values_handling = PlugValues)

Type: Union[None, str, H2OFrame].

property prior

Prior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.

Type: float, defaults to 0.0.

property remove_collinear_columns

In case of linearly dependent columns, remove some of the dependent columns

Type: bool, defaults to False.

property response_column

Response variable column.

Type: str.

result()[source]

Get result frame that contains information about the model building process like for modelselection and anovaglm. :return: the H2OFrame that contains information about the model building process like for modelselection and anovaglm.

property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

property score_iteration_interval

Perform scoring for every score_iteration_interval iterations

Type: int, defaults to 0.

property seed

Seed for pseudo random number generator (if applicable)

Type: int, defaults to -1.

property solver

AUTO will set the solver based on given data and the other parameters. IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, L_BFGS scales better for datasets with many columns.

Type: Literal["auto", "irlsm", "l_bfgs", "coordinate_descent_naive", "coordinate_descent", "gradient_descent_lh", "gradient_descent_sqerr"], defaults to "irlsm".

property standardize

Standardize numeric columns to have zero mean and unit variance

Type: bool, defaults to True.

property startval

double array to initialize fixed and random coefficients for HGLM, coefficients for GLM.

Type: List[float].

property stopping_metric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.

Type: Literal["auto", "deviance", "logloss", "mse", "rmse", "mae", "rmsle", "auc", "aucpr", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing"], defaults to "auto".

property stopping_rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)

Type: int, defaults to 0.

property stopping_tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

Type: float, defaults to 0.001.

property theta

Theta

Type: float, defaults to 0.0.

property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Tweedie link power

Type: float, defaults to 0.0.

property tweedie_variance_power

Tweedie variance power

Type: float, defaults to 0.0.

property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

property weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

Type: str.

H2ONaiveBayesEstimator

class h2o.estimators.naive_bayes.H2ONaiveBayesEstimator(model_id=None, nfolds=0, seed=-1, fold_assignment='auto', fold_column=None, keep_cross_validation_models=True, keep_cross_validation_predictions=False, keep_cross_validation_fold_assignment=False, training_frame=None, validation_frame=None, response_column=None, ignored_columns=None, ignore_const_cols=True, score_each_iteration=False, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_confusion_matrix_size=20, laplace=0.0, min_sdev=0.001, eps_sdev=0.0, min_prob=0.001, eps_prob=0.0, compute_metrics=True, max_runtime_secs=0.0, export_checkpoints_dir=None, gainslift_bins=-1, auc_type='auto')[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Naive Bayes

The naive Bayes classifier assumes independence between predictor variables conditional on the response, and a Gaussian distribution of numeric predictors with mean and standard deviation computed from the training dataset. When building a naive Bayes classifier, every row in the training dataset that contains at least one NA will be skipped completely. If the test dataset has missing values, then those predictors are omitted in the probability calculation during prediction.

property auc_type

Set default multinomial AUC type.

Type: Literal["auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo"], defaults to "auto".

property balance_classes

Balance training data class counts via over/under-sampling (for imbalanced data).

Type: bool, defaults to False.

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> iris_nb = H2ONaiveBayesEstimator(balance_classes=False,
...                                  nfolds=3,
...                                  seed=1234)
>>> iris_nb.train(x=list(range(4)),
...               y=4,
...               training_frame=iris)
>>> iris_nb.mse()
property class_sampling_factors

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

Type: List[float].

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> sample_factors = [1., 0.5, 1., 1., 1., 1., 1.]
>>> cov_nb = H2ONaiveBayesEstimator(class_sampling_factors=sample_factors,
...                                 seed=1234)
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> cov_nb.train(x=predictors, y=response, training_frame=covtype)
>>> cov_nb.logloss()
property compute_metrics

Compute metrics on training data

Type: bool, defaults to True.

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
>>> prostate['RACE'] = prostate['RACE'].asfactor()
>>> prostate['DCAPS'] = prostate['DCAPS'].asfactor()
>>> prostate['DPROS'] = prostate['DPROS'].asfactor()
>>> response_col = 'CAPSULE'
>>> prostate_nb = H2ONaiveBayesEstimator(laplace=0,
...                                      compute_metrics=False)
>>> prostate_nb.train(x=list(range(3,9)),
...                   y=response_col,
...                   training_frame=prostate)
>>> prostate_nb.show()
property eps_prob

Cutoff below which probability is replaced with min_prob

Type: float, defaults to 0.0.

Examples

>>> import random
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> problem = random.sample(["binomial","multinomial"],1)
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> if problem == "binomial":
...     response_col = "economy_20mpg"
... else:
...     response_col = "cylinders"
>>> cars[response_col] = cars[response_col].asfactor()
>>> cars_nb = H2ONaiveBayesEstimator(min_prob=0.1,
...                                  eps_prob=0.5,
...                                  seed=1234)
>>> cars_nb.train(x=predictors, y=response_col, training_frame=cars)
>>> cars_nb.mse()
property eps_sdev

Cutoff below which standard deviation is replaced with min_sdev

Type: float, defaults to 0.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> problem = random.sample(["binomial","multinomial"],1)
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> if problem == "binomial":
...     response_col = "economy_20mpg"
... else:
...     response_col = "cylinders"
>>> cars[response_col] = cars[response_col].asfactor()
>>> cars_nb = H2ONaiveBayesEstimator(min_sdev=0.1,
...                                  eps_sdev=0.5,
...                                  seed=1234)
>>> cars_nb.train(x=predictors, y=response_col, training_frame=cars)
>>> cars_nb.mse()
property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> import tempfile
>>> from os import listdir
>>> airlines = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip", destination_frame="air.hex")
>>> predictors = ["DayofMonth", "DayOfWeek"]
>>> response = "IsDepDelayed"
>>> checkpoints_dir = tempfile.mkdtemp()
>>> air_nb = H2ONaiveBayesEstimator(export_checkpoints_dir=checkpoints_dir)
>>> air_nb.train(x=predictors, y=response, training_frame=airlines)
>>> len(listdir(checkpoints_dir))
property fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.

Type: Literal["auto", "random", "modulo", "stratified"], defaults to "auto".

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> cars_nb = H2ONaiveBayesEstimator(fold_assignment="Random",
...                                  nfolds=5,
...                                  seed=1234)
>>> response = "economy_20mpg"
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> cars_nb.train(x=predictors, y=response, training_frame=cars)
>>> cars_nb.auc()
property fold_column

Column with cross-validation fold index assignment per observation.

Type: str.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> fold_numbers = cars.kfold_column(n_folds=5, seed=1234)
>>> fold_numbers.set_names(["fold_numbers"])
>>> cars = cars.cbind(fold_numbers)
>>> cars_nb = H2ONaiveBayesEstimator(seed=1234)
>>> cars_nb.train(x=predictors,
...               y=response,
...               training_frame=cars,
...               fold_column="fold_numbers")
>>> cars_nb.auc()
property gainslift_bins

Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.

Type: int, defaults to -1.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv")
>>> model = H2ONaiveBayesEstimator(gainslift_bins=20)
>>> model.train(x=["Origin", "Distance"],
...             y="IsDepDelayed",
...             training_frame=airlines)
>>> model.gains_lift()
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars["const_1"] = 6
>>> cars["const_2"] = 7
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_nb = H2ONaiveBayesEstimator(seed=1234,
...                                  ignore_const_cols=True)
>>> cars_nb.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_nb.auc()
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property keep_cross_validation_fold_assignment

Whether to keep the cross-validation fold assignment.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_nb = H2ONaiveBayesEstimator(keep_cross_validation_fold_assignment=True,
...                                  nfolds=5,
...                                  seed=1234)
>>> cars_nb.train(x=predictors,
...               y=response,
...               training_frame=train)
>>> cars_nb.cross_validation_fold_assignment()
property keep_cross_validation_models

Whether to keep the cross-validation models.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_nb = H2ONaiveBayesEstimator(keep_cross_validation_models=True,
...                                  nfolds=5,
...                                  seed=1234)
>>> cars_nb.train(x=predictors,
...               y=response,
...               training_frame=train)
>>> cars_nb.cross_validation_models()
property keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation models.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_nb = H2ONaiveBayesEstimator(keep_cross_validation_predictions=True,
...                                  nfolds=5,
...                                  seed=1234)
>>> cars_nb.train(x=predictors,
...               y=response,
...               training_frame=train)
>>> cars_nb.cross_validation_predictions()
property laplace

Laplace smoothing parameter

Type: float, defaults to 0.0.

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
>>> prostate['RACE'] = prostate['RACE'].asfactor()
>>> prostate['DCAPS'] = prostate['DCAPS'].asfactor()
>>> prostate['DPROS'] = prostate['DPROS'].asfactor()
>>> prostate_nb = H2ONaiveBayesEstimator(laplace=1)
>>> prostate_nb.train(x=list(range(3,9)),
...                   y=response_col,
...                   training_frame=prostate)
>>> prostate_nb.mse()
property max_after_balance_size

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.

Type: float, defaults to 5.0.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> max = .85
>>> cov_nb = H2ONaiveBayesEstimator(max_after_balance_size=max,
...                                 seed=1234) 
>>> cov_nb.train(x=predictors,
...              y=response,
...              training_frame=train,
...              validation_frame=valid)
>>> cars_nb.logloss()
property max_confusion_matrix_size

[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs

Type: int, defaults to 20.

property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_nb = H2ONaiveBayesEstimator(max_runtime_secs=10,
...                                  seed=1234) 
>>> cars_nb.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_nb.auc()
property min_prob

Min. probability to use for observations with not enough data

Type: float, defaults to 0.001.

Examples

>>> import random
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> problem = random.sample(["binomial","multinomial"],1)
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> if problem == "binomial":
...     response_col = "economy_20mpg"
... else:
...     response_col = "cylinders"
>>> cars[response_col] = cars[response_col].asfactor()
>>> cars_nb = H2ONaiveBayesEstimator(min_prob=0.1,
...                                  eps_prob=0.5,
...                                  seed=1234)
>>> cars_nb.train(x=predictors,
...               y=response_col,
...               training_frame=cars)
>>> cars_nb.show()
property min_sdev

Min. standard deviation to use for observations with not enough data

Type: float, defaults to 0.001.

Examples

>>> import random
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> problem = random.sample(["binomial","multinomial"],1)
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> if problem == "binomial":
...     response_col = "economy_20mpg"
... else:
...     response_col = "cylinders"
>>> cars[response_col] = cars[response_col].asfactor()
>>> cars_nb = H2ONaiveBayesEstimator(min_sdev=0.1,
...                                  eps_sdev=0.5,
...                                  seed=1234)
>>> cars_nb.train(x=predictors,
...               y=response_col,
...               training_frame=cars)
>>> cars_nb.show()
property nfolds

Number of folds for K-fold cross-validation (0 to disable or >= 2).

Type: int, defaults to 0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars_nb = H2ONaiveBayesEstimator(nfolds=5,
...                                  seed=1234)
>>> cars_nb.train(x=predictors,
...               y=response,
...               training_frame=cars)
>>> cars_nb.auc()
property response_column

Response variable column.

Type: str.

property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_nb = H2ONaiveBayesEstimator(score_each_iteration=True,
...                                  seed=1234)
>>> cars_nb.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_nb.auc()
property seed

Seed for pseudo random number generator (only used for cross-validation and fold_assignment=”Random” or “AUTO”)

Type: int, defaults to -1.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> nb_w_seed = H2ONaiveBayesEstimator(seed=1234)
>>> nb_w_seed.train(x=predictors,
...                 y=response,
...                 training_frame=train,
...                  validation_frame=valid)
>>> nb_wo_seed = H2ONaiveBayesEstimator()
>>> nb_wo_seed.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> nb_w_seed.auc()
>>> nb_wo_seed.auc()
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_nb = H2ONaiveBayesEstimator()
>>> cars_nb.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_nb.auc()
property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_nb = H2ONaiveBayesEstimator()
>>> cars_nb.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_nb.auc()

H2OSupportVectorMachineEstimator

class h2o.estimators.psvm.H2OSupportVectorMachineEstimator(model_id=None, training_frame=None, validation_frame=None, response_column=None, ignored_columns=None, ignore_const_cols=True, hyper_param=1.0, kernel_type='gaussian', gamma=-1.0, rank_ratio=-1.0, positive_weight=1.0, negative_weight=1.0, disable_training_metrics=True, sv_threshold=0.0001, fact_threshold=1e-05, feasible_threshold=0.001, surrogate_gap_threshold=0.001, mu_factor=10.0, max_iterations=200, seed=-1)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

PSVM

property disable_training_metrics

Disable calculating training metrics (expensive on large datasets)

Type: bool, defaults to True.

Examples

>>> from h2o.estimators import H2OSupportVectorMachineEstimator
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(gamma=0.01,
...                                        rank_ratio=0.1,
...                                        disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=splice)
>>> svm.mse()
property fact_threshold

Convergence threshold of the Incomplete Cholesky Factorization (ICF)

Type: float, defaults to 1e-05.

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(disable_training_metrics=False,
...                                        fact_threshold=1e-7)
>>> svm.train(y="C1", training_frame=splice)
>>> svm.mse()
property feasible_threshold

Convergence threshold for primal-dual residuals in the IPM iteration

Type: float, defaults to 0.001.

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(disable_training_metrics=False,
...                                        fact_threshold=1e-7)
>>> svm.train(y="C1", training_frame=splice)
>>> svm.mse()
property gamma

Coefficient of the kernel (currently RBF gamma for gaussian kernel, -1 means 1/#features)

Type: float, defaults to -1.0.

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(gamma=0.01,
...                                        rank_ratio=0.1,
...                                        disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=splice)
>>> svm.mse()
property hyper_param

Penalty parameter C of the error term

Type: float, defaults to 1.0.

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(gamma=0.01,
...                                        rank_ratio=0.1,
...                                        hyper_param=0.01,
...                                        disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=splice)
>>> svm.mse()
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(gamma=0.01,
...                                        rank_ratio=0.1,
...                                        ignore_const_cols=False,
...                                        disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=splice)
>>> svm.mse()
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property kernel_type

Type of used kernel

Type: Literal["gaussian"], defaults to "gaussian".

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(gamma=0.1,
...                                        rank_ratio=0.1,
...                                        hyper_param=0.01,
...                                        kernel_type="gaussian",
...                                        disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=splice) 
>>> svm.mse()
property max_iterations

Maximum number of iteration of the algorithm

Type: int, defaults to 200.

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(gamma=0.1,
...                                        rank_ratio=0.1,
...                                        hyper_param=0.01,
...                                        max_iterations=20,
...                                        disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=splice)  
>>> svm.mse()
property mu_factor

Increasing factor mu

Type: float, defaults to 10.0.

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(gamma=0.1,
...                                        mu_factor=100.5,
...                                        disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=splice) 
>>> svm.mse()
property negative_weight

Weight of positive (-1) class of observations

Type: float, defaults to 1.0.

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(gamma=0.1,
...                                        rank_ratio=0.1,
...                                        negative_weight=10,
...                                        disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=splice)  
>>> svm.mse()
property positive_weight

Weight of positive (+1) class of observations

Type: float, defaults to 1.0.

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(gamma=0.1,
...                                        rank_ratio=0.1,
...                                        positive_weight=0.1,
...                                        disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=splice)   
>>> svm.mse()
property rank_ratio

Desired rank of the ICF matrix expressed as an ration of number of input rows (-1 means use sqrt(#rows)).

Type: float, defaults to -1.0.

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(gamma=0.01,
...                                        rank_ratio=0.1,
...                                        disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=splice)
>>> svm.mse()
property response_column

Response variable column.

Type: str.

property seed

Seed for pseudo random number generator (if applicable)

Type: int, defaults to -1.

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(gamma=0.1,
...                                        rank_ratio=0.1,
...                                        seed=1234,
...                                        disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=splice)
>>> svm.model_performance
property surrogate_gap_threshold

Feasibility criterion of the surrogate duality gap (eta)

Type: float, defaults to 0.001.

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(gamma=0.01,
...                                        rank_ratio=0.1,
...                                        surrogate_gap_threshold=0.1,
...                                        disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=splice) 
>>> svm.mse()
property sv_threshold

Threshold for accepting a candidate observation into the set of support vectors

Type: float, defaults to 0.0001.

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> svm = H2OSupportVectorMachineEstimator(gamma=0.01,
...                                        rank_ratio=0.1,
...                                        sv_threshold=0.01,
...                                        disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=splice) 
>>> svm.mse()
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> train, valid = splice.split_frame(ratios=[0.8])
>>> svm = H2OSupportVectorMachineEstimator(disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=train)
>>> svm.mse()
property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm")
>>> train, valid = splice.split_frame(ratios=[0.8])
>>> svm = H2OSupportVectorMachineEstimator(disable_training_metrics=False)
>>> svm.train(y="C1", training_frame=train, validation_frame=valid)
>>> svm.mse()

H2ORandomForestEstimator

class h2o.estimators.random_forest.H2ORandomForestEstimator(model_id=None, training_frame=None, validation_frame=None, nfolds=0, keep_cross_validation_models=True, keep_cross_validation_predictions=False, keep_cross_validation_fold_assignment=False, score_each_iteration=False, score_tree_interval=0, fold_assignment='auto', fold_column=None, response_column=None, ignored_columns=None, ignore_const_cols=True, weights_column=None, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_confusion_matrix_size=20, ntrees=50, max_depth=20, min_rows=1.0, nbins=20, nbins_top_level=1024, nbins_cats=1024, r2_stopping=None, stopping_rounds=0, stopping_metric='auto', stopping_tolerance=0.001, max_runtime_secs=0.0, seed=-1, build_tree_one_node=False, mtries=-1, sample_rate=0.632, sample_rate_per_class=None, binomial_double_trees=False, checkpoint=None, col_sample_rate_change_per_level=1.0, col_sample_rate_per_tree=1.0, min_split_improvement=1e-05, histogram_type='auto', categorical_encoding='auto', calibrate_model=False, calibration_frame=None, distribution='auto', custom_metric_func=None, export_checkpoints_dir=None, check_constant_response=True, gainslift_bins=-1, auc_type='auto')[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Distributed Random Forest

Builds a Distributed Random Forest (DRF) on a parsed dataset, for regression or classification.

property auc_type

Set default multinomial AUC type.

Type: Literal["auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo"], defaults to "auto".

property balance_classes

Balance training data class counts via over/under-sampling (for imbalanced data).

Type: bool, defaults to False.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> cov_drf = H2ORandomForestEstimator(balance_classes=True,
...                                    seed=1234)
>>> cov_drf.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> print('logloss', cov_drf.logloss(valid=True))
property binomial_double_trees

For binary classification: Build 2x as many trees (one per class) - can lead to higher accuracy.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_drf = H2ORandomForestEstimator(binomial_double_trees=False,
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> print('without binomial_double_trees:',
...        cars_drf.auc(valid=True))
>>> cars_drf_2 = H2ORandomForestEstimator(binomial_double_trees=True,
...                                       seed=1234)
>>> cars_drf_2.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> print('with binomial_double_trees:', cars_drf_2.auc(valid=True))
property build_tree_one_node

Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_drf = H2ORandomForestEstimator(build_tree_one_node=True,
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_drf.auc(valid=True)
property calibrate_model

Use Platt Scaling to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities.

Type: bool, defaults to False.

Examples

>>> ecology = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/ecology_model.csv")
>>> ecology['Angaus'] = ecology['Angaus'].asfactor()
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> response = 'Angaus'
>>> predictors = ecology.columns[3:13]
>>> train, calib = ecology.split_frame(seed=12354)
>>> w = h2o.create_frame(binary_fraction=1,
...                      binary_ones_fraction=0.5,
...                      missing_fraction=0,
...                      rows=744, cols=1)
>>> w.set_names(["weight"])
>>> train = train.cbind(w)
>>> ecology_drf = H2ORandomForestEstimator(ntrees=10,
...                                        max_depth=5,
...                                        min_rows=10,
...                                        distribution="multinomial",
...                                        weights_column="weight",
...                                        calibrate_model=True,
...                                        calibration_frame=calib)
>>> ecology_drf.train(x=predictors,
...                   y="Angaus",
...                   training_frame=train)
>>> predicted = ecology_drf.predict(calib)
property calibration_frame

Calibration frame for Platt Scaling

Type: Union[None, str, H2OFrame].

Examples

>>> ecology = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/ecology_model.csv")
>>> ecology['Angaus'] = ecology['Angaus'].asfactor()
>>> response = 'Angaus'
>>> predictors = ecology.columns[3:13]
>>> train, calib = ecology.split_frame(seed = 12354)
>>> w = h2o.create_frame(binary_fraction=1,
...                      binary_ones_fraction=0.5,
...                      missing_fraction=0,
...                      rows=744, cols=1)
>>> w.set_names(["weight"])
>>> train = train.cbind(w)
>>> ecology_drf = H2ORandomForestEstimator(ntrees=10,
...                                        max_depth=5,
...                                        min_rows=10,
...                                        distribution="multinomial",
...                                        calibrate_model=True,
...                                        calibration_frame=calib)
>>> ecology_drf.train(x=predictors,
...                   y="Angaus,
...                   training_frame=train,
...                   weights_column="weight")
>>> predicted = ecology_drf.predict(train)
property categorical_encoding

Encoding scheme for categorical features

Type: Literal["auto", "enum", "one_hot_internal", "one_hot_explicit", "binary", "eigen", "label_encoder", "sort_by_response", "enum_limited"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") 
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> encoding = "one_hot_explicit"
>>> airlines_drf = H2ORandomForestEstimator(categorical_encoding=encoding,
...                                         seed=1234)
>>> airlines_drf.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_drf.auc(valid=True)
property check_constant_response

Check if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not.

Type: bool, defaults to True.

Examples

>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv")
>>> train["constantCol"] = 1
>>> my_drf = H2ORandomForestEstimator(check_constant_response=False)
>>> my_drf.train(x=list(range(1,5)),
...              y="constantCol",
...              training_frame=train)
property checkpoint

Model checkpoint to resume training with.

Type: Union[None, str, H2OEstimator].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8],
...                                 seed=1234)
>>> cars_drf = H2ORandomForestEstimator(ntrees=1,
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> print(cars_drf.auc(valid=True))
property class_sampling_factors

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

Type: List[float].

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> print(covtype[54].table())
>>> sample_factors = [1., 0.5, 1., 1., 1., 1., 1.]
>>> cov_drf = H2ORandomForestEstimator(balance_classes=True,
...                                    class_sampling_factors=sample_factors,
...                                    seed=1234)
>>> cov_drf.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> print('logloss', cov_drf.logloss(valid=True))
property col_sample_rate_change_per_level

Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_drf = H2ORandomForestEstimator(col_sample_rate_change_per_level=.9,
...                                         seed=1234)
>>> airlines_drf.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>>  print(airlines_drf.auc(valid=True))
property col_sample_rate_per_tree

Column sample rate per tree (from 0.0 to 1.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_drf = H2ORandomForestEstimator(col_sample_rate_per_tree=.7,
...                                         seed=1234)
>>> airlines_drf.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print(airlines_drf.auc(valid=True))
property custom_metric_func

Reference to custom evaluation function, format: language:keyName=funcName

Type: str.

property distribution

Distribution function

Type: Literal["auto", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber"], defaults to "auto".

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_drf = H2ORandomForestEstimator(distribution="poisson",
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_drf.mse(valid=True)
property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> import tempfile
>>> from os import listdir
>>> from h2o.grid.grid_search import H2OGridSearch
>>> airlines = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip", destination_frame="air.hex")
>>> predictors = ["DayofMonth", "DayOfWeek"]
>>> response = "IsDepDelayed"
>>> hyper_parameters = {'ntrees': [5,10]}
>>> search_crit = {'strategy': "RandomDiscrete",
...                'max_models': 5,
...                'seed': 1234,
...                'stopping_rounds': 3,
...                'stopping_metric': "AUTO",
...                'stopping_tolerance': 1e-2}
>>> checkpoints_dir = tempfile.mkdtemp()
>>> air_grid = H2OGridSearch(H2ORandomForestEstimator,
...                          hyper_params=hyper_parameters,
...                          search_criteria=search_crit)
>>> air_grid.train(x=predictors,
...                y=response,
...                training_frame=airlines,
...                distribution="bernoulli",
...                max_depth=3,
...                export_checkpoints_dir=checkpoints_dir)
>>> num_files = len(listdir(checkpoints_dir))
>>> num_files
property fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.

Type: Literal["auto", "random", "modulo", "stratified"], defaults to "auto".

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> assignment_type = "Random"
>>> cars_drf = H2ORandomForestEstimator(fold_assignment=assignment_type,
...                                     nfolds=5,
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=cars)
>>> cars_drf.auc(xval=True)
property fold_column

Column with cross-validation fold index assignment per observation.

Type: str.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> fold_numbers = cars.kfold_column(n_folds=5, seed=1234)
>>> fold_numbers.set_names(["fold_numbers"])
>>> cars = cars.cbind(fold_numbers)
>>> print(cars['fold_numbers'])
>>> cars_drf = H2ORandomForestEstimator(seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=cars,
...                fold_column="fold_numbers")
>>> cars_drf.auc(xval=True)
property gainslift_bins

Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.

Type: int, defaults to -1.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv")
>>> model = H2ORandomForestEstimator(ntrees=1, gainslift_bins=20)
>>> model.train(x=["Origin", "Distance"],
...             y="IsDepDelayed",
...             training_frame=airlines)
>>> model.gains_lift()
property histogram_type

What type of histogram to use for finding optimal split points

Type: Literal["auto", "uniform_adaptive", "random", "quantiles_global", "round_robin", "uniform_robust"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_drf = H2ORandomForestEstimator(histogram_type="UniformAdaptive",
...                                         seed=1234)
>>> airlines_drf.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print(airlines_drf.auc(valid=True))
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> cars["const_1"] = 6
>>> cars["const_2"] = 7
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_drf = H2ORandomForestEstimator(seed=1234,
...                                     ignore_const_cols=True)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_drf.auc(valid=True)
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property keep_cross_validation_fold_assignment

Whether to keep the cross-validation fold assignment.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_drf = H2ORandomForestEstimator(keep_cross_validation_fold_assignment=True,
...                                     nfolds=5,
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train)
>>> cars_drf.cross_validation_fold_assignment()
property keep_cross_validation_models

Whether to keep the cross-validation models.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_drf = H2ORandomForestEstimator(keep_cross_validation_models=True,
...                                     nfolds=5,
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train)
>>> cars_drf.auc()
property keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation models.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_drf = H2ORandomForestEstimator(keep_cross_validation_predictions=True,
...                                     nfolds=5,
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train)
>>> cars_drf.cross_validation_predictions()
property max_after_balance_size

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.

Type: float, defaults to 5.0.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> print(covtype[54].table())
>>> max = .85
>>> cov_drf = H2ORandomForestEstimator(balance_classes=True,
...                                    max_after_balance_size=max,
...                                    seed=1234)
>>> cov_drf.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> print('logloss', cov_drf.logloss(valid=True))
property max_confusion_matrix_size

[Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs

Type: int, defaults to 20.

property max_depth

Maximum tree depth (0 for unlimited).

Type: int, defaults to 20.

Examples

>>> df = h2o.import_file(path = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> response = "survived"
>>> df[response] = df[response].asfactor()
>>> predictors = df.columns
>>> del predictors[1:3]
>>> train, valid, test = df.split_frame(ratios=[0.6,0.2],
...                                     seed=1234,
...                                     destination_frames=
...                                     ['train.hex','valid.hex','test.hex'])
>>> drf = H2ORandomForestEstimator()
>>> drf.train(x=predictors,
...           y=response,
...           training_frame=train)
>>> perf = drf.model_performance(valid)
>>> print perf.auc()
property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_drf = H2ORandomForestEstimator(max_runtime_secs=10,
...                                     ntrees=10000,
...                                     max_depth=10,
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_drf.auc(valid = True)
property min_rows

Fewest allowed (weighted) observations in a leaf.

Type: float, defaults to 1.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_drf = H2ORandomForestEstimator(min_rows=16,
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> print(cars_drf.auc(valid=True))
property min_split_improvement

Minimum relative improvement in squared error reduction for a split to happen

Type: float, defaults to 1e-05.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_drf = H2ORandomForestEstimator(min_split_improvement=1e-3,
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> print(cars_drf.auc(valid=True))
property mtries

Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification and p/3 for regression (where p is the # of predictors

Type: int, defaults to -1.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
>>> cov_drf = H2ORandomForestEstimator(mtries=30, seed=1234)
>>> cov_drf.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> print('logloss', cov_drf.logloss(valid=True))
property nbins

For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point

Type: int, defaults to 20.

Examples

>>> eeg = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/eeg/eeg_eyestate.csv")
>>> eeg['eyeDetection'] = eeg['eyeDetection'].asfactor()
>>> predictors = eeg.columns[:-1]
>>> response = 'eyeDetection'
>>> train, valid = eeg.split_frame(ratios=[.8], seed=1234)
>>> bin_num = [16, 32, 64, 128, 256, 512]
>>> label = ["16", "32", "64", "128", "256", "512"]
>>> for key, num in enumerate(bin_num):
#              Insert integer for 'num' and 'key'
>>> eeg_drf = H2ORandomForestEstimator(nbins=num, seed=1234)
>>> eeg_drf.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> print(label[key], 'training score',
...       eeg_drf.auc(train=True))
>>> print(label[key], 'validation score',
...       eeg_drf.auc(train=True))
property nbins_cats

For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.

Type: int, defaults to 1024.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> bin_num = [8, 16, 32, 64, 128, 256,
...            512, 1024, 2048, 4096]
>>> label = ["8", "16", "32", "64", "128",
...          "256", "512", "1024", "2048", "4096"]
>>> for key, num in enumerate(bin_num):
#              Insert integer for 'num' and 'key'
>>> airlines_drf = H2ORandomForestEstimator(nbins_cats=num,
...                                         seed=1234)
>>> airlines_drf.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print(label[key], 'training score',
...       airlines_gbm.auc(train=True))
>>> print(label[key], 'validation score',
...       airlines_gbm.auc(valid=True))
property nbins_top_level

For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level

Type: int, defaults to 1024.

Examples

>>> eeg = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/eeg/eeg_eyestate.csv")
>>> eeg['eyeDetection'] = eeg['eyeDetection'].asfactor()
>>> predictors = eeg.columns[:-1]
>>> response = 'eyeDetection'
>>> train, valid = eeg.split_frame(ratios=[.8],
...                                seed=1234)
>>> bin_num = [32, 64, 128, 256, 512,
...            1024, 2048, 4096]
>>> label = ["32", "64", "128", "256",
...          "512", "1024", "2048", "4096"]
>>> for key, num in enumerate(bin_num):
#              Insert integer for 'num' and 'key'
>>> eeg_drf = H2ORandomForestEstimator(nbins_top_level=32,
...                                    seed=1234)
>>> eeg_drf.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> print(label[key], 'training score',
...       eeg_gbm.auc(train=True))
>>> print(label[key], 'validation score',
...       eeg_gbm.auc(valid=True))
property nfolds

Number of folds for K-fold cross-validation (0 to disable or >= 2).

Type: int, defaults to 0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> folds = 5
>>> cars_drf = H2ORandomForestEstimator(nfolds=folds,
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=cars)
>>> cars_drf.auc(xval=True)
property ntrees

Number of trees.

Type: int, defaults to 50.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> tree_num = [20, 50, 80, 110,
...             140, 170, 200]
>>> label = ["20", "50", "80", "110",
...          "140", "170", "200"]
>>> for key, num in enumerate(tree_num):
#              Input an integer for 'num' and 'key'
>>> titanic_drf = H2ORandomForestEstimator(ntrees=num,
...                                        seed=1234)
>>> titanic_drf.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> print(label[key], 'training score',
...       titanic_drf.auc(train=True))
>>> print(label[key], 'validation score',
...       titanic_drf.auc(valid=True))
property offset_column

[Deprecated] The property was removed and will be ignored.

property r2_stopping

r2_stopping is no longer supported and will be ignored if set - please use stopping_rounds, stopping_metric and stopping_tolerance instead. Previous version of H2O would stop making trees when the R^2 metric equals or exceeds this

Type: float, defaults to .

property response_column

Response variable column.

Type: str.

property sample_rate

Row sample rate per tree (from 0.0 to 1.0)

Type: float, defaults to 0.632.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_drf = H2ORandomForestEstimator(sample_rate=.7,
...                                         seed=1234)
>>> airlines_drf.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print(airlines_drf.auc(valid=True))
property sample_rate_per_class

A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree

Type: List[float].

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8],
...                                    seed=1234)
>>> print(train[response].table())
>>> rate_per_class_list = [1, .4, 1, 1, 1, 1, 1]
>>> cov_drf = H2ORandomForestEstimator(sample_rate_per_class=rate_per_class_list,
...                                    seed=1234)
>>> cov_drf.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> print('logloss', cov_drf.logloss(valid=True))
property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_drf = H2ORandomForestEstimator(score_each_iteration=True,
...                                     ntrees=55,
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame = valid)
>>> cars_drf.scoring_history()
property score_tree_interval

Score the model after every so many trees. Disabled if set to 0.

Type: int, defaults to 0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_drf = H2ORandomForestEstimator(score_tree_interval=5,
...                                     seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_drf.scoring_history()
property seed

Seed for pseudo random number generator (if applicable)

Type: int, defaults to -1.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> drf_w_seed_1 = H2ORandomForestEstimator(seed=1234)
>>> drf_w_seed_1.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print('auc for the 1st model build with a seed:',
...        drf_w_seed_1.auc(valid=True))
property stopping_metric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.

Type: Literal["auto", "deviance", "logloss", "mse", "rmse", "mae", "rmsle", "auc", "aucpr", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_drf = H2ORandomForestEstimator(stopping_metric="auc",
...                                         stopping_rounds=3,
...                                         stopping_tolerance=1e-2,
...                                         seed=1234)
>>> airlines_drf.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_drf.auc(valid=True)
property stopping_rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)

Type: int, defaults to 0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_drf = H2ORandomForestEstimator(stopping_metric="auc",
...                                         stopping_rounds=3,
...                                         stopping_tolerance=1e-2,
...                                         seed=1234)
>>> airlines_drf.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_drf.auc(valid=True)
property stopping_tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

Type: float, defaults to 0.001.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_drf = H2ORandomForestEstimator(stopping_metric="auc",
...                                         stopping_rounds=3,
...                                         stopping_tolerance=1e-2,
...                                         seed=1234)
>>> airlines_drf.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_drf.auc(valid=True)
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8],
...                                 seed=1234)
>>> cars_drf = H2ORandomForestEstimator(seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_drf.auc(valid=True)
property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8],
...                                 seed=1234)
>>> cars_drf = H2ORandomForestEstimator(seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_drf.auc(valid=True)
property weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

Type: str.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","acceleration","year"]
>>> response = "economy_20mpg"
>>> train, valid = cars.split_frame(ratios=[.8],
...                                 seed=1234)
>>> cars_drf = H2ORandomForestEstimator(seed=1234)
>>> cars_drf.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid,
...                weights_column="weight")
>>> cars_drf.auc(valid=True)

H2ORuleFitEstimator

class h2o.estimators.rulefit.H2ORuleFitEstimator(model_id=None, training_frame=None, validation_frame=None, seed=-1, response_column=None, ignored_columns=None, algorithm='auto', min_rule_length=3, max_rule_length=3, max_num_rules=-1, model_type='rules_and_linear', weights_column=None, distribution='auto', rule_generation_ntrees=50, auc_type='auto', remove_duplicates=True, lambda_=None)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

RuleFit

Builds a RuleFit on a parsed dataset, for regression or classification.

property Lambda

[Deprecated] Use lambda_ instead

property algorithm

The algorithm to use to generate rules.

Type: Literal["auto", "drf", "gbm"], defaults to "auto".

property auc_type

Set default multinomial AUC type.

Type: Literal["auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo"], defaults to "auto".

property distribution

Distribution function

Type: Literal["auto", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber"], defaults to "auto".

property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property lambda_

Lambda for LASSO regressor.

Type: List[float].

property max_num_rules

The maximum number of rules to return. defaults to -1 which means the number of rules is selected by diminishing returns in model deviance.

Type: int, defaults to -1.

property max_rule_length

Maximum length of rules. Defaults to 3.

Type: int, defaults to 3.

property min_rule_length

Minimum length of rules. Defaults to 3.

Type: int, defaults to 3.

property model_type

Specifies type of base learners in the ensemble.

Type: Literal["rules_and_linear", "rules", "linear"], defaults to "rules_and_linear".

predict_rules(frame, rule_ids)[source]

Evaluates validity of the given rules on the given data.

Parameters
  • frame – H2OFrame on which rule validity is to be evaluated

  • rule_ids – string array of rule ids to be evaluated against the frame

Returns

H2OFrame with a column per each input ruleId, representing a flag whether given rule is applied to the observation or not.

property remove_duplicates

Whether to remove rules which are identical to an earlier rule. Defaults to true.

Type: bool, defaults to True.

property response_column

Response variable column.

Type: str.

property rule_generation_ntrees

Specifies the number of trees to build in the tree model. Defaults to 50.

Type: int, defaults to 50.

rule_importance()[source]

Retrieve rule importances for a Rulefit model

Returns

H2OTwoDimTable

property seed

Seed for pseudo random number generator (if applicable).

Type: int, defaults to -1.

property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

property weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

Type: str.

H2OStackedEnsembleEstimator

class h2o.estimators.stackedensemble.H2OStackedEnsembleEstimator(model_id=None, training_frame=None, response_column=None, validation_frame=None, blending_frame=None, base_models=[], metalearner_algorithm='auto', metalearner_nfolds=0, metalearner_fold_assignment=None, metalearner_fold_column=None, metalearner_params=None, metalearner_transform='none', max_runtime_secs=0.0, weights_column=None, offset_column=None, seed=-1, score_training_samples=10000, keep_levelone_frame=False, export_checkpoints_dir=None, auc_type='auto')[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Stacked Ensemble

Builds a stacked ensemble (aka “super learner”) machine learning method that uses two or more H2O learning algorithms to improve predictive performance. It is a loss-based supervised learning method that finds the optimal combination of a collection of prediction algorithms.This method supports regression and binary classification.

Examples

>>> import h2o
>>> h2o.init()
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> col_types = ["numeric", "numeric", "numeric", "enum",
...              "enum", "numeric", "numeric", "numeric", "numeric"]
>>> data = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv", col_types=col_types)
>>> train, test = data.split_frame(ratios=[.8], seed=1)
>>> x = ["CAPSULE","GLEASON","RACE","DPROS","DCAPS","PSA","VOL"]
>>> y = "AGE"
>>> nfolds = 5
>>> gbm = H2OGradientBoostingEstimator(nfolds=nfolds,
...                                    fold_assignment="Modulo",
...                                    keep_cross_validation_predictions=True)
>>> gbm.train(x=x, y=y, training_frame=train)
>>> rf = H2ORandomForestEstimator(nfolds=nfolds,
...                               fold_assignment="Modulo",
...                               keep_cross_validation_predictions=True)
>>> rf.train(x=x, y=y, training_frame=train)
>>> stack = H2OStackedEnsembleEstimator(model_id="ensemble",
...                                     training_frame=train,
...                                     validation_frame=test,
...                                     base_models=[gbm.model_id, rf.model_id])
>>> stack.train(x=x, y=y, training_frame=train, validation_frame=test)
>>> stack.model_performance()
property auc_type

Set default multinomial AUC type.

Type: Literal["auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo"], defaults to "auto".

property base_models

List of models or grids (or their ids) to ensemble/stack together. Grids are expanded to individual models. If not using blending frame, then models must have been cross-validated using nfolds > 1, and folds must be identical across models.

Type: List[str], defaults to [].

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> col_types = ["numeric", "numeric", "numeric", "enum",
...              "enum", "numeric", "numeric", "numeric", "numeric"]
>>> data = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv", col_types=col_types)
>>> train, test = data.split_frame(ratios=[.8], seed=1)
>>> x = ["CAPSULE","GLEASON","RACE","DPROS","DCAPS","PSA","VOL"]
>>> y = "AGE"
>>> nfolds = 5
>>> gbm = H2OGradientBoostingEstimator(nfolds=nfolds,
...                                    fold_assignment="Modulo",
...                                    keep_cross_validation_predictions=True)
>>> gbm.train(x=x, y=y, training_frame=train)
>>> rf = H2ORandomForestEstimator(nfolds=nfolds,
...                               fold_assignment="Modulo",
...                               keep_cross_validation_predictions=True)
>>> rf.train(x=x, y=y, training_frame=train)
>>> stack = H2OStackedEnsembleEstimator(model_id="ensemble",
...                                     training_frame=train,
...                                     validation_frame=test,
...                                     base_models=[gbm.model_id, rf.model_id])
>>> stack.train(x=x, y=y, training_frame=train, validation_frame=test)
>>> stack.model_performance()
property blending_frame

Frame used to compute the predictions that serve as the training frame for the metalearner (triggers blending mode if provided)

Type: Union[None, str, H2OFrame].

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
>>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234)
>>> x = train.columns
>>> y = "response"
>>> x.remove(y)
>>> train[y] = train[y].asfactor()
>>> blend[y] = blend[y].asfactor()
>>> nfolds = 3
>>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
...                                       ntrees=10,
...                                       nfolds=nfolds,
...                                       fold_assignment="Modulo",
...                                       keep_cross_validation_predictions=True,
...                                       seed=1)
>>> my_gbm.train(x=x, y=y, training_frame=train)
>>> my_rf = H2ORandomForestEstimator(ntrees=50,
...                                  nfolds=nfolds,
...                                  fold_assignment="Modulo",
...                                  keep_cross_validation_predictions=True,
...                                  seed=1)
>>> my_rf.train(x=x, y=y, training_frame=train)
>>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
...                                           seed=1)
>>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend)
>>> stack_blend.model_performance(blend).auc()
property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> import tempfile
>>> from os import listdir
>>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
>>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234)
>>> x = train.columns
>>> y = "response"
>>> x.remove(y)
>>> train[y] = train[y].asfactor()
>>> blend[y] = blend[y].asfactor()
>>> nfolds = 3
>>> checkpoints_dir = tempfile.mkdtemp()
>>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
...                                       ntrees=10,
...                                       nfolds=nfolds,
...                                       fold_assignment="Modulo",
...                                       keep_cross_validation_predictions=True,
...                                       seed=1)
>>> my_gbm.train(x=x, y=y, training_frame=train)
>>> my_rf = H2ORandomForestEstimator(ntrees=50,
...                                  nfolds=nfolds,
...                                  fold_assignment="Modulo",
...                                  keep_cross_validation_predictions=True,
...                                  seed=1)
>>> my_rf.train(x=x, y=y, training_frame=train)
>>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
...                                           seed=1,
...                                           export_checkpoints_dir=checkpoints_dir)
>>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend)
>>> len(listdir(checkpoints_dir))
property keep_levelone_frame

Keep level one frame used for metalearner training.

Type: bool, defaults to False.

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
>>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234)
>>> x = train.columns
>>> y = "response"
>>> x.remove(y)
>>> train[y] = train[y].asfactor()
>>> blend[y] = blend[y].asfactor()
>>> nfolds = 3
>>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
...                                       ntrees=1,
...                                       nfolds=nfolds,
...                                       fold_assignment="Modulo",
...                                       keep_cross_validation_predictions=True,
...                                       seed=1)
>>> my_gbm.train(x=x, y=y, training_frame=train)
>>> my_rf = H2ORandomForestEstimator(ntrees=50,
...                                  nfolds=nfolds,
...                                  fold_assignment="Modulo",
...                                  keep_cross_validation_predictions=True,
...                                  seed=1)
>>> my_rf.train(x=x, y=y, training_frame=train)
>>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
...                                           seed=1,
...                                           keep_levelone_frame=True)
>>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend)
>>> stack_blend.model_performance(blend).auc()
levelone_frame_id()[source]

Fetch the levelone_frame_id for an H2OStackedEnsembleEstimator.

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
>>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234)
>>> x = train.columns
>>> y = "response"
>>> x.remove(y)
>>> train[y] = train[y].asfactor()
>>> blend[y] = blend[y].asfactor()
>>> nfolds = 3
>>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
...                                       ntrees=10,
...                                       nfolds=nfolds,
...                                       fold_assignment="Modulo",
...                                       keep_cross_validation_predictions=True,
...                                       seed=1)
>>> my_gbm.train(x=x, y=y, training_frame=train)
>>> my_rf = H2ORandomForestEstimator(ntrees=50,
...                                  nfolds=nfolds,
...                                  fold_assignment="Modulo",
...                                  keep_cross_validation_predictions=True,
...                                  seed=1)
>>> my_rf.train(x=x, y=y, training_frame=train)
>>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
...                                           seed=1,
...                                           keep_levelone_frame=True)
>>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend)
>>> stack_blend.levelone_frame_id()
property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

metalearner()[source]

Print the metalearner of an H2OStackedEnsembleEstimator.

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
>>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234)
>>> x = train.columns
>>> y = "response"
>>> x.remove(y)
>>> train[y] = train[y].asfactor()
>>> blend[y] = blend[y].asfactor()
>>> nfolds = 3
>>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
...                                       ntrees=10,
...                                       nfolds=nfolds,
...                                       fold_assignment="Modulo",
...                                       keep_cross_validation_predictions=True,
...                                       seed=1)
>>> my_gbm.train(x=x, y=y, training_frame=train)
>>> my_rf = H2ORandomForestEstimator(ntrees=50,
...                                  nfolds=nfolds,
...                                  fold_assignment="Modulo",
...                                  keep_cross_validation_predictions=True,
...                                  seed=1)
>>> my_rf.train(x=x, y=y, training_frame=train)
>>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
...                                           seed=1,
...                                           keep_levelone_frame=True)
>>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend)
>>> stack_blend.metalearner()
property metalearner_algorithm

Type of algorithm to use as the metalearner. Options include ‘AUTO’ (GLM with non negative weights; if validation_frame is present, a lambda search is performed), ‘deeplearning’ (Deep Learning with default parameters), ‘drf’ (Random Forest with default parameters), ‘gbm’ (GBM with default parameters), ‘glm’ (GLM with default parameters), ‘naivebayes’ (NaiveBayes with default parameters), or ‘xgboost’ (if available, XGBoost with default parameters).

Type: Literal["auto", "deeplearning", "drf", "gbm", "glm", "naivebayes", "xgboost"], defaults to "auto".

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
>>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234)
>>> x = train.columns
>>> y = "response"
>>> x.remove(y)
>>> train[y] = train[y].asfactor()
>>> blend[y] = blend[y].asfactor()
>>> nfolds = 3
>>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
...                                       ntrees=1,
...                                       nfolds=nfolds,
...                                       fold_assignment="Modulo",
...                                       keep_cross_validation_predictions=True,
...                                       seed=1)
>>> my_gbm.train(x=x, y=y, training_frame=train)
>>> my_rf = H2ORandomForestEstimator(ntrees=50,
...                                  nfolds=nfolds,
...                                  fold_assignment="Modulo",
...                                  keep_cross_validation_predictions=True,
...                                  seed=1)
>>> my_rf.train(x=x, y=y, training_frame=train)
>>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
...                                           seed=1,
...                                           metalearner_algorithm="gbm")
>>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend)
>>> stack_blend.model_performance(blend).auc()
property metalearner_fold_assignment

Cross-validation fold assignment scheme for metalearner cross-validation. Defaults to AUTO (which is currently set to Random). The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.

Type: Literal["auto", "random", "modulo", "stratified"].

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
>>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234)
>>> x = train.columns
>>> y = "response"
>>> x.remove(y)
>>> train[y] = train[y].asfactor()
>>> blend[y] = blend[y].asfactor()
>>> nfolds = 3
>>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
...                                       ntrees=1,
...                                       nfolds=nfolds,
...                                       fold_assignment="Modulo",
...                                       keep_cross_validation_predictions=True,
...                                       seed=1)
>>> my_gbm.train(x=x, y=y, training_frame=train)
>>> my_rf = H2ORandomForestEstimator(ntrees=50,
...                                  nfolds=nfolds,
...                                  fold_assignment="Modulo",
...                                  keep_cross_validation_predictions=True,
...                                  seed=1)
>>> my_rf.train(x=x, y=y, training_frame=train)
>>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
...                                           seed=1,
...                                           metalearner_fold_assignment="Random")
>>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend)
>>> stack_blend.model_performance(blend).auc()
property metalearner_fold_column

Column with cross-validation fold index assignment per observation for cross-validation of the metalearner.

Type: str.

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
>>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_test_5k.csv")
>>> fold_column = "fold_id"
>>> train[fold_column] = train.kfold_column(n_folds=3, seed=1)
>>> x = train.columns
>>> y = "response"
>>> x.remove(y)
>>> x.remove(fold_column)
>>> train[y] = train[y].asfactor()
>>> test[y] = test[y].asfactor()
>>> nfolds = 3
>>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
...                                       ntrees=10,
...                                       nfolds=nfolds,
...                                       fold_assignment="Modulo",
...                                       keep_cross_validation_predictions=True,
...                                       seed=1)
>>> my_gbm.train(x=x, y=y, training_frame=train)
>>> my_rf = H2ORandomForestEstimator(ntrees=50,
...                                  nfolds=nfolds,
...                                  fold_assignment="Modulo",
...                                  keep_cross_validation_predictions=True,
...                                  seed=1)
>>> my_rf.train(x=x, y=y, training_frame=train)
>>> stack = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
...                                     metalearner_fold_column=fold_column,
...                                     metalearner_params=dict(keep_cross_validation_models=True))
>>> stack.train(x=x, y=y, training_frame=train)
>>> stack.model_performance().auc()
property metalearner_nfolds

Number of folds for K-fold cross-validation of the metalearner algorithm (0 to disable or >= 2).

Type: int, defaults to 0.

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
>>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234)
>>> x = train.columns
>>> y = "response"
>>> x.remove(y)
>>> train[y] = train[y].asfactor()
>>> blend[y] = blend[y].asfactor()
>>> nfolds = 3
>>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
...                                       ntrees=1,
...                                       nfolds=nfolds,
...                                       fold_assignment="Modulo",
...                                       keep_cross_validation_predictions=True,
...                                       seed=1)
>>> my_gbm.train(x=x, y=y, training_frame=train)
>>> my_rf = H2ORandomForestEstimator(ntrees=50,
...                                  nfolds=nfolds,
...                                  fold_assignment="Modulo",
...                                  keep_cross_validation_predictions=True,
...                                  seed=1)
>>> my_rf.train(x=x, y=y, training_frame=train)
>>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
...                                           seed=1,
...                                           metalearner_nfolds=3)
>>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend)
>>> stack_blend.model_performance(blend).auc()
property metalearner_params

Parameters for metalearner algorithm

Type: dict.

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
>>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234)
>>> x = train.columns
>>> y = "response"
>>> x.remove(y)
>>> train[y] = train[y].asfactor()
>>> blend[y] = blend[y].asfactor()
>>> nfolds = 3
>>> gbm_params = {"ntrees" : 100, "max_depth" : 6}
>>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
...                                       ntrees=1,
...                                       nfolds=nfolds,
...                                       fold_assignment="Modulo",
...                                       keep_cross_validation_predictions=True,
...                                       seed=1)
>>> my_gbm.train(x=x, y=y, training_frame=train)
>>> my_rf = H2ORandomForestEstimator(ntrees=50,
...                                  nfolds=nfolds,
...                                  fold_assignment="Modulo",
...                                  keep_cross_validation_predictions=True,
...                                  seed=1)
>>> my_rf.train(x=x, y=y, training_frame=train)
>>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
...                                           metalearner_algorithm="gbm",
...                                           metalearner_params=gbm_params)
>>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend)
>>> stack_blend.model_performance(blend).auc()
property metalearner_transform

Transformation used for the level one frame.

Type: Literal["none", "logit"], defaults to "none".

property offset_column

Offset column. This will be added to the combination of columns before applying the link function.

Type: str.

property response_column

Response variable column.

Type: str.

property score_training_samples

Specify the number of training set samples for scoring. The value must be >= 0. To use all training samples, enter 0.

Type: int, defaults to 10000.

property seed

Seed for random numbers; passed through to the metalearner algorithm. Defaults to -1 (time-based random number)

Type: int, defaults to -1.

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
>>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234)
>>> x = train.columns
>>> y = "response"
>>> x.remove(y)
>>> train[y] = train[y].asfactor()
>>> blend[y] = blend[y].asfactor()
>>> nfolds = 3
>>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
...                                       ntrees=1,
...                                       nfolds=nfolds,
...                                       fold_assignment="Modulo",
...                                       keep_cross_validation_predictions=True,
...                                       seed=1)
>>> my_gbm.train(x=x, y=y, training_frame=train)
>>> my_rf = H2ORandomForestEstimator(ntrees=50,
...                                  nfolds=nfolds,
...                                  fold_assignment="Modulo",
...                                  keep_cross_validation_predictions=True,
...                                  seed=1)
>>> my_rf.train(x=x, y=y, training_frame=train)
>>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
...                                           seed=1,
...                                           metalearner_fold_assignment="Random")
>>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend)
>>> stack_blend.model_performance(blend).auc()
train(x=None, y=None, training_frame=None, blending_frame=None, verbose=False, **kwargs)[source]

Train the H2O model.

Parameters
  • x – A list of column names or indices indicating the predictor columns.

  • y – An index or a column name indicating the response column.

  • training_frame (H2OFrame) – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).

  • offset_column – The name or index of the column in training_frame that holds the offsets.

  • fold_column – The name or index of the column in training_frame that holds the per-row fold assignments.

  • weights_column – The name or index of the column in training_frame that holds the per-row weights.

  • validation_frame – H2OFrame with validation data to be scored on while training.

  • max_runtime_secs (float) – Maximum allowed runtime in seconds for model training. Use 0 to disable.

  • verbose (bool) – Print scoring history to stdout. Defaults to False.

property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
>>> train, valid = higgs.split_frame(ratios = [.8], seed = 1234)
>>> x = train.columns
>>> y = "response"
>>> x.remove(y)
>>> train[y] = train[y].asfactor()
>>> blend[y] = blend[y].asfactor()
>>> nfolds = 3
>>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
...                                       ntrees=1,
...                                       nfolds=nfolds,
...                                       fold_assignment="Modulo",
...                                       keep_cross_validation_predictions=True,
...                                       seed=1)
>>> my_gbm.train(x=x, y=y, training_frame=train)
>>> my_rf = H2ORandomForestEstimator(ntrees=50,
...                                  nfolds=nfolds,
...                                  fold_assignment="Modulo",
...                                  keep_cross_validation_predictions=True,
...                                  seed=1)
>>> my_rf.train(x=x, y=y, training_frame=train)
>>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
...                                           seed=1,
...                                           metalearner_fold_assignment="Random")
>>> stack_blend.train(x=x, y=y, training_frame=train, validation_frame=valid)
>>> stack_blend.model_performance(blend).auc()
property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> from h2o.estimators.random_forest import H2ORandomForestEstimator
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
>>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
>>> train, valid = higgs.split_frame(ratios = [.8], seed = 1234)
>>> x = train.columns
>>> y = "response"
>>> x.remove(y)
>>> train[y] = train[y].asfactor()
>>> blend[y] = blend[y].asfactor()
>>> nfolds = 3 
>>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
...                                       ntrees=1,
...                                       nfolds=nfolds,
...                                       fold_assignment="Modulo",
...                                       keep_cross_validation_predictions=True,
...                                       seed=1)
>>> my_gbm.train(x=x, y=y, training_frame=train)
>>> my_rf = H2ORandomForestEstimator(ntrees=50,
...                                  nfolds=nfolds,
...                                  fold_assignment="Modulo",
...                                  keep_cross_validation_predictions=True,
...                                  seed=1)
>>> my_rf.train(x=x, y=y, training_frame=train)
>>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
...                                           seed=1,
...                                           metalearner_fold_assignment="Random")
>>> stack_blend.train(x=x, y=y, training_frame=train, validation_frame=valid)
>>> stack_blend.model_performance(blend).auc()
property weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

Type: str.

H2OTargetEncoderEstimator

class h2o.estimators.targetencoder.H2OTargetEncoderEstimator(model_id=None, training_frame=None, fold_column=None, response_column=None, ignored_columns=None, columns_to_encode=None, keep_original_categorical_columns=True, blending=False, inflection_point=10.0, smoothing=20.0, data_leakage_handling='none', noise=0.01, seed=-1)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

TargetEncoder

property blending

If true, enables blending of posterior probabilities (computed for a given categorical value) with prior probabilities (computed on the entire set). This allows to mitigate the effect of categorical values with small cardinality. The blending effect can be tuned using the inflection_point and smoothing parameters.

Type: bool, defaults to False.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> predictors = ["home.dest", "cabin", "embarked"]
>>> response = "survived"
>>> titanic["survived"] = titanic["survived"].asfactor()
>>> fold_col = "kfold_column"
>>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234)
>>> titanic_te = H2OTargetEncoderEstimator(inflection_point=35,
...                                        smoothing=25,
...                                        blending=True)
>>> titanic_te.train(x=predictors,
...                  y=response,
...                  training_frame=titanic)
>>> titanic_te
property columns_to_encode

List of categorical columns or groups of categorical columns to encode. When groups of columns are specified, each group is encoded as a single column (interactions are created internally).

Type: List[List[str]].

property data_leakage_handling

Data leakage handling strategy used to generate the encoding. Supported options are: 1) “none” (default) - no holdout, using the entire training frame. 2) “leave_one_out” - current row’s response value is subtracted from the per-level frequencies pre-calculated on the entire training frame. 3) “k_fold” - encodings for a fold are generated based on out-of-fold data.

Type: Literal["leave_one_out", "k_fold", "none"], defaults to "none".

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> predictors = ["home.dest", "cabin", "embarked"]
>>> response = "survived"
>>> titanic["survived"] = titanic["survived"].asfactor()
>>> fold_col = "kfold_column"
>>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234)
>>> titanic_te = H2OTargetEncoderEstimator(inflection_point=35,
...                                        smoothing=25,
...                                        data_leakage_handling="k_fold",
...                                        blending=True)
>>> titanic_te.train(x=predictors,
...                  y=response,
...                  training_frame=titanic)
>>> titanic_te
property f

[Deprecated] Use smoothing instead

property fold_column

Column with cross-validation fold index assignment per observation.

Type: str.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> predictors = ["home.dest", "cabin", "embarked"]
>>> response = "survived"
>>> titanic["survived"] = titanic["survived"].asfactor()
>>> fold_col = "kfold_column"
>>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234)
>>> titanic_te = H2OTargetEncoderEstimator(inflection_point=35,
...                                        smoothing=25,
...                                        blending=True)
>>> titanic_te.train(x=predictors,
...                  y=response,
...                  training_frame=titanic)
>>> titanic_te
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property inflection_point

Inflection point of the sigmoid used to blend probabilities (see blending parameter). For a given categorical value, if it appears less that inflection_point in a data sample, then the influence of the posterior probability will be smaller than the prior.

Type: float, defaults to 10.0.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> predictors = ["home.dest", "cabin", "embarked"]
>>> response = "survived"
>>> titanic["survived"] = titanic["survived"].asfactor()
>>> fold_col = "kfold_column"
>>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234)
>>> titanic_te = H2OTargetEncoderEstimator(inflection_point=35,
...                                        smoothing=25,
...                                        blending=True)
>>> titanic_te.train(x=predictors,
...                  y=response,
...                  training_frame=titanic)
>>> titanic_te
property k

[Deprecated] Use inflection_point instead

property keep_original_categorical_columns

If true, the original non-encoded categorical features will remain in the result frame.

Type: bool, defaults to True.

property noise

The amount of noise to add to the encoded column. Use 0 to disable noise, and -1 (=AUTO) to let the algorithm determine a reasonable amount of noise.

Type: float, defaults to 0.01.

property noise_level

[Deprecated] Use noise instead

property response_column

Response variable column.

Type: str.

property seed

Seed used to generate the noise. By default, the seed is chosen randomly.

Type: int, defaults to -1.

property smoothing

Smoothing factor corresponds to the inverse of the slope at the inflection point on the sigmoid used to blend probabilities (see blending parameter). If smoothing tends towards 0, then the sigmoid used for blending turns into a Heaviside step function.

Type: float, defaults to 20.0.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> predictors = ["home.dest", "cabin", "embarked"]
>>> response = "survived"
>>> titanic["survived"] = titanic["survived"].asfactor()
>>> fold_col = "kfold_column"
>>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234)
>>> titanic_te = H2OTargetEncoderEstimator(inflection_point=35,
...                                        smoothing=25,
...                                        blending=True)
>>> titanic_te.train(x=predictors,
...                  y=response,
...                  training_frame=titanic)
>>> titanic_te
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> predictors = ["home.dest", "cabin", "embarked"]
>>> response = "survived"
>>> titanic["survived"] = titanic["survived"].asfactor()
>>> fold_col = "kfold_column"
>>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234)
>>> titanic_te = H2OTargetEncoderEstimator(inflection_point=35,
...                                        smoothing=25,
...                                        blending=True)
>>> titanic_te.train(x=predictors,
...                  y=response,
...                  training_frame=titanic)
>>> titanic_te
transform(frame, blending=None, inflection_point=None, smoothing=None, noise=None, as_training=False, **kwargs)[source]

Apply transformation to te_columns based on the encoding maps generated during train() method call.

Parameters
  • frame (H2OFrame) – the frame on which to apply the target encoding transformations.

  • blending (boolean) – If provided, this overrides the blending parameter on the model.

  • inflection_point (float) – If provided, this overrides the inflection_point parameter on the model.

  • smoothing (float) – If provided, this overrides the smoothing parameter on the model.

  • noise (float) – If provided, this overrides the amount of random noise added to the target encoding defined on the model, this helps prevent overfitting.

  • as_training (boolean) – Must be set to True when encoding the training frame. Defaults to False.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> predictors = ["home.dest", "cabin", "embarked"]
>>> response = "survived"
>>> titanic[response] = titanic[response].asfactor()
>>> fold_col = "kfold_column"
>>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234)
>>> titanic_te = H2OTargetEncoderEstimator(data_leakage_handling="leave_one_out",
...                                        inflection_point=35,
...                                        smoothing=25,
...                                        blending=True,
...                                        seed=1234)
>>> titanic_te.train(x=predictors,
...                  y=response,
...                  training_frame=titanic)
>>> transformed = titanic_te.transform(frame=titanic)

H2OUpliftRandomForestEstimator

class h2o.estimators.uplift_random_forest.H2OUpliftRandomForestEstimator(model_id=None, training_frame=None, validation_frame=None, score_each_iteration=False, score_tree_interval=0, response_column=None, ignored_columns=None, ignore_const_cols=True, ntrees=50, max_depth=20, min_rows=1.0, nbins=20, nbins_top_level=1024, nbins_cats=1024, max_runtime_secs=0.0, seed=-1, mtries=-2, sample_rate=0.632, sample_rate_per_class=None, col_sample_rate_change_per_level=1.0, col_sample_rate_per_tree=1.0, histogram_type='auto', categorical_encoding='auto', distribution='auto', check_constant_response=True, treatment_column='treatment', uplift_metric='auto', auuc_type='auto', auuc_nbins=-1)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Uplift Distributed Random Forest

property auuc_nbins

Number of bins to calculate Area Under Uplift Curve.

Type: int, defaults to -1.

property auuc_type

Metric used to calculate Area Under Uplift Curve.

Type: Literal["auto", "qini", "lift", "gain"], defaults to "auto".

property categorical_encoding

Encoding scheme for categorical features

Type: Literal["auto", "enum", "one_hot_internal", "one_hot_explicit", "binary", "eigen", "label_encoder", "sort_by_response", "enum_limited"], defaults to "auto".

property check_constant_response

Check if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not.

Type: bool, defaults to True.

property col_sample_rate_change_per_level

Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0)

Type: float, defaults to 1.0.

property col_sample_rate_per_tree

Column sample rate per tree (from 0.0 to 1.0)

Type: float, defaults to 1.0.

property distribution

Distribution function

Type: Literal["auto", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber"], defaults to "auto".

property histogram_type

What type of histogram to use for finding optimal split points

Type: Literal["auto", "uniform_adaptive", "random", "quantiles_global", "round_robin", "uniform_robust"], defaults to "auto".

property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property max_depth

Maximum tree depth (0 for unlimited).

Type: int, defaults to 20.

property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

property min_rows

Fewest allowed (weighted) observations in a leaf.

Type: float, defaults to 1.0.

property mtries

Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification and p/3 for regression (where p is the # of predictors

Type: int, defaults to -2.

property nbins

For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point

Type: int, defaults to 20.

property nbins_cats

For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.

Type: int, defaults to 1024.

property nbins_top_level

For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level

Type: int, defaults to 1024.

property ntrees

Number of trees.

Type: int, defaults to 50.

property response_column

Response variable column.

Type: str.

property sample_rate

Row sample rate per tree (from 0.0 to 1.0)

Type: float, defaults to 0.632.

property sample_rate_per_class

A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree

Type: List[float].

property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

property score_tree_interval

Score the model after every so many trees. Disabled if set to 0.

Type: int, defaults to 0.

property seed

Seed for pseudo random number generator (if applicable)

Type: int, defaults to -1.

property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

property treatment_column

Define the column which will be used for computing uplift gain to select best split for a tree. The column has to divide the dataset into treatment (value 1) and control (value 0) groups.

Type: str, defaults to "treatment".

property uplift_metric

Divergence metric used to find best split when building an uplift tree.

Type: Literal["auto", "kl", "euclidean", "chi_squared"], defaults to "auto".

property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

H2OXGBoostEstimator

class h2o.estimators.xgboost.H2OXGBoostEstimator(model_id=None, training_frame=None, validation_frame=None, nfolds=0, keep_cross_validation_models=True, keep_cross_validation_predictions=False, keep_cross_validation_fold_assignment=False, score_each_iteration=False, fold_assignment='auto', fold_column=None, response_column=None, ignored_columns=None, ignore_const_cols=True, offset_column=None, weights_column=None, stopping_rounds=0, stopping_metric='auto', stopping_tolerance=0.001, max_runtime_secs=0.0, seed=-1, distribution='auto', tweedie_power=1.5, categorical_encoding='auto', quiet_mode=True, checkpoint=None, export_checkpoints_dir=None, ntrees=50, max_depth=6, min_rows=1.0, min_child_weight=1.0, learn_rate=0.3, eta=0.3, sample_rate=1.0, subsample=1.0, col_sample_rate=1.0, colsample_bylevel=1.0, col_sample_rate_per_tree=1.0, colsample_bytree=1.0, colsample_bynode=1.0, max_abs_leafnode_pred=0.0, max_delta_step=0.0, monotone_constraints=None, interaction_constraints=None, score_tree_interval=0, min_split_improvement=0.0, gamma=0.0, nthread=-1, save_matrix_directory=None, build_tree_one_node=False, calibrate_model=False, calibration_frame=None, max_bins=256, max_leaves=0, sample_type='uniform', normalize_type='tree', rate_drop=0.0, one_drop=False, skip_drop=0.0, tree_method='auto', grow_policy='depthwise', booster='gbtree', reg_lambda=1.0, reg_alpha=0.0, dmatrix_type='auto', backend='auto', gpu_id=None, gainslift_bins=-1, auc_type='auto', scale_pos_weight=1.0)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

XGBoost

Builds an eXtreme Gradient Boosting model using the native XGBoost backend.

property auc_type

Set default multinomial AUC type.

Type: Literal["auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo"], defaults to "auto".

static available()[source]

Ask the H2O server whether a XGBoost model can be built (depends on availability of native backends). :return: True if a XGBoost model can be built, or False otherwise.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_xgb = H2OXGBoostEstimator(seed=1234)
>>> boston_xgb.available()
property backend

Backend. By default (auto), a GPU is used if available.

Type: Literal["auto", "gpu", "cpu"], defaults to "auto".

Examples

>>> pros = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> pros["CAPSULE"] = pros["CAPSULE"].asfactor()
>>> pros_xgb = H2OXGBoostEstimator(tree_method="exact",
...                                seed=123,
...                                backend="cpu")
>>> pros_xgb.train(y="CAPSULE",
...                ignored_columns=["ID"],
...                training_frame=pros)
>>> pros_xgb.auc()
property booster

Booster type

Type: Literal["gbtree", "gblinear", "dart"], defaults to "gbtree".

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(booster='dart',
...                                   normalize_type="tree",
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> print(titanic_xgb.auc(valid=True))
property build_tree_one_node

Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.

Type: bool, defaults to False.

property calibrate_model

Use Platt Scaling to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities.

Type: bool, defaults to False.

property calibration_frame

Calibration frame for Platt Scaling

Type: Union[None, str, H2OFrame].

property categorical_encoding

Encoding scheme for categorical features

Type: Literal["auto", "enum", "one_hot_internal", "one_hot_explicit", "binary", "eigen", "label_encoder", "sort_by_response", "enum_limited"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> encoding = "one_hot_explicit"
>>> airlines_xgb = H2OXGBoostEstimator(categorical_encoding=encoding,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_xgb.auc(valid=True)
property checkpoint

Model checkpoint to resume training with.

Type: Union[None, str, H2OEstimator].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
>>> predictors = ["displacement","power","weight","year","economy_20mpg"]
>>> response = "acceleration"
>>> from h2o.estimators import H2OXGBoostEstimator
>>> cars_xgb = H2OXGBoostEstimator(seed=1234)
>>> train, valid = cars.split_frame(ratios=[.8])
>>> cars_xgb.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_xgb.mse()
>>> cars_xgb_continued = H2OXGBoostEstimator(checkpoint=cars_xgb.model_id,
...                                          ntrees=51,
...                                          seed=1234)
>>> cars_xgb_continued.train(x=predictors,
...                          y=response,
...                          training_frame=train,
...                          validation_frame=valid)
>>> cars_xgb_continued.mse()
property col_sample_rate

(same as colsample_bylevel) Column sample rate (from 0.0 to 1.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(col_sample_rate=.7,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print(airlines_xgb.auc(valid=True))
property col_sample_rate_per_tree

(same as colsample_bytree) Column sample rate per tree (from 0.0 to 1.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(col_sample_rate_per_tree=.7,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print(airlines_xgb.auc(valid=True))
property colsample_bylevel

(same as col_sample_rate) Column sample rate (from 0.0 to 1.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(col_sample_rate=.7,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print(airlines_xgb.auc(valid=True))
property colsample_bynode

Column sample rate per tree node (from 0.0 to 1.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(colsample_bynode=.5,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors, y=response,
...                    training_frame=train, validation_frame=valid)
>>> print(airlines_xgb.auc(valid=True))
property colsample_bytree

(same as col_sample_rate_per_tree) Column sample rate per tree (from 0.0 to 1.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(col_sample_rate_per_tree=.7,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print(airlines_xgb.auc(valid=True))
convert_H2OXGBoostParams_2_XGBoostParams()[source]

In order to use convert_H2OXGBoostParams_2_XGBoostParams and convert_H2OFrame_2_DMatrix, you must import the following toolboxes: xgboost, pandas, numpy and scipy.sparse.

Given an H2OXGBoost model, this method will generate the corresponding parameters that should be used by native XGBoost in order to give exactly the same result, assuming that the same dataset (derived from h2oFrame) is used to train the native XGBoost model.

Follow the steps below to compare H2OXGBoost and native XGBoost:

  1. Train the H2OXGBoost model with H2OFrame trainFile and generate a prediction:

  • h2oModelD = H2OXGBoostEstimator(**h2oParamsD) # parameters specified as a dict()

  • h2oModelD.train(x=myX, y=y, training_frame=trainFile) # train with H2OFrame trainFile

  • h2oPredict = h2oPredictD = h2oModelD.predict(trainFile)

  1. Derive the DMatrix from H2OFrame:

  • nativeDMatrix = trainFile.convert_H2OFrame_2_DMatrix(myX, y, h2oModelD)

  1. Derive the parameters for native XGBoost:

  • nativeParams = h2oModelD.convert_H2OXGBoostParams_2_XGBoostParams()

  1. Train your native XGBoost model and generate a prediction:

  • nativeModel = xgb.train(params=nativeParams[0], dtrain=nativeDMatrix, num_boost_round=nativeParams[1])

  • nativePredict = nativeModel.predict(data=nativeDMatrix, ntree_limit=nativeParams[1]

  1. Compare the predictions h2oPredict from H2OXGBoost, nativePredict from native XGBoost.

Returns

nativeParams, num_boost_round

property distribution

Distribution function

Type: Literal["auto", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber"], defaults to "auto".

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "cylinders"
>>> train, valid = cars.split_frame(ratios=[.8],
...                                 seed=1234)
>>> cars_xgb = H2OXGBoostEstimator(distribution="poisson",
...                                seed=1234)
>>> cars_xgb.train(x=predictors,
...                y=response,
...                training_frame=train,
...                validation_frame=valid)
>>> cars_xgb.mse(valid=True)
property dmatrix_type

Type of DMatrix. For sparse, NAs and 0 are treated equally.

Type: Literal["auto", "dense", "sparse"], defaults to "auto".

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_xgb = H2OXGBoostEstimator(dmatrix_type="auto",
...                                  seed=1234)
>>> boston_xgb.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> boston_xgb.mse()
property eta

(same as learn_rate) Learning rate (from 0.0 to 1.0)

Type: float, defaults to 0.3.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(ntrees=10000,
...                                   learn_rate=0.01,
...                                   stopping_rounds=5,
...                                   stopping_metric="AUC",
...                                   stopping_tolerance=1e-4,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>>  print(titanic_xgb.auc(valid=True))
property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> import tempfile
>>> from h2o.grid.grid_search import H2OGridSearch
>>> from os import listdir
>>> airlines = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip", destination_frame="air.hex")
>>> predictors = ["DayofMonth", "DayOfWeek"]
>>> response = "IsDepDelayed"
>>> hyper_parameters = {'ntrees': [5,10]}
>>> search_crit = {'strategy': "RandomDiscrete",
...                'max_models': 5,
...                'seed': 1234,
...                'stopping_rounds': 3,
...                'stopping_metric': "AUTO",
...                'stopping_tolerance': 1e-2}
>>> checkpoints_dir = tempfile.mkdtemp()
>>> air_grid = H2OGridSearch(H2OXGBoostEstimator,
...                          hyper_params=hyper_parameters,
...                          search_criteria=search_crit)
>>> air_grid.train(x=predictors,
...                y=response,
...                training_frame=airlines,
...                distribution="bernoulli",
...                learn_rate=0.1,
...                max_depth=3,
...                export_checkpoints_dir=checkpoints_dir)
>>> len(listdir(checkpoints_dir))
property fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.

Type: Literal["auto", "random", "modulo", "stratified"], defaults to "auto".

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> response = 'survived'
>>> assignment_type = "Random"
>>> titanic_xgb = H2OXGBoostEstimator(fold_assignment=assignment_type,
...                                   nfolds=5,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=titanic)
>>> titanic_xgb.auc(xval=True)
property fold_column

Column with cross-validation fold index assignment per observation.

Type: str.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> response = 'survived'
>>> fold_numbers = titanic.kfold_column(n_folds=5,
...                                     seed=1234)
>>> fold_numbers.set_names(["fold_numbers"])
>>> titanic = titanic.cbind(fold_numbers)
>>> print(titanic['fold_numbers'])
>>> titanic_xgb = H2OXGBoostEstimator(seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=titanic,
...                   fold_column="fold_numbers")
>>> titanic_xgb.auc(xval=True)
property gainslift_bins

Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning.

Type: int, defaults to -1.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv")
>>> model = H2OXGBoostEstimator(ntrees=1, gainslift_bins=20)
>>> model.train(x=["Origin", "Distance"],
...             y="IsDepDelayed",
...             training_frame=airlines)
>>> model.gains_lift()
property gamma

(same as min_split_improvement) Minimum relative improvement in squared error reduction for a split to happen

Type: float, defaults to 0.0.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(min_split_improvement=1e-3,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> print(titanic_xgb.auc(valid=True))
property gpu_id

Which GPU(s) to use.

Type: List[int].

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_xgb = H2OXGBoostEstimator(gpu_id=0,
...                                  seed=1234)
>>> boston_xgb.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> boston_xgb.mse()
property grow_policy

Grow policy - depthwise is standard GBM, lossguide is LightGBM

Type: Literal["depthwise", "lossguide"], defaults to "depthwise".

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> response = 'survived'
>>> titanic["const_1"] = 6
>>> titanic["const_2"] = 7
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(seed=1234,
...                                   grow_policy="depthwise")
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> titanic_xgb.auc(valid=True)
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> response = 'survived'
>>> titanic["const_1"] = 6
>>> titanic["const_2"] = 7
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(seed=1234,
...                                   ignore_const_cols=True)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> titanic_xgb.auc(valid=True)
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property interaction_constraints

A set of allowed column interactions.

Type: List[List[str]].

property keep_cross_validation_fold_assignment

Whether to keep the cross-validation fold assignment.

Type: bool, defaults to False.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(keep_cross_validation_fold_assignment=True,
...                                   nfolds=5,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train)
>>> titanic_xgb.cross_validation_fold_assignment()
property keep_cross_validation_models

Whether to keep the cross-validation models.

Type: bool, defaults to True.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(keep_cross_validation_models=True,
...                                   nfolds=5 ,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train)
>>> titanic_xgb.cross_validation_models()
property keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation models.

Type: bool, defaults to False.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(keep_cross_validation_predictions=True,
...                                   nfolds=5,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train)
>>> titanic_xgb.cross_validation_predictions()
property learn_rate

(same as eta) Learning rate (from 0.0 to 1.0)

Type: float, defaults to 0.3.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8], seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(ntrees=10000,
...                                   learn_rate=0.01,
...                                   stopping_rounds=5,
...                                   stopping_metric="AUC",
...                                   stopping_tolerance=1e-4,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> print(titanic_xgb.auc(valid=True))
property max_abs_leafnode_pred

(same as max_delta_step) Maximum absolute value of a leaf node prediction

Type: float, defaults to 0.0.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8],
...                                    seed=1234)
>>> cov_xgb = H2OXGBoostEstimator(max_abs_leafnode_pred=float(2),
...                               seed=1234)
>>> cov_xgb.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> print(cov_xgb.logloss(valid=True))
property max_bins

For tree_method=hist only: maximum number of bins

Type: int, defaults to 256.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8],
...                                    seed=1234)
>>> cov_xgb = H2OXGBoostEstimator(max_bins=200,
...                               seed=1234)
>>> cov_xgb.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> print(cov_xgb.logloss(valid=True))
property max_delta_step

(same as max_abs_leafnode_pred) Maximum absolute value of a leaf node prediction

Type: float, defaults to 0.0.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8],
...                                    seed=1234)
>>> cov_xgb = H2OXGBoostEstimator(max_delta_step=float(2),
...                               seed=1234)
>>> cov_xgb.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> print(cov_xgb.logloss(valid=True))
property max_depth

Maximum tree depth (0 for unlimited).

Type: int, defaults to 6.

Examples

>>> df = h2o.import_file(path = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> response = "survived"
>>> df[response] = df[response].asfactor()
>>> predictors = df.columns
>>> del predictors[1:3]
>>> train, valid, test = df.split_frame(ratios=[0.6,0.2],
...                                     seed=1234,
...                                     destination_frames=
...                                     ['train.hex',
...                                     'valid.hex',
...                                     'test.hex'])
>>> xgb = H2OXGBoostEstimator()
>>> xgb.train(x=predictors,
...           y=response,
...           training_frame=train)
>>> perf = xgb.model_performance(valid)
>>> print perf.auc()
property max_leaves

For tree_method=hist only: maximum number of leaves

Type: int, defaults to 0.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(max_leaves=0, seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> print(titanic_xgb.auc(valid=True))
property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> covtype[54] = covtype[54].asfactor()
>>> predictors = covtype.columns[0:54]
>>> response = 'C55'
>>> train, valid = covtype.split_frame(ratios=[.8],
...                                    seed=1234)
>>> cov_xgb = H2OXGBoostEstimator(max_runtime_secs=10,
...                               ntrees=10000,
...                               max_depth=10,
...                               seed=1234)
>>> cov_xgb.train(x=predictors,
...               y=response,
...               training_frame=train,
...               validation_frame=valid)
>>> print(cov_xgb.logloss(valid=True))
property min_child_weight

(same as min_rows) Fewest allowed (weighted) observations in a leaf.

Type: float, defaults to 1.0.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(min_child_weight=16,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> print(titanic_xgb.auc(valid=True))
property min_rows

(same as min_child_weight) Fewest allowed (weighted) observations in a leaf.

Type: float, defaults to 1.0.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(min_rows=16,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> print(titanic_xgb.auc(valid=True))
property min_split_improvement

(same as gamma) Minimum relative improvement in squared error reduction for a split to happen

Type: float, defaults to 0.0.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(min_split_improvement=0.55,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> print(titanic_xgb.auc(valid=True))
property monotone_constraints

A mapping representing monotonic constraints. Use +1 to enforce an increasing constraint and -1 to specify a decreasing constraint.

Type: dict.

Examples

>>> prostate_hex = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate_hex["CAPSULE"] = prostate_hex["CAPSULE"].asfactor()
>>> response = "CAPSULE"
>>> seed=42
>>> monotone_constraints={"AGE":1}
>>> xgb_model = H2OXGBoostEstimator(seed=seed,
...                                 monotone_constraints=monotone_constraints)
>>> xgb_model.train(y=response,
...                 ignored_columns=["ID"],
...                 training_frame=prostate_hex)
>>> xgb_model.scoring_history()
property nfolds

Number of folds for K-fold cross-validation (0 to disable or >= 2).

Type: int, defaults to 0.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> folds = 5
>>> titanic_xgb = H2OXGBoostEstimator(nfolds=folds,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=titanic)
>>> titanic_xgb.auc(xval=True)
property normalize_type

For booster=dart only: normalize_type

Type: Literal["tree", "forest"], defaults to "tree".

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(booster='dart',
...                                   normalize_type="tree",
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> print(titanic_xgb.auc(valid=True))
property nthread

Number of parallel threads that can be used to run XGBoost. Cannot exceed H2O cluster limits (-nthreads parameter). Defaults to maximum available

Type: int, defaults to -1.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8], seed=1234)
>>> thread = 4
>>> titanic_xgb = H2OXGBoostEstimator(nthread=thread,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=titanic)
>>> print(titanic_xgb.auc(train=True))
property ntrees

(same as n_estimators) Number of trees.

Type: int, defaults to 50.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> tree_num = [20, 50, 80, 110, 140, 170, 200]
>>> label = ["20", "50", "80", "110",
...          "140", "170", "200"]
>>> for key, num in enumerate(tree_num):
#              Input integer for 'num' and 'key'
>>> titanic_xgb = H2OXGBoostEstimator(ntrees=num,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> print(label[key], 'training score',
...       titanic_xgb.auc(train=True))
>>> print(label[key], 'validation score',
...       titanic_xgb.auc(valid=True))
property offset_column

Offset column. This will be added to the combination of columns before applying the link function.

Type: str.

property one_drop

For booster=dart only: one_drop

Type: bool, defaults to False.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(booster='dart',
...                                   one_drop=True,
...                                   seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> print(titanic_xgb.auc(valid=True))
property quiet_mode

Enable quiet mode

Type: bool, defaults to True.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8], seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(seed=1234, quiet_mode=True)
>>> titanic_xgb.train(x=predictors
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> titanic_xgb.mse(valid=True)
property rate_drop

For booster=dart only: rate_drop (0..1)

Type: float, defaults to 0.0.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(rate_drop=0.1, seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> print(titanic_xgb.auc(valid=True))
property reg_alpha

L1 regularization

Type: float, defaults to 0.0.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> response = "medv"
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_xgb = H2OXGBoostEstimator(reg_alpha=.25)
>>> boston_xgb.train(x=predictors,
...                  y=response,
...                  training_frame=train,
...                  validation_frame=valid)
>>> print(boston_xgb.mse(valid=True))
property reg_lambda

L2 regularization

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8])
>>> airlines_xgb = H2OXGBoostEstimator(reg_lambda=.0001,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print(airlines_xgb.auc(valid=True))
property response_column

Response variable column.

Type: str.

property sample_rate

(same as subsample) Row sample rate per tree (from 0.0 to 1.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(sample_rate=.7,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print(airlines_xgb.auc(valid=True))
property sample_type

For booster=dart only: sample_type

Type: Literal["uniform", "weighted"], defaults to "uniform".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(sample_type="weighted",
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print(airlines_xgb.auc(valid=True))
property save_matrix_directory

Directory where to save matrices passed to XGBoost library. Useful for debugging.

Type: str.

property scale_pos_weight

Controls the effect of observations with positive labels in relation to the observations with negative labels on gradient calculation. Useful for imbalanced problems.

Type: float, defaults to 1.0.

property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(score_each_iteration=True,
...                                    ntrees=55,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_xgb.scoring_history()
property score_tree_interval

Score the model after every so many trees. Disabled if set to 0.

Type: int, defaults to 0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(score_tree_interval=5,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_xgb.scoring_history()
property seed

Seed for pseudo random number generator (if applicable)

Type: int, defaults to -1.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> xgb_w_seed_1 = H2OXGBoostEstimator(col_sample_rate=.7,
...                                    seed=1234)
>>> xgb_w_seed_1.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> xgb_w_seed_2 = H2OXGBoostEstimator(col_sample_rate = .7,
...                                    seed = 1234)
>>> xgb_w_seed_2.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print('auc for the 1st model built with a seed:',
...        xgb_w_seed_1.auc(valid=True))
>>> print('auc for the 2nd model built with a seed:',
...        xgb_w_seed_2.auc(valid=True))
property skip_drop

For booster=dart only: skip_drop (0..1)

Type: float, defaults to 0.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(skip_drop=0.5,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train)
>>> airlines_xgb.auc(train=True)
property stopping_metric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.

Type: Literal["auto", "deviance", "logloss", "mse", "rmse", "mae", "rmsle", "auc", "aucpr", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(stopping_metric="auc",
...                                    stopping_rounds=3,
...                                    stopping_tolerance=1e-2,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_xgb.auc(valid=True)
property stopping_rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)

Type: int, defaults to 0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(stopping_metric="auc",
...                                    stopping_rounds=3,
...                                    stopping_tolerance=1e-2,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_xgb.auc(valid=True)
property stopping_tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

Type: float, defaults to 0.001.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(stopping_metric="auc",
...                                    stopping_rounds=3,
...                                    stopping_tolerance=1e-2,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> airlines_xgb.auc(valid=True)
property subsample

(same as sample_rate) Row sample rate per tree (from 0.0 to 1.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> airlines_xgb = H2OXGBoostEstimator(sample_rate=.7,
...                                    seed=1234)
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print(airlines_xgb.auc(valid=True))
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> titanic_xgb.auc(valid=True)
property tree_method

Tree method

Type: Literal["auto", "exact", "approx", "hist"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"] = airlines["Year"].asfactor()
>>> airlines["Month"] = airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> train, valid= airlines.split_frame(ratios=[.8],
...                                    seed=1234)
>>> >>> airlines_xgb = H2OXGBoostEstimator(seed=1234,
...                                        tree_method="approx")
>>> airlines_xgb.train(x=predictors,
...                    y=response,
...                    training_frame=train,
...                    validation_frame=valid)
>>> print(airlines_xgb.auc(valid=True))
property tweedie_power

Tweedie power for Tweedie regression, must be between 1 and 2.

Type: float, defaults to 1.5.

Examples

>>> insurance = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> predictors = insurance.columns[0:4]
>>> response = 'Claims'
>>> insurance['Group'] = insurance['Group'].asfactor()
>>> insurance['Age'] = insurance['Age'].asfactor()
>>> train, valid = insurance.split_frame(ratios=[.8],
...                                      seed=1234)
>>> insurance_xgb = H2OXGBoostEstimator(distribution="tweedie",
...                                     tweedie_power=1.2,
...                                     seed=1234)
>>> insurance_xgb.train(x=predictors,
...                     y=response,
...                     training_frame=train,
...                     validation_frame=valid)
>>> print(insurance_xgb.mse(valid=True))
property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> insurance = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance['Group'] = insurance['Group'].asfactor()
>>> insurance['Age'] = insurance['Age'].asfactor()
>>> predictors = insurance.columns[0:4]
>>> response = 'Claims'
>>> train, valid = insurance.split_frame(ratios=[.8],
...                                      seed=1234)
>>> insurance_xgb = H2OXGBoostEstimator(seed=1234)
>>> insurance_xgb.train(x=predictors,
...                     y=response,
...                     training_frame=train,
...                     validation_frame=valid)
>>> print(insurance_xgb.mse(valid=True))
property weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

Type: str.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic['survived'] = titanic['survived'].asfactor()
>>> predictors = titanic.columns
>>> del predictors[1:3]
>>> response = 'survived'
>>> train, valid = titanic.split_frame(ratios=[.8],
...                                    seed=1234)
>>> titanic_xgb = H2OXGBoostEstimator(seed=1234)
>>> titanic_xgb.train(x=predictors,
...                   y=response,
...                   training_frame=train,
...                   validation_frame=valid)
>>> titanic_xgb.auc(valid=True)

Unsupervised

H2OAggregatorEstimator

class h2o.estimators.aggregator.H2OAggregatorEstimator(model_id=None, training_frame=None, response_column=None, ignored_columns=None, ignore_const_cols=True, target_num_exemplars=5000, rel_tol_num_exemplars=0.5, transform='normalize', categorical_encoding='auto', save_mapping_frame=False, num_iteration_without_new_exemplar=500, export_checkpoints_dir=None)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Aggregator

property categorical_encoding

Encoding scheme for categorical features

Type: Literal["auto", "enum", "one_hot_internal", "one_hot_explicit", "binary", "eigen", "label_encoder", "sort_by_response", "enum_limited"], defaults to "auto".

Examples

>>> df = h2o.create_frame(rows=10000,
...                       cols=10,
...                       categorical_fraction=0.6,
...                       integer_fraction=0,
...                       binary_fraction=0,
...                       real_range=100,
...                       integer_range=100,
...                       missing_fraction=0,
...                       factors=100,
...                       seed=1234)
>>> params = {"target_num_exemplars": 1000,
...           "rel_tol_num_exemplars": 0.5,
...           "categorical_encoding": "eigen"}
>>> agg = H2OAggregatorEstimator(**params)
>>> agg.train(training_frame=df)
>>> new_df = agg.aggregated_frame
>>> new_df
property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> import tempfile
>>> from os import listdir
>>> df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> checkpoints_dir = tempfile.mkdtemp()
>>> model = H2OAggregatorEstimator(target_num_exemplars=500, 
...                                rel_tol_num_exemplars=0.3,
...                                export_checkpoints_dir=checkpoints_dir)
>>> model.train(training_frame=df)
>>> new_df = model.aggregated_frame
>>> new_df
>>> len(listdir(checkpoints_dir))
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> params = {"ignore_const_cols": False,
...           "target_num_exemplars": 500,
...           "rel_tol_num_exemplars": 0.3,
...           "transform": "standardize",
...           "categorical_encoding": "eigen"}
>>> model = H2OAggregatorEstimator(**params)
>>> model.train(training_frame=df)
>>> new_df = model.aggregated_frame
>>> new_df
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property num_iteration_without_new_exemplar

The number of iterations to run before aggregator exits if the number of exemplars collected didn’t change

Type: int, defaults to 500.

Examples

>>> df = h2o.create_frame(rows=10000,
...                       cols=10,
...                       categorical_fraction=0.6,
...                       integer_fraction=0,
...                       binary_fraction=0,
...                       real_range=100,
...                       integer_range=100,
...                       missing_fraction=0,
...                       factors=100,
...                       seed=1234)
>>> params = {"target_num_exemplars": 1000,
...           "rel_tol_num_exemplars": 0.5,
...           "categorical_encoding": "eigen",
...           "num_iteration_without_new_exemplar": 400}
>>> agg = H2OAggregatorEstimator(**params)
>>> agg.train(training_frame=df)
>>> new_df = agg.aggregated_frame
>>> new_df
property rel_tol_num_exemplars

Relative tolerance for number of exemplars (e.g, 0.5 is +/- 50 percents)

Type: float, defaults to 0.5.

Examples

>>> df = h2o.create_frame(rows=10000,
...                       cols=10,
...                       categorical_fraction=0.6,
...                       integer_fraction=0,
...                       binary_fraction=0,
...                       real_range=100,
...                       integer_range=100,
...                       missing_fraction=0,
...                       factors=100,
...                       seed=1234)
>>> params = {"target_num_exemplars": 1000,
...           "rel_tol_num_exemplars": 0.5,
...           "categorical_encoding": "eigen",
...           "num_iteration_without_new_exemplar": 400}
>>> agg = H2OAggregatorEstimator(**params)
>>> agg.train(training_frame=df)
>>> new_df = agg.aggregated_frame
>>> new_df
property response_column

Response variable column.

Type: str.

property save_mapping_frame

Whether to export the mapping of the aggregated frame

Type: bool, defaults to False.

Examples

>>> df = h2o.create_frame(rows=10000,
...                       cols=10,
...                       categorical_fraction=0.6,
...                       integer_fraction=0,
...                       binary_fraction=0,
...                       real_range=100,
...                       integer_range=100,
...                       missing_fraction=0,
...                       factors=100,
...                       seed=1234)
>>> params = {"target_num_exemplars": 1000,
...           "rel_tol_num_exemplars": 0.5,
...           "categorical_encoding": "eigen",
...           "save_mapping_frame": True}
>>> agg = H2OAggregatorEstimator(**params)
>>> agg.train(training_frame=df)
>>> mapping_frame = agg.mapping_frame
>>> mapping_frame
property target_num_exemplars

Targeted number of exemplars

Type: int, defaults to 5000.

Examples

>>> df = h2o.create_frame(rows=10000,
...                       cols=10,
...                       categorical_fraction=0.6,
...                       integer_fraction=0,
...                       binary_fraction=0,
...                       real_range=100,
...                       integer_range=100,
...                       missing_fraction=0,
...                       factors=100,
...                       seed=1234)
>>> params = {"target_num_exemplars": 1000,
...           "rel_tol_num_exemplars": 0.5,
...           "categorical_encoding": "eigen",
...           "num_iteration_without_new_exemplar": 400}
>>> agg = H2OAggregatorEstimator(**params)
>>> agg.train(training_frame=df)
>>> new_df = agg.aggregated_frame
>>> new_df
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> df = h2o.create_frame(rows=10000,
...                       cols=10,
...                       categorical_fraction=0.6,
...                       integer_fraction=0,
...                       binary_fraction=0,
...                       real_range=100,
...                       integer_range=100,
...                       missing_fraction=0,
...                       factors=100,
...                       seed=1234)
>>> params = {"target_num_exemplars": 1000,
...           "rel_tol_num_exemplars": 0.5,
...           "categorical_encoding": "eigen",
...           "num_iteration_without_new_exemplar": 400}
>>> agg = H2OAggregatorEstimator(**params)
>>> agg.train(training_frame=df)
>>> new_df = agg.aggregated_frame
>>> new_df
property transform

Transformation of training data

Type: Literal["none", "standardize", "normalize", "demean", "descale"], defaults to "normalize".

Examples

>>> df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> params = {"ignore_const_cols": False,
...           "target_num_exemplars": 500,
...           "rel_tol_num_exemplars": 0.3,
...           "transform": "standardize",
...           "categorical_encoding": "eigen"}
>>> model = H2OAggregatorEstimator(**params)
>>> model.train(training_frame=df)
>>> new_df = model.aggregated_frame

H2OAutoEncoderEstimator

class h2o.estimators.deeplearning.H2OAutoEncoderEstimator(**kwargs)[source]

Bases: h2o.estimators.deeplearning.H2ODeepLearningEstimator

Examples

>>> import h2o as ml
>>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator
>>> ml.init()
>>> rows = [[1,2,3,4,0]*50, [2,1,2,4,1]*50, [2,1,4,2,1]*50, [0,1,2,34,1]*50, [2,3,4,1,0]*50]
>>> fr = ml.H2OFrame(rows)
>>> fr[4] = fr[4].asfactor()
>>> model = H2OAutoEncoderEstimator()
>>> model.train(x=range(4), training_frame=fr)

H2OExtendedIsolationForestEstimator

class h2o.estimators.extended_isolation_forest.H2OExtendedIsolationForestEstimator(model_id=None, training_frame=None, ignored_columns=None, ignore_const_cols=True, categorical_encoding='auto', ntrees=100, sample_size=256, extension_level=0, seed=-1)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Extended Isolation Forest

Builds an Extended Isolation Forest model. Extended Isolation Forest generalizes its predecessor algorithm, Isolation Forest. The original Isolation Forest algorithm suffers from bias due to tree branching. Extension of the algorithm mitigates the bias by adjusting the branching, and the original algorithm becomes just a special case. Extended Isolation Forest’s attribute “extension_level” allows leveraging the generalization. The minimum value is 0 and means the Isolation Forest’s behavior. Maximum value is (numCols - 1) and stands for full extension. The rest of the algorithm is analogical to the Isolation Forest algorithm. Each iteration builds a tree that partitions the sample observations’ space until it isolates observation. The length of the path from root to a leaf node of the resulting tree is used to calculate the anomaly score. Anomalies are easier to isolate, and their average tree path is expected to be shorter than paths of regular observations. Anomaly score is a number between 0 and 1. A number closer to 0 is a normal point, and a number closer to 1 is a more anomalous point.

property categorical_encoding

Encoding scheme for categorical features

Type: Literal["auto", "enum", "one_hot_internal", "one_hot_explicit", "binary", "eigen", "label_encoder", "sort_by_response", "enum_limited"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> encoding = "one_hot_explicit"
>>> airlines_eif = H2OExtendedIsolationForestEstimator(categorical_encoding = encoding,
...                                                    seed = 1234)
>>> airlines_eif.train(x = predictors,
...                   training_frame = airlines)
>>> airlines_eif.model_performance()
property extension_level

Maximum is N - 1 (N = numCols). Minimum is 0. Extended Isolation Forest with extension_Level = 0 behaves like Isolation Forest.

Type: int, defaults to 0.

Examples

>>> train = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/anomaly/single_blob.csv")
>>> eif_model = H2OExtendedIsolationForestEstimator(extension_level = 1,
...                                                 ntrees=7)
>>> eif_model.train(training_frame = train)
>>> print(eif_model)
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year","const_1","const_2"]
>>> cars["const_1"] = 6
>>> cars["const_2"] = 7
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)
>>> cars_eif = H2OExtendedIsolationForestEstimator(seed = 1234,
...                                                ignore_const_cols = True)
>>> cars_eif.train(x = predictors,
...               training_frame = cars)
>>> cars_eif.model_performance()
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property ntrees

Number of Extended Isolation Forest trees.

Type: int, defaults to 100.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> predictors = titanic.columns
>>> tree_num = [20, 50, 80, 110, 140, 170, 200]
>>> label = ["20", "50", "80", "110", "140", "170", "200"]
>>> for key, num in enumerate(tree_num):
...     titanic_eif = H2OExtendedIsolationForestEstimator(ntrees = num,
...                                                       seed = 1234,
...                                                       extension_level = titanic.dim[1] - 1)
...     titanic_eif.train(x = predictors,
...                      training_frame = titanic) 
property sample_size

Number of randomly sampled observations used to train each Extended Isolation Forest tree.

Type: int, defaults to 256.

Examples

>>> train = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/anomaly/ecg_discord_train.csv")
>>> eif_model = H2OExtendedIsolationForestEstimator(sample_size = 5,
...                                                 ntrees=7)
>>> eif_model.train(training_frame = train)
>>> print(eif_model)
property seed

Seed for pseudo random number generator (if applicable)

Type: int, defaults to -1.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> eif_w_seed = H2OExtendedIsolationForestEstimator(seed = 1234) 
>>> eif_w_seed.train(x = predictors,
...                        training_frame = airlines)
>>> eif_wo_seed = H2OExtendedIsolationForestEstimator()
>>> eif_wo_seed.train(x = predictors,
...                         training_frame = airlines)
>>> print(eif_w_seed)
>>> print(eif_wo_seed)
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> cars_eif = H2OExtendedIsolationForestEstimator(seed = 1234, 
...                                                sample_size = 256, 
...                                                extension_level = cars.dim[1] - 1)
>>> cars_eif.train(x = predictors,
...                training_frame = cars)
>>> print(cars_eif)

H2OGenericEstimator

class h2o.estimators.generic.H2OGenericEstimator(model_id=None, model_key=None, path=None)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Import MOJO Model

static from_file(file=<class 'str'>, model_id=None)[source]

Creates new Generic model by loading existing embedded model into library, e.g. from H2O MOJO. The imported model must be supported by H2O.

Parameters
  • file – A string containing path to the file to create the model from

  • model_id – Model ID

Returns

H2OGenericEstimator instance representing the generic model

Examples

>>> from h2o.estimators import H2OIsolationForestEstimator, H2OGenericEstimator
>>> import tempfile
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv")
>>> ifr = H2OIsolationForestEstimator(ntrees=1)
>>> ifr.train(x=["Origin","Dest"], y="Distance", training_frame=airlines)
>>> original_model_filename = tempfile.mkdtemp()
>>> original_model_filename = ifr.download_mojo(original_model_filename)
>>> model = H2OGenericEstimator.from_file(original_model_filename)
>>> model.model_performance()
property model_key

Key to the self-contained model archive already uploaded to H2O.

Type: Union[None, str, H2OFrame].

Examples

>>> from h2o.estimators import H2OGenericEstimator, H2OXGBoostEstimator
>>> import tempfile
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv")
>>> y = "IsDepDelayed"
>>> x = ["fYear","fMonth","Origin","Dest","Distance"]
>>> xgb = H2OXGBoostEstimator(ntrees=1, nfolds=3)
>>> xgb.train(x=x, y=y, training_frame=airlines)
>>> original_model_filename = tempfile.mkdtemp()
>>> original_model_filename = xgb.download_mojo(original_model_filename)
>>> key = h2o.lazy_import(original_model_filename)
>>> fr = h2o.get_frame(key[0])
>>> model = H2OGenericEstimator(model_key=fr)
>>> model.train()
>>> model.auc()
property path

Path to file with self-contained model archive.

Type: str.

Examples

>>> from h2o.estimators import H2OIsolationForestEstimator, H2OGenericEstimator
>>> import tempfile
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv")
>>> ifr = H2OIsolationForestEstimator(ntrees=1)
>>> ifr.train(x=["Origin","Dest"], y="Distance", training_frame=airlines)
>>> generic_mojo_filename = tempfile.mkdtemp("zip","genericMojo")
>>> generic_mojo_filename = model.download_mojo(path=generic_mojo_filename)
>>> model = H2OGenericEstimator.from_file(generic_mojo_filename)
>>> model.model_performance()

H2OGeneralizedLowRankEstimator

class h2o.estimators.glrm.H2OGeneralizedLowRankEstimator(model_id=None, training_frame=None, validation_frame=None, ignored_columns=None, ignore_const_cols=True, score_each_iteration=False, representation_name=None, loading_name=None, transform='none', k=1, loss='quadratic', loss_by_col=None, loss_by_col_idx=None, multi_loss='categorical', period=1, regularization_x='none', regularization_y='none', gamma_x=0.0, gamma_y=0.0, max_iterations=1000, max_updates=2000, init_step_size=1.0, min_step_size=0.0001, seed=-1, init='plus_plus', svd_method='randomized', user_y=None, user_x=None, expand_user_y=True, impute_original=False, recover_svd=False, max_runtime_secs=0.0, export_checkpoints_dir=None)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Generalized Low Rank Modeling

Builds a generalized low rank model of a H2O dataset.

property expand_user_y

Expand categorical columns in user-specified initial Y

Type: bool, defaults to True.

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> rank = 3
>>> gx = 0.5
>>> gy = 0.5
>>> trans = "standardize"
>>> iris_glrm = H2OGeneralizedLowRankEstimator(k=rank,
...                                            loss="Quadratic",
...                                            gamma_x=gx,
...                                            gamma_y=gy,
...                                            transform=trans,
...                                            expand_user_y=False)
>>> iris_glrm.train(x=iris.names, training_frame=iris)
>>> iris_glrm.show()
property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> import tempfile
>>> from os import listdir
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> checkpoints_dir = tempfile.mkdtemp()
>>> iris_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                            export_checkpoints_dir=checkpoints_dir,
...                                            seed=1234)
>>> iris_glrm.train(x=iris.names, training_frame=iris)
>>> len(listdir(checkpoints_dir))
property gamma_x

Regularization weight on X matrix

Type: float, defaults to 0.0.

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> rank = 3
>>> gx = 0.5
>>> gy = 0.5
>>> trans = "standardize"
>>> iris_glrm = H2OGeneralizedLowRankEstimator(k=rank,
...                                            loss="Quadratic",
...                                            gamma_x=gx,
...                                            gamma_y=gy,
...                                            transform=trans)
>>> iris_glrm.train(x=iris.names, training_frame=iris)
>>> iris_glrm.show()
property gamma_y

Regularization weight on Y matrix

Type: float, defaults to 0.0.

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> rank = 3
>>> gx = 0.5
>>> gy = 0.5
>>> trans = "standardize"
>>> iris_glrm = H2OGeneralizedLowRankEstimator(k=rank,
...                                            loss="Quadratic",
...                                            gamma_x=gx,
...                                            gamma_y=gy,
...                                            transform=trans)
>>> iris_glrm.train(x=iris.names, training_frame=iris)
>>> iris_glrm.show()
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> iris_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                            ignore_const_cols=False,
...                                            seed=1234)
>>> iris_glrm.train(x=iris.names, training_frame=iris)
>>> iris_glrm.show()
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property impute_original

Reconstruct original training data by reversing transform

Type: bool, defaults to False.

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> rank = 3
>>> gx = 0.5
>>> gy = 0.5
>>> trans = "standardize"
>>> iris_glrm = H2OGeneralizedLowRankEstimator(k=rank,
...                                            loss="Quadratic",
...                                            gamma_x=gx,
...                                            gamma_y=gy,
...                                            transform=trans
...                                            impute_original=True)
>>> iris_glrm.train(x=iris.names, training_frame=iris)
>>> iris_glrm.show()
property init

Initialization mode

Type: Literal["random", "svd", "plus_plus", "user"], defaults to "plus_plus".

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> iris_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                            init="svd",
...                                            seed=1234) 
>>> iris_glrm.train(x=iris.names, training_frame=iris)
>>> iris_glrm.show()
property init_step_size

Initial step size

Type: float, defaults to 1.0.

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> iris_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                            init_step_size=2.5,
...                                            seed=1234) 
>>> iris_glrm.train(x=iris.names, training_frame=iris)
>>> iris_glrm.show()
property k

Rank of matrix approximation

Type: int, defaults to 1.

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> iris_glrm = H2OGeneralizedLowRankEstimator(k=3)
>>> iris_glrm.train(x=iris.names, training_frame=iris)
>>> iris_glrm.show()
property loading_name

[Deprecated] Use representation_name instead. Frame key to save resulting X.

Type: str.

Examples

>>> # loading_name will be deprecated.  Use representation_name instead.    
>>> acs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/census/ACS_13_5YR_DP02_cleaned.zip")
>>> acs_fill = acs.drop("ZCTA5")
>>> acs_glrm = H2OGeneralizedLowRankEstimator(k=10,
...                                           transform="standardize",
...                                           loss="quadratic",
...                                           regularization_x="quadratic",
...                                           regularization_y="L1",
...                                           gamma_x=0.25,
...                                           gamma_y=0.5,
...                                           max_iterations=1,
...                                           loading_name="acs_full")
>>> acs_glrm.train(x=acs_fill.names, training_frame=acs)
>>> acs_glrm.loading_name
>>> acs_glrm.show()
property loss

Numeric loss function

Type: Literal["quadratic", "absolute", "huber", "poisson", "hinge", "logistic", "periodic"], defaults to "quadratic".

Examples

>>> acs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/census/ACS_13_5YR_DP02_cleaned.zip")
>>> acs_fill = acs.drop("ZCTA5")
>>> acs_glrm = H2OGeneralizedLowRankEstimator(k=10,
...                                           transform="standardize",
...                                           loss="absolute",
...                                           regularization_x="quadratic",
...                                           regularization_y="L1",
...                                           gamma_x=0.25,
...                                           gamma_y=0.5,
...                                           max_iterations=700)
>>> acs_glrm.train(x=acs_fill.names, training_frame=acs)
>>> acs_glrm.show()
property loss_by_col

Loss function by column (override)

Type: List[Literal["quadratic", "absolute", "huber", "poisson", "hinge", "logistic", "periodic", "categorical", "ordinal"]].

Examples

>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv")
>>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                               loss="quadratic",
...                                               loss_by_col=["absolute","huber"],
...                                               loss_by_col_idx=[0,3],
...                                               regularization_x="quadratic",
...                                               regularization_y="l1")
>>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O)
>>> arrests_glrm.show()
property loss_by_col_idx

Loss function by column index (override)

Type: List[int].

Examples

>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv")
>>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                               loss="quadratic",
...                                               loss_by_col=["absolute","huber"],
...                                               loss_by_col_idx=[0,3],
...                                               regularization_x="quadratic",
...                                               regularization_y="l1")
>>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O)
>>> arrests_glrm.show()
property max_iterations

Maximum number of iterations

Type: int, defaults to 1000.

Examples

>>> acs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/census/ACS_13_5YR_DP02_cleaned.zip")
>>> acs_fill = acs.drop("ZCTA5")
>>> acs_glrm = H2OGeneralizedLowRankEstimator(k=10,
...                                           transform="standardize",
...                                           loss="quadratic",
...                                           regularization_x="quadratic",
...                                           regularization_y="L1",
...                                           gamma_x=0.25,
...                                           gamma_y=0.5,
...                                           max_iterations=700)
>>> acs_glrm.train(x=acs_fill.names, training_frame=acs)
>>> acs_glrm.show()
property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

Examples

>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv")
>>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                               max_runtime_secs=15,
...                                               max_iterations=500,
...                                               max_updates=900,
...                                               min_step_size=0.005)
>>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O)
>>> arrests_glrm.show()
property max_updates

Maximum number of updates, defaults to 2*max_iterations

Type: int, defaults to 2000.

Examples

>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv")
>>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                               max_runtime_secs=15,
...                                               max_iterations=500,
...                                               max_updates=900,
...                                               min_step_size=0.005)
>>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O)
>>> arrests_glrm.show()
property min_step_size

Minimum step size

Type: float, defaults to 0.0001.

Examples

>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv")
>>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                               max_runtime_secs=15,
...                                               max_iterations=500,
...                                               max_updates=900,
...                                               min_step_size=0.005)
>>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O)
>>> arrests_glrm.show()
property multi_loss

Categorical loss function

Type: Literal["categorical", "ordinal"], defaults to "categorical".

Examples

>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv")
>>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                               loss="quadratic",
...                                               loss_by_col=["absolute","huber"],
...                                               loss_by_col_idx=[0,3],
...                                               regularization_x="quadratic",
...                                               regularization_y="l1"
...                                               multi_loss="ordinal")
>>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O)
>>> arrests_glrm.show()
property period

Length of period (only used with periodic loss function)

Type: int, defaults to 1.

Examples

>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv")
>>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                               max_runtime_secs=15,
...                                               max_iterations=500,
...                                               max_updates=900,
...                                               min_step_size=0.005,
...                                               period=5)
>>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O)
>>> arrests_glrm.show()
property recover_svd

Recover singular values and eigenvectors of XY

Type: bool, defaults to False.

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate_cat.csv")
>>> prostate[0] = prostate[0].asnumeric()
>>> prostate[4] = prostate[4].asnumeric()
>>> loss_all = ["Hinge", "Quadratic", "Categorical", "Categorical",
...             "Hinge", "Quadratic", "Quadratic", "Quadratic"]
>>> pros_glrm = H2OGeneralizedLowRankEstimator(k=5,
...                                            loss_by_col=loss_all,
...                                            recover_svd=True,
...                                            transform="standardize",
...                                            seed=12345)
>>> pros_glrm.train(x=prostate.names, training_frame=prostate)
>>> pros_glrm.show()
property regularization_x

Regularization function for X matrix

Type: Literal["none", "quadratic", "l2", "l1", "non_negative", "one_sparse", "unit_one_sparse", "simplex"], defaults to "none".

Examples

>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv")
>>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                               loss="quadratic",
...                                               loss_by_col=["absolute","huber"],
...                                               loss_by_col_idx=[0,3],
...                                               regularization_x="quadratic",
...                                               regularization_y="l1")
>>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O)
>>> arrests_glrm.show()
property regularization_y

Regularization function for Y matrix

Type: Literal["none", "quadratic", "l2", "l1", "non_negative", "one_sparse", "unit_one_sparse", "simplex"], defaults to "none".

Examples

>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv")
>>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                               loss="quadratic",
...                                               loss_by_col=["absolute","huber"],
...                                               loss_by_col_idx=[0,3],
...                                               regularization_x="quadratic",
...                                               regularization_y="l1")
>>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O)
>>> arrests_glrm.show()
property representation_name

Frame key to save resulting X

Type: str.

Examples

>>> acs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/census/ACS_13_5YR_DP02_cleaned.zip")
>>> acs_fill = acs.drop("ZCTA5")
>>> acs_glrm = H2OGeneralizedLowRankEstimator(k=10,
...                                           transform="standardize",
...                                           loss="quadratic",
...                                           regularization_x="quadratic",
...                                           regularization_y="L1",
...                                           gamma_x=0.25,
...                                           gamma_y=0.5,
...                                           max_iterations=1,
...                                           representation_name="acs_full")
>>> acs_glrm.train(x=acs_fill.names, training_frame=acs)
>>> acs_glrm.loading_name
>>> acs_glrm.show()
property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate_cat.csv")
>>> prostate[0] = prostate[0].asnumeric()
>>> prostate[4] = prostate[4].asnumeric()
>>> loss_all = ["Hinge", "Quadratic", "Categorical", "Categorical",
...             "Hinge", "Quadratic", "Quadratic", "Quadratic"]
>>> pros_glrm = H2OGeneralizedLowRankEstimator(k=5,
...                                            loss_by_col=loss_all,
...                                            score_each_iteration=True,
...                                            transform="standardize",
...                                            seed=12345)
>>> pros_glrm.train(x=prostate.names, training_frame=prostate)
>>> pros_glrm.show()
property seed

RNG seed for initialization

Type: int, defaults to -1.

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate_cat.csv")
>>> prostate[0] = prostate[0].asnumeric()
>>> prostate[4] = prostate[4].asnumeric()
>>> glrm_w_seed = H2OGeneralizedLowRankEstimator(k=5, seed=12345) 
>>> glrm_w_seed.train(x=prostate.names, training_frame=prostate)
>>> glrm_wo_seed = H2OGeneralizedLowRankEstimator(k=5, 
>>> glrm_wo_seed.train(x=prostate.names, training_frame=prostate)
>>> glrm_w_seed.show()
>>> glrm_wo_seed.show()
property svd_method

Method for computing SVD during initialization (Caution: Randomized is currently experimental and unstable)

Type: Literal["gram_s_v_d", "power", "randomized"], defaults to "randomized".

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate_cat.csv")
>>> prostate[0] = prostate[0].asnumeric()
>>> prostate[4] = prostate[4].asnumeric()
>>> pros_glrm = H2OGeneralizedLowRankEstimator(k=5,
...                                            svd_method="power",
...                                            seed=1234)
>>> pros_glrm.train(x=prostate.names, training_frame=prostate)
>>> pros_glrm.show()
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate_cat.csv")
>>> prostate[0] = prostate[0].asnumeric()
>>> prostate[4] = prostate[4].asnumeric()
>>> pros_glrm = H2OGeneralizedLowRankEstimator(k=5,
...                                            seed=1234)
>>> pros_glrm.train(x=prostate.names, training_frame=prostate)
>>> pros_glrm.show()
property transform

Transformation of training data

Type: Literal["none", "standardize", "normalize", "demean", "descale"], defaults to "none".

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate_cat.csv")
>>> prostate[0] = prostate[0].asnumeric()
>>> prostate[4] = prostate[4].asnumeric()
>>> pros_glrm = H2OGeneralizedLowRankEstimator(k=5,
...                                            score_each_iteration=True,
...                                            transform="standardize",
...                                            seed=12345)
>>> pros_glrm.train(x=prostate.names, training_frame=prostate)
>>> pros_glrm.show()
property user_x

User-specified initial X

Type: Union[None, str, H2OFrame].

Examples

>>> arrestsH2O = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> initial_x = ([[5.412, 65.24, -7.54, -0.032, 2.212, 92.24, -17.54, 23.268, 0.312,
...                123.24, 14.46, 9.768, 1.012, 19.24, -15.54, -1.732, 5.412, 65.24,
...                -7.54, -0.032, 2.212, 92.24, -17.54, 23.268, 0.312, 123.24, 14.46,
...                9.76, 1.012, 19.24, -15.54, -1.732, 5.412, 65.24, -7.54, -0.032,
...                2.212, 92.24, -17.54, 23.268, 0.312, 123.24, 14.46, 9.768, 1.012,
...                19.24, -15.54, -1.732, 5.412, 65.24]]*4)
>>> initial_x_h2o = h2o.H2OFrame(list(zip(*initial_x)))
>>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=4,
...                                               transform="demean",
...                                               loss="quadratic",
...                                               gamma_x=0.5,
...                                               gamma_y=0.3,
...                                               init="user",
...                                               user_x=initial_x_h2o,
...                                               recover_svd=True)
>>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O)
>>> arrests_glrm.show()
property user_y

User-specified initial Y

Type: Union[None, str, H2OFrame].

Examples

>>> arrestsH2O = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> initial_y = [[5.412,  65.24,  -7.54, -0.032],
...              [2.212,  92.24, -17.54, 23.268],
...              [0.312, 123.24,  14.46,  9.768],
...              [1.012,  19.24, -15.54, -1.732]]
>>> initial_y_h2o = h2o.H2OFrame(list(zip(*initial_y)))
>>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=4,
...                                               transform="demean",
...                                               loss="quadratic",
...                                               gamma_x=0.5,
...                                               gamma_y=0.3,
...                                               init="user",
...                                               user_y=initial_y_h2o,
...                                               recover_svd=True)
>>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O)
>>> arrests_glrm.show()
property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> iris = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_wheader.csv")
>>> iris_glrm = H2OGeneralizedLowRankEstimator(k=3,
...                                            loss="quadratic",
...                                            gamma_x=0.5,
...                                            gamma_y=0.5,
...                                            transform="standardize")
>>> iris_glrm.train(x=iris.names,
...                 training_frame=iris,
...                 validation_frame=iris)
>>> iris_glrm.show()

H2OIsolationForestEstimator

class h2o.estimators.isolation_forest.H2OIsolationForestEstimator(model_id=None, training_frame=None, score_each_iteration=False, score_tree_interval=0, ignored_columns=None, ignore_const_cols=True, ntrees=50, max_depth=8, min_rows=1.0, max_runtime_secs=0.0, seed=-1, build_tree_one_node=False, mtries=-1, sample_size=256, sample_rate=-1.0, col_sample_rate_change_per_level=1.0, col_sample_rate_per_tree=1.0, categorical_encoding='auto', stopping_rounds=0, stopping_metric='auto', stopping_tolerance=0.01, export_checkpoints_dir=None, contamination=-1.0, validation_frame=None, validation_response_column=None)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Isolation Forest

Builds an Isolation Forest model. Isolation Forest algorithm samples the training frame and in each iteration builds a tree that partitions the space of the sample observations until it isolates each observation. Length of the path from root to a leaf node of the resulting tree is used to calculate the anomaly score. Anomalies are easier to isolate and their average tree path is expected to be shorter than paths of regular observations.

property build_tree_one_node

Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> cars_if = H2OIsolationForestEstimator(build_tree_one_node=True,
...                                       seed=1234)
>>> cars_if.train(x=predictors,
...               training_frame=cars)
>>> cars_if.model_performance()
property categorical_encoding

Encoding scheme for categorical features

Type: Literal["auto", "enum", "one_hot_internal", "one_hot_explicit", "binary", "eigen", "label_encoder", "sort_by_response", "enum_limited"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> encoding = "one_hot_explicit"
>>> airlines_if = H2OIsolationForestEstimator(categorical_encoding=encoding,
...                                           seed=1234)
>>> airlines_if.train(x=predictors,
...                   training_frame=airlines)
>>> airlines_if.model_performance()
property col_sample_rate_change_per_level

Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> airlines_if = H2OIsolationForestEstimator(col_sample_rate_change_per_level=.9,
...                                           seed=1234)
>>> airlines_if.train(x=predictors,
...                   training_frame=airlines)
>>> airlines_if.model_performance()
property col_sample_rate_per_tree

Column sample rate per tree (from 0.0 to 1.0)

Type: float, defaults to 1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> airlines_if = H2OIsolationForestEstimator(col_sample_rate_per_tree=.7,
...                                           seed=1234)
>>> airlines_if.train(x=predictors,
...                   training_frame=airlines)
>>> airlines_if.model_performance()
property contamination

Contamination ratio - the proportion of anomalies in the input dataset. If undefined (-1) the predict function will not mark observations as anomalies and only anomaly score will be returned. Defaults to -1 (undefined).

Type: float, defaults to -1.0.

property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> import tempfile
>>> from os import listdir
>>> airlines = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip", destination_frame="air.hex")
>>> predictors = ["DayofMonth", "DayOfWeek"]
>>> checkpoints_dir = tempfile.mkdtemp()
>>> air_if = H2OIsolationForestEstimator(max_depth=3,
...                                      seed=1234,
...                                      export_checkpoints_dir=checkpoints_dir)
>>> air_if.train(x=predictors,
...              training_frame=airlines)
>>> len(listdir(checkpoints_dir))
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year","const_1","const_2"]
>>> cars["const_1"] = 6
>>> cars["const_2"] = 7
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_if = H2OIsolationForestEstimator(seed=1234,
...                                       ignore_const_cols=True)
>>> cars_if.train(x=predictors,
...               training_frame=cars)
>>> cars_if.model_performance()
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property max_depth

Maximum tree depth (0 for unlimited).

Type: int, defaults to 8.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> cars_if = H2OIsolationForestEstimator(max_depth=2,
...                                       seed=1234)
>>> cars_if.train(x=predictors,
...               training_frame=cars)
>>> cars_if.model_performance()
property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> cars_if = H2OIsolationForestEstimator(max_runtime_secs=10,
...                                       ntrees=10000,
...                                       max_depth=10,
...                                       seed=1234)
>>> cars_if.train(x=predictors,
...               training_frame=cars)
>>> cars_if.model_performance()
property min_rows

Fewest allowed (weighted) observations in a leaf.

Type: float, defaults to 1.0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> cars_if = H2OIsolationForestEstimator(min_rows=16,
...                                       seed=1234)
>>> cars_if.train(x=predictors,
...               training_frame=cars)
>>> cars_if.model_performance()
property mtries

Number of variables randomly sampled as candidates at each split. If set to -1, defaults (number of predictors)/3.

Type: int, defaults to -1.

Examples

>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
>>> predictors = covtype.columns[0:54]
>>> cov_if = H2OIsolationForestEstimator(mtries=30, seed=1234)
>>> cov_if.train(x=predictors,
...              training_frame=covtype)
>>> cov_if.model_performance()
property ntrees

Number of trees.

Type: int, defaults to 50.

Examples

>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> predictors = titanic.columns
>>> tree_num = [20, 50, 80, 110, 140, 170, 200]
>>> label = ["20", "50", "80", "110", "140", "170", "200"]
>>> for key, num in enumerate(tree_num):
...     titanic_if = H2OIsolationForestEstimator(ntrees=num,
...                                              seed=1234)
...     titanic_if.train(x=predictors,
...                      training_frame=titanic) 
...     print(label[key], 'training score', titanic_if.mse(train=True))
property sample_rate

Rate of randomly sampled observations used to train each Isolation Forest tree. Needs to be in range from 0.0 to 1.0. If set to -1, sample_rate is disabled and sample_size will be used instead.

Type: float, defaults to -1.0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> airlines_if = H2OIsolationForestEstimator(sample_rate=.7,
...                                           seed=1234)
>>> airlines_if.train(x=predictors,
...                   training_frame=airlines)
>>> airlines_if.model_performance()
property sample_size

Number of randomly sampled observations used to train each Isolation Forest tree. Only one of parameters sample_size and sample_rate should be defined. If sample_rate is defined, sample_size will be ignored.

Type: int, defaults to 256.

Examples

>>> train = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/anomaly/ecg_discord_train.csv")
>>> isofor_model = H2OIsolationForestEstimator(sample_size=5,
...                                            ntrees=7)
>>> isofor_model.train(training_frame=train)
>>> isofor_model.model_performance()
property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> cars_if = H2OIsolationForestEstimator(score_each_iteration=True,
...                                       ntrees=55,
...                                       seed=1234)
>>> cars_if.train(x=predictors,
...               training_frame=cars)
>>> cars_if.model_performance()
property score_tree_interval

Score the model after every so many trees. Disabled if set to 0.

Type: int, defaults to 0.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> cars_if = H2OIsolationForestEstimator(score_tree_interval=5,
...                                       seed=1234)
>>> cars_if.train(x=predictors,
...               training_frame=cars)
>>> cars_if.model_performance()
property seed

Seed for pseudo random number generator (if applicable)

Type: int, defaults to -1.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> isofor_w_seed = H2OIsolationForestEstimator(seed=1234) 
>>> isofor_w_seed.train(x=predictors,
...                     training_frame=airlines)
>>> isofor_wo_seed = H2OIsolationForestEstimator()
>>> isofor_wo_seed.train(x=predictors,
...                      training_frame=airlines)
>>> isofor_w_seed.model_performance()
>>> isofor_wo_seed.model_performance()
property stopping_metric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.

Type: Literal["auto", "anomaly_score", "deviance", "logloss", "mse", "rmse", "mae", "rmsle", "auc", "aucpr", "misclassification", "mean_per_class_error"], defaults to "auto".

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> airlines_if = H2OIsolationForestEstimator(stopping_metric="auto",
...                                           stopping_rounds=3,
...                                           stopping_tolerance=1e-2,
...                                           seed=1234)
>>> airlines_if.train(x=predictors,
...                   training_frame=airlines)
>>> airlines_if.model_performance()
property stopping_rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)

Type: int, defaults to 0.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> airlines_if = H2OIsolationForestEstimator(stopping_metric="auto",
...                                           stopping_rounds=3,
...                                           stopping_tolerance=1e-2,
...                                           seed=1234)
>>> airlines_if.train(x=predictors,
...                   training_frame=airlines)
>>> airlines_if.model_performance()
property stopping_tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

Type: float, defaults to 0.01.

Examples

>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
...               "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> airlines_if = H2OIsolationForestEstimator(stopping_metric="auto",
...                                           stopping_rounds=3,
...                                           stopping_tolerance=1e-2,
...                                           seed=1234)
>>> airlines_if.train(x=predictors,
...                   training_frame=airlines)
>>> airlines_if.model_performance()
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> cars_if = H2OIsolationForestEstimator(seed=1234)
>>> cars_if.train(x=predictors,
...               training_frame=cars)
>>> cars_if.model_performance()
property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

property validation_response_column

(experimental) Name of the response column in the validation frame. Response column should be binary and indicate not anomaly/anomaly.

Type: str.

H2OKMeansEstimator

class h2o.estimators.kmeans.H2OKMeansEstimator(model_id=None, training_frame=None, validation_frame=None, nfolds=0, keep_cross_validation_models=True, keep_cross_validation_predictions=False, keep_cross_validation_fold_assignment=False, fold_assignment='auto', fold_column=None, ignored_columns=None, ignore_const_cols=True, score_each_iteration=False, k=1, estimate_k=False, user_points=None, max_iterations=10, standardize=True, seed=-1, init='furthest', max_runtime_secs=0.0, categorical_encoding='auto', export_checkpoints_dir=None, cluster_size_constraints=None)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

K-means

Performs k-means clustering on an H2O dataset.

property categorical_encoding

Encoding scheme for categorical features

Type: Literal["auto", "enum", "one_hot_internal", "one_hot_explicit", "binary", "eigen", "label_encoder", "sort_by_response", "enum_limited"], defaults to "auto".

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> predictors = ["AGE", "RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON"]
>>> train, valid = prostate.split_frame(ratios=[.8], seed=1234)
>>> encoding = "one_hot_explicit"
>>> pros_km = H2OKMeansEstimator(categorical_encoding=encoding,
...                              seed=1234)
>>> pros_km.train(x=predictors,
...               training_frame=train,
...               validation_frame=valid)
>>> pros_km.scoring_history()
property cluster_size_constraints

An array specifying the minimum number of points that should be in each cluster. The length of the constraints array has to be the same as the number of clusters.

Type: List[int].

Examples

>>> iris_h2o = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
>>> k=3
>>> start_points = h2o.H2OFrame(
...         [[4.9, 3.0, 1.4, 0.2],
...          [5.6, 2.5, 3.9, 1.1],
...          [6.5, 3.0, 5.2, 2.0]])
>>> kmm = H2OKMeansEstimator(k=k,
...                          user_points=start_points,
...                          standardize=True,
...                          cluster_size_constraints=[2, 5, 8],
...                          score_each_iteration=True)
>>> kmm.train(x=list(range(7)), training_frame=iris_h2o)
>>> kmm.scoring_history()
property estimate_k

Whether to estimate the number of clusters (<=k) iteratively and deterministically.

Type: bool, defaults to False.

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> iris['class'] = iris['class'].asfactor()
>>> predictors = iris.columns[:-1]
>>> train, valid = iris.split_frame(ratios=[.8], seed=1234)
>>> iris_kmeans = H2OKMeansEstimator(k=10,
...                                  estimate_k=True,
...                                  standardize=False,
...                                  seed=1234)
>>> iris_kmeans.train(x=predictors,
...                   training_frame=train,
...                   validation_frame=valid)
>>> iris_kmeans.scoring_history()
property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> import tempfile
>>> from os import listdir
>>> airlines = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip", destination_frame="air.hex")
>>> predictors = ["DayofMonth", "DayOfWeek"]
>>> checkpoints_dir = tempfile.mkdtemp()
>>> air_km = H2OKMeansEstimator(export_checkpoints_dir=checkpoints_dir,
...                             seed=1234)
>>> air_km.train(x=predictors, training_frame=airlines)
>>> len(listdir(checkpoints_dir))
property fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.

Type: Literal["auto", "random", "modulo", "stratified"], defaults to "auto".

Examples

>>> ozone = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/ozone.csv")
>>> predictors = ["radiation","temperature","wind"]
>>> train, valid = ozone.split_frame(ratios=[.8], seed=1234)
>>> ozone_km = H2OKMeansEstimator(fold_assignment="Random",
...                               nfolds=5,
...                               seed=1234)
>>> ozone_km.train(x=predictors,
...                training_frame=train,
...                validation_frame=valid)
>>> ozone_km.scoring_history()
property fold_column

Column with cross-validation fold index assignment per observation.

Type: str.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> fold_numbers = cars.kfold_column(n_folds=5, seed=1234)
>>> fold_numbers.set_names(["fold_numbers"])
>>> cars = cars.cbind(fold_numbers)
>>> print(cars['fold_numbers'])
>>> cars_km = H2OKMeansEstimator(seed=1234)
>>> cars_km.train(x=predictors,
...               training_frame=cars,
...               fold_column="fold_numbers")
>>> cars_km.scoring_history()
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> cars["const_1"] = 6
>>> cars["const_2"] = 7
>>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
>>> cars_km = H2OKMeansEstimator(ignore_const_cols=True,
...                              seed=1234)
>>> cars_km.train(x=predictors,
...               training_frame=train,
...               validation_frame=valid)
>>> cars_km.scoring_history()
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property init

Initialization mode

Type: Literal["random", "plus_plus", "furthest", "user"], defaults to "furthest".

Examples

>>> seeds = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/seeds_dataset.txt")
>>> predictors = seeds.columns[0:7]
>>> train, valid = seeds.split_frame(ratios=[.8], seed=1234)
>>> seeds_km = H2OKMeansEstimator(k=3,
...                               init='Furthest',
...                               seed=1234)
>>> seeds_km.train(x=predictors,
...                training_frame=train,
...                validation_frame= valid)
>>> seeds_km.scoring_history()
property k

The max. number of clusters. If estimate_k is disabled, the model will find k centroids, otherwise it will find up to k centroids.

Type: int, defaults to 1.

Examples

>>> seeds = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/seeds_dataset.txt")
>>> predictors = seeds.columns[0:7]
>>> train, valid = seeds.split_frame(ratios=[.8], seed=1234)
>>> seeds_km = H2OKMeansEstimator(k=3, seed=1234)
>>> seeds_km.train(x=predictors,
...                training_frame=train,
...                validation_frame=valid)
>>> seeds_km.scoring_history()
property keep_cross_validation_fold_assignment

Whether to keep the cross-validation fold assignment.

Type: bool, defaults to False.

Examples

>>> ozone = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/ozone.csv")
>>> predictors = ["radiation","temperature","wind"]
>>> train, valid = ozone.split_frame(ratios=[.8], seed=1234)
>>> ozone_km = H2OKMeansEstimator(keep_cross_validation_fold_assignment=True,
...                               nfolds=5,
...                               seed=1234)
>>> ozone_km.train(x=predictors,
...                training_frame=train)
>>> ozone_km.scoring_history()
property keep_cross_validation_models

Whether to keep the cross-validation models.

Type: bool, defaults to True.

Examples

>>> ozone = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/ozone.csv")
>>> predictors = ["radiation","temperature","wind"]
>>> train, valid = ozone.split_frame(ratios=[.8], seed=1234)
>>> ozone_km = H2OKMeansEstimator(keep_cross_validation_models=True,
...                               nfolds=5,
...                               seed=1234)
>>> ozone_km.train(x=predictors,
...                training_frame=train,
...                validation_frame=valid)
>>> ozone_km.scoring_history()
property keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation models.

Type: bool, defaults to False.

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> predictors = ["AGE", "RACE", "DPROS", "DCAPS",
...               "PSA", "VOL", "GLEASON"]
>>> train, valid = prostate.split_frame(ratios=[.8], seed=1234)
>>> pros_km = H2OKMeansEstimator(keep_cross_validation_predictions=True,
...                              nfolds=5,
...                              seed=1234)
>>> pros_km.train(x=predictors,
...               training_frame=train,
...               validation_frame=valid)
>>> pros_km.scoring_history()
property max_iterations

Maximum training iterations (if estimate_k is enabled, then this is for each inner Lloyds iteration)

Type: int, defaults to 10.

Examples

>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> predictors = ["AGMT","FNDX","HIGD","DEG","CHK",
...               "AGP1","AGMN","LIV","AGLP"]
>>> train, valid = benign.split_frame(ratios=[.8], seed=1234)
>>> benign_km = H2OKMeansEstimator(max_iterations=50)
>>> benign_km.train(x=predictors,
...                 training_frame=train,
...                 validation_frame=valid)
>>> benign_km.scoring_history()
property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

Examples

>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> predictors = ["AGMT","FNDX","HIGD","DEG","CHK",
...               "AGP1","AGMN","LIV","AGLP"]
>>> train, valid = benign.split_frame(ratios=[.8], seed=1234)
>>> benign_km = H2OKMeansEstimator(max_runtime_secs=10,
...                                seed=1234)
>>> benign_km.train(x=predictors,
...                 training_frame=train,
...                 validation_frame=valid)
>>> benign_km.scoring_history()
property nfolds

Number of folds for K-fold cross-validation (0 to disable or >= 2).

Type: int, defaults to 0.

Examples

>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> predictors = ["AGMT","FNDX","HIGD","DEG","CHK",
...               "AGP1","AGMN","LIV","AGLP"]
>>> train, valid = benign.split_frame(ratios=[.8], seed=1234)
>>> benign_km = H2OKMeansEstimator(nfolds=5, seed=1234)
>>> benign_km.train(x=predictors,
...                 training_frame=train,
...                 validation_frame=valid)
>>> benign_km.scoring_history()
property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

Examples

>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> predictors = ["AGMT","FNDX","HIGD","DEG","CHK",
...               "AGP1","AGMN","LIV","AGLP"]
>>> train, valid = benign.split_frame(ratios=[.8], seed=1234)
>>> benign_km = H2OKMeansEstimator(score_each_iteration=True,
...                                seed=1234)
>>> benign_km.train(x=predictors,
...                 training_frame=train,
...                 validation_frame=valid)
>>> benign_km.scoring_history()
property seed

RNG Seed

Type: int, defaults to -1.

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> predictors = ["AGE", "RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON"]
>>> train, valid = prostate.split_frame(ratios=[.8], seed=1234)
>>> pros_w_seed = H2OKMeansEstimator(seed=1234)
>>> pros_w_seed.train(x=predictors,
...                   training_frame=train,
...                   validation_frame=valid)
>>> pros_wo_seed = H2OKMeansEstimator()
>>> pros_wo_seed.train(x=predictors,
...                    training_frame=train,
...                    validation_frame=valid)
>>> pros_w_seed.scoring_history()
>>> pros_wo_seed.scoring_history()
property standardize

Standardize columns before computing distances

Type: bool, defaults to True.

Examples

>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
>>> predictors = boston.columns[:-1]
>>> boston['chas'] = boston['chas'].asfactor()
>>> train, valid = boston.split_frame(ratios=[.8])
>>> boston_km = H2OKMeansEstimator(standardize=True)
>>> boston_km.train(x=predictors,
...                 training_frame=train,
...                 validation_frame=valid)
>>> boston_km.scoring_history()
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> predictors = ["AGE", "RACE", "DPROS", "DCAPS",
...               "PSA", "VOL", "GLEASON"]
>>> train, valid = prostate.split_frame(ratios=[.8], seed=1234)
>>> pros_km = H2OKMeansEstimator(seed=1234)
>>> pros_km.train(x=predictors,
...               training_frame=train,
...               validation_frame=valid)
>>> pros_km.scoring_history()
property user_points

This option allows you to specify a dataframe, where each row represents an initial cluster center. The user- specified points must have the same number of columns as the training observations. The number of rows must equal the number of clusters

Type: Union[None, str, H2OFrame].

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> iris['class'] = iris['class'].asfactor()
>>> predictors = iris.columns[:-1]
>>> train, valid = iris.split_frame(ratios=[.8], seed=1234)
>>> point1 = [4.9,3.0,1.4,0.2]
>>> point2 = [5.6,2.5,3.9,1.1]
>>> point3 = [6.5,3.0,5.2,2.0]
>>> points = h2o.H2OFrame([point1, point2, point3])
>>> iris_km = H2OKMeansEstimator(k=3,
...                              user_points=points,
...                              seed=1234)
>>> iris_km.train(x=predictors,
...               training_frame=iris,
...               validation_frame=valid)
>>> iris_kmeans.tot_withinss(valid=True)
property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> predictors = ["AGE", "RACE", "DPROS", "DCAPS",
...               "PSA", "VOL", "GLEASON"]
>>> train, valid = prostate.split_frame(ratios=[.8], seed=1234)
>>> pros_km = H2OKMeansEstimator(seed=1234)
>>> pros_km.train(x=predictors,
...               training_frame=train,
...               validation_frame=valid)
>>> pros_km.scoring_history()

H2OPrincipalComponentAnalysisEstimator

class h2o.estimators.pca.H2OPrincipalComponentAnalysisEstimator(model_id=None, training_frame=None, validation_frame=None, ignored_columns=None, ignore_const_cols=True, score_each_iteration=False, transform='none', pca_method='gram_s_v_d', pca_impl=None, k=1, max_iterations=1000, use_all_factor_levels=False, compute_metrics=True, impute_missing=False, seed=-1, max_runtime_secs=0.0, export_checkpoints_dir=None)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Principal Components Analysis

property compute_metrics

Whether to compute metrics on the training data

Type: bool, defaults to True.

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
>>> prostate['RACE'] = prostate['RACE'].asfactor()
>>> prostate['DCAPS'] = prostate['DCAPS'].asfactor()
>>> prostate['DPROS'] = prostate['DPROS'].asfactor()
>>> pros_pca = H2OPrincipalComponentAnalysisEstimator(compute_metrics=False)
>>> pros_pca.train(x=prostate.names, training_frame=prostate)
>>> pros_pca.show()
property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> import tempfile
>>> from os import listdir
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
>>> prostate['RACE'] = prostate['RACE'].asfactor()
>>> prostate['DCAPS'] = prostate['DCAPS'].asfactor()
>>> prostate['DPROS'] = prostate['DPROS'].asfactor()
>>> checkpoints_dir = tempfile.mkdtemp()
>>> pros_pca = H2OPrincipalComponentAnalysisEstimator(impute_missing=True,
...                                                   export_checkpoints_dir=checkpoints_dir)
>>> pros_pca.train(x=prostate.names, training_frame=prostate)
>>> len(listdir(checkpoints_dir))
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
>>> prostate['RACE'] = prostate['RACE'].asfactor()
>>> prostate['DCAPS'] = prostate['DCAPS'].asfactor()
>>> prostate['DPROS'] = prostate['DPROS'].asfactor()
>>> pros_pca = H2OPrincipalComponentAnalysisEstimator(ignore_const_cols=False)
>>> pros_pca.train(x=prostate.names, training_frame=prostate)
>>> pros_pca.show()
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

property impute_missing

Whether to impute missing entries with the column mean

Type: bool, defaults to False.

Examples

>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
>>> prostate['RACE'] = prostate['RACE'].asfactor()
>>> prostate['DCAPS'] = prostate['DCAPS'].asfactor()
>>> prostate['DPROS'] = prostate['DPROS'].asfactor()
>>> pros_pca = H2OPrincipalComponentAnalysisEstimator(impute_missing=True)
>>> pros_pca.train(x=prostate.names, training_frame=prostate)
>>> pros_pca.show()
init_for_pipeline()[source]

Returns H2OPCA object which implements fit and transform method to be used in sklearn.Pipeline properly. All parameters defined in self.__params, should be input parameters in H2OPCA.__init__ method.

Returns

H2OPCA object

Examples

>>> from sklearn.pipeline import Pipeline
>>> from h2o.transforms.preprocessing import H2OScaler
>>> from h2o.estimators import H2ORandomForestEstimator
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> pipe = Pipeline([("standardize", H2OScaler()),
...                  ("pca", H2OPrincipalComponentAnalysisEstimator(k=2).init_for_pipeline()),
...                  ("rf", H2ORandomForestEstimator(seed=42,ntrees=5))])
>>> pipe.fit(iris[:4], iris[4])
property k

Rank of matrix approximation

Type: int, defaults to 1.

Examples

>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip")
>>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=-1,
...                                                   transform="standardize",
...                                                   pca_method="power",
...                                                   impute_missing=True,
...                                                   max_iterations=800)
>>> data_pca.train(x=data.names, training_frame=data)
>>> data_pca.show()
property max_iterations

Maximum training iterations

Type: int, defaults to 1000.

Examples

>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip")
>>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=-1,
...                                                   transform="standardize",
...                                                   pca_method="power",
...                                                   impute_missing=True,
...                                                   max_iterations=800)
>>> data_pca.train(x=data.names, training_frame=data)
>>> data_pca.show()
property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

Examples

>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip")
>>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=-1,
...                                                   transform="standardize",
...                                                   pca_method="power",
...                                                   impute_missing=True,
...                                                   max_iterations=800
...                                                   max_runtime_secs=15)
>>> data_pca.train(x=data.names, training_frame=data)
>>> data_pca.show()
property pca_impl

Specify the implementation to use for computing PCA (via SVD or EVD): MTJ_EVD_DENSEMATRIX - eigenvalue decompositions for dense matrix using MTJ; MTJ_EVD_SYMMMATRIX - eigenvalue decompositions for symmetric matrix using MTJ; MTJ_SVD_DENSEMATRIX - singular-value decompositions for dense matrix using MTJ; JAMA - eigenvalue decompositions for dense matrix using JAMA. References: JAMA - http://math.nist.gov/javanumerics/jama/; MTJ - https://github.com/fommil/matrix-toolkits-java/

Type: Literal["mtj_evd_densematrix", "mtj_evd_symmmatrix", "mtj_svd_densematrix", "jama"].

Examples

>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip")
>>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=3,
...                                                   pca_impl="jama",
...                                                   impute_missing=True,
...                                                   max_iterations=1200)
>>> data_pca.train(x=data.names, training_frame=data)
>>> data_pca.show()
property pca_method

Specify the algorithm to use for computing the principal components: GramSVD - uses a distributed computation of the Gram matrix, followed by a local SVD; Power - computes the SVD using the power iteration method (experimental); Randomized - uses randomized subspace iteration method; GLRM - fits a generalized low-rank model with L2 loss function and no regularization and solves for the SVD using local matrix algebra (experimental)

Type: Literal["gram_s_v_d", "power", "randomized", "glrm"], defaults to "gram_s_v_d".

Examples

>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip")
>>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=-1,
...                                                   transform="standardize",
...                                                   pca_method="power",
...                                                   impute_missing=True,
...                                                   max_iterations=800)
>>> data_pca.train(x=data.names, training_frame=data)
>>> data_pca.show()
property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

Examples

>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip")
>>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=3,
...                                                   score_each_iteration=True,
...                                                   seed=1234,
...                                                   impute_missing=True)
>>> data_pca.train(x=data.names, training_frame=data)
>>> data_pca.show()
property seed

RNG seed for initialization

Type: int, defaults to -1.

Examples

>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip")
>>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=3,
...                                                   seed=1234,
...                                                   impute_missing=True)
>>> data_pca.train(x=data.names, training_frame=data)
>>> data_pca.show()
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip")
>>> data_pca = H2OPrincipalComponentAnalysisEstimator()
>>> data_pca.train(x=data.names, training_frame=data)
>>> data_pca.show()
property transform

Transformation of training data

Type: Literal["none", "standardize", "normalize", "demean", "descale"], defaults to "none".

Examples

>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip")
>>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=-1,
...                                                   transform="standardize",
...                                                   pca_method="power",
...                                                   impute_missing=True,
...                                                   max_iterations=800)
>>> data_pca.train(x=data.names, training_frame=data)
>>> data_pca.show()
property use_all_factor_levels

Whether first factor level is included in each categorical expansion

Type: bool, defaults to False.

Examples

>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip")
>>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=3,
...                                                   use_all_factor_levels=True,
...                                                   seed=1234)
>>> data_pca.train(x=data.names, training_frame=data)
>>> data_pca.show()
property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip")
>>> train, valid = data.split_frame(ratios=[.8], seed=1234)
>>> model_pca = H2OPrincipalComponentAnalysisEstimator(impute_missing=True)
>>> model_pca.train(x=data.names,
...                training_frame=train,
...                validation_frame=valid)
>>> model_pca.show()

Miscellaneous

automl

H2OAutoML

class h2o.automl.H2OAutoML(nfolds=-1, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_runtime_secs=None, max_runtime_secs_per_model=None, max_models=None, distribution='AUTO', stopping_metric='AUTO', stopping_tolerance=None, stopping_rounds=3, seed=None, project_name=None, exclude_algos=None, include_algos=None, exploitation_ratio=-1, modeling_plan=None, preprocessing=None, monotone_constraints=None, keep_cross_validation_predictions=False, keep_cross_validation_models=False, keep_cross_validation_fold_assignment=False, sort_metric='AUTO', export_checkpoints_dir=None, verbosity='warn', **kwargs)[source]

Bases: h2o.automl._base.H2OAutoMLBaseMixin, h2o.base.Keyed

Automatic Machine Learning

The Automatic Machine Learning (AutoML) function automates the supervised machine learning model training process. It trains several models, cross-validated by default, by using the following available algorithms:

  • XGBoost

  • GBM (Gradient Boosting Machine)

  • GLM (Generalized Linear Model)

  • DRF (Distributed Random Forest)

  • XRT (eXtremely Randomized Trees)

  • DeepLearning (Fully Connected Deep Neural Network)

It also applies HPO on the following algorithms:

  • XGBoost

  • GBM

  • DeepLearning

In some cases, there will not be enough time to complete all the algorithms, so some may be missing from the leaderboard. Finally, AutoML also trains several Stacked Ensemble models at various stages during the run. Mainly two kinds of Stacked Ensemble models are trained:

  • one of all available models at time t

  • one of only the best models of each kind at time t.

Note that Stacked Ensemble models are trained only if there isn’t another stacked ensemble with the same base models.

Examples

>>> import h2o
>>> from h2o.automl import H2OAutoML
>>> h2o.init()
>>> # Import a sample binary outcome train/test set into H2O
>>> train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
>>> test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
>>> # Identify the response and set of predictors
>>> y = "response"
>>> x = list(train.columns)  #if x is defined as all columns except the response, then x is not required
>>> x.remove(y)
>>> # For binary classification, response should be a factor
>>> train[y] = train[y].asfactor()
>>> test[y] = test[y].asfactor()
>>> # Run AutoML for 30 seconds
>>> aml = H2OAutoML(max_runtime_secs = 30)
>>> aml.train(x = x, y = y, training_frame = train)
>>> # Print Leaderboard (ranked by xval metrics)
>>> aml.leaderboard
>>> # (Optional) Evaluate performance on a test set
>>> perf = aml.leader.model_performance(test)
>>> perf.auc()
property balance_classes
Specify whether to oversample the minority classes to balance the class distribution. This option can increase

the data frame size. This option is only applicable for classification. If the oversampled size of the dataset exceeds the maximum size calculated using the max_after_balance_size parameter, then the majority classes will be undersampled to satisfy the size limit. Defaults to False.

Type: bool

property class_sampling_factors

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes set to True.

detach()[source]

Detach the Python object from the backend, usually by clearing its key

property distribution
Distribution function used by algorithms that support it; other algorithms

use their defaults. Possible values: “AUTO”, “bernoulli”, “multinomial”, “gaussian”, “poisson”, “gamma”, “tweedie”, “laplace”, “quantile”, “huber”, “custom”, and for parameterized distributions dictionary form is used to specify the parameter, e.g., dict(type="tweedie", tweedie_power=1.5). Defaults to AUTO.

Type: Union[str, dict]

property event_log

Retrieve the backend event log from an H2OAutoML object

Returns

an H2OFrame with detailed events occurred during the AutoML training.

property exclude_algos

List the algorithms to skip during the model-building phase. The full list of options is:

  • "DRF" (Random Forest and Extremely-Randomized Trees)

  • "GLM"

  • "XGBoost"

  • "GBM"

  • "DeepLearning"

  • "StackedEnsemble"

Defaults to None, which means that all appropriate H2O algorithms will be used, if the search stopping criteria allow. Optional. Usage example:

exclude_algos = ["GLM", "DeepLearning", "DRF"]
property exploitation_ratio

The budget ratio (between 0 and 1) dedicated to the exploitation (vs exploration) phase. By default, the exploitation phase is 0 (disabled) as this is still experimental; to activate it, it is recommended to try a ratio around 0.1. Note that the current exploitation phase only tries to fine-tune the best XGBoost and the best GBM found during exploration.

property export_checkpoints_dir

Path to a directory where every model will be stored in binary form.

property include_algos

List the algorithms to restrict to during the model-building phase. This can’t be used in combination with exclude_algos param. Defaults to None, which means that all appropriate H2O algorithms will be used, if the search stopping criteria allow. Optional. Usage example:

include_algos = ["GLM", "DeepLearning", "DRF"]
property keep_cross_validation_fold_assignment

Whether to keep fold assignments in the models. Deleting them will save memory in the H2O cluster. Defaults to False.

property keep_cross_validation_models

Whether to keep the cross-validated models. Keeping cross-validation models may consume significantly more memory in the H2O cluster. Defaults to False.

property keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation predictions. This needs to be set to True if running the same AutoML object for repeated runs because CV predictions are required to build additional Stacked Ensemble models in AutoML. Defaults to False.

property key
Returns

the unique key representing the object on the backend

property leader

Retrieve the top model from an H2OAutoML object

Returns

an H2O model

Examples

>>> # Set up an H2OAutoML object
>>> aml = H2OAutoML(max_runtime_secs=30)
>>> # Launch an AutoML run
>>> aml.train(y=y, training_frame=train)
>>> # Get the best model in the AutoML Leaderboard
>>> aml.leader
>>>
>>> # Get AutoML object by `project_name`
>>> get_aml = h2o.automl.get_automl(aml.project_name)
>>> # Get the best model in the AutoML Leaderboard
>>> get_aml.leader
property leaderboard

Retrieve the leaderboard from an H2OAutoML object

Returns

an H2OFrame with model ids in the first column and evaluation metric in the second column sorted by the evaluation metric

Examples

>>> # Set up an H2OAutoML object
>>> aml = H2OAutoML(max_runtime_secs=30)
>>> # Launch an AutoML run
>>> aml.train(y=y, training_frame=train)
>>> # Get the AutoML Leaderboard
>>> aml.leaderboard
>>>
>>> # Get AutoML object by `project_name`
>>> get_aml = h2o.automl.get_automl(aml.project_name)
>>> # Get the AutoML Leaderboard
>>> get_aml.leaderboard
property max_after_balance_size
Maximum relative size of the training data after balancing class counts (can be less than 1.0).

Requires balance_classes. Defaults to 5.0.

Type: float

property max_models
Specify the maximum number of models to build in an AutoML run, excluding the Stacked Ensemble models.

Defaults to None (disabled: no limitation). Always set this parameter to ensure AutoML reproducibility: all models are then trained until convergence and none is constrained by a time budget.

Type: int

property max_runtime_secs
Specify the maximum time that the AutoML process will run for.

If both max_runtime_secs and max_models are specified, then the AutoML run will stop as soon as it hits either of these limits. If neither max_runtime_secs nor max_models are specified, then max_runtime_secs dynamically defaults to 3600 seconds (1 hour). Otherwise, defaults to 0 (no limit).

Type: int

property max_runtime_secs_per_model
Controls the max time the AutoML run will dedicate to each individual model.

Defaults to 0 (disabled: no time limit). Note that models constrained by a time budget are not guaranteed reproducible.

Type: int

property modeling_plan

List of modeling steps to be used by the AutoML engine (they may not all get executed, depending on other constraints). Defaults to None (Expert usage only).

property modeling_steps

Expose the modeling steps effectively used by the AutoML run. This executed plan can be directly reinjected as the modeling_plan property of a new AutoML instance to improve reproducibility across AutoML versions.

Returns

a list of dictionaries representing the effective modeling plan.

property monotone_constraints

A mapping that represents monotonic constraints. Use +1 to enforce an increasing constraint and -1 to specify a decreasing constraint.

property nfolds
Number of folds for k-fold cross-validation.

Use 0 to disable cross-validation; this will also disable Stacked Ensemble (thus decreasing the overall model performance). Defaults to -1.

Type: int

predict(test_data)[source]

Predict on a dataset.

Parameters

test_data (H2OFrame) – Data on which to make predictions.

Returns

A new H2OFrame of predictions.

Examples

>>> # Set up an H2OAutoML object
>>> aml = H2OAutoML(max_runtime_secs=30)
>>> # Launch an H2OAutoML run
>>> aml.train(y=y, training_frame=train)
>>> # Predict with top model from AutoML Leaderboard on a H2OFrame called 'test'
>>> aml.predict(test)
>>>
>>> # Get AutoML object by `project_name`
>>> get_aml = h2o.automl.get_automl(aml.project_name)
>>> # Predict with top model from AutoML Leaderboard on a H2OFrame called 'test'
>>> get_aml.predict(test)
property preprocessing

List of preprocessing steps to run. Only ["target_encoding"] is currently supported. Experimental.

property project_name
Character string to identify an AutoML project.

Defaults to None, which means a project name will be auto-generated based on the training frame ID. More models can be trained on an existing AutoML project by specifying the same project name in multiple calls to the AutoML function (as long as the same training frame, or a sample, is used in subsequent runs).

Type: str

property seed
Set a seed for reproducibility.

AutoML can only guarantee reproducibility if max_models or early stopping is used because max_runtime_secs is resource limited, meaning that if the resources are not the same between runs, AutoML may be able to train more models on one run vs another. In addition, H2O Deep Learning models are not reproducible by default for performance reasons, so exclude_algos must contain DeepLearning. Defaults to None.

Type: int

property sort_metric

Metric to sort the leaderboard by at the end of an AutoML run. For binomial classification, select from the following options:

  • "auc"

  • "aucpr"

  • "logloss"

  • "mean_per_class_error"

  • "rmse"

  • "mse"

For multinomial classification, select from the following options:

  • "mean_per_class_error"

  • "logloss"

  • "rmse"

  • "mse"

For regression, select from the following options:

  • "deviance"

  • "rmse"

  • "mse"

  • "mae"

  • "rmlse"

Defaults to "AUTO" (This translates to "auc" for binomial classification, "mean_per_class_error" for multinomial classification, "deviance" for regression).

property stopping_metric
Specifies the metric to use for early stopping.

The available options are:

  • "AUTO" (This defaults to "logloss" for classification, "deviance" for regression)

  • "deviance"

  • "logloss"

  • "mse"

  • "rmse"

  • "mae"

  • "rmsle"

  • "auc"

  • aucpr

  • "lift_top_group"

  • "misclassification"

  • "mean_per_class_error"

  • "r2"

Defaults to "AUTO".

Type: str

property stopping_rounds
Stop training new models in the AutoML run when the option selected for

stopping_metric doesn’t improve for the specified number of models, based on a simple moving average. To disable this feature, set it to 0. Defaults to 3 and must be an non-negative integer.

Type: int

property stopping_tolerance
Specify the relative tolerance for the metric-based stopping criterion to stop a grid search and

the training of individual models within the AutoML run. Defaults to 0.001 if the dataset is at least 1 million rows; otherwise it defaults to a value determined by the size of the dataset and the non-NA-rate, in which case the value is computed as 1/sqrt(nrows * non-NA-rate).

Type: float

train(x=None, y=None, training_frame=None, fold_column=None, weights_column=None, validation_frame=None, leaderboard_frame=None, blending_frame=None)[source]

Begins an AutoML task, a background task that automatically builds a number of models with various algorithms and tracks their performance in a leaderboard. At any point in the process you may use H2O’s performance or prediction functions on the resulting models.

Parameters
  • x – A list of column names or indices indicating the predictor columns.

  • y – An index or a column name indicating the response column.

  • fold_column – The name or index of the column in training_frame that holds per-row fold assignments.

  • weights_column – The name or index of the column in training_frame that holds per-row weights.

  • training_frame – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold_column or weights_column).

  • validation_frame – H2OFrame with validation data. This argument is ignored unless the user sets nfolds = 0. If cross-validation is turned off, then a validation frame can be specified and used for early stopping of individual models and early stopping of the grid searches. By default and when nfolds > 1, cross-validation metrics will be used for early stopping and thus validation_frame will be ignored.

  • leaderboard_frame – H2OFrame with test data for scoring the leaderboard. This is optional and if this is set to None (the default), then cross-validation metrics will be used to generate the leaderboard rankings instead.

  • blending_frame – H2OFrame used to train the the metalearning algorithm in Stacked Ensembles (instead of relying on cross-validated predicted values). This is optional, but when provided, it is also recommended to disable cross validation by setting nfolds=0 and to provide a leaderboard frame for scoring purposes.

Returns

An H2OAutoML object.

Examples

>>> # Set up an H2OAutoML object
>>> aml = H2OAutoML(max_runtime_secs=30)
>>> # Launch an AutoML run
>>> aml.train(y=y, training_frame=train)
property training_info

Expose the name/value columns of event_log as a simple dictionary, for example start_epoch, stop_epoch, … See event_log() to obtain a description of those key/value pairs.

Returns

a dictionary with event_log[‘name’] column as keys and event_log[‘value’] column as values.

H2OEstimator

class h2o.estimators.estimator_base.H2OEstimator[source]

Bases: h2o.model.model_base.ModelBase

Base class for H2O Estimators.

H2O Estimators implement the following methods for model construction:

  • start() - Top-level user-facing API for asynchronous model build

  • join() - Top-level user-facing API for blocking on async model build

  • train() - Top-level user-facing API for model building.

  • fit() - Used by scikit-learn.

Because H2OEstimator instances are instances of ModelBase, these objects can use the H2O model API.

fit(X, y=None, **params)[source]

Fit an H2O model as part of a scikit-learn pipeline or grid search.

A warning will be issued if a caller other than sklearn attempts to use this method.

Parameters
  • X (H2OFrame) – An H2OFrame consisting of the predictor variables.

  • y (H2OFrame) – An H2OFrame consisting of the response variable.

  • params – Extra arguments.

Returns

The current instance of H2OEstimator for method chaining.

get_params(deep=True)[source]

Obtain parameters for this estimator.

Used primarily for sklearn Pipelines and sklearn grid search.

Parameters

deep – If True, return parameters of all sub-objects that are estimators.

Returns

A dict of parameters

join()[source]

Wait until job’s completion.

set_params(**parms)[source]

Used by sklearn for updating parameters during grid search.

Parameters

parms – A dictionary of parameters that will be set on this model.

Returns

self, the current estimator object with the parameters all set as desired.

start(x, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, **params)[source]

Train the model asynchronously (to block for results call join()).

Parameters
  • x – A list of column names or indices indicating the predictor columns.

  • y – An index or a column name indicating the response column.

  • training_frame (H2OFrame) – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).

  • offset_column – The name or index of the column in training_frame that holds the offsets.

  • fold_column – The name or index of the column in training_frame that holds the per-row fold assignments.

  • weights_column – The name or index of the column in training_frame that holds the per-row weights.

  • validation_frame – H2OFrame with validation data to be scored on while training.

train(x=None, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, max_runtime_secs=None, ignored_columns=None, model_id=None, verbose=False)[source]

Train the H2O model.

Parameters
  • x – A list of column names or indices indicating the predictor columns.

  • y – An index or a column name indicating the response column.

  • training_frame (H2OFrame) – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).

  • offset_column – The name or index of the column in training_frame that holds the offsets.

  • fold_column – The name or index of the column in training_frame that holds the per-row fold assignments.

  • weights_column – The name or index of the column in training_frame that holds the per-row weights.

  • validation_frame – H2OFrame with validation data to be scored on while training.

  • max_runtime_secs (float) – Maximum allowed runtime in seconds for model training. Use 0 to disable.

  • verbose (bool) – Print scoring history to stdout. Defaults to False.

train_segments(x=None, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, max_runtime_secs=None, ignored_columns=None, segments=None, segment_models_id=None, parallelism=1, verbose=False)[source]

Trains H2O model for each segment (subpopulation) of the training dataset.

Parameters
  • x – A list of column names or indices indicating the predictor columns.

  • y – An index or a column name indicating the response column.

  • training_frame (H2OFrame) – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).

  • offset_column – The name or index of the column in training_frame that holds the offsets.

  • fold_column – The name or index of the column in training_frame that holds the per-row fold assignments.

  • weights_column – The name or index of the column in training_frame that holds the per-row weights.

  • validation_frame – H2OFrame with validation data to be scored on while training.

  • max_runtime_secs (float) – Maximum allowed runtime in seconds for each model training. Use 0 to disable. Please note that regardless of how this parameter is set, a model will be built for each input segment. This parameter only affects individual model training.

  • segments – A list of columns to segment-by. H2O will group the training (and validation) dataset by the segment-by columns and train a separate model for each segment (group of rows). As an alternative to providing a list of columns, users can also supply an explicit enumeration of segments to build the models for. This enumeration needs to be represented as H2OFrame.

  • segment_models_id – Identifier for the returned collection of Segment Models. If not specified it will be automatically generated.

  • parallelism – Level of parallelism of the bulk segment models building, it is the maximum number of models each H2O node will be building in parallel.

  • verbose (bool) – Enable to print additional information during model building. Defaults to False.

Examples

>>> response = "survived"
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> titanic[response] = titanic[response].asfactor()
>>> predictors = ["survived","name","sex","age","sibsp","parch","ticket","fare","cabin"]
>>> train, valid = titanic.split_frame(ratios=[.8], seed=1234)
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> titanic_gbm = H2OGradientBoostingEstimator(seed=1234)
>>> titanic_models = titanic_gbm.train_segments(segments=["pclass"],
...                                             x=predictors,
...                                             y=response,
...                                             training_frame=train,
...                                             validation_frame=valid)
>>> titanic_models.as_frame()

H2OSingularValueDecompositionEstimator

class h2o.estimators.svd.H2OSingularValueDecompositionEstimator(model_id=None, training_frame=None, validation_frame=None, ignored_columns=None, ignore_const_cols=True, score_each_iteration=False, transform='none', svd_method='gram_s_v_d', nv=1, max_iterations=1000, seed=-1, keep_u=True, u_name=None, use_all_factor_levels=True, max_runtime_secs=0.0, export_checkpoints_dir=None)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Singular Value Decomposition

property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> import tempfile
>>> from os import listdir
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> checkpoints_dir = tempfile.mkdtemp()
>>> fit_h2o = H2OSingularValueDecompositionEstimator(export_checkpoints_dir=checkpoints_dir,
...                                                  seed=-5)
>>> fit_h2o.train(x=list(range(4)), training_frame=arrests)
>>> len(listdir(checkpoints_dir))
property ignore_const_cols

Ignore constant columns.

Type: bool, defaults to True.

Examples

>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> fit_h2o = H2OSingularValueDecompositionEstimator(ignore_const_cols=False,
...                                                  nv=4)
>>> fit_h2o.train(x=list(range(4)), training_frame=arrests)
>>> fit_h2o
property ignored_columns

Names of columns to ignore for training.

Type: List[str].

init_for_pipeline()[source]

Returns H2OSVD object which implements fit and transform method to be used in sklearn.Pipeline properly. All parameters defined in self.__params, should be input parameters in H2OSVD.__init__ method.

Returns

H2OSVD object

Examples

>>> from h2o.transforms.preprocessing import H2OScaler
>>> from h2o.estimators import H2ORandomForestEstimator
>>> from h2o.estimators import H2OSingularValueDecompositionEstimator
>>> from sklearn.pipeline import Pipeline
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> pipe = Pipeline([("standardize", H2OScaler()),
...                  ("svd", H2OSingularValueDecompositionEstimator(nv=3).init_for_pipeline()),
...                  ("rf", H2ORandomForestEstimator(seed=42,ntrees=50))])
>>> pipe.fit(arrests[1:], arrests[0])
property keep_u

Save left singular vectors?

Type: bool, defaults to True.

Examples

>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> fit_h2o = H2OSingularValueDecompositionEstimator(keep_u=False)
>>> fit_h2o.train(x=list(range(4)), training_frame=arrests)
>>> fit_h2o
property max_iterations

Maximum iterations

Type: int, defaults to 1000.

Examples

>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> fit_h2o = H2OSingularValueDecompositionEstimator(nv=4,
...                                                  transform="standardize",
...                                                  max_iterations=2000)
>>> fit_h2o.train(x=list(range(4)), training_frame=arrests)
>>> fit_h2o
property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

Examples

>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> fit_h2o = H2OSingularValueDecompositionEstimator(nv=4,
...                                                  transform="standardize",
...                                                  max_runtime_secs=25)
>>> fit_h2o.train(x=list(range(4)), training_frame=arrests)
>>> fit_h2o
property nv

Number of right singular vectors

Type: int, defaults to 1.

Examples

>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> fit_h2o = H2OSingularValueDecompositionEstimator(nv=4,
...                                                  transform="standardize",
...                                                  max_iterations=2000)
>>> fit_h2o.train(x=list(range(4)), training_frame=arrests)
>>> fit_h2o
property score_each_iteration

Whether to score during each iteration of model training.

Type: bool, defaults to False.

Examples

>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> fit_h2o = H2OSingularValueDecompositionEstimator(nv=4,
...                                                  score_each_iteration=True)
>>> fit_h2o.train(x=list(range(4)), training_frame=arrests)
>>> fit_h2o
property seed

RNG seed for k-means++ initialization

Type: int, defaults to -1.

Examples

>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> fit_h2o = H2OSingularValueDecompositionEstimator(nv=4, seed=-3)
>>> fit_h2o.train(x=list(range(4)), training_frame=arrests)
>>> fit_h2o
property svd_method

Method for computing SVD (Caution: Randomized is currently experimental and unstable)

Type: Literal["gram_s_v_d", "power", "randomized"], defaults to "gram_s_v_d".

Examples

>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> fit_h2o = H2OSingularValueDecompositionEstimator(svd_method="power")
>>> fit_h2o.train(x=list(range(4)), training_frame=arrests)
>>> fit_h2o
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> fit_h2o = H2OSingularValueDecompositionEstimator()
>>> fit_h2o.train(x=list(range(4)), training_frame=arrests)
>>> fit_h2o
property transform

Transformation of training data

Type: Literal["none", "standardize", "normalize", "demean", "descale"], defaults to "none".

Examples

>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> fit_h2o = H2OSingularValueDecompositionEstimator(nv=4,
...                                                  transform="standardize",
...                                                  max_iterations=2000)
>>> fit_h2o.train(x=list(range(4)), training_frame=arrests)
>>> fit_h2o
property u_name

Frame key to save left singular vectors

Type: str.

Examples

>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> fit_h2o = H2OSingularValueDecompositionEstimator(u_name="fit_h2o")
>>> fit_h2o.train(x=list(range(4)), training_frame=arrests)
>>> fit_h2o.u_name
>>> fit_h2o
property use_all_factor_levels

Whether first factor level is included in each categorical expansion

Type: bool, defaults to True.

Examples

>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> fit_h2o = H2OSingularValueDecompositionEstimator(use_all_factor_levels=False)
>>> fit_h2o.train(x=list(range(4)), training_frame=arrests)
>>> fit_h2o
property validation_frame

Id of the validation data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> train, valid = arrests.split_frame(ratios=[.8])
>>> fit_h2o = H2OSingularValueDecompositionEstimator()
>>> fit_h2o.train(x=list(range(4)),
...               training_frame=train,
...               validation_frame=valid)
>>> fit_h2o

H2OWord2vecEstimator

class h2o.estimators.word2vec.H2OWord2vecEstimator(model_id=None, training_frame=None, min_word_freq=5, word_model='skip_gram', norm_model='hsm', vec_size=100, window_size=5, sent_sample_rate=0.001, init_learning_rate=0.025, epochs=5, pre_trained=None, max_runtime_secs=0.0, export_checkpoints_dir=None)[source]

Bases: h2o.estimators.estimator_base.H2OEstimator

Word2Vec

property epochs

Number of training iterations to run

Type: int, defaults to 5.

Examples

>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), 
...                               col_names = ["category", "jobtitle"], 
...                               col_types = ["string", "string"], 
...                               header = 1)
>>> words = job_titles.tokenize(" ")
>>> w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0, epochs = 10)
>>> w2v_model.train(training_frame=words)
>>> synonyms = w2v_model.find_synonyms("teacher", count = 5)
>>> print(synonyms)
>>>
>>> w2v_model2 = H2OWord2vecEstimator(sent_sample_rate = 0.0, epochs = 1)
>>> w2v_model2.train(training_frame=words)
>>> synonyms2 = w2v_model2.find_synonyms("teacher", 3)
>>> print(synonyms2)
property export_checkpoints_dir

Automatically export generated models to this directory.

Type: str.

Examples

>>> import tempfile
>>> from os import listdir
>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), 
...                               col_names = ["category", "jobtitle"], 
...                               col_types = ["string", "string"], 
...                               header = 1)
>>> checkpoints_dir = tempfile.mkdtemp()
>>> words = job_titles.tokenize(" ")
>>> w2v_model = H2OWord2vecEstimator(epochs=1,
...                                  max_runtime_secs=10,
...                                  export_checkpoints_dir=checkpoints_dir)
>>> w2v_model.train(training_frame=words)
>>> len(listdir(checkpoints_dir))
static from_external(external=<class 'h2o.frame.H2OFrame'>)[source]

Creates new H2OWord2vecEstimator based on an external model.

Parameters

external – H2OFrame with an external model

Returns

H2OWord2vecEstimator instance representing the external model

Examples

>>> words = h2o.create_frame(rows=10, cols=1,
...                          string_fraction=1.0,
...                          missing_fraction=0.0)
>>> embeddings = h2o.create_frame(rows=10, cols=100,
...                               real_fraction=1.0,
...                               missing_fraction=0.0)
>>> word_embeddings = words.cbind(embeddings)
>>> w2v_model = H2OWord2vecEstimator.from_external(external=word_embeddings)
property init_learning_rate

Set the starting learning rate

Type: float, defaults to 0.025.

Examples

>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), 
...                               col_names = ["category", "jobtitle"], 
...                               col_types = ["string", "string"], 
...                               header = 1)
>>> words = job_titles.tokenize(" ")
>>> w2v_model = H2OWord2vecEstimator(epochs=3, init_learning_rate=0.05)
>>> w2v_model.train(training_frame=words)
>>> synonyms = w2v_model.find_synonyms("assistant", 3)
>>> print(synonyms)
property max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

Type: float, defaults to 0.0.

Examples

>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), 
...                               col_names = ["category", "jobtitle"], 
...                               col_types = ["string", "string"], 
...                               header = 1)
>>> words = job_titles.tokenize(" ")
>>> w2v_model = H2OWord2vecEstimator(epochs=1, max_runtime_secs=10)
>>> w2v_model.train(training_frame=words)
>>> synonyms = w2v_model.find_synonyms("tutor", 3)
>>> print(synonyms)
property min_word_freq

This will discard words that appear less than <int> times

Type: int, defaults to 5.

Examples

>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), 
...                               col_names = ["category", "jobtitle"], 
...                               col_types = ["string", "string"], 
...                               header = 1)
>>> words = job_titles.tokenize(" ")
>>> w2v_model = H2OWord2vecEstimator(epochs=1, min_word_freq=4)
>>> w2v_model.train(training_frame=words)
>>> synonyms = w2v_model.find_synonyms("teacher", 3)
>>> print(synonyms)
property norm_model

Use Hierarchical Softmax

Type: Literal["hsm"], defaults to "hsm".

Examples

>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), 
...                               col_names = ["category", "jobtitle"], 
...                               col_types = ["string", "string"], 
...                               header = 1)
>>> words = job_titles.tokenize(" ")
>>> w2v_model = H2OWord2vecEstimator(epochs=1, norm_model="hsm")
>>> w2v_model.train(training_frame=words)
>>> synonyms = w2v_model.find_synonyms("teacher", 3)
>>> print(synonyms)
property pre_trained

Id of a data frame that contains a pre-trained (external) word2vec model

Type: Union[None, str, H2OFrame].

Examples

>>> words = h2o.create_frame(rows=1000,cols=1,
...                          string_fraction=1.0,
...                          missing_fraction=0.0)
>>> embeddings = h2o.create_frame(rows=1000,cols=100,
...                               real_fraction=1.0,
...                               missing_fraction=0.0)
>>> word_embeddings = words.cbind(embeddings)
>>> w2v_model = H2OWord2vecEstimator(pre_trained=word_embeddings)
>>> w2v_model.train(training_frame=word_embeddings)
>>> model_id = w2v_model.model_id
>>> model = h2o.get_model(model_id)
property sent_sample_rate
Set threshold for occurrence of words. Those that appear with higher frequency in the training data

will be randomly down-sampled; useful range is (0, 1e-5)

Type: float, defaults to 0.001.

Examples

>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), 
...                               col_names = ["category", "jobtitle"], 
...                               col_types = ["string", "string"], 
...                               header = 1)
>>> words = job_titles.tokenize(" ")
>>> w2v_model = H2OWord2vecEstimator(epochs=1, sent_sample_rate=0.01)
>>> w2v_model.train(training_frame=words)
>>> synonyms = w2v_model.find_synonyms("teacher", 3)
>>> print(synonyms)
property training_frame

Id of the training data frame.

Type: Union[None, str, H2OFrame].

Examples

>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), 
...                               col_names = ["category", "jobtitle"], 
...                               col_types = ["string", "string"], 
...                               header = 1)
>>> words = job_titles.tokenize(" ")
>>> w2v_model = H2OWord2vecEstimator()
>>> w2v_model.train(training_frame=words)
>>> synonyms = w2v_model.find_synonyms("tutor", 3)
>>> print(synonyms)
property vec_size

Set size of word vectors

Type: int, defaults to 100.

Examples

>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), 
...                               col_names = ["category", "jobtitle"], 
...                               col_types = ["string", "string"], 
...                               header = 1)
>>> words = job_titles.tokenize(" ")
>>> w2v_model = H2OWord2vecEstimator(epochs=3, vec_size=50)
>>> w2v_model.train(training_frame=words)
>>> synonyms = w2v_model.find_synonyms("tutor", 3)
>>> print(synonyms)
property window_size

Set max skip length between words

Type: int, defaults to 5.

Examples

>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), 
...                               col_names = ["category", "jobtitle"], 
...                               col_types = ["string", "string"], 
...                               header = 1)
>>> words = job_titles.tokenize(" ")
>>> w2v_model = H2OWord2vecEstimator(epochs=3, window_size=2)
>>> w2v_model.train(training_frame=words)
>>> synonyms = w2v_model.find_synonyms("teacher", 3)
>>> print(synonyms)
property word_model

The word model to use (SkipGram or CBOW)

Type: Literal["skip_gram", "cbow"], defaults to "skip_gram".

Examples

>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), 
...                               col_names = ["category", "jobtitle"], 
...                               col_types = ["string", "string"], 
...                               header = 1)
>>> words = job_titles.tokenize(" ")
>>> w2v_model = H2OWord2vecEstimator(epochs=3, word_model="skip_gram")
>>> w2v_model.train(training_frame=words)
>>> synonyms = w2v_model.find_synonyms("assistant", 3)
>>> print(synonyms)

H2OGridSearch

class h2o.grid.H2OGridSearch(model, hyper_params, grid_id=None, search_criteria=None, export_checkpoints_dir=None, recovery_dir=None, parallelism=1)[source]

Bases: h2o.grid.grid_search.H2OGridSearch

Grid Search of a Hyper-Parameter Space for a Model

Examples

>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> hyper_parameters = {'alpha': [0.01,0.5], 'lambda': [1e-5,1e-6]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_parameters)
>>> training_data = h2o.import_file("smalldata/logreg/benign.csv")
>>> gs.train(x=[3, 4-11], y=3, training_frame=training_data)
>>> gs.show()
aic(train=False, valid=False, xval=False)[source]

Get the AIC(s).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the AIC value for the training data.

  • valid (bool) – If valid is True, then return the AIC value for the validation data.

  • xval (bool) – If xval is True, then return the AIC value for the validation data.

Returns

The AIC.

Examples

>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> prostate[2] = prostate[2].asfactor()
>>> prostate[4] = prostate[4].asfactor()
>>> prostate[5] = prostate[5].asfactor()
>>> prostate[8] = prostate[8].asfactor()
>>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> response = "CAPSULE"
>>> hyper_params = {'alpha': [0.01,0.5],
...                 'lambda': [1e-5,1e-6]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=predictors, y=response, training_frame=prostate)
>>> gs.aic()
auc(train=False, valid=False, xval=False)[source]

Get the AUC(s).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the AUC value for the training data.

  • valid (bool) – If valid is True, then return the AUC value for the validation data.

  • xval (bool) – If xval is True, then return the AUC value for the validation data.

Returns

The AUC.

Examples

>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> data = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
>>> test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
>>> x = data.columns
>>> y = "response"
>>> x.remove(y)
>>> data[y] = data[y].asfactor()
>>> test[y] = test[y].asfactor()
>>> ss = data.split_frame(seed = 1)
>>> train = ss[0]
>>> valid = ss[1]
>>> gbm_params1 = {'learn_rate': [0.01, 0.1],
...                 'max_depth': [3, 5, 9],
...                 'sample_rate': [0.8, 1.0],
...                 'col_sample_rate': [0.2, 0.5, 1.0]}
>>> gbm_grid1 = H2OGridSearch(model=H2OGradientBoostingEstimator,
...                           grid_id='gbm_grid1',
...                           hyper_params=gbm_params1)
>>> gbm_grid1.train(x=x, y=y,
...                 training_frame=train,
...                 validation_frame=valid,
...                 ntrees=100,
...                 seed=1)
>>> gbm_pridperf1 = gbm_grid1.get_grid(sort_by='auc', decreasing=True)
>>> best_gbm1 = gbm_gridperf1.models[0]
>>> best_gbm_perf1 = best_gbm1.model_performance(test)
>>> best_gbm_perf1.auc()
aucpr(train=False, valid=False, xval=False)[source]

Get the aucPR (Area Under PRECISION RECALL Curve).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the aucpr value for the training data.

  • valid (bool) – If valid is True, then return the aucpr value for the validation data.

  • xval (bool) – If xval is True, then return the aucpr value for the validation data.

Returns

The AUCPR for the models in this grid.

biases(vector_id=0)[source]

Return the frame for the respective bias vector.

Parameters

vector_id – an integer, ranging from 0 to number of layers, that specifies the bias vector to return.

Returns

an H2OFrame which represents the bias vector identified by vector_id

Examples

>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
>>> hh = H2ODeepLearningEstimator(hidden=[],
...                               loss="CrossEntropy",
...                               export_weights_and_biases=True)
>>> hh.train(x=list(range(4)), y=4, training_frame=iris)
>>> hh.biases(0)
build_model(algo_params)[source]

(internal)

cancel()[source]

Cancel grid execution.

catoffsets()[source]

Categorical offsets for one-hot encoding

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
>>> hh = H2ODeepLearningEstimator(hidden=[],
...                               loss="CrossEntropy",
...                               export_weights_and_biases=True)
>>> hh.train(x=list(range(4)), y=4, training_frame=iris)
>>> hh.catoffsets()
coef()[source]

Return the coefficients that can be applied to the non-standardized data.

Note: standardize = True by default. If set to False, then coef() returns the coefficients that are fit directly.

Examples

>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv")
>>> hyper_parameters = {'alpha': [0.01,0.5],
...                     'lambda': [1e-5,1e-6]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_parameters)
>>> gs.train(x=range(3)+range(4,11), y=3, training_frame=training_data)
>>> gs.coef()
coef_norm()[source]

Return coefficients fitted on the standardized data (requires standardize = True, which is on by default). These coefficients can be used to evaluate variable importance.

Examples

>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv")
>>> hyper_parameters = {'alpha': [0.01,0.5],
...                     'lambda': [1e-5,1e-6]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_parameters)
>>> gs.train(x=range(3)+range(4,11), y=3, training_frame=training_data)
>>> gs.coef_norm()
deepfeatures(test_data, layer)[source]

Obtain a hidden layer’s details on a dataset.

Parameters
  • test_data – Data to create a feature space on.

  • layer (int) – Index of the hidden layer.

Returns

A dictionary of hidden layer details for each model.

Examples

>>> from h2o.estimators import H2OAutoEncoderEstimator
>>> resp = 784
>>> nfeatures = 20
>>> train = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
>>> train[resp] = train[resp].asfactor()
>>> test = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
>>> test[resp] = test[resp].asfactor()
>>> sid = train[0].runif(0)
>>> train_unsup = train[sid >= 0.5]
>>> train_unsup.pop(resp)
>>> train_sup = train[sid < 0.5]
>>> ae_model = H2OAutoEncoderEstimator(activation="Tanh",
...                                    hidden=[nfeatures],
...                                    model_id="ae_model",
...                                    epochs=1,
...                                    ignore_const_cols=False,
...                                    reproducible=True,
...                                    seed=1234)
>>> ae_model.train(list(range(resp)), training_frame=train_unsup)
>>> ae_model.deepfeatures(train_sup[0:resp], 0)
detach()[source]

Detach the Python object from the backend, usually by clearing its key

property failed_params

Return a list of failed parameters. :examples:

>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv")
>>> hyper_parameters = {'alpha': [0.01,0.5],
...                     'lambda': [1e-5,1e-6],
...                     'beta_epsilon': [0.05]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_parameters)
>>> gs.train(x=range(3)+range(4,11), y=3, training_frame=training_data)
>>> gs.failed_params
get_grid(sort_by=None, decreasing=None)[source]

Retrieve an H2OGridSearch instance.

Optionally specify a metric by which to sort models and a sort order. Note that if neither cross-validation nor a validation frame is used in the grid search, then the training metrics will display in the “get grid” output. If a validation frame is passed to the grid, and nfolds = 0, then the validation metrics will display. However, if nfolds > 1, then cross-validation metrics will display even if a validation frame is provided.

Parameters
  • sort_by (str) – A metric by which to sort the models in the grid space. Choices are: "logloss", "residual_deviance", "mse", "auc", "r2", "accuracy", "precision", "recall", "f1", etc.

  • decreasing (bool) – Sort the models in decreasing order of metric if true, otherwise sort in increasing order (default).

Returns

A new H2OGridSearch instance optionally sorted on the specified metric.

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> y = 3
>>> x = [4,5,6,7,8,9,10,11]
>>> hyper_params = {'alpha': [0.01,0.3,0.5],
...                 'lambda': [1e-5, 1e-6, 1e-7]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=x,y=y, training_frame=benign)
>>> gs.get_grid(sort_by='F1', decreasing=True)
get_hyperparams(id, display=True)[source]

Get the hyperparameters of a model explored by grid search.

Parameters
  • id (str) – The model id of the model with hyperparameters of interest.

  • display (bool) – Flag to indicate whether to display the hyperparameter names.

Returns

A list of the hyperparameters for the specified model.

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> y = 3
>>> x = [4,5,6,7,8,9,10,11]
>>> hyper_params = {'alpha': [0.01,0.3,0.5],
...                 'lambda': [1e-5, 1e-6, 1e-7]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=x,y=y, training_frame=benign)
>>> best_model_id = gs.get_grid(sort_by='F1',
...                             decreasing=True).model_ids[0]
>>> gs.get_hyperparams(best_model_id)
get_hyperparams_dict(id, display=True)[source]

Derived and returned the model parameters used to train the particular grid search model.

Parameters
  • id (str) – The model id of the model with hyperparameters of interest.

  • display (bool) – Flag to indicate whether to display the hyperparameter names.

Returns

A dict of model pararmeters derived from the hyper-parameters used to train this particular model.

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> y = 3
>>> x = [4,5,6,7,8,9,10,11]
>>> hyper_params = {'alpha': [0.01,0.3,0.5],
...                 'lambda': [1e-5, 1e-6, 1e-7]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=x,y=y, training_frame=benign)
>>> best_model_id = gs.get_grid(sort_by='F1',
...                             decreasing=True).model_ids[0]
>>> gs.get_hyperparams_dict(best_model_id)
get_xval_models(key=None)[source]

Return a Model object.

Parameters

key (str) – If None, return all cross-validated models; otherwise return the model specified by the key.

Returns

A model or a list of models.

Examples

>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> fr = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/prostate_train.csv")
>>> m = H2OGradientBoostingEstimator(nfolds=10,
...                                  ntrees=10,
...                                  keep_cross_validation_models=True)
>>> m.train(x=list(range(2,fr.ncol)), y=1, training_frame=fr)
>>> m.get_xval_models()
gini(train=False, valid=False, xval=False)[source]

Get the Gini Coefficient(s).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the Gini Coefficient value for the training data.

  • valid (bool) – If valid is True, then return the Gini Coefficient value for the validation data.

  • xval (bool) – If xval is True, then return the Gini Coefficient value for the cross validation data.

Returns

The Gini Coefficient for the models in this grid.

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> y = 3
>>> x = [4,5,6,7,8,9,10,11]
>>> hyper_params = {'alpha': [0.01,0.3,0.5],
...                 'lambda': [1e-5, 1e-6, 1e-7]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=x,y=y, training_frame=benign)
>>> gs.gini()
property grid_id

A key that identifies this grid search object in H2O.

Examples

>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv")
>>> hyper_parameters = {'alpha': [0.01,0.5],
...                     'lambda': [1e-5,1e-6]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_parameters)
>>> gs.train(x=range(3)+range(4,11), y=3, training_frame=training_data)
>>> gs.grid_id
property hyper_names

Return the hyperparameter names.

Examples

>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv")
>>> hyper_parameters = {'alpha': [0.01,0.5],
...                     'lambda': [1e-5,1e-6]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_parameters)
>>> gs.train(x=range(3)+range(4,11), y=3, training_frame=training_data)
>>> gs.hyper_names
is_cross_validated()[source]

Return True if the model was cross-validated.

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> y = 3
>>> x = [4,5,6,7,8,9,10,11]
>>> hyper_params = {'alpha': [0.01,0.3,0.5],
...                 'lambda': [1e-5, 1e-6, 1e-7]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=x,y=y, training_frame=benign)
>>> gs.is_cross_validated()
join()[source]

Wait until grid finishes computing.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), hyper_params)
>>> gs.start(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.join()
property key
Returns

the unique key representing the object on the backend

logloss(train=False, valid=False, xval=False)[source]

Get the Log Loss(s).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the Log Loss value for the training data.

  • valid (bool) – If valid is True, then return the Log Loss value for the validation data.

  • xval (bool) – If xval is True, then return the Log Loss value for the cross validation data.

Returns

The Log Loss for this binomial model.

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> y = 3
>>> x = [4,5,6,7,8,9,10,11]
>>> hyper_params = {'alpha': [0.01,0.3,0.5],
...                 'lambda': [1e-5, 1e-6, 1e-7]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=x,y=y, training_frame=benign)
>>> gs.logloss()
mean_residual_deviance(train=False, valid=False, xval=False)[source]

Get the Mean Residual Deviances(s).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the Mean Residual Deviance value for the training data.

  • valid (bool) – If valid is True, then return the Mean Residual Deviance value for the validation data.

  • xval (bool) – If xval is True, then return the Mean Residual Deviance value for the cross validation data.

Returns

The Mean Residual Deviance for this regression model.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5),
...                    hyper_params)
>>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.mean_residual_deviance()
property model_ids

Returns model ids.

Examples

>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv")
>>> hyper_parameters = {'alpha': [0.01,0.5],
...                     'lambda': [1e-5,1e-6]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_parameters)    
>>> gs.train(x=range(3)+range(4,11), y=3, training_frame=training_data)
>>> gs.model_ids
model_performance(test_data=None, train=False, valid=False, xval=False)[source]

Generate model metrics for this model on test_data.

Parameters
  • test_data – Data set for which model metrics shall be computed against. All three of train, valid and xval arguments are ignored if test_data is not None.

  • train – Report the training metrics for the model.

  • valid – Report the validation metrics for the model.

  • xval – Report the validation metrics for the model.

Returns

An object of class H2OModelMetrics.

Examples

>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> data = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
>>> test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
>>> x = data.columns
>>> y = "response"
>>> x.remove(y)
>>> data[y] = data[y].asfactor()
>>> test[y] = test[y].asfactor()
>>> ss = data.split_frame(seed = 1)
>>> train = ss[0]
>>> valid = ss[1]
>>> gbm_params1 = {'learn_rate': [0.01, 0.1],
...                 'max_depth': [3, 5, 9],
...                 'sample_rate': [0.8, 1.0],
...                 'col_sample_rate': [0.2, 0.5, 1.0]}
>>> gbm_grid1 = H2OGridSearch(model=H2OGradientBoostingEstimator,
...                           grid_id='gbm_grid1',
...                           hyper_params=gbm_params1)
>>> gbm_grid1.train(x=x, y=y,
...                 training_frame=train,
...                 validation_frame=valid,
...                 ntrees=100,
...                 seed=1)
>>> gbm_gridperf1 = gbm_grid1.get_grid(sort_by='auc', decreasing=True)
>>> best_gbm1 = gbm_gridperf1.models[0]
>>> best_gbm1.model_performance(test)
mse(train=False, valid=False, xval=False)[source]

Get the MSE(s).

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the MSE value for the training data.

  • valid (bool) – If valid is True, then return the MSE value for the validation data.

  • xval (bool) – If xval is True, then return the MSE value for the cross validation data.

Returns

The MSE for this regression model.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> from h2o.estimators import H2ODeepLearningEstimator
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5),
...                    hyper_params)
>>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.mse()
normmul()[source]

Normalization/Standardization multipliers for numeric predictors.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> from h2o.estimators import H2ODeepLearningEstimator
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5),
...                    hyper_params)
>>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.normmul()
normsub()[source]

Normalization/Standardization offsets for numeric predictors.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> from h2o.estimators import H2ODeepLearningEstimator
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5),
...                    hyper_params)
>>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.normsub()
null_degrees_of_freedom(train=False, valid=False, xval=False)[source]

Retreive the null degress of freedom if this model has the attribute, or None otherwise.

Parameters
  • train (bool) – Get the null dof for the training set. If both train and valid are False, then train is selected by default.

  • valid (bool) – Get the null dof for the validation set. If both train and valid are True, then train is selected by default.

  • xval (bool) – Get the null dof for the cross-validated models.

Returns

the null dof, or None if it is not present.

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> y = 3
>>> x = [4,5,6,7,8,9,10,11]
>>> hyper_params = {'alpha': [0.01,0.3,0.5],
...                 'lambda': [1e-5, 1e-6, 1e-7]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=x,y=y, training_frame=benign)
>>> gs.null_degrees_of_freedom()
null_deviance(train=False, valid=False, xval=False)[source]

Retreive the null deviance if this model has the attribute, or None otherwise.

Parameters
  • train (bool) – Get the null deviance for the training set. If both train and valid are False, then train is selected by default.

  • valid (bool) – Get the null deviance for the validation set. If both train and valid are True, then train is selected by default.

  • xval (bool) – Get the null deviance for the cross-validated models.

Returns

the null deviance, or None if it is not present.

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> y = 3
>>> x = [4,5,6,7,8,9,10,11]
>>> hyper_params = {'alpha': [0.01,0.3,0.5],
...                 'lambda': [1e-5, 1e-6, 1e-7]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=x,y=y, training_frame=benign)
>>> gs.null_deviance()
pprint_coef()[source]

Pretty print the coefficents table (includes normalized coefficients).

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> y = 3
>>> x = [4,5,6,7,8,9,10,11]
>>> hyper_params = {'alpha': [0.01,0.3,0.5],
...                 'lambda': [1e-5, 1e-6, 1e-7]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=x,y=y, training_frame=benign)
>>> gs.pprint_coef()
pr_auc()[source]

H2OGridSearch.pr_auc is deprecated, please use H2OGridSearch.aucpr instead.

predict(test_data)[source]

Predict on a dataset.

Parameters

test_data (H2OFrame) – Data to be predicted on.

Returns

H2OFrame filled with predictions.

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> y = 3
>>> x = [4,5,6,7,8,9,10,11]
>>> hyper_params = {'alpha': [0.01,0.3,0.5],
...                 'lambda': [1e-5, 1e-6, 1e-7]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=x,y=y, training_frame=benign)
>>> gs.predict(benign)
r2(train=False, valid=False, xval=False)[source]

Return the R^2 for this regression model.

The R^2 value is defined to be 1 - MSE/var, where var is computed as sigma^2.

If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.

Parameters
  • train (bool) – If train is True, then return the R^2 value for the training data.

  • valid (bool) – If valid is True, then return the R^2 value for the validation data.

  • xval (bool) – If xval is True, then return the R^2 value for the cross validation data.

Returns

The R^2 for this regression model.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> from h2o.estimators import H2ODeepLearningEstimator
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5),
...                    hyper_params)
>>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.r2()
residual_degrees_of_freedom(train=False, valid=False, xval=False)[source]

Retreive the residual degress of freedom if this model has the attribute, or None otherwise.

Parameters
  • train (bool) – Get the residual dof for the training set. If both train and valid are False, then train is selected by default.

  • valid (bool) – Get the residual dof for the validation set. If both train and valid are True, then train is selected by default.

  • xval (bool) – Get the residual dof for the cross-validated models.

Returns

the residual degrees of freedom, or None if they are not present.

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> y = 3
>>> x = [4,5,6,7,8,9,10,11]
>>> hyper_params = {'alpha': [0.01,0.3,0.5],
...                 'lambda': [1e-5, 1e-6, 1e-7]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=x,y=y, training_frame=benign)
>>> gs.residual_degrees_of_freedom()
residual_deviance(train=False, valid=False, xval=False)[source]

Retreive the residual deviance if this model has the attribute, or None otherwise.

Parameters
  • train (bool) – Get the residual deviance for the training set. If both train and valid are False, then train is selected by default.

  • valid (bool) – Get the residual deviance for the validation set. If both train and valid are True, then train is selected by default.

  • xval (bool) – Get the residual deviance for the cross-validated models.

Returns

the residual deviance, or None if it is not present.

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> y = 3
>>> x = [4,5,6,7,8,9,10,11]
>>> hyper_params = {'alpha': [0.01,0.3,0.5],
...                 'lambda': [1e-5, 1e-6, 1e-7]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=x,y=y, training_frame=benign)
>>> gs.residual_deviance()
respmul()[source]

Normalization/Standardization multipliers for numeric response.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> from h2o.estimators import H2ODeepLearningEstimator
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5),
...                    hyper_params)
>>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.respmul()
respsub()[source]

Normalization/Standardization offsets for numeric response.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> from h2o.estimators import H2ODeepLearningEstimator
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5),
...                    hyper_params)
>>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.respsub()
resume(recovery_dir=None, **kwargs)[source]

Resume previously stopped grid training.

Parameters

recovery_dir – When specified, the grid and all necessary data (frames, models) will be saved to this directory (use HDFS or other distributed file-system). Should the cluster crash during training, the grid can be reloaded from this directory via h2o.load_grid, and training can be resumed.

scoring_history()[source]

Retrieve model scoring history.

Returns

Score history (H2OTwoDimTable)

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> from h2o.estimators import H2ODeepLearningEstimator
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5),
...                    hyper_params)
>>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.scoring_history()
show()[source]

Print models sorted by metric.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> from h2o.estimators import H2ODeepLearningEstimator
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5),
...                    hyper_params)
>>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.show()
sort_by(metric, increasing=True)[source]

grid.sort_by() is deprecated; use grid.get_grid() instead

Deprecated since 2016-12-12, use grid.get_grid() instead.

sorted_metric_table()[source]

Retrieve summary table of an H2O Grid Search.

Returns

The summary table as an H2OTwoDimTable or a Pandas DataFrame.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> from h2o.estimators import H2ODeepLearningEstimator
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5),
...                    hyper_params)
>>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.sorted_metric_table()
start(x, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, **params)[source]

Asynchronous model build by specifying the predictor columns, response column, and any additional frame-specific values.

To block for results, call join().

Parameters
  • x – A list of column names or indices indicating the predictor columns.

  • y – An index or a column name indicating the response column.

  • training_frame – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).

  • offset_column – The name or index of the column in training_frame that holds the offsets.

  • fold_column – The name or index of the column in training_frame that holds the per-row fold assignments.

  • weights_column – The name or index of the column in training_frame that holds the per-row weights.

  • validation_frame – H2OFrame with validation data to be scored on while training.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), hyper_params)
>>> gs.start(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.join()
summary(header=True)[source]

Print a detailed summary of the explored models.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> from h2o.estimators import H2ODeepLearningEstimator
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5),
...                    hyper_params)
>>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.summary()
train(x=None, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, **params)[source]

Train the model synchronously (i.e. do not return until the model finishes training).

To train asynchronously call start().

Parameters
  • x – A list of column names or indices indicating the predictor columns.

  • y – An index or a column name indicating the response column.

  • training_frame – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).

  • offset_column – The name or index of the column in training_frame that holds the offsets.

  • fold_column – The name or index of the column in training_frame that holds the per-row fold assignments.

  • weights_column – The name or index of the column in training_frame that holds the per-row weights.

  • validation_frame – H2OFrame with validation data to be scored on while training.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> from h2o.estimators import H2ODeepLearningEstimator
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5),
...                    hyper_params)
>>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
varimp(use_pandas=False)[source]

Pretty print the variable importances, or return them in a list/pandas DataFrame.

Parameters

use_pandas (bool) – If True, then the variable importances will be returned as a pandas data frame.

Returns

A dictionary of lists or Pandas DataFrame instances.

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
>>> insurance["offset"] = insurance["Holders"].log()
>>> insurance["Group"] = insurance["Group"].asfactor()
>>> insurance["Age"] = insurance["Age"].asfactor()
>>> insurance["District"] = insurance["District"].asfactor()
>>> hyper_params = {'huber_alpha': [0.2,0.5],
...                 'quantile_alpha': [0.2,0.6]}
>>> from h2o.estimators import H2ODeepLearningEstimator
>>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5),
...                    hyper_params)
>>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
>>> gs.varimp(use_pandas=True)
weights(matrix_id=0)[source]

Return the frame for the respective weight matrix.

Param

matrix_id: an integer, ranging from 0 to number of layers, that specifies the weight matrix to return.

Returns

an H2OFrame which represents the weight matrix identified by matrix_id

Examples

>>> from h2o.estimators import H2ODeepLearningEstimator
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
>>> hh = H2ODeepLearningEstimator(hidden=[],
...                               loss="CrossEntropy",
...                               export_weights_and_biases=True)
>>> hh.train(x=list(range(4)), y=4, training_frame=iris)
>>> hh.weights(0)
xval_keys()[source]

Model keys for the cross-validated model.

Examples

>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> from h2o.grid.grid_search import H2OGridSearch
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv")
>>> y = 3
>>> x = [4,5,6,7,8,9,10,11]
>>> hyper_params = {'alpha': [0.01,0.3,0.5],
...                 'lambda': [1e-5, 1e-6, 1e-7]}
>>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'),
...                    hyper_params)
>>> gs.train(x=x,y=y, training_frame=benign)
>>> gs.xval_keys()
xvals()[source]

Return the list of cross-validated models.