Modeling In H2O¶
Supervised¶
H2OCoxProportionalHazardsEstimator
¶
-
class
h2o.estimators.coxph.
H2OCoxProportionalHazardsEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Cox Proportional Hazards
Trains a Cox Proportional Hazards Model (CoxPH) on an H2O dataset.
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> import tempfile >>> from os import listdir >>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv") >>> predictor = "age" >>> response = "event" >>> checkpoints_dir = tempfile.mkdtemp() >>> coxph = H2OCoxProportionalHazardsEstimator(start_column="start", ... stop_column="stop", ... export_checkpoints_dir=checkpoints_dir) >>> coxph.train(x=predictor, ... y=response, ... training_frame=heart) >>> len(listdir(checkpoints_dir))
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
init
¶ Coefficient starting value.
Type:
float
(default:0
).- Examples
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv") >>> predictor = "age" >>> response = "event" >>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start", ... stop_column="stop", ... init=2.9) >>> heart_coxph.train(x=predictor, ... y=response, ... training_frame=heart) >>> heart_coxph.scoring_history()
-
property
interaction_pairs
¶ A list of pairwise (first order) column interactions.
Type:
List[tuple]
.- Examples
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv") >>> predictor = "age" >>> response = "event" >>> interaction_pairs = [("start","stop")] >>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start", ... stop_column="stop", ... interaction_pairs=interaction_pairs) >>> heart_coxph.train(x=predictor, ... y=response, ... training_frame=heart) >>> heart_coxph.scoring_history()
-
property
interactions
¶ A list of predictor column indices to interact. All pairwise combinations will be computed for the list.
Type:
List[str]
.- Examples
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv") >>> predictor = "age" >>> response = "event" >>> interactions = ['start','stop'] >>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start", ... stop_column="stop", ... interactions=interactions) >>> heart_coxph.train(x=predictor, ... y=response, ... training_frame=heart) >>> heart_coxph.scoring_history()
-
property
interactions_only
¶ A list of columns that should only be used to create interactions but should not itself participate in model training.
Type:
List[str]
.- Examples
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv") >>> predictor = "age" >>> response = "event" >>> interactions = ['start','stop'] >>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start", ... stop_column="stop", ... interactions_only=interactions) >>> heart_coxph.train(x=predictor, ... y=response, ... training_frame=heart) >>> heart_coxph.scoring_history()
-
property
lre_min
¶ Minimum log-relative error.
Type:
float
(default:9
).- Examples
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv") >>> predictor = "age" >>> response = "event" >>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start", ... stop_column="stop", ... lre_min=5) >>> heart_coxph.train(x=predictor, ... y=response, ... training_frame=heart) >>> heart_coxph.scoring_history()
-
property
max_iterations
¶ Maximum number of iterations.
Type:
int
(default:20
).- Examples
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv") >>> predictor = "age" >>> response = "event" >>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start", ... stop_column="stop", ... max_iterations=50) >>> heart_coxph.train(x=predictor, ... y=response, ... training_frame=heart) >>> heart_coxph.scoring_history()
-
property
offset_column
¶ Offset column. This will be added to the combination of columns before applying the link function.
Type:
str
.- Examples
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv") >>> predictor = "age" >>> response = "event" >>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start", ... stop_column="stop", ... offset_column="transplant") >>> heart_coxph.train(x=predictor, ... y=response, ... training_frame=heart) >>> heart_coxph.scoring_history()
-
property
response_column
¶ Response variable column.
Type:
str
.
-
property
single_node_mode
¶ Run on a single node to reduce the effect of network overhead (for smaller datasets)
Type:
bool
(default:False
).
-
property
start_column
¶ Start Time Column.
Type:
str
.- Examples
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv") >>> predictor = "age" >>> response = "event" >>> train, valid = heart.split_frame(ratios=[.8]) >>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start", ... stop_column="stop") >>> heart_coxph.train(x=predictor, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> heart_coxph.scoring_history()
-
property
stop_column
¶ Stop Time Column.
Type:
str
.- Examples
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv") >>> predictor = "age" >>> response = "event" >>> train, valid = heart.split_frame(ratios=[.8]) >>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start", ... stop_column="stop") >>> heart_coxph.train(x=predictor, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> heart_coxph.scoring_history()
-
property
stratify_by
¶ List of columns to use for stratification.
Type:
List[str]
.
-
property
ties
¶ Method for Handling Ties.
One of:
"efron"
,"breslow"
(default:"efron"
).- Examples
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv") >>> predictor = "age" >>> response = "event" >>> train, valid = heart.split_frame(ratios=[.8]) >>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start", ... stop_column="stop", ... ties="breslow") >>> heart_coxph.train(x=predictor, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> heart_coxph.scoring_history()
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv") >>> predictor = "age" >>> response = "event" >>> train, valid = heart.split_frame(ratios=[.8]) >>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start", ... stop_column="stop") >>> heart_coxph.train(x=predictor, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> heart_coxph.scoring_history()
-
property
use_all_factor_levels
¶ (Internal. For development only!) Indicates whether to use all factor levels.
Type:
bool
(default:False
).- Examples
>>> heart = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv") >>> predictor = "age" >>> response = "event" >>> heart_coxph = H2OCoxProportionalHazardsEstimator(start_column="start", ... stop_column="stop", ... use_all_factor_levels=True) >>> heart_coxph.train(x=predictor, ... y=response, ... training_frame=heart) >>> heart_coxph.scoring_history()
-
property
weights_column
¶ Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
Type:
str
.
-
property
H2ODeepLearningEstimator
¶
-
class
h2o.estimators.deeplearning.
H2ODeepLearningEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Deep Learning
Build a Deep Neural Network model using CPUs Builds a feed-forward multilayer artificial neural network on an H2OFrame
- Examples
>>> from h2o.estimators.deeplearning import H2ODeepLearningEstimator >>> rows = [[1,2,3,4,0], [2,1,2,4,1], [2,1,4,2,1], ... [0,1,2,34,1], [2,3,4,1,0]] * 50 >>> fr = h2o.H2OFrame(rows) >>> fr[4] = fr[4].asfactor() >>> model = H2ODeepLearningEstimator() >>> model.train(x=range(4), y=4, training_frame=fr) >>> model.logloss()
-
property
activation
¶ Activation function.
One of:
"tanh"
,"tanh_with_dropout"
,"rectifier"
,"rectifier_with_dropout"
,"maxout"
,"maxout_with_dropout"
(default:"rectifier"
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> cars_dl = H2ODeepLearningEstimator(activation="tanh") >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
-
property
adaptive_rate
¶ Adaptive learning rate.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> cars_dl = H2ODeepLearningEstimator(adaptive_rate=True) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
-
property
autoencoder
¶ Auto-Encoder.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> cars_dl = H2ODeepLearningEstimator(autoencoder=True) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
-
property
average_activation
¶ Average activation for sparse auto-encoder. #Experimental
Type:
float
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> cars_dl = H2ODeepLearningEstimator(average_activation=1.5, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
-
property
balance_classes
¶ Balance training data class counts via over/under-sampling (for imbalanced data).
Type:
bool
(default:False
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cov_dl = H2ODeepLearningEstimator(balance_classes=True, ... seed=1234) >>> cov_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_dl.mse()
-
property
categorical_encoding
¶ Encoding scheme for categorical features
One of:
"auto"
,"enum"
,"one_hot_internal"
,"one_hot_explicit"
,"binary"
,"eigen"
,"label_encoder"
,"sort_by_response"
,"enum_limited"
(default:"auto"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> encoding = "one_hot_internal" >>> airlines_dl = H2ODeepLearningEstimator(categorical_encoding=encoding, ... seed=1234) >>> airlines_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_dl.mse()
-
property
checkpoint
¶ Model checkpoint to resume training with.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(activation="tanh", ... autoencoder=True, ... seed=1234, ... model_id="cars_dl") >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse() >>> cars_cont = H2ODeepLearningEstimator(checkpoint=cars_dl, ... seed=1234) >>> cars_cont.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_cont.mse()
-
property
class_sampling_factors
¶ Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
Type:
List[float]
.- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> sample_factors = [1., 0.5, 1., 1., 1., 1., 1.] >>> cars_dl = H2ODeepLearningEstimator(balance_classes=True, ... class_sampling_factors=sample_factors, ... seed=1234) >>> cov_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_dl.mse()
-
property
classification_stop
¶ Stopping criterion for classification error fraction on training data (-1 to disable).
Type:
float
(default:0
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(classification_stop=1.5, ... seed=1234) >>> cov_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_dl.mse()
-
property
col_major
¶ #DEPRECATED Use a column major weight matrix for input layer. Can speed up forward propagation, but might slow down backpropagation.
Type:
bool
(default:False
).
-
property
diagnostics
¶ Enable diagnostics for hidden layers.
Type:
bool
(default:True
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(diagnostics=True, ... seed=1234) >>> cov_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_dl.mse()
-
property
distribution
¶ Distribution function
One of:
"auto"
,"bernoulli"
,"multinomial"
,"gaussian"
,"poisson"
,"gamma"
,"tweedie"
,"laplace"
,"quantile"
,"huber"
(default:"auto"
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(distribution="poisson", ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
-
property
elastic_averaging
¶ Elastic averaging between compute nodes can improve distributed model convergence. #Experimental
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(elastic_averaging=True, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
-
property
elastic_averaging_moving_rate
¶ Elastic averaging moving rate (only if elastic averaging is enabled).
Type:
float
(default:0.9
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(elastic_averaging_moving_rate=.8, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
-
property
elastic_averaging_regularization
¶ Elastic averaging regularization strength (only if elastic averaging is enabled).
Type:
float
(default:0.001
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(elastic_averaging_regularization=.008, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
-
property
epochs
¶ How many times the dataset should be iterated (streamed), can be fractional.
Type:
float
(default:10
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(epochs=15, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
-
property
epsilon
¶ Adaptive learning rate smoothing factor (to avoid divisions by zero and allow progress).
Type:
float
(default:1e-08
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(epsilon=1e-6, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> import tempfile >>> from os import listdir >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> checkpoints_dir = tempfile.mkdtemp() >>> cars_dl = H2ODeepLearningEstimator(export_checkpoints_dir=checkpoints_dir, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> len(listdir(checkpoints_dir))
-
property
export_weights_and_biases
¶ Whether to export Neural Network weights and biases to H2O Frames.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(export_weights_and_biases=True, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
-
property
fast_mode
¶ Enable fast mode (minor approximation in back-propagation).
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(fast_mode=False, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
-
property
fold_assignment
¶ Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
One of:
"auto"
,"random"
,"modulo"
,"stratified"
(default:"auto"
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(fold_assignment="Random", ... nfolds=5, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
-
property
fold_column
¶ Column with cross-validation fold index assignment per observation.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> fold_numbers = cars.kfold_column(n_folds=5, seed=1234) >>> fold_numbers.set_names(["fold_numbers"]) >>> cars = cars.cbind(fold_numbers) >>> print(cars['fold_numbers']) >>> cars_dl = H2ODeepLearningEstimator(seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=cars, ... fold_column="fold_numbers") >>> cars_dl.mse()
-
property
force_load_balance
¶ Force extra load balancing to increase training speed for small datasets (to keep all cores busy).
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(force_load_balance=False, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
Hidden layer sizes (e.g. [100, 100]).
Type:
List[int]
(default:[200, 200]
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(hidden=[100,100], ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.mse()
Hidden layer dropout ratios (can improve generalization), specify one value per hidden layer, defaults to 0.5.
Type:
List[float]
.- Examples
>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz") >>> valid = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz") >>> features = list(range(0,784)) >>> target = 784 >>> train[target] = train[target].asfactor() >>> valid[target] = valid[target].asfactor() >>> model = H2ODeepLearningEstimator(epochs=20, ... hidden=[200,200], ... hidden_dropout_ratios=[0.5,0.5], ... seed=1234, ... activation='tanhwithdropout') >>> model.train(x=features, ... y=target, ... training_frame=train, ... validation_frame=valid) >>> model.mse()
-
property
huber_alpha
¶ Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
Type:
float
(default:0.9
).- Examples
>>> insurance = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> predictors = insurance.columns[0:4] >>> response = 'Claims' >>> insurance['Group'] = insurance['Group'].asfactor() >>> insurance['Age'] = insurance['Age'].asfactor() >>> train, valid = insurance.split_frame(ratios=[.8], seed=1234) >>> insurance_dl = H2ODeepLearningEstimator(distribution="huber", ... huber_alpha=0.9, ... seed=1234) >>> insurance_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> insurance_dl.mse()
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> cars["const_1"] = 6 >>> cars["const_2"] = 7 >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(seed=1234, ... ignore_const_cols=True) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.auc()
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
initial_biases
¶ A list of H2OFrame ids to initialize the bias vectors of this model with.
Type:
List[H2OFrame]
.- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv") >>> dl1 = H2ODeepLearningEstimator(hidden=[10,10], ... export_weights_and_biases=True) >>> dl1.train(x=list(range(4)), y=4, training_frame=iris) >>> p1 = dl1.model_performance(iris).logloss() >>> ll1 = dl1.predict(iris) >>> print(p1) >>> w1 = dl1.weights(0) >>> w2 = dl1.weights(1) >>> w3 = dl1.weights(2) >>> b1 = dl1.biases(0) >>> b2 = dl1.biases(1) >>> b3 = dl1.biases(2) >>> dl2 = H2ODeepLearningEstimator(hidden=[10,10], ... initial_weights=[w1, w2, w3], ... initial_biases=[b1, b2, b3], ... epochs=0) >>> dl2.train(x=list(range(4)), y=4, training_frame=iris) >>> dl2.initial_biases
-
property
initial_weight_distribution
¶ Initial weight distribution.
One of:
"uniform_adaptive"
,"uniform"
,"normal"
(default:"uniform_adaptive"
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(initial_weight_distribution="Uniform", ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.auc()
-
property
initial_weight_scale
¶ Uniform: -value…value, Normal: stddev.
Type:
float
(default:1
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(initial_weight_scale=1.5, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.auc()
-
property
initial_weights
¶ A list of H2OFrame ids to initialize the weight matrices of this model with.
Type:
List[H2OFrame]
.- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv") >>> dl1 = H2ODeepLearningEstimator(hidden=[10,10], ... export_weights_and_biases=True) >>> dl1.train(x=list(range(4)), y=4, training_frame=iris) >>> p1 = dl1.model_performance(iris).logloss() >>> ll1 = dl1.predict(iris) >>> print(p1) >>> w1 = dl1.weights(0) >>> w2 = dl1.weights(1) >>> w3 = dl1.weights(2) >>> b1 = dl1.biases(0) >>> b2 = dl1.biases(1) >>> b3 = dl1.biases(2) >>> dl2 = H2ODeepLearningEstimator(hidden=[10,10], ... initial_weights=[w1, w2, w3], ... initial_biases=[b1, b2, b3], ... epochs=0) >>> dl2.train(x=list(range(4)), y=4, training_frame=iris) >>> dl2.initial_weights
-
property
input_dropout_ratio
¶ Input layer dropout ratio (can improve generalization, try 0.1 or 0.2).
Type:
float
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(input_dropout_ratio=0.2, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.auc()
-
property
keep_cross_validation_fold_assignment
¶ Whether to keep the cross-validation fold assignment.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(keep_cross_validation_fold_assignment=True, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(cars_dl.cross_validation_fold_assignment())
-
property
keep_cross_validation_models
¶ Whether to keep the cross-validation models.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(keep_cross_validation_models=True, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(cars_dl.cross_validation_models())
-
property
keep_cross_validation_predictions
¶ Whether to keep the predictions of the cross-validation models.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(keep_cross_validation_predictions=True, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train) >>> print(cars_dl.cross_validation_predictions())
-
property
l1
¶ L1 regularization (can add stability and improve generalization, causes many weights to become 0).
Type:
float
(default:0
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> hh_imbalanced = H2ODeepLearningEstimator(l1=1e-5, ... activation="Rectifier", ... loss="CrossEntropy", ... hidden=[200,200], ... epochs=1, ... balance_classes=False, ... reproducible=True, ... seed=1234) >>> hh_imbalanced.train(x=list(range(54)),y=54, training_frame=covtype) >>> hh_imbalanced.mse()
-
property
l2
¶ L2 regularization (can add stability and improve generalization, causes many weights to be small.
Type:
float
(default:0
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> hh_imbalanced = H2ODeepLearningEstimator(l2=1e-5, ... activation="Rectifier", ... loss="CrossEntropy", ... hidden=[200,200], ... epochs=1, ... balance_classes=False, ... reproducible=True, ... seed=1234) >>> hh_imbalanced.train(x=list(range(54)),y=54, training_frame=covtype) >>> hh_imbalanced.mse()
-
property
loss
¶ Loss function.
One of:
"automatic"
,"cross_entropy"
,"quadratic"
,"huber"
,"absolute"
,"quantile"
(default:"automatic"
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> hh_imbalanced = H2ODeepLearningEstimator(l1=1e-5, ... activation="Rectifier", ... loss="CrossEntropy", ... hidden=[200,200], ... epochs=1, ... balance_classes=False, ... reproducible=True, ... seed=1234) >>> hh_imbalanced.train(x=list(range(54)),y=54, training_frame=covtype) >>> hh_imbalanced.mse()
-
property
max_after_balance_size
¶ Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
Type:
float
(default:5
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> max = .85 >>> cov_dl = H2ODeepLearningEstimator(balance_classes=True, ... max_after_balance_size=max, ... seed=1234) >>> cov_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_dl.logloss()
-
property
max_categorical_features
¶ Max. number of categorical features, enforced via hashing. #Experimental
Type:
int
(default:2147483647
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cov_dl = H2ODeepLearningEstimator(balance_classes=True, ... max_categorical_features=2147483647, ... seed=1234) >>> cov_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_dl.logloss()
-
property
max_confusion_matrix_size
¶ [Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs.
Type:
int
(default:20
).
-
property
max_hit_ratio_k
¶ Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable).
Type:
int
(default:0
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cov_dl = H2ODeepLearningEstimator(max_hit_ratio_k=3, ... seed=1234) >>> cov_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_dl.show()
-
property
max_runtime_secs
¶ Maximum allowed runtime in seconds for model training. Use 0 to disable.
Type:
float
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(max_runtime_secs=10, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.auc()
-
property
max_w2
¶ Constraint for squared sum of incoming weights per unit (e.g. for Rectifier).
Type:
float
(default:3.4028235e+38
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cov_dl = H2ODeepLearningEstimator(activation="RectifierWithDropout", ... hidden=[10,10], ... epochs=10, ... input_dropout_ratio=0.2, ... l1=1e-5, ... max_w2=10.5, ... stopping_rounds=0) >>> cov_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_dl.mse()
-
property
mini_batch_size
¶ Mini-batch size (smaller leads to better fit, larger can speed up and generalize better).
Type:
int
(default:1
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cov_dl = H2ODeepLearningEstimator(activation="RectifierWithDropout", ... hidden=[10,10], ... epochs=10, ... input_dropout_ratio=0.2, ... l1=1e-5, ... max_w2=10.5, ... stopping_rounds=0) ... mini_batch_size=35 >>> cov_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_dl.mse()
-
property
missing_values_handling
¶ Handling of missing values. Either MeanImputation or Skip.
One of:
"mean_imputation"
,"skip"
(default:"mean_imputation"
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> boston.insert_missing_values() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_dl = H2ODeepLearningEstimator(missing_values_handling="skip") >>> boston_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_dl.mse()
-
property
momentum_ramp
¶ Number of training samples for which momentum increases.
Type:
float
(default:1000000
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> predictors = ["Year","Month","DayofMonth","DayOfWeek","CRSDepTime", ... "CRSArrTime","UniqueCarrier","FlightNum"] >>> response_col = "IsDepDelayed" >>> airlines_dl = H2ODeepLearningEstimator(hidden=[200,200], ... activation="Rectifier", ... input_dropout_ratio=0.0, ... momentum_start=0.9, ... momentum_stable=0.99, ... momentum_ramp=1e7, ... epochs=100, ... stopping_rounds=4, ... train_samples_per_iteration=30000, ... mini_batch_size=32, ... score_duty_cycle=0.25, ... score_interval=1) >>> airlines_dl.train(x=predictors, ... y=response_col, ... training_frame=airlines) >>> airlines_dl.mse()
-
property
momentum_stable
¶ Final momentum after the ramp is over (try 0.99).
Type:
float
(default:0
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> predictors = ["Year","Month","DayofMonth","DayOfWeek","CRSDepTime", ... "CRSArrTime","UniqueCarrier","FlightNum"] >>> response_col = "IsDepDelayed" >>> airlines_dl = H2ODeepLearningEstimator(hidden=[200,200], ... activation="Rectifier", ... input_dropout_ratio=0.0, ... momentum_start=0.9, ... momentum_stable=0.99, ... momentum_ramp=1e7, ... epochs=100, ... stopping_rounds=4, ... train_samples_per_iteration=30000, ... mini_batch_size=32, ... score_duty_cycle=0.25, ... score_interval=1) >>> airlines_dl.train(x=predictors, ... y=response_col, ... training_frame=airlines) >>> airlines_dl.mse()
-
property
momentum_start
¶ Initial momentum at the beginning of training (try 0.5).
Type:
float
(default:0
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> predictors = ["Year","Month","DayofMonth","DayOfWeek","CRSDepTime", ... "CRSArrTime","UniqueCarrier","FlightNum"] >>> response_col = "IsDepDelayed" >>> airlines_dl = H2ODeepLearningEstimator(hidden=[200,200], ... activation="Rectifier", ... input_dropout_ratio=0.0, ... momentum_start=0.9, ... momentum_stable=0.99, ... momentum_ramp=1e7, ... epochs=100, ... stopping_rounds=4, ... train_samples_per_iteration=30000, ... mini_batch_size=32, ... score_duty_cycle=0.25, ... score_interval=1) >>> airlines_dl.train(x=predictors, ... y=response_col, ... training_frame=airlines) >>> airlines_dl.mse()
-
property
nesterov_accelerated_gradient
¶ Use Nesterov accelerated gradient (recommended).
Type:
bool
(default:True
).- Examples
>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz") >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz") >>> predictors = list(range(0,784)) >>> resp = 784 >>> train[resp] = train[resp].asfactor() >>> test[resp] = test[resp].asfactor() >>> nclasses = train[resp].nlevels()[0] >>> model = H2ODeepLearningEstimator(activation="RectifierWithDropout", ... adaptive_rate=False, ... rate=0.01, ... rate_decay=0.9, ... rate_annealing=1e-6, ... momentum_start=0.95, ... momentum_ramp=1e5, ... momentum_stable=0.99, ... nesterov_accelerated_gradient=False, ... input_dropout_ratio=0.2, ... train_samples_per_iteration=20000, ... classification_stop=-1, ... l1=1e-5) >>> model.train (x=predictors, ... y=resp, ... training_frame=train, ... validation_frame=test) >>> model.model_performance()
-
property
nfolds
¶ Number of folds for K-fold cross-validation (0 to disable or >= 2).
Type:
int
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> cars_dl = H2ODeepLearningEstimator(nfolds=5, seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_dl.auc()
-
property
offset_column
¶ Offset column. This will be added to the combination of columns before applying the link function.
Type:
str
.- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> boston["offset"] = boston["medv"].log() >>> train, valid = boston.split_frame(ratios=[.8], seed=1234) >>> boston_dl = H2ODeepLearningEstimator(offset_column="offset", ... seed=1234) >>> boston_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_dl.mse()
-
property
overwrite_with_best_model
¶ If enabled, override the final model with the best model found during training.
Type:
bool
(default:True
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> boston["offset"] = boston["medv"].log() >>> train, valid = boston.split_frame(ratios=[.8], seed=1234) >>> boston_dl = H2ODeepLearningEstimator(overwrite_with_best_model=True, ... seed=1234) >>> boston_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_dl.mse()
-
property
pretrained_autoencoder
¶ Pretrained autoencoder model to initialize this model with.
Type:
str
.- Examples
>>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator >>> resp = 784 >>> nfeatures = 20 >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz") >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz") >>> train[resp] = train[resp].asfactor() >>> test[resp] = test[resp].asfactor() >>> sid = train[0].runif(0) >>> train_unsupervised = train[sid>=0.5] >>> train_unsupervised.pop(resp) >>> train_supervised = train[sid<0.5] >>> ae_model = H2OAutoEncoderEstimator(activation="Tanh", ... hidden=[nfeatures], ... model_id="ae_model", ... epochs=1, ... ignore_const_cols=False, ... reproducible=True, ... seed=1234) >>> ae_model.train(list(range(resp)), training_frame=train_unsupervised) >>> ae_model.mse() >>> pretrained_model = H2ODeepLearningEstimator(activation="Tanh", ... hidden=[nfeatures], ... epochs=1, ... reproducible = True, ... seed=1234, ... ignore_const_cols=False, ... pretrained_autoencoder="ae_model") >>> pretrained_model.train(list(range(resp)), resp, ... training_frame=train_supervised, ... validation_frame=test) >>> pretrained_model.mse()
-
property
quantile_alpha
¶ Desired quantile for Quantile regression, must be between 0 and 1.
Type:
float
(default:0.5
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8], seed=1234) >>> boston_dl = H2ODeepLearningEstimator(distribution="quantile", ... quantile_alpha=.8, ... seed=1234) >>> boston_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_dl.mse()
-
property
quiet_mode
¶ Enable quiet mode for less output to standard output.
Type:
bool
(default:False
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], seed=1234) >>> titanic_dl = H2ODeepLearningEstimator(quiet_mode=True, ... seed=1234) >>> titanic_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> titanic_dl.mse()
-
property
rate
¶ Learning rate (higher => less stable, lower => slower convergence).
Type:
float
(default:0.005
).- Examples
>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz") >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz") >>> predictors = list(range(0,784)) >>> resp = 784 >>> train[resp] = train[resp].asfactor() >>> test[resp] = test[resp].asfactor() >>> nclasses = train[resp].nlevels()[0] >>> model = H2ODeepLearningEstimator(activation="RectifierWithDropout", ... adaptive_rate=False, ... rate=0.01, ... rate_decay=0.9, ... rate_annealing=1e-6, ... momentum_start=0.95, ... momentum_ramp=1e5, ... momentum_stable=0.99, ... nesterov_accelerated_gradient=False, ... input_dropout_ratio=0.2, ... train_samples_per_iteration=20000, ... classification_stop=-1, ... l1=1e-5) >>> model.train (x=predictors,y=resp, training_frame=train, validation_frame=test) >>> model.model_performance(valid=True)
-
property
rate_annealing
¶ Learning rate annealing: rate / (1 + rate_annealing * samples).
Type:
float
(default:1e-06
).- Examples
>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz") >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz") >>> predictors = list(range(0,784)) >>> resp = 784 >>> train[resp] = train[resp].asfactor() >>> test[resp] = test[resp].asfactor() >>> nclasses = train[resp].nlevels()[0] >>> model = H2ODeepLearningEstimator(activation="RectifierWithDropout", ... adaptive_rate=False, ... rate=0.01, ... rate_decay=0.9, ... rate_annealing=1e-6, ... momentum_start=0.95, ... momentum_ramp=1e5, ... momentum_stable=0.99, ... nesterov_accelerated_gradient=False, ... input_dropout_ratio=0.2, ... train_samples_per_iteration=20000, ... classification_stop=-1, ... l1=1e-5) >>> model.train (x=predictors, ... y=resp, ... training_frame=train, ... validation_frame=test) >>> model.mse()
-
property
rate_decay
¶ Learning rate decay factor between layers (N-th layer: rate * rate_decay ^ (n - 1).
Type:
float
(default:1
).- Examples
>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz") >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz") >>> predictors = list(range(0,784)) >>> resp = 784 >>> train[resp] = train[resp].asfactor() >>> test[resp] = test[resp].asfactor() >>> nclasses = train[resp].nlevels()[0] >>> model = H2ODeepLearningEstimator(activation="RectifierWithDropout", ... adaptive_rate=False, ... rate=0.01, ... rate_decay=0.9, ... rate_annealing=1e-6, ... momentum_start=0.95, ... momentum_ramp=1e5, ... momentum_stable=0.99, ... nesterov_accelerated_gradient=False, ... input_dropout_ratio=0.2, ... train_samples_per_iteration=20000, ... classification_stop=-1, ... l1=1e-5) >>> model.train (x=predictors, ... y=resp, ... training_frame=train, ... validation_frame=test) >>> model.model_performance()
-
property
regression_stop
¶ Stopping criterion for regression error (MSE) on training data (-1 to disable).
Type:
float
(default:1e-06
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_dl = H2ODeepLearningEstimator(regression_stop=1e-6, ... seed=1234) >>> airlines_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_dl.auc()
-
property
replicate_training_data
¶ Replicate the entire training dataset onto every node for faster training on small datasets.
Type:
bool
(default:True
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> airlines_dl = H2ODeepLearningEstimator(replicate_training_data=False) >>> airlines_dl.train(x=predictors, ... y=response, ... training_frame=airlines) >>> airlines_dl.auc()
-
property
reproducible
¶ Force reproducibility on small data (will be slow - only uses 1 thread).
Type:
bool
(default:False
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_dl = H2ODeepLearningEstimator(reproducible=True) >>> airlines_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_dl.auc()
-
property
response_column
¶ Response variable column.
Type:
str
.
-
property
rho
¶ Adaptive learning rate time decay factor (similarity to prior updates).
Type:
float
(default:0.99
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> cars_dl = H2ODeepLearningEstimator(rho=0.9, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_dl.auc()
-
property
score_duty_cycle
¶ Maximum duty cycle fraction for scoring (lower: more training, higher: more scoring).
Type:
float
(default:0.1
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> cars_dl = H2ODeepLearningEstimator(score_duty_cycle=0.2, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_dl.auc()
-
property
score_each_iteration
¶ Whether to score during each iteration of model training.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> cars_dl = H2ODeepLearningEstimator(score_each_iteration=True, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_dl.auc()
-
property
score_interval
¶ Shortest time interval (in seconds) between model scoring.
Type:
float
(default:5
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> cars_dl = H2ODeepLearningEstimator(score_interval=3, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_dl.auc()
-
property
score_training_samples
¶ Number of training set samples for scoring (0 for all).
Type:
int
(default:10000
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> cars_dl = H2ODeepLearningEstimator(score_training_samples=10000, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_dl.auc()
-
property
score_validation_samples
¶ Number of validation set samples for scoring (0 for all).
Type:
int
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(score_validation_samples=3, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.auc()
-
property
score_validation_sampling
¶ Method used to sample validation dataset for scoring.
One of:
"uniform"
,"stratified"
(default:"uniform"
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(score_validation_sampling="uniform", ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.auc()
-
property
seed
¶ Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded.
Type:
int
(default:-1
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.auc()
-
property
shuffle_training_data
¶ Enable shuffling of training data (recommended if training data is replicated and train_samples_per_iteration is close to #nodes x #rows, of if using balance_classes).
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(shuffle_training_data=True, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_dl.auc()
-
property
single_node_mode
¶ Run on a single node for fine-tuning of model parameters.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(single_node_mode=True, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_dl.auc()
-
property
sparse
¶ Sparse data handling (more efficient for data with lots of 0 values).
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(sparse=True, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_dl.auc()
-
property
sparsity_beta
¶ Sparsity regularization. #Experimental
Type:
float
(default:0
).- Examples
>>> from h2o.estimators import H2OAutoEncoderEstimator >>> resp = 784 >>> nfeatures = 20 >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz") >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz") >>> train[resp] = train[resp].asfactor() >>> test[resp] = test[resp].asfactor() >>> sid = train[0].runif(0) >>> train_unsupervised = train[sid>=0.5] >>> train_unsupervised.pop(resp) >>> ae_model = H2OAutoEncoderEstimator(activation="Tanh", ... hidden=[nfeatures], ... epochs=1, ... ignore_const_cols=False, ... reproducible=True, ... sparsity_beta=0.5, ... seed=1234) >>> ae_model.train(list(range(resp)), ... training_frame=train_unsupervised) >>> ae_model.mse()
-
property
standardize
¶ If enabled, automatically standardize the data. If disabled, the user must provide properly scaled input data.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> cars_dl = H2ODeepLearningEstimator(standardize=True, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_dl.auc()
-
property
stopping_metric
¶ Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
One of:
"auto"
,"deviance"
,"logloss"
,"mse"
,"rmse"
,"mae"
,"rmsle"
,"auc"
,"aucpr"
,"lift_top_group"
,"misclassification"
,"mean_per_class_error"
,"custom"
,"custom_increasing"
(default:"auto"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_dl = H2ODeepLearningEstimator(stopping_metric="auc", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_dl.auc()
-
property
stopping_rounds
¶ Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
Type:
int
(default:5
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_dl = H2ODeepLearningEstimator(stopping_metric="auc", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_dl.auc()
-
property
stopping_tolerance
¶ Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
Type:
float
(default:0
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_dl = H2ODeepLearningEstimator(stopping_metric="auc", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_dl.auc()
-
property
target_ratio_comm_to_comp
¶ Target ratio of communication overhead to computation. Only for multi-node operation and train_samples_per_iteration = -2 (auto-tuning).
Type:
float
(default:0.05
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_dl = H2ODeepLearningEstimator(target_ratio_comm_to_comp=0.05, ... seed=1234) >>> airlines_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_dl.auc()
-
property
train_samples_per_iteration
¶ Number of training samples (globally) per MapReduce iteration. Special values are 0: one epoch, -1: all available data (e.g., replicated training data), -2: automatic.
Type:
int
(default:-2
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_dl = H2ODeepLearningEstimator(train_samples_per_iteration=-1, ... epochs=1, ... seed=1234) >>> airlines_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_dl.auc()
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_dl = H2ODeepLearningEstimator() >>> airlines_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_dl.auc()
-
property
tweedie_power
¶ Tweedie power for Tweedie regression, must be between 1 and 2.
Type:
float
(default:1.5
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_dl = H2ODeepLearningEstimator(tweedie_power=1.5, ... seed=1234) >>> airlines_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_dl.auc()
-
property
use_all_factor_levels
¶ Use all factor levels of categorical variables. Otherwise, the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder.
Type:
bool
(default:True
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_dl = H2ODeepLearningEstimator(use_all_factor_levels=True, ... seed=1234) >>> airlines_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_dl.mse()
-
property
validation_frame
¶ Id of the validation data frame.
Type:
H2OFrame
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(standardize=True, ... seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.auc()
-
property
variable_importances
¶ Compute variable importances for input features (Gedeon method) - can be slow for large networks.
Type:
bool
(default:True
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_dl = H2ODeepLearningEstimator(variable_importances=True, ... seed=1234) >>> airlines_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_dl.mse()
-
property
weights_column
¶ Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_dl = H2ODeepLearningEstimator(seed=1234) >>> cars_dl.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_dl.auc()
H2OGeneralizedAdditiveEstimator
¶
-
class
h2o.estimators.gam.
H2OGeneralizedAdditiveEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
General Additive Model
Fits a generalized additive model, specified by a response variable, a set of predictors, and a description of the error distribution.
A subclass of
ModelBase
is returned. The specific subclass depends on the machine learning task at hand (if it’s binomial classification, then an H2OBinomialModel is returned, if it’s regression then a H2ORegressionModel is returned). The default print-out of the models is shown, but further GAM-specific information can be queried out of the object. Upon completion of the GAM, the resulting object has coefficients, normalized coefficients, residual/null deviance, aic, and a host of model metrics including MSE, AUC (for logistic regression), degrees of freedom, and confusion matrices.-
property
Lambda
¶ DEPRECATED. Use
self.lambda_
instead
-
property
alpha
¶ Distribution of regularization between the L1 (Lasso) and L2 (Ridge) penalties. A value of 1 for alpha represents Lasso regression, a value of 0 produces Ridge regression, and anything in between specifies the amount of mixing between the two. Default value of alpha is 0 when SOLVER = ‘L-BFGS’; 0.5 otherwise.
Type:
List[float]
.
-
property
balance_classes
¶ Balance training data class counts via over/under-sampling (for imbalanced data).
Type:
bool
(default:False
).
-
property
beta_constraints
¶ Beta constraints
Type:
H2OFrame
.
-
property
beta_epsilon
¶ Converge if beta changes less (using L-infinity norm) than beta esilon, ONLY applies to IRLSM solver
Type:
float
(default:0.0001
).
-
property
bs
¶ Basis function type for each gam predictors, 0 for cr
Type:
List[int]
.
-
property
class_sampling_factors
¶ Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
Type:
List[float]
.
-
property
compute_p_values
¶ Request p-values computation, p-values work only with IRLSM solver and no regularization
Type:
bool
(default:False
).
-
property
custom_metric_func
¶ Reference to custom evaluation function, format: language:keyName=funcName
Type:
str
.
-
property
early_stopping
¶ Stop early when there is no more relative improvement on train or validation (if provided)
Type:
bool
(default:True
).
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.
-
property
family
¶ Family. Use binomial for classification with logistic regression, others are for regression problems.
One of:
"gaussian"
,"binomial"
,"quasibinomial"
,"ordinal"
,"multinomial"
,"poisson"
,"gamma"
,"tweedie"
,"negativebinomial"
,"fractionalbinomial"
.
-
property
fold_assignment
¶ Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
One of:
"auto"
,"random"
,"modulo"
,"stratified"
(default:"auto"
).
-
property
fold_column
¶ Column with cross-validation fold index assignment per observation.
Type:
str
.
-
property
gam_columns
¶ Predictor column names for gam
Type:
List[str]
.
-
property
gradient_epsilon
¶ Converge if objective changes less (using L-infinity norm) than this, ONLY applies to L-BFGS solver. Default indicates: If lambda_search is set to False and lambda is equal to zero, the default value of gradient_epsilon is equal to .000001, otherwise the default value is .0001. If lambda_search is set to True, the conditional values above are 1E-8 and 1E-6 respectively.
Type:
float
(default:-1
).
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
interaction_pairs
¶ A list of pairwise (first order) column interactions.
Type:
List[tuple]
.
-
property
interactions
¶ A list of predictor column indices to interact. All pairwise combinations will be computed for the list.
Type:
List[str]
.
-
property
intercept
¶ Include constant term in the model
Type:
bool
(default:True
).
-
property
keep_cross_validation_fold_assignment
¶ Whether to keep the cross-validation fold assignment.
Type:
bool
(default:False
).
-
property
keep_cross_validation_models
¶ Whether to keep the cross-validation models.
Type:
bool
(default:True
).
-
property
keep_cross_validation_predictions
¶ Whether to keep the predictions of the cross-validation models.
Type:
bool
(default:False
).
-
property
keep_gam_cols
¶ Save keys of model matrix
Type:
bool
(default:False
).
-
property
knot_ids
¶ String arrays storing frame keys of knots. One for each gam column specified in gam_columns
Type:
List[str]
.
-
property
lambda_
¶ Regularization strength
Type:
List[float]
.
-
property
lambda_min_ratio
¶ Minimum lambda used in lambda search, specified as a ratio of lambda_max (the smallest lambda that drives all coefficients to zero). Default indicates: if the number of observations is greater than the number of variables, then lambda_min_ratio is set to 0.0001; if the number of observations is less than the number of variables, then lambda_min_ratio is set to 0.01.
Type:
float
(default:-1
).
-
property
lambda_search
¶ Use lambda search starting at lambda max, given lambda is then interpreted as lambda min
Type:
bool
(default:False
).
-
property
link
¶ Link function.
One of:
"family_default"
,"identity"
,"logit"
,"log"
,"inverse"
,"tweedie"
,"ologit"
.
-
property
max_active_predictors
¶ Maximum number of active predictors during computation. Use as a stopping criterion to prevent expensive model building with many predictors. Default indicates: If the IRLSM solver is used, the value of max_active_predictors is set to 5000 otherwise it is set to 100000000.
Type:
int
(default:-1
).
-
property
max_after_balance_size
¶ Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
Type:
float
(default:5
).
-
property
max_confusion_matrix_size
¶ [Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs
Type:
int
(default:20
).
-
property
max_hit_ratio_k
¶ Maximum number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable)
Type:
int
(default:0
).
-
property
max_iterations
¶ Maximum number of iterations
Type:
int
(default:-1
).
-
property
max_runtime_secs
¶ Maximum allowed runtime in seconds for model training. Use 0 to disable.
Type:
float
(default:0
).
-
property
missing_values_handling
¶ Handling of missing values. Either MeanImputation, Skip or PlugValues.
One of:
"mean_imputation"
,"skip"
,"plug_values"
(default:"mean_imputation"
).
-
property
nfolds
¶ Number of folds for K-fold cross-validation (0 to disable or >= 2).
Type:
int
(default:0
).
-
property
nlambdas
¶ Number of lambdas to be used in a search. Default indicates: If alpha is zero, with lambda search set to True, the value of nlamdas is set to 30 (fewer lambdas are needed for ridge regression) otherwise it is set to 100.
Type:
int
(default:-1
).
-
property
non_negative
¶ Restrict coefficients (not intercept) to be non-negative
Type:
bool
(default:False
).
-
property
num_knots
¶ Number of knots for gam predictors
Type:
List[int]
.
-
property
obj_reg
¶ Likelihood divider in objective value computation, default is 1/nobs
Type:
float
(default:-1
).
-
property
objective_epsilon
¶ Converge if objective value changes less than this. Default indicates: If lambda_search is set to True the value of objective_epsilon is set to .0001. If the lambda_search is set to False and lambda is equal to zero, the value of objective_epsilon is set to .000001, for any other value of lambda the default value of objective_epsilon is set to .0001.
Type:
float
(default:-1
).
-
property
offset_column
¶ Offset column. This will be added to the combination of columns before applying the link function.
Type:
str
.
-
property
plug_values
¶ Plug Values (a single row frame containing values that will be used to impute missing values of the training/validation frame, use with conjunction missing_values_handling = PlugValues)
Type:
H2OFrame
.
-
property
prior
¶ Prior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.
Type:
float
(default:-1
).
-
property
remove_collinear_columns
¶ In case of linearly dependent columns, remove some of the dependent columns
Type:
bool
(default:False
).
-
property
response_column
¶ Response variable column.
Type:
str
.
-
property
scale
¶ Smoothing parameter for gam predictors
Type:
List[float]
.
-
property
score_each_iteration
¶ Whether to score during each iteration of model training.
Type:
bool
(default:False
).
-
property
seed
¶ Seed for pseudo random number generator (if applicable)
Type:
int
(default:-1
).
-
property
solver
¶ AUTO will set the solver based on given data and the other parameters. IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, L_BFGS scales better for datasets with many columns.
One of:
"auto"
,"irlsm"
,"l_bfgs"
,"coordinate_descent_naive"
,"coordinate_descent"
,"gradient_descent_lh"
,"gradient_descent_sqerr"
(default:"auto"
).
-
property
standardize
¶ Standardize numeric columns to have zero mean and unit variance
Type:
bool
(default:False
).
-
property
theta
¶ Theta
Type:
float
(default:0
).
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.
-
property
tweedie_link_power
¶ Tweedie link power
Type:
float
(default:0
).
-
property
tweedie_variance_power
¶ Tweedie variance power
Type:
float
(default:0
).
-
property
validation_frame
¶ Id of the validation data frame.
Type:
H2OFrame
.
-
property
weights_column
¶ Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
Type:
str
.
-
property
H2OGradientBoostingEstimator
¶
-
class
h2o.estimators.gbm.
H2OGradientBoostingEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Gradient Boosting Machine
Builds gradient boosted trees on a parsed data set, for regression or classification. The default distribution function will guess the model type based on the response column type. Otherwise, the response column must be an enum for “bernoulli” or “multinomial”, and numeric for all other distributions.
-
property
balance_classes
¶ Balance training data class counts via over/under-sampling (for imbalanced data).
Type:
bool
(default:False
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cov_gbm = H2OGradientBoostingEstimator(balance_classes=True, ... seed=1234) >>> cov_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_gbm.logloss(valid=True)
-
property
build_tree_one_node
¶ Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(build_tree_one_node=True, ... seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.auc(valid=True)
-
property
calibrate_model
¶ Use Platt Scaling to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities.
Type:
bool
(default:False
).- Examples
>>> ecology = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/ecology_model.csv") >>> ecology['Angaus'] = ecology['Angaus'].asfactor() >>> response = 'Angaus' >>> train, calib = ecology.split_frame(seed = 12354) >>> predictors = ecology.columns[3:13] >>> w = h2o.create_frame(binary_fraction=1, ... binary_ones_fraction=0.5, ... missing_fraction=0, ... rows=744, cols=1) >>> w.set_names(["weight"]) >>> train = train.cbind(w) >>> ecology_gbm = H2OGradientBoostingEstimator(ntrees=10, ... max_depth=5, ... min_rows=10, ... learn_rate=0.1, ... distribution="multinomial", ... weights_column="weight", ... calibrate_model=True, ... calibration_frame=calib) >>> ecology_gbm.train(x=predictors, ... y="Angaus", ... training_frame=train) >>> ecology_gbm.auc()
-
property
calibration_frame
¶ Calibration frame for Platt Scaling
Type:
H2OFrame
.- Examples
>>> ecology = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/ecology_model.csv") >>> ecology['Angaus'] = ecology['Angaus'].asfactor() >>> response = 'Angaus' >>> predictors = ecology.columns[3:13] >>> train, calib = ecology.split_frame(seed=12354) >>> w = h2o.create_frame(binary_fraction=1, ... binary_ones_fraction=0.5, ... missing_fraction=0, ... rows=744,cols=1) >>> w.set_names(["weight"]) >>> train = train.cbind(w) >>> ecology_gbm = H2OGradientBoostingEstimator(ntrees=10, ... max_depth=5, ... min_rows=10, ... learn_rate=0.1, ... distribution="multinomial", ... calibrate_model=True, ... calibration_frame=calib) >>> ecology_gbm.train(x=predictors, ... y="Angaus", ... training_frame=train, ... weights_column="weight") >>> ecology_gbm.auc()
-
property
categorical_encoding
¶ Encoding scheme for categorical features
One of:
"auto"
,"enum"
,"one_hot_internal"
,"one_hot_explicit"
,"binary"
,"eigen"
,"label_encoder"
,"sort_by_response"
,"enum_limited"
(default:"auto"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_gbm = H2OGradientBoostingEstimator(categorical_encoding="labelencoder", ... seed=1234) >>> airlines_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_gbm.auc(valid=True)
-
property
check_constant_response
¶ Check if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not.
Type:
bool
(default:True
).- Examples
>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> train["constantCol"] = 1 >>> my_gbm = H2OGradientBoostingEstimator(check_constant_response=False) >>> my_gbm.train(x=list(range(1,5)), ... y="constantCol", ... training_frame=train)
-
property
checkpoint
¶ Model checkpoint to resume training with.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(ntrees=1, ... seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(cars_gbm.auc(valid=True)) >>> print("Number of trees built for cars_gbm model:", cars_gbm.ntrees) >>> cars_gbm_continued = H2OGradientBoostingEstimator(checkpoint=cars_gbm.model_id, ... ntrees=50, ... seed=1234) >>> cars_gbm_continued.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm_continued.auc(valid=True) >>> print("Number of trees built for cars_gbm model:",cars_gbm_continued.ntrees)
-
property
class_sampling_factors
¶ Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
Type:
List[float]
.- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> sample_factors = [1., 0.5, 1., 1., 1., 1., 1.] >>> cov_gbm = H2OGradientBoostingEstimator(balance_classes=True, ... class_sampling_factors=sample_factors, ... seed=1234) >>> cov_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_gbm.logloss(valid=True)
-
property
col_sample_rate
¶ Column sample rate (from 0.0 to 1.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_gbm = H2OGradientBoostingEstimator(col_sample_rate=.7, ... seed=1234) >>> airlines_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_gbm.auc(valid=True)
-
property
col_sample_rate_change_per_level
¶ Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_gbm = H2OGradientBoostingEstimator(col_sample_rate_change_per_level=.9, ... seed=1234) >>> airlines_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_gbm.auc(valid=True)
-
property
col_sample_rate_per_tree
¶ Column sample rate per tree (from 0.0 to 1.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_gbm = H2OGradientBoostingEstimator(col_sample_rate_per_tree=.7, ... seed=1234) >>> airlines_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_gbm.auc(valid=True)
-
property
custom_distribution_func
¶ Reference to custom distribution, format: language:keyName=funcName
Type:
str
.- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_gbm = H2OGradientBoostingEstimator(ntrees=3, ... max_depth=5, ... distribution="bernoulli", ... seed=1234) >>> airlines_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame valid) >>> from h2o.utils.distributions import CustomDistributionBernoulli >>> custom_distribution_bernoulli = h2o.upload_custom_distribution(CustomDistributionBernoulli, ... func_name="custom_bernoulli", ... func_file="custom_bernoulli.py") >>> airlines_gbm_custom = H2OGradientBoostingEstimator(ntrees=3, ... max_depth=5, ... distribution="custom", ... custom_distribution_func=custom_distribution_bernoulli, ... seed=1235) >>> airlines_gbm_custom.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_gbm.auc()
-
property
custom_metric_func
¶ Reference to custom evaluation function, format: language:keyName=funcName
Type:
str
.
-
property
distribution
¶ Distribution function
One of:
"auto"
,"bernoulli"
,"quasibinomial"
,"multinomial"
,"gaussian"
,"poisson"
,"gamma"
,"tweedie"
,"laplace"
,"quantile"
,"huber"
,"custom"
(default:"auto"
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(distribution="poisson", ... seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.mse(valid=True)
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> airlines = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip", destination_frame="air.hex") >>> predictors = ["DayofMonth", "DayOfWeek"] >>> response = "IsDepDelayed" >>> hyper_parameters = {'ntrees': [5,10]} >>> search_crit = {'strategy': "RandomDiscrete", ... 'max_models': 5, ... 'seed': 1234, ... 'stopping_rounds': 3, ... 'stopping_metric': "AUTO", ... 'stopping_tolerance': 1e-2} >>> checkpoints_dir = tempfile.mkdtemp() >>> air_grid = H2OGridSearch(H2OGradientBoostingEstimator, ... hyper_params=hyper_parameters, ... search_criteria=search_crit) >>> air_grid.train(x=predictors, ... y=response, ... training_frame=airlines, ... distribution="bernoulli", ... learn_rate=0.1, ... max_depth=3, ... export_checkpoints_dir=checkpoints_dir) >>> len(listdir(checkpoints_dir))
-
property
fold_assignment
¶ Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
One of:
"auto"
,"random"
,"modulo"
,"stratified"
(default:"auto"
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> assignment_type = "Random" >>> cars_gbm = H2OGradientBoostingEstimator(fold_assignment=assignment_type, ... nfolds=5, ... seed=1234) >>> cars_gbm.train(x=predictors, y=response, training_frame=cars) >>> cars_gbm.auc(xval=True)
-
property
fold_column
¶ Column with cross-validation fold index assignment per observation.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> fold_numbers = cars.kfold_column(n_folds=5, ... seed=1234) >>> fold_numbers.set_names(["fold_numbers"]) >>> cars = cars.cbind(fold_numbers) >>> cars_gbm = H2OGradientBoostingEstimator(seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=cars, ... fold_column="fold_numbers") >>> cars_gbm.auc(xval=True)
-
property
histogram_type
¶ What type of histogram to use for finding optimal split points
One of:
"auto"
,"uniform_adaptive"
,"random"
,"quantiles_global"
,"round_robin"
(default:"auto"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_gbm = H2OGradientBoostingEstimator(histogram_type="UniformAdaptive", ... seed=1234) >>> airlines_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_gbm.auc(valid=True)
-
property
huber_alpha
¶ Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
Type:
float
(default:0.9
).- Examples
>>> insurance = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> predictors = insurance.columns[0:4] >>> response = 'Claims' >>> insurance['Group'] = insurance['Group'].asfactor() >>> insurance['Age'] = insurance['Age'].asfactor() >>> train, valid = insurance.split_frame(ratios=[.8], seed=1234) >>> insurance_gbm = H2OGradientBoostingEstimator(distribution="huber", ... huber_alpha=0.9, ... seed=1234) >>> insurance_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> insurance_gbm.mse(valid=True)
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> cars["const_1"] = 6 >>> cars["const_2"] = 7 >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(seed=1234, ... ignore_const_cols=True) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.auc(valid=True)
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
keep_cross_validation_fold_assignment
¶ Whether to keep the cross-validation fold assignment.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> folds = 5 >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(keep_cross_validation_fold_assignment=True, ... nfolds=5, ... seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.auc()
-
property
keep_cross_validation_models
¶ Whether to keep the cross-validation models.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> folds = 5 >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(keep_cross_validation_models=True, ... nfolds=5, ... seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.auc()
-
property
keep_cross_validation_predictions
¶ Whether to keep the predictions of the cross-validation models.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> folds = 5 >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(keep_cross_validation_predictions=True, ... nfolds=5, ... seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.auc()
-
property
learn_rate
¶ Learning rate (from 0.0 to 1.0)
Type:
float
(default:0.1
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], seed=1234) >>> titanic_gbm = H2OGradientBoostingEstimator(ntrees=10000, ... learn_rate=0.01, ... stopping_rounds=5, ... stopping_metric="AUC", ... stopping_tolerance=1e-4, ... seed=1234) >>> titanic_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> titanic_gbm.auc(valid=True)
-
property
learn_rate_annealing
¶ Scale the learning rate by this factor after each tree (e.g., 0.99 or 0.999)
Type:
float
(default:1
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], seed=1234) >>> titanic_gbm = H2OGradientBoostingEstimator(ntrees=10000, ... learn_rate=0.05, ... learn_rate_annealing=.9, ... stopping_rounds=5, ... stopping_metric="AUC", ... stopping_tolerance=1e-4, ... seed=1234) >>> titanic_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> titanic_gbm.auc(valid=True)
-
property
max_abs_leafnode_pred
¶ Maximum absolute value of a leaf node prediction
Type:
float
(default:1.797693135e+308
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cov_gbm = H2OGradientBoostingEstimator(max_abs_leafnode_pred=2, ... seed=1234) >>> cov_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_gbm.logloss(valid=True)
-
property
max_after_balance_size
¶ Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
Type:
float
(default:5
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> max = .85 >>> cov_gbm = H2OGradientBoostingEstimator(balance_classes=True, ... max_after_balance_size=max, ... seed=1234) >>> cov_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_gbm.logloss(valid=True)
-
property
max_confusion_matrix_size
¶ [Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs
Type:
int
(default:20
).
-
property
max_depth
¶ Maximum tree depth.
Type:
int
(default:5
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(ntrees=100, ... max_depth=2, ... seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.auc(valid=True)
-
property
max_hit_ratio_k
¶ Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable)
Type:
int
(default:0
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cov_gbm = H2OGradientBoostingEstimator(max_hit_ratio_k=3, ... seed=1234) >>> cov_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_gbm.logloss(valid=True)
-
property
max_runtime_secs
¶ Maximum allowed runtime in seconds for model training. Use 0 to disable.
Type:
float
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(max_runtime_secs=10, ... ntrees=10000, ... max_depth=10, ... seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.auc(valid=True)
-
property
min_rows
¶ Fewest allowed (weighted) observations in a leaf.
Type:
float
(default:10
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(min_rows=16, ... seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.auc(valid=True)
-
property
min_split_improvement
¶ Minimum relative improvement in squared error reduction for a split to happen
Type:
float
(default:1e-05
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(min_split_improvement=1e-3, ... seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.auc(valid=True)
-
property
monotone_constraints
¶ A mapping representing monotonic constraints. Use +1 to enforce an increasing constraint and -1 to specify a decreasing constraint.
Type:
dict
.- Examples
>>> prostate_hex = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate_hex["CAPSULE"] = prostate_hex["CAPSULE"].asfactor() >>> response = "CAPSULE" >>> seed = 42 >>> monotone_constraints = {"AGE":1} >>> gbm_model = H2OGradientBoostingEstimator(seed=seed, ... monotone_constraints=monotone_constraints) >>> gbm_model.train(y=response, ... ignored_columns=["ID"], ... training_frame=prostate_hex) >>> gbm_model.scoring_history()
-
property
nbins
¶ For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point
Type:
int
(default:20
).- Examples
>>> eeg = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/eeg/eeg_eyestate.csv") >>> eeg['eyeDetection'] = eeg['eyeDetection'].asfactor() >>> predictors = eeg.columns[:-1] >>> response = 'eyeDetection' >>> train, valid = eeg.split_frame(ratios=[.8], seed=1234) >>> bin_num = [16, 32, 64, 128, 256, 512] >>> label = ["16", "32", "64", "128", "256", "512"] >>> for key, num in enumerate(bin_num): ... eeg_gbm = H2OGradientBoostingEstimator(nbins=num, seed=1234) ... eeg_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) ... print(label[key], 'training score', eeg_gbm.auc(train=True)) ... print(label[key], 'validation score', eeg_gbm.auc(valid=True))
-
property
nbins_cats
¶ For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.
Type:
int
(default:1024
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> bin_num = [8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096] >>> label = ["8", "16", "32", "64", "128", "256", "512", "1024", "2048", "4096"] >>> for key, num in enumerate(bin_num): ... airlines_gbm = H2OGradientBoostingEstimator(nbins_cats=num, seed=1234) ... airlines_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) ... print(label[key], 'training score', airlines_gbm.auc(train=True)) ... print(label[key], 'validation score', airlines_gbm.auc(valid=True))
-
property
nbins_top_level
¶ For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level
Type:
int
(default:1024
).- Examples
>>> eeg = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/eeg/eeg_eyestate.csv") >>> eeg['eyeDetection'] = eeg['eyeDetection'].asfactor() >>> predictors = eeg.columns[:-1] >>> response = 'eyeDetection' >>> train, valid = eeg.split_frame(ratios=[.8], seed=1234) >>> bin_num = [32, 64, 128, 256, 512, 1024, 2048, 4096] >>> label = ["32", "64", "128", "256", "512", "1024", "2048", "4096"] >>> for key, num in enumerate(bin_num): ... eeg_gbm = H2OGradientBoostingEstimator(nbins_top_level=num, seed=1234) ... eeg_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) ... print(label[key], 'training score', eeg_gbm.auc(train=True)) ... print(label[key], 'validation score', eeg_gbm.auc(valid=True))
-
property
nfolds
¶ Number of folds for K-fold cross-validation (0 to disable or >= 2).
Type:
int
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> folds = 5 >>> cars_gbm = H2OGradientBoostingEstimator(nfolds=folds, ... seed=1234 >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_gbm.auc()
-
property
ntrees
¶ Number of trees.
Type:
int
(default:50
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], seed=1234) >>> tree_num = [20, 50, 80, 110, 140, 170, 200] >>> label = ["20", "50", "80", "110", "140", "170", "200"] >>> for key, num in enumerate(tree_num): ... titanic_gbm = H2OGradientBoostingEstimator(ntrees=num, ... seed=1234) ... titanic_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) ... print(label[key], 'training score', titanic_gbm.auc(train=True)) ... print(label[key], 'validation score', titanic_gbm.auc(valid=True))
-
property
offset_column
¶ Offset column. This will be added to the combination of columns before applying the link function.
Type:
str
.- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> boston["offset"] = boston["medv"].log() >>> train, valid = boston.split_frame(ratios=[.8], seed=1234) >>> boston_gbm = H2OGradientBoostingEstimator(offset_column="offset", ... seed=1234) >>> boston_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_gbm.mse(valid=True)
-
property
pred_noise_bandwidth
¶ Bandwidth (sigma) of Gaussian multiplicative noise ~N(1,sigma) for tree node predictions
Type:
float
(default:0
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], seed=1234) >>> titanic_gbm = H2OGradientBoostingEstimator(pred_noise_bandwidth=0.1, ... seed=1234) >>> titanic_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> titanic_gbm.auc(valid = True)
-
property
quantile_alpha
¶ Desired quantile for Quantile regression, must be between 0 and 1.
Type:
float
(default:0.5
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8], seed=1234) >>> boston_gbm = H2OGradientBoostingEstimator(distribution="quantile", ... quantile_alpha=.8, ... seed=1234) >>> boston_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_gbm.mse(valid=True)
-
property
r2_stopping
¶ r2_stopping is no longer supported and will be ignored if set - please use stopping_rounds, stopping_metric and stopping_tolerance instead. Previous version of H2O would stop making trees when the R^2 metric equals or exceeds this
Type:
float
(default:1.797693135e+308
).
-
property
response_column
¶ Response variable column.
Type:
str
.
-
property
sample_rate
¶ Row sample rate per tree (from 0.0 to 1.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_gbm = H2OGradientBoostingEstimator(sample_rate=.7, ... seed=1234) >>> airlines_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_gbm.auc(valid=True)
-
property
sample_rate_per_class
¶ A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree
Type:
List[float]
.- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> rate_per_class_list = [1, .4, 1, 1, 1, 1, 1] >>> cov_gbm = H2OGradientBoostingEstimator(sample_rate_per_class=rate_per_class_list, ... seed=1234) >>> cov_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_gbm.logloss(valid=True)
-
property
score_each_iteration
¶ Whether to score during each iteration of model training.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], ... seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(score_each_iteration=True, ... ntrees=55, ... seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.scoring_history()
-
property
score_tree_interval
¶ Score the model after every so many trees. Disabled if set to 0.
Type:
int
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], ... seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(score_tree_interval=True, ... ntrees=55, ... seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.scoring_history()
-
property
seed
¶ Seed for pseudo random number generator (if applicable)
Type:
int
(default:-1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> gbm_w_seed_1 = H2OGradientBoostingEstimator(col_sample_rate=.7, ... seed=1234) >>> gbm_w_seed_1.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print('auc for the 1st model built with a seed:', gbm_w_seed_1.auc(valid=True))
-
property
stopping_metric
¶ Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
One of:
"auto"
,"deviance"
,"logloss"
,"mse"
,"rmse"
,"mae"
,"rmsle"
,"auc"
,"aucpr"
,"lift_top_group"
,"misclassification"
,"mean_per_class_error"
,"custom"
,"custom_increasing"
(default:"auto"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_gbm = H2OGradientBoostingEstimator(stopping_metric="auc", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_gbm.auc(valid=True)
-
property
stopping_rounds
¶ Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
Type:
int
(default:0
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_gbm = H2OGradientBoostingEstimator(stopping_metric="auc", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_gbm.auc(valid=True)
-
property
stopping_tolerance
¶ Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
Type:
float
(default:0.001
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_gbm = H2OGradientBoostingEstimator(stopping_metric="auc", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_gbm.auc(valid=True)
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.auc(valid=True)
-
property
tweedie_power
¶ Tweedie power for Tweedie regression, must be between 1 and 2.
Type:
float
(default:1.5
).- Examples
>>> insurance = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> predictors = insurance.columns[0:4] >>> response = 'Claims' >>> insurance['Group'] = insurance['Group'].asfactor() >>> insurance['Age'] = insurance['Age'].asfactor() >>> train, valid = insurance.split_frame(ratios=[.8], seed=1234) >>> insurance_gbm = H2OGradientBoostingEstimator(distribution="tweedie", ... tweedie_power=1.2, ... seed=1234) >>> insurance_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> insurance_gbm.mse(valid=True)
-
property
validation_frame
¶ Id of the validation data frame.
Type:
H2OFrame
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_gbm.auc(valid=True)
-
property
weights_column
¶ Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_gbm = H2OGradientBoostingEstimator(seed=1234) >>> cars_gbm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid, ... weights_column="weight") >>> cars_gbm.auc(valid=True)
-
property
H2OGeneralizedLinearEstimator
¶
-
class
h2o.estimators.glm.
H2OGeneralizedLinearEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Generalized Linear Modeling
Fits a generalized linear model, specified by a response variable, a set of predictors, and a description of the error distribution.
A subclass of
ModelBase
is returned. The specific subclass depends on the machine learning task at hand (if it’s binomial classification, then an H2OBinomialModel is returned, if it’s regression then a H2ORegressionModel is returned). The default print-out of the models is shown, but further GLM-specific information can be queried out of the object. Upon completion of the GLM, the resulting object has coefficients, normalized coefficients, residual/null deviance, aic, and a host of model metrics including MSE, AUC (for logistic regression), degrees of freedom, and confusion matrices.-
property
HGLM
¶ If set to true, will return HGLM model. Otherwise, normal GLM model will be returned
Type:
bool
(default:False
).
-
property
Lambda
¶ DEPRECATED. Use
self.lambda_
instead
-
property
alpha
¶ Distribution of regularization between the L1 (Lasso) and L2 (Ridge) penalties. A value of 1 for alpha represents Lasso regression, a value of 0 produces Ridge regression, and anything in between specifies the amount of mixing between the two. Default value of alpha is 0 when SOLVER = ‘L-BFGS’; 0.5 otherwise.
Type:
List[float]
.- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_glm = H2OGeneralizedLinearEstimator(alpha=.25) >>> boston_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(boston_glm.mse(valid=True))
-
property
balance_classes
¶ Balance training data class counts via over/under-sampling (for imbalanced data).
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","year"] >>> response = "acceleration" >>> train, valid = cars.split_frame(ratios=[.8]) >>> cars_glm = H2OGeneralizedLinearEstimator(balance_classes=True, ... seed=1234) >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.mse()
-
property
beta_constraints
¶ Beta constraints
Type:
H2OFrame
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","year"] >>> response = "acceleration" >>> train, valid = cars.split_frame(ratios=[.8]) >>> n = len(predictors) >>> constraints = h2o.H2OFrame({'names':predictors, ... 'lower_bounds': [-1000]*n, ... 'upper_bounds': [1000]*n, ... 'beta_given': [1]*n, ... 'rho': [0.2]*n}) >>> cars_glm = H2OGeneralizedLinearEstimator(standardize=True, ... beta_constraints=constraints) >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.mse()
-
property
beta_epsilon
¶ Converge if beta changes less (using L-infinity norm) than beta esilon, ONLY applies to IRLSM solver
Type:
float
(default:0.0001
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","year"] >>> response = "acceleration" >>> train, valid = cars.split_frame(ratios=[.8]) >>> cars_glm = H2OGeneralizedLinearEstimator(beta_epsilon=1e-3) >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.mse()
-
property
calc_like
¶ if true, will return likelihood function value for HGLM.
Type:
bool
(default:False
).
-
property
class_sampling_factors
¶ Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
Type:
List[float]
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","year"] >>> response = "acceleration" >>> train, valid = cars.split_frame(ratios=[.8]) >>> sample_factors = [1., 0.5, 1., 1., 1., 1., 1.] >>> cars_glm = H2OGeneralizedLinearEstimator(balance_classes=True, ... class_sampling_factors=sample_factors, ... seed=1234) >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.mse()
-
property
compute_p_values
¶ Request p-values computation, p-values work only with IRLSM solver and no regularization
Type:
bool
(default:False
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8]) >>> airlines_glm = H2OGeneralizedLinearEstimator(family='binomial', ... lambda_=0, ... remove_collinear_columns=True, ... compute_p_values=True) >>> airlines_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_glm.mse()
-
property
custom_metric_func
¶ Reference to custom evaluation function, format: language:keyName=funcName
Type:
str
.
-
property
early_stopping
¶ Stop early when there is no more relative improvement on train or validation (if provided)
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8]) >>> cars_glm = H2OGeneralizedLinearEstimator(family='binomial', ... early_stopping=True) >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.auc(valid=True)
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> import tempfile >>> from os import listdir >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","year"] >>> response = "acceleration" >>> train, valid = cars.split_frame(ratios=[.8]) >>> checkpoints = tempfile.mkdtemp() >>> cars_glm = H2OGeneralizedLinearEstimator(export_checkpoints_dir=checkpoints, ... seed=1234) >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.mse() >>> len(listdir(checkpoints_dir))
-
property
family
¶ Family. Use binomial for classification with logistic regression, others are for regression problems.
One of:
"gaussian"
,"binomial"
,"fractionalbinomial"
,"quasibinomial"
,"ordinal"
,"multinomial"
,"poisson"
,"gamma"
,"tweedie"
,"negativebinomial"
(default:"gaussian"
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8]) >>> cars_glm = H2OGeneralizedLinearEstimator(family='binomial') >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.auc(valid = True)
-
property
fold_assignment
¶ Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
One of:
"auto"
,"random"
,"modulo"
,"stratified"
(default:"auto"
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> assignment_type = "Random" >>> cars_gml = H2OGeneralizedLinearEstimator(fold_assignment=assignment_type, ... nfolds=5, ... family='binomial', ... seed=1234) >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_glm.auc(train=True)
-
property
fold_column
¶ Column with cross-validation fold index assignment per observation.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> fold_numbers = cars.kfold_column(n_folds=5, seed=1234) >>> fold_numbers.set_names(["fold_numbers"]) >>> cars = cars.cbind(fold_numbers) >>> print(cars['fold_numbers']) >>> cars_glm = H2OGeneralizedLinearEstimator(seed=1234, ... family="binomial") >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=cars, ... fold_column="fold_numbers") >>> cars_glm.auc(xval=True)
-
static
getGLMRegularizationPath
(model)[source]¶ Extract full regularization path explored during lambda search from glm model.
- Parameters
model – source lambda search model
- Examples
>>> d = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv") >>> m = H2OGeneralizedLinearEstimator(family = 'binomial', ... lambda_search = True, ... solver = 'COORDINATE_DESCENT') >>> m.train(training_frame = d, ... x = [2,3,4,5,6,7,8], ... y = 1) >>> r = H2OGeneralizedLinearEstimator.getGLMRegularizationPath(m) >>> m2 = H2OGeneralizedLinearEstimator.makeGLMModel(model=m, ... coefs=r['coefficients'][10]) >>> dev1 = r['explained_deviance_train'][10] >>> p = m2.model_performance(d) >>> dev2 = 1-p.residual_deviance()/p.null_deviance() >>> print(dev1, " =?= ", dev2)
-
property
gradient_epsilon
¶ Converge if objective changes less (using L-infinity norm) than this, ONLY applies to L-BFGS solver. Default indicates: If lambda_search is set to False and lambda is equal to zero, the default value of gradient_epsilon is equal to .000001, otherwise the default value is .0001. If lambda_search is set to True, the conditional values above are 1E-8 and 1E-6 respectively.
Type:
float
(default:-1
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_glm = H2OGeneralizedLinearEstimator(gradient_epsilon=1e-3) >>> boston_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_glm.mse()
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> cars["const_1"] = 6 >>> cars["const_2"] = 7 >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_glm = H2OGeneralizedLinearEstimator(seed=1234, ... ignore_const_cols=True, ... family="binomial") >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.auc(valid=True)
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
interaction_pairs
¶ A list of pairwise (first order) column interactions.
Type:
List[tuple]
.- Examples
>>> df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> XY = [df.names[i-1] for i in [1,2,3,4,6,8,9,13,17,18,19,31]] >>> interactions = [XY[i-1] for i in [5,7,9]] >>> m = H2OGeneralizedLinearEstimator(lambda_search=True, ... family="binomial", ... interactions=interactions) >>> m.train(x=XY[:len(XY)], y=XY[-1],training_frame=df) >>> m._model_json['output']['coefficients_table'] >>> coef_m = m._model_json['output']['coefficients_table'] >>> interaction_pairs = [("CRSDepTime", "UniqueCarrier"), ... ("CRSDepTime", "Origin"), ... ("UniqueCarrier", "Origin")] >>> mexp = H2OGeneralizedLinearEstimator(lambda_search=True, ... family="binomial", ... interaction_pairs=interaction_pairs) >>> mexp.train(x=XY[:len(XY)], y=XY[-1],training_frame=df) >>> mexp._model_json['output']['coefficients_table']
-
property
interactions
¶ A list of predictor column indices to interact. All pairwise combinations will be computed for the list.
Type:
List[str]
.- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> interactions_list = ['crim', 'dis'] >>> boston_glm = H2OGeneralizedLinearEstimator(interactions=interactions_list) >>> boston_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_glm.mse()
-
property
intercept
¶ Include constant term in the model
Type:
bool
(default:True
).- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> iris['class'] = iris['class'].asfactor() >>> predictors = iris.columns[:-1] >>> response = 'class' >>> train, valid = iris.split_frame(ratios=[.8]) >>> iris_glm = H2OGeneralizedLinearEstimator(family='multinomial', ... intercept=True) >>> iris_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> iris_glm.logloss(valid=True)
-
property
keep_cross_validation_fold_assignment
¶ Whether to keep the cross-validation fold assignment.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_glm = H2OGeneralizedLinearEstimator(keep_cross_validation_fold_assignment=True, ... nfolds=5, ... seed=1234, ... family="binomial") >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train) >>> cars_glm.cross_validation_fold_assignment()
-
property
keep_cross_validation_models
¶ Whether to keep the cross-validation models.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_glm = H2OGeneralizedLinearEstimator(keep_cross_validation_models=True, ... nfolds=5, ... seed=1234, ... family="binomial") >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train) >>> cars_glm_cv_models = cars_glm.cross_validation_models() >>> print(cars_glm.cross_validation_models())
-
property
keep_cross_validation_predictions
¶ Whether to keep the predictions of the cross-validation models.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_glm = H2OGeneralizedLinearEstimator(keep_cross_validation_predictions=True, ... nfolds=5, ... seed=1234, ... family="binomial") >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train) >>> cars_glm.cross_validation_predictions()
-
property
lambda_
¶ Regularization strength
Type:
List[float]
.- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8]) >>> airlines_glm = H2OGeneralizedLinearEstimator(family='binomial', ... lambda_=.0001) >>> airlines_glm.train(x=predictors, ... y=response ... trainig_frame=train, ... validation_frame=valid) >>> print(airlines_glm.auc(valid=True))
-
property
lambda_min_ratio
¶ Minimum lambda used in lambda search, specified as a ratio of lambda_max (the smallest lambda that drives all coefficients to zero). Default indicates: if the number of observations is greater than the number of variables, then lambda_min_ratio is set to 0.0001; if the number of observations is less than the number of variables, then lambda_min_ratio is set to 0.01.
Type:
float
(default:-1
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_glm = H2OGeneralizedLinearEstimator(lambda_min_ratio=.0001) >>> boston_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_glm.mse()
-
property
lambda_search
¶ Use lambda search starting at lambda max, given lambda is then interpreted as lambda min
Type:
bool
(default:False
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_glm = H2OGeneralizedLinearEstimator(lambda_search=True) >>> boston_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(boston_glm.mse(valid=True))
-
property
link
¶ Link function.
One of:
"family_default"
,"identity"
,"logit"
,"log"
,"inverse"
,"tweedie"
,"ologit"
(default:"family_default"
).- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> iris['class'] = iris['class'].asfactor() >>> predictors = iris.columns[:-1] >>> response = 'class' >>> train, valid = iris.split_frame(ratios=[.8]) >>> iris_glm = H2OGeneralizedLinearEstimator(family='multinomial', ... link='family_default') >>> iris_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> iris_glm.logloss()
-
static
makeGLMModel
(model, coefs, threshold=0.5)[source]¶ Create a custom GLM model using the given coefficients.
Needs to be passed source model trained on the dataset to extract the dataset information from.
- Parameters
model – source model, used for extracting dataset information
coefs – dictionary containing model coefficients
threshold – (optional, only for binomial) decision threshold used for classification
- Examples
>>> d = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv") >>> m = H2OGeneralizedLinearEstimator(family='binomial', ... lambda_search=True, ... solver='COORDINATE_DESCENT') >>> m.train(training_frame=d, ... x=[2,3,4,5,6,7,8], ... y=1) >>> r = H2OGeneralizedLinearEstimator.getGLMRegularizationPath(m) >>> m2 = H2OGeneralizedLinearEstimator.makeGLMModel(model=m, ... coefs=r['coefficients'][10]) >>> dev1 = r['explained_deviance_train'][10] >>> p = m2.model_performance(d) >>> dev2 = 1-p.residual_deviance()/p.null_deviance() >>> print(dev1, " =?= ", dev2)
-
property
max_active_predictors
¶ Maximum number of active predictors during computation. Use as a stopping criterion to prevent expensive model building with many predictors. Default indicates: If the IRLSM solver is used, the value of max_active_predictors is set to 5000 otherwise it is set to 100000000.
Type:
int
(default:-1
).- Examples
>>> higgs= h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/testng/higgs_train_5k.csv") >>> predictors = higgs.names >>> predictors.remove('response') >>> response = "response" >>> train, valid = higgs.split_frame(ratios=[.8]) >>> higgs_glm = H2OGeneralizedLinearEstimator(family='binomial', ... max_active_predictors=200) >>> higgs_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> higgs_glm.auc()
-
property
max_after_balance_size
¶ Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
Type:
float
(default:5
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","year"] >>> response = "acceleration" >>> train, valid = cars.split_frame(ratios=[.8]) >>> max = .85 >>> cars_glm = H2OGeneralizedLinearEstimator(balance_classes=True, ... max_after_balance_size=max, ... seed=1234) >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.mse()
-
property
max_confusion_matrix_size
¶ [Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs
Type:
int
(default:20
).
-
property
max_hit_ratio_k
¶ Maximum number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable)
Type:
int
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","year"] >>> response = "acceleration" >>> train, valid = cars.split_frame(ratios=[.8]) >>> cars_glm = H2OGeneralizedLinearEstimator(max_hit_ratio_k=3, ... seed=1234) >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.mse()
-
property
max_iterations
¶ Maximum number of iterations
Type:
int
(default:-1
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8]) >>> cars_glm = H2OGeneralizedLinearEstimator(family='binomial', ... max_iterations=50) >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.mse()
-
property
max_runtime_secs
¶ Maximum allowed runtime in seconds for model training. Use 0 to disable.
Type:
float
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8]) >>> cars_glm = H2OGeneralizedLinearEstimator(max_runtime_secs=10, ... seed=1234) >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.mse()
-
property
missing_values_handling
¶ Handling of missing values. Either MeanImputation, Skip or PlugValues.
One of:
"mean_imputation"
,"skip"
,"plug_values"
(default:"mean_imputation"
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> boston.insert_missing_values() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_glm = H2OGeneralizedLinearEstimator(missing_values_handling="skip") >>> boston_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_glm.mse()
-
property
nfolds
¶ Number of folds for K-fold cross-validation (0 to disable or >= 2).
Type:
int
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> folds = 5 >>> cars_glm = H2OGeneralizedLinearEstimator(nfolds=folds, ... seed=1234, ... family='binomial') >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_glm.auc(xval=True)
-
property
nlambdas
¶ Number of lambdas to be used in a search. Default indicates: If alpha is zero, with lambda search set to True, the value of nlamdas is set to 30 (fewer lambdas are needed for ridge regression) otherwise it is set to 100.
Type:
int
(default:-1
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_glm = H2OGeneralizedLinearEstimator(lambda_search=True, ... nlambdas=50) >>> boston_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(boston_glm.mse(valid=True))
-
property
non_negative
¶ Restrict coefficients (not intercept) to be non-negative
Type:
bool
(default:False
).- Examples
>>> airlines = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8]) >>> airlines_glm = H2OGeneralizedLinearEstimator(family='binomial', ... non_negative=True) >>> airlines_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_glm.auc()
-
property
obj_reg
¶ Likelihood divider in objective value computation, default is 1/nobs
Type:
float
(default:-1
).- Examples
>>> df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/glm_ordinal_logit/ordinal_multinomial_training_set.csv") >>> df["C11"] = df["C11"].asfactor() >>> ordinal_fit = H2OGeneralizedLinearEstimator(family="ordinal", ... alpha=1.0, ... lambda_=0.000000001, ... obj_reg=0.00001, ... max_iterations=1000, ... beta_epsilon=1e-8, ... objective_epsilon=1e-10) >>> ordinal_fit.train(x=list(range(0,10)), ... y="C11", ... training_frame=df) >>> ordinal_fit.mse()
-
property
objective_epsilon
¶ Converge if objective value changes less than this. Default indicates: If lambda_search is set to True the value of objective_epsilon is set to .0001. If the lambda_search is set to False and lambda is equal to zero, the value of objective_epsilon is set to .000001, for any other value of lambda the default value of objective_epsilon is set to .0001.
Type:
float
(default:-1
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_glm = H2OGeneralizedLinearEstimator(objective_epsilon=1e-3) >>> boston_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_glm.mse()
-
property
offset_column
¶ Offset column. This will be added to the combination of columns before applying the link function.
Type:
str
.- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> boston["offset"] = boston["medv"].log() >>> train, valid = boston.split_frame(ratios=[.8], seed=1234) >>> boston_glm = H2OGeneralizedLinearEstimator(offset_column="offset", ... seed=1234) >>> boston_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_glm.mse(valid=True)
-
property
plug_values
¶ Plug Values (a single row frame containing values that will be used to impute missing values of the training/validation frame, use with conjunction missing_values_handling = PlugValues)
Type:
H2OFrame
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars = cars.drop(0) >>> means = cars.mean() >>> means = H2OFrame._expr(ExprNode("mean", cars, True, 0)) >>> glm_means = H2OGeneralizedLinearEstimator(seed=42) >>> glm_means.train(training_frame=cars, y="cylinders") >>> glm_plugs1 = H2OGeneralizedLinearEstimator(seed=42, ... missing_values_handling="PlugValues", ... plug_values=means) >>> glm_plugs1.train(training_frame=cars, y="cylinders") >>> glm_means.coef() == glm_plugs1.coef() >>> not_means = 0.1 + (means * 0.5) >>> glm_plugs2 = H2OGeneralizedLinearEstimator(seed=42, ... missing_values_handling="PlugValues", ... plug_values=not_means) >>> glm_plugs2.train(training_frame=cars, y="cylinders") >>> glm_means.coef() != glm_plugs2.coef()
-
property
prior
¶ Prior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.
Type:
float
(default:-1
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8]) >>> cars_glm1 = H2OGeneralizedLinearEstimator(family='binomial', prior=0.5) >>> cars_glm1.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm1.mse()
-
property
rand_family
¶ Random Component Family array. One for each random component. Only support gaussian for now.
Type:
List[Enum["[gaussian]"]]
.
-
property
rand_link
¶ Link function array for random component in HGLM.
Type:
List[Enum["[identity]", "[family_default]"]]
.
-
property
random_columns
¶ random columns indices for HGLM.
Type:
List[int]
.
-
property
remove_collinear_columns
¶ In case of linearly dependent columns, remove some of the dependent columns
Type:
bool
(default:False
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8]) >>> airlines_glm = H2OGeneralizedLinearEstimator(family='binomial', ... lambda_=0, ... remove_collinear_columns=True) >>> airlines_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_glm.auc()
-
property
response_column
¶ Response variable column.
Type:
str
.
-
property
score_each_iteration
¶ Whether to score during each iteration of model training.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_glm = H2OGeneralizedLinearEstimator(score_each_iteration=True, ... seed=1234, ... family='binomial') >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.scoring_history()
-
property
seed
¶ Seed for pseudo random number generator (if applicable)
Type:
int
(default:-1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid = airlines.split_frame(ratios=[.8], seed=1234) >>> glm_w_seed = H2OGeneralizedLinearEstimator(family='binomial', ... seed=1234) >>> glm_w_seed.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(glm_w_seed_1.auc(valid=True))
-
property
solver
¶ AUTO will set the solver based on given data and the other parameters. IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, L_BFGS scales better for datasets with many columns.
One of:
"auto"
,"irlsm"
,"l_bfgs"
,"coordinate_descent_naive"
,"coordinate_descent"
,"gradient_descent_lh"
,"gradient_descent_sqerr"
(default:"auto"
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_glm = H2OGeneralizedLinearEstimator(solver='irlsm') >>> boston_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(boston_glm.mse(valid=True))
-
property
standardize
¶ Standardize numeric columns to have zero mean and unit variance
Type:
bool
(default:True
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_glm = H2OGeneralizedLinearEstimator(standardize=True) >>> boston_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_glm.mse()
-
property
startval
¶ double array to initialize fixed and random coefficients for HGLM.
Type:
List[float]
.
-
property
theta
¶ Theta
Type:
float
(default:1e-10
).- Examples
>>> h2o_df = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/glm_test/Motor_insurance_sweden.txt") >>> predictors = ["Payment", "Insured", "Kilometres", "Zone", "Bonus", "Make"] >>> response = "Claims" >>> negativebinomial_fit = H2OGeneralizedLinearEstimator(family="negativebinomial", ... link="identity", ... theta=0.5) >>> negativebinomial_fit.train(x=predictors, ... y=response, ... training_frame=h2o_df) >>> negativebinomial_fit.scoring_history()
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], ... seed=1234) >>> cars_glm = H2OGeneralizedLinearEstimator(seed=1234, ... family='binomial') >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.auc(train=True)
-
property
tweedie_link_power
¶ Tweedie link power
Type:
float
(default:1
).- Examples
>>> auto = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/auto.csv") >>> predictors = auto.names >>> predictors.remove('y') >>> response = "y" >>> train, valid = auto.split_frame(ratios=[.8]) >>> auto_glm = H2OGeneralizedLinearEstimator(family='tweedie', ... tweedie_link_power=1) >>> auto_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(auto_glm.mse(valid=True))
-
property
tweedie_variance_power
¶ Tweedie variance power
Type:
float
(default:0
).- Examples
>>> auto = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/auto.csv") >>> predictors = auto.names >>> predictors.remove('y') >>> response = "y" >>> train, valid = auto.split_frame(ratios=[.8]) >>> auto_glm = H2OGeneralizedLinearEstimator(family='tweedie', ... tweedie_variance_power=1) >>> auto_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(auto_glm.mse(valid=True))
-
property
validation_frame
¶ Id of the validation data frame.
Type:
H2OFrame
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_glm = H2OGeneralizedLinearEstimator(seed=1234, ... family='binomial') >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_glm.auc(valid=True)
-
property
weights_column
¶ Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_glm = H2OGeneralizedLinearEstimator(seed=1234, ... family='binomial') >>> cars_glm.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid, ... weights_column="weight") >>> cars_glm.auc(valid=True)
-
property
H2ONaiveBayesEstimator
¶
-
class
h2o.estimators.naive_bayes.
H2ONaiveBayesEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Naive Bayes
The naive Bayes classifier assumes independence between predictor variables conditional on the response, and a Gaussian distribution of numeric predictors with mean and standard deviation computed from the training dataset. When building a naive Bayes classifier, every row in the training dataset that contains at least one NA will be skipped completely. If the test dataset has missing values, then those predictors are omitted in the probability calculation during prediction.
-
property
balance_classes
¶ Balance training data class counts via over/under-sampling (for imbalanced data).
Type:
bool
(default:False
).- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> iris_nb = H2ONaiveBayesEstimator(balance_classes=False, ... nfolds=3, ... seed=1234) >>> iris_nb.train(x=list(range(4)), ... y=4, ... training_frame=iris) >>> iris_nb.mse()
-
property
class_sampling_factors
¶ Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
Type:
List[float]
.- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> sample_factors = [1., 0.5, 1., 1., 1., 1., 1.] >>> cov_nb = H2ONaiveBayesEstimator(class_sampling_factors=sample_factors, ... seed=1234) >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> cov_nb.train(x=predictors, y=response, training_frame=covtype) >>> cov_nb.logloss()
-
property
compute_metrics
¶ Compute metrics on training data
Type:
bool
(default:True
).- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate['CAPSULE'] = prostate['CAPSULE'].asfactor() >>> prostate['RACE'] = prostate['RACE'].asfactor() >>> prostate['DCAPS'] = prostate['DCAPS'].asfactor() >>> prostate['DPROS'] = prostate['DPROS'].asfactor() >>> response_col = 'CAPSULE' >>> prostate_nb = H2ONaiveBayesEstimator(laplace=0, ... compute_metrics=False) >>> prostate_nb.train(x=list(range(3,9)), ... y=response_col, ... training_frame=prostate) >>> prostate_nb.show()
-
property
eps_prob
¶ Cutoff below which probability is replaced with min_prob
Type:
float
(default:0
).- Examples
>>> import random >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> problem = random.sample(["binomial","multinomial"],1) >>> predictors = ["displacement","power","weight","acceleration","year"] >>> if problem == "binomial": ... response_col = "economy_20mpg" ... else: ... response_col = "cylinders" >>> cars[response_col] = cars[response_col].asfactor() >>> cars_nb = H2ONaiveBayesEstimator(min_prob=0.1, ... eps_prob=0.5, ... seed=1234) >>> cars_nb.train(x=predictors, y=response_col, training_frame=cars) >>> cars_nb.mse()
-
property
eps_sdev
¶ Cutoff below which standard deviation is replaced with min_sdev
Type:
float
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> problem = random.sample(["binomial","multinomial"],1) >>> predictors = ["displacement","power","weight","acceleration","year"] >>> if problem == "binomial": ... response_col = "economy_20mpg" ... else: ... response_col = "cylinders" >>> cars[response_col] = cars[response_col].asfactor() >>> cars_nb = H2ONaiveBayesEstimator(min_sdev=0.1, ... eps_sdev=0.5, ... seed=1234) >>> cars_nb.train(x=predictors, y=response_col, training_frame=cars) >>> cars_nb.mse()
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> import tempfile >>> from os import listdir >>> airlines = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip", destination_frame="air.hex") >>> predictors = ["DayofMonth", "DayOfWeek"] >>> response = "IsDepDelayed" >>> checkpoints_dir = tempfile.mkdtemp() >>> air_nb = H2ONaiveBayesEstimator(export_checkpoints_dir=checkpoints_dir) >>> air_nb.train(x=predictors, y=response, training_frame=airlines) >>> len(listdir(checkpoints_dir))
-
property
fold_assignment
¶ Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
One of:
"auto"
,"random"
,"modulo"
,"stratified"
(default:"auto"
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> cars_nb = H2ONaiveBayesEstimator(fold_assignment="Random", ... nfolds=5, ... seed=1234) >>> response = "economy_20mpg" >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> cars_nb.train(x=predictors, y=response, training_frame=cars) >>> cars_nb.auc()
-
property
fold_column
¶ Column with cross-validation fold index assignment per observation.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> fold_numbers = cars.kfold_column(n_folds=5, seed=1234) >>> fold_numbers.set_names(["fold_numbers"]) >>> cars = cars.cbind(fold_numbers) >>> cars_nb = H2ONaiveBayesEstimator(seed=1234) >>> cars_nb.train(x=predictors, ... y=response, ... training_frame=cars, ... fold_column="fold_numbers") >>> cars_nb.auc()
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> cars["const_1"] = 6 >>> cars["const_2"] = 7 >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_nb = H2ONaiveBayesEstimator(seed=1234, ... ignore_const_cols=True) >>> cars_nb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_nb.auc()
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
keep_cross_validation_fold_assignment
¶ Whether to keep the cross-validation fold assignment.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_nb = H2ONaiveBayesEstimator(keep_cross_validation_fold_assignment=True, ... nfolds=5, ... seed=1234) >>> cars_nb.train(x=predictors, ... y=response, ... training_frame=train) >>> cars_nb.cross_validation_fold_assignment()
-
property
keep_cross_validation_models
¶ Whether to keep the cross-validation models.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_nb = H2ONaiveBayesEstimator(keep_cross_validation_models=True, ... nfolds=5, ... seed=1234) >>> cars_nb.train(x=predictors, ... y=response, ... training_frame=train) >>> cars_nb.cross_validation_models()
-
property
keep_cross_validation_predictions
¶ Whether to keep the predictions of the cross-validation models.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_nb = H2ONaiveBayesEstimator(keep_cross_validation_predictions=True, ... nfolds=5, ... seed=1234) >>> cars_nb.train(x=predictors, ... y=response, ... training_frame=train) >>> cars_nb.cross_validation_predictions()
-
property
laplace
¶ Laplace smoothing parameter
Type:
float
(default:0
).- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate['CAPSULE'] = prostate['CAPSULE'].asfactor() >>> prostate['RACE'] = prostate['RACE'].asfactor() >>> prostate['DCAPS'] = prostate['DCAPS'].asfactor() >>> prostate['DPROS'] = prostate['DPROS'].asfactor() >>> prostate_nb = H2ONaiveBayesEstimator(laplace=1) >>> prostate_nb.train(x=list(range(3,9)), ... y=response_col, ... training_frame=prostate) >>> prostate_nb.mse()
-
property
max_after_balance_size
¶ Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
Type:
float
(default:5
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> max = .85 >>> cov_nb = H2ONaiveBayesEstimator(max_after_balance_size=max, ... seed=1234) >>> cov_nb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_nb.logloss()
-
property
max_confusion_matrix_size
¶ [Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs
Type:
int
(default:20
).
-
property
max_hit_ratio_k
¶ Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable)
Type:
int
(default:0
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cars_nb = H2ONaiveBayesEstimator(max_hit_ratio_k=3, ... seed=1234) >>> cars_nb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_nb.mse()
-
property
max_runtime_secs
¶ Maximum allowed runtime in seconds for model training. Use 0 to disable.
Type:
float
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_nb = H2ONaiveBayesEstimator(max_runtime_secs=10, ... seed=1234) >>> cars_nb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_nb.auc()
-
property
min_prob
¶ Min. probability to use for observations with not enough data
Type:
float
(default:0.001
).- Examples
>>> import random >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> problem = random.sample(["binomial","multinomial"],1) >>> predictors = ["displacement","power","weight","acceleration","year"] >>> if problem == "binomial": ... response_col = "economy_20mpg" ... else: ... response_col = "cylinders" >>> cars[response_col] = cars[response_col].asfactor() >>> cars_nb = H2ONaiveBayesEstimator(min_prob=0.1, ... eps_prob=0.5, ... seed=1234) >>> cars_nb.train(x=predictors, ... y=response_col, ... training_frame=cars) >>> cars_nb.show()
-
property
min_sdev
¶ Min. standard deviation to use for observations with not enough data
Type:
float
(default:0.001
).- Examples
>>> import random >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> problem = random.sample(["binomial","multinomial"],1) >>> predictors = ["displacement","power","weight","acceleration","year"] >>> if problem == "binomial": ... response_col = "economy_20mpg" ... else: ... response_col = "cylinders" >>> cars[response_col] = cars[response_col].asfactor() >>> cars_nb = H2ONaiveBayesEstimator(min_sdev=0.1, ... eps_sdev=0.5, ... seed=1234) >>> cars_nb.train(x=predictors, ... y=response_col, ... training_frame=cars) >>> cars_nb.show()
-
property
nfolds
¶ Number of folds for K-fold cross-validation (0 to disable or >= 2).
Type:
int
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> cars_nb = H2ONaiveBayesEstimator(nfolds=5, ... seed=1234) >>> cars_nb.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_nb.auc()
-
property
response_column
¶ Response variable column.
Type:
str
.
-
property
score_each_iteration
¶ Whether to score during each iteration of model training.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_nb = H2ONaiveBayesEstimator(score_each_iteration=True, ... seed=1234) >>> cars_nb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_nb.auc()
-
property
seed
¶ Seed for pseudo random number generator (only used for cross-validation and fold_assignment=”Random” or “AUTO”)
Type:
int
(default:-1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> nb_w_seed = H2ONaiveBayesEstimator(seed=1234) >>> nb_w_seed.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> nb_wo_seed = H2ONaiveBayesEstimator() >>> nb_wo_seed.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> nb_w_seed.auc() >>> nb_wo_seed.auc()
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_nb = H2ONaiveBayesEstimator() >>> cars_nb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_nb.auc()
-
property
validation_frame
¶ Id of the validation data frame.
Type:
H2OFrame
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_nb = H2ONaiveBayesEstimator() >>> cars_nb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_nb.auc()
-
property
H2OSupportVectorMachineEstimator
¶
-
class
h2o.estimators.psvm.
H2OSupportVectorMachineEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
PSVM
-
property
disable_training_metrics
¶ Disable calculating training metrics (expensive on large datasets)
Type:
bool
(default:True
).- Examples
>>> from h2o.estimators import H2OSupportVectorMachineEstimator >>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(gamma=0.01, ... rank_ratio=0.1, ... disable_training_metrics=False) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
fact_threshold
¶ Convergence threshold of the Incomplete Cholesky Factorization (ICF)
Type:
float
(default:1e-05
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(disable_training_metrics=False, ... fact_threshold=1e-7) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
feasible_threshold
¶ Convergence threshold for primal-dual residuals in the IPM iteration
Type:
float
(default:0.001
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(disable_training_metrics=False, ... fact_threshold=1e-7) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
gamma
¶ Coefficient of the kernel (currently RBF gamma for gaussian kernel, -1 means 1/#features)
Type:
float
(default:-1
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(gamma=0.01, ... rank_ratio=0.1, ... disable_training_metrics=False) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
hyper_param
¶ Penalty parameter C of the error term
Type:
float
(default:1
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(gamma=0.01, ... rank_ratio=0.1, ... hyper_param=0.01, ... disable_training_metrics=False) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(gamma=0.01, ... rank_ratio=0.1, ... ignore_const_cols=False, ... disable_training_metrics=False) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
kernel_type
¶ Type of used kernel
One of:
"gaussian"
(default:"gaussian"
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(gamma=0.1, ... rank_ratio=0.1, ... hyper_param=0.01, ... kernel_type="gaussian", ... disable_training_metrics=False) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
max_iterations
¶ Maximum number of iteration of the algorithm
Type:
int
(default:200
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(gamma=0.1, ... rank_ratio=0.1, ... hyper_param=0.01, ... max_iterations=20, ... disable_training_metrics=False) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
mu_factor
¶ Increasing factor mu
Type:
float
(default:10
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(gamma=0.1, ... mu_factor=100.5, ... disable_training_metrics=False) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
negative_weight
¶ Weight of positive (-1) class of observations
Type:
float
(default:1
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(gamma=0.1, ... rank_ratio=0.1, ... negative_weight=10, ... disable_training_metrics=False) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
positive_weight
¶ Weight of positive (+1) class of observations
Type:
float
(default:1
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(gamma=0.1, ... rank_ratio=0.1, ... positive_weight=0.1, ... disable_training_metrics=False) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
rank_ratio
¶ Desired rank of the ICF matrix expressed as an ration of number of input rows (-1 means use sqrt(#rows)).
Type:
float
(default:-1
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(gamma=0.01, ... rank_ratio=0.1, ... disable_training_metrics=False) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
response_column
¶ Response variable column.
Type:
str
.
-
property
seed
¶ Seed for pseudo random number generator (if applicable)
Type:
int
(default:-1
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(gamma=0.1, ... rank_ratio=0.1, ... seed=1234, ... disable_training_metrics=False) >>> svm.train(y="C1", training_frame=splice) >>> svm.model_performance
-
property
surrogate_gap_threshold
¶ Feasibility criterion of the surrogate duality gap (eta)
Type:
float
(default:0.001
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(gamma=0.01, ... rank_ratio=0.1, ... surrogate_gap_threshold=0.1, ... disable_training_metrics=False) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
sv_threshold
¶ Threshold for accepting a candidate observation into the set of support vectors
Type:
float
(default:0.0001
).- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> svm = H2OSupportVectorMachineEstimator(gamma=0.01, ... rank_ratio=0.1, ... sv_threshold=0.01, ... disable_training_metrics=False) >>> svm.train(y="C1", training_frame=splice) >>> svm.mse()
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> train, valid = splice.split_frame(ratios=[0.8]) >>> svm = H2OSupportVectorMachineEstimator(disable_training_metrics=False) >>> svm.train(y="C1", training_frame=train) >>> svm.mse()
-
property
validation_frame
¶ Id of the validation data frame.
Type:
H2OFrame
.- Examples
>>> splice = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/splice/splice.svm") >>> train, valid = splice.split_frame(ratios=[0.8]) >>> svm = H2OSupportVectorMachineEstimator(disable_training_metrics=False) >>> svm.train(y="C1", training_frame=train, validation_frame=valid) >>> svm.mse()
-
property
H2ORandomForestEstimator
¶
-
class
h2o.estimators.random_forest.
H2ORandomForestEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Distributed Random Forest
Builds a Distributed Random Forest (DRF) on a parsed dataset, for regression or classification.
-
property
balance_classes
¶ Balance training data class counts via over/under-sampling (for imbalanced data).
Type:
bool
(default:False
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cov_drf = H2ORandomForestEstimator(balance_classes=True, ... seed=1234) >>> cov_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print('logloss', cov_drf.logloss(valid=True))
-
property
binomial_double_trees
¶ For binary classification: Build 2x as many trees (one per class) - can lead to higher accuracy.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_drf = H2ORandomForestEstimator(binomial_double_trees=False, ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print('without binomial_double_trees:', ... cars_drf.auc(valid=True)) >>> cars_drf_2 = H2ORandomForestEstimator(binomial_double_trees=True, ... seed=1234) >>> cars_drf_2.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print('with binomial_double_trees:', cars_drf_2.auc(valid=True))
-
property
build_tree_one_node
¶ Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_drf = H2ORandomForestEstimator(build_tree_one_node=True, ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_drf.auc(valid=True)
-
property
calibrate_model
¶ Use Platt Scaling to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities.
Type:
bool
(default:False
).- Examples
>>> ecology = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/ecology_model.csv") >>> ecology['Angaus'] = ecology['Angaus'].asfactor() >>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> response = 'Angaus' >>> predictors = ecology.columns[3:13] >>> train, calib = ecology.split_frame(seed=12354) >>> w = h2o.create_frame(binary_fraction=1, ... binary_ones_fraction=0.5, ... missing_fraction=0, ... rows=744, cols=1) >>> w.set_names(["weight"]) >>> train = train.cbind(w) >>> ecology_drf = H2ORandomForestEstimator(ntrees=10, ... max_depth=5, ... min_rows=10, ... distribution="multinomial", ... weights_column="weight", ... calibrate_model=True, ... calibration_frame=calib) >>> ecology_drf.train(x=predictors, ... y="Angaus", ... training_frame=train) >>> predicted = ecology_drf.predict(calib)
-
property
calibration_frame
¶ Calibration frame for Platt Scaling
Type:
H2OFrame
.- Examples
>>> ecology = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/ecology_model.csv") >>> ecology['Angaus'] = ecology['Angaus'].asfactor() >>> response = 'Angaus' >>> predictors = ecology.columns[3:13] >>> train, calib = ecology.split_frame(seed = 12354) >>> w = h2o.create_frame(binary_fraction=1, ... binary_ones_fraction=0.5, ... missing_fraction=0, ... rows=744, cols=1) >>> w.set_names(["weight"]) >>> train = train.cbind(w) >>> ecology_drf = H2ORandomForestEstimator(ntrees=10, ... max_depth=5, ... min_rows=10, ... distribution="multinomial", ... calibrate_model=True, ... calibration_frame=calib) >>> ecology_drf.train(x=predictors, ... y="Angaus, ... training_frame=train, ... weights_column="weight") >>> predicted = ecology_drf.predict(train)
-
property
categorical_encoding
¶ Encoding scheme for categorical features
One of:
"auto"
,"enum"
,"one_hot_internal"
,"one_hot_explicit"
,"binary"
,"eigen"
,"label_encoder"
,"sort_by_response"
,"enum_limited"
(default:"auto"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> encoding = "one_hot_explicit" >>> airlines_drf = H2ORandomForestEstimator(categorical_encoding=encoding, ... seed=1234) >>> airlines_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_drf.auc(valid=True)
-
property
check_constant_response
¶ Check if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not.
Type:
bool
(default:True
).- Examples
>>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_train.csv") >>> train["constantCol"] = 1 >>> my_drf = H2ORandomForestEstimator(check_constant_response=False) >>> my_drf.train(x=list(range(1,5)), ... y="constantCol", ... training_frame=train)
-
property
checkpoint
¶ Model checkpoint to resume training with.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], ... seed=1234) >>> cars_drf = H2ORandomForestEstimator(ntrees=1, ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(cars_drf.auc(valid=True))
-
property
class_sampling_factors
¶ Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
Type:
List[float]
.- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> print(covtype[54].table()) >>> sample_factors = [1., 0.5, 1., 1., 1., 1., 1.] >>> cov_drf = H2ORandomForestEstimator(balance_classes=True, ... class_sampling_factors=sample_factors, ... seed=1234) >>> cov_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print('logloss', cov_drf.logloss(valid=True))
-
property
col_sample_rate_change_per_level
¶ Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_drf = H2ORandomForestEstimator(col_sample_rate_change_per_level=.9, ... seed=1234) >>> airlines_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(airlines_drf.auc(valid=True))
-
property
col_sample_rate_per_tree
¶ Column sample rate per tree (from 0.0 to 1.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_drf = H2ORandomForestEstimator(col_sample_rate_per_tree=.7, ... seed=1234) >>> airlines_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(airlines_drf.auc(valid=True))
-
property
custom_metric_func
¶ Reference to custom evaluation function, format: language:keyName=funcName
Type:
str
.
-
property
distribution
¶ [Deprecated] Distribution function
One of:
"auto"
,"bernoulli"
,"multinomial"
,"gaussian"
,"poisson"
,"gamma"
,"tweedie"
,"laplace"
,"quantile"
,"huber"
(default:"auto"
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_drf = H2ORandomForestEstimator(distribution="poisson", ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_drf.mse(valid=True)
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> import tempfile >>> from os import listdir >>> from h2o.grid.grid_search import H2OGridSearch >>> airlines = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip", destination_frame="air.hex") >>> predictors = ["DayofMonth", "DayOfWeek"] >>> response = "IsDepDelayed" >>> hyper_parameters = {'ntrees': [5,10]} >>> search_crit = {'strategy': "RandomDiscrete", ... 'max_models': 5, ... 'seed': 1234, ... 'stopping_rounds': 3, ... 'stopping_metric': "AUTO", ... 'stopping_tolerance': 1e-2} >>> checkpoints_dir = tempfile.mkdtemp() >>> air_grid = H2OGridSearch(H2ORandomForestEstimator, ... hyper_params=hyper_parameters, ... search_criteria=search_crit) >>> air_grid.train(x=predictors, ... y=response, ... training_frame=airlines, ... distribution="bernoulli", ... max_depth=3, ... export_checkpoints_dir=checkpoints_dir) >>> num_files = len(listdir(checkpoints_dir)) >>> num_files
-
property
fold_assignment
¶ Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
One of:
"auto"
,"random"
,"modulo"
,"stratified"
(default:"auto"
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> assignment_type = "Random" >>> cars_drf = H2ORandomForestEstimator(fold_assignment=assignment_type, ... nfolds=5, ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_drf.auc(xval=True)
-
property
fold_column
¶ Column with cross-validation fold index assignment per observation.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> fold_numbers = cars.kfold_column(n_folds=5, seed=1234) >>> fold_numbers.set_names(["fold_numbers"]) >>> cars = cars.cbind(fold_numbers) >>> print(cars['fold_numbers']) >>> cars_drf = H2ORandomForestEstimator(seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=cars, ... fold_column="fold_numbers") >>> cars_drf.auc(xval=True)
-
property
histogram_type
¶ What type of histogram to use for finding optimal split points
One of:
"auto"
,"uniform_adaptive"
,"random"
,"quantiles_global"
,"round_robin"
(default:"auto"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_drf = H2ORandomForestEstimator(histogram_type="UniformAdaptive", ... seed=1234) >>> airlines_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(airlines_drf.auc(valid=True))
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> cars["const_1"] = 6 >>> cars["const_2"] = 7 >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_drf = H2ORandomForestEstimator(seed=1234, ... ignore_const_cols=True) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_drf.auc(valid=True)
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
keep_cross_validation_fold_assignment
¶ Whether to keep the cross-validation fold assignment.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_drf = H2ORandomForestEstimator(keep_cross_validation_fold_assignment=True, ... nfolds=5, ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train) >>> cars_drf.cross_validation_fold_assignment()
-
property
keep_cross_validation_models
¶ Whether to keep the cross-validation models.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_drf = H2ORandomForestEstimator(keep_cross_validation_models=True, ... nfolds=5, ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train) >>> cars_drf.auc()
-
property
keep_cross_validation_predictions
¶ Whether to keep the predictions of the cross-validation models.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_drf = H2ORandomForestEstimator(keep_cross_validation_predictions=True, ... nfolds=5, ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train) >>> cars_drf.cross_validation_predictions()
-
property
max_after_balance_size
¶ Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.
Type:
float
(default:5
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> print(covtype[54].table()) >>> max = .85 >>> cov_drf = H2ORandomForestEstimator(balance_classes=True, ... max_after_balance_size=max, ... seed=1234) >>> cov_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print('logloss', cov_drf.logloss(valid=True))
-
property
max_confusion_matrix_size
¶ [Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs
Type:
int
(default:20
).
-
property
max_depth
¶ Maximum tree depth.
Type:
int
(default:20
).- Examples
>>> df = h2o.import_file(path = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> response = "survived" >>> df[response] = df[response].asfactor() >>> predictors = df.columns >>> del predictors[1:3] >>> train, valid, test = df.split_frame(ratios=[0.6,0.2], ... seed=1234, ... destination_frames= ... ['train.hex','valid.hex','test.hex']) >>> drf = H2ORandomForestEstimator() >>> drf.train(x=predictors, ... y=response, ... training_frame=train) >>> perf = drf.model_performance(valid) >>> print perf.auc()
-
property
max_hit_ratio_k
¶ Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable)
Type:
int
(default:0
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cov_drf = H2ORandomForestEstimator(max_hit_ratio_k=3, ... seed=1234) >>> cov_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cov_drf.show()
-
property
max_runtime_secs
¶ Maximum allowed runtime in seconds for model training. Use 0 to disable.
Type:
float
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_drf = H2ORandomForestEstimator(max_runtime_secs=10, ... ntrees=10000, ... max_depth=10, ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_drf.auc(valid = True)
-
property
min_rows
¶ Fewest allowed (weighted) observations in a leaf.
Type:
float
(default:1
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_drf = H2ORandomForestEstimator(min_rows=16, ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(cars_drf.auc(valid=True))
-
property
min_split_improvement
¶ Minimum relative improvement in squared error reduction for a split to happen
Type:
float
(default:1e-05
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_drf = H2ORandomForestEstimator(min_split_improvement=1e-3, ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(cars_drf.auc(valid=True))
-
property
mtries
¶ Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification and p/3 for regression (where p is the # of predictors
Type:
int
(default:-1
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234) >>> cov_drf = H2ORandomForestEstimator(mtries=30, seed=1234) >>> cov_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print('logloss', cov_drf.logloss(valid=True))
-
property
nbins
¶ For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point
Type:
int
(default:20
).- Examples
>>> eeg = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/eeg/eeg_eyestate.csv") >>> eeg['eyeDetection'] = eeg['eyeDetection'].asfactor() >>> predictors = eeg.columns[:-1] >>> response = 'eyeDetection' >>> train, valid = eeg.split_frame(ratios=[.8], seed=1234) >>> bin_num = [16, 32, 64, 128, 256, 512] >>> label = ["16", "32", "64", "128", "256", "512"] >>> for key, num in enumerate(bin_num): # Insert integer for 'num' and 'key' >>> eeg_drf = H2ORandomForestEstimator(nbins=num, seed=1234) >>> eeg_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(label[key], 'training score', ... eeg_drf.auc(train=True)) >>> print(label[key], 'validation score', ... eeg_drf.auc(train=True))
-
property
nbins_cats
¶ For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.
Type:
int
(default:1024
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> bin_num = [8, 16, 32, 64, 128, 256, ... 512, 1024, 2048, 4096] >>> label = ["8", "16", "32", "64", "128", ... "256", "512", "1024", "2048", "4096"] >>> for key, num in enumerate(bin_num): # Insert integer for 'num' and 'key' >>> airlines_drf = H2ORandomForestEstimator(nbins_cats=num, ... seed=1234) >>> airlines_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(label[key], 'training score', ... airlines_gbm.auc(train=True)) >>> print(label[key], 'validation score', ... airlines_gbm.auc(valid=True))
-
property
nbins_top_level
¶ For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level
Type:
int
(default:1024
).- Examples
>>> eeg = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/eeg/eeg_eyestate.csv") >>> eeg['eyeDetection'] = eeg['eyeDetection'].asfactor() >>> predictors = eeg.columns[:-1] >>> response = 'eyeDetection' >>> train, valid = eeg.split_frame(ratios=[.8], ... seed=1234) >>> bin_num = [32, 64, 128, 256, 512, ... 1024, 2048, 4096] >>> label = ["32", "64", "128", "256", ... "512", "1024", "2048", "4096"] >>> for key, num in enumerate(bin_num): # Insert integer for 'num' and 'key' >>> eeg_drf = H2ORandomForestEstimator(nbins_top_level=32, ... seed=1234) >>> eeg_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(label[key], 'training score', ... eeg_gbm.auc(train=True)) >>> print(label[key], 'validation score', ... eeg_gbm.auc(valid=True))
-
property
nfolds
¶ Number of folds for K-fold cross-validation (0 to disable or >= 2).
Type:
int
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> folds = 5 >>> cars_drf = H2ORandomForestEstimator(nfolds=folds, ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=cars) >>> cars_drf.auc(xval=True)
-
property
ntrees
¶ Number of trees.
Type:
int
(default:50
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> tree_num = [20, 50, 80, 110, ... 140, 170, 200] >>> label = ["20", "50", "80", "110", ... "140", "170", "200"] >>> for key, num in enumerate(tree_num): # Input an integer for 'num' and 'key' >>> titanic_drf = H2ORandomForestEstimator(ntrees=num, ... seed=1234) >>> titanic_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(label[key], 'training score', ... titanic_drf.auc(train=True)) >>> print(label[key], 'validation score', ... titanic_drf.auc(valid=True))
-
property
offset_column
¶ [Deprecated] Offset column. This will be added to the combination of columns before applying the link function.
Type:
str
.
-
property
r2_stopping
¶ r2_stopping is no longer supported and will be ignored if set - please use stopping_rounds, stopping_metric and stopping_tolerance instead. Previous version of H2O would stop making trees when the R^2 metric equals or exceeds this
Type:
float
(default:1.797693135e+308
).
-
property
response_column
¶ Response variable column.
Type:
str
.
-
property
sample_rate
¶ Row sample rate per tree (from 0.0 to 1.0)
Type:
float
(default:0.632
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_drf = H2ORandomForestEstimator(sample_rate=.7, ... seed=1234) >>> airlines_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(airlines_drf.auc(valid=True))
-
property
sample_rate_per_class
¶ A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree
Type:
List[float]
.- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], ... seed=1234) >>> print(train[response].table()) >>> rate_per_class_list = [1, .4, 1, 1, 1, 1, 1] >>> cov_drf = H2ORandomForestEstimator(sample_rate_per_class=rate_per_class_list, ... seed=1234) >>> cov_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print('logloss', cov_drf.logloss(valid=True))
-
property
score_each_iteration
¶ Whether to score during each iteration of model training.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_drf = H2ORandomForestEstimator(score_each_iteration=True, ... ntrees=55, ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame = valid) >>> cars_drf.scoring_history()
-
property
score_tree_interval
¶ Score the model after every so many trees. Disabled if set to 0.
Type:
int
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_drf = H2ORandomForestEstimator(score_tree_interval=5, ... seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_drf.scoring_history()
-
property
seed
¶ Seed for pseudo random number generator (if applicable)
Type:
int
(default:-1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> drf_w_seed_1 = H2ORandomForestEstimator(seed=1234) >>> drf_w_seed_1.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print('auc for the 1st model build with a seed:', ... drf_w_seed_1.auc(valid=True))
-
property
stopping_metric
¶ Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
One of:
"auto"
,"deviance"
,"logloss"
,"mse"
,"rmse"
,"mae"
,"rmsle"
,"auc"
,"aucpr"
,"lift_top_group"
,"misclassification"
,"mean_per_class_error"
,"custom"
,"custom_increasing"
(default:"auto"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_drf = H2ORandomForestEstimator(stopping_metric="auc", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_drf.auc(valid=True)
-
property
stopping_rounds
¶ Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
Type:
int
(default:0
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_drf = H2ORandomForestEstimator(stopping_metric="auc", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_drf.auc(valid=True)
-
property
stopping_tolerance
¶ Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
Type:
float
(default:0.001
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_drf = H2ORandomForestEstimator(stopping_metric="auc", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_drf.auc(valid=True)
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], ... seed=1234) >>> cars_drf = H2ORandomForestEstimator(seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_drf.auc(valid=True)
-
property
validation_frame
¶ Id of the validation data frame.
Type:
H2OFrame
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], ... seed=1234) >>> cars_drf = H2ORandomForestEstimator(seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_drf.auc(valid=True)
-
property
weights_column
¶ Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","acceleration","year"] >>> response = "economy_20mpg" >>> train, valid = cars.split_frame(ratios=[.8], ... seed=1234) >>> cars_drf = H2ORandomForestEstimator(seed=1234) >>> cars_drf.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid, ... weights_column="weight") >>> cars_drf.auc(valid=True)
-
property
H2OStackedEnsembleEstimator
¶
-
class
h2o.estimators.stackedensemble.
H2OStackedEnsembleEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Stacked Ensemble
Builds a stacked ensemble (aka “super learner”) machine learning method that uses two or more H2O learning algorithms to improve predictive performance. It is a loss-based supervised learning method that finds the optimal combination of a collection of prediction algorithms.This method supports regression and binary classification.
- Examples
>>> import h2o >>> h2o.init() >>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> col_types = ["numeric", "numeric", "numeric", "enum", ... "enum", "numeric", "numeric", "numeric", "numeric"] >>> data = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv", col_types=col_types) >>> train, test = data.split_frame(ratios=[.8], seed=1) >>> x = ["CAPSULE","GLEASON","RACE","DPROS","DCAPS","PSA","VOL"] >>> y = "AGE" >>> nfolds = 5 >>> gbm = H2OGradientBoostingEstimator(nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True) >>> gbm.train(x=x, y=y, training_frame=train) >>> rf = H2ORandomForestEstimator(nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True) >>> rf.train(x=x, y=y, training_frame=train) >>> stack = H2OStackedEnsembleEstimator(model_id="ensemble", ... training_frame=train, ... validation_frame=test, ... base_models=[gbm.model_id, rf.model_id]) >>> stack.train(x=x, y=y, training_frame=train, validation_frame=test) >>> stack.model_performance()
-
property
base_models
¶ List of models or grids (or their ids) to ensemble/stack together. Grids are expanded to individual models. If not using blending frame, then models must have been cross-validated using nfolds > 1, and folds must be identical across models.
Type:
List[str]
(default:[]
).- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> col_types = ["numeric", "numeric", "numeric", "enum", ... "enum", "numeric", "numeric", "numeric", "numeric"] >>> data = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv", col_types=col_types) >>> train, test = data.split_frame(ratios=[.8], seed=1) >>> x = ["CAPSULE","GLEASON","RACE","DPROS","DCAPS","PSA","VOL"] >>> y = "AGE" >>> nfolds = 5 >>> gbm = H2OGradientBoostingEstimator(nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True) >>> gbm.train(x=x, y=y, training_frame=train) >>> rf = H2ORandomForestEstimator(nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True) >>> rf.train(x=x, y=y, training_frame=train) >>> stack = H2OStackedEnsembleEstimator(model_id="ensemble", ... training_frame=train, ... validation_frame=test, ... base_models=[gbm.model_id, rf.model_id]) >>> stack.train(x=x, y=y, training_frame=train, validation_frame=test) >>> stack.model_performance()
-
property
blending_frame
¶ Frame used to compute the predictions that serve as the training frame for the metalearner (triggers blending mode if provided)
Type:
H2OFrame
.- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv") >>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234) >>> x = train.columns >>> y = "response" >>> x.remove(y) >>> train[y] = train[y].asfactor() >>> blend[y] = blend[y].asfactor() >>> nfolds = 3 >>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=10, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_gbm.train(x=x, y=y, training_frame=train) >>> my_rf = H2ORandomForestEstimator(ntrees=50, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_rf.train(x=x, y=y, training_frame=train) >>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf], ... seed=1) >>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend) >>> stack_blend.model_performance(blend).auc()
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> import tempfile >>> from os import listdir >>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv") >>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234) >>> x = train.columns >>> y = "response" >>> x.remove(y) >>> train[y] = train[y].asfactor() >>> blend[y] = blend[y].asfactor() >>> nfolds = 3 >>> checkpoints_dir = tempfile.mkdtemp() >>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=10, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_gbm.train(x=x, y=y, training_frame=train) >>> my_rf = H2ORandomForestEstimator(ntrees=50, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_rf.train(x=x, y=y, training_frame=train) >>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf], ... seed=1, ... export_checkpoints_dir=checkpoints_dir) >>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend) >>> len(listdir(checkpoints_dir))
-
property
keep_levelone_frame
¶ Keep level one frame used for metalearner training.
Type:
bool
(default:False
).- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv") >>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234) >>> x = train.columns >>> y = "response" >>> x.remove(y) >>> train[y] = train[y].asfactor() >>> blend[y] = blend[y].asfactor() >>> nfolds = 3 >>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=1, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_gbm.train(x=x, y=y, training_frame=train) >>> my_rf = H2ORandomForestEstimator(ntrees=50, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_rf.train(x=x, y=y, training_frame=train) >>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf], ... seed=1, ... keep_levelone_frame=True) >>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend) >>> stack_blend.model_performance(blend).auc()
-
levelone_frame_id
()[source]¶ Fetch the levelone_frame_id for an H2OStackedEnsembleEstimator.
- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv") >>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234) >>> x = train.columns >>> y = "response" >>> x.remove(y) >>> train[y] = train[y].asfactor() >>> blend[y] = blend[y].asfactor() >>> nfolds = 3 >>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=10, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_gbm.train(x=x, y=y, training_frame=train) >>> my_rf = H2ORandomForestEstimator(ntrees=50, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_rf.train(x=x, y=y, training_frame=train) >>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf], ... seed=1, ... keep_levelone_frame=True) >>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend) >>> stack_blend.levelone_frame_id()
-
metalearner
()[source]¶ Print the metalearner of an H2OStackedEnsembleEstimator.
- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv") >>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234) >>> x = train.columns >>> y = "response" >>> x.remove(y) >>> train[y] = train[y].asfactor() >>> blend[y] = blend[y].asfactor() >>> nfolds = 3 >>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=10, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_gbm.train(x=x, y=y, training_frame=train) >>> my_rf = H2ORandomForestEstimator(ntrees=50, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_rf.train(x=x, y=y, training_frame=train) >>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf], ... seed=1, ... keep_levelone_frame=True) >>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend) >>> stack_blend.metalearner()
-
property
metalearner_algorithm
¶ Type of algorithm to use as the metalearner. Options include ‘AUTO’ (GLM with non negative weights; if validation_frame is present, a lambda search is performed), ‘deeplearning’ (Deep Learning with default parameters), ‘drf’ (Random Forest with default parameters), ‘gbm’ (GBM with default parameters), ‘glm’ (GLM with default parameters), ‘naivebayes’ (NaiveBayes with default parameters), or ‘xgboost’ (if available, XGBoost with default parameters).
One of:
"auto"
,"deeplearning"
,"drf"
,"gbm"
,"glm"
,"naivebayes"
,"xgboost"
(default:"auto"
).- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv") >>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234) >>> x = train.columns >>> y = "response" >>> x.remove(y) >>> train[y] = train[y].asfactor() >>> blend[y] = blend[y].asfactor() >>> nfolds = 3 >>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=1, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_gbm.train(x=x, y=y, training_frame=train) >>> my_rf = H2ORandomForestEstimator(ntrees=50, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_rf.train(x=x, y=y, training_frame=train) >>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf], ... seed=1, ... metalearner_algorithm="gbm") >>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend) >>> stack_blend.model_performance(blend).auc()
-
property
metalearner_fold_assignment
¶ Cross-validation fold assignment scheme for metalearner cross-validation. Defaults to AUTO (which is currently set to Random). The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
One of:
"auto"
,"random"
,"modulo"
,"stratified"
.- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv") >>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234) >>> x = train.columns >>> y = "response" >>> x.remove(y) >>> train[y] = train[y].asfactor() >>> blend[y] = blend[y].asfactor() >>> nfolds = 3 >>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=1, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_gbm.train(x=x, y=y, training_frame=train) >>> my_rf = H2ORandomForestEstimator(ntrees=50, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_rf.train(x=x, y=y, training_frame=train) >>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf], ... seed=1, ... metalearner_fold_assignment="Random") >>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend) >>> stack_blend.model_performance(blend).auc()
-
property
metalearner_fold_column
¶ Column with cross-validation fold index assignment per observation for cross-validation of the metalearner.
Type:
str
.- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv") >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_test_5k.csv") >>> fold_column = "fold_id" >>> train[fold_column] = train.kfold_column(n_folds=3, seed=1) >>> x = train.columns >>> y = "response" >>> x.remove(y) >>> x.remove(fold_column) >>> train[y] = train[y].asfactor() >>> test[y] = test[y].asfactor() >>> nfolds = 3 >>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=10, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_gbm.train(x=x, y=y, training_frame=train) >>> my_rf = H2ORandomForestEstimator(ntrees=50, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_rf.train(x=x, y=y, training_frame=train) >>> stack = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf], ... metalearner_fold_column=fold_column, ... metalearner_params=dict(keep_cross_validation_models=True)) >>> stack.train(x=x, y=y, training_frame=train) >>> stack.model_performance().auc()
-
property
metalearner_nfolds
¶ Number of folds for K-fold cross-validation of the metalearner algorithm (0 to disable or >= 2).
Type:
int
(default:0
).- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv") >>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234) >>> x = train.columns >>> y = "response" >>> x.remove(y) >>> train[y] = train[y].asfactor() >>> blend[y] = blend[y].asfactor() >>> nfolds = 3 >>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=1, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_gbm.train(x=x, y=y, training_frame=train) >>> my_rf = H2ORandomForestEstimator(ntrees=50, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_rf.train(x=x, y=y, training_frame=train) >>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf], ... seed=1, ... metalearner_nfolds=3) >>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend) >>> stack_blend.model_performance(blend).auc()
-
property
metalearner_params
¶ Parameters for metalearner algorithm
Type:
dict
.- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv") >>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234) >>> x = train.columns >>> y = "response" >>> x.remove(y) >>> train[y] = train[y].asfactor() >>> blend[y] = blend[y].asfactor() >>> nfolds = 3 >>> gbm_params = {"ntrees" : 100, "max_depth" : 6} >>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=1, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_gbm.train(x=x, y=y, training_frame=train) >>> my_rf = H2ORandomForestEstimator(ntrees=50, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_rf.train(x=x, y=y, training_frame=train) >>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf], ... metalearner_algorithm="gbm", ... metalearner_params=gbm_params) >>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend) >>> stack_blend.model_performance(blend).auc()
-
property
response_column
¶ Response variable column.
Type:
str
.
-
property
seed
¶ Seed for random numbers; passed through to the metalearner algorithm. Defaults to -1 (time-based random number)
Type:
int
(default:-1
).- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv") >>> train, blend = higgs.split_frame(ratios = [.8], seed = 1234) >>> x = train.columns >>> y = "response" >>> x.remove(y) >>> train[y] = train[y].asfactor() >>> blend[y] = blend[y].asfactor() >>> nfolds = 3 >>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=1, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_gbm.train(x=x, y=y, training_frame=train) >>> my_rf = H2ORandomForestEstimator(ntrees=50, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_rf.train(x=x, y=y, training_frame=train) >>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf], ... seed=1, ... metalearner_fold_assignment="Random") >>> stack_blend.train(x=x, y=y, training_frame=train, blending_frame=blend) >>> stack_blend.model_performance(blend).auc()
-
train
(x=None, y=None, training_frame=None, blending_frame=None, verbose=False, **kwargs)[source]¶ Train the H2O model.
- Parameters
x – A list of column names or indices indicating the predictor columns.
y – An index or a column name indicating the response column.
training_frame (H2OFrame) – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).
offset_column – The name or index of the column in training_frame that holds the offsets.
fold_column – The name or index of the column in training_frame that holds the per-row fold assignments.
weights_column – The name or index of the column in training_frame that holds the per-row weights.
validation_frame – H2OFrame with validation data to be scored on while training.
max_runtime_secs (float) – Maximum allowed runtime in seconds for model training. Use 0 to disable.
verbose (bool) – Print scoring history to stdout. Defaults to False.
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv") >>> train, valid = higgs.split_frame(ratios = [.8], seed = 1234) >>> x = train.columns >>> y = "response" >>> x.remove(y) >>> train[y] = train[y].asfactor() >>> blend[y] = blend[y].asfactor() >>> nfolds = 3 >>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=1, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_gbm.train(x=x, y=y, training_frame=train) >>> my_rf = H2ORandomForestEstimator(ntrees=50, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_rf.train(x=x, y=y, training_frame=train) >>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf], ... seed=1, ... metalearner_fold_assignment="Random") >>> stack_blend.train(x=x, y=y, training_frame=train, validation_frame=valid) >>> stack_blend.model_performance(blend).auc()
-
property
validation_frame
¶ Id of the validation data frame.
Type:
H2OFrame
.- Examples
>>> from h2o.estimators.random_forest import H2ORandomForestEstimator >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator >>> higgs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv") >>> train, valid = higgs.split_frame(ratios = [.8], seed = 1234) >>> x = train.columns >>> y = "response" >>> x.remove(y) >>> train[y] = train[y].asfactor() >>> blend[y] = blend[y].asfactor() >>> nfolds = 3 >>> my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli", ... ntrees=1, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_gbm.train(x=x, y=y, training_frame=train) >>> my_rf = H2ORandomForestEstimator(ntrees=50, ... nfolds=nfolds, ... fold_assignment="Modulo", ... keep_cross_validation_predictions=True, ... seed=1) >>> my_rf.train(x=x, y=y, training_frame=train) >>> stack_blend = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf], ... seed=1, ... metalearner_fold_assignment="Random") >>> stack_blend.train(x=x, y=y, training_frame=train, validation_frame=valid) >>> stack_blend.model_performance(blend).auc()
H2OTargetEncoderEstimator
¶
-
class
h2o.estimators.targetencoder.
H2OTargetEncoderEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
TargetEncoder
-
property
blending
¶ Blending enabled/disabled
Type:
bool
(default:False
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> predictors = ["home.dest", "cabin", "embarked"] >>> response = "survived" >>> titanic["survived"] = titanic["survived"].asfactor() >>> fold_col = "kfold_column" >>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234) >>> titanic_te = H2OTargetEncoderEstimator(k=35, ... f=25, ... blending=True) >>> titanic_te.train(x=predictors, ... y=response, ... training_frame=titanic) >>> titanic_te
-
property
data_leakage_handling
¶ Data leakage handling strategy.
One of:
"none"
,"k_fold"
,"leave_one_out"
(default:"none"
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> predictors = ["home.dest", "cabin", "embarked"] >>> response = "survived" >>> titanic["survived"] = titanic["survived"].asfactor() >>> fold_col = "kfold_column" >>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234) >>> titanic_te = H2OTargetEncoderEstimator(k=35, ... f=25, ... data_leakage_handling="k_fold", ... blending=True) >>> titanic_te.train(x=predictors, ... y=response, ... training_frame=titanic) >>> titanic_te
-
property
f
¶ Smoothing. Used for blending (if enabled). Blending is to be enabled separately using the ‘blending’ parameter.
Type:
float
(default:20
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> predictors = ["home.dest", "cabin", "embarked"] >>> response = "survived" >>> titanic["survived"] = titanic["survived"].asfactor() >>> fold_col = "kfold_column" >>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234) >>> titanic_te = H2OTargetEncoderEstimator(k=35, ... f=25, ... blending=True) >>> titanic_te.train(x=predictors, ... y=response, ... training_frame=titanic) >>> titanic_te
-
property
fold_column
¶ Column with cross-validation fold index assignment per observation.
Type:
str
.- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> predictors = ["home.dest", "cabin", "embarked"] >>> response = "survived" >>> titanic["survived"] = titanic["survived"].asfactor() >>> fold_col = "kfold_column" >>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234) >>> titanic_te = H2OTargetEncoderEstimator(k=35, ... f=25, ... blending=True) >>> titanic_te.train(x=predictors, ... y=response, ... training_frame=titanic) >>> titanic_te
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
k
¶ Inflection point. Used for blending (if enabled). Blending is to be enabled separately using the ‘blending’ parameter.
Type:
float
(default:10
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> predictors = ["home.dest", "cabin", "embarked"] >>> response = "survived" >>> titanic["survived"] = titanic["survived"].asfactor() >>> fold_col = "kfold_column" >>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234) >>> titanic_te = H2OTargetEncoderEstimator(k=35, ... f=25, ... blending=True) >>> titanic_te.train(x=predictors, ... y=response, ... training_frame=titanic) >>> titanic_te
-
property
noise_level
¶ Noise level
Type:
float
(default:0.01
).
-
property
response_column
¶ Response variable column.
Type:
str
.
-
property
seed
¶ Seed for the specified noise level
Type:
int
(default:-1
).
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> predictors = ["home.dest", "cabin", "embarked"] >>> response = "survived" >>> titanic["survived"] = titanic["survived"].asfactor() >>> fold_col = "kfold_column" >>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234) >>> titanic_te = H2OTargetEncoderEstimator(k=35, ... f=25, ... blending=True) >>> titanic_te.train(x=predictors, ... y=response, ... training_frame=titanic) >>> titanic_te
-
transform
(frame, data_leakage_handling='None', noise=-1, seed=-1)[source]¶ Apply transformation to te_columns based on the encoding maps generated during train() method call.
- Parameters
frame (H2OFrame) – to which frame we are applying target encoding transformations.
data_leakage_handling (str) – Supported options:
“k_fold” - encodings for a fold are generated based on out-of-fold data.
“leave_one_out” - leave one out. Current row’s response value is subtracted from the pre-calculated per-level frequencies.
“none” - we do not holdout anything. Using whole frame for training
- Parameters
noise (float) – the amount of random noise added to the target encoding. This helps prevent overfitting. Defaults to 0.01 * range of y.
seed (int) – a random seed used to generate draws from the uniform distribution for random noise. Defaults to -1.
- Example
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> predictors = ["home.dest", "cabin", "embarked"] >>> response = "survived" >>> titanic[response] = titanic[response].asfactor() >>> fold_col = "kfold_column" >>> titanic[fold_col] = titanic.kfold_column(n_folds=5, seed=1234) >>> titanic_te = H2OTargetEncoderEstimator(k=35, ... f=25, ... data_leakage_handling="leave_one_out", ... blending=True) >>> titanic_te.train(x=predictors, ... y=response, ... training_frame=titanic) >>> transformed = titanic_te.transform(frame=titanic, ... data_leakage_handling="leave_one_out", ... seed=1234)
-
property
H2OXGBoostEstimator
¶
-
class
h2o.estimators.xgboost.
H2OXGBoostEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
XGBoost
Builds an eXtreme Gradient Boosting model using the native XGBoost backend.
-
static
available
()[source]¶ Ask the H2O server whether a XGBoost model can be built (depends on availability of native backends). :return: True if a XGBoost model can be built, or False otherwise.
- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_xgb = H2OXGBoostEstimator(seed=1234) >>> boston_xgb.available()
-
property
backend
¶ Backend. By default (auto), a GPU is used if available.
One of:
"auto"
,"gpu"
,"cpu"
(default:"auto"
).- Examples
>>> pros = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv") >>> pros["CAPSULE"] = pros["CAPSULE"].asfactor() >>> pros_xgb = H2OXGBoostEstimator(tree_method="exact", ... seed=123, ... backend="cpu") >>> pros_xgb.train(y="CAPSULE", ... ignored_columns=["ID"], ... training_frame=pros) >>> pros_xgb.auc()
-
property
booster
¶ Booster type
One of:
"gbtree"
,"gblinear"
,"dart"
(default:"gbtree"
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(booster='dart', ... normalize_type="tree", ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(titanic_xgb.auc(valid=True))
-
property
build_tree_one_node
¶ Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
Type:
bool
(default:False
).
-
property
calibrate_model
¶ Use Platt Scaling to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities.
Type:
bool
(default:False
).
-
property
calibration_frame
¶ Calibration frame for Platt Scaling
Type:
H2OFrame
.
-
property
categorical_encoding
¶ Encoding scheme for categorical features
One of:
"auto"
,"enum"
,"one_hot_internal"
,"one_hot_explicit"
,"binary"
,"eigen"
,"label_encoder"
,"sort_by_response"
,"enum_limited"
(default:"auto"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> encoding = "one_hot_explicit" >>> airlines_xgb = H2OXGBoostEstimator(categorical_encoding=encoding, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_xgb.auc(valid=True)
-
property
checkpoint
¶ Model checkpoint to resume training with.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor() >>> predictors = ["displacement","power","weight","year","economy_20mpg"] >>> response = "acceleration" >>> from h2o.estimators import H2OXGBoostEstimator >>> cars_xgb = H2OXGBoostEstimator(seed=1234) >>> train, valid = cars.split_frame(ratios=[.8]) >>> cars_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_xgb.mse() >>> cars_xgb_continued = H2OXGBoostEstimator(checkpoint=cars_xgb.model_id, ... ntrees=51, ... seed=1234) >>> cars_xgb_continued.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_xgb_continued.mse()
-
property
col_sample_rate
¶ (same as colsample_bylevel) Column sample rate (from 0.0 to 1.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_xgb = H2OXGBoostEstimator(col_sample_rate=.7, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(airlines_xgb.auc(valid=True))
-
property
col_sample_rate_per_tree
¶ (same as colsample_bytree) Column sample rate per tree (from 0.0 to 1.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_xgb = H2OXGBoostEstimator(col_sample_rate_per_tree=.7, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(airlines_xgb.auc(valid=True))
-
property
colsample_bylevel
¶ (same as col_sample_rate) Column sample rate (from 0.0 to 1.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_xgb = H2OXGBoostEstimator(col_sample_rate=.7, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(airlines_xgb.auc(valid=True))
-
property
colsample_bytree
¶ (same as col_sample_rate_per_tree) Column sample rate per tree (from 0.0 to 1.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_xgb = H2OXGBoostEstimator(col_sample_rate_per_tree=.7, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(airlines_xgb.auc(valid=True))
-
property
distribution
¶ Distribution function
One of:
"auto"
,"bernoulli"
,"multinomial"
,"gaussian"
,"poisson"
,"gamma"
,"tweedie"
,"laplace"
,"quantile"
,"huber"
(default:"auto"
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> response = "cylinders" >>> train, valid = cars.split_frame(ratios=[.8], ... seed=1234) >>> cars_xgb = H2OXGBoostEstimator(distribution="poisson", ... seed=1234) >>> cars_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> cars_xgb.mse(valid=True)
-
property
dmatrix_type
¶ Type of DMatrix. For sparse, NAs and 0 are treated equally.
One of:
"auto"
,"dense"
,"sparse"
(default:"auto"
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_xgb = H2OXGBoostEstimator(dmatrix_type="auto", ... seed=1234) >>> boston_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_xgb.mse()
-
property
eta
¶ (same as learn_rate) Learning rate (from 0.0 to 1.0)
Type:
float
(default:0.3
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(ntrees=10000, ... learn_rate=0.01, ... stopping_rounds=5, ... stopping_metric="AUC", ... stopping_tolerance=1e-4, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(titanic_xgb.auc(valid=True))
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> import tempfile >>> from h2o.grid.grid_search import H2OGridSearch >>> from os import listdir >>> airlines = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip", destination_frame="air.hex") >>> predictors = ["DayofMonth", "DayOfWeek"] >>> response = "IsDepDelayed" >>> hyper_parameters = {'ntrees': [5,10]} >>> search_crit = {'strategy': "RandomDiscrete", ... 'max_models': 5, ... 'seed': 1234, ... 'stopping_rounds': 3, ... 'stopping_metric': "AUTO", ... 'stopping_tolerance': 1e-2} >>> checkpoints_dir = tempfile.mkdtemp() >>> air_grid = H2OGridSearch(H2OXGBoostEstimator, ... hyper_params=hyper_parameters, ... search_criteria=search_crit) >>> air_grid.train(x=predictors, ... y=response, ... training_frame=airlines, ... distribution="bernoulli", ... learn_rate=0.1, ... max_depth=3, ... export_checkpoints_dir=checkpoints_dir) >>> len(listdir(checkpoints_dir))
-
property
fold_assignment
¶ Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
One of:
"auto"
,"random"
,"modulo"
,"stratified"
(default:"auto"
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> response = 'survived' >>> assignment_type = "Random" >>> titanic_xgb = H2OXGBoostEstimator(fold_assignment=assignment_type, ... nfolds=5, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=titanic) >>> titanic_xgb.auc(xval=True)
-
property
fold_column
¶ Column with cross-validation fold index assignment per observation.
Type:
str
.- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> response = 'survived' >>> fold_numbers = titanic.kfold_column(n_folds=5, ... seed=1234) >>> fold_numbers.set_names(["fold_numbers"]) >>> titanic = titanic.cbind(fold_numbers) >>> print(titanic['fold_numbers']) >>> titanic_xgb = H2OXGBoostEstimator(seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=titanic, ... fold_column="fold_numbers") >>> titanic_xgb.auc(xval=True)
-
property
gamma
¶ (same as min_split_improvement) Minimum relative improvement in squared error reduction for a split to happen
Type:
float
(default:0
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(min_split_improvement=1e-3, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(titanic_xgb.auc(valid=True))
-
property
gpu_id
¶ Which GPU to use.
Type:
int
(default:0
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_xgb = H2OXGBoostEstimator(gpu_id=0, ... seed=1234) >>> boston_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> boston_xgb.mse()
-
property
grow_policy
¶ Grow policy - depthwise is standard GBM, lossguide is LightGBM
One of:
"depthwise"
,"lossguide"
(default:"depthwise"
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> response = 'survived' >>> titanic["const_1"] = 6 >>> titanic["const_2"] = 7 >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(seed=1234, ... grow_policy="depthwise") >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> titanic_xgb.auc(valid=True)
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> response = 'survived' >>> titanic["const_1"] = 6 >>> titanic["const_2"] = 7 >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(seed=1234, ... ignore_const_cols=True) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> titanic_xgb.auc(valid=True)
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
keep_cross_validation_fold_assignment
¶ Whether to keep the cross-validation fold assignment.
Type:
bool
(default:False
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(keep_cross_validation_fold_assignment=True, ... nfolds=5, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train) >>> titanic_xgb.cross_validation_fold_assignment()
-
property
keep_cross_validation_models
¶ Whether to keep the cross-validation models.
Type:
bool
(default:True
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(keep_cross_validation_models=True, ... nfolds=5 , ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train) >>> titanic_xgb.cross_validation_models()
-
property
keep_cross_validation_predictions
¶ Whether to keep the predictions of the cross-validation models.
Type:
bool
(default:False
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(keep_cross_validation_predictions=True, ... nfolds=5, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train) >>> titanic_xgb.cross_validation_predictions()
-
property
learn_rate
¶ (same as eta) Learning rate (from 0.0 to 1.0)
Type:
float
(default:0.3
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(ntrees=10000, ... learn_rate=0.01, ... stopping_rounds=5, ... stopping_metric="AUC", ... stopping_tolerance=1e-4, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(titanic_xgb.auc(valid=True))
-
property
max_abs_leafnode_pred
¶ (same as max_delta_step) Maximum absolute value of a leaf node prediction
Type:
float
(default:0
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], ... seed=1234) >>> cov_xgb = H2OXGBoostEstimator(max_abs_leafnode_pred=float(2), ... seed=1234) >>> cov_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(cov_xgb.logloss(valid=True))
-
property
max_bins
¶ For tree_method=hist only: maximum number of bins
Type:
int
(default:256
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], ... seed=1234) >>> cov_xgb = H2OXGBoostEstimator(max_bins=200, ... seed=1234) >>> cov_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(cov_xgb.logloss(valid=True))
-
property
max_delta_step
¶ (same as max_abs_leafnode_pred) Maximum absolute value of a leaf node prediction
Type:
float
(default:0
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], ... seed=1234) >>> cov_xgb = H2OXGBoostEstimator(max_delta_step=float(2), ... seed=1234) >>> cov_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(cov_xgb.logloss(valid=True))
-
property
max_depth
¶ Maximum tree depth.
Type:
int
(default:6
).- Examples
>>> df = h2o.import_file(path = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> response = "survived" >>> df[response] = df[response].asfactor() >>> predictors = df.columns >>> del predictors[1:3] >>> train, valid, test = df.split_frame(ratios=[0.6,0.2], ... seed=1234, ... destination_frames= ... ['train.hex', ... 'valid.hex', ... 'test.hex']) >>> xgb = H2OXGBoostEstimator() >>> xgb.train(x=predictors, ... y=response, ... training_frame=train) >>> perf = xgb.model_performance(valid) >>> print perf.auc()
-
property
max_leaves
¶ For tree_method=hist only: maximum number of leaves
Type:
int
(default:0
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(max_leaves=0, seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(titanic_xgb.auc(valid=True))
-
property
max_runtime_secs
¶ Maximum allowed runtime in seconds for model training. Use 0 to disable.
Type:
float
(default:0
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> covtype[54] = covtype[54].asfactor() >>> predictors = covtype.columns[0:54] >>> response = 'C55' >>> train, valid = covtype.split_frame(ratios=[.8], ... seed=1234) >>> cov_xgb = H2OXGBoostEstimator(max_runtime_secs=10, ... ntrees=10000, ... max_depth=10, ... seed=1234) >>> cov_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(cov_xgb.logloss(valid=True))
-
property
min_child_weight
¶ (same as min_rows) Fewest allowed (weighted) observations in a leaf.
Type:
float
(default:1
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(min_child_weight=16, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(titanic_xgb.auc(valid=True))
-
property
min_data_in_leaf
¶ For tree_method=hist only: the mininum data in a leaf to keep splitting
Type:
float
(default:0
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(min_data_in_leaf=0.55, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> titanic_xgb.auc(valid=True)
-
property
min_rows
¶ (same as min_child_weight) Fewest allowed (weighted) observations in a leaf.
Type:
float
(default:1
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(min_rows=16, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(titanic_xgb.auc(valid=True))
-
property
min_split_improvement
¶ (same as gamma) Minimum relative improvement in squared error reduction for a split to happen
Type:
float
(default:0
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(min_split_improvement=0.55, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(titanic_xgb.auc(valid=True))
-
property
min_sum_hessian_in_leaf
¶ For tree_method=hist only: the mininum sum of hessian in a leaf to keep splitting
Type:
float
(default:100
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(min_sum_hessian_in_leaf=90.5, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> titanic_xgb.auc(valid=True)
-
property
monotone_constraints
¶ A mapping representing monotonic constraints. Use +1 to enforce an increasing constraint and -1 to specify a decreasing constraint.
Type:
dict
.- Examples
>>> prostate_hex = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate_hex["CAPSULE"] = prostate_hex["CAPSULE"].asfactor() >>> response = "CAPSULE" >>> seed=42 >>> monotone_constraints={"AGE":1} >>> xgb_model = H2OXGBoostEstimator(seed=seed, ... monotone_constraints=monotone_constraints) >>> xgb_model.train(y=response, ... ignored_columns=["ID"], ... training_frame=prostate_hex) >>> xgb_model.scoring_history()
-
property
nfolds
¶ Number of folds for K-fold cross-validation (0 to disable or >= 2).
Type:
int
(default:0
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> folds = 5 >>> titanic_xgb = H2OXGBoostEstimator(nfolds=folds, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=titanic) >>> titanic_xgb.auc(xval=True)
-
property
normalize_type
¶ For booster=dart only: normalize_type
One of:
"tree"
,"forest"
(default:"tree"
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(booster='dart', ... normalize_type="tree", ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(titanic_xgb.auc(valid=True))
-
property
nthread
¶ Number of parallel threads that can be used to run XGBoost. Cannot exceed H2O cluster limits (-nthreads parameter). Defaults to maximum available
Type:
int
(default:-1
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], seed=1234) >>> thread = 4 >>> titanic_xgb = H2OXGBoostEstimator(nthread=thread, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=titanic) >>> print(titanic_xgb.auc(train=True))
-
property
ntrees
¶ (same as n_estimators) Number of trees.
Type:
int
(default:50
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> tree_num = [20, 50, 80, 110, 140, 170, 200] >>> label = ["20", "50", "80", "110", ... "140", "170", "200"] >>> for key, num in enumerate(tree_num): # Input integer for 'num' and 'key' >>> titanic_xgb = H2OXGBoostEstimator(ntrees=num, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(label[key], 'training score', ... titanic_xgb.auc(train=True)) >>> print(label[key], 'validation score', ... titanic_xgb.auc(valid=True))
-
property
offset_column
¶ Offset column. This will be added to the combination of columns before applying the link function.
Type:
str
.
-
property
one_drop
¶ For booster=dart only: one_drop
Type:
bool
(default:False
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(booster='dart', ... one_drop=True, ... seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(titanic_xgb.auc(valid=True))
-
property
quiet_mode
¶ Enable quiet mode
Type:
bool
(default:True
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(seed=1234, quiet_mode=True) >>> titanic_xgb.train(x=predictors ... y=response, ... training_frame=train, ... validation_frame=valid) >>> titanic_xgb.mse(valid=True)
-
property
rate_drop
¶ For booster=dart only: rate_drop (0..1)
Type:
float
(default:0
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(rate_drop=0.1, seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(titanic_xgb.auc(valid=True))
-
property
reg_alpha
¶ L1 regularization
Type:
float
(default:0
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> response = "medv" >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_xgb = H2OXGBoostEstimator(reg_alpha=.25) >>> boston_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(boston_xgb.mse(valid=True))
-
property
reg_lambda
¶ L2 regularization
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8]) >>> airlines_xgb = H2OXGBoostEstimator(reg_lambda=.0001, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(airlines_xgb.auc(valid=True))
-
property
response_column
¶ Response variable column.
Type:
str
.
-
property
sample_rate
¶ (same as subsample) Row sample rate per tree (from 0.0 to 1.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_xgb = H2OXGBoostEstimator(sample_rate=.7, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(airlines_xgb.auc(valid=True))
-
property
sample_type
¶ For booster=dart only: sample_type
One of:
"uniform"
,"weighted"
(default:"uniform"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_xgb = H2OXGBoostEstimator(sample_type="weighted", ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(airlines_xgb.auc(valid=True))
-
property
save_matrix_directory
¶ Directory where to save matrices passed to XGBoost library. Useful for debugging.
Type:
str
.
-
property
score_each_iteration
¶ Whether to score during each iteration of model training.
Type:
bool
(default:False
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_xgb = H2OXGBoostEstimator(score_each_iteration=True, ... ntrees=55, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_xgb.scoring_history()
-
property
score_tree_interval
¶ Score the model after every so many trees. Disabled if set to 0.
Type:
int
(default:0
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_xgb = H2OXGBoostEstimator(score_tree_interval=5, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_xgb.scoring_history()
-
property
seed
¶ Seed for pseudo random number generator (if applicable)
Type:
int
(default:-1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> xgb_w_seed_1 = H2OXGBoostEstimator(col_sample_rate=.7, ... seed=1234) >>> xgb_w_seed_1.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> xgb_w_seed_2 = H2OXGBoostEstimator(col_sample_rate = .7, ... seed = 1234) >>> xgb_w_seed_2.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print('auc for the 1st model built with a seed:', ... xgb_w_seed_1.auc(valid=True)) >>> print('auc for the 2nd model built with a seed:', ... xgb_w_seed_2.auc(valid=True))
-
property
skip_drop
¶ For booster=dart only: skip_drop (0..1)
Type:
float
(default:0
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_xgb = H2OXGBoostEstimator(skip_drop=0.5, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train) >>> airlines_xgb.auc(train=True)
-
property
stopping_metric
¶ Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
One of:
"auto"
,"deviance"
,"logloss"
,"mse"
,"rmse"
,"mae"
,"rmsle"
,"auc"
,"aucpr"
,"lift_top_group"
,"misclassification"
,"mean_per_class_error"
,"custom"
,"custom_increasing"
(default:"auto"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234) >>> airlines_xgb = H2OXGBoostEstimator(stopping_metric="auc", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_xgb.auc(valid=True)
-
property
stopping_rounds
¶ Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
Type:
int
(default:0
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_xgb = H2OXGBoostEstimator(stopping_metric="auc", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_xgb.auc(valid=True)
-
property
stopping_tolerance
¶ Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
Type:
float
(default:0.001
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_xgb = H2OXGBoostEstimator(stopping_metric="auc", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> airlines_xgb.auc(valid=True)
-
property
subsample
¶ (same as sample_rate) Row sample rate per tree (from 0.0 to 1.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> airlines_xgb = H2OXGBoostEstimator(sample_rate=.7, ... seed=1234) >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(airlines_xgb.auc(valid=True))
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> titanic_xgb.auc(valid=True)
-
property
tree_method
¶ Tree method
One of:
"auto"
,"exact"
,"approx"
,"hist"
(default:"auto"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"] = airlines["Year"].asfactor() >>> airlines["Month"] = airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> train, valid= airlines.split_frame(ratios=[.8], ... seed=1234) >>> >>> airlines_xgb = H2OXGBoostEstimator(seed=1234, ... tree_method="approx") >>> airlines_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(airlines_xgb.auc(valid=True))
-
property
tweedie_power
¶ Tweedie power for Tweedie regression, must be between 1 and 2.
Type:
float
(default:1.5
).- Examples
>>> insurance = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> predictors = insurance.columns[0:4] >>> response = 'Claims' >>> insurance['Group'] = insurance['Group'].asfactor() >>> insurance['Age'] = insurance['Age'].asfactor() >>> train, valid = insurance.split_frame(ratios=[.8], ... seed=1234) >>> insurance_xgb = H2OXGBoostEstimator(distribution="tweedie", ... tweedie_power=1.2, ... seed=1234) >>> insurance_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(insurance_xgb.mse(valid=True))
-
property
validation_frame
¶ Id of the validation data frame.
Type:
H2OFrame
.- Examples
>>> insurance = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance['Group'] = insurance['Group'].asfactor() >>> insurance['Age'] = insurance['Age'].asfactor() >>> predictors = insurance.columns[0:4] >>> response = 'Claims' >>> train, valid = insurance.split_frame(ratios=[.8], ... seed=1234) >>> insurance_xgb = H2OXGBoostEstimator(seed=1234) >>> insurance_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> print(insurance_xgb.mse(valid=True))
-
property
weights_column
¶ Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
Type:
str
.- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic['survived'] = titanic['survived'].asfactor() >>> predictors = titanic.columns >>> del predictors[1:3] >>> response = 'survived' >>> train, valid = titanic.split_frame(ratios=[.8], ... seed=1234) >>> titanic_xgb = H2OXGBoostEstimator(seed=1234) >>> titanic_xgb.train(x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> titanic_xgb.auc(valid=True)
-
static
Unsupervised¶
H2OAggregatorEstimator
¶
-
class
h2o.estimators.aggregator.
H2OAggregatorEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Aggregator
-
property
categorical_encoding
¶ Encoding scheme for categorical features
One of:
"auto"
,"enum"
,"one_hot_internal"
,"one_hot_explicit"
,"binary"
,"eigen"
,"label_encoder"
,"sort_by_response"
,"enum_limited"
(default:"auto"
).- Examples
>>> df = h2o.create_frame(rows=10000, ... cols=10, ... categorical_fraction=0.6, ... integer_fraction=0, ... binary_fraction=0, ... real_range=100, ... integer_range=100, ... missing_fraction=0, ... factors=100, ... seed=1234) >>> params = {"target_num_exemplars": 1000, ... "rel_tol_num_exemplars": 0.5, ... "categorical_encoding": "eigen"} >>> agg = H2OAggregatorEstimator(**params) >>> agg.train(training_frame=df) >>> new_df = agg.aggregated_frame >>> new_df
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> import tempfile >>> from os import listdir >>> df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> checkpoints_dir = tempfile.mkdtemp() >>> model = H2OAggregatorEstimator(target_num_exemplars=500, ... rel_tol_num_exemplars=0.3, ... export_checkpoints_dir=checkpoints_dir) >>> model.train(training_frame=df) >>> new_df = model.aggregated_frame >>> new_df >>> len(listdir(checkpoints_dir))
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).- Examples
>>> df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> params = {"ignore_const_cols": False, ... "target_num_exemplars": 500, ... "rel_tol_num_exemplars": 0.3, ... "transform": "standardize", ... "categorical_encoding": "eigen"} >>> model = H2OAggregatorEstimator(**params) >>> model.train(training_frame=df) >>> new_df = model.aggregated_frame >>> new_df
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
num_iteration_without_new_exemplar
¶ The number of iterations to run before aggregator exits if the number of exemplars collected didn’t change
Type:
int
(default:500
).- Examples
>>> df = h2o.create_frame(rows=10000, ... cols=10, ... categorical_fraction=0.6, ... integer_fraction=0, ... binary_fraction=0, ... real_range=100, ... integer_range=100, ... missing_fraction=0, ... factors=100, ... seed=1234) >>> params = {"target_num_exemplars": 1000, ... "rel_tol_num_exemplars": 0.5, ... "categorical_encoding": "eigen", ... "num_iteration_without_new_exemplar": 400} >>> agg = H2OAggregatorEstimator(**params) >>> agg.train(training_frame=df) >>> new_df = agg.aggregated_frame >>> new_df
-
property
rel_tol_num_exemplars
¶ Relative tolerance for number of exemplars (e.g, 0.5 is +/- 50 percents)
Type:
float
(default:0.5
).- Examples
>>> df = h2o.create_frame(rows=10000, ... cols=10, ... categorical_fraction=0.6, ... integer_fraction=0, ... binary_fraction=0, ... real_range=100, ... integer_range=100, ... missing_fraction=0, ... factors=100, ... seed=1234) >>> params = {"target_num_exemplars": 1000, ... "rel_tol_num_exemplars": 0.5, ... "categorical_encoding": "eigen", ... "num_iteration_without_new_exemplar": 400} >>> agg = H2OAggregatorEstimator(**params) >>> agg.train(training_frame=df) >>> new_df = agg.aggregated_frame >>> new_df
-
property
response_column
¶ Response variable column.
Type:
str
.
-
property
save_mapping_frame
¶ Whether to export the mapping of the aggregated frame
Type:
bool
(default:False
).- Examples
>>> df = h2o.create_frame(rows=10000, ... cols=10, ... categorical_fraction=0.6, ... integer_fraction=0, ... binary_fraction=0, ... real_range=100, ... integer_range=100, ... missing_fraction=0, ... factors=100, ... seed=1234) >>> params = {"target_num_exemplars": 1000, ... "rel_tol_num_exemplars": 0.5, ... "categorical_encoding": "eigen", ... "save_mapping_frame": True} >>> agg = H2OAggregatorEstimator(**params) >>> agg.train(training_frame=df) >>> new_df = agg.aggregated_frame >>> new_df
-
property
target_num_exemplars
¶ Targeted number of exemplars
Type:
int
(default:5000
).- Examples
>>> df = h2o.create_frame(rows=10000, ... cols=10, ... categorical_fraction=0.6, ... integer_fraction=0, ... binary_fraction=0, ... real_range=100, ... integer_range=100, ... missing_fraction=0, ... factors=100, ... seed=1234) >>> params = {"target_num_exemplars": 1000, ... "rel_tol_num_exemplars": 0.5, ... "categorical_encoding": "eigen", ... "num_iteration_without_new_exemplar": 400} >>> agg = H2OAggregatorEstimator(**params) >>> agg.train(training_frame=df) >>> new_df = agg.aggregated_frame >>> new_df
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> df = h2o.create_frame(rows=10000, ... cols=10, ... categorical_fraction=0.6, ... integer_fraction=0, ... binary_fraction=0, ... real_range=100, ... integer_range=100, ... missing_fraction=0, ... factors=100, ... seed=1234) >>> params = {"target_num_exemplars": 1000, ... "rel_tol_num_exemplars": 0.5, ... "categorical_encoding": "eigen", ... "num_iteration_without_new_exemplar": 400} >>> agg = H2OAggregatorEstimator(**params) >>> agg.train(training_frame=df) >>> new_df = agg.aggregated_frame >>> new_df
-
property
transform
¶ Transformation of training data
One of:
"none"
,"standardize"
,"normalize"
,"demean"
,"descale"
(default:"normalize"
).- Examples
>>> df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> params = {"ignore_const_cols": False, ... "target_num_exemplars": 500, ... "rel_tol_num_exemplars": 0.3, ... "transform": "standardize", ... "categorical_encoding": "eigen"} >>> model = H2OAggregatorEstimator(**params) >>> model.train(training_frame=df) >>> new_df = model.aggregated_frame
-
property
H2OAutoEncoderEstimator
¶
-
class
h2o.estimators.deeplearning.
H2OAutoEncoderEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.deeplearning.H2ODeepLearningEstimator
- Examples
>>> import h2o as ml >>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator >>> ml.init() >>> rows = [[1,2,3,4,0]*50, [2,1,2,4,1]*50, [2,1,4,2,1]*50, [0,1,2,34,1]*50, [2,3,4,1,0]*50] >>> fr = ml.H2OFrame(rows) >>> fr[4] = fr[4].asfactor() >>> model = H2OAutoEncoderEstimator() >>> model.train(x=range(4), training_frame=fr)
H2OGenericEstimator
¶
-
class
h2o.estimators.generic.
H2OGenericEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Import MOJO Model
-
static
from_file
(file=<class 'str'>)[source]¶ Creates new Generic model by loading existing embedded model into library, e.g. from H2O MOJO. The imported model must be supported by H2O.
- Parameters
file – A string containing path to the file to create the model from
- Returns
H2OGenericEstimator instance representing the generic model
- Examples
>>> from h2o.estimators import H2OIsolationForestEstimator, H2OGenericEstimator >>> import tempfile >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv") >>> ifr = H2OIsolationForestEstimator(ntrees=1) >>> ifr.train(x=["Origin","Dest"], y="Distance", training_frame=airlines) >>> original_model_filename = tempfile.mkdtemp() >>> original_model_filename = ifr.download_mojo(original_model_filename) >>> model = H2OGenericEstimator.from_file(original_model_filename) >>> model.model_performance()
-
property
model_key
¶ Key to the self-contained model archive already uploaded to H2O.
Type:
H2OFrame
.- Examples
>>> from h2o.estimators import H2OGenericEstimator, H2OXGBoostEstimator >>> import tempfile >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv") >>> y = "IsDepDelayed" >>> x = ["fYear","fMonth","Origin","Dest","Distance"] >>> xgb = H2OXGBoostEstimator(ntrees=1, nfolds=3) >>> xgb.train(x=x, y=y, training_frame=airlines) >>> original_model_filename = tempfile.mkdtemp() >>> original_model_filename = xgb.download_mojo(original_model_filename) >>> key = h2o.lazy_import(original_model_filename) >>> fr = h2o.get_frame(key[0]) >>> model = H2OGenericEstimator(model_key=fr) >>> model.train() >>> model.auc()
-
property
path
¶ Path to file with self-contained model archive.
Type:
str
.- Examples
>>> from h2o.estimators import H2OIsolationForestEstimator, H2OGenericEstimator >>> import tempfile >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/airlines_train.csv") >>> ifr = H2OIsolationForestEstimator(ntrees=1) >>> ifr.train(x=["Origin","Dest"], y="Distance", training_frame=airlines) >>> generic_mojo_filename = tempfile.mkdtemp("zip","genericMojo") >>> generic_mojo_filename = model.download_mojo(path=generic_mojo_filename) >>> model = H2OGenericEstimator.from_file(generic_mojo_filename) >>> model.model_performance()
-
static
H2OGeneralizedLowRankEstimator
¶
-
class
h2o.estimators.glrm.
H2OGeneralizedLowRankEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Generalized Low Rank Modeling
Builds a generalized low rank model of a H2O dataset.
-
property
expand_user_y
¶ Expand categorical columns in user-specified initial Y
Type:
bool
(default:True
).- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> rank = 3 >>> gx = 0.5 >>> gy = 0.5 >>> trans = "standardize" >>> iris_glrm = H2OGeneralizedLowRankEstimator(k=rank, ... loss="Quadratic", ... gamma_x=gx, ... gamma_y=gy, ... transform=trans, ... expand_user_y=False) >>> iris_glrm.train(x=iris.names, training_frame=iris) >>> iris_glrm.show()
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> import tempfile >>> from os import listdir >>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> checkpoints_dir = tempfile.mkdtemp() >>> iris_glrm = H2OGeneralizedLowRankEstimator(k=3, ... export_checkpoints_dir=checkpoints_dir, ... seed=1234) >>> iris_glrm.train(x=iris.names, training_frame=iris) >>> len(listdir(checkpoints_dir))
-
property
gamma_x
¶ Regularization weight on X matrix
Type:
float
(default:0
).- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> rank = 3 >>> gx = 0.5 >>> gy = 0.5 >>> trans = "standardize" >>> iris_glrm = H2OGeneralizedLowRankEstimator(k=rank, ... loss="Quadratic", ... gamma_x=gx, ... gamma_y=gy, ... transform=trans) >>> iris_glrm.train(x=iris.names, training_frame=iris) >>> iris_glrm.show()
-
property
gamma_y
¶ Regularization weight on Y matrix
Type:
float
(default:0
).- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> rank = 3 >>> gx = 0.5 >>> gy = 0.5 >>> trans = "standardize" >>> iris_glrm = H2OGeneralizedLowRankEstimator(k=rank, ... loss="Quadratic", ... gamma_x=gx, ... gamma_y=gy, ... transform=trans) >>> iris_glrm.train(x=iris.names, training_frame=iris) >>> iris_glrm.show()
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> iris_glrm = H2OGeneralizedLowRankEstimator(k=3, ... ignore_const_cols=False, ... seed=1234) >>> iris_glrm.train(x=iris.names, training_frame=iris) >>> iris_glrm.show()
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
impute_original
¶ Reconstruct original training data by reversing transform
Type:
bool
(default:False
).- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> rank = 3 >>> gx = 0.5 >>> gy = 0.5 >>> trans = "standardize" >>> iris_glrm = H2OGeneralizedLowRankEstimator(k=rank, ... loss="Quadratic", ... gamma_x=gx, ... gamma_y=gy, ... transform=trans ... impute_original=True) >>> iris_glrm.train(x=iris.names, training_frame=iris) >>> iris_glrm.show()
-
property
init
¶ Initialization mode
One of:
"random"
,"svd"
,"plus_plus"
,"user"
(default:"plus_plus"
).- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> iris_glrm = H2OGeneralizedLowRankEstimator(k=3, ... init="svd", ... seed=1234) >>> iris_glrm.train(x=iris.names, training_frame=iris) >>> iris_glrm.show()
-
property
init_step_size
¶ Initial step size
Type:
float
(default:1
).- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> iris_glrm = H2OGeneralizedLowRankEstimator(k=3, ... init_step_size=2.5, ... seed=1234) >>> iris_glrm.train(x=iris.names, training_frame=iris) >>> iris_glrm.show()
-
property
k
¶ Rank of matrix approximation
Type:
int
(default:1
).- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> iris_glrm = H2OGeneralizedLowRankEstimator(k=3) >>> iris_glrm.train(x=iris.names, training_frame=iris) >>> iris_glrm.show()
-
property
loading_name
¶ Frame key to save resulting X
Type:
str
.- Examples
>>> acs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/census/ACS_13_5YR_DP02_cleaned.zip") >>> acs_fill = acs.drop("ZCTA5") >>> acs_glrm = H2OGeneralizedLowRankEstimator(k=10, ... transform="standardize", ... loss="quadratic", ... regularization_x="quadratic", ... regularization_y="L1", ... gamma_x=0.25, ... gamma_y=0.5, ... max_iterations=1, ... loading_name="acs_full") >>> acs_glrm.train(x=acs_fill.names, training_frame=acs) >>> acs_glrm.loading_name >>> acs_glrm.show()
-
property
loss
¶ Numeric loss function
One of:
"quadratic"
,"absolute"
,"huber"
,"poisson"
,"hinge"
,"logistic"
,"periodic"
(default:"quadratic"
).- Examples
>>> acs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/census/ACS_13_5YR_DP02_cleaned.zip") >>> acs_fill = acs.drop("ZCTA5") >>> acs_glrm = H2OGeneralizedLowRankEstimator(k=10, ... transform="standardize", ... loss="absolute", ... regularization_x="quadratic", ... regularization_y="L1", ... gamma_x=0.25, ... gamma_y=0.5, ... max_iterations=700) >>> acs_glrm.train(x=acs_fill.names, training_frame=acs) >>> acs_glrm.show()
-
property
loss_by_col
¶ Loss function by column (override)
Type:
List[Enum["quadratic", "absolute", "huber", "poisson", "hinge", "logistic", "periodic", "categorical", "ordinal"]]
.- Examples
>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv") >>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3, ... loss="quadratic", ... loss_by_col=["absolute","huber"], ... loss_by_col_idx=[0,3], ... regularization_x="quadratic", ... regularization_y="l1") >>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O) >>> arrests_glrm.show()
-
property
loss_by_col_idx
¶ Loss function by column index (override)
Type:
List[int]
.- Examples
>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv") >>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3, ... loss="quadratic", ... loss_by_col=["absolute","huber"], ... loss_by_col_idx=[0,3], ... regularization_x="quadratic", ... regularization_y="l1") >>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O) >>> arrests_glrm.show()
-
property
max_iterations
¶ Maximum number of iterations
Type:
int
(default:1000
).- Examples
>>> acs = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/census/ACS_13_5YR_DP02_cleaned.zip") >>> acs_fill = acs.drop("ZCTA5") >>> acs_glrm = H2OGeneralizedLowRankEstimator(k=10, ... transform="standardize", ... loss="quadratic", ... regularization_x="quadratic", ... regularization_y="L1", ... gamma_x=0.25, ... gamma_y=0.5, ... max_iterations=700) >>> acs_glrm.train(x=acs_fill.names, training_frame=acs) >>> acs_glrm.show()
-
property
max_runtime_secs
¶ Maximum allowed runtime in seconds for model training. Use 0 to disable.
Type:
float
(default:0
).- Examples
>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv") >>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3, ... max_runtime_secs=15, ... max_iterations=500, ... max_updates=900, ... min_step_size=0.005) >>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O) >>> arrests_glrm.show()
-
property
max_updates
¶ Maximum number of updates, defaults to 2*max_iterations
Type:
int
(default:2000
).- Examples
>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv") >>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3, ... max_runtime_secs=15, ... max_iterations=500, ... max_updates=900, ... min_step_size=0.005) >>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O) >>> arrests_glrm.show()
-
property
min_step_size
¶ Minimum step size
Type:
float
(default:0.0001
).- Examples
>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv") >>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3, ... max_runtime_secs=15, ... max_iterations=500, ... max_updates=900, ... min_step_size=0.005) >>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O) >>> arrests_glrm.show()
-
property
multi_loss
¶ Categorical loss function
One of:
"categorical"
,"ordinal"
(default:"categorical"
).- Examples
>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv") >>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3, ... loss="quadratic", ... loss_by_col=["absolute","huber"], ... loss_by_col_idx=[0,3], ... regularization_x="quadratic", ... regularization_y="l1" ... multi_loss="ordinal") >>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O) >>> arrests_glrm.show()
-
property
period
¶ Length of period (only used with periodic loss function)
Type:
int
(default:1
).- Examples
>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv") >>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3, ... max_runtime_secs=15, ... max_iterations=500, ... max_updates=900, ... min_step_size=0.005, ... period=5) >>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O) >>> arrests_glrm.show()
-
property
recover_svd
¶ Recover singular values and eigenvectors of XY
Type:
bool
(default:False
).- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate_cat.csv") >>> prostate[0] = prostate[0].asnumeric() >>> prostate[4] = prostate[4].asnumeric() >>> loss_all = ["Hinge", "Quadratic", "Categorical", "Categorical", ... "Hinge", "Quadratic", "Quadratic", "Quadratic"] >>> pros_glrm = H2OGeneralizedLowRankEstimator(k=5, ... loss_by_col=loss_all, ... recover_svd=True, ... transform="standardize", ... seed=12345) >>> pros_glrm.train(x=prostate.names, training_frame=prostate) >>> pros_glrm.show()
-
property
regularization_x
¶ Regularization function for X matrix
One of:
"none"
,"quadratic"
,"l2"
,"l1"
,"non_negative"
,"one_sparse"
,"unit_one_sparse"
,"simplex"
(default:"none"
).- Examples
>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv") >>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3, ... loss="quadratic", ... loss_by_col=["absolute","huber"], ... loss_by_col_idx=[0,3], ... regularization_x="quadratic", ... regularization_y="l1") >>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O) >>> arrests_glrm.show()
-
property
regularization_y
¶ Regularization function for Y matrix
One of:
"none"
,"quadratic"
,"l2"
,"l1"
,"non_negative"
,"one_sparse"
,"unit_one_sparse"
,"simplex"
(default:"none"
).- Examples
>>> arrestsH2O = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/pca_test/USArrests.csv") >>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=3, ... loss="quadratic", ... loss_by_col=["absolute","huber"], ... loss_by_col_idx=[0,3], ... regularization_x="quadratic", ... regularization_y="l1") >>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O) >>> arrests_glrm.show()
-
property
score_each_iteration
¶ Whether to score during each iteration of model training.
Type:
bool
(default:False
).- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate_cat.csv") >>> prostate[0] = prostate[0].asnumeric() >>> prostate[4] = prostate[4].asnumeric() >>> loss_all = ["Hinge", "Quadratic", "Categorical", "Categorical", ... "Hinge", "Quadratic", "Quadratic", "Quadratic"] >>> pros_glrm = H2OGeneralizedLowRankEstimator(k=5, ... loss_by_col=loss_all, ... score_each_iteration=True, ... transform="standardize", ... seed=12345) >>> pros_glrm.train(x=prostate.names, training_frame=prostate) >>> pros_glrm.show()
-
property
seed
¶ RNG seed for initialization
Type:
int
(default:-1
).- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate_cat.csv") >>> prostate[0] = prostate[0].asnumeric() >>> prostate[4] = prostate[4].asnumeric() >>> glrm_w_seed = H2OGeneralizedLowRankEstimator(k=5, seed=12345) >>> glrm_w_seed.train(x=prostate.names, training_frame=prostate) >>> glrm_wo_seed = H2OGeneralizedLowRankEstimator(k=5, >>> glrm_wo_seed.train(x=prostate.names, training_frame=prostate) >>> glrm_w_seed.show() >>> glrm_wo_seed.show()
-
property
svd_method
¶ Method for computing SVD during initialization (Caution: Randomized is currently experimental and unstable)
One of:
"gram_s_v_d"
,"power"
,"randomized"
(default:"randomized"
).- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate_cat.csv") >>> prostate[0] = prostate[0].asnumeric() >>> prostate[4] = prostate[4].asnumeric() >>> pros_glrm = H2OGeneralizedLowRankEstimator(k=5, ... svd_method="power", ... seed=1234) >>> pros_glrm.train(x=prostate.names, training_frame=prostate) >>> pros_glrm.show()
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate_cat.csv") >>> prostate[0] = prostate[0].asnumeric() >>> prostate[4] = prostate[4].asnumeric() >>> pros_glrm = H2OGeneralizedLowRankEstimator(k=5, ... seed=1234) >>> pros_glrm.train(x=prostate.names, training_frame=prostate) >>> pros_glrm.show()
-
property
transform
¶ Transformation of training data
One of:
"none"
,"standardize"
,"normalize"
,"demean"
,"descale"
(default:"none"
).- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate_cat.csv") >>> prostate[0] = prostate[0].asnumeric() >>> prostate[4] = prostate[4].asnumeric() >>> pros_glrm = H2OGeneralizedLowRankEstimator(k=5, ... score_each_iteration=True, ... transform="standardize", ... seed=12345) >>> pros_glrm.train(x=prostate.names, training_frame=prostate) >>> pros_glrm.show()
-
property
user_x
¶ User-specified initial X
Type:
H2OFrame
.- Examples
>>> arrestsH2O = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> initial_x = ([[5.412, 65.24, -7.54, -0.032, 2.212, 92.24, -17.54, 23.268, 0.312, ... 123.24, 14.46, 9.768, 1.012, 19.24, -15.54, -1.732, 5.412, 65.24, ... -7.54, -0.032, 2.212, 92.24, -17.54, 23.268, 0.312, 123.24, 14.46, ... 9.76, 1.012, 19.24, -15.54, -1.732, 5.412, 65.24, -7.54, -0.032, ... 2.212, 92.24, -17.54, 23.268, 0.312, 123.24, 14.46, 9.768, 1.012, ... 19.24, -15.54, -1.732, 5.412, 65.24]]*4) >>> initial_x_h2o = h2o.H2OFrame(list(zip(*initial_x))) >>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=4, ... transform="demean", ... loss="quadratic", ... gamma_x=0.5, ... gamma_y=0.3, ... init="user", ... user_x=initial_x_h2o, ... recover_svd=True) >>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O) >>> arrests_glrm.show()
-
property
user_y
¶ User-specified initial Y
Type:
H2OFrame
.- Examples
>>> arrestsH2O = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> initial_y = [[5.412, 65.24, -7.54, -0.032], ... [2.212, 92.24, -17.54, 23.268], ... [0.312, 123.24, 14.46, 9.768], ... [1.012, 19.24, -15.54, -1.732]] >>> initial_y_h2o = h2o.H2OFrame(list(zip(*initial_y))) >>> arrests_glrm = H2OGeneralizedLowRankEstimator(k=4, ... transform="demean", ... loss="quadratic", ... gamma_x=0.5, ... gamma_y=0.3, ... init="user", ... user_y=initial_y_h2o, ... recover_svd=True) >>> arrests_glrm.train(x=arrestsH2O.names, training_frame=arrestsH2O) >>> arrests_glrm.show()
-
property
validation_frame
¶ Id of the validation data frame.
Type:
H2OFrame
.- Examples
>>> iris = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/iris/iris_wheader.csv") >>> iris_glrm = H2OGeneralizedLowRankEstimator(k=3, ... loss="quadratic", ... gamma_x=0.5, ... gamma_y=0.5, ... transform="standardize") >>> iris_glrm.train(x=iris.names, ... training_frame=iris, ... validation_frame=iris) >>> iris_glrm.show()
-
property
H2OIsolationForestEstimator
¶
-
class
h2o.estimators.isolation_forest.
H2OIsolationForestEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Isolation Forest
Builds an Isolation Forest model. Isolation Forest algorithm samples the training frame and in each iteration builds a tree that partitions the space of the sample observations until it isolates each observation. Length of the path from root to a leaf node of the resulting tree is used to calculate the anomaly score. Anomalies are easier to isolate and their average tree path is expected to be shorter than paths of regular observations.
-
property
build_tree_one_node
¶ Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> cars_if = H2OIsolationForestEstimator(build_tree_one_node=True, ... seed=1234) >>> cars_if.train(x=predictors, ... training_frame=cars) >>> cars_if.model_performance()
-
property
categorical_encoding
¶ Encoding scheme for categorical features
One of:
"auto"
,"enum"
,"one_hot_internal"
,"one_hot_explicit"
,"binary"
,"eigen"
,"label_encoder"
,"sort_by_response"
,"enum_limited"
(default:"auto"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> encoding = "one_hot_explicit" >>> airlines_if = H2OIsolationForestEstimator(categorical_encoding=encoding, ... seed=1234) >>> airlines_if.train(x=predictors, ... training_frame=airlines) >>> airlines_if.model_performance()
-
property
col_sample_rate_change_per_level
¶ Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> airlines_if = H2OIsolationForestEstimator(col_sample_rate_change_per_level=.9, ... seed=1234) >>> airlines_if.train(x=predictors, ... training_frame=airlines) >>> airlines_if.model_performance()
-
property
col_sample_rate_per_tree
¶ Column sample rate per tree (from 0.0 to 1.0)
Type:
float
(default:1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> airlines_if = H2OIsolationForestEstimator(col_sample_rate_per_tree=.7, ... seed=1234) >>> airlines_if.train(x=predictors, ... training_frame=airlines) >>> airlines_if.model_performance()
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> import tempfile >>> from os import listdir >>> airlines = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip", destination_frame="air.hex") >>> predictors = ["DayofMonth", "DayOfWeek"] >>> checkpoints_dir = tempfile.mkdtemp() >>> air_if = H2OIsolationForestEstimator(max_depth=3, ... seed=1234, ... export_checkpoints_dir=checkpoints_dir) >>> air_if.train(x=predictors, ... training_frame=airlines) >>> len(listdir(checkpoints_dir))
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> cars["const_1"] = 6 >>> cars["const_2"] = 7 >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_if = H2OIsolationForestEstimator(seed=1234, ... ignore_const_cols=True) >>> cars_if.train(x=predictors, ... training_frame=cars) >>> cars_if.model_performance()
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
max_depth
¶ Maximum tree depth.
Type:
int
(default:8
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> cars_if = H2OIsolationForestEstimator(max_depth=2, ... seed=1234) >>> cars_if.train(x=predictors, ... training_frame=cars) >>> cars_if.model_performance()
-
property
max_runtime_secs
¶ Maximum allowed runtime in seconds for model training. Use 0 to disable.
Type:
float
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> cars_if = H2OIsolationForestEstimator(max_runtime_secs=10, ... ntrees=10000, ... max_depth=10, ... seed=1234) >>> cars_if.train(x=predictors, ... training_frame=cars) >>> cars_if.model_performance()
-
property
min_rows
¶ Fewest allowed (weighted) observations in a leaf.
Type:
float
(default:1
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> cars_if = H2OIsolationForestEstimator(min_rows=16, ... seed=1234) >>> cars_if.train(x=predictors, ... training_frame=cars) >>> cars_if.model_performance()
-
property
mtries
¶ Number of variables randomly sampled as candidates at each split. If set to -1, defaults (number of predictors)/3.
Type:
int
(default:-1
).- Examples
>>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data") >>> predictors = covtype.columns[0:54] >>> cov_if = H2OIsolationForestEstimator(mtries=30, seed=1234) >>> cov_if.train(x=predictors, ... training_frame=covtype) >>> cov_if.model_performance()
-
property
ntrees
¶ Number of trees.
Type:
int
(default:50
).- Examples
>>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> predictors = titanic.columns >>> tree_num = [20, 50, 80, 110, 140, 170, 200] >>> label = ["20", "50", "80", "110", "140", "170", "200"] >>> for key, num in enumerate(tree_num): ... titanic_if = H2OIsolationForestEstimator(ntrees=num, ... seed=1234) ... titanic_if.train(x=predictors, ... training_frame=titanic) ... print(label[key], 'training score', titanic_if.mse(train=True))
-
property
sample_rate
¶ Rate of randomly sampled observations used to train each Isolation Forest tree. Needs to be in range from 0.0 to 1.0. If set to -1, sample_rate is disabled and sample_size will be used instead.
Type:
float
(default:-1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> airlines_if = H2OIsolationForestEstimator(sample_rate=.7, ... seed=1234) >>> airlines_if.train(x=predictors, ... training_frame=airlines) >>> airlines_if.model_performance()
-
property
sample_size
¶ Number of randomly sampled observations used to train each Isolation Forest tree. Only one of parameters sample_size and sample_rate should be defined. If sample_rate is defined, sample_size will be ignored.
Type:
int
(default:256
).- Examples
>>> train = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/anomaly/ecg_discord_train.csv") >>> test = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/anomaly/ecg_discord_test.csv") >>> isofor_model = H2OIsolationForestEstimator(sample_size=5, ... ntrees=7) >>> isofor_model.train(training_frame=train) >>> isofor_model.model_performance() >>> isofor_model.model_performance(test)
-
property
score_each_iteration
¶ Whether to score during each iteration of model training.
Type:
bool
(default:False
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> cars_if = H2OIsolationForestEstimator(score_each_iteration=True, ... ntrees=55, ... seed=1234) >>> cars_if.train(x=predictors, ... training_frame=cars) >>> cars_if.model_performance()
-
property
score_tree_interval
¶ Score the model after every so many trees. Disabled if set to 0.
Type:
int
(default:0
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> cars_if = H2OIsolationForestEstimator(score_tree_interval=5, ... seed=1234) >>> cars_if.train(x=predictors, ... training_frame=cars) >>> cars_if.model_performance()
-
property
seed
¶ Seed for pseudo random number generator (if applicable)
Type:
int
(default:-1
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> isofor_w_seed = H2OIsolationForestEstimator(seed=1234) >>> isofor_w_seed.train(x=predictors, ... training_frame=airlines) >>> isofor_wo_seed = H2OIsolationForestEstimator() >>> isofor_wo_seed.train(x=predictors, ... training_frame=airlines) >>> isofor_w_seed.model_performance() >>> isofor_wo_seed.model_performance()
-
property
stopping_metric
¶ Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client.
One of:
"auto"
,"anomaly_score"
(default:"auto"
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> airlines_if = H2OIsolationForestEstimator(stopping_metric="auto", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_if.train(x=predictors, ... training_frame=airlines) >>> airlines_if.model_performance()
-
property
stopping_rounds
¶ Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
Type:
int
(default:0
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> airlines_if = H2OIsolationForestEstimator(stopping_metric="auto", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_if.train(x=predictors, ... training_frame=airlines) >>> airlines_if.model_performance()
-
property
stopping_tolerance
¶ Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
Type:
float
(default:0.01
).- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> airlines_if = H2OIsolationForestEstimator(stopping_metric="auto", ... stopping_rounds=3, ... stopping_tolerance=1e-2, ... seed=1234) >>> airlines_if.train(x=predictors, ... training_frame=airlines) >>> airlines_if.model_performance()
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> cars_if = H2OIsolationForestEstimator(seed=1234) >>> cars_if.train(x=predictors, ... training_frame=cars) >>> cars_if.model_performance()
-
property
H2OKMeansEstimator
¶
-
class
h2o.estimators.kmeans.
H2OKMeansEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
K-means
Performs k-means clustering on an H2O dataset.
-
property
categorical_encoding
¶ Encoding scheme for categorical features
One of:
"auto"
,"enum"
,"one_hot_internal"
,"one_hot_explicit"
,"binary"
,"eigen"
,"label_encoder"
,"sort_by_response"
,"enum_limited"
(default:"auto"
).- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv") >>> predictors = ["AGE", "RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON"] >>> train, valid = prostate.split_frame(ratios=[.8], seed=1234) >>> encoding = "one_hot_explicit" >>> pros_km = H2OKMeansEstimator(categorical_encoding=encoding, ... seed=1234) >>> pros_km.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> pros_km.scoring_history()
-
property
cluster_size_constraints
¶ An array specifying the minimum number of points that should be in each cluster. The length of the constraints array has to be the same as the number of clusters.
Type:
List[int]
.- Examples
>>> iris_h2o = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv") >>> k=3 >>> start_points = h2o.H2OFrame( ... [[4.9, 3.0, 1.4, 0.2], ... [5.6, 2.5, 3.9, 1.1], ... [6.5, 3.0, 5.2, 2.0]]) >>> kmm = H2OKMeansEstimator(k=k, ... user_points=start_points, ... standardize=True, ... cluster_size_constraints=[2, 5, 8], ... score_each_iteration=True) >>> kmm.train(x=list(range(7)), training_frame=iris_h2o) >>> kmm.scoring_history()
-
property
estimate_k
¶ Whether to estimate the number of clusters (<=k) iteratively and deterministically.
Type:
bool
(default:False
).- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> iris['class'] = iris['class'].asfactor() >>> predictors = iris.columns[:-1] >>> train, valid = iris.split_frame(ratios=[.8], seed=1234) >>> iris_kmeans = H2OKMeansEstimator(k=10, ... estimate_k=True, ... standardize=False, ... seed=1234) >>> iris_kmeans.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> iris_kmeans.scoring_history()
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> import tempfile >>> from os import listdir >>> airlines = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip", destination_frame="air.hex") >>> predictors = ["DayofMonth", "DayOfWeek"] >>> checkpoints_dir = tempfile.mkdtemp() >>> air_km = H2OKMeansEstimator(export_checkpoints_dir=checkpoints_dir, ... seed=1234) >>> air_km.train(x=predictors, training_frame=airlines) >>> len(listdir(checkpoints_dir))
-
property
fold_assignment
¶ Cross-validation fold assignment scheme, if fold_column is not specified. The ‘Stratified’ option will stratify the folds based on the response variable, for classification problems.
One of:
"auto"
,"random"
,"modulo"
,"stratified"
(default:"auto"
).- Examples
>>> ozone = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/ozone.csv") >>> predictors = ["radiation","temperature","wind"] >>> train, valid = ozone.split_frame(ratios=[.8], seed=1234) >>> ozone_km = H2OKMeansEstimator(fold_assignment="Random", ... nfolds=5, ... seed=1234) >>> ozone_km.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> ozone_km.scoring_history()
-
property
fold_column
¶ Column with cross-validation fold index assignment per observation.
Type:
str
.- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> fold_numbers = cars.kfold_column(n_folds=5, seed=1234) >>> fold_numbers.set_names(["fold_numbers"]) >>> cars = cars.cbind(fold_numbers) >>> print(cars['fold_numbers']) >>> cars_km = H2OKMeansEstimator(seed=1234) >>> cars_km.train(x=predictors, ... training_frame=cars, ... fold_column="fold_numbers") >>> cars_km.scoring_history()
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).- Examples
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") >>> predictors = ["displacement","power","weight","acceleration","year"] >>> cars["const_1"] = 6 >>> cars["const_2"] = 7 >>> train, valid = cars.split_frame(ratios=[.8], seed=1234) >>> cars_km = H2OKMeansEstimator(ignore_const_cols=True, ... seed=1234) >>> cars_km.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> cars_km.scoring_history()
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
init
¶ Initialization mode
One of:
"random"
,"plus_plus"
,"furthest"
,"user"
(default:"furthest"
).- Examples
>>> seeds = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/seeds_dataset.txt") >>> predictors = seeds.columns[0:7] >>> train, valid = seeds.split_frame(ratios=[.8], seed=1234) >>> seeds_km = H2OKMeansEstimator(k=3, ... init='Furthest', ... seed=1234) >>> seeds_km.train(x=predictors, ... training_frame=train, ... validation_frame= valid) >>> seeds_km.scoring_history()
-
property
k
¶ The max. number of clusters. If estimate_k is disabled, the model will find k centroids, otherwise it will find up to k centroids.
Type:
int
(default:1
).- Examples
>>> seeds = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/seeds_dataset.txt") >>> predictors = seeds.columns[0:7] >>> train, valid = seeds.split_frame(ratios=[.8], seed=1234) >>> seeds_km = H2OKMeansEstimator(k=3, seed=1234) >>> seeds_km.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> seeds_km.scoring_history()
-
property
keep_cross_validation_fold_assignment
¶ Whether to keep the cross-validation fold assignment.
Type:
bool
(default:False
).- Examples
>>> ozone = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/ozone.csv") >>> predictors = ["radiation","temperature","wind"] >>> train, valid = ozone.split_frame(ratios=[.8], seed=1234) >>> ozone_km = H2OKMeansEstimator(keep_cross_validation_fold_assignment=True, ... nfolds=5, ... seed=1234) >>> ozone_km.train(x=predictors, ... training_frame=train) >>> ozone_km.scoring_history()
-
property
keep_cross_validation_models
¶ Whether to keep the cross-validation models.
Type:
bool
(default:True
).- Examples
>>> ozone = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/ozone.csv") >>> predictors = ["radiation","temperature","wind"] >>> train, valid = ozone.split_frame(ratios=[.8], seed=1234) >>> ozone_km = H2OKMeansEstimator(keep_cross_validation_models=True, ... nfolds=5, ... seed=1234) >>> ozone_km.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> ozone_km.scoring_history()
-
property
keep_cross_validation_predictions
¶ Whether to keep the predictions of the cross-validation models.
Type:
bool
(default:False
).- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv") >>> predictors = ["AGE", "RACE", "DPROS", "DCAPS", ... "PSA", "VOL", "GLEASON"] >>> train, valid = prostate.split_frame(ratios=[.8], seed=1234) >>> pros_km = H2OKMeansEstimator(keep_cross_validation_predictions=True, ... nfolds=5, ... seed=1234) >>> pros_km.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> pros_km.scoring_history()
-
property
max_iterations
¶ Maximum training iterations (if estimate_k is enabled, then this is for each inner Lloyds iteration)
Type:
int
(default:10
).- Examples
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> predictors = ["AGMT","FNDX","HIGD","DEG","CHK", ... "AGP1","AGMN","LIV","AGLP"] >>> train, valid = benign.split_frame(ratios=[.8], seed=1234) >>> benign_km = H2OKMeansEstimator(max_iterations=50) >>> benign_km.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> benign_km.scoring_history()
-
property
max_runtime_secs
¶ Maximum allowed runtime in seconds for model training. Use 0 to disable.
Type:
float
(default:0
).- Examples
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> predictors = ["AGMT","FNDX","HIGD","DEG","CHK", ... "AGP1","AGMN","LIV","AGLP"] >>> train, valid = benign.split_frame(ratios=[.8], seed=1234) >>> benign_km = H2OKMeansEstimator(max_runtime_secs=10, ... seed=1234) >>> benign_km.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> benign_km.scoring_history()
-
property
nfolds
¶ Number of folds for K-fold cross-validation (0 to disable or >= 2).
Type:
int
(default:0
).- Examples
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> predictors = ["AGMT","FNDX","HIGD","DEG","CHK", ... "AGP1","AGMN","LIV","AGLP"] >>> train, valid = benign.split_frame(ratios=[.8], seed=1234) >>> benign_km = H2OKMeansEstimator(nfolds=5, seed=1234) >>> benign_km.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> benign_km.scoring_history()
-
property
score_each_iteration
¶ Whether to score during each iteration of model training.
Type:
bool
(default:False
).- Examples
>>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> predictors = ["AGMT","FNDX","HIGD","DEG","CHK", ... "AGP1","AGMN","LIV","AGLP"] >>> train, valid = benign.split_frame(ratios=[.8], seed=1234) >>> benign_km = H2OKMeansEstimator(score_each_iteration=True, ... seed=1234) >>> benign_km.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> benign_km.scoring_history()
-
property
seed
¶ RNG Seed
Type:
int
(default:-1
).- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv") >>> predictors = ["AGE", "RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON"] >>> train, valid = prostate.split_frame(ratios=[.8], seed=1234) >>> pros_w_seed = H2OKMeansEstimator(seed=1234) >>> pros_w_seed.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> pros_wo_seed = H2OKMeansEstimator() >>> pros_wo_seed.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> pros_w_seed.scoring_history() >>> pros_wo_seed.scoring_history()
-
property
standardize
¶ Standardize columns before computing distances
Type:
bool
(default:True
).- Examples
>>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv") >>> predictors = boston.columns[:-1] >>> boston['chas'] = boston['chas'].asfactor() >>> train, valid = boston.split_frame(ratios=[.8]) >>> boston_km = H2OKMeansEstimator(standardize=True) >>> boston_km.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> boston_km.scoring_history()
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv") >>> predictors = ["AGE", "RACE", "DPROS", "DCAPS", ... "PSA", "VOL", "GLEASON"] >>> train, valid = prostate.split_frame(ratios=[.8], seed=1234) >>> pros_km = H2OKMeansEstimator(seed=1234) >>> pros_km.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> pros_km.scoring_history()
-
property
user_points
¶ This option allows you to specify a dataframe, where each row represents an initial cluster center. The user- specified points must have the same number of columns as the training observations. The number of rows must equal the number of clusters
Type:
H2OFrame
.- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> iris['class'] = iris['class'].asfactor() >>> predictors = iris.columns[:-1] >>> train, valid = iris.split_frame(ratios=[.8], seed=1234) >>> point1 = [4.9,3.0,1.4,0.2] >>> point2 = [5.6,2.5,3.9,1.1] >>> point3 = [6.5,3.0,5.2,2.0] >>> points = h2o.H2OFrame([point1, point2, point3]) >>> iris_km = H2OKMeansEstimator(k=3, ... user_points=points, ... seed=1234) >>> iris_km.train(x=predictors, ... training_frame=iris, ... validation_frame=valid) >>> iris_kmeans.tot_withinss(valid=True)
-
property
validation_frame
¶ Id of the validation data frame.
Type:
H2OFrame
.- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv") >>> predictors = ["AGE", "RACE", "DPROS", "DCAPS", ... "PSA", "VOL", "GLEASON"] >>> train, valid = prostate.split_frame(ratios=[.8], seed=1234) >>> pros_km = H2OKMeansEstimator(seed=1234) >>> pros_km.train(x=predictors, ... training_frame=train, ... validation_frame=valid) >>> pros_km.scoring_history()
-
property
H2OPrincipalComponentAnalysisEstimator
¶
-
class
h2o.estimators.pca.
H2OPrincipalComponentAnalysisEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Principal Components Analysis
-
property
compute_metrics
¶ Whether to compute metrics on the training data
Type:
bool
(default:True
).- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate['CAPSULE'] = prostate['CAPSULE'].asfactor() >>> prostate['RACE'] = prostate['RACE'].asfactor() >>> prostate['DCAPS'] = prostate['DCAPS'].asfactor() >>> prostate['DPROS'] = prostate['DPROS'].asfactor() >>> pros_pca = H2OPrincipalComponentAnalysisEstimator(compute_metrics=False) >>> pros_pca.train(x=prostate.names, training_frame=prostate) >>> pros_pca.show()
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> import tempfile >>> from os import listdir >>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate['CAPSULE'] = prostate['CAPSULE'].asfactor() >>> prostate['RACE'] = prostate['RACE'].asfactor() >>> prostate['DCAPS'] = prostate['DCAPS'].asfactor() >>> prostate['DPROS'] = prostate['DPROS'].asfactor() >>> checkpoints_dir = tempfile.mkdtemp() >>> pros_pca = H2OPrincipalComponentAnalysisEstimator(impute_missing=True, ... export_checkpoints_dir=checkpoints_dir) >>> pros_pca.train(x=prostate.names, training_frame=prostate) >>> len(listdir(checkpoints_dir))
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate['CAPSULE'] = prostate['CAPSULE'].asfactor() >>> prostate['RACE'] = prostate['RACE'].asfactor() >>> prostate['DCAPS'] = prostate['DCAPS'].asfactor() >>> prostate['DPROS'] = prostate['DPROS'].asfactor() >>> pros_pca = H2OPrincipalComponentAnalysisEstimator(ignore_const_cols=False) >>> pros_pca.train(x=prostate.names, training_frame=prostate) >>> pros_pca.show()
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
property
impute_missing
¶ Whether to impute missing entries with the column mean
Type:
bool
(default:False
).- Examples
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate['CAPSULE'] = prostate['CAPSULE'].asfactor() >>> prostate['RACE'] = prostate['RACE'].asfactor() >>> prostate['DCAPS'] = prostate['DCAPS'].asfactor() >>> prostate['DPROS'] = prostate['DPROS'].asfactor() >>> pros_pca = H2OPrincipalComponentAnalysisEstimator(impute_missing=True) >>> pros_pca.train(x=prostate.names, training_frame=prostate) >>> pros_pca.show()
-
init_for_pipeline
()[source]¶ Returns H2OPCA object which implements fit and transform method to be used in sklearn.Pipeline properly. All parameters defined in self.__params, should be input parameters in H2OPCA.__init__ method.
- Returns
H2OPCA object
- Examples
>>> from sklearn.pipeline import Pipeline >>> from h2o.transforms.preprocessing import H2OScaler >>> from h2o.estimators import H2ORandomForestEstimator >>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> pipe = Pipeline([("standardize", H2OScaler()), ... ("pca", H2OPrincipalComponentAnalysisEstimator(k=2).init_for_pipeline()), ... ("rf", H2ORandomForestEstimator(seed=42,ntrees=5))]) >>> pipe.fit(iris[:4], iris[4])
-
property
k
¶ Rank of matrix approximation
Type:
int
(default:1
).- Examples
>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip") >>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=-1, ... transform="standardize", ... pca_method="power", ... impute_missing=True, ... max_iterations=800) >>> data_pca.train(x=data.names, training_frame=data) >>> data_pca.show()
-
property
max_iterations
¶ Maximum training iterations
Type:
int
(default:1000
).- Examples
>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip") >>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=-1, ... transform="standardize", ... pca_method="power", ... impute_missing=True, ... max_iterations=800) >>> data_pca.train(x=data.names, training_frame=data) >>> data_pca.show()
-
property
max_runtime_secs
¶ Maximum allowed runtime in seconds for model training. Use 0 to disable.
Type:
float
(default:0
).- Examples
>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip") >>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=-1, ... transform="standardize", ... pca_method="power", ... impute_missing=True, ... max_iterations=800 ... max_runtime_secs=15) >>> data_pca.train(x=data.names, training_frame=data) >>> data_pca.show()
-
property
pca_impl
¶ Specify the implementation to use for computing PCA (via SVD or EVD): MTJ_EVD_DENSEMATRIX - eigenvalue decompositions for dense matrix using MTJ; MTJ_EVD_SYMMMATRIX - eigenvalue decompositions for symmetric matrix using MTJ; MTJ_SVD_DENSEMATRIX - singular-value decompositions for dense matrix using MTJ; JAMA - eigenvalue decompositions for dense matrix using JAMA. References: JAMA - http://math.nist.gov/javanumerics/jama/; MTJ - https://github.com/fommil/matrix-toolkits-java/
One of:
"mtj_evd_densematrix"
,"mtj_evd_symmmatrix"
,"mtj_svd_densematrix"
,"jama"
.- Examples
>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip") >>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=3, ... pca_impl="jama", ... impute_missing=True, ... max_iterations=1200) >>> data_pca.train(x=data.names, training_frame=data) >>> data_pca.show()
-
property
pca_method
¶ Specify the algorithm to use for computing the principal components: GramSVD - uses a distributed computation of the Gram matrix, followed by a local SVD; Power - computes the SVD using the power iteration method (experimental); Randomized - uses randomized subspace iteration method; GLRM - fits a generalized low-rank model with L2 loss function and no regularization and solves for the SVD using local matrix algebra (experimental)
One of:
"gram_s_v_d"
,"power"
,"randomized"
,"glrm"
(default:"gram_s_v_d"
).- Examples
>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip") >>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=-1, ... transform="standardize", ... pca_method="power", ... impute_missing=True, ... max_iterations=800) >>> data_pca.train(x=data.names, training_frame=data) >>> data_pca.show()
-
property
score_each_iteration
¶ Whether to score during each iteration of model training.
Type:
bool
(default:False
).- Examples
>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip") >>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=3, ... score_each_iteration=True, ... seed=1234, ... impute_missing=True) >>> data_pca.train(x=data.names, training_frame=data) >>> data_pca.show()
-
property
seed
¶ RNG seed for initialization
Type:
int
(default:-1
).- Examples
>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip") >>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=3, ... seed=1234, ... impute_missing=True) >>> data_pca.train(x=data.names, training_frame=data) >>> data_pca.show()
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip") >>> data_pca = H2OPrincipalComponentAnalysisEstimator() >>> data_pca.train(x=data.names, training_frame=data) >>> data_pca.show()
-
property
transform
¶ Transformation of training data
One of:
"none"
,"standardize"
,"normalize"
,"demean"
,"descale"
(default:"none"
).- Examples
>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip") >>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=-1, ... transform="standardize", ... pca_method="power", ... impute_missing=True, ... max_iterations=800) >>> data_pca.train(x=data.names, training_frame=data) >>> data_pca.show()
-
property
use_all_factor_levels
¶ Whether first factor level is included in each categorical expansion
Type:
bool
(default:False
).- Examples
>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip") >>> data_pca = H2OPrincipalComponentAnalysisEstimator(k=3, ... use_all_factor_levels=True, ... seed=1234) >>> data_pca.train(x=data.names, training_frame=data) >>> data_pca.show()
-
property
validation_frame
¶ Id of the validation data frame.
Type:
H2OFrame
.- Examples
>>> data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/SDSS_quasar.txt.zip") >>> train, valid = data.split_frame(ratios=[.8], seed=1234) >>> model_pca = H2OPrincipalComponentAnalysisEstimator(impute_missing=True) >>> model_pca.train(x=data.names, ... training_frame=train, ... validation_frame=valid) >>> model_pca.show()
-
property
Miscellaneous¶
automl
¶
-
h2o.automl.
get_automl
(project_name)[source]¶ Retrieve information about an AutoML instance.
- Parameters
project_name (str) – A string indicating the project_name of the automl instance to retrieve.
- Returns
A dictionary containing the project_name, leader model, leaderboard, event_log.
-
h2o.automl.
get_leaderboard
(aml, extra_columns=None)[source]¶ Retrieve the leaderboard from the AutoML instance. Contrary to the default leaderboard attached to the automl instance, this one can return columns other than the metrics. :param H2OAutoML aml: the instance for which to return the leaderboard. :param extra_columns: a string or a list of string specifying which optional columns should be added to the leaderboard. Defaults to None.
Currently supported extensions are: - ‘ALL’: adds all columns below. - ‘training_time_ms’: column providing the training time of each model in milliseconds (doesn’t include the training of cross validation models). - ‘predict_time_per_row_ms`: column providing the average prediction time by the model for a single row.
- Returns
An H2OFrame representing the leaderboard.
- Examples
>>> aml = H2OAutoML(max_runtime_secs=30) >>> aml.train(y=y, training_frame=train) >>> lb_all = h2o.automl.get_leaderboard(aml, 'ALL') >>> lb_custom = h2o.automl.get_leaderboard(aml, ['predict_time_per_row_ms', 'training_time_ms']) >>> lb_custom_sorted = lb_custom.sort(by='predict_time_per_row_ms')
H2OAutoML
¶
-
class
h2o.automl.
H2OAutoML
(nfolds=5, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_runtime_secs=None, max_runtime_secs_per_model=None, max_models=None, stopping_metric='AUTO', stopping_tolerance=None, stopping_rounds=3, seed=None, project_name=None, exclude_algos=None, include_algos=None, exploitation_ratio=0, modeling_plan=None, monotone_constraints=None, algo_parameters=None, keep_cross_validation_predictions=False, keep_cross_validation_models=False, keep_cross_validation_fold_assignment=False, sort_metric='AUTO', export_checkpoints_dir=None, verbosity='warn')[source]¶ Bases:
h2o.automl._base.H2OAutoMLBaseMixin
,h2o.base.Keyed
Automatic Machine Learning
The Automatic Machine Learning (AutoML) function automates the supervised machine learning model training process. The current version of AutoML trains and cross-validates a Random Forest (DRF), an Extremely-Randomized Forest (DRF/XRT), a random grid of Generalized Linear Models (GLM) a random grid of XGBoost (XGBoost), a random grid of Gradient Boosting Machines (GBM), a random grid of Deep Neural Nets (DeepLearning), and 2 Stacked Ensembles, one of all the models, and one of only the best models of each kind.
- Examples
>>> import h2o >>> from h2o.automl import H2OAutoML >>> h2o.init() >>> # Import a sample binary outcome train/test set into H2O >>> train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv") >>> test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv") >>> # Identify the response and set of predictors >>> y = "response" >>> x = list(train.columns) #if x is defined as all columns except the response, then x is not required >>> x.remove(y) >>> # For binary classification, response should be a factor >>> train[y] = train[y].asfactor() >>> test[y] = test[y].asfactor() >>> # Run AutoML for 30 seconds >>> aml = H2OAutoML(max_runtime_secs = 30) >>> aml.train(x = x, y = y, training_frame = train) >>> # Print Leaderboard (ranked by xval metrics) >>> aml.leaderboard >>> # (Optional) Evaluate performance on a test set >>> perf = aml.leader.model_performance(test) >>> perf.auc()
-
property
event_log
¶ Retrieve the backend event log from an H2OAutoML object
- Returns
an H2OFrame with detailed events occurred during the AutoML training.
-
property
key
¶ - Returns
the unique key representing the object on the backend
-
property
leader
¶ Retrieve the top model from an H2OAutoML object
- Returns
an H2O model
- Examples
>>> # Set up an H2OAutoML object >>> aml = H2OAutoML(max_runtime_secs=30) >>> # Launch an AutoML run >>> aml.train(y=y, training_frame=train) >>> # Get the best model in the AutoML Leaderboard >>> aml.leader >>> >>> # Get AutoML object by `project_name` >>> get_aml = h2o.automl.get_automl(aml.project_name) >>> # Get the best model in the AutoML Leaderboard >>> get_aml.leader
-
property
leaderboard
¶ Retrieve the leaderboard from an H2OAutoML object
- Returns
an H2OFrame with model ids in the first column and evaluation metric in the second column sorted by the evaluation metric
- Examples
>>> # Set up an H2OAutoML object >>> aml = H2OAutoML(max_runtime_secs=30) >>> # Launch an AutoML run >>> aml.train(y=y, training_frame=train) >>> # Get the AutoML Leaderboard >>> aml.leaderboard >>> >>> # Get AutoML object by `project_name` >>> get_aml = h2o.automl.get_automl(aml.project_name) >>> # Get the AutoML Leaderboard >>> get_aml.leaderboard
-
property
modeling_steps
¶ expose the modeling steps effectively used by the AutoML run. This executed plan can be directly reinjected as the modeling_plan property of a new AutoML instance
to improve reproducibility across AutoML versions.
- Returns
a list of dictionaries representing the effective modeling plan.
-
predict
(test_data)[source]¶ Predict on a dataset.
- Parameters
test_data (H2OFrame) – Data on which to make predictions.
- Returns
A new H2OFrame of predictions.
- Examples
>>> # Set up an H2OAutoML object >>> aml = H2OAutoML(max_runtime_secs=30) >>> # Launch an H2OAutoML run >>> aml.train(y=y, training_frame=train) >>> # Predict with top model from AutoML Leaderboard on a H2OFrame called 'test' >>> aml.predict(test) >>> >>> # Get AutoML object by `project_name` >>> get_aml = h2o.automl.get_automl(aml.project_name) >>> # Predict with top model from AutoML Leaderboard on a H2OFrame called 'test' >>> get_aml.predict(test)
-
property
project_name
¶ Retrieve a string indicating the project_name of the automl instance to retrieve.
- Returns
a string containing the project_name
-
train
(x=None, y=None, training_frame=None, fold_column=None, weights_column=None, validation_frame=None, leaderboard_frame=None, blending_frame=None)[source]¶ Begins an AutoML task, a background task that automatically builds a number of models with various algorithms and tracks their performance in a leaderboard. At any point in the process you may use H2O’s performance or prediction functions on the resulting models.
- Parameters
x – A list of column names or indices indicating the predictor columns.
y – An index or a column name indicating the response column.
fold_column – The name or index of the column in training_frame that holds per-row fold assignments.
weights_column – The name or index of the column in training_frame that holds per-row weights.
training_frame – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold_column or weights_column).
validation_frame – H2OFrame with validation data. This argument is ignored unless the user sets nfolds = 0. If cross-validation is turned off, then a validation frame can be specified and used for early stopping of individual models and early stopping of the grid searches. By default and when nfolds > 1, cross-validation metrics will be used for early stopping and thus validation_frame will be ignored.
leaderboard_frame – H2OFrame with test data for scoring the leaderboard. This is optional and if this is set to None (the default), then cross-validation metrics will be used to generate the leaderboard rankings instead.
blending_frame – H2OFrame used to train the the metalearning algorithm in Stacked Ensembles (instead of relying on cross-validated predicted values). This is optional, but when provided, it is also recommended to disable cross validation by setting nfolds=0 and to provide a leaderboard frame for scoring purposes.
- Returns
An H2OAutoML object.
- Examples
>>> # Set up an H2OAutoML object >>> aml = H2OAutoML(max_runtime_secs=30) >>> # Launch an AutoML run >>> aml.train(y=y, training_frame=train)
-
property
training_info
¶ Expose the name/value columns of event_log as a simple dictionary, for example start_epoch, stop_epoch, … See
event_log()
to obtain a description of those key/value pairs.- Returns
a dictionary with event_log[‘name’] column as keys and event_log[‘value’] column as values.
H2OEstimator
¶
-
class
h2o.estimators.estimator_base.
H2OEstimator
(*args, **kwargs)[source]¶ Bases:
h2o.model.model_base.ModelBase
Base class for H2O Estimators.
H2O Estimators implement the following methods for model construction:
start()
- Top-level user-facing API for asynchronous model buildjoin()
- Top-level user-facing API for blocking on async model buildtrain()
- Top-level user-facing API for model building.fit()
- Used by scikit-learn.
Because H2OEstimator instances are instances of ModelBase, these objects can use the H2O model API.
-
convert_H2OXGBoostParams_2_XGBoostParams
()[source]¶ In order to use convert_H2OXGBoostParams_2_XGBoostParams and convert_H2OFrame_2_DMatrix, you must import the following toolboxes: xgboost, pandas, numpy and scipy.sparse.
Given an H2OXGBoost model, this method will generate the corresponding parameters that should be used by native XGBoost in order to give exactly the same result, assuming that the same dataset (derived from h2oFrame) is used to train the native XGBoost model.
Follow the steps below to compare H2OXGBoost and native XGBoost:
Train the H2OXGBoost model with H2OFrame trainFile and generate a prediction:
h2oModelD = H2OXGBoostEstimator(**h2oParamsD) # parameters specified as a dict()
h2oModelD.train(x=myX, y=y, training_frame=trainFile) # train with H2OFrame trainFile
h2oPredict = h2oPredictD = h2oModelD.predict(trainFile)
Derive the DMatrix from H2OFrame:
nativeDMatrix = trainFile.convert_H2OFrame_2_DMatrix(myX, y, h2oModelD)
Derive the parameters for native XGBoost:
nativeParams = h2oModelD.convert_H2OXGBoostParams_2_XGBoostParams()
Train your native XGBoost model and generate a prediction:
nativeModel = xgb.train(params=nativeParams[0], dtrain=nativeDMatrix, num_boost_round=nativeParams[1])
nativePredict = nativeModel.predict(data=nativeDMatrix, ntree_limit=nativeParams[1]
Compare the predictions h2oPredict from H2OXGBoost, nativePredict from native XGBoost.
- Returns
nativeParams, num_boost_round
-
fit
(X, y=None, **params)[source]¶ Fit an H2O model as part of a scikit-learn pipeline or grid search.
A warning will be issued if a caller other than sklearn attempts to use this method.
-
get_params
(deep=True)[source]¶ Obtain parameters for this estimator.
Used primarily for sklearn Pipelines and sklearn grid search.
- Parameters
deep – If True, return parameters of all sub-objects that are estimators.
- Returns
A dict of parameters
-
set_params
(**parms)[source]¶ Used by sklearn for updating parameters during grid search.
- Parameters
parms – A dictionary of parameters that will be set on this model.
- Returns
self, the current estimator object with the parameters all set as desired.
-
start
(x, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, **params)[source]¶ Train the model asynchronously (to block for results call
join()
).- Parameters
x – A list of column names or indices indicating the predictor columns.
y – An index or a column name indicating the response column.
training_frame (H2OFrame) – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).
offset_column – The name or index of the column in training_frame that holds the offsets.
fold_column – The name or index of the column in training_frame that holds the per-row fold assignments.
weights_column – The name or index of the column in training_frame that holds the per-row weights.
validation_frame – H2OFrame with validation data to be scored on while training.
-
train
(x=None, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, max_runtime_secs=None, ignored_columns=None, model_id=None, verbose=False)[source]¶ Train the H2O model.
- Parameters
x – A list of column names or indices indicating the predictor columns.
y – An index or a column name indicating the response column.
training_frame (H2OFrame) – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).
offset_column – The name or index of the column in training_frame that holds the offsets.
fold_column – The name or index of the column in training_frame that holds the per-row fold assignments.
weights_column – The name or index of the column in training_frame that holds the per-row weights.
validation_frame – H2OFrame with validation data to be scored on while training.
max_runtime_secs (float) – Maximum allowed runtime in seconds for model training. Use 0 to disable.
verbose (bool) – Print scoring history to stdout. Defaults to False.
-
train_segments
(x=None, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, max_runtime_secs=None, ignored_columns=None, segments=None, segment_models_id=None, parallelism=1, verbose=False)[source]¶ Trains H2O model for each segment (subpopulation) of the training dataset.
- Parameters
x – A list of column names or indices indicating the predictor columns.
y – An index or a column name indicating the response column.
training_frame (H2OFrame) – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).
offset_column – The name or index of the column in training_frame that holds the offsets.
fold_column – The name or index of the column in training_frame that holds the per-row fold assignments.
weights_column – The name or index of the column in training_frame that holds the per-row weights.
validation_frame – H2OFrame with validation data to be scored on while training.
max_runtime_secs (float) – Maximum allowed runtime in seconds for each model training. Use 0 to disable. Please note that regardless of how this parameter is set, a model will be built for each input segment. This parameter only affects individual model training.
segments – A list of columns to segment-by. H2O will group the training (and validation) dataset by the segment-by columns and train a separate model for each segment (group of rows). As an alternative to providing a list of columns, users can also supply an explicit enumeration of segments to build the models for. This enumeration needs to be represented as H2OFrame.
segment_models_id – Identifier for the returned collection of Segment Models. If not specified it will be automatically generated.
parallelism – Level of parallelism of the bulk segment models building, it is the maximum number of models each H2O node will be building in parallel.
verbose (bool) – Enable to print additional information during model building. Defaults to False.
- Examples
>>> response = "survived" >>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> titanic[response] = titanic[response].asfactor() >>> predictors = ["survived","name","sex","age","sibsp","parch","ticket","fare","cabin"] >>> train, valid = titanic.split_frame(ratios=[.8], seed=1234) >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> titanic_gbm = H2OGradientBoostingEstimator(seed=1234) >>> titanic_models = titanic_gbm.train_segments(segments=["pclass"], ... x=predictors, ... y=response, ... training_frame=train, ... validation_frame=valid) >>> titanic_models.as_frame()
H2OSingularValueDecompositionEstimator
¶
-
class
h2o.estimators.svd.
H2OSingularValueDecompositionEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Singular Value Decomposition
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> import tempfile >>> from os import listdir >>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> checkpoints_dir = tempfile.mkdtemp() >>> fit_h2o = H2OSingularValueDecompositionEstimator(export_checkpoints_dir=checkpoints_dir, ... seed=-5) >>> fit_h2o.train(x=list(range(4)), training_frame=arrests) >>> len(listdir(checkpoints_dir))
-
property
ignore_const_cols
¶ Ignore constant columns.
Type:
bool
(default:True
).- Examples
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> fit_h2o = H2OSingularValueDecompositionEstimator(ignore_const_cols=False, ... nv=4) >>> fit_h2o.train(x=list(range(4)), training_frame=arrests) >>> fit_h2o
-
property
ignored_columns
¶ Names of columns to ignore for training.
Type:
List[str]
.
-
init_for_pipeline
()[source]¶ Returns H2OSVD object which implements fit and transform method to be used in sklearn.Pipeline properly. All parameters defined in self.__params, should be input parameters in H2OSVD.__init__ method.
- Returns
H2OSVD object
- Examples
>>> from h2o.transforms.preprocessing import H2OScaler >>> from h2o.estimators import H2ORandomForestEstimator >>> from h2o.estimators import H2OSingularValueDecompositionEstimator >>> from sklearn.pipeline import Pipeline >>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> pipe = Pipeline([("standardize", H2OScaler()), ... ("svd", H2OSingularValueDecompositionEstimator(nv=3).init_for_pipeline()), ... ("rf", H2ORandomForestEstimator(seed=42,ntrees=50))]) >>> pipe.fit(arrests[1:], arrests[0])
-
property
keep_u
¶ Save left singular vectors?
Type:
bool
(default:True
).- Examples
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> fit_h2o = H2OSingularValueDecompositionEstimator(keep_u=False) >>> fit_h2o.train(x=list(range(4)), training_frame=arrests) >>> fit_h2o
-
property
max_iterations
¶ Maximum iterations
Type:
int
(default:1000
).- Examples
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> fit_h2o = H2OSingularValueDecompositionEstimator(nv=4, ... transform="standardize", ... max_iterations=2000) >>> fit_h2o.train(x=list(range(4)), training_frame=arrests) >>> fit_h2o
-
property
max_runtime_secs
¶ Maximum allowed runtime in seconds for model training. Use 0 to disable.
Type:
float
(default:0
).- Examples
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> fit_h2o = H2OSingularValueDecompositionEstimator(nv=4, ... transform="standardize", ... max_runtime_secs=25) >>> fit_h2o.train(x=list(range(4)), training_frame=arrests) >>> fit_h2o
-
property
nv
¶ Number of right singular vectors
Type:
int
(default:1
).- Examples
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> fit_h2o = H2OSingularValueDecompositionEstimator(nv=4, ... transform="standardize", ... max_iterations=2000) >>> fit_h2o.train(x=list(range(4)), training_frame=arrests) >>> fit_h2o
-
property
score_each_iteration
¶ Whether to score during each iteration of model training.
Type:
bool
(default:False
).- Examples
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> fit_h2o = H2OSingularValueDecompositionEstimator(nv=4, ... score_each_iteration=True) >>> fit_h2o.train(x=list(range(4)), training_frame=arrests) >>> fit_h2o
-
property
seed
¶ RNG seed for k-means++ initialization
Type:
int
(default:-1
).- Examples
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> fit_h2o = H2OSingularValueDecompositionEstimator(nv=4, seed=-3) >>> fit_h2o.train(x=list(range(4)), training_frame=arrests) >>> fit_h2o
-
property
svd_method
¶ Method for computing SVD (Caution: Randomized is currently experimental and unstable)
One of:
"gram_s_v_d"
,"power"
,"randomized"
(default:"gram_s_v_d"
).- Examples
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> fit_h2o = H2OSingularValueDecompositionEstimator(svd_method="power") >>> fit_h2o.train(x=list(range(4)), training_frame=arrests) >>> fit_h2o
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> fit_h2o = H2OSingularValueDecompositionEstimator() >>> fit_h2o.train(x=list(range(4)), training_frame=arrests) >>> fit_h2o
-
property
transform
¶ Transformation of training data
One of:
"none"
,"standardize"
,"normalize"
,"demean"
,"descale"
(default:"none"
).- Examples
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> fit_h2o = H2OSingularValueDecompositionEstimator(nv=4, ... transform="standardize", ... max_iterations=2000) >>> fit_h2o.train(x=list(range(4)), training_frame=arrests) >>> fit_h2o
-
property
u_name
¶ Frame key to save left singular vectors
Type:
str
.- Examples
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> fit_h2o = H2OSingularValueDecompositionEstimator(u_name="fit_h2o") >>> fit_h2o.train(x=list(range(4)), training_frame=arrests) >>> fit_h2o.u_name >>> fit_h2o
-
property
use_all_factor_levels
¶ Whether first factor level is included in each categorical expansion
Type:
bool
(default:True
).- Examples
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> fit_h2o = H2OSingularValueDecompositionEstimator(use_all_factor_levels=False) >>> fit_h2o.train(x=list(range(4)), training_frame=arrests) >>> fit_h2o
-
property
validation_frame
¶ Id of the validation data frame.
Type:
H2OFrame
.- Examples
>>> arrests = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> train, valid = arrests.split_frame(ratios=[.8]) >>> fit_h2o = H2OSingularValueDecompositionEstimator() >>> fit_h2o.train(x=list(range(4)), ... training_frame=train, ... validation_frame=valid) >>> fit_h2o
-
property
H2OWord2vecEstimator
¶
-
class
h2o.estimators.word2vec.
H2OWord2vecEstimator
(**kwargs)[source]¶ Bases:
h2o.estimators.estimator_base.H2OEstimator
Word2Vec
-
property
epochs
¶ Number of training iterations to run
Type:
int
(default:5
).- Examples
>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> words = job_titles.tokenize(" ") >>> w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0, epochs = 10) >>> w2v_model.train(training_frame=words) >>> synonyms = w2v_model.find_synonyms("teacher", count = 5) >>> print(synonyms) >>> >>> w2v_model2 = H2OWord2vecEstimator(sent_sample_rate = 0.0, epochs = 1) >>> w2v_model2.train(training_frame=words) >>> synonyms2 = w2v_model2.find_synonyms("teacher", 3) >>> print(synonyms2)
-
property
export_checkpoints_dir
¶ Automatically export generated models to this directory.
Type:
str
.- Examples
>>> import tempfile >>> from os import listdir >>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> checkpoints_dir = tempfile.mkdtemp() >>> words = job_titles.tokenize(" ") >>> w2v_model = H2OWord2vecEstimator(epochs=1, ... max_runtime_secs=10, ... export_checkpoints_dir=checkpoints_dir) >>> w2v_model.train(training_frame=words) >>> len(listdir(checkpoints_dir))
-
static
from_external
(external=<class 'h2o.frame.H2OFrame'>)[source]¶ Creates new H2OWord2vecEstimator based on an external model.
- Parameters
external – H2OFrame with an external model
- Returns
H2OWord2vecEstimator instance representing the external model
- Examples
>>> words = h2o.create_frame(rows=10, cols=1, ... string_fraction=1.0, ... missing_fraction=0.0) >>> embeddings = h2o.create_frame(rows=10, cols=100, ... real_fraction=1.0, ... missing_fraction=0.0) >>> word_embeddings = words.cbind(embeddings) >>> w2v_model = H2OWord2vecEstimator.from_external(external=word_embeddings)
-
property
init_learning_rate
¶ Set the starting learning rate
Type:
float
(default:0.025
).- Examples
>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> words = job_titles.tokenize(" ") >>> w2v_model = H2OWord2vecEstimator(epochs=3, init_learning_rate=0.05) >>> w2v_model.train(training_frame=words) >>> synonyms = w2v_model.find_synonyms("assistant", 3) >>> print(synonyms)
-
property
max_runtime_secs
¶ Maximum allowed runtime in seconds for model training. Use 0 to disable.
Type:
float
(default:0
).- Examples
>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> words = job_titles.tokenize(" ") >>> w2v_model = H2OWord2vecEstimator(epochs=1, max_runtime_secs=10) >>> w2v_model.train(training_frame=words) >>> synonyms = w2v_model.find_synonyms("tutor", 3) >>> print(synonyms)
-
property
min_word_freq
¶ This will discard words that appear less than <int> times
Type:
int
(default:5
).- Examples
>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> words = job_titles.tokenize(" ") >>> w2v_model = H2OWord2vecEstimator(epochs=1, min_word_freq=4) >>> w2v_model.train(training_frame=words) >>> synonyms = w2v_model.find_synonyms("teacher", 3) >>> print(synonyms)
-
property
norm_model
¶ Use Hierarchical Softmax
One of:
"hsm"
(default:"hsm"
).- Examples
>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> words = job_titles.tokenize(" ") >>> w2v_model = H2OWord2vecEstimator(epochs=1, norm_model="hsm") >>> w2v_model.train(training_frame=words) >>> synonyms = w2v_model.find_synonyms("teacher", 3) >>> print(synonyms)
-
property
pre_trained
¶ Id of a data frame that contains a pre-trained (external) word2vec model
Type:
H2OFrame
.- Examples
>>> words = h2o.create_frame(rows=1000,cols=1, ... string_fraction=1.0, ... missing_fraction=0.0) >>> embeddings = h2o.create_frame(rows=1000,cols=100, ... real_fraction=1.0, ... missing_fraction=0.0) >>> word_embeddings = words.cbind(embeddings) >>> w2v_model = H2OWord2vecEstimator(pre_trained=word_embeddings) >>> w2v_model.train(training_frame=word_embeddings) >>> model_id = w2v_model.model_id >>> model = h2o.get_model(model_id)
-
property
sent_sample_rate
¶ Set threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled; useful range is (0, 1e-5)
Type:
float
(default:0.001
).- Examples
>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> words = job_titles.tokenize(" ") >>> w2v_model = H2OWord2vecEstimator(epochs=1, sent_sample_rate=0.01) >>> w2v_model.train(training_frame=words) >>> synonyms = w2v_model.find_synonyms("teacher", 3) >>> print(synonyms)
-
property
training_frame
¶ Id of the training data frame.
Type:
H2OFrame
.- Examples
>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> words = job_titles.tokenize(" ") >>> w2v_model = H2OWord2vecEstimator() >>> w2v_model.train(training_frame=words) >>> synonyms = w2v_model.find_synonyms("tutor", 3) >>> print(synonyms)
-
property
vec_size
¶ Set size of word vectors
Type:
int
(default:100
).- Examples
>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> words = job_titles.tokenize(" ") >>> w2v_model = H2OWord2vecEstimator(epochs=3, vec_size=50) >>> w2v_model.train(training_frame=words) >>> synonyms = w2v_model.find_synonyms("tutor", 3) >>> print(synonyms)
-
property
window_size
¶ Set max skip length between words
Type:
int
(default:5
).- Examples
>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> words = job_titles.tokenize(" ") >>> w2v_model = H2OWord2vecEstimator(epochs=3, window_size=2) >>> w2v_model.train(training_frame=words) >>> synonyms = w2v_model.find_synonyms("teacher", 3) >>> print(synonyms)
-
property
word_model
¶ The word model to use (SkipGram or CBOW)
One of:
"skip_gram"
,"cbow"
(default:"skip_gram"
).- Examples
>>> job_titles = h2o.import_file(("https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv"), ... col_names = ["category", "jobtitle"], ... col_types = ["string", "string"], ... header = 1) >>> words = job_titles.tokenize(" ") >>> w2v_model = H2OWord2vecEstimator(epochs=3, word_model="skip_gram") >>> w2v_model.train(training_frame=words) >>> synonyms = w2v_model.find_synonyms("assistant", 3) >>> print(synonyms)
-
property
H2OGridSearch
¶
-
class
h2o.grid.
H2OGridSearch
(*args, **kwargs)[source]¶ Bases:
h2o.grid.grid_search.H2OGridSearch
Grid Search of a Hyper-Parameter Space for a Model
- Parameters
model – The type of model to be explored initialized with optional parameters that will be unchanged across explored models.
hyper_params – A dictionary of string parameters (keys) and a list of values to be explored by grid search (values).
grid_id (str) – The unique id assigned to the resulting grid object. If none is given, an id will automatically be generated.
search_criteria –
The optional dictionary of directives which control the search of the hyperparameter space. The dictionary can include values for:
strategy
,max_models
,max_runtime_secs
,stopping_metric
,stopping_tolerance
,stopping_rounds
andseed
. The default strategy, “Cartesian”, covers the entire space of hyperparameter combinations. If you want to use cartesian grid search, you can leave the search_criteria argument unspecified. Specify the “RandomDiscrete” strategy to get random search of all the combinations of your hyperparameters with three ways of specifying when to stop the search: max number of models, max time, and metric-based early stopping (e.g., stop if MSE hasn’t improved by 0.0001 over the 5 best models). Examples below:>>> criteria = {"strategy": "RandomDiscrete", "max_runtime_secs": 600, ... "max_models": 100, "stopping_metric": "AUTO", ... "stopping_tolerance": 0.00001, "stopping_rounds": 5, ... "seed": 123456} >>> criteria = {"strategy": "RandomDiscrete", "max_models": 42, ... "max_runtime_secs": 28800, "seed": 1234} >>> criteria = {"strategy": "RandomDiscrete", "stopping_metric": "AUTO", ... "stopping_tolerance": 0.001, "stopping_rounds": 10} >>> criteria = {"strategy": "RandomDiscrete", "stopping_rounds": 5, ... "stopping_metric": "misclassification", ... "stopping_tolerance": 0.00001}
parallelism – Level of parallelism during grid model building. 1 = sequential building (default). Use the value of 0 for adaptive parallelism - decided by H2O. Any number > 1 sets the exact number of models built in parallel.
- Returns
a new H2OGridSearch instance
Examples
>>> from h2o.grid.grid_search import H2OGridSearch >>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> hyper_parameters = {'alpha': [0.01,0.5], 'lambda': [1e-5,1e-6]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_parameters) >>> training_data = h2o.import_file("smalldata/logreg/benign.csv") >>> gs.train(x=range(3) + range(4,11),y=3, training_frame=training_data) >>> gs.show()
-
aic
(train=False, valid=False, xval=False)[source]¶ Get the AIC(s).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
- Parameters
train (bool) – If train is True, then return the AIC value for the training data.
valid (bool) – If valid is True, then return the AIC value for the validation data.
xval (bool) – If xval is True, then return the AIC value for the validation data.
- Returns
The AIC.
- Examples
>>> from h2o.grid.grid_search import H2OGridSearch >>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> prostate[2] = prostate[2].asfactor() >>> prostate[4] = prostate[4].asfactor() >>> prostate[5] = prostate[5].asfactor() >>> prostate[8] = prostate[8].asfactor() >>> predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"] >>> response = "CAPSULE" >>> hyper_params = {'alpha': [0.01,0.5], ... 'lambda': [1e-5,1e-6]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=predictors, y=response, training_frame=prostate) >>> gs.aic()
-
auc
(train=False, valid=False, xval=False)[source]¶ Get the AUC(s).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
- Parameters
train (bool) – If train is True, then return the AUC value for the training data.
valid (bool) – If valid is True, then return the AUC value for the validation data.
xval (bool) – If xval is True, then return the AUC value for the validation data.
- Returns
The AUC.
- Examples
>>> from h2o.estimators import H2OGradientBoostingEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> data = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv") >>> test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv") >>> x = data.columns >>> y = "response" >>> x.remove(y) >>> data[y] = data[y].asfactor() >>> test[y] = test[y].asfactor() >>> ss = data.split_frame(seed = 1) >>> train = ss[0] >>> valid = ss[1] >>> gbm_params1 = {'learn_rate': [0.01, 0.1], ... 'max_depth': [3, 5, 9], ... 'sample_rate': [0.8, 1.0], ... 'col_sample_rate': [0.2, 0.5, 1.0]} >>> gbm_grid1 = H2OGridSearch(model=H2OGradientBoostingEstimator, ... grid_id='gbm_grid1', ... hyper_params=gbm_params1) >>> gbm_grid1.train(x=x, y=y, ... training_frame=train, ... validation_frame=valid, ... ntrees=100, ... seed=1) >>> gbm_pridperf1 = gbm_grid1.get_grid(sort_by='auc', decreasing=True) >>> best_gbm1 = gbm_gridperf1.models[0] >>> best_gbm_perf1 = best_gbm1.model_performance(test) >>> best_gbm_perf1.auc()
-
aucpr
(train=False, valid=False, xval=False)[source]¶ Get the aucPR (Area Under PRECISION RECALL Curve).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
- Parameters
train (bool) – If train is True, then return the aucpr value for the training data.
valid (bool) – If valid is True, then return the aucpr value for the validation data.
xval (bool) – If xval is True, then return the aucpr value for the validation data.
- Returns
The AUCPR for the models in this grid.
-
biases
(vector_id=0)[source]¶ Return the frame for the respective bias vector.
- Parameters
vector_id – an integer, ranging from 0 to number of layers, that specifies the bias vector to return.
- Returns
an H2OFrame which represents the bias vector identified by vector_id
- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv") >>> hh = H2ODeepLearningEstimator(hidden=[], ... loss="CrossEntropy", ... export_weights_and_biases=True) >>> hh.train(x=list(range(4)), y=4, training_frame=iris) >>> hh.biases(0)
-
catoffsets
()[source]¶ Categorical offsets for one-hot encoding
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv") >>> hh = H2ODeepLearningEstimator(hidden=[], ... loss="CrossEntropy", ... export_weights_and_biases=True) >>> hh.train(x=list(range(4)), y=4, training_frame=iris) >>> hh.catoffsets()
-
coef
()[source]¶ Return the coefficients that can be applied to the non-standardized data.
Note: standardize = True by default. If set to False, then coef() returns the coefficients that are fit directly.
- Examples
>>> from h2o.grid.grid_search import H2OGridSearch >>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv") >>> hyper_parameters = {'alpha': [0.01,0.5], ... 'lambda': [1e-5,1e-6]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_parameters) >>> gs.train(x=range(3)+range(4,11), y=3, training_frame=training_data) >>> gs.coef()
-
coef_norm
()[source]¶ Return coefficients fitted on the standardized data (requires standardize = True, which is on by default). These coefficients can be used to evaluate variable importance.
- Examples
>>> from h2o.grid.grid_search import H2OGridSearch >>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv") >>> hyper_parameters = {'alpha': [0.01,0.5], ... 'lambda': [1e-5,1e-6]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_parameters) >>> gs.train(x=range(3)+range(4,11), y=3, training_frame=training_data) >>> gs.coef_norm()
-
deepfeatures
(test_data, layer)[source]¶ Obtain a hidden layer’s details on a dataset.
- Parameters
test_data – Data to create a feature space on.
layer (int) – Index of the hidden layer.
- Returns
A dictionary of hidden layer details for each model.
- Examples
>>> from h2o.estimators import H2OAutoEncoderEstimator >>> resp = 784 >>> nfeatures = 20 >>> train = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz") >>> train[resp] = train[resp].asfactor() >>> test = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz") >>> test[resp] = test[resp].asfactor() >>> sid = train[0].runif(0) >>> train_unsup = train[sid >= 0.5] >>> train_unsup.pop(resp) >>> train_sup = train[sid < 0.5] >>> ae_model = H2OAutoEncoderEstimator(activation="Tanh", ... hidden=[nfeatures], ... model_id="ae_model", ... epochs=1, ... ignore_const_cols=False, ... reproducible=True, ... seed=1234) >>> ae_model.train(list(range(resp)), training_frame=train_unsup) >>> ae_model.deepfeatures(train_sup[0:resp], 0)
-
property
failed_params
¶ Return a list of failed parameters. :examples:
>>> from h2o.grid.grid_search import H2OGridSearch >>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv") >>> hyper_parameters = {'alpha': [0.01,0.5], ... 'lambda': [1e-5,1e-6], ... 'beta_epsilon': [0.05]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_parameters) >>> gs.train(x=range(3)+range(4,11), y=3, training_frame=training_data) >>> gs.failed_params
-
get_grid
(sort_by=None, decreasing=None)[source]¶ Retrieve an H2OGridSearch instance.
Optionally specify a metric by which to sort models and a sort order. Note that if neither cross-validation nor a validation frame is used in the grid search, then the training metrics will display in the “get grid” output. If a validation frame is passed to the grid, and
nfolds = 0
, then the validation metrics will display. However, ifnfolds
> 1, then cross-validation metrics will display even if a validation frame is provided.- Parameters
sort_by (str) – A metric by which to sort the models in the grid space. Choices are:
"logloss"
,"residual_deviance"
,"mse"
,"auc"
,"r2"
,"accuracy"
,"precision"
,"recall"
,"f1"
, etc.decreasing (bool) – Sort the models in decreasing order of metric if true, otherwise sort in increasing order (default).
- Returns
A new H2OGridSearch instance optionally sorted on the specified metric.
- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> y = 3 >>> x = [4,5,6,7,8,9,10,11] >>> hyper_params = {'alpha': [0.01,0.3,0.5], ... 'lambda': [1e-5, 1e-6, 1e-7]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=x,y=y, training_frame=benign) >>> gs.get_grid(sort_by='F1', decreasing=True)
-
get_hyperparams
(id, display=True)[source]¶ Get the hyperparameters of a model explored by grid search.
- Parameters
id (str) – The model id of the model with hyperparameters of interest.
display (bool) – Flag to indicate whether to display the hyperparameter names.
- Returns
A list of the hyperparameters for the specified model.
- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> y = 3 >>> x = [4,5,6,7,8,9,10,11] >>> hyper_params = {'alpha': [0.01,0.3,0.5], ... 'lambda': [1e-5, 1e-6, 1e-7]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=x,y=y, training_frame=benign) >>> best_model_id = gs.get_grid(sort_by='F1', ... decreasing=True).model_ids[0] >>> gs.get_hyperparams(best_model_id)
-
get_hyperparams_dict
(id, display=True)[source]¶ Derived and returned the model parameters used to train the particular grid search model.
- Parameters
id (str) – The model id of the model with hyperparameters of interest.
display (bool) – Flag to indicate whether to display the hyperparameter names.
- Returns
A dict of model pararmeters derived from the hyper-parameters used to train this particular model.
- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> y = 3 >>> x = [4,5,6,7,8,9,10,11] >>> hyper_params = {'alpha': [0.01,0.3,0.5], ... 'lambda': [1e-5, 1e-6, 1e-7]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=x,y=y, training_frame=benign) >>> best_model_id = gs.get_grid(sort_by='F1', ... decreasing=True).model_ids[0] >>> gs.get_hyperparams_dict(best_model_id)
-
get_xval_models
(key=None)[source]¶ Return a Model object.
- Parameters
key (str) – If None, return all cross-validated models; otherwise return the model specified by the key.
- Returns
A model or a list of models.
- Examples
>>> from h2o.estimators import H2OGradientBoostingEstimator >>> fr = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/prostate_train.csv") >>> m = H2OGradientBoostingEstimator(nfolds=10, ... ntrees=10, ... keep_cross_validation_models=True) >>> m.train(x=list(range(2,fr.ncol)), y=1, training_frame=fr) >>> m.get_xval_models()
-
gini
(train=False, valid=False, xval=False)[source]¶ Get the Gini Coefficient(s).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
- Parameters
train (bool) – If train is True, then return the Gini Coefficient value for the training data.
valid (bool) – If valid is True, then return the Gini Coefficient value for the validation data.
xval (bool) – If xval is True, then return the Gini Coefficient value for the cross validation data.
- Returns
The Gini Coefficient for the models in this grid.
- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> y = 3 >>> x = [4,5,6,7,8,9,10,11] >>> hyper_params = {'alpha': [0.01,0.3,0.5], ... 'lambda': [1e-5, 1e-6, 1e-7]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=x,y=y, training_frame=benign) >>> gs.gini()
-
property
grid_id
¶ A key that identifies this grid search object in H2O.
- Examples
>>> from h2o.grid.grid_search import H2OGridSearch >>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv") >>> hyper_parameters = {'alpha': [0.01,0.5], ... 'lambda': [1e-5,1e-6]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_parameters) >>> gs.train(x=range(3)+range(4,11), y=3, training_frame=training_data) >>> gs.grid_id
-
property
hyper_names
¶ Return the hyperparameter names.
- Examples
>>> from h2o.grid.grid_search import H2OGridSearch >>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv") >>> hyper_parameters = {'alpha': [0.01,0.5], ... 'lambda': [1e-5,1e-6]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_parameters) >>> gs.train(x=range(3)+range(4,11), y=3, training_frame=training_data) >>> gs.hyper_names
-
is_cross_validated
()[source]¶ Return True if the model was cross-validated.
- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> y = 3 >>> x = [4,5,6,7,8,9,10,11] >>> hyper_params = {'alpha': [0.01,0.3,0.5], ... 'lambda': [1e-5, 1e-6, 1e-7]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=x,y=y, training_frame=benign) >>> gs.is_cross_validated()
-
join
()[source]¶ Wait until grid finishes computing.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), hyper_params) >>> gs.start(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.join()
-
property
key
¶ - Returns
the unique key representing the object on the backend
-
logloss
(train=False, valid=False, xval=False)[source]¶ Get the Log Loss(s).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
- Parameters
train (bool) – If train is True, then return the Log Loss value for the training data.
valid (bool) – If valid is True, then return the Log Loss value for the validation data.
xval (bool) – If xval is True, then return the Log Loss value for the cross validation data.
- Returns
The Log Loss for this binomial model.
- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> y = 3 >>> x = [4,5,6,7,8,9,10,11] >>> hyper_params = {'alpha': [0.01,0.3,0.5], ... 'lambda': [1e-5, 1e-6, 1e-7]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=x,y=y, training_frame=benign) >>> gs.logloss()
-
mean_residual_deviance
(train=False, valid=False, xval=False)[source]¶ Get the Mean Residual Deviances(s).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
- Parameters
train (bool) – If train is True, then return the Mean Residual Deviance value for the training data.
valid (bool) – If valid is True, then return the Mean Residual Deviance value for the validation data.
xval (bool) – If xval is True, then return the Mean Residual Deviance value for the cross validation data.
- Returns
The Mean Residual Deviance for this regression model.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), ... hyper_params) >>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.mean_residual_deviance()
-
property
model_ids
¶ Returns model ids.
- Examples
>>> from h2o.grid.grid_search import H2OGridSearch >>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv") >>> hyper_parameters = {'alpha': [0.01,0.5], ... 'lambda': [1e-5,1e-6]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_parameters) >>> gs.train(x=range(3)+range(4,11), y=3, training_frame=training_data) >>> gs.model_ids
-
model_performance
(test_data=None, train=False, valid=False, xval=False)[source]¶ Generate model metrics for this model on test_data.
- Parameters
test_data – Data set for which model metrics shall be computed against. All three of train, valid and xval arguments are ignored if test_data is not None.
train – Report the training metrics for the model.
valid – Report the validation metrics for the model.
xval – Report the validation metrics for the model.
- Returns
An object of class H2OModelMetrics.
- Examples
>>> from h2o.estimators import H2OGradientBoostingEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> data = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv") >>> test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv") >>> x = data.columns >>> y = "response" >>> x.remove(y) >>> data[y] = data[y].asfactor() >>> test[y] = test[y].asfactor() >>> ss = data.split_frame(seed = 1) >>> train = ss[0] >>> valid = ss[1] >>> gbm_params1 = {'learn_rate': [0.01, 0.1], ... 'max_depth': [3, 5, 9], ... 'sample_rate': [0.8, 1.0], ... 'col_sample_rate': [0.2, 0.5, 1.0]} >>> gbm_grid1 = H2OGridSearch(model=H2OGradientBoostingEstimator, ... grid_id='gbm_grid1', ... hyper_params=gbm_params1) >>> gbm_grid1.train(x=x, y=y, ... training_frame=train, ... validation_frame=valid, ... ntrees=100, ... seed=1) >>> gbm_gridperf1 = gbm_grid1.get_grid(sort_by='auc', decreasing=True) >>> best_gbm1 = gbm_gridperf1.models[0] >>> best_gbm1.model_performance(test)
-
mse
(train=False, valid=False, xval=False)[source]¶ Get the MSE(s).
If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
- Parameters
train (bool) – If train is True, then return the MSE value for the training data.
valid (bool) – If valid is True, then return the MSE value for the validation data.
xval (bool) – If xval is True, then return the MSE value for the cross validation data.
- Returns
The MSE for this regression model.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> from h2o.estimators import H2ODeepLearningEstimator >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), ... hyper_params) >>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.mse()
-
normmul
()[source]¶ Normalization/Standardization multipliers for numeric predictors.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> from h2o.estimators import H2ODeepLearningEstimator >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), ... hyper_params) >>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.normmul()
-
normsub
()[source]¶ Normalization/Standardization offsets for numeric predictors.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> from h2o.estimators import H2ODeepLearningEstimator >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), ... hyper_params) >>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.normsub()
-
null_degrees_of_freedom
(train=False, valid=False, xval=False)[source]¶ Retreive the null degress of freedom if this model has the attribute, or None otherwise.
- Parameters
train (bool) – Get the null dof for the training set. If both train and valid are False, then train is selected by default.
valid (bool) – Get the null dof for the validation set. If both train and valid are True, then train is selected by default.
xval (bool) – Get the null dof for the cross-validated models.
- Returns
the null dof, or None if it is not present.
- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> y = 3 >>> x = [4,5,6,7,8,9,10,11] >>> hyper_params = {'alpha': [0.01,0.3,0.5], ... 'lambda': [1e-5, 1e-6, 1e-7]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=x,y=y, training_frame=benign) >>> gs.null_degrees_of_freedom()
-
null_deviance
(train=False, valid=False, xval=False)[source]¶ Retreive the null deviance if this model has the attribute, or None otherwise.
- Parameters
train (bool) – Get the null deviance for the training set. If both train and valid are False, then train is selected by default.
valid (bool) – Get the null deviance for the validation set. If both train and valid are True, then train is selected by default.
xval (bool) – Get the null deviance for the cross-validated models.
- Returns
the null deviance, or None if it is not present.
- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> y = 3 >>> x = [4,5,6,7,8,9,10,11] >>> hyper_params = {'alpha': [0.01,0.3,0.5], ... 'lambda': [1e-5, 1e-6, 1e-7]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=x,y=y, training_frame=benign) >>> gs.null_deviance()
-
pprint_coef
()[source]¶ Pretty print the coefficents table (includes normalized coefficients).
- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> y = 3 >>> x = [4,5,6,7,8,9,10,11] >>> hyper_params = {'alpha': [0.01,0.3,0.5], ... 'lambda': [1e-5, 1e-6, 1e-7]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=x,y=y, training_frame=benign) >>> gs.pprint_coef()
-
predict
(test_data)[source]¶ Predict on a dataset.
- Parameters
test_data (H2OFrame) – Data to be predicted on.
- Returns
H2OFrame filled with predictions.
- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> y = 3 >>> x = [4,5,6,7,8,9,10,11] >>> hyper_params = {'alpha': [0.01,0.3,0.5], ... 'lambda': [1e-5, 1e-6, 1e-7]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=x,y=y, training_frame=benign) >>> gs.predict(benign)
-
r2
(train=False, valid=False, xval=False)[source]¶ Return the R^2 for this regression model.
The R^2 value is defined to be
1 - MSE/var
, wherevar
is computed assigma^2
.If all are False (default), then return the training metric value. If more than one options is set to True, then return a dictionary of metrics where the keys are “train”, “valid”, and “xval”.
- Parameters
train (bool) – If train is True, then return the R^2 value for the training data.
valid (bool) – If valid is True, then return the R^2 value for the validation data.
xval (bool) – If xval is True, then return the R^2 value for the cross validation data.
- Returns
The R^2 for this regression model.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> from h2o.estimators import H2ODeepLearningEstimator >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), ... hyper_params) >>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.r2()
-
residual_degrees_of_freedom
(train=False, valid=False, xval=False)[source]¶ Retreive the residual degress of freedom if this model has the attribute, or None otherwise.
- Parameters
train (bool) – Get the residual dof for the training set. If both train and valid are False, then train is selected by default.
valid (bool) – Get the residual dof for the validation set. If both train and valid are True, then train is selected by default.
xval (bool) – Get the residual dof for the cross-validated models.
- Returns
the residual degrees of freedom, or None if they are not present.
- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> y = 3 >>> x = [4,5,6,7,8,9,10,11] >>> hyper_params = {'alpha': [0.01,0.3,0.5], ... 'lambda': [1e-5, 1e-6, 1e-7]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=x,y=y, training_frame=benign) >>> gs.residual_degrees_of_freedom()
-
residual_deviance
(train=False, valid=False, xval=False)[source]¶ Retreive the residual deviance if this model has the attribute, or None otherwise.
- Parameters
train (bool) – Get the residual deviance for the training set. If both train and valid are False, then train is selected by default.
valid (bool) – Get the residual deviance for the validation set. If both train and valid are True, then train is selected by default.
xval (bool) – Get the residual deviance for the cross-validated models.
- Returns
the residual deviance, or None if it is not present.
- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> y = 3 >>> x = [4,5,6,7,8,9,10,11] >>> hyper_params = {'alpha': [0.01,0.3,0.5], ... 'lambda': [1e-5, 1e-6, 1e-7]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=x,y=y, training_frame=benign) >>> gs.residual_deviance()
-
respmul
()[source]¶ Normalization/Standardization multipliers for numeric response.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> from h2o.estimators import H2ODeepLearningEstimator >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), ... hyper_params) >>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.respmul()
-
respsub
()[source]¶ Normalization/Standardization offsets for numeric response.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> from h2o.estimators import H2ODeepLearningEstimator >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), ... hyper_params) >>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.respsub()
-
scoring_history
()[source]¶ Retrieve model scoring history.
- Returns
Score history (H2OTwoDimTable)
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> from h2o.estimators import H2ODeepLearningEstimator >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), ... hyper_params) >>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.scoring_history()
-
show
()[source]¶ Print models sorted by metric.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> from h2o.estimators import H2ODeepLearningEstimator >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), ... hyper_params) >>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.show()
-
sort_by
(metric, increasing=True)[source]¶ grid.sort_by() is deprecated; use grid.get_grid() instead
Deprecated since 2016-12-12, use grid.get_grid() instead.
-
sorted_metric_table
()[source]¶ Retrieve summary table of an H2O Grid Search.
- Returns
The summary table as an H2OTwoDimTable or a Pandas DataFrame.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> from h2o.estimators import H2ODeepLearningEstimator >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), ... hyper_params) >>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.sorted_metric_table()
-
start
(x, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, **params)[source]¶ Asynchronous model build by specifying the predictor columns, response column, and any additional frame-specific values.
To block for results, call
join()
.- Parameters
x – A list of column names or indices indicating the predictor columns.
y – An index or a column name indicating the response column.
training_frame – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).
offset_column – The name or index of the column in training_frame that holds the offsets.
fold_column – The name or index of the column in training_frame that holds the per-row fold assignments.
weights_column – The name or index of the column in training_frame that holds the per-row weights.
validation_frame – H2OFrame with validation data to be scored on while training.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), hyper_params) >>> gs.start(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.join()
-
summary
(header=True)[source]¶ Print a detailed summary of the explored models.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> from h2o.estimators import H2ODeepLearningEstimator >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), ... hyper_params) >>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.summary()
-
train
(x=None, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, **params)[source]¶ Train the model synchronously (i.e. do not return until the model finishes training).
To train asynchronously call
start()
.- Parameters
x – A list of column names or indices indicating the predictor columns.
y – An index or a column name indicating the response column.
training_frame – The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).
offset_column – The name or index of the column in training_frame that holds the offsets.
fold_column – The name or index of the column in training_frame that holds the per-row fold assignments.
weights_column – The name or index of the column in training_frame that holds the per-row weights.
validation_frame – H2OFrame with validation data to be scored on while training.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> from h2o.estimators import H2ODeepLearningEstimator >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), ... hyper_params) >>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance)
-
varimp
(use_pandas=False)[source]¶ Pretty print the variable importances, or return them in a list/pandas DataFrame.
- Parameters
use_pandas (bool) – If True, then the variable importances will be returned as a pandas data frame.
- Returns
A dictionary of lists or Pandas DataFrame instances.
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> insurance = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv") >>> insurance["offset"] = insurance["Holders"].log() >>> insurance["Group"] = insurance["Group"].asfactor() >>> insurance["Age"] = insurance["Age"].asfactor() >>> insurance["District"] = insurance["District"].asfactor() >>> hyper_params = {'huber_alpha': [0.2,0.5], ... 'quantile_alpha': [0.2,0.6]} >>> from h2o.estimators import H2ODeepLearningEstimator >>> gs = H2OGridSearch(H2ODeepLearningEstimator(epochs=5), ... hyper_params) >>> gs.train(x=list(range(3)),y="Claims", training_frame=insurance) >>> gs.varimp(use_pandas=True)
-
weights
(matrix_id=0)[source]¶ Return the frame for the respective weight matrix.
- Param
matrix_id: an integer, ranging from 0 to number of layers, that specifies the weight matrix to return.
- Returns
an H2OFrame which represents the weight matrix identified by matrix_id
- Examples
>>> from h2o.estimators import H2ODeepLearningEstimator >>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv") >>> hh = H2ODeepLearningEstimator(hidden=[], ... loss="CrossEntropy", ... export_weights_and_biases=True) >>> hh.train(x=list(range(4)), y=4, training_frame=iris) >>> hh.weights(0)
-
xval_keys
()[source]¶ Model keys for the cross-validated model.
- Examples
>>> from h2o.estimators import H2OGeneralizedLinearEstimator >>> from h2o.grid.grid_search import H2OGridSearch >>> benign = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/benign.csv") >>> y = 3 >>> x = [4,5,6,7,8,9,10,11] >>> hyper_params = {'alpha': [0.01,0.3,0.5], ... 'lambda': [1e-5, 1e-6, 1e-7]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_params) >>> gs.train(x=x,y=y, training_frame=benign) >>> gs.xval_keys()