Modeling In H2O¶

`H2OEstimator`¶

class h2o.estimators.estimator_base.H2OEstimator[source]¶

Bases: h2o.model.model_base.ModelBase

H2O Estimators

H2O Estimators implement the following methods for model construction:

start - Top-level user-facing API for asynchronous model build
join - Top-level user-facing API for blocking on async model build
train - Top-level user-facing API for model building.
fit - Used by scikit-learn.

Because H2OEstimator instances are instances of ModelBase, these objects can use the H2O model API.

fit(X, y=None, **params)[source]¶

Fit an H2O model as part of a scikit-learn pipeline or grid search.

A warning will be issued if a caller other than sklearn attempts to use this method.

Parameters:

X : H2OFrame

An H2OFrame consisting of the predictor variables.

y : H2OFrame, optional

An H2OFrame consisting of the response variable.

params : optional

Extra arguments.

Returns:

The current instance of H2OEstimator for method chaining.

get_params(deep=True)[source]¶

Useful method for obtaining parameters for this estimator. Used primarily for sklearn Pipelines and sklearn grid search.

Parameters:

deep : bool, optional

If True, return parameters of all sub-objects that are estimators.

Returns:

A dict of parameters

set_params(**parms)[source]¶

Used by sklearn for updating parameters during grid search.

Parameters:

parms : dict

A dictionary of parameters that will be set on this model.

Returns:

Returns self, the current estimator object with the parameters all set as desired.

start(x, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, **params)[source]¶

Asynchronous model build by specifying the predictor columns, response column, and any additional frame-specific values.

To block for results, call join.

Parameters:

x : list

A list of column names or indices indicating the predictor columns.

y : str

An index or a column name indicating the response column.

training_frame : H2OFrame

The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).

offset_column : str, optional

The name or index of the column in training_frame that holds the offsets.

fold_column : str, optional

The name or index of the column in training_frame that holds the per-row fold assignments.

weights_column : str, optional

The name or index of the column in training_frame that holds the per-row weights.

validation_frame : H2OFrame, optional

H2OFrame with validation data to be scored on while training.

train(x, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, max_runtime_secs=None, **params)[source]¶

Train the H2O model by specifying the predictor columns, response column, and any additional frame-specific values.

Parameters:

x : list

A list of column names or indices indicating the predictor columns.

y : str

An index or a column name indicating the response column.

training_frame : H2OFrame

The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).

offset_column : str, optional

The name or index of the column in training_frame that holds the offsets.

fold_column : str, optional

The name or index of the column in training_frame that holds the per-row fold assignments.

weights_column : str, optional

The name or index of the column in training_frame that holds the per-row weights.

validation_frame : H2OFrame, optional

H2OFrame with validation data to be scored on while training.

max_runtime_secs : float

Maximum allowed runtime in seconds for model training. Use 0 to disable.

`H2ODeepLearningEstimator`¶

class h2o.estimators.deeplearning.H2ODeepLearningEstimator(model_id=None, overwrite_with_best_model=None, checkpoint=None, pretrained_autoencoder=None, use_all_factor_levels=None, standardize=None, activation=None, hidden=None, epochs=None, train_samples_per_iteration=None, seed=None, adaptive_rate=None, rho=None, epsilon=None, rate=None, rate_annealing=None, rate_decay=None, momentum_start=None, momentum_ramp=None, momentum_stable=None, nesterov_accelerated_gradient=None, input_dropout_ratio=None, hidden_dropout_ratios=None, l1=None, l2=None, max_w2=None, initial_weight_distribution=None, initial_weight_scale=None, loss=None, distribution=None, quantile_alpha=None, tweedie_power=None, score_interval=None, score_training_samples=None, score_validation_samples=None, score_duty_cycle=None, classification_stop=None, regression_stop=None, quiet_mode=None, max_confusion_matrix_size=None, max_hit_ratio_k=None, balance_classes=None, class_sampling_factors=None, max_after_balance_size=None, score_validation_sampling=None, diagnostics=None, variable_importances=None, fast_mode=None, ignore_const_cols=None, force_load_balance=None, replicate_training_data=None, single_node_mode=None, shuffle_training_data=None, sparse=None, col_major=None, average_activation=None, sparsity_beta=None, max_categorical_features=None, missing_values_handling=None, reproducible=None, export_weights_and_biases=None, nfolds=None, fold_assignment=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, stopping_rounds=None, stopping_metric=None, stopping_tolerance=None, initial_weights=None, initial_biases=None)[source]¶

Bases: h2o.estimators.estimator_base.H2OEstimator

Build a supervised Deep Neural Network model Builds a feed-forward multilayer artificial neural network on an H2OFrame

Parameters:

model_id : str, optional

The unique id assigned to the resulting model. If none is given, an id will automatically be generated.

overwrite_with_best_model : bool

If True, overwrite the final model with the best model found during training. Defaults to True.

checkpoint : H2ODeepLearningModel, optional

Model checkpoint (either key or H2ODeepLearningModel) to resume training with.

use_all_factor_levels : bool

Use all factor levels of categorical variance. Otherwise the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder..

standardize : bool

If enabled, automatically standardize the data. If disabled, the user must provide properly scaled input data.

activation : str

A string indicating the activation function to use. Must be either “Tanh”, “TanhWithDropout”, “Rectifier”, “RectifierWithDropout”, “Maxout”, or “MaxoutWithDropout”

hidden : list

Hidden layer sizes (e.g. [100,100])

epochs : float

How many times the dataset should be iterated (streamed), can be fractional

train_samples_per_iteration : int

Number of training samples (globally) per MapReduce iteration. Special values are: 0 one epoch; -1 all available data (e.g., replicated training data); or -2 auto-tuning (default)

seed : int

Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded

adaptive_rate : bool

Adaptive learning rate (ADAELTA)

rho : float

Adaptive learning rate time decay factor (similarity to prior updates)

epsilon : float

Adaptive learning rate parameter, similar to learn rate annealing during initial training phase. Typical values are between 1.0e-10 and 1.0e-4

rate : float

Learning rate (higher => less stable, lower => slower convergence)

rate_annealing : float

Learning rate annealing: eqn{(rate)/(1 + rate_annealing*samples)

rate_decay : float

Learning rate decay factor between layers (N-th layer: eqn{rate*lpha^(N-1))

momentum_start : float

Initial momentum at the beginning of training (try 0.5)

momentum_ramp : float

Number of training samples for which momentum increases

momentum_stable : float

Final momentum after the amp is over (try 0.99)

nesterov_accelerated_gradient : bool

Logical. Use Nesterov accelerated gradient (recommended)

input_dropout_ratio : float

A fraction of the features for each training row to be omitted from training in order to improve generalization (dimension sampling).

hidden_dropout_ratios : float

Input layer dropout ratio (can improve generalization) specify one value per hidden layer, defaults to 0.5

l1 : float

L1 regularization (can add stability and improve generalization, causes many weights to become 0)

l2 : float

L2 regularization (can add stability and improve generalization, causes many weights to be small)

max_w2 : float

Constraint for squared sum of incoming weights per unit (e.g. Rectifier)

initial_weight_distribution : str

Can be “Uniform”, “UniformAdaptive”, or “Normal”

initial_weight_scale : str

Uniform: -value ... value, Normal: stddev

loss : str

Loss function: “Automatic”, “CrossEntropy” (for classification only), “Quadratic”, “Absolute” (experimental) or “Huber” (experimental)

distribution : str

A character string. The distribution function of the response. Must be “AUTO”, “bernoulli”, “multinomial”, “poisson”, “gamma”, “tweedie”, “laplace”, “huber”, “quantile” or “gaussian”

quantile_alpha : float

Quantile (only for Quantile regression, must be between 0 and 1)

tweedie_power : float

Tweedie power (only for Tweedie distribution, must be between 1 and 2)

score_interval : int

Shortest time interval (in secs) between model scoring

score_training_samples : int

Number of training set samples for scoring (0 for all)

score_validation_samples : int

Number of validation set samples for scoring (0 for all)

score_duty_cycle : float

Maximum duty cycle fraction for scoring (lower: more training, higher: more scoring)

classification_stop : float

Stopping criterion for classification error fraction on training data (-1 to disable)

regression_stop : float

Stopping criterion for regression error (MSE) on training data (-1 to disable)

stopping_rounds : int

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve (by stopping_tolerance) for k=stopping_rounds scoring events. Can only trigger after at least 2k scoring events. Use 0 to disable.

stopping_metric : str

Metric to use for convergence checking, only for _stopping_rounds > 0 Can be one of “AUTO”, “deviance”, “logloss”, “MSE”, “AUC”, “r2”, “misclassification”.

stopping_tolerance : float

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

quiet_mode : bool

Enable quiet mode for less output to standard output

max_confusion_matrix_size : int

Max. size (number of classes) for confusion matrices to be shown

max_hit_ratio_k : float

Max number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable)

balance_classes : bool

Balance training data class counts via over/under-sampling (for imbalanced data)

class_sampling_factors : list

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

max_after_balance_size : float

Maximum relative size of the training data after balancing class counts (can be less than 1.0)

score_validation_sampling :

Method used to sample validation dataset for scoring

diagnostics : bool

Enable diagnostics for hidden layers

variable_importances : bool

Compute variable importances for input features (Gedeon method) - can be slow for large networks)

fast_mode : bool

Enable fast mode (minor approximations in back-propagation)

ignore_const_cols : bool

Ignore constant columns (no information can be gained anyway)

force_load_balance : bool

Force extra load balancing to increase training speed for small datasets (to keep all cores busy)

replicate_training_data : bool

Replicate the entire training dataset onto every node for faster training

single_node_mode : bool

Run on a single node for fine-tuning of model parameters

shuffle_training_data : bool

Enable shuffling of training data (recommended if training data is replicated and train_samples_per_iteration is close to eqn{numRows*numNodes

sparse : bool

Sparse data handling (Experimental)

col_major : bool

Use a column major weight matrix for input layer. Can speed up forward propagation, but might slow down back propagation (Experimental)

average_activation : float

Average activation for sparse auto-encoder (Experimental)

sparsity_beta : bool

Sparsity regularization (Experimental)

max_categorical_features : int

Max. number of categorical features, enforced via hashing Experimental)

reproducible : bool

Force reproducibility on small data (will be slow - only uses 1 thread)

missing_values_handling : str

Handling of missing values. Either “Skip” or “MeanImputation”.

export_weights_and_biases : bool

Whether to export Neural Network weights and biases to H2O Frames”

nfolds : int, optional

Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.

fold_assignment : str

Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”

keep_cross_validation_predictions : bool

Whether to keep the predictions of the cross-validation models

keep_cross_validation_fold_assignment : bool

Whether to keep the cross-validation fold assignment.

Examples

>>> import h2o as ml
>>> from h2o.estimators.deeplearning import H2ODeepLearningEstimator
>>> ml.init()
>>> rows=[[1,2,3,4,0],[2,1,2,4,1],[2,1,4,2,1],[0,1,2,34,1],[2,3,4,1,0]]*50
>>> fr = ml.H2OFrame(rows)
>>> fr[4] = fr[4].asfactor()
>>> model = H2ODeepLearningEstimator()
>>> model.train(x=range(4), y=4, training_frame=fr)

`H2OAutoEncoderEstimator`¶

class h2o.estimators.deeplearning.H2OAutoEncoderEstimator(model_id=None, overwrite_with_best_model=None, checkpoint=None, pretrained_autoencoder=None, use_all_factor_levels=None, standardize=None, activation=None, hidden=None, epochs=None, train_samples_per_iteration=None, seed=None, adaptive_rate=None, rho=None, epsilon=None, rate=None, rate_annealing=None, rate_decay=None, momentum_start=None, momentum_ramp=None, momentum_stable=None, nesterov_accelerated_gradient=None, input_dropout_ratio=None, hidden_dropout_ratios=None, l1=None, l2=None, max_w2=None, initial_weight_distribution=None, initial_weight_scale=None, loss=None, distribution=None, quantile_alpha=None, tweedie_power=None, score_interval=None, score_training_samples=None, score_validation_samples=None, score_duty_cycle=None, classification_stop=None, regression_stop=None, quiet_mode=None, max_confusion_matrix_size=None, max_hit_ratio_k=None, balance_classes=None, class_sampling_factors=None, max_after_balance_size=None, score_validation_sampling=None, diagnostics=None, variable_importances=None, fast_mode=None, ignore_const_cols=None, force_load_balance=None, replicate_training_data=None, single_node_mode=None, shuffle_training_data=None, sparse=None, col_major=None, average_activation=None, sparsity_beta=None, max_categorical_features=None, missing_values_handling=None, reproducible=None, export_weights_and_biases=None, nfolds=None, fold_assignment=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, stopping_rounds=None, stopping_metric=None, stopping_tolerance=None, initial_weights=None, initial_biases=None)[source]¶

Bases: h2o.estimators.deeplearning.H2ODeepLearningEstimator

Examples

>>> import h2o as ml
>>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator
>>> ml.init()
>>> rows=[[1,2,3,4,0]*50,[2,1,2,4,1]*50,[2,1,4,2,1]*50,[0,1,2,34,1]*50,[2,3,4,1,0]*50]
>>> fr = ml.H2OFrame(rows)
>>> fr[4] = fr[4].asfactor()
>>> model = H2OAutoEncoderEstimator()
>>> model.train(x=range(4), training_frame=fr)

`H2ORandomForestEstimator`¶

class h2o.estimators.random_forest.H2ORandomForestEstimator(model_id=None, mtries=None, col_sample_rate_change_per_level=None, sample_rate=None, sample_rate_per_class=None, col_sample_rate_per_tree=None, build_tree_one_node=None, ntrees=None, max_depth=None, min_rows=None, nbins=None, nbins_cats=None, binomial_double_trees=None, balance_classes=None, class_sampling_factors=None, max_after_balance_size=None, seed=None, nfolds=None, fold_assignment=None, stopping_rounds=None, stopping_metric=None, stopping_tolerance=None, score_each_iteration=None, score_tree_interval=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, checkpoint=None, min_split_improvement=None, random_split_points=None)[source]¶

Bases: h2o.estimators.estimator_base.H2OEstimator

Builds a Random Forest Model on an H2OFrame

Parameters:

model_id : str, optional

The unique id assigned to the resulting model. If none is given, an id will automatically be generated.

mtries : int

Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification, and p/3 for regression, where p is the number of predictors.

col_sample_rate_change_per_level : float

Relative change of the column sampling rate for every level (from 0.0 to 2.0)

sample_rate : float

Row sample rate per tree (from 0.0 to 1.0)

sample_rate_per_class : list

Row sample rate per tree per class (one per class, from 0.0 to 1.0)

col_sample_rate_per_tree : float

Column sample rate per tree (from 0.0 to 1.0)

build_tree_one_node : bool

Run on one node only; no network overhead but fewer CPUs used. Suitable for small datasets.

ntrees : int

A non-negative integer that determines the number of trees to grow.

max_depth : int

Maximum depth to grow the tree.

min_rows : int

Minimum number of rows to assign to terminal nodes.

nbins : int

For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point.

nbins_top_level : int

For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level.

nbins_cats : int

For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.

binomial_double_trees : bool

or binary classification: Build 2x as many trees (one per class) - can lead to higher accuracy.

balance_classes : bool

logical, indicates whether or not to balance training data class counts via over/under-sampling (for imbalanced data)

class_sampling_factors : list

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

max_after_balance_size : float

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Ignored if balance_classes is False, which is the default behavior.

seed : int

Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded

nfolds : int, optional

Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.

fold_assignment : str

Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”

keep_cross_validation_predictions : bool

Whether to keep the predictions of the cross-validation models

keep_cross_validation_fold_assignment : bool

Whether to keep the cross-validation fold assignment.

score_each_iteration : bool

Attempts to score each tree.

score_tree_interval : int

Score the model after every so many trees. Disabled if set to 0.

stopping_rounds : int

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve (by stopping_tolerance) for k=stopping_rounds scoring events. Can only trigger after at least 2k scoring events. Use 0 to disable.

stopping_metric : str

Metric to use for convergence checking, only for _stopping_rounds > 0 Can be one of “AUTO”, “deviance”, “logloss”, “MSE”, “AUC”, “r2”, “misclassification”.

stopping_tolerance : float

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much) Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

min_split_improvement : float

Minimum relative improvement in squared error reduction for a split to happen

random_split_points : boolean

Whether to use random split points for histograms (to pick the best split from).

`H2OGradientBoostingEstimator`¶

class h2o.estimators.gbm.H2OGradientBoostingEstimator(model_id=None, distribution=None, quantile_alpha=None, tweedie_power=None, ntrees=None, max_depth=None, min_rows=None, learn_rate=None, nbins=None, sample_rate=None, sample_rate_per_class=None, col_sample_rate=None, col_sample_rate_change_per_level=None, col_sample_rate_per_tree=None, nbins_top_level=None, nbins_cats=None, balance_classes=None, class_sampling_factors=None, max_after_balance_size=None, seed=None, build_tree_one_node=None, nfolds=None, fold_assignment=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, stopping_rounds=None, stopping_metric=None, stopping_tolerance=None, score_each_iteration=None, score_tree_interval=None, checkpoint=None, min_split_improvement=None, random_split_points=None, max_abs_leafnode_pred=None)[source]¶

Bases: h2o.estimators.estimator_base.H2OEstimator

Builds gradient boosted classification trees, and gradient boosted regression trees on a parsed data set. The default distribution function will guess the model type based on the response column type run properly the response column must be an numeric for “gaussian” or an enum for “bernoulli” or “multinomial”.

Parameters:

model_id : str, optional

The unique id assigned to the resulting model. If none is given, an id will automatically be generated.

distribution : str

The distribution function of the response. Must be “AUTO”, “bernoulli”, “multinomial”, “poisson”, “gamma”, “tweedie”, “laplace”, “quantile” or “gaussian”

quantile_alpha : float

Quantile (only for Quantile regression, must be between 0 and 1)

tweedie_power : float

Tweedie power (only for Tweedie distribution, must be between 1 and 2)

ntrees : int

A non-negative integer that determines the number of trees to grow.

max_depth : int

Maximum depth to grow the tree.

min_rows : int

Minimum number of rows to assign to terminal nodes.

learn_rate : float

Learning rate (from 0.0 to 1.0)

learn_rate_annealing : float

Multiply the learning rate by this factor after every tree

sample_rate : float

Row sample rate per tree (from 0.0 to 1.0)

sample_rate_per_class : list

Row sample rate per tree per class (one per class, from 0.0 to 1.0)

col_sample_rate : float

Column sample rate per split (from 0.0 to 1.0)

col_sample_rate_change_per_level : float

Relative change of the column sampling rate for every level (from 0.0 to 2.0)

col_sample_rate_per_tree : float

Column sample rate per tree (from 0.0 to 1.0)

nbins : int

For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point.

nbins_top_level : int

For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level.

nbins_cats : int

For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.

balance_classes : bool

logical, indicates whether or not to balance training data class counts via over/under-sampling (for imbalanced data)

class_sampling_factors : list

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

max_after_balance_size : float

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Ignored if balance_classes is False, which is the default behavior.

seed : int

Seed for random numbers (affects sampling when balance_classes=T)

build_tree_one_node : bool

Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.

nfolds : int, optional

Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.

fold_assignment : str

Cross-validation fold assignment scheme, if fold_column is not specified. Must be “AUTO”, “Random” or “Modulo”

keep_cross_validation_predictions : bool

Whether to keep the predictions of the cross-validation models

keep_cross_validation_fold_assignment : bool

Whether to keep the cross-validation fold assignment.

score_each_iteration : bool

Attempts to score each tree.

score_tree_interval : int

Score the model after every so many trees. Disabled if set to 0.

stopping_rounds : int

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve (by stopping_tolerance) for k=stopping_rounds scoring events. Can only trigger after at least 2k scoring events. Use 0 to disable.

stopping_metric : str

Metric to use for convergence checking, only for _stopping_rounds > 0 Can be one of “AUTO”, “deviance”, “logloss”, “MSE”, “AUC”, “r2”, “misclassification”.

stopping_tolerance : float

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)

min_split_improvement : float

Minimum relative improvement in squared error reduction for a split to happen

random_split_points : boolean

Whether to use random split points for histograms (to pick the best split from).

max_abs_leafnode_pred : float

Maximum absolute value of a leaf node prediction.

Returns:

A new H2OGradientBoostedEstimator object.

`H2OGeneralizedLinearEstimator`¶

class h2o.estimators.glm.H2OGeneralizedLinearEstimator(model_id=None, max_iterations=None, beta_epsilon=None, solver=None, standardize=None, family=None, link=None, tweedie_variance_power=None, tweedie_link_power=None, alpha=None, prior=None, lambda_search=None, nlambdas=None, lambda_min_ratio=None, beta_constraints=None, nfolds=None, fold_assignment=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, intercept=None, Lambda=None, max_active_predictors=None, checkpoint=None, objective_epsilon=None, gradient_epsilon=None, non_negative=False, compute_p_values=False, remove_collinear_columns=False, missing_values_handling=None, interactions=None)[source]¶

Bases: h2o.estimators.estimator_base.H2OEstimator

Build a Generalized Linear Model: Fit a generalized linear model, specified by a response variable, a set of predictors, and a description of the error distribution.

Parameters:

model_id : str, optional

The unique id assigned to the resulting model. If none is given, an id will automatically be generated.

max_iterations : int

A non-negative integer specifying the maximum number of iterations.

beta_epsilon : int

A non-negative number specifying the magnitude of the maximum difference between the coefficient estimates from successive iterations. Defines the convergence criterion.

solver : str

A character string specifying the solver used: IRLSM (supports more features), L_BFGS (scales better for datasets with many columns)

standardize : bool

Indicates whether the numeric predictors should be standardized to have a mean of 0 and a variance of 1 prior to training the models.

family : str

A character string specifying the distribution of the model: gaussian, binomial, multinomial, poisson, gamma, tweedie.

link : str

A character string specifying the link function. The default is the canonical link for the family. The supported links for each of the family specifications are “gaussian” - “identity”, “log”, “inverse” “binomial” - “logit”, “log” “multinomial” - “multinomial” “poisson” - “log”, “identity” “gamma” - “inverse”, “log”, “identity” “tweedie” - “tweedie”

tweedie_variance_power : int

numeric specifying the power for the variance function when family = “tweedie”.

tweedie_link_power : int

A numeric specifying the power for the link function when family = “tweedie”.

alpha : float

A numeric in [0, 1] specifying the elastic-net mixing parameter. The elastic-net penalty is defined to be eqn{P(lpha,eta) = (1-lpha)/2||eta||_2^2 + lpha||eta||_1 = sum_j [(1-lpha)/2 eta_j^2 + lpha|eta_j|], making alpha = 1 the lasso penalty and alpha = 0 the ridge penalty.

Lambda : float

A non-negative shrinkage parameter for the elastic-net, which multiplies eqn{P(lpha,eta) in the objective function. When Lambda = 0, no elastic-net penalty is applied and ordinary generalized linear models are fit.

prior : float, optional

A numeric specifying the prior probability of class 1 in the response when family = “binomial”. The default prior is the observational frequency of class 1. Must be from (0,1) exclusive range or None (no prior).

lambda_search : bool

A logical value indicating whether to conduct a search over the space of lambda values starting from the lambda max, given lambda is interpreted as lambda minself.

nlambdas : int

The number of lambda values to use when lambda_search = TRUE.

lambda_min_ratio : float

Smallest value for lambda as a fraction of lambda.max. By default if the number of observations is greater than the the number of variables then lambda_min_ratio = 0.0001; if the number of observations is less than the number of variables then lambda_min_ratio = 0.01.

beta_constraints : H2OFrame

A data.frame or H2OParsedData object with the columns [“names”, “lower_bounds”, “upper_bounds”, “beta_given”], where each row corresponds to a predictor in the GLM. “names” contains the predictor names, “lower”/”upper_bounds”, are the lower and upper bounds of beta, and “beta_given” is some supplied starting values.

nfolds : int, optional

Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.

fold_assignment : str

Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”

keep_cross_validation_predictions : bool

Whether to keep the predictions of the cross-validation models

keep_cross_validation_fold_assignment : bool

Whether to keep the cross-validation fold assignment.

intercept : bool

Logical, include constant term (intercept) in the model

max_active_predictors : int, optional

Convergence criteria for number of predictors when using L1 penalty.

missing_values_handling : str

A character string specifying how to handle missing value: “MeanImputation”,”Skip”.

<<<<<<< HEAD

interactions : list, optional

A list of column names to interact. All pairwise combinations of columns will be interacted.

=======

max_runtime_secs: int, optional

Maximum allowed runtime, model will stop running after reaching the limit and return whatever result it has at the moment.

>>>>>>> e95576ae7d6e4928eb76beb6066e899f91123ca4

Returns:

A subclass of ModelBase is returned. The specific subclass depends on the machine

learning task at hand (if it’s binomial classification, then an H2OBinomialModel is returned, if it’s regression then a H2ORegressionModel is returned). The default print-out of the models is shown, but further GLM-specifc information can be queried out of the object. Upon completion of the GLM, the resulting object has coefficients, normalized coefficients, residual/null deviance, aic, and a host of model metrics including MSE, AUC (for logistic regression), degrees of freedom, and confusion matrices.

`H2OGeneralizedLowRankEstimator`¶

class h2o.estimators.glrm.H2OGeneralizedLowRankEstimator(k=None, max_iterations=None, transform=None, seed=None, ignore_const_cols=None, loss=None, multi_loss=None, loss_by_col=None, loss_by_col_idx=None, regularization_x=None, regularization_y=None, gamma_x=None, gamma_y=None, init_step_size=None, min_step_size=None, init=None, svd_method=None, user_x=None, user_y=None, recover_svd=None)[source]¶

Bases: h2o.estimators.estimator_base.H2OEstimator

Builds a generalized low rank model of a H2O dataset.

Parameters:

k : int

The rank of the resulting decomposition. This must be between 1 and the number of columns in the training frame inclusive.

max_iterations : int

The maximum number of iterations to run the optimization loop. Each iteration consists of an update of the X matrix, followed by an update of the Y matrix.

transform : str

A character string that indicates how the training data should be transformed before running GLRM. Possible values are “NONE” for no transformation, “DEMEAN” for subtracting the mean of each column, “DESCALE” for dividing by the standard deviation of each column, “STANDARDIZE” for demeaning and descaling, and “NORMALIZE” for demeaning and dividing each column by its range (max - min).

seed : int, optional

Random seed used to initialize the X and Y matrices.

ignore_const_cols : bool, optional

A logical value indicating whether to ignore constant columns in the training frame. A column is constant if all of its non-missing values are the same value.

loss : str

A character string indicating the default loss function for numeric columns. Possible values are “Quadratic” (default), “Absolute”, “Huber”, “Poisson”, “Hinge”, and “Logistic”.

multi_loss : str

A character string indicating the default loss function for enum columns. Possible values are “Categorical” and “Ordinal”.

loss_by_col : str, optional

A list of strings indicating the loss function for specific columns by corresponding index in loss_by_col_idx. Will override loss for numeric columns and multi_loss for enum columns.

loss_by_col_idx : str, optional

A list of column indices to which the corresponding loss functions in loss_by_col are assigned. Must be zero indexed.

regularization_x : str

A character string indicating the regularization function for the X matrix. Possible values are “None” (default), “Quadratic”, “L2”, “L1”, “NonNegative”, “OneSparse”, “UnitOneSparse”, and “Simplex”.

regularization_y : str

A character string indicating the regularization function for the Y matrix. Possible values are “None” (default), “Quadratic”, “L2”, “L1”, “NonNegative”, “OneSparse”, “UnitOneSparse”, and “Simplex”.

gamma_x : float

The weight on the X matrix regularization term.

gamma_y : float

The weight on the Y matrix regularization term.

init_step_size : float

Initial step size. Divided by number of columns in the training frame when calculating the proximal gradient update. The algorithm begins at init_step_size and decreases the step size at each iteration until a termination condition is reached.

min_step_size : float

Minimum step size upon which the algorithm is terminated.

init : str

A character string indicating how to select the initial X and Y matrices. Possible values are “Random” for initialization to a random array from the standard normal distribution, “PlusPlus” for initialization using the clusters from k-means++ initialization, “SVD” for initialization using the first k (approximate) right singular vectors, “User” for user-specified initial X and Y frames (must set user_y and user_x arguments).

svd_method : str

A character string that indicates how SVD should be calculated during initialization. Possible values are “GramSVD” for distributed computation of the Gram matrix followed by a local SVD using the JAMA package, “Power” for computation of the SVD using the power iteration method, “Randomized” for approximate SVD by projecting onto a random subspace.

user_x : H2OFrame, optional

An H2OFrame object specifying the initial X matrix. Only used when init = “User”.

user_y : H2OFrame, optional

An H2OFrame object specifying the initial Y matrix. Only used when init = “User”.

recover_svd : bool

A logical value indicating whether the singular values and eigenvectors should be recovered during post-processing of the generalized low rank decomposition.

Returns:

A new H2OGeneralizedLowRankEstimator instance.

`H2OKMeansEstimator`¶

class h2o.estimators.kmeans.H2OKMeansEstimator(model_id=None, k=None, max_iterations=None, standardize=None, init=None, seed=None, nfolds=None, fold_assignment=None, user_points=None, ignored_columns=None, score_each_iteration=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, ignore_const_cols=None, checkpoint=None)[source]¶

Bases: h2o.estimators.estimator_base.H2OEstimator

Performs k-means clustering on an H2O dataset.

Parameters:

model_id : str, optional

The unique id assigned to the resulting model. If none is given, an id will automatically be generated.

k : int

The number of clusters. Must be between 1 and 1e7 inclusive. k may be omitted if the user specifies the initial centers in the init parameter. If k is not omitted, in this case, then it should be equal to the number of user-specified centers.

max_iterations : int

The maximum number of iterations allowed. Must be between 0 and 1e6 inclusive.

standardize : bool

Indicates whether the data should be standardized before running k-means.

init : str

A character string that selects the initial set of k cluster centers. Possible values are “Random” for random initialization, “PlusPlus” for k-means plus initialization, or “Furthest” for initialization at the furthest point from each successive center. Additionally, the user may specify a the initial centers as a matrix, data.frame, H2OFrame, or list of vectors. For matrices, data.frames, and H2OFrames, each row of the respective structure is an initial center. For lists of vectors, each vector is an initial center.

seed : int, optional

Random seed used to initialize the cluster centroids.

nfolds : int, optional

Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.

fold_assignment : str

Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”

keep_cross_validation_predictions : bool

Whether to keep the predictions of the cross-validation models

keep_cross_validation_fold_assignment : bool

Whether to keep the cross-validation fold assignment.

Returns:

An instance of H2OClusteringModel.

`H2ONaiveBayesEstimator`¶

class h2o.estimators.naive_bayes.H2ONaiveBayesEstimator(model_id=None, laplace=None, threshold=None, eps=None, compute_metrics=None, balance_classes=None, max_after_balance_size=None, nfolds=None, fold_assignment=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, checkpoint=None)[source]¶

Bases: h2o.estimators.estimator_base.H2OEstimator

The naive Bayes classifier assumes independence between predictor variables conditional on the response, and a Gaussian distribution of numeric predictors with mean and standard deviation computed from the training dataset. When building a naive Bayes classifier, every row in the training dataset that contains at least one NA will be skipped completely. If the test dataset has missing values, then those predictors are omitted in the probability calculation during prediction.

Parameters:

laplace : int

A positive number controlling Laplace smoothing. The default zero disables smoothing.

threshold : float

The minimum standard deviation to use for observations without enough data. Must be at least 1e-10.

eps : float

A threshold cutoff to deal with numeric instability, must be positive.

compute_metrics : bool

A logical value indicating whether model metrics should be computed. Set to FALSE to reduce the runtime of the algorithm.

nfolds : int, optional

Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.

fold_assignment : str

Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”

keep_cross_validation_predictions : bool

Whether to keep the predictions of the cross-validation models.

keep_cross_validation_fold_assignment : bool

Whether to keep the cross-validation fold assignment.

Returns:

Returns instance of H2ONaiveBayesEstimator

Modeling In H2O¶

H2OEstimator¶

H2ODeepLearningEstimator¶

H2OAutoEncoderEstimator¶

H2ORandomForestEstimator¶

H2OGradientBoostingEstimator¶

H2OGeneralizedLinearEstimator¶

H2OGeneralizedLowRankEstimator¶

H2OKMeansEstimator¶

H2ONaiveBayesEstimator¶

`H2OEstimator`¶

`H2ODeepLearningEstimator`¶

`H2OAutoEncoderEstimator`¶

`H2ORandomForestEstimator`¶

`H2OGradientBoostingEstimator`¶

`H2OGeneralizedLinearEstimator`¶

`H2OGeneralizedLowRankEstimator`¶

`H2OKMeansEstimator`¶

`H2ONaiveBayesEstimator`¶