Modeling In H2O¶
H2OEstimator¶
- class h2o.estimators.estimator_base.H2OEstimator[source]¶
Bases: h2o.model.model_base.ModelBase
H2O Estimators
- H2O Estimators implement the following methods for model construction:
- start - Top-level user-facing API for asynchronous model build
- join - Top-level user-facing API for blocking on async model build
- train - Top-level user-facing API for model building.
- fit - Used by scikit-learn.
Because H2OEstimator instances are instances of ModelBase, these objects can use the H2O model API.
- fit(X, y=None, **params)[source]¶
Fit an H2O model as part of a scikit-learn pipeline or grid search.
A warning will be issued if a caller other than sklearn attempts to use this method.
Parameters: X : H2OFrame
An H2OFrame consisting of the predictor variables.
- y : H2OFrame, optional
An H2OFrame consisting of the response variable.
- params : optional
Extra arguments.
Returns: The current instance of H2OEstimator for method chaining.
- get_params(deep=True)[source]¶
Useful method for obtaining parameters for this estimator. Used primarily for sklearn Pipelines and sklearn grid search.
Parameters: deep : bool, optional
If True, return parameters of all sub-objects that are estimators.
Returns: A dict of parameters
- set_params(**parms)[source]¶
Used by sklearn for updating parameters during grid search.
Parameters: parms : dict
A dictionary of parameters that will be set on this model.
Returns: Returns self, the current estimator object with the parameters all set as desired.
- start(x, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, **params)[source]¶
Asynchronous model build by specifying the predictor columns, response column, and any additional frame-specific values.
To block for results, call join.
Parameters: x : list
A list of column names or indices indicating the predictor columns.
- y : str
An index or a column name indicating the response column.
- training_frame : H2OFrame
The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).
- offset_column : str, optional
The name or index of the column in training_frame that holds the offsets.
- fold_column : str, optional
The name or index of the column in training_frame that holds the per-row fold assignments.
- weights_column : str, optional
The name or index of the column in training_frame that holds the per-row weights.
- validation_frame : H2OFrame, optional
H2OFrame with validation data to be scored on while training.
- train(x, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, max_runtime_secs=None, **params)[source]¶
Train the H2O model by specifying the predictor columns, response column, and any additional frame-specific values.
Parameters: x : list
A list of column names or indices indicating the predictor columns.
- y : str
An index or a column name indicating the response column.
- training_frame : H2OFrame
The H2OFrame having the columns indicated by x and y (as well as any additional columns specified by fold, offset, and weights).
- offset_column : str, optional
The name or index of the column in training_frame that holds the offsets.
- fold_column : str, optional
The name or index of the column in training_frame that holds the per-row fold assignments.
- weights_column : str, optional
The name or index of the column in training_frame that holds the per-row weights.
- validation_frame : H2OFrame, optional
H2OFrame with validation data to be scored on while training.
- max_runtime_secs : float
Maximum allowed runtime in seconds for model training. Use 0 to disable.
H2ODeepLearningEstimator¶
- class h2o.estimators.deeplearning.H2ODeepLearningEstimator(model_id=None, overwrite_with_best_model=None, checkpoint=None, pretrained_autoencoder=None, use_all_factor_levels=None, standardize=None, activation=None, hidden=None, epochs=None, train_samples_per_iteration=None, seed=None, adaptive_rate=None, rho=None, epsilon=None, rate=None, rate_annealing=None, rate_decay=None, momentum_start=None, momentum_ramp=None, momentum_stable=None, nesterov_accelerated_gradient=None, input_dropout_ratio=None, hidden_dropout_ratios=None, l1=None, l2=None, max_w2=None, initial_weight_distribution=None, initial_weight_scale=None, loss=None, distribution=None, quantile_alpha=None, tweedie_power=None, score_interval=None, score_training_samples=None, score_validation_samples=None, score_duty_cycle=None, classification_stop=None, regression_stop=None, quiet_mode=None, max_confusion_matrix_size=None, max_hit_ratio_k=None, balance_classes=None, class_sampling_factors=None, max_after_balance_size=None, score_validation_sampling=None, diagnostics=None, variable_importances=None, fast_mode=None, ignore_const_cols=None, force_load_balance=None, replicate_training_data=None, single_node_mode=None, shuffle_training_data=None, sparse=None, col_major=None, average_activation=None, sparsity_beta=None, max_categorical_features=None, missing_values_handling=None, reproducible=None, export_weights_and_biases=None, nfolds=None, fold_assignment=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, stopping_rounds=None, stopping_metric=None, stopping_tolerance=None, initial_weights=None, initial_biases=None)[source]¶
Bases: h2o.estimators.estimator_base.H2OEstimator
Build a supervised Deep Neural Network model Builds a feed-forward multilayer artificial neural network on an H2OFrame
Parameters: model_id : str, optional
The unique id assigned to the resulting model. If none is given, an id will automatically be generated.
- overwrite_with_best_model : bool
If True, overwrite the final model with the best model found during training. Defaults to True.
- checkpoint : H2ODeepLearningModel, optional
Model checkpoint (either key or H2ODeepLearningModel) to resume training with.
- use_all_factor_levels : bool
Use all factor levels of categorical variance. Otherwise the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder..
- standardize : bool
If enabled, automatically standardize the data. If disabled, the user must provide properly scaled input data.
- activation : str
A string indicating the activation function to use. Must be either “Tanh”, “TanhWithDropout”, “Rectifier”, “RectifierWithDropout”, “Maxout”, or “MaxoutWithDropout”
- hidden : list
Hidden layer sizes (e.g. [100,100])
- epochs : float
How many times the dataset should be iterated (streamed), can be fractional
- train_samples_per_iteration : int
Number of training samples (globally) per MapReduce iteration. Special values are: 0 one epoch; -1 all available data (e.g., replicated training data); or -2 auto-tuning (default)
- seed : int
Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded
- adaptive_rate : bool
Adaptive learning rate (ADAELTA)
- rho : float
Adaptive learning rate time decay factor (similarity to prior updates)
- epsilon : float
Adaptive learning rate parameter, similar to learn rate annealing during initial training phase. Typical values are between 1.0e-10 and 1.0e-4
- rate : float
Learning rate (higher => less stable, lower => slower convergence)
- rate_annealing : float
Learning rate annealing: eqn{(rate)/(1 + rate_annealing*samples)
- rate_decay : float
Learning rate decay factor between layers (N-th layer: eqn{rate*lpha^(N-1))
- momentum_start : float
Initial momentum at the beginning of training (try 0.5)
- momentum_ramp : float
Number of training samples for which momentum increases
- momentum_stable : float
Final momentum after the amp is over (try 0.99)
- nesterov_accelerated_gradient : bool
Logical. Use Nesterov accelerated gradient (recommended)
- input_dropout_ratio : float
A fraction of the features for each training row to be omitted from training in order to improve generalization (dimension sampling).
- hidden_dropout_ratios : float
Input layer dropout ratio (can improve generalization) specify one value per hidden layer, defaults to 0.5
- l1 : float
L1 regularization (can add stability and improve generalization, causes many weights to become 0)
- l2 : float
L2 regularization (can add stability and improve generalization, causes many weights to be small)
- max_w2 : float
Constraint for squared sum of incoming weights per unit (e.g. Rectifier)
- initial_weight_distribution : str
Can be “Uniform”, “UniformAdaptive”, or “Normal”
- initial_weight_scale : str
Uniform: -value ... value, Normal: stddev
- loss : str
Loss function: “Automatic”, “CrossEntropy” (for classification only), “Quadratic”, “Absolute” (experimental) or “Huber” (experimental)
- distribution : str
A character string. The distribution function of the response. Must be “AUTO”, “bernoulli”, “multinomial”, “poisson”, “gamma”, “tweedie”, “laplace”, “huber”, “quantile” or “gaussian”
- quantile_alpha : float
Quantile (only for Quantile regression, must be between 0 and 1)
- tweedie_power : float
Tweedie power (only for Tweedie distribution, must be between 1 and 2)
- score_interval : int
Shortest time interval (in secs) between model scoring
- score_training_samples : int
Number of training set samples for scoring (0 for all)
- score_validation_samples : int
Number of validation set samples for scoring (0 for all)
- score_duty_cycle : float
Maximum duty cycle fraction for scoring (lower: more training, higher: more scoring)
- classification_stop : float
Stopping criterion for classification error fraction on training data (-1 to disable)
- regression_stop : float
Stopping criterion for regression error (MSE) on training data (-1 to disable)
- stopping_rounds : int
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve (by stopping_tolerance) for k=stopping_rounds scoring events. Can only trigger after at least 2k scoring events. Use 0 to disable.
- stopping_metric : str
Metric to use for convergence checking, only for _stopping_rounds > 0 Can be one of “AUTO”, “deviance”, “logloss”, “MSE”, “AUC”, “r2”, “misclassification”.
- stopping_tolerance : float
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
- quiet_mode : bool
Enable quiet mode for less output to standard output
- max_confusion_matrix_size : int
Max. size (number of classes) for confusion matrices to be shown
- max_hit_ratio_k : float
Max number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable)
- balance_classes : bool
Balance training data class counts via over/under-sampling (for imbalanced data)
- class_sampling_factors : list
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
- max_after_balance_size : float
Maximum relative size of the training data after balancing class counts (can be less than 1.0)
- score_validation_sampling :
Method used to sample validation dataset for scoring
- diagnostics : bool
Enable diagnostics for hidden layers
- variable_importances : bool
Compute variable importances for input features (Gedeon method) - can be slow for large networks)
- fast_mode : bool
Enable fast mode (minor approximations in back-propagation)
- ignore_const_cols : bool
Ignore constant columns (no information can be gained anyway)
- force_load_balance : bool
Force extra load balancing to increase training speed for small datasets (to keep all cores busy)
- replicate_training_data : bool
Replicate the entire training dataset onto every node for faster training
- single_node_mode : bool
Run on a single node for fine-tuning of model parameters
- shuffle_training_data : bool
Enable shuffling of training data (recommended if training data is replicated and train_samples_per_iteration is close to eqn{numRows*numNodes
- sparse : bool
Sparse data handling (Experimental)
- col_major : bool
Use a column major weight matrix for input layer. Can speed up forward propagation, but might slow down back propagation (Experimental)
- average_activation : float
Average activation for sparse auto-encoder (Experimental)
- sparsity_beta : bool
Sparsity regularization (Experimental)
- max_categorical_features : int
Max. number of categorical features, enforced via hashing Experimental)
- reproducible : bool
Force reproducibility on small data (will be slow - only uses 1 thread)
- missing_values_handling : str
Handling of missing values. Either “Skip” or “MeanImputation”.
- export_weights_and_biases : bool
Whether to export Neural Network weights and biases to H2O Frames”
- nfolds : int, optional
Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.
- fold_assignment : str
Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”
- keep_cross_validation_predictions : bool
Whether to keep the predictions of the cross-validation models
- keep_cross_validation_fold_assignment : bool
Whether to keep the cross-validation fold assignment.
Examples
>>> import h2o as ml >>> from h2o.estimators.deeplearning import H2ODeepLearningEstimator >>> ml.init() >>> rows=[[1,2,3,4,0],[2,1,2,4,1],[2,1,4,2,1],[0,1,2,34,1],[2,3,4,1,0]]*50 >>> fr = ml.H2OFrame(rows) >>> fr[4] = fr[4].asfactor() >>> model = H2ODeepLearningEstimator() >>> model.train(x=range(4), y=4, training_frame=fr)
H2OAutoEncoderEstimator¶
- class h2o.estimators.deeplearning.H2OAutoEncoderEstimator(model_id=None, overwrite_with_best_model=None, checkpoint=None, pretrained_autoencoder=None, use_all_factor_levels=None, standardize=None, activation=None, hidden=None, epochs=None, train_samples_per_iteration=None, seed=None, adaptive_rate=None, rho=None, epsilon=None, rate=None, rate_annealing=None, rate_decay=None, momentum_start=None, momentum_ramp=None, momentum_stable=None, nesterov_accelerated_gradient=None, input_dropout_ratio=None, hidden_dropout_ratios=None, l1=None, l2=None, max_w2=None, initial_weight_distribution=None, initial_weight_scale=None, loss=None, distribution=None, quantile_alpha=None, tweedie_power=None, score_interval=None, score_training_samples=None, score_validation_samples=None, score_duty_cycle=None, classification_stop=None, regression_stop=None, quiet_mode=None, max_confusion_matrix_size=None, max_hit_ratio_k=None, balance_classes=None, class_sampling_factors=None, max_after_balance_size=None, score_validation_sampling=None, diagnostics=None, variable_importances=None, fast_mode=None, ignore_const_cols=None, force_load_balance=None, replicate_training_data=None, single_node_mode=None, shuffle_training_data=None, sparse=None, col_major=None, average_activation=None, sparsity_beta=None, max_categorical_features=None, missing_values_handling=None, reproducible=None, export_weights_and_biases=None, nfolds=None, fold_assignment=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, stopping_rounds=None, stopping_metric=None, stopping_tolerance=None, initial_weights=None, initial_biases=None)[source]¶
Bases: h2o.estimators.deeplearning.H2ODeepLearningEstimator
Examples
>>> import h2o as ml >>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator >>> ml.init() >>> rows=[[1,2,3,4,0]*50,[2,1,2,4,1]*50,[2,1,4,2,1]*50,[0,1,2,34,1]*50,[2,3,4,1,0]*50] >>> fr = ml.H2OFrame(rows) >>> fr[4] = fr[4].asfactor() >>> model = H2OAutoEncoderEstimator() >>> model.train(x=range(4), training_frame=fr)
H2ORandomForestEstimator¶
- class h2o.estimators.random_forest.H2ORandomForestEstimator(model_id=None, mtries=None, col_sample_rate_change_per_level=None, sample_rate=None, sample_rate_per_class=None, col_sample_rate_per_tree=None, build_tree_one_node=None, ntrees=None, max_depth=None, min_rows=None, nbins=None, nbins_cats=None, binomial_double_trees=None, balance_classes=None, class_sampling_factors=None, max_after_balance_size=None, seed=None, nfolds=None, fold_assignment=None, stopping_rounds=None, stopping_metric=None, stopping_tolerance=None, score_each_iteration=None, score_tree_interval=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, checkpoint=None, min_split_improvement=None, random_split_points=None)[source]¶
Bases: h2o.estimators.estimator_base.H2OEstimator
Builds a Random Forest Model on an H2OFrame
Parameters: model_id : str, optional
The unique id assigned to the resulting model. If none is given, an id will automatically be generated.
mtries : int
Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification, and p/3 for regression, where p is the number of predictors.
col_sample_rate_change_per_level : float
Relative change of the column sampling rate for every level (from 0.0 to 2.0)
sample_rate : float
Row sample rate per tree (from 0.0 to 1.0)
sample_rate_per_class : list
Row sample rate per tree per class (one per class, from 0.0 to 1.0)
col_sample_rate_per_tree : float
Column sample rate per tree (from 0.0 to 1.0)
build_tree_one_node : bool
Run on one node only; no network overhead but fewer CPUs used. Suitable for small datasets.
ntrees : int
A non-negative integer that determines the number of trees to grow.
max_depth : int
Maximum depth to grow the tree.
min_rows : int
Minimum number of rows to assign to terminal nodes.
nbins : int
For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point.
nbins_top_level : int
For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level.
nbins_cats : int
For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.
binomial_double_trees : bool
or binary classification: Build 2x as many trees (one per class) - can lead to higher accuracy.
balance_classes : bool
logical, indicates whether or not to balance training data class counts via over/under-sampling (for imbalanced data)
class_sampling_factors : list
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
max_after_balance_size : float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Ignored if balance_classes is False, which is the default behavior.
seed : int
Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded
nfolds : int, optional
Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.
fold_assignment : str
Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”
keep_cross_validation_predictions : bool
Whether to keep the predictions of the cross-validation models
keep_cross_validation_fold_assignment : bool
Whether to keep the cross-validation fold assignment.
score_each_iteration : bool
Attempts to score each tree.
score_tree_interval : int
Score the model after every so many trees. Disabled if set to 0.
stopping_rounds : int
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve (by stopping_tolerance) for k=stopping_rounds scoring events. Can only trigger after at least 2k scoring events. Use 0 to disable.
stopping_metric : str
Metric to use for convergence checking, only for _stopping_rounds > 0 Can be one of “AUTO”, “deviance”, “logloss”, “MSE”, “AUC”, “r2”, “misclassification”.
stopping_tolerance : float
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much) Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
min_split_improvement : float
Minimum relative improvement in squared error reduction for a split to happen
random_split_points : boolean
Whether to use random split points for histograms (to pick the best split from).
H2OGradientBoostingEstimator¶
- class h2o.estimators.gbm.H2OGradientBoostingEstimator(model_id=None, distribution=None, quantile_alpha=None, tweedie_power=None, ntrees=None, max_depth=None, min_rows=None, learn_rate=None, nbins=None, sample_rate=None, sample_rate_per_class=None, col_sample_rate=None, col_sample_rate_change_per_level=None, col_sample_rate_per_tree=None, nbins_top_level=None, nbins_cats=None, balance_classes=None, class_sampling_factors=None, max_after_balance_size=None, seed=None, build_tree_one_node=None, nfolds=None, fold_assignment=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, stopping_rounds=None, stopping_metric=None, stopping_tolerance=None, score_each_iteration=None, score_tree_interval=None, checkpoint=None, min_split_improvement=None, random_split_points=None, max_abs_leafnode_pred=None)[source]¶
Bases: h2o.estimators.estimator_base.H2OEstimator
Builds gradient boosted classification trees, and gradient boosted regression trees on a parsed data set. The default distribution function will guess the model type based on the response column type run properly the response column must be an numeric for “gaussian” or an enum for “bernoulli” or “multinomial”.
Parameters: model_id : str, optional
The unique id assigned to the resulting model. If none is given, an id will automatically be generated.
distribution : str
The distribution function of the response. Must be “AUTO”, “bernoulli”, “multinomial”, “poisson”, “gamma”, “tweedie”, “laplace”, “quantile” or “gaussian”
quantile_alpha : float
Quantile (only for Quantile regression, must be between 0 and 1)
tweedie_power : float
Tweedie power (only for Tweedie distribution, must be between 1 and 2)
ntrees : int
A non-negative integer that determines the number of trees to grow.
max_depth : int
Maximum depth to grow the tree.
min_rows : int
Minimum number of rows to assign to terminal nodes.
learn_rate : float
Learning rate (from 0.0 to 1.0)
learn_rate_annealing : float
Multiply the learning rate by this factor after every tree
sample_rate : float
Row sample rate per tree (from 0.0 to 1.0)
sample_rate_per_class : list
Row sample rate per tree per class (one per class, from 0.0 to 1.0)
col_sample_rate : float
Column sample rate per split (from 0.0 to 1.0)
col_sample_rate_change_per_level : float
Relative change of the column sampling rate for every level (from 0.0 to 2.0)
col_sample_rate_per_tree : float
Column sample rate per tree (from 0.0 to 1.0)
nbins : int
For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point.
nbins_top_level : int
For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level.
nbins_cats : int
For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.
balance_classes : bool
logical, indicates whether or not to balance training data class counts via over/under-sampling (for imbalanced data)
class_sampling_factors : list
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
max_after_balance_size : float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Ignored if balance_classes is False, which is the default behavior.
seed : int
Seed for random numbers (affects sampling when balance_classes=T)
build_tree_one_node : bool
Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
nfolds : int, optional
Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.
fold_assignment : str
Cross-validation fold assignment scheme, if fold_column is not specified. Must be “AUTO”, “Random” or “Modulo”
keep_cross_validation_predictions : bool
Whether to keep the predictions of the cross-validation models
keep_cross_validation_fold_assignment : bool
Whether to keep the cross-validation fold assignment.
score_each_iteration : bool
Attempts to score each tree.
score_tree_interval : int
Score the model after every so many trees. Disabled if set to 0.
stopping_rounds : int
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve (by stopping_tolerance) for k=stopping_rounds scoring events. Can only trigger after at least 2k scoring events. Use 0 to disable.
stopping_metric : str
Metric to use for convergence checking, only for _stopping_rounds > 0 Can be one of “AUTO”, “deviance”, “logloss”, “MSE”, “AUC”, “r2”, “misclassification”.
stopping_tolerance : float
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
min_split_improvement : float
Minimum relative improvement in squared error reduction for a split to happen
random_split_points : boolean
Whether to use random split points for histograms (to pick the best split from).
max_abs_leafnode_pred : float
Maximum absolute value of a leaf node prediction.
Returns: A new H2OGradientBoostedEstimator object.
H2OGeneralizedLinearEstimator¶
- class h2o.estimators.glm.H2OGeneralizedLinearEstimator(model_id=None, max_iterations=None, beta_epsilon=None, solver=None, standardize=None, family=None, link=None, tweedie_variance_power=None, tweedie_link_power=None, alpha=None, prior=None, lambda_search=None, nlambdas=None, lambda_min_ratio=None, beta_constraints=None, nfolds=None, fold_assignment=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, intercept=None, Lambda=None, max_active_predictors=None, checkpoint=None, objective_epsilon=None, gradient_epsilon=None, non_negative=False, compute_p_values=False, remove_collinear_columns=False, missing_values_handling=None, interactions=None)[source]¶
Bases: h2o.estimators.estimator_base.H2OEstimator
- Build a Generalized Linear Model
- Fit a generalized linear model, specified by a response variable, a set of predictors, and a description of the error distribution.
Parameters: model_id : str, optional
The unique id assigned to the resulting model. If none is given, an id will automatically be generated.
- max_iterations : int
A non-negative integer specifying the maximum number of iterations.
- beta_epsilon : int
A non-negative number specifying the magnitude of the maximum difference between the coefficient estimates from successive iterations. Defines the convergence criterion.
- solver : str
A character string specifying the solver used: IRLSM (supports more features), L_BFGS (scales better for datasets with many columns)
- standardize : bool
Indicates whether the numeric predictors should be standardized to have a mean of 0 and a variance of 1 prior to training the models.
- family : str
A character string specifying the distribution of the model: gaussian, binomial, multinomial, poisson, gamma, tweedie.
- link : str
A character string specifying the link function. The default is the canonical link for the family. The supported links for each of the family specifications are “gaussian” - “identity”, “log”, “inverse” “binomial” - “logit”, “log” “multinomial” - “multinomial” “poisson” - “log”, “identity” “gamma” - “inverse”, “log”, “identity” “tweedie” - “tweedie”
- tweedie_variance_power : int
numeric specifying the power for the variance function when family = “tweedie”.
- tweedie_link_power : int
A numeric specifying the power for the link function when family = “tweedie”.
- alpha : float
A numeric in [0, 1] specifying the elastic-net mixing parameter. The elastic-net penalty is defined to be eqn{P(lpha,eta) = (1-lpha)/2||eta||_2^2 + lpha||eta||_1 = sum_j [(1-lpha)/2 eta_j^2 + lpha|eta_j|], making alpha = 1 the lasso penalty and alpha = 0 the ridge penalty.
- Lambda : float
A non-negative shrinkage parameter for the elastic-net, which multiplies eqn{P(lpha,eta) in the objective function. When Lambda = 0, no elastic-net penalty is applied and ordinary generalized linear models are fit.
- prior : float, optional
A numeric specifying the prior probability of class 1 in the response when family = “binomial”. The default prior is the observational frequency of class 1. Must be from (0,1) exclusive range or None (no prior).
- lambda_search : bool
A logical value indicating whether to conduct a search over the space of lambda values starting from the lambda max, given lambda is interpreted as lambda minself.
- nlambdas : int
The number of lambda values to use when lambda_search = TRUE.
- lambda_min_ratio : float
Smallest value for lambda as a fraction of lambda.max. By default if the number of observations is greater than the the number of variables then lambda_min_ratio = 0.0001; if the number of observations is less than the number of variables then lambda_min_ratio = 0.01.
- beta_constraints : H2OFrame
A data.frame or H2OParsedData object with the columns [“names”, “lower_bounds”, “upper_bounds”, “beta_given”], where each row corresponds to a predictor in the GLM. “names” contains the predictor names, “lower”/”upper_bounds”, are the lower and upper bounds of beta, and “beta_given” is some supplied starting values.
- nfolds : int, optional
Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.
- fold_assignment : str
Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”
- keep_cross_validation_predictions : bool
Whether to keep the predictions of the cross-validation models
- keep_cross_validation_fold_assignment : bool
Whether to keep the cross-validation fold assignment.
- intercept : bool
Logical, include constant term (intercept) in the model
- max_active_predictors : int, optional
Convergence criteria for number of predictors when using L1 penalty.
- missing_values_handling : str
A character string specifying how to handle missing value: “MeanImputation”,”Skip”.
<<<<<<< HEAD
- interactions : list, optional
A list of column names to interact. All pairwise combinations of columns will be interacted.
=======
- max_runtime_secs: int, optional
Maximum allowed runtime, model will stop running after reaching the limit and return whatever result it has at the moment.
>>>>>>> e95576ae7d6e4928eb76beb6066e899f91123ca4
Returns: A subclass of ModelBase is returned. The specific subclass depends on the machine
learning task at hand (if it’s binomial classification, then an H2OBinomialModel is returned, if it’s regression then a H2ORegressionModel is returned). The default print-out of the models is shown, but further GLM-specifc information can be queried out of the object. Upon completion of the GLM, the resulting object has coefficients, normalized coefficients, residual/null deviance, aic, and a host of model metrics including MSE, AUC (for logistic regression), degrees of freedom, and confusion matrices.
H2OGeneralizedLowRankEstimator¶
- class h2o.estimators.glrm.H2OGeneralizedLowRankEstimator(k=None, max_iterations=None, transform=None, seed=None, ignore_const_cols=None, loss=None, multi_loss=None, loss_by_col=None, loss_by_col_idx=None, regularization_x=None, regularization_y=None, gamma_x=None, gamma_y=None, init_step_size=None, min_step_size=None, init=None, svd_method=None, user_x=None, user_y=None, recover_svd=None)[source]¶
Bases: h2o.estimators.estimator_base.H2OEstimator
Builds a generalized low rank model of a H2O dataset.
Parameters: k : int
The rank of the resulting decomposition. This must be between 1 and the number of columns in the training frame inclusive.
- max_iterations : int
The maximum number of iterations to run the optimization loop. Each iteration consists of an update of the X matrix, followed by an update of the Y matrix.
- transform : str
A character string that indicates how the training data should be transformed before running GLRM. Possible values are “NONE” for no transformation, “DEMEAN” for subtracting the mean of each column, “DESCALE” for dividing by the standard deviation of each column, “STANDARDIZE” for demeaning and descaling, and “NORMALIZE” for demeaning and dividing each column by its range (max - min).
- seed : int, optional
Random seed used to initialize the X and Y matrices.
- ignore_const_cols : bool, optional
A logical value indicating whether to ignore constant columns in the training frame. A column is constant if all of its non-missing values are the same value.
- loss : str
A character string indicating the default loss function for numeric columns. Possible values are “Quadratic” (default), “Absolute”, “Huber”, “Poisson”, “Hinge”, and “Logistic”.
- multi_loss : str
A character string indicating the default loss function for enum columns. Possible values are “Categorical” and “Ordinal”.
- loss_by_col : str, optional
A list of strings indicating the loss function for specific columns by corresponding index in loss_by_col_idx. Will override loss for numeric columns and multi_loss for enum columns.
- loss_by_col_idx : str, optional
A list of column indices to which the corresponding loss functions in loss_by_col are assigned. Must be zero indexed.
- regularization_x : str
A character string indicating the regularization function for the X matrix. Possible values are “None” (default), “Quadratic”, “L2”, “L1”, “NonNegative”, “OneSparse”, “UnitOneSparse”, and “Simplex”.
- regularization_y : str
A character string indicating the regularization function for the Y matrix. Possible values are “None” (default), “Quadratic”, “L2”, “L1”, “NonNegative”, “OneSparse”, “UnitOneSparse”, and “Simplex”.
- gamma_x : float
The weight on the X matrix regularization term.
- gamma_y : float
The weight on the Y matrix regularization term.
- init_step_size : float
Initial step size. Divided by number of columns in the training frame when calculating the proximal gradient update. The algorithm begins at init_step_size and decreases the step size at each iteration until a termination condition is reached.
- min_step_size : float
Minimum step size upon which the algorithm is terminated.
- init : str
A character string indicating how to select the initial X and Y matrices. Possible values are “Random” for initialization to a random array from the standard normal distribution, “PlusPlus” for initialization using the clusters from k-means++ initialization, “SVD” for initialization using the first k (approximate) right singular vectors, “User” for user-specified initial X and Y frames (must set user_y and user_x arguments).
- svd_method : str
A character string that indicates how SVD should be calculated during initialization. Possible values are “GramSVD” for distributed computation of the Gram matrix followed by a local SVD using the JAMA package, “Power” for computation of the SVD using the power iteration method, “Randomized” for approximate SVD by projecting onto a random subspace.
- user_x : H2OFrame, optional
An H2OFrame object specifying the initial X matrix. Only used when init = “User”.
- user_y : H2OFrame, optional
An H2OFrame object specifying the initial Y matrix. Only used when init = “User”.
- recover_svd : bool
A logical value indicating whether the singular values and eigenvectors should be recovered during post-processing of the generalized low rank decomposition.
Returns: A new H2OGeneralizedLowRankEstimator instance.
H2OKMeansEstimator¶
- class h2o.estimators.kmeans.H2OKMeansEstimator(model_id=None, k=None, max_iterations=None, standardize=None, init=None, seed=None, nfolds=None, fold_assignment=None, user_points=None, ignored_columns=None, score_each_iteration=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, ignore_const_cols=None, checkpoint=None)[source]¶
Bases: h2o.estimators.estimator_base.H2OEstimator
Performs k-means clustering on an H2O dataset.
Parameters: model_id : str, optional
The unique id assigned to the resulting model. If none is given, an id will automatically be generated.
- k : int
The number of clusters. Must be between 1 and 1e7 inclusive. k may be omitted if the user specifies the initial centers in the init parameter. If k is not omitted, in this case, then it should be equal to the number of user-specified centers.
- max_iterations : int
The maximum number of iterations allowed. Must be between 0 and 1e6 inclusive.
- standardize : bool
Indicates whether the data should be standardized before running k-means.
- init : str
A character string that selects the initial set of k cluster centers. Possible values are “Random” for random initialization, “PlusPlus” for k-means plus initialization, or “Furthest” for initialization at the furthest point from each successive center. Additionally, the user may specify a the initial centers as a matrix, data.frame, H2OFrame, or list of vectors. For matrices, data.frames, and H2OFrames, each row of the respective structure is an initial center. For lists of vectors, each vector is an initial center.
- seed : int, optional
Random seed used to initialize the cluster centroids.
- nfolds : int, optional
Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.
- fold_assignment : str
Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”
- keep_cross_validation_predictions : bool
Whether to keep the predictions of the cross-validation models
- keep_cross_validation_fold_assignment : bool
Whether to keep the cross-validation fold assignment.
Returns: An instance of H2OClusteringModel.
H2ONaiveBayesEstimator¶
- class h2o.estimators.naive_bayes.H2ONaiveBayesEstimator(model_id=None, laplace=None, threshold=None, eps=None, compute_metrics=None, balance_classes=None, max_after_balance_size=None, nfolds=None, fold_assignment=None, keep_cross_validation_predictions=None, keep_cross_validation_fold_assignment=None, checkpoint=None)[source]¶
Bases: h2o.estimators.estimator_base.H2OEstimator
The naive Bayes classifier assumes independence between predictor variables conditional on the response, and a Gaussian distribution of numeric predictors with mean and standard deviation computed from the training dataset. When building a naive Bayes classifier, every row in the training dataset that contains at least one NA will be skipped completely. If the test dataset has missing values, then those predictors are omitted in the probability calculation during prediction.
Parameters: laplace : int
A positive number controlling Laplace smoothing. The default zero disables smoothing.
threshold : float
The minimum standard deviation to use for observations without enough data. Must be at least 1e-10.
eps : float
A threshold cutoff to deal with numeric instability, must be positive.
compute_metrics : bool
A logical value indicating whether model metrics should be computed. Set to FALSE to reduce the runtime of the algorithm.
nfolds : int, optional
Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.
fold_assignment : str
Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”
keep_cross_validation_predictions : bool
Whether to keep the predictions of the cross-validation models.
keep_cross_validation_fold_assignment : bool
Whether to keep the cross-validation fold assignment.
Returns: Returns instance of H2ONaiveBayesEstimator