Parameters of H2OAutoML¶

Affected Classes¶

ai.h2o.sparkling.ml.algos.H2OAutoML
ai.h2o.sparkling.ml.algos.classification.H2OAutoMLClassifier
ai.h2o.sparkling.ml.algos.regression.H2OAutoMLRegressor

Parameters¶

Each parameter has also a corresponding getter and setter method. (E.g.: label -> getLabel() , setLabel(...) )

blendingDataFrame

This parameter is used for computing the predictions that serve as the training frame for the meta-learner. If provided, this triggers blending mode on the stacked ensemble training stage. Blending mode is faster than cross-validating the base learners (though these ensembles may not perform as well as the Super Learner ensemble). The parameter is not serializable!

Scala default value: null ; Python default value: None

ignoredCols

Names of columns to ignore for training.

Scala default value: null ; Python default value: None

leaderboardDataFrame

This parameter allows the user to specify a particular data frame to use to score and rank models on the leaderboard. This data frame will not be used for anything besides leaderboard scoring.

Scala default value: null ; Python default value: None

monotoneConstraints

A key must correspond to a feature name and value could be 1 or -1

Scala default value: Map() ; Python default value: {}

balanceClasses

Balance training data class counts via over/under-sampling (for imbalanced data).

Scala default value: false ; Python default value: False

classSamplingFactors

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

Scala default value: null ; Python default value: None

columnsToCategorical

List of columns to convert to categorical before modelling

Scala default value: Array() ; Python default value: []

convertInvalidNumbersToNa

If set to ‘true’, the model converts invalid numbers to NA during making predictions.

Scala default value: false ; Python default value: False

convertUnknownCategoricalLevelsToNa

If set to ‘true’, the model converts unknown categorical levels to NA during making predictions.

Scala default value: false ; Python default value: False

customDistributionFunc

Reference to custom distribution, format: language:keyName=funcName.

Scala default value: null ; Python default value: None

customMetricFunc

Reference to custom evaluation function, format: language:keyName=funcName.

Scala default value: null ; Python default value: None

dataFrameSerializer

A full name of a serializer used for serialization and deserialization of Spark DataFrames to a JSON value within NullableDataFrameParam.

Default value: "ai.h2o.sparkling.utils.JSONDataFrameSerializer"

detailedPredictionCol

Column containing additional prediction details, its content depends on the model type.

Default value: "detailed_prediction"

distribution

Distribution function used by algorithms that support it; other algorithms use their defaults. Possible values are "AUTO", "bernoulli", "quasibinomial", "modified_huber", "multinomial", "ordinal", "gaussian", "poisson", "gamma", "tweedie", "huber", "laplace", "quantile", "fractionalbinomial", "negativebinomial", "custom".

Default value: "AUTO"

excludeAlgos

A list of algorithms to skip during the model-building phase. Possible values are "GLM", "DRF", "GBM", "DeepLearning", "StackedEnsemble", "XGBoost".

Scala default value: null ; Python default value: None

exploitationRatio

The budget ratio (between 0 and 1) dedicated to the exploitation (vs exploration) phase.

Default value: -1.0

exportCheckpointsDir

Path to a directory where every generated model will be stored.

Scala default value: null ; Python default value: None

featuresCols

Name of feature columns

Scala default value: Array() ; Python default value: []

foldCol

Fold column (contains fold IDs) in the training frame. These assignments are used to create the folds for cross-validation of the models.

Scala default value: null ; Python default value: None

huberAlpha

Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).

Default value: 0.9

includeAlgos

A list of algorithms to restrict to during the model-building phase. Possible values are "GLM", "DRF", "GBM", "DeepLearning", "StackedEnsemble", "XGBoost".

Scala default value: Array("GLM", "DRF", "GBM", "DeepLearning", "StackedEnsemble", "XGBoost") ; Python default value: ["GLM", "DRF", "GBM", "DeepLearning", "StackedEnsemble", "XGBoost"]

keepBinaryModels

If set to true, all binary models created during execution of the fit method will be kept in DKV of H2O-3 cluster.

Scala default value: false ; Python default value: False

keepCrossValidationFoldAssignment

Whether to keep cross-validation assignments.

Scala default value: false ; Python default value: False

keepCrossValidationModels

Whether to keep the cross-validated models. Keeping cross-validation models may consume significantly more memory in the H2O cluster.

Scala default value: false ; Python default value: False

keepCrossValidationPredictions

Whether to keep the predictions of the cross-validation predictions. This needs to be set to TRUE if running the same AutoML object for repeated runs because CV predictions are required to build additional Stacked Ensemble models in AutoML.

Scala default value: false ; Python default value: False

labelCol

Response column.

Default value: "label"

maxAfterBalanceSize

Maximum relative size of the training data after balancing class counts (defaults to 5.0 and can be less than 1.0). Requires balance_classes.

Scala default value: 5.0f ; Python default value: 5.0

maxModels

Maximum number of models to build (optional). Always set this parameter to ensure AutoML reproducibility: all models are then trained until convergence and none is constrained by a time budget.

Default value: 0

maxRuntimeSecs

This argument specifies the maximum time that the AutoML process will run for. If both max_runtime_secs and max_models are specified, then the AutoML run will stop as soon as it hits either of these limits. If neither max_runtime_secs nor max_models are specified, then max_runtime_secs defaults to 3600 seconds (1 hour).

Default value: 0.0

maxRuntimeSecsPerModel

Maximum time to spend on each individual model (optional). Note that models constrained by a time budget are not guaranteed reproducible.

Default value: 0.0

nfolds

Number of folds for k-fold cross-validation (defaults to -1 (AUTO), otherwise it must be >=2 or use 0 to disable). Disabling prevents Stacked Ensembles from being built.

Default value: -1

predictionCol

Prediction column name

Default value: "prediction"

projectName

Optional project name used to group models from multiple AutoML runs into a single Leaderboard; derived from the training data name if not specified.

Scala default value: null ; Python default value: None

quantileAlpha

Desired quantile for Quantile regression, must be between 0 and 1.

Default value: 0.5

seed

Seed for random number generator; set to a value other than -1 for reproducibility.

Scala default value: -1L ; Python default value: -1

sortMetric

Metric used to sort leaderboard. Possible values are "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "mean_per_class_error".

Default value: "AUTO"

splitRatio

Accepts values in range [0, 1.0] which determine how large part of dataset is used for training and for validation. For example, 0.8 -> 80% training 20% validation. This parameter is ignored when validationDataFrame is set.

Default value: 1.0

stoppingMetric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression). Possible values are "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "AUCPR", "lift_top_group", "misclassification", "mean_per_class_error", "anomaly_score", "custom", "custom_increasing".

Default value: "AUTO"

stoppingRounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable).

Default value: 3

stoppingTolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much).

Default value: -1.0

tweediePower

Tweedie power for Tweedie regression, must be between 1 and 2.

Default value: 1.5

validationDataFrame

A data frame dedicated for a validation of the trained model. If the parameters is not set,a validation frame created via the ‘splitRatio’ parameter. The parameter is not serializable!

Scala default value: null ; Python default value: None

weightCol

Weights column in the training frame, which specifies the row weights used in model training.

Scala default value: null ; Python default value: None

withContributions

Enables or disables generating a sub-column of detailedPredictionCol containing Shapley values of original features.

Scala default value: false ; Python default value: False

withLeafNodeAssignments

Enables or disables computation of leaf node assignments.

Scala default value: false ; Python default value: False

withStageResults

Enables or disables computation of stage results.

Scala default value: false ; Python default value: False