Parameters of H2OAutoML¶
Affected Classes¶
ai.h2o.sparkling.ml.algos.H2OAutoML
ai.h2o.sparkling.ml.algos.classification.H2OAutoMLClassifier
ai.h2o.sparkling.ml.algos.regression.H2OAutoMLRegressor
Parameters¶
Each parameter has also a corresponding getter and setter method. (E.g.:
label
->getLabel()
,setLabel(...)
)
- blendingDataFrame
This parameter is used for computing the predictions that serve as the training frame for the meta-learner. If provided, this triggers blending mode on the stacked ensemble training stage. Blending mode is faster than cross-validating the base learners (though these ensembles may not perform as well as the Super Learner ensemble). The parameter is not serializable!
Scala default value:
null
; Python default value:None
- ignoredCols
Names of columns to ignore for training.
Scala default value:
null
; Python default value:None
- leaderboardDataFrame
This parameter allows the user to specify a particular data frame to use to score and rank models on the leaderboard. This data frame will not be used for anything besides leaderboard scoring.
Scala default value:
null
; Python default value:None
- monotoneConstraints
A key must correspond to a feature name and value could be 1 or -1
Scala default value:
Map()
; Python default value:{}
- balanceClasses
Balance training data class counts via over/under-sampling (for imbalanced data).
Scala default value:
false
; Python default value:False
- classSamplingFactors
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
Scala default value:
null
; Python default value:None
- columnsToCategorical
List of columns to convert to categorical before modelling
Scala default value:
Array()
; Python default value:[]
- convertInvalidNumbersToNa
If set to ‘true’, the model converts invalid numbers to NA during making predictions.
Scala default value:
false
; Python default value:False
- convertUnknownCategoricalLevelsToNa
If set to ‘true’, the model converts unknown categorical levels to NA during making predictions.
Scala default value:
false
; Python default value:False
- customDistributionFunc
Reference to custom distribution, format: language:keyName=funcName.
Scala default value:
null
; Python default value:None
- customMetricFunc
Reference to custom evaluation function, format: language:keyName=funcName.
Scala default value:
null
; Python default value:None
- dataFrameSerializer
A full name of a serializer used for serialization and deserialization of Spark DataFrames to a JSON value within NullableDataFrameParam.
Default value:
"ai.h2o.sparkling.utils.JSONDataFrameSerializer"
- detailedPredictionCol
Column containing additional prediction details, its content depends on the model type.
Default value:
"detailed_prediction"
- distribution
Distribution function used by algorithms that support it; other algorithms use their defaults. Possible values are
"AUTO"
,"bernoulli"
,"quasibinomial"
,"modified_huber"
,"multinomial"
,"ordinal"
,"gaussian"
,"poisson"
,"gamma"
,"tweedie"
,"huber"
,"laplace"
,"quantile"
,"fractionalbinomial"
,"negativebinomial"
,"custom"
.Default value:
"AUTO"
- excludeAlgos
A list of algorithms to skip during the model-building phase. Possible values are
"GLM"
,"DRF"
,"GBM"
,"DeepLearning"
,"StackedEnsemble"
,"XGBoost"
.Scala default value:
null
; Python default value:None
- exploitationRatio
The budget ratio (between 0 and 1) dedicated to the exploitation (vs exploration) phase.
Default value:
-1.0
- exportCheckpointsDir
Path to a directory where every generated model will be stored.
Scala default value:
null
; Python default value:None
- featuresCols
Name of feature columns
Scala default value:
Array()
; Python default value:[]
- foldCol
Fold column (contains fold IDs) in the training frame. These assignments are used to create the folds for cross-validation of the models.
Scala default value:
null
; Python default value:None
- huberAlpha
Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
Default value:
0.9
- includeAlgos
A list of algorithms to restrict to during the model-building phase. Possible values are
"GLM"
,"DRF"
,"GBM"
,"DeepLearning"
,"StackedEnsemble"
,"XGBoost"
.Scala default value:
Array("GLM", "DRF", "GBM", "DeepLearning", "StackedEnsemble", "XGBoost")
; Python default value:["GLM", "DRF", "GBM", "DeepLearning", "StackedEnsemble", "XGBoost"]
- keepBinaryModels
If set to true, all binary models created during execution of the
fit
method will be kept in DKV of H2O-3 cluster.Scala default value:
false
; Python default value:False
- keepCrossValidationFoldAssignment
Whether to keep cross-validation assignments.
Scala default value:
false
; Python default value:False
- keepCrossValidationModels
Whether to keep the cross-validated models. Keeping cross-validation models may consume significantly more memory in the H2O cluster.
Scala default value:
false
; Python default value:False
- keepCrossValidationPredictions
Whether to keep the predictions of the cross-validation predictions. This needs to be set to TRUE if running the same AutoML object for repeated runs because CV predictions are required to build additional Stacked Ensemble models in AutoML.
Scala default value:
false
; Python default value:False
- labelCol
Response column.
Default value:
"label"
- maxAfterBalanceSize
Maximum relative size of the training data after balancing class counts (defaults to 5.0 and can be less than 1.0). Requires balance_classes.
Scala default value:
5.0f
; Python default value:5.0
- maxModels
Maximum number of models to build (optional). Always set this parameter to ensure AutoML reproducibility: all models are then trained until convergence and none is constrained by a time budget.
Default value:
0
- maxRuntimeSecs
This argument specifies the maximum time that the AutoML process will run for. If both max_runtime_secs and max_models are specified, then the AutoML run will stop as soon as it hits either of these limits. If neither max_runtime_secs nor max_models are specified, then max_runtime_secs defaults to 3600 seconds (1 hour).
Default value:
0.0
- maxRuntimeSecsPerModel
Maximum time to spend on each individual model (optional). Note that models constrained by a time budget are not guaranteed reproducible.
Default value:
0.0
- nfolds
Number of folds for k-fold cross-validation (defaults to -1 (AUTO), otherwise it must be >=2 or use 0 to disable). Disabling prevents Stacked Ensembles from being built.
Default value:
-1
- predictionCol
Prediction column name
Default value:
"prediction"
- projectName
Optional project name used to group models from multiple AutoML runs into a single Leaderboard; derived from the training data name if not specified.
Scala default value:
null
; Python default value:None
- quantileAlpha
Desired quantile for Quantile regression, must be between 0 and 1.
Default value:
0.5
- seed
Seed for random number generator; set to a value other than -1 for reproducibility.
Scala default value:
-1L
; Python default value:-1
- sortMetric
Metric used to sort leaderboard. Possible values are
"AUTO"
,"deviance"
,"logloss"
,"MSE"
,"RMSE"
,"MAE"
,"RMSLE"
,"AUC"
,"mean_per_class_error"
.Default value:
"AUTO"
- splitRatio
Accepts values in range [0, 1.0] which determine how large part of dataset is used for training and for validation. For example, 0.8 -> 80% training 20% validation. This parameter is ignored when validationDataFrame is set.
Default value:
1.0
- stoppingMetric
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression). Possible values are
"AUTO"
,"deviance"
,"logloss"
,"MSE"
,"RMSE"
,"MAE"
,"RMSLE"
,"AUC"
,"AUCPR"
,"lift_top_group"
,"misclassification"
,"mean_per_class_error"
,"anomaly_score"
,"AUUC"
,"ATE"
,"ATT"
,"ATC"
,"qini"
,"custom"
,"custom_increasing"
.Default value:
"AUTO"
- stoppingRounds
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable).
Default value:
3
- stoppingTolerance
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much).
Default value:
-1.0
- tweediePower
Tweedie power for Tweedie regression, must be between 1 and 2.
Default value:
1.5
- validationDataFrame
A data frame dedicated for a validation of the trained model. If the parameters is not set,a validation frame created via the ‘splitRatio’ parameter. The parameter is not serializable!
Scala default value:
null
; Python default value:None
- weightCol
Weights column in the training frame, which specifies the row weights used in model training.
Scala default value:
null
; Python default value:None
- withContributions
Enables or disables generating a sub-column of detailedPredictionCol containing Shapley values of original features.
Scala default value:
false
; Python default value:False
- withLeafNodeAssignments
Enables or disables computation of leaf node assignments.
Scala default value:
false
; Python default value:False
- withStageResults
Enables or disables computation of stage results.
Scala default value:
false
; Python default value:False