R/targetencoder.R
h2o.targetencoder.Rd
Transformation of a categorical variable with a mean value of the target variable
h2o.targetencoder( x, y, training_frame, model_id = NULL, fold_column = NULL, columns_to_encode = NULL, keep_original_categorical_columns = TRUE, blending = FALSE, inflection_point = 10, smoothing = 20, data_leakage_handling = c("leave_one_out", "k_fold", "none", "LeaveOneOut", "KFold", "None"), noise = 0.01, seed = -1, ... )
x | (Optional) A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used. |
---|---|
y | The name or column index of the response variable in the data. The response must be either a numeric or a categorical/factor variable. If the response is numeric, then a regression model will be trained, otherwise it will train a classification model. |
training_frame | Id of the training data frame. |
model_id | Destination id for this model; auto-generated if not specified. |
fold_column | Column with cross-validation fold index assignment per observation. |
columns_to_encode | List of categorical columns or groups of categorical columns to encode. When groups of columns are specified, each group is encoded as a single column (interactions are created internally). |
keep_original_categorical_columns |
|
blending |
|
inflection_point | Inflection point of the sigmoid used to blend probabilities (see `blending` parameter). For a given categorical value, if it appears less that `inflection_point` in a data sample, then the influence of the posterior probability will be smaller than the prior. Defaults to 10. |
smoothing | Smoothing factor corresponds to the inverse of the slope at the inflection point on the sigmoid used to blend probabilities (see `blending` parameter). If smoothing tends towards 0, then the sigmoid used for blending turns into a Heaviside step function. Defaults to 20. |
data_leakage_handling | Data leakage handling strategy used to generate the encoding. Supported options are: 1) "none" (default) - no holdout, using the entire training frame. 2) "leave_one_out" - current row's response value is subtracted from the per-level frequencies pre-calculated on the entire training frame. 3) "k_fold" - encodings for a fold are generated based on out-of-fold data. Must be one of: "leave_one_out", "k_fold", "none", "LeaveOneOut", "KFold", "None". Defaults to None. |
noise | The amount of noise to add to the encoded column. Use 0 to disable noise, and -1 (=AUTO) to let the algorithm determine a reasonable amount of noise. Defaults to 0.01. |
seed | Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default). Defaults to -1 (time-based random number). |
... | Mainly used for backwards compatibility, to allow deprecated parameters. |
if (FALSE) { library(h2o) h2o.init() #Import the titanic dataset f <- "https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv" titanic <- h2o.importFile(f) # Set response as a factor response <- "survived" titanic[response] <- as.factor(titanic[response]) # Split the dataset into train and test splits <- h2o.splitFrame(data = titanic, ratios = .8, seed = 1234) train <- splits[[1]] test <- splits[[2]] # Choose which columns to encode encode_columns <- c("home.dest", "cabin", "embarked") # Train a TE model te_model <- h2o.targetencoder(x = encode_columns, y = response, training_frame = train, fold_column = "pclass", data_leakage_handling = "KFold") # New target encoded train and test sets train_te <- h2o.transform(te_model, train) test_te <- h2o.transform(te_model, test) }