Transformation of a categorical variable with a mean value of the target variable

h2o.targetencoder(
  x,
  y,
  training_frame,
  model_id = NULL,
  fold_column = NULL,
  blending = FALSE,
  k = 10,
  f = 20,
  data_leakage_handling = c("None", "KFold", "LeaveOneOut"),
  noise_level = 0.01,
  seed = -1
)

Arguments

x	(Optional) A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used.
y	The name or column index of the response variable in the data. The response must be either a numeric or a categorical/factor variable. If the response is numeric, then a regression model will be trained, otherwise it will train a classification model.
training_frame	Id of the training data frame.
model_id	Destination id for this model; auto-generated if not specified.
fold_column	Column with cross-validation fold index assignment per observation.
blending	`Logical`. Blending enabled/disabled Defaults to FALSE.
k	Inflection point. Used for blending (if enabled). Blending is to be enabled separately using the 'blending' parameter. Defaults to 10.
f	Smoothing. Used for blending (if enabled). Blending is to be enabled separately using the 'blending' parameter. Defaults to 20.
data_leakage_handling	Data leakage handling strategy. Must be one of: "None", "KFold", "LeaveOneOut". Defaults to None.
noise_level	Noise level Defaults to 0.01.
seed	Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default). Defaults to -1 (time-based random number).

Examples

if (FALSE) {
library(h2o)
h2o.init()
#Import the titanic dataset
f <- "https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv"
titanic <- h2o.importFile(f)

# Set response as a factor
response <- "survived"
titanic[response] <- as.factor(titanic[response])

# Split the dataset into train and test
splits <- h2o.splitFrame(data = titanic, ratios = .8, seed = 1234)
train <- splits[[1]]
test <- splits[[2]]

# Choose which columns to encode
encode_columns <- c("home.dest", "cabin", "embarked")

# Train a TE model
te_model <- h2o.targetencoder(x = encode_columns,
                              y = response, 
                              training_frame = train,
                              fold_column = "pclass", 
                              data_leakage_handling = "KFold")

# New target encoded train and test sets
train_te <- h2o.transform(te_model, train)
test_te <- h2o.transform(te_model, test)
}