Target Encoding in Sparkling Water¶
Target Encoding in Sparkling Water is a mechanism of converting categorical features to continues features based on the mean calculated from values of the label (target) column.
An example of converting a categorical feature to continues with Target Encoder (Town_te is a produced column):
Town Label Town_te Chennai 1 0.8 Prague 0 0.286 Chennai 0 0.8 Mountain View 1 0.714 Chennai 1 0.8 Prague 1 0.286 Mountain View 1 0.714 Chennai 1 0.8 Mountain View 0 0.714 Prague 1 0.286 Prague 0 0.286 Mountain View 1 0.714 Prague 0 0.286 Mountain View 0 0.714 Chennai 1 0.8 Mountain View 1 0.714 Prague 0 0.286 Prague 0 0.286 Mountain View 1 0.714
Target Encoding can help to improve accuracy of machine learning algorithms when columns with high cardinality are used as features during training phase.
Parameters¶
- labelCol
- A name of label column
- inputCols
- Names of columns that will be transformed to Target Encoding
- outputCols
- Names of columns representing the result of target encoding. If the parameter is not specified by user, the output
column names will be automatically derived from
inputCols
by appending the suffix _te. - holdoutStrategy
A strategy deciding what records will be excluded when calculating the target average on the training dataset.
Options:
- None
- All rows are considered for the calculation
- LeaveOneOut
- All rows except the row the calculation is made for
- KFold
- Only out-of-fold data is considered (The option requires
foldCol
to be set.)
- foldCol
- A name of a column determining folds when
KFold
holdoutStrategy is applied. - blendedAvgEnabled
- If set, the target average becomes a weighted average of the posterior average for a given categorical level and the prior average of the target. The weight is determined by the size of the given group that the row belongs to. By default, the blended average is disabled.
- blendedAvgInflectionPoint
- A parameter of the blended average. The bigger number is set, the groups relatively bigger to the overall dataset size will consider the prior average as a component in the weighted average. The default value is 10.
- blendedAvgSmoothing
- A parameter of the blended average. It controls the rate of a transition between a posterior average and a prior average. The default value is 20.
- noise
- Amount of random noise added to output values of a training dataset to prevent over-fitting of an algorithm consuming encoded features. The default value is 0.01. Noise addition can be disabled by setting the parameter to 0.0.
- noiseSeed
- A seed of the generator producing the random noise.
Using Target Encoder¶
Sparkling Water exposes API for target encoder in Scala and Python. Before we start using Target Encoder, we need to run and prepare the environment:
Scala
First, let’s start Sparkling Shell (use :paste mode when you try to copy-paste examples):
./bin/sparkling-shell
Start H2O cluster inside the Spark environment:
import org.apache.spark.h2o._
import java.net.URI
val hc = H2OContext.getOrCreate(spark)
Parse the data using H2O and convert them to Spark Frame:
val frame = new H2OFrame(new URI("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv"))
val sparkDF = hc.asDataFrame(frame).withColumn("CAPSULE", $"CAPSULE" cast "string")
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))
Python
First, let’s start PySparkling Shell:
./bin/pysparkling
Start H2O cluster inside the Spark environment:
from pysparkling import *
hc = H2OContext.getOrCreate(spark)
Parse the data using H2O and convert them to Spark Frame:
import h2o
frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
sparkDF = hc.as_spark_frame(frame)
sparkDF = sparkDF.withColumn("CAPSULE", sparkDF.CAPSULE.cast("string"))
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])
Target Encoder in ML Pipeline¶
Target Encoder in Sparkling Water is implemented as a regular estimator and thus could be placed as a stage to Spark ML Pipeline
Scala
Let’s create an instance of Target Encoder and configure it:
import ai.h2o.sparkling.ml.features.H2OTargetEncoder
val targetEncoder = new H2OTargetEncoder()
.setInputCols(Array("RACE", "DPROS", "DCAPS"))
.setLabelCol("CAPSULE")
Also create an instance of an algorithm consuming encoded columns and define pipeline:
import ai.h2o.sparkling.ml.algos.H2OGBM
import org.apache.spark.ml.Pipeline
val gbm = new H2OGBM()
.setFeaturesCols(targetEncoder.getOutputCols())
.setLabelCol("CAPSULE")
val pipeline = new Pipeline().setStages(Array(targetEncoder, gbm))
Train the created pipeline
val pipelineModel = pipeline.fit(trainingDF)
Make predictions including a model of Target Encoder:
pipelineModel.transform(testingDF).show()
The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:
import org.apache.spark.ml.PipelineModel
pipelineModel.write.save("somePathForStoringPipelineModel")
val loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")
loadedPipelineModel.transform(testingDF).show()
Python
Let’s create an instance of Target Encoder and configure it:
from pysparkling.ml import H2OTargetEncoder
targetEncoder = H2OTargetEncoder()\
.setInputCols(["RACE", "DPROS", "DCAPS"])\
.setLabelCol("CAPSULE")
Also create an instance of an algorithm consuming encoded columns and define pipeline:
from pysparkling.ml import H2OGBM
from pyspark.ml import Pipeline
gbm = H2OGBM()\
.setFeaturesCols(targetEncoder.getOutputCols())\
.setLabelCol("CAPSULE")
pipeline = Pipeline(stages=[targetEncoder, gbm])
Train the created pipeline
pipelineModel = pipeline.fit(trainingDF)
Make predictions including a model of Target Encoder:
pipelineModel.transform(testingDF).show()
The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:
from pyspark.ml import PipelineModel
pipelineModel.save("somePathForStoringPipelineModel")
loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")
loadedPipelineModel.transform(testingDF).show()
Standalone Target Encoder¶
Target Encoder’s parameters like noise
and holdoutStrategy
are relevant only for a training dataset.
Thus the transform
method of H2OTargetEncoderModel
has to treat training and other data sets differently and
eventually ignore the mentioned parameters.
When Target Encoder is inside ML pipeline, the differentiation is done automatically. But if a user decides to train an algorithm without ML pipeline, the ‘transformTrainingDataset’ method should be on the model of Target Encoder to get appropriate results.
Limitations and Edge Cases¶
- The label column can’t contain more than two unique categorical values.
- The label column can’t contain any
null
values. - Input columns transformed by Target Encoder can contain
null
values. - Novel values in a testing/production data set and
null
values belong to the same category. In other words, Target Encoder returns a prior average for all novel values in case a given column of the training dataset did not contain anynull
values. Otherwise, the posterior average of rows havingnull
values in the column is returned.