Target Encoding in Sparkling Water¶
Target Encoding in Sparkling Water is a mechanism of converting categorical features to continues features based on the mean calculated from values of the label (target) column. See also Parameters of H2OTargetEncoder.
An example of converting a categorical feature to continues with Target Encoder (Town_te is a produced column):
Town
Label
Town_te
Chennai
1
0.8
Prague
0
0.286
Chennai
0
0.8
Mountain View
1
0.714
Chennai
1
0.8
Prague
1
0.286
Mountain View
1
0.714
Chennai
1
0.8
Mountain View
0
0.714
Prague
1
0.286
Prague
0
0.286
Mountain View
1
0.714
Prague
0
0.286
Mountain View
0
0.714
Chennai
1
0.8
Mountain View
1
0.714
Prague
0
0.286
Prague
0
0.286
Mountain View
1
0.714
Target Encoding can help to improve accuracy of machine learning algorithms when columns with high cardinality are used as features during a training phase.
Using Target Encoder¶
Sparkling Water exposes API for target encoder in Scala and Python. Before we start using Target Encoder, we need to run and prepare the environment:
Scala
First, let’s start Sparkling Shell (use :paste mode when you try to copy-paste examples):
./bin/sparkling-shell
Start H2O cluster inside the Spark environment:
import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame:
import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv"))
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))
Python
First, let’s start PySparkling Shell:
./bin/pysparkling
Start H2O cluster inside the Spark environment:
from pysparkling import *
hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame:
import h2o
frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
sparkDF = hc.asSparkFrame(frame)
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])
Target Encoder in ML Pipeline¶
Target Encoder in Sparkling Water is implemented as a regular estimator and thus could be placed as a stage to Spark ML Pipeline
Scala
Let’s create an instance of Target Encoder and configure it:
import ai.h2o.sparkling.ml.features.H2OTargetEncoder
val targetEncoder = new H2OTargetEncoder()
.setInputCols(Array("RACE", "DPROS", "DCAPS"))
.setProblemType("Classification")
.setLabelCol("CAPSULE")
Also, create an instance of an algorithm consuming encoded columns and define pipeline:
import ai.h2o.sparkling.ml.algos.classification.H2OGBMClassifier
import org.apache.spark.ml.Pipeline
val gbm = new H2OGBMClassifier()
.setFeaturesCols(targetEncoder.getOutputCols())
.setLabelCol("CAPSULE")
val pipeline = new Pipeline().setStages(Array(targetEncoder, gbm))
Train the created pipeline
val pipelineModel = pipeline.fit(trainingDF)
Make predictions including a model of Target Encoder:
pipelineModel.transform(testingDF).show()
The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:
import org.apache.spark.ml.PipelineModel
pipelineModel.write.save("somePathForStoringPipelineModel")
val loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")
loadedPipelineModel.transform(testingDF).show()
Python
Let’s create an instance of Target Encoder and configure it:
from pysparkling.ml import H2OTargetEncoder
targetEncoder = H2OTargetEncoder()\
.setInputCols(["RACE", "DPROS", "DCAPS"])\
.setLabelCol("CAPSULE")\
.setProblemType("Classification")
Also, create an instance of an algorithm consuming encoded columns and define pipeline:
from pysparkling.ml import H2OGBMClassifier
from pyspark.ml import Pipeline
gbm = H2OGBMClassifier()\
.setFeaturesCols(targetEncoder.getOutputCols())\
.setLabelCol("CAPSULE")
pipeline = Pipeline(stages=[targetEncoder, gbm])
Train the created pipeline
pipelineModel = pipeline.fit(trainingDF)
Make predictions including a model of Target Encoder:
pipelineModel.transform(testingDF).show()
The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:
from pyspark.ml import PipelineModel
pipelineModel.save("somePathForStoringPipelineModel")
loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")
loadedPipelineModel.transform(testingDF).show()
Standalone Target Encoder¶
Target Encoder’s parameters like noise
and holdoutStrategy
are relevant only for a training dataset.
Thus the transform
method of H2OTargetEncoderModel
has to treat training and other data sets differently and
eventually, ignore the mentioned parameters.
When Target Encoder is inside a ML pipeline, the differentiation is done automatically. But if a user decides to train an algorithm without ML pipeline, the ‘transformTrainingDataset’ method should be on the model of Target Encoder to get appropriate results.
Edge Cases¶
The label column can’t contain any
null
values.Input columns transformed by Target Encoder can contain
null
values.Novel values in a testing/production data set and
null
values belong to the same category. In other words, Target Encoder returns a prior average for all novel values in case a given column of the training dataset did not contain anynull
values. Otherwise, the posterior average of rows havingnull
values in the column is returned.