Target Encoding in Sparkling Water¶

Target Encoding in Sparkling Water is a mechanism of converting categorical features to continues features based on the mean calculated from values of the label (target) column. See also Parameters of H2OTargetEncoder.

An example of converting a categorical feature to continues with Target Encoder (Town_te is a produced column):

Town

Label

Town_te

Chennai

1

0.8

Prague

0

0.286

Chennai

0

0.8

Mountain View

1

0.714

Chennai

1

0.8

Prague

1

0.286

Mountain View

1

0.714

Chennai

1

0.8

Mountain View

0

0.714

Prague

1

0.286

Prague

0

0.286

Mountain View

1

0.714

Prague

0

0.286

Mountain View

0

0.714

Chennai

1

0.8

Mountain View

1

0.714

Prague

0

0.286

Prague

0

0.286

Mountain View

1

0.714

Target Encoding can help to improve accuracy of machine learning algorithms when columns with high cardinality are used as features during a training phase.

Using Target Encoder¶

Sparkling Water exposes API for target encoder in Scala and Python. Before we start using Target Encoder, we need to run and prepare the environment:

Scala

First, let’s start Sparkling Shell (use :paste mode when you try to copy-paste examples):

./bin/sparkling-shell

Start H2O cluster inside the Spark environment:

import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame:

import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv"))
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))

Python

First, let’s start PySparkling Shell:

./bin/pysparkling

Start H2O cluster inside the Spark environment:

from pysparkling import *
hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame:

import h2o
frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
sparkDF = hc.asSparkFrame(frame)
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])

Target Encoder in ML Pipeline¶

Target Encoder in Sparkling Water is implemented as a regular estimator and thus could be placed as a stage to Spark ML Pipeline

Scala

Let’s create an instance of Target Encoder and configure it:

import ai.h2o.sparkling.ml.features.H2OTargetEncoder
val targetEncoder = new H2OTargetEncoder()
  .setInputCols(Array("RACE", "DPROS", "DCAPS"))
  .setProblemType("Classification")
  .setLabelCol("CAPSULE")

Also, create an instance of an algorithm consuming encoded columns and define pipeline:

import ai.h2o.sparkling.ml.algos.classification.H2OGBMClassifier
import org.apache.spark.ml.Pipeline
val gbm = new H2OGBMClassifier()
    .setFeaturesCols(targetEncoder.getOutputCols())
    .setLabelCol("CAPSULE")
val pipeline = new Pipeline().setStages(Array(targetEncoder, gbm))

Train the created pipeline

val pipelineModel = pipeline.fit(trainingDF)

Make predictions including a model of Target Encoder:

pipelineModel.transform(testingDF).show()

The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:

import org.apache.spark.ml.PipelineModel
pipelineModel.write.save("somePathForStoringPipelineModel")
val loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")
loadedPipelineModel.transform(testingDF).show()

Python

Let’s create an instance of Target Encoder and configure it:

from pysparkling.ml import H2OTargetEncoder
targetEncoder = H2OTargetEncoder()\
  .setInputCols(["RACE", "DPROS", "DCAPS"])\
  .setLabelCol("CAPSULE")\
  .setProblemType("Classification")

Also, create an instance of an algorithm consuming encoded columns and define pipeline:

from pysparkling.ml import H2OGBMClassifier
from pyspark.ml import Pipeline
gbm = H2OGBMClassifier()\
    .setFeaturesCols(targetEncoder.getOutputCols())\
    .setLabelCol("CAPSULE")
pipeline = Pipeline(stages=[targetEncoder, gbm])

Train the created pipeline

pipelineModel = pipeline.fit(trainingDF)

Make predictions including a model of Target Encoder:

pipelineModel.transform(testingDF).show()

The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:

from pyspark.ml import PipelineModel
pipelineModel.save("somePathForStoringPipelineModel")
loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")
loadedPipelineModel.transform(testingDF).show()

Standalone Target Encoder¶

Target Encoder’s parameters like noise and holdoutStrategy are relevant only for a training dataset. Thus the transform method of H2OTargetEncoderModel has to treat training and other data sets differently and eventually, ignore the mentioned parameters.

When Target Encoder is inside a ML pipeline, the differentiation is done automatically. But if a user decides to train an algorithm without ML pipeline, the ‘transformTrainingDataset’ method should be on the model of Target Encoder to get appropriate results.

Edge Cases¶

The label column can’t contain any null values.
Input columns transformed by Target Encoder can contain null values.
Novel values in a testing/production data set and null values belong to the same category. In other words, Target Encoder returns a prior average for all novel values in case a given column of the training dataset did not contain any null values. Otherwise, the posterior average of rows having null values in the column is returned.