Target Encoding in Sparkling Water

Target Encoding in Sparkling Water is a mechanism of converting categorical features to continues features based on the mean calculated from values of the label (target) column. See also Parameters of H2OTargetEncoder.

An example of converting a categorical feature to continues with Target Encoder (Town_te is a produced column):

Town

Label

Town_te

Chennai

1

0.8

Prague

0

0.286

Chennai

0

0.8

Mountain View

1

0.714

Chennai

1

0.8

Prague

1

0.286

Mountain View

1

0.714

Chennai

1

0.8

Mountain View

0

0.714

Prague

1

0.286

Prague

0

0.286

Mountain View

1

0.714

Prague

0

0.286

Mountain View

0

0.714

Chennai

1

0.8

Mountain View

1

0.714

Prague

0

0.286

Prague

0

0.286

Mountain View

1

0.714

Target Encoding can help to improve accuracy of machine learning algorithms when columns with high cardinality are used as features during a training phase.

Using Target Encoder

Sparkling Water exposes API for target encoder in Scala and Python. Before we start using Target Encoder, we need to run and prepare the environment:

Scala

First, let’s start Sparkling Shell (use :paste mode when you try to copy-paste examples):

./bin/sparkling-shell

Start H2O cluster inside the Spark environment:

import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame:

    import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
    val rawSparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv"))
val sparkDF = rawSparkDF.withColumn("CAPSULE", $"CAPSULE" cast "string")
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))

Python

First, let’s start PySparkling Shell:

./bin/pysparkling

Start H2O cluster inside the Spark environment:

from pysparkling import *
hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame:

import h2o
frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
sparkDF = hc.asSparkFrame(frame)
sparkDF = sparkDF.withColumn("CAPSULE", sparkDF.CAPSULE.cast("string"))
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])

Target Encoder in ML Pipeline

Target Encoder in Sparkling Water is implemented as a regular estimator and thus could be placed as a stage to Spark ML Pipeline

Scala

Let’s create an instance of Target Encoder and configure it:

import ai.h2o.sparkling.ml.features.H2OTargetEncoder
val targetEncoder = new H2OTargetEncoder()
  .setInputCols(Array("RACE", "DPROS", "DCAPS"))
  .setLabelCol("CAPSULE")

Also, create an instance of an algorithm consuming encoded columns and define pipeline:

import ai.h2o.sparkling.ml.algos.H2OGBM
import org.apache.spark.ml.Pipeline
val gbm = new H2OGBM()
    .setFeaturesCols(targetEncoder.getOutputCols())
    .setLabelCol("CAPSULE")
val pipeline = new Pipeline().setStages(Array(targetEncoder, gbm))

Train the created pipeline

val pipelineModel = pipeline.fit(trainingDF)

Make predictions including a model of Target Encoder:

pipelineModel.transform(testingDF).show()

The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:

import org.apache.spark.ml.PipelineModel
pipelineModel.write.save("somePathForStoringPipelineModel")
val loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")
loadedPipelineModel.transform(testingDF).show()

Python

Let’s create an instance of Target Encoder and configure it:

from pysparkling.ml import H2OTargetEncoder
targetEncoder = H2OTargetEncoder()\
  .setInputCols(["RACE", "DPROS", "DCAPS"])\
  .setLabelCol("CAPSULE")

Also, create an instance of an algorithm consuming encoded columns and define pipeline:

from pysparkling.ml import H2OGBM
from pyspark.ml import Pipeline
gbm = H2OGBM()\
    .setFeaturesCols(targetEncoder.getOutputCols())\
    .setLabelCol("CAPSULE")
pipeline = Pipeline(stages=[targetEncoder, gbm])

Train the created pipeline

pipelineModel = pipeline.fit(trainingDF)

Make predictions including a model of Target Encoder:

pipelineModel.transform(testingDF).show()

The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:

from pyspark.ml import PipelineModel
pipelineModel.save("somePathForStoringPipelineModel")
loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")
loadedPipelineModel.transform(testingDF).show()

Standalone Target Encoder

Target Encoder’s parameters like noise and holdoutStrategy are relevant only for a training dataset. Thus the transform method of H2OTargetEncoderModel has to treat training and other data sets differently and eventually, ignore the mentioned parameters.

When Target Encoder is inside a ML pipeline, the differentiation is done automatically. But if a user decides to train an algorithm without ML pipeline, the ‘transformTrainingDataset’ method should be on the model of Target Encoder to get appropriate results.

Limitations and Edge Cases

  • The label column can’t contain more than two unique categorical values.

  • The label column can’t contain any null values.

  • Input columns transformed by Target Encoder can contain null values.

  • Novel values in a testing/production data set and null values belong to the same category. In other words, Target Encoder returns a prior average for all novel values in case a given column of the training dataset did not contain any null values. Otherwise, the posterior average of rows having null values in the column is returned.