Target Encoding in Sparkling Water ================================== Target Encoding in Sparkling Water is a mechanism of converting categorical features to continues features based on the mean calculated from values of the label (target) column. An example of converting a categorical feature to continues with Target Encoder (`Town_te` is a produced column): =============== ======= ========= Town Label Town_te =============== ======= ========= Chennai 1 0.8 Prague 0 0.286 Chennai 0 0.8 Mountain View 1 0.714 Chennai 1 0.8 Prague 1 0.286 Mountain View 1 0.714 Chennai 1 0.8 Mountain View 0 0.714 Prague 1 0.286 Prague 0 0.286 Mountain View 1 0.714 Prague 0 0.286 Mountain View 0 0.714 Chennai 1 0.8 Mountain View 1 0.714 Prague 0 0.286 Prague 0 0.286 Mountain View 1 0.714 =============== ======= ========= Target Encoding can help to improve accuracy of machine learning algorithms when columns with high cardinality are used as features during a training phase. Parameters ---------- labelCol A name of label column inputCols Names of columns that will be transformed into Target Encoding outputCols Names of columns representing the result of target encoding. If the parameter is not specified by the user, the output column names will be automatically derived from ``inputCols`` by appending the suffix `_te`. holdoutStrategy A strategy deciding what records will be excluded when calculating the target average on the training dataset. Options: None All rows are considered for the calculation LeaveOneOut All rows except the row the calculation is made for KFold Only out-of-fold data is considered (The option requires ``foldCol`` to be set.) foldCol A name of a column determining folds when ``KFold`` holdoutStrategy is applied. blendedAvgEnabled If set, the target average becomes a weighted average of the posterior average for a given categorical level and the prior average of the target. The weight is determined by the size of the given group that the row belongs to. By default, the blended average is **disabled**. blendedAvgInflectionPoint A parameter of the blended average. The bigger number is set, the groups relatively bigger to the overall dataset size will consider the prior average as a component in the weighted average. The default value is **10**. blendedAvgSmoothing A parameter of the blended average. It controls the rate of a transition between a posterior average and a prior average. The default value is **20**. noise Amount of random noise added to output values of a training dataset to prevent over-fitting of an algorithm consuming encoded features. The default value is **0.01**. Noise addition can be disabled by setting the parameter to **0.0**. noiseSeed A seed of the generator producing the random noise. Using Target Encoder -------------------- Sparkling Water exposes API for target encoder in Scala and Python. Before we start using Target Encoder, we need to run and prepare the environment: .. content-tabs:: .. tab-container:: Scala :title: Scala First, let's start Sparkling Shell (use *:paste* mode when you try to copy-paste examples): .. code:: shell ./bin/sparkling-shell Start H2O cluster inside the Spark environment: .. code:: scala import org.apache.spark.h2o._ import java.net.URI val hc = H2OContext.getOrCreate() Parse the data using H2O and convert them to Spark Frame: .. code:: scala val frame = new H2OFrame(new URI("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")) val sparkDF = hc.asSparkFrame(frame).withColumn("CAPSULE", $"CAPSULE" cast "string") val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2)) .. tab-container:: Python :title: Python First, let's start PySparkling Shell: .. code:: shell ./bin/pysparkling Start H2O cluster inside the Spark environment: .. code:: python from pysparkling import * hc = H2OContext.getOrCreate() Parse the data using H2O and convert them to Spark Frame: .. code:: python import h2o frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv") sparkDF = hc.asSparkFrame(frame) sparkDF = sparkDF.withColumn("CAPSULE", sparkDF.CAPSULE.cast("string")) [trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2]) Target Encoder in ML Pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Target Encoder in Sparkling Water is implemented as a regular estimator and thus could be placed as a stage to Spark ML Pipeline .. content-tabs:: .. tab-container:: Scala :title: Scala Let's create an instance of Target Encoder and configure it: .. code:: scala import ai.h2o.sparkling.ml.features.H2OTargetEncoder val targetEncoder = new H2OTargetEncoder() .setInputCols(Array("RACE", "DPROS", "DCAPS")) .setLabelCol("CAPSULE") Also, create an instance of an algorithm consuming encoded columns and define pipeline: .. code:: scala import ai.h2o.sparkling.ml.algos.H2OGBM import org.apache.spark.ml.Pipeline val gbm = new H2OGBM() .setFeaturesCols(targetEncoder.getOutputCols()) .setLabelCol("CAPSULE") val pipeline = new Pipeline().setStages(Array(targetEncoder, gbm)) Train the created pipeline .. code:: scala val pipelineModel = pipeline.fit(trainingDF) Make predictions including a model of Target Encoder: .. code:: scala pipelineModel.transform(testingDF).show() The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model: .. code:: scala import org.apache.spark.ml.PipelineModel pipelineModel.write.save("somePathForStoringPipelineModel") val loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel") loadedPipelineModel.transform(testingDF).show() .. tab-container:: Python :title: Python Let's create an instance of Target Encoder and configure it: .. code:: python from pysparkling.ml import H2OTargetEncoder targetEncoder = H2OTargetEncoder()\ .setInputCols(["RACE", "DPROS", "DCAPS"])\ .setLabelCol("CAPSULE") Also, create an instance of an algorithm consuming encoded columns and define pipeline: .. code:: python from pysparkling.ml import H2OGBM from pyspark.ml import Pipeline gbm = H2OGBM()\ .setFeaturesCols(targetEncoder.getOutputCols())\ .setLabelCol("CAPSULE") pipeline = Pipeline(stages=[targetEncoder, gbm]) Train the created pipeline .. code:: python pipelineModel = pipeline.fit(trainingDF) Make predictions including a model of Target Encoder: .. code:: python pipelineModel.transform(testingDF).show() The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model: .. code:: python from pyspark.ml import PipelineModel pipelineModel.save("somePathForStoringPipelineModel") loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel") loadedPipelineModel.transform(testingDF).show() Standalone Target Encoder ~~~~~~~~~~~~~~~~~~~~~~~~~ Target Encoder's parameters like ``noise`` and ``holdoutStrategy`` are relevant only for a training dataset. Thus the ``transform`` method of ``H2OTargetEncoderModel`` has to treat training and other data sets differently and eventually, ignore the mentioned parameters. When Target Encoder is inside a ML pipeline, the differentiation is done automatically. But if a user decides to train an algorithm without ML pipeline, the 'transformTrainingDataset' method should be on the model of Target Encoder to get appropriate results. Limitations and Edge Cases ~~~~~~~~~~~~~~~~~~~~~~~~~~ - The label column can't contain more than two unique categorical values. - The label column can't contain any ``null`` values. - Input columns transformed by Target Encoder can contain ``null`` values. - Novel values in a testing/production data set and ``null`` values belong to the same category. In other words, Target Encoder returns a prior average for all novel values in case a given column of the training dataset did not contain any ``null`` values. Otherwise, the posterior average of rows having ``null`` values in the column is returned.