Target Encoding in Sparkling Water
==================================
Target Encoding in Sparkling Water is a mechanism of converting categorical features to continues features based on
the mean calculated from values of the label (target) column. See also :ref:`parameters_H2OTargetEncoder`.

An example of converting a categorical feature to continues with Target Encoder (`Town_te` is a produced column):

 =============== ======= =========
  Town            Label   Town_te  
 =============== ======= ========= 
  Chennai         1       0.8      
  Prague          0       0.286    
  Chennai         0       0.8      
  Mountain View   1       0.714    
  Chennai         1       0.8      
  Prague          1       0.286    
  Mountain View   1       0.714    
  Chennai         1       0.8      
  Mountain View   0       0.714    
  Prague          1       0.286    
  Prague          0       0.286    
  Mountain View   1       0.714    
  Prague          0       0.286    
  Mountain View   0       0.714    
  Chennai         1       0.8
  Mountain View   1       0.714    
  Prague          0       0.286    
  Prague          0       0.286    
  Mountain View   1       0.714    
 =============== ======= =========

Target Encoding can help to improve accuracy of machine learning algorithms when columns with high
cardinality are used as features during a training phase.

Using Target Encoder
--------------------
Sparkling Water exposes API for target encoder in Scala and Python. Before we start using Target Encoder, we need to run
and prepare the environment:

.. content-tabs::

    .. tab-container:: Scala
        :title: Scala

        First, let's start Sparkling Shell (use *:paste* mode when you try to copy-paste examples):

        .. code:: shell

            ./bin/sparkling-shell

        Start H2O cluster inside the Spark environment:

        .. code:: scala

            import ai.h2o.sparkling._
            import java.net.URI
            val hc = H2OContext.getOrCreate()

        Parse the data using H2O and convert them to Spark Frame:

        .. code:: scala

            import org.apache.spark.SparkFiles
            spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
            val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv"))
            val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))

    .. tab-container:: Python
        :title: Python

        First, let's start PySparkling Shell:

        .. code:: shell

            ./bin/pysparkling

        Start H2O cluster inside the Spark environment:

        .. code:: python

            from pysparkling import *
            hc = H2OContext.getOrCreate()

        Parse the data using H2O and convert them to Spark Frame:

        .. code:: python

            import h2o
            frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
            sparkDF = hc.asSparkFrame(frame)
            [trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])


Target Encoder in ML Pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Target Encoder in Sparkling Water is implemented as a regular estimator and thus could be placed as a stage to Spark ML Pipeline

.. content-tabs::

    .. tab-container:: Scala
        :title: Scala

        Let's create an instance of Target Encoder and configure it:

        .. code:: scala

            import ai.h2o.sparkling.ml.features.H2OTargetEncoder
            val targetEncoder = new H2OTargetEncoder()
              .setInputCols(Array("RACE", "DPROS", "DCAPS"))
              .setProblemType("Classification")
              .setLabelCol("CAPSULE")

        Also, create an instance of an algorithm consuming encoded columns and define pipeline:

        .. code:: scala

            import ai.h2o.sparkling.ml.algos.classification.H2OGBMClassifier
            import org.apache.spark.ml.Pipeline
            val gbm = new H2OGBMClassifier()
                .setFeaturesCols(targetEncoder.getOutputCols())
                .setLabelCol("CAPSULE")
            val pipeline = new Pipeline().setStages(Array(targetEncoder, gbm))

        Train the created pipeline

        .. code:: scala

            val pipelineModel = pipeline.fit(trainingDF)

        Make predictions including a model of Target Encoder:

        .. code:: scala

            pipelineModel.transform(testingDF).show()

        The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:

        .. code:: scala

            import org.apache.spark.ml.PipelineModel
            pipelineModel.write.save("somePathForStoringPipelineModel")
            val loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")
            loadedPipelineModel.transform(testingDF).show()

    .. tab-container:: Python
        :title: Python

        Let's create an instance of Target Encoder and configure it:

        .. code:: python

            from pysparkling.ml import H2OTargetEncoder
            targetEncoder = H2OTargetEncoder()\
              .setInputCols(["RACE", "DPROS", "DCAPS"])\
              .setLabelCol("CAPSULE")\
              .setProblemType("Classification")

        Also, create an instance of an algorithm consuming encoded columns and define pipeline:

        .. code:: python

            from pysparkling.ml import H2OGBMClassifier
            from pyspark.ml import Pipeline
            gbm = H2OGBMClassifier()\
                .setFeaturesCols(targetEncoder.getOutputCols())\
                .setLabelCol("CAPSULE")
            pipeline = Pipeline(stages=[targetEncoder, gbm])

        Train the created pipeline

        .. code:: python

            pipelineModel = pipeline.fit(trainingDF)

        Make predictions including a model of Target Encoder:

        .. code:: python

            pipelineModel.transform(testingDF).show()

        The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:

        .. code:: python

            from pyspark.ml import PipelineModel
            pipelineModel.save("somePathForStoringPipelineModel")
            loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")
            loadedPipelineModel.transform(testingDF).show()


Standalone Target Encoder
~~~~~~~~~~~~~~~~~~~~~~~~~
Target Encoder's parameters like ``noise`` and ``holdoutStrategy`` are relevant only for a training dataset.
Thus the ``transform`` method of ``H2OTargetEncoderModel`` has to treat training and other data sets differently and
eventually, ignore the mentioned parameters.

When Target Encoder is inside a ML pipeline, the differentiation is done automatically. But if a user decides to train
an algorithm without ML pipeline, the 'transformTrainingDataset' method should be on the model of Target Encoder to get
appropriate results.


Edge Cases
~~~~~~~~~~
- The label column can't contain any ``null`` values.
- Input columns transformed by Target Encoder can contain ``null`` values.
- Novel values in a testing/production data set and ``null`` values belong to the same category. In other words,
  Target Encoder returns a prior average for all novel values in case a given column of the training dataset
  did not contain any ``null`` values. Otherwise, the posterior average of rows having ``null`` values in the column is returned.