Cross-Validation
================

N-fold cross-validation is used to validate a model internally, i.e.,
estimate the model performance without having to sacrifice a validation
split. Also, you avoid statistical issues with your validation split (it
might be a â€œluckyâ€ split, especially for imbalanced data). Good values
for N are around 5 to 10. Comparing the N validation metrics is always a
good idea, to check the stability of the estimation, before â€œtrustingâ€
the main model.

You have to make sure, however, that the holdout sets for each of the N
models are good. For i.i.d. data, the random splitting of the data into
N pieces (default behavior) or modulo-based splitting is fine. For
temporal or otherwise structured data with distinct â€œeventsâ€, you have
to make sure to split the folds based on the events. For example, if you
have observations (e.g., user transactions) from N cities and you want
to build models on users from only N-1 cities and validate them on the
remaining city (if you want to study the generalization to new cities,
for example), you will need to specify the parameter â€œfold\_column" to
be the city column. Otherwise, you will have rows (users) from all N
cities randomly blended into the N folds, and all N cv models will see
all N cities, making the validation less useful (or totally wrong,
depending on the distribution of the data). This is known as â€œdata
leakageâ€: https://youtu.be/NHw\_aKO5KUM?t=889

How Cross-Validation is Calculated
----------------------------------

In general, for all algos that support the nfolds parameter, H2Oâ€™s
cross-validation works as follows:

For example, for nfolds=5, 6 models are built. The first 5 models
(cross-validation models) are built on 80% of the training data, and a
different 20% is held out for each of the 5 models. Then the main model
is built on 100% of the training data. This main model is the model you
get back from H2O in R, Python and Flow.

This main model contains training metrics and cross-validation metrics
(and optionally, validation metrics if a validation frame was provided).
The main model also contains pointers to the 5 cross-validation models
for further inspection.

All 5 cross-validation models contain training metrics (from the 80%
training data) and validation metrics (from their 20% holdout/validation
data). To compute their individual validation metrics, each of the 5
cross-validation models had to make predictions on their 20% of of rows
of the original training frame, and score against the true labels of the
20% holdout.

For the main model, this is how the cross-validation metrics are
computed: The 5 holdout predictions are combined into one prediction for
the full training dataset (i.e., predictions for every row of the
training data, but the model making the prediction for a particular row
has not seen that row during training). This â€œholdout prediction" is
then scored against the true labels, and the overall cross-validation
metrics are computed.

This approach has some implications. Scoring the holdout predictions
freshly can result in different metrics than taking the average of the 5
validation metrics of the cross-validation models. For example, if the
sizes of the holdout folds differ a lot (e.g., when a user-given
fold\_column is used), then the average should probably be replaced with
a weighted average. Also, if the cross-validation models map to slightly
different probability spaces, which can happen for small DL models that
converge to different local minima, then the confused rank ordering of
the combined predictions would lead to a significantly different AUC
than the average.

Example in R
~~~~~~~~~~~~

To gain more insights into the variance of the holdout metrics (e.g.,
AUCs), you can look up the cross-validation models, and inspect their
validation metrics. Hereâ€™s an R code example showing the two approaches:

::

    library(h2o)
    h2o.init()
    df <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
    df$CAPSULE <- as.factor(df$CAPSULE)
    model_fit <- h2o.gbm(3:8,2,df,nfolds=5,seed=1234)

    # Default: AUC of holdout predictions
    h2o.auc(model_fit,xval=TRUE)

    # Optional: Average the holdout AUCs
    cvAUCs <- sapply(sapply(model_fit@model$cross_validation_models, `[[`, "name"), function(x) { h2o.auc(h2o.getModel(x), valid=TRUE) })
    print(cvAUCs)
    mean(cvAUCs)

Using Cross-Validated Predictions
---------------------------------

With cross-validated model building, H2O builds N+1 models: N
cross-validated model and 1 overarching model over all of the training
data.

Each cv-model produces a prediction frame pertaining to its fold. It can
be saved and probed from the various clients if
``keep_cross_validation_predictions`` parameter is set in the model
constructor.

These holdout predictions have some interesting properties. First they
have names like:

::

      prediction_GBM_model_1452035702801_1_cv_1

and they contain, unsurprisingly, predictions for the data held out in
the fold. They also have the same number of rows as the entire input
training frame with ``0``\ s filled in for all rows that are not in the
hold out.

Let's look at an example.

Here is a snippet of a three-class classification dataset (last column
is the response column), with a 3-fold identification column appended to
the end:

+--------------+--------------+--------------+--------------+----------+----------+
| sepal\_len   | sepal\_wid   | petal\_len   | petal\_wid   | class    | foldId   |
+==============+==============+==============+==============+==========+==========+
| 5.1          | 3.5          | 1.4          | 0.2          | setosa   | 0        |
+--------------+--------------+--------------+--------------+----------+----------+
| 4.9          | 3.0          | 1.4          | 0.2          | setosa   | 0        |
+--------------+--------------+--------------+--------------+----------+----------+
| 4.7          | 3.2          | 1.3          | 0.2          | setosa   | 2        |
+--------------+--------------+--------------+--------------+----------+----------+
| 4.6          | 3.1          | 1.5          | 0.2          | setosa   | 1        |
+--------------+--------------+--------------+--------------+----------+----------+
| 5.0          | 3.6          | 1.4          | 0.2          | setosa   | 2        |
+--------------+--------------+--------------+--------------+----------+----------+
| 5.4          | 3.9          | 1.7          | 0.4          | setosa   | 1        |
+--------------+--------------+--------------+--------------+----------+----------+
| 4.6          | 3.4          | 1.4          | 0.3          | setosa   | 1        |
+--------------+--------------+--------------+--------------+----------+----------+
| 5.0          | 3.4          | 1.5          | 0.2          | setosa   | 0        |
+--------------+--------------+--------------+--------------+----------+----------+
| 4.4          | 2.9          | 1.4          | 0.4          | setosa   | 1        |
+--------------+--------------+--------------+--------------+----------+----------+

Each cross-validated model produces a prediction frame

::

      prediction_GBM_model_1452035702801_1_cv_1
      prediction_GBM_model_1452035702801_1_cv_2
      prediction_GBM_model_1452035702801_1_cv_3

and each one has the following shape (for example the first one):

::

      prediction_GBM_model_1452035702801_1_cv_1

+--------------+----------+--------------+-------------+
| prediction   | setosa   | versicolor   | virginica   |
+==============+==========+==============+=============+
| 1            | 0.0232   | 0.7321       | 0.2447      |
+--------------+----------+--------------+-------------+
| 2            | 0.0543   | 0.2343       | 0.7114      |
+--------------+----------+--------------+-------------+
| 0            | 0        | 0            | 0           |
+--------------+----------+--------------+-------------+
| 0            | 0        | 0            | 0           |
+--------------+----------+--------------+-------------+
| 0            | 0        | 0            | 0           |
+--------------+----------+--------------+-------------+
| 0            | 0        | 0            | 0           |
+--------------+----------+--------------+-------------+
| 0            | 0        | 0            | 0           |
+--------------+----------+--------------+-------------+
| 0            | 0.8921   | 0.0321       | 0.0758      |
+--------------+----------+--------------+-------------+
| 0            | 0        | 0            | 0           |
+--------------+----------+--------------+-------------+

The training rows receive a prediction of ``0`` (more on this below) as
well as ``0`` for all class probabilities. Each of these holdout
predictions has the same number of rows as the input frame.

Combining Holdout Predictions
-----------------------------

The frame of cross-validated predictions is simply the superposition of
the individual predictions. `Here's an example from
R <https://0xdata.atlassian.net/browse/PUBDEV-2236>`__:

::

    library(h2o)
    h2o.init()

    # H2O Cross-validated K-means example
    prosPath <- system.file("extdata", "prostate.csv", package="h2o")
    prostate.hex <- h2o.uploadFile(path = prosPath)
    fit <- h2o.kmeans(training_frame = prostate.hex,
                      k = 10,
                      x = c("AGE", "RACE", "VOL", "GLEASON"),
                      nfolds = 5,  #If you want to specify folds directly, then use "fold_column" arg
                      keep_cross_validation_predictions = TRUE)

    # This is where cv preds are stored:
    fit@model$cross_validation_predictions$name


    # Compress the CV preds into a single H2O Frame:
    # Each fold's preds are stored in a N x 1 col, where the row values for non-active folds are set to zero
    # So we will compress this into a single 1-col H2O Frame (easier to digest)

    nfolds <- fit@parameters$nfolds
    predlist <- sapply(1:nfolds, function(v) h2o.getFrame(fit@model$cross_validation_predictions[[v]]$name)$predict, simplify = FALSE)
    cvpred_sparse <- h2o.cbind(predlist)  # N x V Hdf with rows that are all zeros, except corresponding to the v^th fold if that rows is associated with v
    pred <- apply(cvpred_sparse, 1, sum)  # These are the cross-validated predicted cluster IDs for each of the 1:N observations

This can be extended to other family types as well (multinomial,
binomial, regression):

::

    # helper function
    .compress_to_cvpreds <- function(h2omodel, family) {
      # return the frame_id of the resulting 1-col Hdf of cvpreds for learner l
      V <- h2omodel@allparameters$nfolds
      if (family %in% c("bernoulli", "binomial")) {
        predlist <- sapply(1:V, function(v) h2o.getFrame(h2omodel@model$cross_validation_predictions[[v]]$name)[,3], simplify = FALSE)
      } else {
        predlist <- sapply(1:V, function(v) h2o.getFrame(h2omodel@model$cross_validation_predictions[[v]]$name)$predict, simplify = FALSE)
      }
      cvpred_sparse <- h2o.cbind(predlist)  # N x V Hdf with rows that are all zeros, except corresponding to the v^th fold if that rows is associated with v
      cvpred_col <- apply(cvpred_sparse, 1, sum)
      return(cvpred_col)
    }


    # Extract cross-validated predicted values (in order of original rows)
    h2o.cvpreds <- function(object) {

      # Need to extract family from model object
      if (class(object) == "H2OBinomialModel") family <- "binomial"
      if (class(object) == "H2OMulticlassModel") family <- "multinomial"
      if (class(object) == "H2ORegressionModel") family <- "gaussian"

      cvpreds <- .compress_to_cvpreds(h2omodel = object, family = family)
      return(cvpreds)
    }