Train GAM Model in Sparkling Water

Note: GAM models are currently experimental.

Introduction

A generalized additive model (GAM) is a Generalized Linear Model (GLM) in which the linear predictor depends linearly on predictor variables and smooth functions of predictor variables. For more more comprehensive description see H2O-3 GAM documentation.

Example

The following section describes how to train the GAM model in Sparkling Water in Scala & Python following the same example as H2O-3 documentation mentioned above. See also Parameters of H2OGAM and Details of H2OGAMMOJOModel.

  • Scala
  • Python

First, let’s start Sparkling Shell as

./bin/sparkling-shell

Start H2O cluster inside the Spark environment

import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()

Create the frame knots

val knots1 = Seq(-1.99905699, -0.98143075, 0.02599159, 1.00770987, 1.99942290).toDF()
val frameKnots1 = hc.asH2OFrame(knots1)
val knots2 = Seq(-1.999821861, -1.005257990, -0.006716042, 1.002197392, 1.999073589).toDF()
val frameKnots2 = hc.asH2OFrame(knots2)
val knots3 = Seq(-1.999675688, -0.979893796, 0.007573327, 1.011437347, 1.999611676).toDF()
val frameKnots3 = hc.asH2OFrame(knots3)

Import the dataset and split into train and validation sets

import org.apache.spark.SparkFiles
val datasetUrl = "https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/multinomial_10_classes_10_cols_10000_Rows_train.csv"
spark.sparkContext.addFile(datasetUrl) //for example purposes, on a real cluster it's better to load directly from distributed storage
val sparkDF =
  spark
    .read
    .option("header", "true")
    .option("inferSchema", "true").csv(SparkFiles.get("multinomial_10_classes_10_cols_10000_Rows_train.csv"))
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))

Set the predictor and response columns, specify the knots array

val y = "C11"
val x = Array("C1", "C2")

val numKnots = Array(5, 5, 5)

Train the model. You can configure all the available GAM arguments using provided setters, such as the label column and gam columns, which are mandatory.

val gam = new H2OGAM()
  .setFeaturesCols(x)
  .setLabelCol(y)
  .setFamily("multinomial")
  .setGamCols(Array("C6", "C7", "C8"))
  .setColumnsToCategorical("C1", "C2", "C11")
  .setScale(Array(1.0, 1.0, 1.0))
  .setNumKnots(numKnots)
  .setKnotIds(Array(frameKnots1.frameId, frameKnots2.frameId, frameKnots3.frameId))

val gamModel = gam.fit(trainingDF)

By default, the H2OGAM algorithm distinguishes between a classification and regression problem based on the type of the label column of the training dataset. If the label column is a string column, a classification model will be trained. If the label column is a numeric column, a regression model will be trained. If you don’t want to be worried about column data types, you can explicitly identify the problem by using ai.h2o.sparkling.ml.algos.classification.H2OGAMClassifier or ai.h2o.sparkling.ml.algos.regression.H2OGAMRegressor instead.

Run Predictions

val predictions = gamModel.transform(testingDF)
predictions.show(truncate = false)

You can also get model details via calling methods listed in Details of H2OGAMMOJOModel.

Clean up

frameKnots1.delete()
frameKnots2.delete()
frameKnots3.delete()