Generalized Low Rank Models (GLRM) in Sparkling Water

Generalized Low Rank Models (GLRM) in Sparkling Water is an feature estimator, which serves to reduce number of features in Spark pipeline. Sparkling Water provides API for GLRM in Scala and Python. The following sections describe how to train and use the Sparkling Water GLRM in both languages. See also Parameters of H2OGLRM and Details of H2OGLRMMOJOModel.

Scala

First, let’s start Sparkling Shell as

./bin/sparkling-shell

Start H2O cluster inside the Spark environment

import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame

import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
val rawSparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv"))
val sparkDF = rawSparkDF.withColumn("CAPSULE", $"CAPSULE" cast "string")
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))

Create H2OGLRM and set input columns, k representing a number of output features and other parameters (see Parameters of H2OGLRM). An input column could be of any simple type or represent multiple features in form of the Spark vector type (org.apache.spark.ml.linalg.VectorUDT).

import ai.h2o.sparkling.ml.features.H2OGLRM
val glrm = new H2OGLRM()
glrm.setInputCols(Array("RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON"))
glrm.setK(4)

Define other pipeline stages.

import ai.h2o.sparkling.ml.algos.H2OGBM
val gbm = new H2OGBM()
gbm.setFeaturesCol(glrm.getOutputCol())
gbm.setLabelCol("CAPSULE")

Construct and fit the pipeline.

import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(glrm, gbm))
val model = pipeline.fit(trainingDF)

Now, you can score with the pipeline model.

val resultDF = model.transform(testingDF)
resultDF.show(truncate=false)

Python

First, let’s start PySparkling Shell as

./bin/pysparkling

Start H2O cluster inside the Spark environment

from pysparkling import *
hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame

import h2o
frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
sparkDF = hc.asSparkFrame(frame)
sparkDF = sparkDF.withColumn("CAPSULE", sparkDF.CAPSULE.cast("string"))
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])

Create H2OGLRM and set input columns, k representing a number of output features and other parameters (see Parameters of H2OGLRM). An input column could be of any simple type or represent multiple features in form of the Spark vector type (pyspark.ml.linalg.VectorUDT).

from pysparkling.ml import H2OGLRM
glrm = H2OGLRM()
glrm.setInputCols(["RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON"])
glrm.setK(4)

Define other pipeline stages.

from pysparkling.ml import H2OGBM
gbm = H2OGBM()
gbm.setFeaturesCols([glrm.getOutputCol()])
gbm.setLabelCol("CAPSULE")

Construct and fit the pipeline.

from pyspark.ml import Pipeline
pipeline = Pipeline(stages = [glrm, gbm])
model = pipeline.fit(trainingDF)

Now, you can score with the pipeline model.

resultDF = model.transform(testingDF)
resultDF.show(truncate=False)