Train KMeans Model in Sparkling Water¶

Introduction¶

K-Means falls in the general category of clustering algorithms. For more more comprehensive description see H2O-3 K Means documentation.

Example¶

The following section describes how to train the KMeans model in Sparkling Water in Scala & Python following the same example as H2O-3 documentation mentioned above. See also Parameters of H2OKMeans and Details of H2OKMeansMOJOModel.

Scala

First, let’s start Sparkling Shell as

./bin/sparkling-shell

Start H2O cluster inside the Spark environment

import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame

import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/iris/iris_wheader.csv")
val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("iris_wheader.csv"))
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))

Set the predictors

val predictors = Array("sepal_len", "sepal_wid", "petal_len", "petal_wid")

Build and train the model. You can configure all the available KMeans arguments using provided setters.

import ai.h2o.sparkling.ml.algos.H2OKMeans
val estimator = new H2OKMeans()
   .setEstimateK(true)
   .setSeed(1234)
   .setFeaturesCols(predictors)
val model = estimator.fit(trainingDF)

Eval performance

val metrics = model.getTrainingMetrics()
println(metrics)

Run Predictions

model.transform(testingDF).show(false)

You can also get model details via calling methods listed in Details of H2OKMeansMOJOModel.

Python

First, let’s start PySparkling Shell as

./bin/pysparkling