Train KMeans Model in Sparkling Water¶
Sparkling Water provides API for H2O KMeans in Scala and Python. All available parameters of the H2OKmeans model are described at H2O KMeans Parameters.
The following sections describe how to train the KMeans model in Sparkling Water in both languages.
Scala
First, let’s start Sparkling Shell as
./bin/sparkling-shell
Start H2O cluster inside the Spark environment
import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame
import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/iris/iris_wheader.csv")
val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("iris_wheader.csv"))
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))
Train the model. You can configure all the available KMeans arguments using provided setters.
import ai.h2o.sparkling.ml.algos.H2OKMeans
val estimator = new H2OKMeans().setK(2).setUserPoints(Array(Array(4.9, 3.0, 1.4, 0.2, 0), Array(5.6, 2.5, 3.9, 1.1, 1)))
val model = estimator.fit(trainingDF)
Run Predictions
model.transform(testingDF).show(false)
You can also get model details via calling methods listed in Details of H2OKMeansMOJOModel.
Python
First, let’s start PySparkling Shell as
./bin/pysparkling
Start H2O cluster inside the Spark environment
from pysparkling import *
hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame
import h2o
frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/iris/iris_wheader.csv")
sparkDF = hc.asSparkFrame(frame)
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])
Train the model. You can configure all the available KMeans arguments using provided setters or constructor parameters.
from pysparkling.ml import H2OKMeans
estimator = H2OKMeans(k=3, userPoints=[[4.9, 3.0, 1.4, 0.2, 0], [5.6, 2.5, 3.9, 1.1, 1], [6.5, 3.0, 5.2, 2.0, 2]])
model = estimator.fit(trainingDF)
Run Predictions
model.transform(testingDF).show(truncate = False)
You can also get model details via calling methods listed in Details of H2OKMeansMOJOModel.
H2O KMeans Parameters¶
See also Parameters of H2OKMeans.
- maxIterations
- Maximum number of KMeans iterations to find the centroids. 
 
- standardize
- Standardize the numeric columns to have a mean of zero and unit variance. More information about the standardization is available at H2O KMeans standardize param documentation. 
 
- init
- Initialization mode for finding the initial cluster centers. More information about the initialization is available at H2O KMeans Init param documentation. 
 
- userPoints
- This option allows you to specify an array of points, where each point represents coordinates of an initial cluster center. The user-specified points must have the same number of columns as the training observations. The number of rows must equal the number of clusters. 
 
- estimateK
- If enabled, the algorithm tries to identify an optimal number of clusters, up to k clusters. 
 
- k
- A number of clusters to generate.