Train KMeans Model in Sparkling Water¶
Introduction¶
K-Means falls in the general category of clustering algorithms. For more more comprehensive description see H2O-3 K Means documentation.
Example¶
The following section describes how to train the KMeans model in Sparkling Water in Scala & Python following the same example as H2O-3 documentation mentioned above. See also Parameters of H2OKMeans and Details of H2OKMeansMOJOModel.
Scala
First, let’s start Sparkling Shell as
./bin/sparkling-shell
Start H2O cluster inside the Spark environment
import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame
import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/iris/iris_wheader.csv")
val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("iris_wheader.csv"))
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))
Set the predictors
val predictors = Array("sepal_len", "sepal_wid", "petal_len", "petal_wid")
Build and train the model. You can configure all the available KMeans arguments using provided setters.
import ai.h2o.sparkling.ml.algos.H2OKMeans
val estimator = new H2OKMeans()
.setEstimateK(true)
.setSeed(1234)
.setFeaturesCols(predictors)
val model = estimator.fit(trainingDF)
Eval performance
val metrics = model.getTrainingMetrics()
println(metrics)
Run Predictions
model.transform(testingDF).show(false)
You can also get model details via calling methods listed in Details of H2OKMeansMOJOModel.
Python
First, let’s start PySparkling Shell as
./bin/pysparkling
Start H2O cluster inside the Spark environment
from pysparkling import *
hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame
import h2o
frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/iris/iris_wheader.csv")
sparkDF = hc.asSparkFrame(frame)
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])
Set the predictors
predictors = ["sepal_len", "sepal_wid", "petal_len", "petal_wid"]
Build and train the model. You can configure all the available KMeans arguments using provided setters or constructor parameters.
from pysparkling.ml import H2OKMeans
estimator = H2OKMeans(
estimateK = True,
seed = 1234,
featuresCols = predictors)
model = estimator.fit(trainingDF)
Eval performance
metrics = model.getTrainingMetrics()
print(metrics)
Run Predictions
model.transform(testingDF).show(truncate = False)
You can also get model details via calling methods listed in Details of H2OKMeansMOJOModel.