Principal Component Analysis (PCA) in Sparkling Water¶
The Principal Component Analysis (PCA) in Sparkling Water is an feature estimator, which serves to reduce number of features in Spark pipeline. Sparkling Water provides API for PCA in Scala and Python. The following sections describe how to train and use the Sparkling Water PCA in both languages. See also Parameters of H2OPCA and Details of H2OPCAMOJOModel.
- Scala
- Python
First, let’s start Sparkling Shell as
./bin/sparkling-shell
Start H2O cluster inside the Spark environment
import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame
import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
val rawSparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv"))
val sparkDF = rawSparkDF.withColumn("CAPSULE", $"CAPSULE" cast "string")
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))
Create H2OPCA
and set input columns, k
representing a number of output features and other parameters
(see Parameters of H2OPCA). An input column could be of any simple type or represent multiple features
in form of the Spark vector type (org.apache.spark.ml.linalg.VectorUDT
).
import ai.h2o.sparkling.ml.features.H2OPCA
val pca = new H2OPCA()
pca.setInputCols(Array("RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON"))
pca.setK(4)
Define other pipeline stages.
import ai.h2o.sparkling.ml.algos.H2OGBM
val gbm = new H2OGBM()
gbm.setFeaturesCol(pca.getOutputCol())
gbm.setLabelCol("CAPSULE")
Construct and fit the pipeline.
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(pca, gbm))
val model = pipeline.fit(trainingDF)
Now, you can score with the pipeline model.
val resultDF = model.transform(testingDF)
resultDF.show(truncate=false)