Train Sparkling Water Algorithms with Grid Search¶
Grid Search serves for finding optimal values for hyper-parameters of a given H2O/SW algorithm. Grid Search in Sparkling Water is able to traverse hyper-space for H2OGBM, H2OXGBoost, H2ODRF, H2OGLM, H2OGAM, H2ODeepLearning, H2OKMeans, and H2OIsolationForest. For more details about hyper-parameters for a specific algorithm (see H2O-3 documentation).
Sparkling Water provides API in Scala and Python for Grid Search. The following sections describe how to Apply Grid Search on H2ODRF in both languages. See also Parameters of H2OGridSearch.
- Scala
- Python
First, let’s start Sparkling Shell as
./bin/sparkling-shell
Start H2O cluster inside the Spark environment
import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame
import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
val rawSparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv"))
val sparkDF = rawSparkDF.withColumn("CAPSULE", $"CAPSULE" cast "string")
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))
Define the algorithm, which will be a subject of hyper-parameter tuning
import ai.h2o.sparkling.ml.algos.H2ODRF
val algo = new H2ODRF().setLabelCol("CAPSULE")
By default, the H2ODRF
algorithm distinguishes between a classification and regression problem based on the type of
the label column of the training dataset. If the label column is a string column, a classification model will be trained.
If the label column is a numeric column, a regression model will be trained. If you don’t want be worried about
column data types, you can explicitly identify the problem by using ai.h2o.sparkling.ml.algos.classification.H2ODRFClassifier
or ai.h2o.sparkling.ml.algos.regression.H2ODRFRegressor
instead.
Define a hyper-space which will be traversed
import scala.collection.mutable.HashMap
val hyperSpace: HashMap[String, Array[AnyRef]] = HashMap()
hyperSpace += "ntrees" -> Array(1, 10, 30).map(_.asInstanceOf[AnyRef])
hyperSpace += "mtries" -> Array(-1, 5, 10).map(_.asInstanceOf[AnyRef])
Pass the algorithm and hyper-space to the grid search and set properties defining the way how the hyper-space will be traversed.
Sparkling Water supports two strategies for traversing hyperspace:
Cartesian - (Default) This strategy tries out every possible combination of hyper-parameter values and finishes after the whole space is traversed.
RandomDiscrete - In each iteration, the strategy randomly selects the combination of values from the hyper-space and can be terminated before the whole space is traversed. The termination depends on various criteria (consider parameters:
maxRuntimeSecs
,maxModels
,stoppingRounds
,stoppingTolerance
,stoppingMetric
). For details see H2O-3 documentation
import ai.h2o.sparkling.ml.algos.H2OGridSearch
val grid = new H2OGridSearch()
.setHyperParameters(hyperSpace)
.setAlgo(algo)
.setStrategy("Cartesian")
Fit the grid search to get the best DRF model.
val model = grid.fit(trainingDF)
You can also get raw model details by calling the getModelDetails() method available on the model as:
model.getModelDetails()
Run Predictions
model.transform(testingDF).show(false)