Train Isolation Forest Model in Sparkling Water¶
Sparkling Water provides API for H2O Isolation Forest in Scala and Python. The following sections describe how to train the Isolation Forest model in Sparkling Water in both languages. See also Parameters of H2OIsolationForest.
Scala
First, let’s start Sparkling Shell as
./bin/sparkling-shell
Start H2O cluster inside the Spark environment
import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame
import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/anomaly/ecg_discord_train.csv")
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/anomaly/ecg_discord_test.csv")
val trainingDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("ecg_discord_train.csv"))
val testingDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("ecg_discord_test.csv"))
Train the model. You can configure all the available Isolation Forest arguments using provided setters.
import ai.h2o.sparkling.ml.algos.H2OIsolationForest
val estimator = new H2OIsolationForest()
val model = estimator.fit(trainingDF)
You can also get raw model details by calling the getModelDetails() method available on the model as:
model.getModelDetails()
Run Predictions
model.transform(testingDF).show(false)
Python
First, let’s start PySparkling Shell as
./bin/pysparkling
Start H2O cluster inside the Spark environment
from pysparkling import *
hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame
import h2o
trainingFrame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/anomaly/ecg_discord_train.csv")
trainingDF = hc.asSparkFrame(trainingFrame)
testingFrame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/anomaly/ecg_discord_test.csv")
testingDF = hc.asSparkFrame(testingFrame)
Train the model. You can configure all the available Isolation Forest arguments using provided setters or constructor parameters.
from pysparkling.ml import H2OIsolationForest
estimator = H2OIsolationForest()
model = estimator.fit(trainingDF)
You can also get raw model details by calling the getModelDetails() method available on the model as:
model.getModelDetails()
Run Predictions
model.transform(testingDF).show(truncate = False)
Train Isolation Forest with H2OGridSearch¶
If you’re not sure about exact values for hyper-parameters of Isolation Forest, you can plug H2OIsolationForest
to
H2OGridSearch
and define a hyper-parameter space to be walked through. Unlike other Sparkling Water algorithms used in
H2OGridSearch
, you must pass validationDataFrame
to H2OIsolationForest
as a parameter in order to
H2OGridSearch
be able to evaluate produced models. The validation data frame has to have an extra column identifying
whether the row represents an anomaly or not. The column can contain only two string values, where a value for the negative
case, must be alphabetically smaller then the value for the positive case. E.g.: "0"
/"1"
, "no"
/"yes"
,
"false"
/"true"
, etc.
Scala
Let’s load a training and validation dataset at first:
import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate_anomaly_validation.csv")
val trainingDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv"))
val validationDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate_anomaly_validation.csv"))
Create an algorithm instance, pass validation data frame, and specify a column identifying an anomaly:
import ai.h2o.sparkling.ml.algos.H2OIsolationForest
val algorithm = new H2OIsolationForest()
algorithm.setValidationDataFrame(validationDF)
algorithm.setValidationLabelCol("isAnomaly")
Define a hyper-parameter space:
import scala.collection.mutable
val hyperParams: mutable.HashMap[String, Array[AnyRef]] = mutable.HashMap()
hyperParams += "ntrees" -> Array(10, 20, 30).map(_.asInstanceOf[AnyRef])
hyperParams += "maxDepth" -> Array(5, 10, 20).map(_.asInstanceOf[AnyRef])
Pass the prepared hyper-parameter space and algorithm to H2OGridSearch
and run it:
import ai.h2o.sparkling.ml.algos.H2OGridSearch
val grid = new H2OGridSearch()
grid.setAlgo(algorithm)
grid.setHyperParameters(hyperParams)
val model = grid.fit(trainingDF)
Logloss
is a default metric for the model comparision produced by grids and can be changed via the method
setSelectBestModelBy
on H2OGridSearch
.
Python
Let’s load a training and validation dataset at first:
import h2o
trainingFrame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
trainingDF = hc.asSparkFrame(trainingFrame)
validationFrame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate_anomaly_validation.csv")
validationDF = hc.asSparkFrame(validationFrame)
Create an algorithm instance, pass validation data frame, and specify a column identifying an anomaly:
from pysparkling.ml import H2OIsolationForest
algorithm = H2OIsolationForest(validationDataFrame=validationDF, validationLabelCol="isAnomaly")
Define a hyper-parameter space:
hyperSpace = {"ntrees": [10, 20, 30], "maxDepth": [5, 10, 20]}
Pass the prepared hyper-parameter space and algorithm to H2OGridSearch
and run it:
from pysparkling.ml import H2OGridSearch
grid = H2OGridSearch(hyperParameters=hyperSpace, algo=algorithm)
model = grid.fit(trainingDF)
Logloss
is a default metric for the model comparision produced by grids and can be changed via the method
setSelectBestModelBy
on H2OGridSearch
.