Train Extended Isolation Forest Model in Sparkling Water¶
Introduction¶
The Extended Isolation Forest algorithm generalizes its predecessor algorithm, Isolation Forest. The original Isolation Forest algorithm brings a brand new form of detection, although the algorithm suffers from bias due to tree branching. Extension of the algorithm mitigates the bias by adjusting the branching, and the original algorithm becomes just a special case. For more comprehensive description see H2O-3 Extended Isolation Forest documentation.
Example¶
The following section describes how to train the Extended Isolation Forest model in Sparkling Water in Scala & Python following the same example as H2O-3 documentation mentioned above. See also Parameters of H2OExtendedIsolationForest and Details of H2OExtendedIsolationForestMOJOModel.
Scala
First, let’s start Sparkling Shell as
./bin/sparkling-shell
Start H2O cluster inside the Spark environment
import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame
import org.apache.spark.SparkFiles
val datasetUrl = "https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv"
spark.sparkContext.addFile(datasetUrl) //for example purposes, on a real cluster it's better to load directly from distributed storage
val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv"))
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))
Train the model. You can configure all the available Extended Isolation Forest arguments using provided setters.
import ai.h2o.sparkling.ml.algos.H2OExtendedIsolationForest
val predictors = Array("AGE", "RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON")
val algo = new H2OExtendedIsolationForest()
.setSampleSize(256)
.setNtrees(100)
.setExtensionLevel(predictors.length - 1)
.setSeed(1234)
.setFeaturesCols(predictors)
val model = algo.fit(trainingDF)
Run Predictions
model.transform(testingDF).show(truncate = false)
View model summary containing info about trained trees etc.
model.getModelSummary()
You can also get other model details by calling methods listed in Details of H2OExtendedIsolationForestMOJOModel.
Python
First, let’s start PySparkling Shell as
./bin/pysparkling
Start H2O cluster inside the Spark environment
from pysparkling import *
hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame
import h2o
frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
sparkDF = hc.asSparkFrame(frame)
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])
Train the model. You can configure all the available ExtendedIsolationForest arguments using provided setters or constructor parameters.
from pysparkling.ml import H2OExtendedIsolationForest
predictors = ["AGE", "RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON"]
algo = H2OExtendedIsolationForest(featuresCols=predictors,
sampleSize=256,
ntrees=100,
seed=1234,
extensionLevel=len(predictors) - 1)
model = algo.fit(trainingDF)
Run Predictions
model.transform(testingDF).show(truncate = False)
View model summary containing info about trained trees etc.
model.getModelSummary()
You can also get other model details by calling methods listed in Details of H2OExtendedIsolationForestMOJOModel.