extension_level¶

• Available in: Extended Isolation Forest

• Hyperparameter: yes

Description¶

For Extended Isolation Forest the extension_level hyperparameter allows you to leverage the generalization of Isolation Forest. The number 0 corresponds to Isolation Forest’s behavior because the split points are not randomized with slope. For the dataset with $$P$$ features, the maximum extension_level is $$P-1$$ and means full-extension. As the extension_level is increased, the bias of the standard Isolation Forest is reduced. A lower extension is suitable for a domain where the range of the minimum and maximum for each feature highly differs (for example, when one feature is measured in millimeters, and the second one in meters). The following paragraphs deliver a more detailed explanation.

The branching criteria in Extended Isolation Forest for the data splitting for a given data point $$x$$ is as follows:

$(x - p) * n ≤ 0$
where:
• $$x$$, $$p$$, and $$n$$ are vectors with $$P$$ features

• $$p$$ is random intercept generated from the uniform distribution with bounds coming from the sub-sample of data to be split.

• $$n$$ is random slope for the branching cut generated from $$\mathcal{N(0,1)}$$ distribution.

The function of extension_level is to force random items of $$n$$ to be zero. The extension_level hyperparameter value is between $$0$$ and $$P-1$$. A value of 0 means that all slopes will be parallel with all of the axes, which corresponds to Isolation Forest’s behavior. A higher number of extension level indicates that the split will be parallel with extension_level-number of axes. The full-extension means extension_level is equal to $$P - 1$$. This indicates that the slope of the branching point will always be randomized.

For a full insight into the extension_level hyperparameter, please read the section High Dimensional Data and Extension Levels from the original paper.

Example¶

library(h2o)
h2o.init()

# Import the prostate dataset
prostate <- h2o.importFile(path = "https://raw.github.com/h2oai/h2o/master/smalldata/logreg/prostate.csv")

# Set the predictors
predictors <- c("AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON")

# Build an Extended Isolation forest model
model_if <- h2o.extendedIsolationForest(x = predictors,
training_frame = prostate,
model_id = "eif_if.hex",
ntrees = 100,
sample_size = 256,
extension_level = 0)

# Use full-extension
model_eif <- h2o.extendedIsolationForest(x = predictors,
training_frame = prostate,
model_id = "eif.hex",
ntrees = 100,
sample_size = 256,
extension_level = length(predictors) - 1)

# Calculate score
score_if <- h2o.predict(model_if, prostate)
anomaly_score_if <- score_if$anomaly_score score_eif <- h2o.predict(model_eif, prostate) anomaly_score_eif <- score_eif$anomaly_score

import h2o
from h2o.estimators import H2OExtendedIsolationForestEstimator
h2o.init()

# Import the prostate dataset
h2o_df = h2o.import_file("https://raw.github.com/h2oai/h2o/master/smalldata/logreg/prostate.csv")

# Set the predictors
predictors = ["AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]

# Simulate Isolation Forest behavior with Extended Isolation Forest algorithm
eif_if = H2OExtendedIsolationForestEstimator(model_id = "eif_if.hex",
ntrees = 100,
extension_level = 0)

# Use full-extension
eif_full = H2OExtendedIsolationForestEstimator(model_id = "eif_full.hex",
ntrees = 100,
extension_level = len(predictors) - 1)

eif_if.train(x = predictors, training_frame = h2o_df)
eif_full.train(x = predictors, training_frame = h2o_df)

# Calculate score
eif_if_result = eif_if.predict(h2o_df)
eif_full_result = eif_full.predict(h2o_df)
print(eif_if_result["anomaly_score"])
print(eif_full_result["anomaly_score"])