library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 hours 14 minutes
## H2O cluster timezone: Europe/Prague
## H2O data parsing timezone: UTC
## H2O cluster version: 3.39.0.99999
## H2O cluster version age: 1 hour and 18 minutes
## H2O cluster name: tomasfryda
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.23 GB
## H2O cluster total cores: 16
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 4.2.2 (2022-10-31)
# Import HMDA dataset
f <- "https://erin-data.s3.amazonaws.com/admissible/data/hmda_lar_2018_sample.csv"
col_types <- list(by.col.name = c("high_priced"),
types = c("factor"))
df <- h2o.importFile(path = f, col.types = col_types)
##
|
| | 0%
|
|========== | 14%
|
|============================== | 42%
|
|============================================ | 62%
|
|===================================================================== | 98%
|
|======================================================================| 100%
splits <- h2o.splitFrame(df, ratios = 0.8, seed = 1)
train <- splits[[1]]
test <- splits[[2]]
# Response column and predictor columns
y <- "high_priced"
x <- c("loan_amount",
"loan_to_value_ratio",
"loan_term",
"intro_rate_period",
"property_value",
"income",
"debt_to_income_ratio")
# Fairness related information
protected_columns <- c("derived_race", "derived_sex")
reference <- c("White", "Male")
favorable_class <- "0"
# Infogram
ig <- h2o.infogram(y = y, x = x, training_frame = train, protected_columns = protected_columns)
##
|
| | 0%
|
|======================================================================| 100%
plot(ig)
# Admissible score frame
asf <- ig@admissible_score
asf
## column admissible admissible_index relevance_index safety_index
## 1 loan_to_value_ratio 1 1.00000000 1.00000000 1.00000000
## 2 property_value 1 0.23240482 0.14543085 0.29474373
## 3 loan_amount 1 0.17695566 0.12252578 0.21820642
## 4 income 0 0.10680164 0.02650036 0.14869739
## 5 intro_rate_period 0 0.06278644 0.05513249 0.06960376
## 6 loan_term 0 0.04623597 0.05506992 0.03525385
## cmi_raw
## 1 0.094140609
## 2 0.027747354
## 3 0.020542085
## 4 0.013998463
## 5 0.006552540
## 6 0.003318819
##
## [7 rows x 6 columns]
da <- h2o.no_progress(h2o.infogram_train_subset_models(ig, h2o.automl, train, test, y, protected_columns, reference, favorable_class, max_models = 10, seed = 1))
da
pf <- h2o.pareto_front(da, x_metric = "auc", y_metric = "significant_air_min", optimum = "top right", color = "algo")
plot(pf)
pf@pareto_front
potentially_fair_model <- h2o.getModel(da[da$significant_air_min > 0.8, "model_id"][[1]])
h2o.inspect_model_fairness(potentially_fair_model, test, protected_columns, reference, favorable_class)
The following table shows fairness metrics for intersections determined using the protected_columns. Apart from the fairness metrics, there is a p-value from Fisher’s exact test or G-test (depends on the size of the intersections) for hypothesis that being selected (positive response) is independent to being in the reference group or a particular protected group. After the table there are two kinds of plot. The first kind starts with AIR prefix which stands for Adverse Impact Ratio. These plots show values relative to the reference group and also show two dashed lines corresponding to 0.8 and 1.25 (the four-fifths rule). The second kind is showing the absolute value of given metrics. The reference group is shown by using a different colored bar.
derived_race | derived_sex | auc | aucpr | f1 | p.value | selectedRatio | total | AIR_auc | AIR_aucpr | AIR_f1 | AIR_selectedRatio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 0.8571429 | 0.9797694 | 0.8888889 | 0.7218205 | 0.8125 | 16 | 1.021821 | 1.006851 | 0.9671162 | 0.9551502 |
2 | 2 | 1 | 0.7587209 | 0.9484587 | 0.825 | 0.01794821 | 0.7254902 | 51 | 0.9044899 | 0.9746752 | 0.8976047 | 0.8528641 |
3 | 3 | 1 | 0.916327 | 0.9850182 | 0.9608637 | 0.008695329 | 0.8966346 | 416 | 1.092376 | 1.012245 | 1.045425 | 1.054056 |
4 | 4 | 1 | 0.7523934 | 0.8963338 | 0.8166144 | 5.958565e-31 | 0.6866516 | 884 | 0.8969466 | 0.9211095 | 0.8884811 | 0.8072066 |
5 | 5 | 1 | 0.9298246 | 0.9894649 | 0.95 | 0.2362586 | 0.9545455 | 22 | 1.108467 | 1.016815 | 1.033605 | 1.122134 |
6 | 6 | 1 | 0.7880266 | 0.964529 | 0.9204819 | 0.3187179 | 0.8686441 | 472 | 0.9394258 | 0.9911897 | 1.00149 | 1.021151 |
7 | 7 | 1 | 0.8345209 | 0.9715174 | 0.9197698 | 0.4772562 | 0.8551985 | 5290 | 0.9948529 | 0.9983713 | 1.000715 | 1.005345 |
8 | 1 | 2 | 0.5357143 | 0.8967344 | 0.8461538 | 0.2830527 | 0.75 | 16 | 0.6386382 | 0.9215212 | 0.9206202 | 0.8816771 |
9 | 2 | 2 | 0.8155172 | 0.9542658 | 0.9026549 | 0.3090088 | 0.8088235 | 68 | 0.9721982 | 0.9806428 | 0.9820936 | 0.9508282 |
10 | 3 | 2 | 0.9533133 | 0.9971094 | 0.9692875 | 8.738433e-12 | 0.9297235 | 868 | 1.136468 | 1.024671 | 1.05459 | 1.092955 |
11 | 4 | 2 | 0.7505808 | 0.9194667 | 0.8421818 | 2.56071e-14 | 0.7469066 | 889 | 0.8947858 | 0.9448818 | 0.9162986 | 0.8780406 |
12 | 5 | 2 | 0.7419355 | 0.9312377 | 0.9354839 | 1 | 0.8611111 | 36 | 0.8844795 | 0.9569782 | 1.017812 | 1.012296 |
13 | 6 | 2 | 0.8616255 | 0.984328 | 0.9272152 | 0.272596 | 0.8663854 | 711 | 1.027165 | 1.011536 | 1.008815 | 1.018496 |
14 | 7 | 2 | 0.8388385 | 0.9731023 | 0.9191129 | 1 | 0.8506516 | 8825 | 1 | 1 | 1 | 1 |
The following plot shows a Receiver Operating Characteristic (ROC) for each intersection. This plot could be used for selecting different threshold of the classifier to make it more fair in some sense this is described in, e.g., HARDT, Moritz, PRICE, Eric and SREBRO, Nathan, 2016. Equality of Opportunity in Supervised Learning. arXiv:1610.02413.
The following plot shows a Precision-Recall Curve for each intersection.
Permutation variable importance is obtained by measuring the distance between prediction errors before and after a feature is permuted; only one feature at a time is permuted.
Variable | Relative Importance | Scaled Importance | Percentage | |
---|---|---|---|---|
1 | loan_to_value_ratio | 0.0371030658080262 | 1 | 0.614613500967895 |
2 | loan_amount | 0.00952262914395156 | 0.256653431099799 | 0.15774266382367 |
3 | property_value | 0.00795377124935635 | 0.214369650489523 | 0.13175448138863 |
4 | intro_rate_period | 0.00337937303680622 | 0.0910806954414851 | 0.0559794250958818 |
5 | income | 0.00240928764094089 | 0.0649350016897986 | 0.0399099287239233 |
The following plots show partial dependence for each intersection separately. This plot can be used to see how the membership to a particular intersection influences the dependence on a given feature.
The following plots show SHAP contributions for individual intersections and one feature at a time.This plot can be used to see how the membership to a particular intersection influences the dependence on a given feature.