compute_p_values

  • Available in: GLM
  • Hyperparameter: no

Description

Z-score, standard error, and p-values are classical statistical measures of model quality. p-values are essentially hypothesis tests on the values of each coefficient. A high p-value means that a coefficient is unreliable (insignificant), while a low p-value suggests that the coefficient is statistically significant. You can request GLM to compute the p-values by enabling the compute_p_values option.

Notes:

  • This option is only applicable when regularization is disabled (lambda=0) and when solver=IRLSM.
  • If collinear columns exist in the data, then you must also specify remove_collinear_columns=TRUE. Otherwise, H2O will return an error.
  • This option cannot be used with family=multinomial
  • GLM auto-standardizes the data by default. This changes the p-value of the constant term (intercept).

Example

  • r
  • python
library(h2o)
h2o.init()
# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines <-  h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")

# convert columns to factors
airlines["Year"] <- as.factor(airlines["Year"])
airlines["Month"] <- as.factor(airlines["Month"])
airlines["DayOfWeek"] <- as.factor(airlines["DayOfWeek"])
airlines["Cancelled"] <- as.factor(airlines["Cancelled"])
airlines['FlightNum'] <- as.factor(airlines['FlightNum'])

# set the predictor names and the response column name
predictors <- c("Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance", "FlightNum")
response <- "IsDepDelayed"

# split into train and validation
airlines.splits <- h2o.splitFrame(data =  airlines, ratios = .8)
train <- airlines.splits[[1]]
valid <- airlines.splits[[2]]

# try using the `compute_p_values` parameter:
airlines_glm <- h2o.glm(family = 'binomial', x = predictors, y = response, training_frame = train,
                        validation_frame = valid,
                        lambda = 0,
                        remove_collinear_columns = TRUE,
                        compute_p_values = TRUE)

# print the AUC for the validation data
print(h2o.auc(airlines_glm, valid = TRUE))

# take a look at the coefficients_table to see the p_values
airlines_glm@model$coefficients_table