compute_p_values
¶
- Available in: GLM
- Hyperparameter: no
Description¶
Z-score, standard error, and p-values are classical statistical measures of model quality. p-values are essentially hypothesis tests on the values of each coefficient. A high p-value means that a coefficient is unreliable (insignificant), while a low p-value suggests that the coefficient is statistically significant. You can request GLM to compute the p-values by enabling the compute_p_values
option.
Notes:
- This option is only applicable when regularization is disabled (
lambda=0
) and whensolver=IRLSM
. - If collinear columns exist in the data, then you must also specify
remove_collinear_columns=TRUE
. Otherwise, H2O will return an error. - This option cannot be used with
family=multinomial
or withbeta_constraints
. - GLM auto-standardizes the data by default. This changes the p-value of the constant term (intercept).
Example¶
library(h2o)
h2o.init()
# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
# convert columns to factors
airlines["Year"] <- as.factor(airlines["Year"])
airlines["Month"] <- as.factor(airlines["Month"])
airlines["DayOfWeek"] <- as.factor(airlines["DayOfWeek"])
airlines["Cancelled"] <- as.factor(airlines["Cancelled"])
airlines['FlightNum'] <- as.factor(airlines['FlightNum'])
# set the predictor names and the response column name
predictors <- c("Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance", "FlightNum")
response <- "IsDepDelayed"
# split into train and validation
airlines.splits <- h2o.splitFrame(data = airlines, ratios = .8)
train <- airlines.splits[[1]]
valid <- airlines.splits[[2]]
# try using the `compute_p_values` parameter:
airlines_glm <- h2o.glm(family = 'binomial', x = predictors, y = response, training_frame = train,
validation_frame = valid,
lambda = 0,
remove_collinear_columns = TRUE,
compute_p_values = TRUE)
# print the AUC for the validation data
print(h2o.auc(airlines_glm, valid = TRUE))
# take a look at the coefficients_table to see the p_values
airlines_glm@model$coefficients_table
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()
# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
# convert columns to factors
airlines["Year"]= airlines["Year"].asfactor()
airlines["Month"]= airlines["Month"].asfactor()
airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
airlines["Cancelled"] = airlines["Cancelled"].asfactor()
airlines['FlightNum'] = airlines['FlightNum'].asfactor()
# set the predictor names and the response column name
predictors = ["Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance", "FlightNum"]
response = "IsDepDelayed"
# split into train and validation sets
train, valid= airlines.split_frame(ratios = [.8])
# try using the `compute_p_values` parameter:
# initialize your estimator
airlines_glm = H2OGeneralizedLinearEstimator(family = 'binomial', lambda_ = 0,
remove_collinear_columns = True,
compute_p_values = True)
# then train your model
airlines_glm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
# print the auc for the validation data
print(airlines_glm.auc(valid=True))
# take a look at the coefficients_table to see the p_values
coeff_table = airlines_glm._model_json['output']['coefficients_table']
# convert table to a pandas dataframe
coeff_table.as_data_frame()