link
¶
Available in: GLM, GAM
Hyperparameter: no
Description¶
GLM and GAM problems consist of three main components:
A random component \(f\) for the dependent variable \(y\): The density function \(f(y;\theta,\phi)\) has a probability distribution from the exponential family parametrized by \(\theta\) and \(\phi\). This removes the restriction on the distribution of the error and allows for non-homogeneity of the variance with respect to the mean vector.
A systematic component (linear model) \(\eta\): \(\eta = X\beta\), where \(X\) is the matrix of all observation vectors \(x_i\).
A link function \(g\): \(E(y) = \mu = {g^-1}(\eta)\) relates the expected value of the response \(\mu\) to the linear component \(\eta\). The link function can be any monotonic differentiable function. This relaxes the constraints on the additivity of the covariates, and it allows the response to belong to a restricted range of values depending on the chosen transformation \(g\).
Accordingly, in order to specify a GLM or GAM problem, you must choose a family function \(f\), link function \(g\), and any parameters needed to train the model.
H2O’s GLM and GAM support the following link functions: Family_Default, Identity, Logit, Log, Inverse, Tweedie, or Ologit.
The following table describes the allowed Family/Link combinations.
Family |
Link Function |
||||||
Family_Default |
Identity |
Logit |
Log |
Inverse |
Tweedie |
Ologit |
|
Binomial |
X |
X |
|||||
Fractional Binomial |
X |
X |
|||||
Quasibinomial |
X |
X |
|||||
Multinomial |
X |
||||||
Ordinal |
X |
X |
|||||
Gaussian |
X |
X |
X |
X |
|||
Poisson |
X |
X |
X |
||||
Gamma |
X |
X |
X |
X |
|||
Tweedie |
X |
X |
|||||
Negative Binomial |
X |
X |
X |
Refer to the Links section for more information.
Example¶
library(h2o)
h2o.init()
# import the iris dataset:
# this dataset is used to classify the type of iris plant
# the original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Iris
iris <- h2o.importFile("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
# convert response column to a factor
iris['class'] <- as.factor(iris['class'])
# set the predictor names and the response column name
predictors <- colnames(iris)[-length(iris)]
response <- 'class'
# split into train and validation
iris_splits <- h2o.splitFrame(data = iris, ratios = 0.8)
train <- iris_splits[[1]]
valid <- iris_splits[[2]]
# try using the `link` parameter:
iris_glm <- h2o.glm(x = predictors, y = response, family = 'multinomial', link = 'family_default',
training_frame = train, validation_frame = valid)
# print the logloss for the validation data
print(h2o.logloss(iris_glm, valid = TRUE))
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()
# import the iris dataset:
# this dataset is used to classify the type of iris plant
# the original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Iris
iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
# convert response column to a factor
iris['class'] = iris['class'].asfactor()
# set the predictor names and the response column name
predictors = iris.columns[:-1]
response = 'class'
# split into train and validation sets
train, valid = iris.split_frame(ratios = [.8])
# try using the `link` parameter:
# Initialize and train a GLM
iris_glm = H2OGeneralizedLinearEstimator(family = 'multinomial', link = 'family_default')
iris_glm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
# print the logloss for the validation data
iris_glm.logloss(valid = True)