link
¶
- Available in: GLM
- Hyperparameter: no
Description¶
GLM problems consist of three main components:
- A random component f for the dependent variable y: The density function f(y;θ,ϕ) has a probability distribution from the exponential family parametrized by θ and ϕ. This removes the restriction on the distribution of the error and allows for non-homogeneity of the variance with respect to the mean vector.
- A systematic component (linear model) η: η=Xβ, where X is the matrix of all observation vectors xi.
- A link function g: E(y)=μ=g−1(η) relates the expected value of the response μ to the linear component η. The link function can be any monotonic differentiable function. This relaxes the constraints on the additivity of the covariates, and it allows the response to belong to a restricted range of values depending on the chosen transformation g.
Accordingly, in order to specify a GLM problem, you must choose a family function f, link function g, and any parameters needed to train the model.
H2O’s GLM supports the following link functions: Family_Default, Identity, Logit, Log, Inverse, and Tweedie.
The following table describes the allowed Family/Link combinations.
Family | Link Function | |||||
Family_Default | Identity | Logit | Log | Inverse | Tweedie | |
Binomial | X | X | ||||
Quasibinomial | X | X | ||||
Multinomial | X | |||||
Gaussian | X | X | X | X | ||
Poisson | X | X | X | |||
Gamma | X | X | X | X | ||
Tweedie | X | X |
Refer to the Links section for more information.
Example¶
- r
- python
library(h2o)
h2o.init()
# import the iris dataset:
# this dataset is used to classify the type of iris plant
# the original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Iris
iris <- h2o.importFile("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
# convert response column to a factor
iris['class'] <- as.factor(iris['class'])
# set the predictor names and the response column name
predictors <- colnames(iris)[-length(iris)]
response <- 'class'
# split into train and validation
iris.splits <- h2o.splitFrame(data = iris, ratios = .8)
train <- iris.splits[[1]]
valid <- iris.splits[[2]]
# try using the `link` parameter:
iris_glm <- h2o.glm(x = predictors, y = response, family = 'multinomial', link = 'family_default',
training_frame = train, validation_frame = valid)
# print the logloss for the validation data
print(h2o.logloss(iris_glm, valid = TRUE))