init
¶
- Available in: GLRM, K-means
- Hyperparameter: yes
Description¶
This option specifies the initialization mode used in K-Means. The options are Random, Furthest, PlusPlus, and User.
- Random: Choose K clusters from the set of N observations at random so that each observation has an equal chance of being chosen.
- Furthest (Default):
- Choose one center m1 at random.
- Calculate the difference between m1 and each of the remaining N−1 observations xi. d(xi,m1)=||(xi−m1)||2
- Choose m2 to be the xi that maximizes d(xi,m1).
- Repeat until K centers have been chosen.
- PlusPlus:
- Choose one center m1 at random.
- Calculate the difference between m1 and each of the remaining N−1 observations xi. d(xi,m1)=‖
- Let P(i) be the probability of choosing x_{i} as m_{2}. Weight P(i) by d(x_{i}, m_{1}) so that those x_{i} furthest from m_{2} have a higher probability of being selected than those x_{i} close to m_{1}.
- Choose the next center m_{2} by drawing at random according to the weighted probability distribution.
- Repeat until K centers have been chosen.
- User initialization allows you to specify a file (using the
user_points
parameter) that includes a vector of initial cluster centers.
Notes:
- The user-specified points dataset must have the same number of columns as the training observations.
- This option is ignored when
estimate_k
is enabled. In this case, the algorithm is deterministic. - If this option is not specified but a user-points file is specified, then this value will default to
user
.
Example¶
- r
- python
library(h2o)
h2o.init()
# import the seeds dataset:
# this dataset looks at three different types of wheat varieties
# the original dataset can be found at http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/flow_examples/seeds_dataset.txt")
# set the predictor names
# ignore the 8th column which has the prior known clusters for this dataset
predictors <-colnames(seeds)[-length(seeds)]
# split into train and validation
seeds_splits <- h2o.splitFrame(data = seeds, ratios = .8, seed = 1234)
train <- seeds_splits[[1]]
valid <- seeds_splits[[2]]
# try using the `init` parameter:
# build the model with three clusters
seeds_kmeans <- h2o.kmeans(x = predictors, k = 3, init='Furthest', training_frame = train, validation_frame = valid, seed = 1234)
# print the total within cluster sum-of-square error for the validation dataset
print(paste0("Total sum-of-square error for valid dataset: ", h2o.tot_withinss(object = seeds_kmeans, valid = T)))
# select the values for `init` to grid over:
# Note: this dataset is too small to see significant differences between these options
# the purpose of the example is to show how to use grid search with `init` if desired
hyper_params <- list( init = c("PlusPlus", "Furthest", "Random") )
# this example uses cartesian grid search because the search space is small
# and we want to see the performance of all models. For a larger search space use
# random grid search instead: list(strategy = "RandomDiscrete")
grid <- h2o.grid(x = predictors, k = 3, training_frame = train, validation_frame = valid,
algorithm = "kmeans", grid_id = "seeds_grid", hyper_params = hyper_params,
search_criteria = list(strategy = "Cartesian"), seed = 1234)
## Sort the grid models by TotSS
sortedGrid <- h2o.getGrid("seeds_grid", sort_by = "tot_withinss", decreasing = F)
sortedGrid