(GLRM, K-Means)¶
Available in: GLRM, K-means
Hyperparameter: yes
This option specifies the initialization mode used in K-Means and GLRM. The options are Random, Furthest, PlusPlus, and User.
Random: Choose \(K\) clusters from the set of \(N\) observations at random so that each observation has an equal chance of being chosen.
Furthest (Default):
Choose one center \(m_{1}\) at random.
Calculate the difference between \(m_{1}\) and each of the remaining \(N-1\) observations \(x_{i}\). \(d(x_{i}, m_{1}) = ||(x_{i}-m_{1})||^2\)
Choose \(m_{2}\) to be the \(x_{i}\) that maximizes \(d(x_{i}, m_{1})\).
Repeat until \(K\) centers have been chosen.
Choose one center \(m_{1}\) at random.
Calculate the difference between \(m_{1}\) and each of the remaining \(N-1\) observations \(x_{i}\). \(d(x_{i}, m_{1}) = \|(x_{i}-m_{1})\|^2\)
Let \(P(i)\) be the probability of choosing \(x_{i}\) as \(m_{2}\). Weight \(P(i)\) by \(d(x_{i}, m_{1})\) so that those \(x_{i}\) furthest from \(m_{2}\) have a higher probability of being selected than those \(x_{i}\) close to \(m_{1}\).
Choose the next center \(m_{2}\) by drawing at random according to the weighted probability distribution.
Repeat until \(K\) centers have been chosen.
User initialization allows you to specify a file (using the
parameter) that includes a vector of initial cluster centers.
The user-specified points dataset must have the same number of columns as the training observations.
This option is ignored when
is enabled. In this case, the algorithm is deterministic.If this option is not specified but a user-points file is specified, then this value will default to
# import the seeds dataset:
# this dataset looks at three different types of wheat varieties
# the original dataset can be found at
seeds <- h2o.importFile("")
# set the predictor names
# ignore the 8th column which has the prior known clusters for this dataset
predictors <-colnames(seeds)[-length(seeds)]
# split into train and validation
seeds_splits <- h2o.splitFrame(data = seeds, ratios = .8, seed = 1234)
train <- seeds_splits[[1]]
valid <- seeds_splits[[2]]
# try using the `init` parameter:
# build the model with three clusters
seeds_kmeans <- h2o.kmeans(x = predictors, k = 3, init='Furthest', training_frame = train, validation_frame = valid, seed = 1234)
# print the total within cluster sum-of-square error for the validation dataset
print(paste0("Total sum-of-square error for valid dataset: ", h2o.tot_withinss(object = seeds_kmeans, valid = T)))
# select the values for `init` to grid over:
# Note: this dataset is too small to see significant differences between these options
# the purpose of the example is to show how to use grid search with `init` if desired
hyper_params <- list( init = c("PlusPlus", "Furthest", "Random") )
# this example uses cartesian grid search because the search space is small
# and we want to see the performance of all models. For a larger search space use
# random grid search instead: list(strategy = "RandomDiscrete")
grid <- h2o.grid(x = predictors, k = 3, training_frame = train, validation_frame = valid,
algorithm = "kmeans", grid_id = "seeds_grid", hyper_params = hyper_params,
search_criteria = list(strategy = "Cartesian"), seed = 1234)
## Sort the grid models by TotSS
sortedGrid <- h2o.getGrid("seeds_grid", sort_by = "tot_withinss", decreasing = F)
import h2o
from h2o.estimators.kmeans import H2OKMeansEstimator
# import the seeds dataset:
# this dataset looks at three different types of wheat varieties
# the original dataset can be found at
seeds = h2o.import_file("")
# set the predictor names
# ignore the 8th column which has the prior known clusters for this dataset
predictors = seeds.columns[0:7]
# split into train and validation sets
train, valid = seeds.split_frame(ratios = [.8], seed = 1234)
# try using the `init` parameter:
# initialize the estimator then train the model
seeds_kmeans = H2OKMeansEstimator(k = 3, init='Furthest', seed = 1234)
seeds_kmeans.train(x = predictors, training_frame = train, validation_frame= valid)
# print the total within cluster sum-of-square error for the validation dataset
print("sum-of-square error for valid:",seeds_kmeans.tot_withinss(valid = True))
# grid over `init`
# import Grid Search
from h2o.grid.grid_search import H2OGridSearch
# select the values for `init` to grid over
# Note: this dataset is too small to see significant differences between these options
# the purpose of the example is to show how to use grid search with `init` if desired
hyper_params = {'init': ["PlusPlus", "Furthest", "Random"]}
# this example uses cartesian grid search because the search space is small
# and we want to see the performance of all models. For a larger search space use
# random grid search instead: {'strategy': "RandomDiscrete"}
# initialize the estimator
seeds_kmeans = H2OKMeansEstimator(k = 3, seed = 1234)
# build grid search with previously made Kmeans and hyperparameters
grid = H2OGridSearch(model = seeds_kmeans, hyper_params = hyper_params,
search_criteria = {'strategy': "Cartesian"})
# train using the grid
grid.train(x = predictors, training_frame = train, validation_frame = valid)
# sort the grid models by total within cluster sum-of-square error.
sorted_grid = grid.get_grid(sort_by='tot_withinss', decreasing= False)