K-Means Clustering

Introduction

K-Means falls in the general category of clustering algorithms. Clustering is a form of unsupervised learning that tries to find structures in the data without using any labels or target values. Clustering partitions a set of observations into separate groupings such that an observation in a given group is more similar to another observation in the same group than to another observation in a different group.

For more information, refer to “A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining” and “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Catgorical Values” by Zhexue Huang.

Defining a K-Means Model

  • model_id: (Optional) Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.

  • training_frame: (Required) Specify the dataset used to build the model. NOTE: In Flow, if you click the Build a model button from the Parse cell, the training frame is entered automatically.

  • validation_frame: (Optional) Specify the dataset used to evaluate the accuracy of the model.

  • x: Specify a vector containing the names or indices of the predictor variables to use when building the model. If x is missing, then all columns are used.

  • nfolds: Specify the number of folds for cross-validation.

  • keep_cross_validation_predictions: Enable this option to keep the cross-validation predictions.

  • keep_cross_validation_fold_assignment: Enable this option to preserve the cross-validation fold assignment.

  • fold_assignment: (Applicable only if a value for nfolds is specified and fold_column is not specified) Specify the cross-validation fold assignment scheme. The available options are AUTO (which is Random), Random, Modulo, or Stratified (which will stratify the folds based on the response variable for classification problems).

  • fold_column: Specify the column that contains the cross-validation fold index assignment per observation.

  • ignored_columns: (Optional, Python and Flow only) Specify the column or columns to be exclude from the model. In Flow, click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the All button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the None button. To search for a specific column, type the column name in the Search field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the Only show columns with more than 0% missing values field. To change the selections for the hidden columns, use the Select Visible or Deselect Visible buttons.

  • ignore_const_cols: (Optional) Specify whether to ignore constant training columns, since no information can be gained from them. This option is enabled by default.

  • score_each_iteration: (Optional) Specify whether to score during each iteration of the model training.

  • k: Specify the number of clusters (groups of data) in a dataset that are similar to one another.

  • estimate_k: Specify whether to estimate the number of clusters (<=k) iteratively (independent of the seed) and deterministically (beginning with k=1,2,3...). If enabled, for each k that, the estimate will go up to max_iteration. This option is disabled by default.

  • user_points: Specify a dataframe, where each row represents an initial cluster center.

  • max_iterations: Specify the maximum number of training iterations. The range is 0 to 1e6.

  • standardize: Enable this option to standardize the numeric columns to have a mean of zero and unit variance. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option is enabled by default.

    Note: If standardization is enabled, each column of numeric data is centered and scaled so that its mean is zero and its standard deviation is one before the algorithm is used. At the end of the process, the cluster centers on both the standardized scale (centers_std) and the de-standardized scale (centers). To de-standardize the centers, the algorithm multiplies by the original standard deviation of the corresponding column and adds the original mean. Enabling standardization is mathematically equivalent to using h2o.scale in R with center = TRUE and scale = TRUE on the numeric columns. Therefore, there will be no discernible difference if standardization is enabled or not for K-Means, since H2O calculates unstandardized centroids.

  • seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.

  • init: Specify the initialization mode. The options are Random, Furthest, PlusPlus, or User.

  • Random initialization randomly samples the k-specified value of the rows of the training data as cluster centers.
  • PlusPlus initialization chooses one initial center at random and weights the random selection of subsequent centers so that points furthest from the first center are more likely to be chosen.
  • Furthest initialization chooses one initial center at random and then chooses the next center to be the point furthest away in terms of Euclidean distance.
  • User initialization requires the corresponding user_points parameter. Note that the user-specified points dataset must have the same number of columns as the training dataset.

Note: If PlusPlus is specified, the initial Y matrix is chosen by the final cluster centers from the K-Means PlusPlus algorithm.

  • max_runtime_secs: Maximum allowed runtime in seconds for model training. Use 0 to disable.
  • categorical_encoding: Specify one of the following encoding schemes for handling categorical features:
    • auto or AUTO: Allow the algorithm to decide (default). In K-Means, the algorithm will automatically perform enum encoding.
    • enum or Enum: 1 column per categorical feature
    • one_hot_explicit: N+1 new columns for categorical features with N levels
    • binary or Binary: No more than 32 columns per categorical feature
    • `eigen or Eigen: k columns per categorical feature, keeping projections of one-hot-encoded matrix onto k-dim eigen space only
    • label_encoder or LabelEncoder: Convert every enum into the integer of its index (for example, level 0 -> 0, level 1 -> 1, etc.)

Interpreting a K-Means Model

By default, the following output displays:

  • A graph of the scoring history (number of iterations vs. within the cluster’s sum of squares)
  • Output (model category, validation metrics if applicable, and centers std)
  • Model Summary Model Summary (number of clusters, number of categorical columns, number of iterations, total within sum of squares, total sum of squares, total between the sum of squares. Note that Flow also returns the number of rows.)
  • Scoring history (duration, number of iterations, number of reassigned observations, number of within cluster sum of squares)
  • Training metrics (model name, checksum name, frame name, frame checksum name, description if applicable, model category, scoring time, predictions, MSE, RMSE, total within sum of squares, total sum of squares, total between sum of squares)
  • Centroid statistics (centroid number, size, within cluster sum of squares)
  • Cluster means (centroid number, column)

K-Means randomly chooses starting points and converges to a local minimum of centroids. The number of clusters is arbitrary and should be thought of as a tuning parameter. The output is a matrix of the cluster assignments and the coordinates of the cluster centers in terms of the originally chosen attributes. Your cluster centers may differ slightly from run to run as this problem is Non-deterministic Polynomial-time (NP)-hard.

Estimating k in K-Means

The steps below describe the method that K-Means uses in order to estimate k.

  1. Beginning with one cluster, run K-Means to compute the centroid.
  2. Find variable with greatest range and split at the mean.
  3. Run K-Means on the two resulting clusters.
  4. Find the variable and cluster with the greatest range, and then split that cluster on the variable’s mean.
  5. Run K-Means again, and so on.
  6. Continue running K-Means until a stopping criterion is met.

H2O uses proportional reduction in error (\(PRE\)) to determine when to stop splitting. The \(PRE\) value is calculated based on the sum of squares within (\(SSW\)).

\(PRE=\frac{(SSW\text{[before split]} - SSW\text{[after split]})} {SSW\text{[before split]}}\)

H2O stops splitting when \(PRE\) falls below a \(threshold\), which is a function of the number of variables and the number of cases as described below:

\(threshold\) takes the smaller of these two values:

either 0.8

or

\(\big[0.02 + \frac{10}{number\_of\_training\_rows} + \frac{2.5}{number\_of\_model\_features^{2}}\big]\)

FAQ

  • How does the algorithm handle missing values during training?
Missing values are automatically imputed by the column mean. K-means also handles missing values by assuming that missing feature distance contributions are equal to the average of all other distance term contributions.
  • How does the algorithm handle missing values during testing?
Missing values are automatically imputed by the column mean of the training data.
  • What happens when you try to predict on a categorical level not seen during training?
An unseen categorical level in a row does not contribute to that row’s prediction. This is because the unseen categorical level does not contribute to the distance comparison between clusters, and therefore does not factor in predicting the cluster to which that row belongs.
  • Does it matter if the data is sorted?
No.
  • Should data be shuffled before training?
No.
  • What if there are a large number of columns?
K-Means suffers from the curse of dimensionality: all points are roughly at the same distance from each other in high dimensions, making the algorithm less and less useful.
  • What if there are a large number of categorical factor levels?
This can be problematic, as categoricals are one-hot encoded on the fly, which can lead to the same problem as datasets with a large number of columns.

K-Means Algorithm

The number of clusters \(K\) is user-defined and is determined a priori.

  1. Choose \(K\) initial cluster centers \(m_{k}\) according to one of the following:

    • Random: Choose \(K\) clusters from the set of \(N\) observations at random so that each observation has an equal chance of being chosen.
    • Furthest (Default):
      1. Choose one center \(m_{1}\) at random.
      2. Calculate the difference between \(m_{1}\) and each of the remaining \(N-1\) observations \(x_{i}\). \(d(x_{i}, m_{1}) = ||(x_{i}-m_{1})||^2\)
      3. Choose \(m_{2}\) to be the \(x_{i}\) that maximizes \(d(x_{i}, m_{1})\).
      4. Repeat until \(K\) centers have been chosen.
    • PlusPlus:
      1. Choose one center \(m_{1}\) at random.
      2. Calculate the difference between \(m_{1}\) and each of the remaining \(N-1\) observations \(x_{i}\). \(d(x_{i}, m_{1}) = \|(x_{i}-m_{1})\|^2\)
      3. Let \(P(i)\) be the probability of choosing \(x_{i}\) as \(m_{2}\). Weight \(P(i)\) by \(d(x_{i}, m_{1})\) so that those \(x_{i}\) furthest from \(m_{2}\) have a higher probability of being selected than those \(x_{i}\) close to \(m_{1}\).
      4. Choose the next center \(m_{2}\) by drawing at random according to the weighted probability distribution.
      5. Repeat until \(K\) centers have been chosen.
    • User initialization allows you to specify a file (using the user_points parameter) that includes a vector of initial cluster centers.
  2. Once \(K\) initial centers have been chosen calculate the difference between each observation \(x_{i}\) and each of the centers \(m_{1},...,m_{K}\), where difference is the squared Euclidean distance taken over \(p\) parameters.

    \[d(x_{i}, m_{k})=\sum_{j=1}^{p}(x_{ij}-m_{k})^2=\|(x_{i}-m_{k})\|^2\]
  3. Assign \(x_{i}\) to the cluster \(k\) defined by \(m_{k}\) that minimizes \(d(x_{i}, m_{k})\)

  4. When all observations \(x_{i}\) are assigned to a cluster calculate the mean of the points in the cluster.

    \[\bar{x}(k)=\{\bar{x_{i1}},…\bar{x_{ip}}\}\]
  5. Set the \(\bar{x}(k)\) as the new cluster centers \(m_{k}\). Repeat steps 2 through 5 until the specified number of max iterations is reached or cluster assignments of the \(x_{i}\) are stable.

References

Hastie, Trevor, Robert Tibshirani, and J Jerome H Friedman. The Elements of Statistical Learning. Second Edition. N.p., Springer New York, 2001.

Xiong, Hui, Junjie Wu, and Jian Chen. “K-means Clustering Versus Validation Measures: A Data- distribution Perspective.” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 39.2 (2009): 318-331.

Hartigan, John A. Clustering Algorithms. New York: John Wiley & Sons, Inc., N.p., 1975.