K Means Tutorial

This tutorial walks through a K-Means analysis and describes how to specify, run, and interpret a K-means model in H2O.

If you have never used H2O before, refer to the quick start guide for additional instructions on how to run H2O: Getting Started From a Downloaded Zip File.

Interested users can find details on the math behind K Means at: K-Means.


Quick Start Video


Getting Started

This tutorial uses a publicly available data set that can be found at http://archive.ics.uci.edu/ml/datasets/seeds

The data are composed of 210 observations, 7 attributes, and an priori grouping assignment. All data are positively valued and continuous. Before modeling, parse data into H2O:

  1. From the drop-down Data menu, select Upload and use the helper to upload data.
  2. On the “Request Parse” page that appears, check the “header” checkbox if the first row of the data set is a header. No other changes are required.
  3. Click Submit. Parsing data into H2O generates a .hex key of the form “data name.hex”
../_images/KMparse.png

Building a Model

  1. Once data are parsed, a horizontal menu appears at the top of the screen that displays “Build model using ... ”. Select K Means here, or go to the drop-down Model menu and select K-Means.
  2. In the “source” field, enter the .hex key associated with the data set.
  3. Specify a value for “k.” For this dataset, use 3.
  4. Check the “normalize” checkbox to normalize data, but this is not required for this example.
  5. Select an option from the “Initialization” drop-down list.
    • Plus Plus initialization chooses one initial center at random and weights the random selection of subsequent centers so that points furthest from the first center are more likely to be chosen.
    • Furthest initialization chooses one initial center at random, and then chooses the next center to be the point furthest away in terms of Euclidean distance.
    • The default (“None”) results in K initial centers being chosen independently at random.
  6. Enter a “Max Iter” (short for maximum iterations) value to specify the maximum number of iterations the algorithm processes.
  7. Select the columns of attributes that should be used in defining the clusters in the “Cols” section. In this example, all columns except column 7 (the a priori known clusters for this particular set) are selected.
  8. Click Submit.
../_images/KMrequest.png

K-Means Output

The output is a matrix of the cluster assignments and the coordinates of the cluster centers (in terms of the originally selected attributes). Your cluster centers may differ slightly. K-Means randomly chooses starting points and converges on optimal centroids. The cluster number is arbitrary and should be thought of as a factor.

../_images/KMinspect.png

K-means Next Steps

For more information about the model, select K-Means from the drop-down Score menu. Specify the K-Means model key and the .hex key for the original data set.

The output that displays when you click Submit is the number of rows assigned to each cluster and the squared error per cluster.

../_images/KMscore.png

K-means Apply

To generate a prediction (assign the observations in a data set to a cluster), select K-means Apply from the drop-down Score menu. Specify the model and the .hex key for the data, then click Submit.

In the following example, cluster assignments have been generated for the original data. Because the data have been sufficiently well researched, the ideal cluster assignments were known in advance. Comparing known clusters with predicted clusters demonstrated that this K-Means model classifies with a less than 10% error rate.

../_images/KMapply.png