The purpose of this tutorial is to walk through a K-Means analysis beginning to end. By the end of this tutorialthe user should know how to specify, run, and interpret a K-means model in H2O.
Those who have never used H2O before should see the quick start guide for additional instructions on how to run H2O.
Interested users can find details on the math behind K Means at: K-Means.
This tutorial uses a publicly available data set that can be found
Seeds data set http://archive.ics.uci.edu/ml/datasets/seeds
The data are composed of 210 observations, 7 attributes, and an priori grouping assignment. All data are positively valued and continuous. Before modeling, parse data into H2O as follows:
Output is a matrix of the cluster assignments, and the coordinates of the cluster centers in terms of the originally chosen attributes. Your cluster centers may differ slightly. K-Means randomly chooses starting points and converges on optimal centroids. The cluster number is arbitrary, and should be thought of as a factor.
For further information on the model select K-Means from the drop down menu Score. Specify the K-Means model key, and the .hex key for the data set originally used.
The output obtained when submit is pressed is the number of rows assigned to each cluster, and the squared error per cluster.
To generate a prediction (assign the observations in a data set to a cluster) select K-means Apply from the Score drop down menu. Specify the model to be applied and the .hex for the data you would like to apply it to, and press submit.
Here cluster assignments have been generated for the original data. Because the data have been sufficiently well researched, the ideal cluster assignments were known in advance. Comparing known cluster with predicted cluster demonstrated that this K-Means model classifies with a less than 10% error rate.
THE END.