.. _KM_tutorial: K Means Tutorial ================ This tutorial walks through a K-Means analysis and describes how to specify, run, and interpret a K-means model in H2O. If you have never used H2O before, refer to the quick start guide for additional instructions on how to run H2O: :ref:`GettingStartedFromaZipFile`. Interested users can find details on the math behind K Means at: :ref:`KMmath`. """" Quick Start Video """"""""""""""""" .. raw:: html <object width="420" height="315"><param name="movie" value="http://www.youtube.com/v/KVeKfoMRMyQ?version=3&hl=en_US"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/KVeKfoMRMyQ?version=3&hl=en_US" type="application/x-shockwave-flash" width="420" height="315" allowscriptaccess="always" allowfullscreen="true"></embed></object> """""""""""""" Getting Started """"""""""""""" This tutorial uses a publicly available data set that can be found at http://archive.ics.uci.edu/ml/datasets/seeds The data are composed of 210 observations, 7 attributes, and an priori grouping assignment. All data are positively valued and continuous. Before modeling, parse data into H2O: #. From the drop-down **Data** menu, select **Upload** and use the helper to upload data. #. On the "Request Parse" page that appears, check the "header" checkbox if the first row of the data set is a header. No other changes are required. #. Click **Submit**. Parsing data into H2O generates a .hex key of the form "data name.hex" .. image:: KMparse.png :width: 90% """""" Building a Model """""""""""""""" #. Once data are parsed, a horizontal menu appears at the top of the screen that displays "Build model using ... ". Select K Means here, or go to the drop-down **Model** menu and select K-Means. #. In the "source" field, enter the .hex key associated with the data set. #. Specify a value for "k." For this dataset, use 3. #. Check the "normalize" checkbox to normalize data, but this is not required for this example. #. Select an option from the "Initialization" drop-down list. - Plus Plus initialization chooses one initial center at random and weights the random selection of subsequent centers so that points furthest from the first center are more likely to be chosen. - Furthest initialization chooses one initial center at random, and then chooses the next center to be the point furthest away in terms of Euclidean distance. - The default ("None") results in K initial centers being chosen independently at random. #. Enter a "Max Iter" (short for maximum iterations) value to specify the maximum number of iterations the algorithm processes. #. Select the columns of attributes that should be used in defining the clusters in the "Cols" section. In this example, all columns except column 7 (the a priori known clusters for this particular set) are selected. #. Click **Submit**. .. image:: KMrequest.png :width: 90% """""" K-Means Output """""""""""""" The output is a matrix of the cluster assignments and the coordinates of the cluster centers (in terms of the originally selected attributes). Your cluster centers may differ slightly. K-Means randomly chooses starting points and converges on optimal centroids. The cluster number is arbitrary and should be thought of as a factor. .. image:: KMinspect.png :width: 100% """""" K-means Next Steps """"""""""""""""""" For more information about the model, select *K-Means* from the drop-down **Score** menu. Specify the K-Means model key and the .hex key for the original data set. The output that displays when you click **Submit** is the number of rows assigned to each cluster and the squared error per cluster. .. image:: KMscore.png :width: 90% """""" K-means Apply """"""""""""" To generate a prediction (assign the observations in a data set to a cluster), select **K-means Apply** from the drop-down **Score** menu. Specify the model and the .hex key for the data, then click **Submit**. In the following example, cluster assignments have been generated for the original data. Because the data have been sufficiently well researched, the ideal cluster assignments were known in advance. Comparing known clusters with predicted clusters demonstrated that this K-Means model classifies with a less than 10% error rate. .. image:: KMapply.png :width: 90% """"""