.. _KMmath:
K-Means
-------
K-Means falls in the general category of clustering algorithms.
When to use K-Means
"""""""""""""""""""
Data are a set of attributes on which the members of the population
likely differ. The objective is classification.
Here are some examples:
"How do competitors differ from one another on critical dimensions?"
"How is a particular market segmented?"
"Which dimensions are most important to differentiating between
members of a population of interest?"
Defining a K-Means model
""""""""""""""""""""""""
**Source key:**
The .hex key associated with the data set for use in clustering.
**K**
The desired number of clusters. There is no set rule or formula
for defining K, it is up to the user and is
often based on heuristics.
**Max iter**
The maximum number of iterations the algorithm is to go
through if no stopping point is reached before then.
**Initialization**
*Plus Plus*
A modification to the k-means algorithm that impacts the assignment
of K initial cluster centroids. Because of the random process
inherent to K-means, it's possible for the algorithm to converge on
centroids that are not the optimal cluster centers purely by chance
in the choice of starting points. To mitigate this risk, Plus Plus
initialization assigns K initial centers by choosing just one center
at random, computing the Euclidean norm between that point and all
other points in the data set, and using the results to define a
weighted probability distribution from which the next center is
chosen at random. The process repeats until all centers have been
chosen, at which point the algorithm proceeds as usual.
*Furthest*
A modification to the k-means algorithm that impacts the assignment
of K initial cluster centroids. Furthest first initialization
attempts to improve K-means results by selecting the first center,
and then calculating the distance from that point to all other
possible points. The second initial center is chosen to the point
furthest from the first center in terms of Euclidean distance.
**Seed:**
A means for specifying the algorithm components
dependent on randomization. Note that the seed stays the same for
each instance of H\ :sub:`2`\ O, allowing the user to create models with the
same starting conditions in alternative configurations.
**Normalize:**
Specifies that each attribute be transformed such that it has a mean
of 0 and standard deviation of 1, and that this transformation be
carried out before the algorithm is applied.
**Cols**
The columns from the data set that contain the attribute data.
Interpreting a Model
""""""""""""""""""""
Output from K-Means is a table with one more column than the
number of attributes used to cluster. The the names of attributes,
and "cluster" appear in the header row. The column cluster gives
an arbitrary number to each cluster built, and the attributes give
the coordinates of the center of that cluster.
+--------+-----------+-----------+
|Clusters|Attribute 1|Attribute 2|
+========+===========+===========+
| 0 | centroid | centroid |
| | value | value |
+--------+-----------+-----------+
References
""""""""""
Xiong, Hui, Junjie Wu, and Jian Chen. "K-means Clustering Versus
Validation Measures: A Data- distribution Perspective." Systems, Man,
and Cybernetics, Part B: Cybernetics, IEEE Transactions on 39.2 (2009): 318-331.
K-Means Algorithm
""""""""""""""""""
The number of clusters :math:`K` is user defined and determined a priori.
1. Choose :math:`K` initial cluster centers :math:`m_{k}` according to one of
the following:
**Randomization**
Choose :math:`K` clusters from the set of :math:`N` observations at random so that
each observation has an equal chance of being chosen.
**Plus Plus**
i. Choose one center :math:`m_{1}` at random.
ii. Calculate the difference between :math:`m_{1}` and each of the
remaining :math:`N-1` observations :math:`x_{i}`.
:math:`d(x_{i}, m_{1})` = :math:`||(x_{i}-m_{1})||^2`
iii. Let :math:`P(i)` be the probability of choosing :math:`x_{i}` as
:math:`m_{2}`. Weight :math:`P(i)` by :math:`d(x_{i}, m_{1})` so that
those :math:`x_{i}` furthest from :math:`m_{2}` have a
higher probability of being selected than those :math:`x_{i}`
close to :math:`m_{1}`.
iv. Choose the next center :math:`m_{2}` by drawing at random
according to the weighted probability distribution.
Repeat until :math:`K` centers have been chosen.
**Furthest**
i. Choose one center :math:`m_{1}` at random.
ii. Calculate the difference between :math:`m_{1}` and each of the
remaining :math:`N-1` observations :math:`x_{i}`.
:math:`d(x_{i}, m_{1})` = :math:`||(x_{i}-m_{1})||^2`
iii. Choose :math:`m_{2}` to be the :math:`x_{i}` that maximizes
:math:`d(x_{i}, m_{1})`.
Repeat until :math:`K` centers have been chosen.
2. Once :math:`K` initial centers have been chosen calculate the difference
between each observation :math:`x_{i}` and each of the centers
:math:`m_{1},...,m_{K}`, where difference is the squared Euclidean
distance taken over :math:`p` parameters.
:math:`d(x_{i}, m_{k})=`
:Math:`\sum_{j=1}^{p}(x_{ij}-m_{k})^2=`
:math:`\lVert(x_{i}-m_{k})\rVert^2`
3. Assign :math:`x_{i}` to the cluster :math:`k` defined by :math:`m_{k}` that
minimizes :math:`d(x_{i}, m_{k})`
4. When all observations :math:`x_{i}` are assigned to a cluster
calculate the mean of the points in the cluster.
:math:`\bar{x}(k)=\lbrace\bar{x_{i1}},…\bar{x_{ip}}\rbrace`
5. Set the :math:`\bar{x}(k)` as the new cluster centers
:math:`m_{k}`. Repeat steps 2 through 5 until the specified number
of max iterations is reached or cluster assignments of the
:math:`x_{i}` are stable.
References
""""""""""
Hastie, Trevor, Robert Tibshirani, and J Jerome H Friedman. The
Elements of Statistical Learning.
Vol.1. N.p., Springer New York, 2001.
http://www.stanford.edu/~hastie/local.ftp/Springer/OLD//ESLII_print4.pdf
Xiong, Hui, Junjie Wu, and Jian Chen. "K-means Clustering Versus
Validation Measures: A Data- distribution Perspective." Systems, Man,
and Cybernetics, Part B: Cybernetics, IEEE Transactions on 39.2 (2009): 318-331.