As suggested by the name of K-means clustering, in this method a set
of mean vectors
are used to represent
the clusters in the feature space, based on the assumption that there
exist a set of
clusters in the dataset.
The K-means clustering algorithm can be formulated as an optimization problem to minimize an objective function
(199) |
(200) |
(201) |
(202) |
As depends on
while in turn
also depends
on
, the steps above need to be carried out iteratively, starting
with
randomly initialized mean vectors which are revised iteratively
until convergence. Here are steps of the process:
if |
(203) |
(204) |
(205) |
This method is simple and effective, but it has the main drawback
that the number of clusters needs to be estimated based on some
prior knowledge, and it stays fixed through out the clustering process,
even it may turn out later that the dataset may be better fit with
more or fewer clusters. One way to resolve this is to carry out the
algorithm multiple times with different
, and then evaluate each
result based on the objective function
, or some other separability
criteria, such as
.
The Matlab code for the iteration loop of the algorithm is listed
below, where Mold
and Mnew
are respectively the mean
vectors before and after each modification. The iteration terminates
when the mean vectors no longer change.
Mnew=X(:,randi(N,1,K)); % use any K random samples as initial means er=inf; while er>0 % main iteration Mold=Mnew; Mnew=zeros(d,K); % initialize new means Number=zeros(1,K); for i=1:N % for all N samples x=X(:,i); dmin=inf; for k=1:K % for all K clusters d=norm(x-Mold(:,k)); if d<dmin dmin=d; j=k; end end Number(j)=Number(j)+1; Mnew(:,j)=Mnew(:,j)+x; end for k=1:K if Number(k)>0 Mnew(:,k)=Mnew(:,k)/Number(k); end end er=norm(Mnew-Mold); % terminate if means no longer change end
Example 1
The K-means algorithm is applied to a simulated dataset in 3-D space with
clusters. The results are shown in the figure below, where the three
panels show the results corresponding to
(left),
(middle),
and
(right). The initial positions of the
means are marked
by the black squares, while their subsequent positions through out the
iteration are marked by smaller dots. The iteration terminates once the
means have moved to the centers of the clusters and no longer change
positions.
The clustering results corresponding to
can be evaluated
by the separability
, the
intra-cluster
istance
of the resulting clusters:
(206) |
(207) |
While the K-means method is simple and effective, it has the main shortcoming
that the number of clusters needs to be specified, although in practice it
is typically unknown a head of time. In this case, one could try different
values and compare the corresponding results in terms of the intra and inter
cluster distances, as well as the separabilities of the resulting clusters, as
shown in the example above. Moreover, if the intra-cluster distance of a cluster
is too large indicating it may contain more than one cluster (e.g.,
in
the example), it can be split; on the other hand, if the inter-cluster distance
between two clusters is too small (e.g.,
in the example), the two clusters
may belong to the same cluster and need to be merged. Following such merging
and/or splitting, a few more iterations can be carried out to make sure the
final clustering results are optimal. The figure below shows the clustering
results of the same dataset above but modified by merging and splliting. The
left pannel is for
, but the second cluster (green) is split, while the
right pannel is for
, when the 4th (yellow) and 5th (cyan) clusters are
merged.
The idea of modifying the clustering results by merging and spliting leads to
the algorithm of Iterative Self-Organizing Data Analysis Technique (ISODATA),
which allows the number of clusters to be adjusted automatically during the
iteration by merging clusters close to each other and splitting clusters with
large intra-cluster distances. However, this algorithm is highly heuristic as
the various parameters such the threshold values for merging and splitting need
to be specified.
Example 2
The K-means clustering method is applied to the dataset of ten handwritten digits from 0 to 9 used previously. The clustering result is visualized based on the KLT that maps the data samples in the original 256-D space into the 3-D space spanned by the three eigenvectors corresponding to the three greatest eigenvalues of the covariance matrix of the dataset. The ground truth labelings are color coded as shown on the left, while the clustering result is shown on the right. We see that the clustering results match the original data reasonably well.
The clustering result is also shown in the confusion matrix, of which the columns and rows represent respectively the clusters based on the K-means clustering and the ground truth labeling (not used). In other words, the element in the ith row and jth column is the number of samples labeled to belong to the ith class but assigned to the jth cluster.
(208) |
All samples are converted from 256-dimensional vector back to
image form and the averages of the samples in each
of the ten cluster are shown in the figure below. We see that each
of the ten clusters clearly represents one distinct digit.