Competitive Learning

Next: Self-Organizing Map (SOM) Up: Competitive Learning Networks Previous: Competitive Learning Networks

Competitive Learning

Competitive learning is a typical unsupervised learning network, similar to the statistical clustering analysis methods (k-means, Isodata). The purpose is to discover groups/clusters composed of similar patterns represented by vectors in the n-D space. The competitive learning network has two layers, the input layer composed of nodes to which an input pattern ${\bf x}=[x_1,\cdots,x_n]^T$ is presented, and the output layer composed of nodes $y_i,\;\;\;(i=1,\cdots,m)$ . There different variations in the computation for the output, depending on the specific data and purpose. In the simplest computation, the net input, the activation, of the ith output node is just the inner product of the weight vector of the node and the current input vector:

$\begin{displaymath}net_i=\sum_{j=1}^n w_{ij} x_j={\bf w}_i^T{\bf x} \end{displaymath}$

The outputs of the network are determined by a ``winner-take-all'' competition such that only the output node receiving the maximal net input will output 1 while all others output 0:

$\begin{displaymath} y_k=\left\{ \begin{array}{ll} 1 & \mbox{if}\;\;net_k=max\{ne... ...\;i=1,\cdots,m\} \\ 0 & \mbox{otherwise} \end{array} \right. \end{displaymath}$

For simplicity, we assume

The input patterns can be either a binary $x_i=0\;\;or\;\;1$ or a real n-D vector ${\bf x}=[x_1,\cdots,x_n]^T$
All weights are real (or positive $w_{ij}>0$ ) and the weight vector ${\bf w}_i$ is normalized in a certain sense:

$\begin{displaymath}\vert\vert{\bf w}_i\vert\vert=\sqrt{\sum_{j=1}^n w_{ij}^2}=1,... ...\;\; \sum_{j=1}^n w_{ij}=1,\;\;\;\;\;\;(\mbox{if $w_{ij}>0$}) \end{displaymath}$

The net input can be considered as the inner product of two vectors ${\bf w}_i$ and ${\bf x}$ :

$\begin{displaymath}net_i=\sum_j w_{ij} x_j={\bf w}_i^T{\bf x}=\vert\vert{\bf w}_i\vert\vert\;\vert\vert{\bf x}\vert\vert\;cos \theta \end{displaymath}$

which is closely related to the difference between the two vectors:

$\begin{displaymath}\vert\vert{\bf w}_i-{\bf x}\vert\vert^2=\vert\vert{\bf w}_i\v... ...t{\bf w}_i\vert\vert \vert\vert{\bf x}\vert\vert\;cos \theta \end{displaymath}$

where $\theta$ is the angle between the two vectors.

We see that the weight vector ${\bf w}_k$ of the winning node is closest to the current input pattern vector ${\bf x}$ in the n-D feature space, with smallest angle $\theta$ and therefore smallest Euclidean distance $d_i=\vert\vert{\bf w}_i-{\bf x}\vert\vert$ . Therefore the competition can also be carried out as:

$\begin{displaymath} y_k=\left\{ \begin{array}{ll} 1 & \mbox{if } d_i \le d_j \;\... ...;\;(j=1,\cdots,m) 0 & \mbox{otherwise} \end{array} \right. \end{displaymath}$

The competitive learning law is

$\begin{displaymath}{\bf w}_i^{new}={\bf w}_i^{old}+u_i \eta\;({\bf x}-{\bf w}_i... ...\eta {\bf x}&u_i=1 {\bf w}_i^{old}&u_i=0 \end{array} \right. \end{displaymath}$

where

$\begin{displaymath}\triangle {\bf w}_i=\eta({\bf x}-{\bf w}_i^{old}), \;\;\;\;\;... ...ox{if ith node is winner}\\ 0&\mbox{else} \end{array} \right. \end{displaymath}$

and $0 < \eta < 1$ is the learning rate which reduces from some set initial value (e.g.,

) toward 0 by certain scaling factor (e.g.,

Note that ${\bf w}_k^{new}$ is between ${\bf w}_k^{old}$ and ${\bf x}$ ), i.e., the effect of this learning process is to pull the weight vector ${\bf w}_k$ closest to the current input pattern ${\bf x}$ even closer to ${\bf x}$ .

These are the main steps of the competitive learning:

Generate a set of output nodes with random weights
Choose a random input pattern from the high-dimensional vector space and calculate the activations for each of the output nodes
Find the winning node and update its weights
Go back to step 2 until the weights are no longer changing, or a set maximum number of iterations is reached.

Every time a randomly chosen pattern ${\bf x}$ is presented to the input layer, one of the output nodes will become the winner and it is drawn closer to the current input ${\bf x}$ . After all input patterns have been repeatedly presented to the input layer, each group of similar patterns, a cluster in the n-D feature space, will draw a certain weight vector to its center, and subsequently the corresponding node will always win and produce a non-zero output whenever any member of the cluster is presented to the input in the future. In other words, after this unsupervised training, the n-D feature space is partitioned into a small set of regions each corresponding to a cluster of similar input patterns, represented by one of the output nodes, whose weight vector ${\bf w}_k$ is in the central area of the region.

The initialization of the weight vectors may be of critical importance for the performance of the competitive learning process. It is desirable for the weight vectors to be in the same general regions as the input vectors so that all weight vectors have some opportunity to win and learn to respond to some of the input patterns. Specifically, the weight vectors can so generated that they have the same mean and covariance as the data points.

It is possible that the input patterns are not clustered into clearly separable groups. Sometimes they even form a continuum in the feature space. In such cases, one or a small number of output nodes may become frequent winners of the competition while all other output nodes never win and consequently never get the chance to learn (``dead nodes''). To prevent such meaningless outcome from happening, a mechanism is needed that can ensure all nodes have some chance to win. Specifically, we could modify the learning law so that it contains an extra term:

$\begin{displaymath} net_i=\sum_j w_{ij} x_j+b_i={\bf w}_i^T{\bf x}+b_i \end{displaymath}$

The value of the bias term

reduces as the winning frequency of the ith node increases, so that it is less easily to win in the future. This bias term

can be proportional to the difference between the ``fair share'' of winning

and the actual winning frequency.

$\begin{displaymath} b_i=c\;\left(\frac{1}{m}-\frac{\mbox{number of winnings of node $i$}}{\mbox{total number iterations so far}}\right) \end{displaymath}$

where

is some scaling coefficients. The greater

, the stronger the competition will be balanced.

If the node is winning too often, the difference is negative thereby making it harder for the node to win in the future. On the other hand if the node rarely wins, the difference is positive thereby increasing its chance for the node to win in the future. Alternatively, we can let decay by multiplying it by a scaling factor (e.g., ) each time the ith node wins. In any case, the value of needs to be fine tuned according to the specific nature of the data being processed.

This process of competitive learning process is also called vector quantization, by which the continuous vector space is quantized to become a set of discrete regions, called Voronoi diagram (tessellation).

Vector quantization can be used for data compression. A cluster of similar signals ${\bf x}$ in the n-D vector space can all be approximately represented by the weight vector ${\bf w}$ of one of a small number of output nodes in the neighborhood of ${\bf x}$ , thereby the data size can be significantly reduced.

Example

As shown on the left of the figure, a set of clusters are formed by $7\times 40$ data points (blue circles) in dimensional space. A competitive learning network with output nodes is then used to do clustering of the dataset. The weight vectors (red dots) are randomly generated in the 4-D space. After competitive learning, they moved from their initial positions (left) to the centers of the clusters (right), thereby representing the clusters. To visualize the data points and the weight vectors in the 4-D space is projected by the PCA (KLT) transform to 3-D and 2-D spaces:

Ideally, the number of output nodes should be the same as the number of clusters that exist in the data set, so that each cluster of the data set can be represented by one of the nodes after the competitive learning is complete. However, the number of clusters in the data set is typically unknown ahead of time. In such cases, one can carry out the competitive learning multiple times each time using a different , and then compare the clustering results to see which value fits the data the best. Specifically, how well the clustering results reflect the intrisic structure of the data set, and, in general, the extent to which the data points are clustered, can be quantitatively measured in terms of the scatteredness of the data point belonging to each cluster, in comparison to the total scatteredness of the entire dataset, as discussed here. Such quantitative measurement can be used as a criterion for the results of different clustering algorithms.

Next: Self-Organizing Map (SOM) Up: Competitive Learning Networks Previous: Competitive Learning Networks

Ruye Wang 2015-08-13