Perceptron Learning

Next: Back Propagation Network (BPN) Up: Introduction to Neural Networks Previous: Hopfield Network

Perceptron Learning

A perceptron network (F. Rosenblatt, 1957) also has 2 layers as in the previous Hebb's network, except the learning law is different. This is a supervised learning algorithm based on some prior knowledge.

We first consider the case in which there is only output node in the network. Presented with a pattern vector ${\bf x}=[x_1,\cdots,x_n]^T$ , the output is computed to be

$\begin{displaymath} y'=\left\{ \begin{array}{ll} 1 & if\;\;net={\bf x}^T{\bf w}+b \geq 0 0 & \mbox{otherwise} \end{array} \right. \end{displaymath}$

where ${\bf w}=[w_1,\cdots,w_n]^T$ is the weight vector and

is the bias. The inequality can also be written as:

$\begin{displaymath} {\bf x}^T{\bf w} \geq -b,\;\;\;\;\mbox{or}\;\;\;\;\;\; \frac... ...\bf w}({\bf x})\geq -\frac{b}{\vert\vert{\bf w}\vert\vert} =b' \end{displaymath}$

As shown in the figure, this network is a 2-class classifier. The weight vector ${\bf w}$ represents the normal direction of a hyperplane in the n-D space, which partitions the space into two halves each for one of the two classes. The inner product

$\begin{displaymath} {\bf x}^T{\bf w}=\frac{{\bf x}^T{\bf w}}{\vert\vert{\bf w}\v... ...rt}=\frac{1}{\vert\vert{\bf w}\vert\vert}Proj_{\bf w}({\bf x}) \end{displaymath}$

is proportional to the projection $Proj_{\bf w}({\bf x})$ of ${\bf x}$ onto the normal direction ${\bf w}$ , which is either greater or smaller than the bias $b'=-b/\vert\vert{\bf w}\vert\vert$ , the distance of a hyperplane with normal direction ${\bf w}$ to the origin, depending on to which of the two classes a specific pattern vector ${\bf x}$ belongs.

The binary output is either indicating ${\bf x}$ belongs to class 1 ( ${\bf x}\in\omega_1$ ) or indicating ${\bf x}$ belongs to class 2 ( ${\bf x}\in \omega_2$ ), i.e.,

The weight vector ${\bf w}$ is obtained in the training process based on a set of training samples with known classes (labeled): $\{ ({\bf x}_k, y_k),\;k=1, \cdots,K \}$ , where ${\bf x}_k$ is an n-D vector representing a pattern and the binary label indicates to which one of the two classes $\omega_1$ or $\omega_2$ the input ${\bf x}_k$ belongs.

Specifically, in each step of the training process, one of the patterns ${\bf x}$ randomly chosen is presented to the input nodes for the network to generate an output . This output is then compared with the the desired output corresponding to this input to obtain their difference $\delta=y-y'$ , based on which the weight vector ${\bf w}$ is then updated by the following learning law:

$\begin{displaymath}{\bf w}^{new}={\bf w}^{old}+\triangle {\bf w}={\bf w}^{old} +\eta (y-y'){\bf x} \end{displaymath}$

where $\eta$ is the learning rate and

$\begin{displaymath}\triangle {\bf w}\stackrel{\triangle}{=}\eta(y-y'){\bf x}=\eta \delta {\bf x} \end{displaymath}$

Here $\delta\stackrel{\triangle}{=}y-y'$ represents the error, the difference between the actual output and the desired output, and the learning law is called the $\delta$ -rule. How the learning law works can be shown by the following three cases:

If $\delta=y-y'=0$ , ${\bf w}^{new}={\bf w}^{old}$ , the weights are not modified.
If the desired output is but ${\bf x}^T{\bf w}^{old}>0$ and therefore , then $\delta=-1$ and

$\begin{displaymath}{\bf w}^{new}={\bf w}^{old}-\eta {\bf x} \end{displaymath}$

Next time when the same ${\bf x}$ is presented, the net input is

$\begin{displaymath} net={\bf x}^T{\bf w}^{new}={\bf x}^T{\bf w}^{old}-\eta {\bf x}^T{\bf x} < {\bf x}^T{\bf w}^{old} \end{displaymath}$

i.e., the output will be more likely to be the same as the desired .
If the desired output is but ${\bf x}^T{\bf w}^{old}<0$ and therefore , then $\delta=1$ and

$\begin{displaymath}{\bf w}^{new}={\bf w}^{old}+\eta {\bf x} \end{displaymath}$

Next time when the same ${\bf x}$ is presented, the net input is

$\begin{displaymath} net={\bf x}^T{\bf w}^{new}={\bf x}^T{\bf x}^{old}+\eta {\bf x}^T{\bf x} > {\bf x}^T{\bf w}^{old} \end{displaymath}$

i.e., the output will be more likely to be the same as the desired .

Perceptron convergence theorem:

If $\omega_1$ and $\omega_2$ are two linearly separable clusters of patterns, then a perceptron will develop a ${\bf w}$ in finite number of training trials to classify them.

As the perceptron network is only capable of linear operations, the corresponding classification is limited to classes that are linearly separable by a hyperplane in the n-D space. A typical example of two classes not linearly separable is the XOR problem, where the 2-D space is divided into four regions for two classes 0 and 1, just like the Karnaugh map for the exclusive-OR of two logical variables $x\oplue y$ .

When there are output nodes in the network, the network becomes a multiple-class classifier. The weight vectors ${\bf w}_i$ ( $i=1,\cdots,m$ ) for these output nodes define hyperplanes that partition the n-D space into multiple regions each corresponding to one of the classes. Specifically, if , then the n-D space can be partitioned into as many as regions.

The limitation of linear separation could be overcome by having more than one learning layer in the network. However, the learning law for the single-layer perceptron network no longer applies. New training method is needed.

Next: Back Propagation Network (BPN) Up: Introduction to Neural Networks Previous: Hopfield Network

Ruye Wang 2015-08-13