next up previous
Next: Back Propagation Network (BPN) Up: Introduction to Neural Networks Previous: Hopfield Network

Perceptron Learning

A perceptron network (F. Rosenblatt, 1957) also has 2 layers as in the previous Hebb's network, except the learning law is different. This is a supervised learning algorithm based on some prior knowledge.

twolayernet.gif

We first consider the case in which there is only $m=1$ output node in the network. Presented with a pattern vector ${\bf x}=[x_1,\cdots,x_n]^T$, the output is computed to be

\begin{displaymath}
y'=\left\{ \begin{array}{ll} 1 & if\;\;net={\bf x}^T{\bf w}+b \geq 0  0 &
\mbox{otherwise}
\end{array} \right.
\end{displaymath}

where ${\bf w}=[w_1,\cdots,w_n]^T$ is the weight vector and $b$ is the bias. The inequality can also be written as:

\begin{displaymath}
{\bf x}^T{\bf w} \geq -b,\;\;\;\;\mbox{or}\;\;\;\;\;\;
\frac...
...\bf w}({\bf x})\geq -\frac{b}{\vert\vert{\bf w}\vert\vert}
=b'
\end{displaymath}

As shown in the figure, this network is a 2-class classifier. The weight vector ${\bf w}$ represents the normal direction of a hyperplane in the n-D space, which partitions the space into two halves each for one of the two classes. The inner product

\begin{displaymath}
{\bf x}^T{\bf w}=\frac{{\bf x}^T{\bf w}}{\vert\vert{\bf w}\v...
...rt}=\frac{1}{\vert\vert{\bf w}\vert\vert}Proj_{\bf w}({\bf x})
\end{displaymath}

is proportional to the projection $Proj_{\bf w}({\bf x})$ of ${\bf x}$ onto the normal direction ${\bf w}$, which is either greater or smaller than the bias $b'=-b/\vert\vert{\bf w}\vert\vert$, the distance of a hyperplane with normal direction ${\bf w}$ to the origin, depending on to which of the two classes a specific pattern vector ${\bf x}$ belongs.

The binary output $y'$ is either $1$ indicating ${\bf x}$ belongs to class 1 ( ${\bf x}\in\omega_1$) or $0$ indicating ${\bf x}$ belongs to class 2 ( ${\bf x}\in
\omega_2$), i.e.,

perceptron.png

The weight vector ${\bf w}$ is obtained in the training process based on a set of $K$ training samples with known classes (labeled): $\{ ({\bf x}_k, y_k),\;k=1,
\cdots,K \}$, where ${\bf x}_k$ is an n-D vector representing a pattern and the binary label $y_k$ indicates to which one of the two classes $\omega_1$ or $\omega_2$ the input ${\bf x}_k$ belongs.

Specifically, in each step of the training process, one of the $K$ patterns ${\bf x}$ randomly chosen is presented to the $n$ input nodes for the network to generate an output $y'$. This output is then compared with the the desired output $y$ corresponding to this input to obtain their difference $\delta=y-y'$, based on which the weight vector ${\bf w}$ is then updated by the following learning law:

\begin{displaymath}{\bf w}^{new}={\bf w}^{old}+\triangle {\bf w}={\bf w}^{old}
+\eta (y-y'){\bf x} \end{displaymath}

where $\eta$ is the learning rate and

\begin{displaymath}\triangle {\bf w}\stackrel{\triangle}{=}\eta(y-y'){\bf x}=\eta \delta {\bf x} \end{displaymath}

Here $\delta\stackrel{\triangle}{=}y-y'$ represents the error, the difference between the actual output and the desired output, and the learning law is called the $\delta$-rule. How the learning law works can be shown by the following three cases:

Perceptron convergence theorem:

If $\omega_1$ and $\omega_2$ are two linearly separable clusters of patterns, then a perceptron will develop a ${\bf w}$ in finite number of training trials to classify them.

As the perceptron network is only capable of linear operations, the corresponding classification is limited to classes that are linearly separable by a hyperplane in the n-D space. A typical example of two classes not linearly separable is the XOR problem, where the 2-D space is divided into four regions for two classes 0 and 1, just like the Karnaugh map for the exclusive-OR of two logical variables $x\oplue y$.

When there are $m>1$ output nodes in the network, the network becomes a multiple-class classifier. The weight vectors ${\bf w}_i$ ($i=1,\cdots,m$) for these $m$ output nodes define $m$ hyperplanes that partition the n-D space into multiple regions each corresponding to one of the classes. Specifically, if $m<n$, then the n-D space can be partitioned into as many as $2^m$ regions.

The limitation of linear separation could be overcome by having more than one learning layer in the network. However, the learning law for the single-layer perceptron network no longer applies. New training method is needed.


next up previous
Next: Back Propagation Network (BPN) Up: Introduction to Neural Networks Previous: Hopfield Network
Ruye Wang 2015-08-13