The back propagation network (BP network or BPN) is a supervised
laerning algorithm that finds a wide variety of applications in
practice. In the most general sense, a BPN can be used as an
associator to learn the relationship between two sets of patterns
represented in vector forms. Specifically in classification,
similar to the perceptron network, the BPN as a classifier is
based on the training set
,
of which each pattern
, a d-dimensional vector, is
associated with the corresponding pattern, an m-dimensional vector
in
, as its
class identity labeling, indicating to to which of the
classs
pattern
belongs.
Different from the perceptron in which there is only one level of learning taking place between the output and input layers in terms of the weights, a BPN is a multi-layer (three or more) hierarchical structure composed of the input, hidden, and output layers, in which learning takes place in multilevels between consecutive layers. Consequently, a BPN can be more flexible and powerful than the two-layer perceptron network. In the following, before we consider the general BPN containg multiple hidden layers, we will first derive the back propagation algorithm for the simplest BPN with only one hidden layer in between the input and output layers.
We assume the input, hidden and output layers contain respectively
,
, and
nodes, each node in the hidden and output layers
is fully connected to all nodes in the previous layer. When one of
the
training patterns
is presented to the input layer
of the BPN, an m-D vector
is produced at the output layer as the corresponding response to the
input
. Here
and
are the function
parameters containing the augmented weight vectors for both the
hidden and output layers, to be determined in the training
process based on the training set
and
so
that its output
as a function of the current input
matches the desired output, the labeling
.
Once the BPN is fully trained, any unlabeled pattern
can then be classified into one of the
classes corresponding
to the minimum
. Note that
different from the perceptron network, the output of the
output nodes are in general not binary, although they can still
be binary if either one-hot or binary encoding is used for class
labeling.
Specifically, the training of the BPN is an iteration of the following two-phase process:
A randomly selected sample labeled by
is
presented to the input layer and forwarded through the weighted
connections to the hidden layer and then the output layer to
produce output
.
The squared error
measuring the difference between the desired output
and the actual output
is propagated backward
from the output layer through the hidden layer to the input
layer, during which the weights of both the output and hidden
layers are modified so that
will be reduced when
the same or similar pattern is presented in the future.
We now consider the specific computation taking place in both the forward and backward passes.
The squared error
can be written as a function of the output weights
and hidden layer
weights
:
(50) |
Similar to the objective function of the ridge regression considered in a previous section, an additional regularization term can be included in the objective function to encourage small weights to prevent overfitting:
where
The gradient descent method is used to modify first the
output layer weights and then the hidden layer weights
to iteratively reduce the objective function in the
following steps:
(52) |
(53) |
(54) |
(55) |
(56) |
(57) |
(58) |
(59) |
In summary, here are the steps in each iteration:
The Matlab code for the essential part of the BPN algorithm is
listed below. Array contains
classes each with
samples,
array
are the labelings of the
training samples, array
contains the
dimensional weight vectors for the
output nodes. Also,
is the number of hidden nodes,
is
the learning rate between 0 and 1, and
is the tolerance
for the terminination of the learning iteration (e.g., 0.01).
function [Wh, Wo, g]=backPropagate(X,Y) syms x g=1/(1+exp(-x)); % Sigmoid activation function dg=diff(g); % its derivative function g=matlabFunction(g); dg=matlabFunction(dg); [d,N]=size(X); % number of inputs and number of samples X=[ones(1,N); X]; % augment X by adding a row of x0=1 Wh=1-2*rand(L,d+1); % Initialize hidden layer weights Wo=1-2*rand(M,L+1); % Initialize output layer weights er=inf; while er > tol I=randperm(N); % random permutation of samples er=0; for n=1:N % for all N samples for an epoch x=X(:,I(n)); % pick a training sample ah=Wh*x; % activation of hidden layer z=[1; g(ah)]; % augment z by adding z0=1 ao=Wo*z; % activation to output layer yhat=g(ao); % output of output layer delta=Y(:,I(n))-yhat % delta error er=er+norm(delta)/N % test error do=delta.*dg(ao); % Find d of output layer Wo=Wo+eta*do*z'; % update weights for output layer dh=(Wo(:,2:L+1)'*do).*dg(ah); % delta of hidden layer Wh=Wh+eta*dh*x'; % update weights for hidden layer end end end
The training process of BP network can also be considered as a
data modeling
problem to
fit the given data
by a function with the weights of both the hidden and output layers
as the parameters:
(60) |
The three-layer BPN containing a single hidden layer discussed
above can be easily generalized to a multilayer BPN containing
any number of hidden layers (e.g., for a deep learning network)
by simply repeating steps 6 and 7 for the single hidden layer in
the algorithm list above. We assume in addition to the input layer,
there are in total learning layers including all the hidden
layers and the output layer, indexed by
. Then step
6 above becomes:
(61) |
(62) |
(63) |
The matlab code segment below is for the BP network of multiple
hidden layers, in which is the total number of learning layers
and
is the number of nodes in the lth layer (
):
W={1-2*rand(m(1),d+1)}; % initial weights for first layer for l=2:L W{l}=1-2*rand(m(l),m(l-1)+1); % initial weights for all other layers end er=inf; while er > tol I=randperm(N); % random permutation of samples er=0; for n=1:N % N samples for an epoch z={1;X(:,I(n))}; % pick a training sample a={W{1}*z{1}}; % activation of first layer for l=2:L % the forward pass z{l}=[1;g(a{l-1})]; % input to the lth layer a{l}=W{l}*z{l}; % activation of the lth layer end yhat=g(a{L}); % actual output of last layer delta=Y(:,I(n))-yhat; % delta error er=er+norm(delta)/N; % test error d{L}=delta.*dg(a{L}); % d for output layer W{L}=W{L}+eta*d{L}*z{L}'; % upddate weights for output layer for l=L-1:-1:1 % the backward pass d{l}=(W{l+1}(:,2:end)'*d{l+1}).*dg(a{l}); % d for hidden layers W{l}=W{l}+eta*d{l}*z{l}'; % upddate weights for hidden layers end end end
Example 1: The classification results of two previously used 2-D
data sets are shown below. The error rates are respectively and
, and the confusion matrices are:
(64) |
Example 2:
The back propagation network applied to the classification of the
dataset of handwritten digits from 0 to 9 used previously. Out of
the samples in the dataset, half is used for training
while the other half for testing. Shown below are the confusion
matrices of both the training (left) and testing (right) phases.
Out of the 1120 samples for training 27 are misclassified (
),
and out of the 1120 samples for testing 74 are misclassified
(
).
(65) |
Hierarchical structure and two pathways of the visual cortex