Back Propagation

The back propagation network (BP network or BPN) is a supervised laerning algorithm that finds a wide variety of applications in practice. In the most general sense, a BPN can be used as an associator to learn the relationship between two sets of patterns represented in vector forms. Specifically in classification, similar to the perceptron network, the BPN as a classifier is based on the training set ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$, of which each pattern ${\bf x}_n$, a d-dimensional vector, is associated with the corresponding pattern, an m-dimensional vector ${\bf y}_n$ in ${\bf Y}=[{\bf y}_1,\cdots,{\bf y}_N]$, as its class identity labeling, indicating to to which of the $K$ classs $\{C_1,\cdots,C_K\}$ pattern ${\bf x}_i$ belongs.

Different from the perceptron in which there is only one level of learning taking place between the output and input layers in terms of the weights, a BPN is a multi-layer (three or more) hierarchical structure composed of the input, hidden, and output layers, in which learning takes place in multilevels between consecutive layers. Consequently, a BPN can be more flexible and powerful than the two-layer perceptron network. In the following, before we consider the general BPN containg multiple hidden layers, we will first derive the back propagation algorithm for the simplest BPN with only one hidden layer in between the input and output layers.

We assume the input, hidden and output layers contain respectively $d$, $l$, and $m$ nodes, each node in the hidden and output layers is fully connected to all nodes in the previous layer. When one of the $N$ training patterns ${\bf x}$ is presented to the input layer of the BPN, an m-D vector $\hat{\bf y}=f({\bf x},{\bf W}^h,{\bf W}^o)$ is produced at the output layer as the corresponding response to the input ${\bf x}$. Here ${\bf W}^h=[{\bf w}^h_1,\cdots,{\bf w}^h_l]$ and ${\bf W}^o=[{\bf w}^o_1,\cdots,{\bf w}^o_m]$ are the function parameters containing the augmented weight vectors for both the hidden and output layers, to be determined in the training process based on the training set ${\bf X}$ and ${\bf Y}$ so that its output $\hat{\bf y}_n$ as a function of the current input ${\bf x}_n$ matches the desired output, the labeling ${\bf y}_n$. Once the BPN is fully trained, any unlabeled pattern ${\bf x}$ can then be classified into one of the $K$ classes corresponding to the minimum $\delta=\vert\vert{\bf y}-\hat{\bf y}\vert\vert$. Note that different from the perceptron network, the output of the $m$ output nodes are in general not binary, although they can still be binary if either one-hot or binary encoding is used for class labeling.

threelayernet.gif

Specifically, the training of the BPN is an iteration of the following two-phase process:

This two-phase process is iteratively carried out untill eventually the error is minimized for all training samples and BPN is properly trained.

We now consider the specific computation taking place in both the forward and backward passes.

In summary, here are the steps in each iteration:

  1. Input a randomly selected pattern $[x_1,\cdots,x_d]^T$, construct $d+1$ dimensional vector ${\bf x}=[1,x_1,\cdots,x_d]^T$;

  2. Compute ${\bf z}={\bf g}({\bf W}^h{\bf x})$, and construct $l+1$ dimensional vector ${\bf z} \leftarrow [1,{\bf z}]$;

  3. Compute $\hat{\bf y}={\bf g}({\bf W}^o{\bf z});$

  4. Get elementwise product ${\bf d}^o=({\bf y}-\hat{\bf y})\odot {\bf g}'({\bf a}^o)
={\bf\delta}\odot{\bf g}'({\bf a}^o) $;

  5. Update output weights ${\bf W}^o\Leftarrow {\bf W}^o+\eta\;{\bf d}^o{\bf z}^T
-\eta\lambda{\bf W}^o$;

  6. Get elementwise product ${\bf d}^h={\bf W}^{oT}{\bf d}^o \odot {\bf g}'({\bf a}^h)
={\bf\delta}^h\odot {\bf g}'({\bf a}^h)$, where ${\bf W}^o_{m\times l}$ is the same as ${\bf W}^o$ but with its first column removed.

  7. Update hidden weights ${\bf W}^h\Leftarrow {\bf W}^h+\eta\;{\bf d}^h{\bf x}^T
-\eta\lambda{\bf W}^h$;

  8. Terminate the iteration if the error $\varepsilon$ is acceptably small for all of the training patterns. Otherwise repeat the above with another pattern in the training set.

The Matlab code for the essential part of the BPN algorithm is listed below. Array $X$ contains $C$ classes each with $K$ samples, array $Y$ are the labelings of the $C*K$ training samples, array $W$ contains the $N+1$ dimensional weight vectors for the $M$ output nodes. Also, $L$ is the number of hidden nodes, $\eta$ is the learning rate between 0 and 1, and $tol$ is the tolerance for the terminination of the learning iteration (e.g., 0.01).

function [Wh, Wo, g]=backPropagate(X,Y)
    syms x 
    g=1/(1+exp(-x));              % Sigmoid activation function
    dg=diff(g);                   % its derivative function
    g=matlabFunction(g);
    dg=matlabFunction(dg);
    [d,N]=size(X);                % number of inputs and number of samples
    X=[ones(1,N); X];             % augment X by adding a row of x0=1  
    Wh=1-2*rand(L,d+1);           % Initialize hidden layer weights 
    Wo=1-2*rand(M,L+1);           % Initialize output layer weights 
    er=inf;
    while er > tol
        I=randperm(N);            % random permutation of samples
        er=0;
        for n=1:N                 % for all N samples for an epoch
            x=X(:,I(n));          % pick a training sample      
            ah=Wh*x;              % activation of hidden layer  
            z=[1; g(ah)];         % augment z by adding z0=1
            ao=Wo*z;              % activation to output layer
            yhat=g(ao);           % output of output layer 
            delta=Y(:,I(n))-yhat  % delta error
            er=er+norm(delta)/N   % test error
            do=delta.*dg(ao);     % Find d of output layer 
            Wo=Wo+eta*do*z';      % update weights for output layer 
            dh=(Wo(:,2:L+1)'*do).*dg(ah);   % delta of hidden layer 
            Wh=Wh+eta*dh*x';      % update weights for hidden layer 
        end         
    end
end

The training process of BP network can also be considered as a data modeling problem to fit the given data $\{({\bf x}_i,\,{\bf y}_i),\;(i=1,\cdots,N)\}$ by a function with the weights of both the hidden and output layers as the parameters:

$\displaystyle {\bf y}={\bf f}({\bf x},{\bf W}^h,{\bf W}^o)$ (60)

The goal is to find the optimal parameters ${\bf W}^h$ and ${\bf W}^o$ that minimize the difference ${\bf r}={\bf y}-\hat{\bf y}$ between the desired and the actual outputs. The Levenberg-Marquardt algorithm discussed previously can be used to obtain the parameters, such as Matlab function trainlm.

The three-layer BPN containing a single hidden layer discussed above can be easily generalized to a multilayer BPN containing any number of hidden layers (e.g., for a deep learning network) by simply repeating steps 6 and 7 for the single hidden layer in the algorithm list above. We assume in addition to the input layer, there are in total $L$ learning layers including all the hidden layers and the output layer, indexed by $l=1,\cdots,L$. Then step 6 above becomes:

$\displaystyle {\bf d}^{(l)}={\bf\delta}^{(l)}\cdot{\bf g}'({\bf a}^{(l)})
\;\;\;\;\;\; (l=1,\cdots,L)$ (61)

where ${\bf a}^{(l)}=({\bf W}^{(l)}{\bf z}^{(l)})$ is the activation of all nodes in the lth layer, ${\bf z}^{(l)}={\bf W}^{(l-1)}{\bf z}^{(l-1)}$ is the input to the lth layer, output from the (l-1)th layer ( ${\bf z}^{(1)}={\bf x}$), and

$\displaystyle {\bf\delta}^{(l)}=\left\{\begin{array}{ll} {\bf y}-\hat{\bf y} & (l=L)\\
{\bf W}^{(l)}{\bf d}^{(l-1)} & (1 \le l <L)\end{array}\right.$ (62)

Then the same step 7 for updating the weights can be carried out based on the gradient vector:

$\displaystyle {\bf d}^{(l)} \otimes {\bf z}^{(l)} ={\bf d}^{(l)} ({\bf z}^{(l)})^T$ (63)

The matlab code segment below is for the BP network of multiple hidden layers, in which $L$ is the total number of learning layers and $m(l)$ is the number of nodes in the lth layer ( $l=1,\cdots,L$):

    W={1-2*rand(m(1),d+1)};                 % initial weights for first layer
    for l=2:L
        W{l}=1-2*rand(m(l),m(l-1)+1);       % initial weights for all other layers
    end
    er=inf;
    while er > tol
        I=randperm(N);                      % random permutation of samples
        er=0;
        for n=1:N                           % N samples for an epoch
            z={1;X(:,I(n))};                % pick a training sample
            a={W{1}*z{1}};                  % activation of first layer
            for l=2:L                       % the forward pass
                z{l}=[1;g(a{l-1})];         % input to the lth layer 
                a{l}=W{l}*z{l};             % activation of the lth layer 
            end
            yhat=g(a{L});                   % actual output of last layer
            delta=Y(:,I(n))-yhat;           % delta error
            er=er+norm(delta)/N;            % test error
            d{L}=delta.*dg(a{L});           % d for output layer
            W{L}=W{L}+eta*d{L}*z{L}';       % upddate weights for output layer 
            for l=L-1:-1:1                  % the backward pass
                d{l}=(W{l+1}(:,2:end)'*d{l+1}).*dg(a{l});  % d for hidden layers
                W{l}=W{l}+eta*d{l}*z{l}';   % upddate  weights for hidden layers
            end  
        end
    end

Example 1: The classification results of two previously used 2-D data sets are shown below. The error rates are respectively $13\%$ and $11.5\%$, and the confusion matrices are:

$\displaystyle \left[ \begin{array}{rrr}
185 & 14 & 1 \\ 5 & 181 & 14 \\ 11 & 33...
...
\;\;\;\;\;\;\;
\left[ \begin{array}{rr}176 & 24 \\ 22 & 178 \end{array}\right]$ (64)

BPNexample2.png BPNexample1.png

Example 2:

The back propagation network applied to the classification of the dataset of handwritten digits from 0 to 9 used previously. Out of the $N=2240$ samples in the dataset, half is used for training while the other half for testing. Shown below are the confusion matrices of both the training (left) and testing (right) phases. Out of the 1120 samples for training 27 are misclassified ($2.4\%$), and out of the 1120 samples for testing 74 are misclassified ($6.6\%$).

$\displaystyle \left[ \begin{array}{rrrrrrrrrr}
113 & 0 & 0 & 0 & 0 & 0 & 0 & 1 ...
... & 2 & 97 & 5 \\
0 & 2 & 0 & 0 & 4 & 0 & 0 & 0 & 0 &102 \\
\end{array}\right]$ (65)

Hierarchical structure and two pathways of the visual cortex