Back Propagation Network

The back propagation network (BPN) is a typical supervised network composed of three hierarchical layers: the input, hidden, and output layers with

, and

nodes, respectively. Each node in the hidden and output layers is fully connected to all nodes in the previous layer. As the BPN has two levels of learning takeing place at both the hidden and output layers, it is a much more powerful algorithm in comparison to the perceptron network which has only a single learning level, in the sense that it can handle more complicated nonlinear classification problems.

For example, in supervised classification, the number of output nodes

can be set to be the same as the number of classes

, i.e.,

, and the desired output for an input ${\bf x}$ belonging to class

is ${\bf y}=[0,\cdots,0,1,0,\cdots,0]^T$ , for all output nodes to output 0 except the c-th one which should output 1 (one-hot method).

Based on the training set of

pattern pairs $\{({\bf x}_1, {\bf y}_1), \cdots, ({\bf x}_K, {\bf y}_K)\}$ , where ${\bf x}_i$ is the input patterns and ${\bf y}_i$ is the desired response of the network corresponding the output, the two-level BPN network is trained in two phases:

The forward phase is carried out in two levels of computation:

From the input layer to the hidden layer:
When an input of components ${\bf x}=[x_1,\cdots,x_N]^T$ randomly selected from the training set is presented to the input layer nodes, the net input to the jth hidden node is:
$\displaystyle q_j=\sum_{k=1}^N w_{jk}^h x_k+T^h_j=\sum_{k=0}^N w_{jk}^h x_k ={\bf x}^T{\bf w}^h_j, \;\;\;\;\;\;(j=1,\cdots,L)$
where $w_{j1},\cdots,w_{jN}$ are the weights of the jth hidden layer node connecting it to the input nodes, and is the threshold value for the node. Also, for computational convenience we have treated the threshold term as the 0th term of the summation with $w_{j0}^h=T_j^h$ and , and redefined ${\bf x}=[x_0=1, x_1,\cdots,x_N]^T$ as the input vector and ${\bf w}_j^h=[w_{j0}^h=T^h_j,w_{j1}^h,\cdots,w_{jN}^h]^T$ as the weight vector for the jth hidden node.
The output of the jth hidden node is a sigmoid function of its net input:
$\displaystyle z_j=g(q_j)=g\left(\sum_{k=0}^N w_{jk}^h x_k\right) =g({\bf x}^T{\bf w}^h_j),\;\;\;\;\;\;(j=1,\cdots,L)$
and the output of the hidden layer of nodes, denoted by ${\bf z}=[z_1,\cdots,z_L]^T$ , is fed into the output layer.
From the hidden layer to the output layer:
The net input to the ith node of the output layer is:
$\displaystyle p_i=\sum_{j=1}^L w_{ij}^o z_j+T^o_i=\sum_{j=0}^L w_{ij}^o z_j ={\bf z}^T{\bf w}^o_i, \;\;\;\;\;\;(i=1,\cdots,M)$
where $w_{i1}^o,\cdots,w_{iL}^o$ are the weights of the ith output node connecting it to the hidden layer nodes, and is the threshold value of the node. Again, for convenience we have treated the threshold term as the 0th term of the summation with $w_{i0}^o=T_i^o$ and , and redefined ${\bf z}=[z_0=1,z_1,\cdots,z_L]^T$ as the input to the output layer and ${\bf w}_i^o=[w_{i0}^o=T_i^o,w_{i1}^o,\cdots,w_{iL}^o]^T$ as the weight vector for the ith output node.
The output of the ith output node is a sigmoid function of its net input:
$\displaystyle \hat{y_i}=g(p_i)=g\left(\sum_{j=0}^L w_{ij}^o z_j\right) =g({\bf z}^T{\bf w}^o_i), \;\;\;\;\;\;(i=1,\cdots,M)$
and the output of the output layer of nodes is denoted by $\hat{\bf y}=[\hat{y}_1,\cdots,\hat{y}_M]^T$ .

The backward phase, the error back propagation:

The desired output ${\bf y}$ corresponding to the input ${\bf x}$ is compared with the actual output of the network, now denoted by $\hat{\bf y}=\hat{\bf y}({\bf w}^h_i,{\bf w}^o_j,(i=1,\cdots,M, \;j=1,\cdots,L))$ as a function of all the weights and thresholds involved in the two levels of computation in the forward pass, to define the error function:

		$\displaystyle \varepsilon({\bf w}^h_i,{\bf w}^o_j,(i=1,\cdots,M,\;j=1,\cdots,L)) =\frac{1}{2}\vert\vert\hat{\bf y}-{\bf y}\vert\vert^2$
	$\textstyle =$	$\displaystyle \frac{1}{2}\sum_{i=1}^M (\hat{y}_i-y_i)^2 =\frac{1}{2}\sum_{i=1}^... ...rac{1}{2}\sum_{i=1}^M \left[g\left(\sum_{j=0}^L w_{ij}^oz_j\right)-y_i\right]^2$
	$\textstyle =$	$\displaystyle \frac{1}{2}\sum_{i=1}^M\left[g\left(\sum_{j=0}^L w_{ij}^o\,g(q_j)... ...m_{j=0}^L w_{ij}^o\, g\left(\sum_{k=0}^N w_{jk}^hx_k\right)\right)-y_i\right]^2$

The goal of the training is to minimize this error function $\varepsilon$ for all samples in the training set by the gradient descent method. Specifically, the weights and the thresholds are to be optimized so that the sigmoid functions $z_i=g({\bf x})$ and $y_j=g({\bf z})$ are properly shaped in such a way that they best fit the output ${\bf y}_k$ of the BPN with the corresponding input ${\bf x}_k$ for all $k=1,\cdots,K$ .

The backward propagation is also carried out in two levels:

From the output layer to the hidden layer:
Find the gradient of the error function $\varepsilon$ in the output weight space $w_{ij}^{o}$ ( $i=1,\cdots,M,\;j=0,\cdots,L$ ) by the chain rule:
$\displaystyle \frac{\partial\,\varepsilon}{\partial\, w_{ij}^o} =\frac{\partia... ...artial\, w_{ij}^{o}} =(\hat{y}_i-y_i)\;g'(p_i)\;z_j=\delta^o_i\,g'(p_i)\;z_j$
where $\delta^o_i=\hat{y}_i-y_i$ and as discussed before. The weight $w_{ij}^o$ is updated to reduce $\varepsilon$ by the gradient descent method:
$\displaystyle w_{ij}^{o(new)}=w_{ij}^{o(old)}-\eta\;\frac{\partial\,\varepsilon}{\partial w_{ij}^o} =w_{ij}^{o(old)} - \eta\;\delta^o_i\;g'_i\;z_j$
where $\eta$ is the learning rate or step size.
From the hidden layer to the input layer:
Find the gradient of $\varepsilon$ in the hidden weight space $w_{jk}^h$ ( $j=1,\cdots,L,\;k=0,\cdots,N$ ):

$\displaystyle \frac{\partial \varepsilon}{\partial w_{jk}^h}$ $\textstyle =$ $\displaystyle \frac{\partial\,\varepsilon}{\partial\, \hat{y}_i}\; \frac{\parti... ...partial\, w_{jk}^h} =\sum_{i=1}^M\;(\hat{y}_i-y_i)\;g'(p_i)w^o_{ij} g'(q_j) x_k$

$\textstyle =$ $\displaystyle \sum_{i=1}^M \delta_i^o g'(p_i)w_{ij}^o\;g'(q_j)x_k =\delta_j^{h}\;g'(q_j)x_k$

where we have defined
$\displaystyle \delta_j^h =\sum_{i=1}^M \delta_i^o g'(p_i)w_{ij}^o$
and . The weight $w_{jk}^h$ is updated to reduce $\varepsilon$ with the gradient descent method:
$\displaystyle w_{jk}^{h(new)}=w_{jk}^{h(old)} -\eta\;\frac{\partial \varepsilon}{\partial w^h_{jk}} =w_{jk}^{h(old)}-\eta\;\delta_j^h\;g'(q_j)x_k$

$({\bf x}_k,\;{\bf y}_k)$

Subsections

$\displaystyle \frac{\partial \varepsilon}{\partial w_{jk}^h}$	$\textstyle =$	$\displaystyle \frac{\partial\,\varepsilon}{\partial\, \hat{y}_i}\; \frac{\parti... ...partial\, w_{jk}^h} =\sum_{i=1}^M\;(\hat{y}_i-y_i)\;g'(p_i)w^o_{ij} g'(q_j) x_k$
	$\textstyle =$	$\displaystyle \sum_{i=1}^M \delta_i^o g'(p_i)w_{ij}^o\;g'(q_j)x_k =\delta_j^{h}\;g'(q_j)x_k$