Softmax Regression for Multiclass Classification

In a multiclass classification problem, an unlabeled data point ${\bf x}$ is to be classified into one of $K>2$ classes $\{C_1,\cdots,C_K\}$, based on the training set ${\cal D}=\{({\bf x}_n, y_n),\;n=1,\cdots,N\}$, where $y_n\in\{1,\cdots,K\}$ is an integer indicating ${\bf x}_n\in C_{y_n}$

Any binary classifier, such as the logistic regression considered above, can be used to solve such a multiclass classification problem in either of the following two ways:

Alternatively, a multiclass problem with $K>2$ can also be solved by multinomial logistic or softmax regression, which can be considered as a generalized version of the logistic regression method, based on the softmax function of $K$ variabbles $z_1,\cdots,z_K$:

$\displaystyle s_k=s(z_k)=\frac{e^{z_k}}{\sum_{l=1}^K e^{z_l}},
\;\;\;\;\;\;(k=1,\cdots,K)$ (232)

with the following properties:

$\displaystyle 0< s_k< 1,\;\;\;\;\;\;\;\;\; \sum_{k=1}^K s_k=1$ (233)

and
$\displaystyle \frac{\partial s(z_i)}{\partial z_j}$ $\displaystyle =$ $\displaystyle \frac{\delta_{ij}e^{z_i} \sum_{l=1}^K e^{z_l}-e^{z_i}e^{z_j} }
{\left(\sum_{l=1}^K e^{z_l}\right)^2}$  
  $\displaystyle =$ $\displaystyle \delta_{ij} s(z_i)-s(z_i) s(z_j)
=s(z_i)[ \delta_{ij}-s(z_j) ]$ (234)

where $\delta_{ij}$ is the Dirac delta function which is 1 if $i=j$, but 0 otherwise.

Similar to how the logistic function is used to model the conditional probability $p(\hat{y}=1\vert{\bf x},{\bf w})$ for a given data point ${\bf x}$ to belong to class $C_+$ based on model parameter ${\bf w}$ inEq. (212), here the softmax function $s_k$ defined above is used to model the conditional probability for ${\bf x}$ to belong to class $C_k$ based on some model parameter ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$:

$\displaystyle p(\hat{y}=k\vert{\bf x},{\bf W})=s_k=s({\bf w}_k^T{\bf x})
=\frac...
...}_k^T{\bf x}}}{\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}}},
\;\;\;\;\;\;(k=1,\cdots,K)$ (235)

where ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$ is composed of $K$ weight vectors each associated with one of the $K$ classes $\{C_1,\cdots,C_K\}$, to be determined in the training process based on training set ${\cal D}$. Same as in logistic regression, here both ${\bf x}$ and ${\bf w}$ are augmented $d+1$ dimensional vectors.

We note that the inner product ${\bf w}^T{\bf x}=\vert\vert{\bf w}\vert\vert\;\vert\vert{\bf x}\vert\vert\cos\theta$ of vectors ${\bf w}$ and ${\bf x}$ is inversely related to the angle $\theta$ between the two vectors, i.e., when compared with other classes, a data sample ${\bf x}$ has a smaller angular distance to ${\bf w}_k$, then it has a larger inner product ${\bf w}_k^T{\bf x}$ and thereby greater probability $p(\hat{y}=k\vert{\bf x})$ to belong to class $C_k$. We therefore see that weight vectors ${\bf w}_1,\cdots,{\bf w}_K$ as the parameters of the softmax regression model actually represent the angular directions of the corresponding classes in the feature space.

Specially, when $K=2$, the above becomes logistic functions:

$\displaystyle s_1$ $\displaystyle =$ $\displaystyle \frac{ e^{{\bf w}_1^T{\bf x}}}{ e^{{\bf w}_0^T{\bf x}}+e^{{\bf w}...
...x}}}
=\frac{1}{1+e^{-({\bf w}_1-{\bf w}_0)^T{\bf x}}}
=\sigma({\bf w}^T{\bf x})$  
$\displaystyle s_0$ $\displaystyle =$ $\displaystyle \frac{e^{{\bf w}_0^T{\bf x}}}{e^{{\bf w}_0^T{\bf x}}+e^{{\bf w}_1...
...}_1-{\bf w}_0)^T{\bf x}}}
=\sigma(-{\bf w}^T{\bf x})=1-\sigma({\bf w}^T{\bf x})$ (236)

where we have defined ${\bf w}={\bf w}_1-{\bf w}_0$, as the vector between the two classes of $C_+$ and $C_-$, which is the normal direction of a decision plane separating the two classes. Any point ${\bf x}$ of unknown class can be therefore classified into either of the two classes depending on whether its projection ${\bf w}^T{\bf x}$ onto ${\bf w}$ is positive or negative, i.e., on which side of the decision plane it is located.

softmaxClassification.png

For mathamatical convenience, we also label each sample ${\bf x}_n$ in the training set by a binary vector ${\bf y}_n=[y_{n1},\cdots,y_{nK}]^T$, in addition to its integer labeling $y_n\in\{1,\cdots,K\}$. If $y_n=k$ indicating ${\bf x}_n\in C_k$, then the kth component of ${\bf y}_n$ is $y_{nk}=1$, while other components are zero $y_{nl}=0$ for all $l\ne k$. Note that all components of ${\bf y}_n$ add up to 1. The $N$ training samples in ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ are also labeled by their corresponding binary labelings in ${\bf Y}=[{\bf y}_1,\cdots,{\bf y}_N]$.

We define the probability for ${\bf x}_n\in C_k$ as the softmax function based on the linear function ${\bf w}_k^T{\bf x}_n$:

$\displaystyle s_{nk}=p(\hat{y}=k\vert{\bf x}_n)
=\frac{e^{{\bf w}_k^T{\bf x}_n}...
...l=1}^K e^{{\bf w}_l^T{\bf x}_n}},
\;\;\;\;\;\;\;(k=1,\cdots,K,\;\;n=1,\cdots,N)$ (237)

and the probability for ${\bf x}_n$ to be correctly classified into class $C_{y_n}$ can be written as the following product of $K$ factors (of which $K-1$ are equal to 1):

$\displaystyle p(\hat{y}=y_n\vert{\bf x}_n,{\bf W})
=\prod_{k=1}^K p(\hat{y}=k\vert{\bf x}_n,{\bf W})^{y_{nk}}
=\prod_{k=1}^K s_{nk}^{y_{nk}}$ (238)

Our goal is to find these weight vectors in ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$ as the parameter of the softmax model for it to optimally fit the $N$ i.i.d. data points in the training set, so that that the following likelihood is maximized:

$\displaystyle L({\bf W}\vert{\cal D})$ $\displaystyle \propto$ $\displaystyle p({\cal D}\vert{\bf W})
=\prod_{n=1}^N p(\hat{y}=y_n\vert{\bf x}_n,{\bf W})
=\prod_{n=1}^N \left( \prod_{k=1}^K s_{nk} ^{y_{nk}} \right)$  
  $\displaystyle =$ $\displaystyle \prod_{n=1}^N \prod_{k=1}^K \left( \frac{ e^{{\bf w}_k^T{\bf x}_n}}
{\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}_n}} \right)^{y_{nk}}$ (239)

Equivalently we can also minimize the negative log likelihood as the objective function:
$\displaystyle J({\bf W})=-l({\bf W}\vert{\cal D})$ $\displaystyle =$ $\displaystyle -\log L({\bf w}\vert{\cal D})
=-\sum_{n=1}^N\sum_{k=1}^K y_{nk}\l...
...\frac{ e^{{\bf w}_k^T{\bf x}_n}}
{\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}_n}}\right)$  
  $\displaystyle =$ $\displaystyle -\sum_{n=1}^N\sum_{k=1}^K y_{nk} \left( {\bf w}_k^T{\bf x}_n
-\log\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}_n}\right)$  
  $\displaystyle =$ $\displaystyle -\sum_{n=1}^N \left[\sum_{k=1}^K y_{nk} {\bf w}_k^T{\bf x}_n
-\sum_{k=1}^K y_{nk}
\left(\log\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}_n}\right)\right]$  
  $\displaystyle =$ $\displaystyle -\sum_{n=1}^N \left[\sum_{k=1}^K y_{nk} {\bf w}_k^T{\bf x}_n
-\log\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}_n}\right]$ (240)

The last equality is due to the fact that the last summation of $K$ terms is independent of $k$ and $\sum_{k=1}^K y_{nk}=1$. Also, same as in the cae of $K=2$, to avoid overfitting, we could further include in this objective function extra regularization terms $\lambda\vert\vert{\bf w}_k\vert\vert^2/2,\;\;(k=1,\cdots,K)$ to penalize large weights for smoother thresholding:

$\displaystyle J({\bf W})=-l({\bf W}\vert{\cal D})+\frac{\lambda}{2}\sum_{k=1}^K\vert\vert{\bf w}_k\vert\vert^2$ (241)

To find the optimal ${\bf W}$ that minimize this objective function $J({\bf W})$, we find its gradient vector with respective to each of its $K$ columnns ${\bf w}_i,\;(i=1,\cdots,K)$:

$\displaystyle {\bf g}_i({\bf W})$ $\displaystyle =$ $\displaystyle \frac{d}{d{\bf w}_i} J({\bf W})
=\frac{d}{d{\bf w}_i}\left[-l({\b...
...t{\cal D})+\frac{\lambda}{2}
\sum_{k=1}^K\vert\vert{\bf w}_k\vert\vert^2\right]$  
  $\displaystyle =$ $\displaystyle -\sum_{n=1}^N \left[\sum_{k=1}^K y_{nk} \frac{d}{d{\bf w}_i}
\lef...
...\left(\log\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}_n}
\right)\right]+\lambda{\bf w}_i$  
  $\displaystyle =$ $\displaystyle -\sum_{n=1}^N\left[ \sum_{k=1}^K y_{nk} \delta_{ik}{\bf x}_n
-\le...
...a{\bf w}_i
=\sum_{n=1}^N \left( s_{in}-y_{in}\right){\bf x}_n
+\lambda{\bf w}_i$  
  $\displaystyle =$ $\displaystyle [{\bf x}_1,\cdots,{\bf x}_N]\left[\begin{array}{c}s_{i1}-y_{i1}\\...
..._N]\left[\begin{array}{c}r_1\\ \vdots\\ r_N
\end{array}\right]+\lambda{\bf w}_i$  
  $\displaystyle =$ $\displaystyle {\bf Xr}_i+\lambda{\bf w}_i$ (242)

where $r_n=s_{in}-y_{in}$ is the residual, the difference between the model output $s_{in}$ and the binary labeling $y_{in}$, the ith component of ${\bf y}_n$ that labels ${\bf x}_n$ as well as $y_n$.

As $s_{in}$ is a function of all $K$ weight vectors in ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$, so is $r_{in}=s_{in}-y_{in}$, and ${\bf r}_i$ can be explicitly expressed as ${\bf r}_i({\bf W})$. We note that Eq. (242) takes the same form as Eq. (221) for $K=2$. When specially $K=2$ with $k\in\{0,\;1\}$, the two equations become the same. We therefore see that logistic regression is actually a special case of softmax regression.

We stack all $K$ such $d+1$ dimensional gradient vectors ${\bf g}_1,\cdots,{\bf g}_K$ together, and get a $(d+1)K$ dimensional gradient vector of the objective function $J({\bf W})$ with respect to all $K$ parameter vectors ${\bf w}_1,\cdots,{\bf w}_K$:

$\displaystyle {\bf g}_J=\left[\begin{array}{c}{\bf g}_1\\
\vdots\\ {\bf g}_K\end{array}\right]$ (243)

based on which the optmal parameters in ${\bf W}$ that minimize the objective function $J({\bf W})$ can be found by the gradient descent method.

We can also find the optimal ${\bf W}$ by the Newton's method, if we can further find the Hessian matrix ${\bf H}_J$ as the second derivative of $J({\bf W})$, by taking the derivative of the ith gradient vector ${\bf g}_i({\bf W})\;(i=1,\cdots,K)$ with respect to the jth weight vector ${\bf w}_j\;(j=1,\cdots,K)$ to get the second order derivative of $J({\bf W})$ with respect to both ${\bf w}_i$ and ${\bf w}_j,\; (i,j=1,\cdots,K$):

$\displaystyle {\bf H}_{ij}$ $\displaystyle =$ $\displaystyle \frac{d^2}{d{\bf w}_j d{\bf w}_i}J({\bf W})
=\frac{d}{d{\bf w}_j}...
...bf w}_j}{\bf g}_i
=\frac{d}{d{\bf w}_j}\left[{\bf Xr}_i+\lambda{\bf w}_i\right]$  
  $\displaystyle =$ $\displaystyle {\bf X}\frac{d{\bf r}_i}{d{\bf w}_j}+\delta_{ij}\lambda
={\bf X}{\bf J}_{ij}+\delta_{ij}\lambda$ (244)

where ${\bf J}_{ij}=d{\bf r}_i/d{\bf w}_j$ is the $N\times(d+1)$ Jacobian matrix of ${\bf r}_i$ with respect to ${\bf w}_j$, of which the nth row is the transpose of the following vector:
$\displaystyle \frac{d}{d{\bf w}_j}r_{in}$ $\displaystyle =$ $\displaystyle \frac{d}{d{\bf w}_j}(s_{in}-y_{in})
=\frac{d}{d{\bf w}_j} s_{in} =\frac{d}{d{\bf w}_j} s({\bf w}_i^T{\bf x}_n)$  
  $\displaystyle =$ $\displaystyle s_{in}\left(\delta_{ij}-s_{jn}\right)
\frac{d}{d{\bf w}_j}\left({\bf w}_j^T{\bf x}_n\right)=z_n{\bf x}_n
\;\;\;\;\;\;\;\;\;(n=1,\cdots,N)$ (245)

where we have used Eq. (234) and defined $z_n=s_{in}(\delta_{ij}-s_{jn})$. Now the Jacobian can be written as

$\displaystyle {\bf J}_{ij}
=\left[\begin{array}{c}
dr_{i1}/d{\bf w}_j\\ \vdots ...
...array}{c}{\bf x}_1^T\\ \vdots\\ {\bf x}_N^T\end{array}\right]
={\bf Z}{\bf X}^T$ (246)

where ${\bf Z}=diag(z_1,\cdots,z_N)$ is an $N\times N$ diagonal matrix. Substituting ${\bf J}_r({\bf W})$ back into the expression for ${\bf H}_{ij}$, we get:

$\displaystyle {\bf H}_{ij}={\bf X}{\bf J}_r+\delta_{ij}\lambda
={\bf XZX}^T+\de...
...sum_{n=1}^N s_{in}(\delta_{ij}-s_{jn}){\bf x}_n {\bf x}_n^T
+\delta_{ij}\lambda$ (247)

Here ${\bf H}_{ij}$ is a matrix of dimension $(d+1)\times(d+1)$, corresponds to ${\bf w}_i$ and ${\bf w}_j\;(i,j=1,\cdots,K)$. We further stack all $K\times K$ such matrices ${\bf H}_{ij}$ together to get the $K(d+1)\times K(d+1)$ dimensional full Hessian matrix ${\bf H}_J$ of $J({\bf W})$ with respect to all $K$ vectors in ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$:

$\displaystyle {\bf H}_J=\left[\begin{array}{ccc}
{\bf H}_{11} & \cdots &{\bf H}...
...dots & \ddots & \vdots\\ {\bf H}_{K1} & \cdots &{\bf H}_{KK}
\end{array}\right]$ (248)

Now we can find the optimal parameter ${\bf W}$ iteratively by the Newton-Raphson method based on both ${\bf H}_J$ and ${\bf g}_J$:

$\displaystyle \left[\begin{array}{c}{\bf w}_1\\ \vdots\\ {\bf w}_K\end{array}\r...
...{-1}_n
\left[\begin{array}{c}{\bf g}_1\\ \vdots\\ {\bf g}_K\end{array}\right]_n$ (249)

Once ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$ is available, any unlabeled sample ${\bf x}$ of unknown class can be classified into one of the $K$ classes with the maximum conditional probability given in Eq. (235):

if$\displaystyle \;\;\;
p(\hat{y}=k\vert{\bf x}) = \max_l \; p(\hat{y}=l\vert{\bf x}),\;\;\;\;$   then$\displaystyle \;\;\;\;\;{\bf x}\in C_k$ (250)

Whether we should use softmax regression or $K$ logistic regressions for a problem of K classes $C_1,\cdots,C_K$ depends on the nature of the classes. The method of softmax regression is suitable if the $K$ classes are mutually exclusive and independent, as assumed by the method. Otherwise, $K$ logistic regression binary classifiers are more suitable.

Below is a Matlab function for estimating the $K$ weight vectors in ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$ based on the training set ${\cal D}=\{{\bf X},{\bf y}$ and the hyperparameter $\lambda$:

function W=softmaxRegression(X,y,lambda)
    [d N]=size(X);
    K=length(unique(y));        % number of classes
    X=[ones(1,N); X];           % augmented data points
    d=d+1;
    Y=zeros(K,N);
    for n=1:N
        Y(y(n),n)=1;            % generate binary labeling Y
    end
    W=zeros(d*K,1);            	% initial guess of K weight vectors
    I=eye(K);
    s=zeros(K,N);
    phi=zeros(K,N);             % softmax functions
    gi=zeros(d,1);              % ith gradient
    Hij=zeros(d,d);             % ij-th Hebbian
    er=9;
    tol=10^(-6);
    it=0;
    while er>tol
        it=it+1;
        W2=reshape(W,d,K);      % weight vectors in d x K 2-D array
        g=[];                   % total gradient vector
        for i=1:K
            for n=1:N
                xn=X(:,n);      % get the nth sample
                t=0;
                for k=1:K        
                    wk=W2(:,k);  % get the kth weight vector
                    s(k,n)=exp(wk'*xn);     
                    t=t+s(k,n);
                end
                phi(i,n)=s(i,n)/t;      % softmax function
            end
            gi=X*(phi(i,:)-Y(i,:))'+lambda*W2(:,i);     % ith gradient
            g=[g; gi];          % stack all gradients into a long gradient vector
        end
        H=[];                   % total Hessian
        z=zeros(N,1);
        for i=1:K               % for the ith block row
            Hi=[];
            for j=1:K           % for the jth block in ith row
                for n=1:N
                   z(n)=phi(i,n)*((i==j)-phi(j,n));
                end
                Hij=X*diag(z)*X';
                Hi=[Hi Hij];    % append jth block
            end
            H=[H; Hi];          % append ith blow row
        end
        H=H+lambda*eye(d*K);    % include regulation term
        W=W-inv(H)*g;           % update W by Newton's method
        er=norm(g);
    end
    W=reshape(W,d,K);           % reshape weight vector into d x K array
end

Here is the function for the classification of the unlabeled data samples in matrix ${\bf X}$ based on the weight vectors in ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$:

function yhat=softmaxClassify(W,X)
    [d N]=size(X);              % dataset to be classified
    [d K]=size(W);              % model parameters 
    X=[ones(1,N); X];           % augmented data points
    yhat=zeros(N,1);
    for n=1:N                   % for each of the N samples
        xn=X(:,n);              % nth sample point
        t=0;
        for k=1:K        
            wk=W(:,k);          % kth weight vector
            s(k)=exp(wk'*xn);     
            t=t+s(k);
        end
        pmax=0;
        for k=1:K
            p=s(k)/t;           % probability based on softmax function
            if p>pmax
                kmax=k; pmax=p;
            end
        end
        yhat(n)=kmax;           % predicted class labeling
    end    
end