Softmax Regression for Multiclass Classification

In a multiclass classification problem, an unlabeled data point ${\bf x}$ is to be classified into one of $K>2$ classes $\{C_1,\cdots,C_K\}$ , based on the training set ${\cal D}=\{({\bf x}_n, y_n),\;n=1,\cdots,N\}$ , where $y_n\in\{1,\cdots,K\}$ is an integer indicating ${\bf x}_n\in C_{y_n}$

Any binary classifier, such as the logistic regression considered above, can be used to solve such a multiclass classification problem in either of the following two ways:

One-Vs-One (1V1): construct binary classifiers for every pair of the classes based on the training set $\{(C_i,\,C_j)\;\;\;i=2,\cdots,K,\;j=1,\cdots,i-1\}$ , then an unlabeled sample ${\bf x}$ is classified to one of the classes with maximum votes out of this many binary classifications.
One-Vs-Rest (1VR). Convert a K-class problem into binary problems by regrouping the classes into two classes and $C_-=\{C_j\vert j=1,\cdots,K,\;j\ne i\}$ , and get the corresponding discriminant function $f({\bf x})$ of the binary problem:

if $\displaystyle f_i({\bf x})\; \left\{\begin{array}{l}>0 \\ <0 \end{array}\right.,$ then $\displaystyle \left\{\begin{array}{l}{\bf x}\in C_+=C_i\\ {\bf x}\in C_-\end{array}\right.$ (230)

which represents quantitatively how much a given ${\bf x}$ belongs to class , instead of containing the rest classes. This process is repeated times for all $i=1,\cdots,K$ to get

$\displaystyle f_k({\bf x})=\max\{f_1({\bf x}),\cdots,f_K({\bf x})\}, \;\;\;\;\;\;(k=1,\cdots,K)$ (231)

which indicates ${\bf x}$ belongs to class .

Alternatively, a multiclass problem with $K>2$ can also be solved by multinomial logistic or softmax regression, which can be considered as a generalized version of the logistic regression method, based on the softmax function of $K$ variabbles $z_1,\cdots,z_K$ :

$\displaystyle s_k=s(z_k)=\frac{e^{z_k}}{\sum_{l=1}^K e^{z_l}}, \;\;\;\;\;\;(k=1,\cdots,K)$

(232)

with the following properties:

$\displaystyle 0< s_k< 1,\;\;\;\;\;\;\;\;\; \sum_{k=1}^K s_k=1$

(233)

and

$\displaystyle \frac{\partial s(z_i)}{\partial z_j}$	$\displaystyle =$	$\displaystyle \frac{\delta_{ij}e^{z_i} \sum_{l=1}^K e^{z_l}-e^{z_i}e^{z_j} } {\left(\sum_{l=1}^K e^{z_l}\right)^2}$
	$\displaystyle =$	$\displaystyle \delta_{ij} s(z_i)-s(z_i) s(z_j) =s(z_i)[ \delta_{ij}-s(z_j) ]$	(234)

where $\delta_{ij}$ is the Dirac delta function which is 1 if $i=j$

, but 0 otherwise.

Similar to how the logistic function is used to model the conditional probability $p(\hat{y}=1\vert{\bf x},{\bf w})$ for a given data point ${\bf x}$ to belong to class $C_+$ based on model parameter ${\bf w}$ inEq. (212), here the softmax function $s_k$ defined above is used to model the conditional probability for ${\bf x}$ to belong to class $C_k$ based on some model parameter ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$ :

$\displaystyle p(\hat{y}=k\vert{\bf x},{\bf W})=s_k=s({\bf w}_k^T{\bf x}) =\frac... ...}_k^T{\bf x}}}{\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}}}, \;\;\;\;\;\;(k=1,\cdots,K)$

(235)

where ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$ is composed of $K$

weight vectors each associated with one of the $K$

classes $\{C_1,\cdots,C_K\}$ , to be determined in the training process based on training set ${\cal D}$ . Same as in logistic regression, here both ${\bf x}$ and ${\bf w}$ are augmented $d+1$

dimensional vectors.

We note that the inner product ${\bf w}^T{\bf x}=\vert\vert{\bf w}\vert\vert\;\vert\vert{\bf x}\vert\vert\cos\theta$ of vectors ${\bf w}$ and ${\bf x}$ is inversely related to the angle $\theta$ between the two vectors, i.e., when compared with other classes, a data sample ${\bf x}$ has a smaller angular distance to ${\bf w}_k$ , then it has a larger inner product ${\bf w}_k^T{\bf x}$ and thereby greater probability $p(\hat{y}=k\vert{\bf x})$ to belong to class $C_k$ . We therefore see that weight vectors ${\bf w}_1,\cdots,{\bf w}_K$ as the parameters of the softmax regression model actually represent the angular directions of the corresponding classes in the feature space.

Specially, when $K=2$ , the above becomes logistic functions:

$\displaystyle s_1$	$\displaystyle =$	$\displaystyle \frac{ e^{{\bf w}_1^T{\bf x}}}{ e^{{\bf w}_0^T{\bf x}}+e^{{\bf w}... ...x}}} =\frac{1}{1+e^{-({\bf w}_1-{\bf w}_0)^T{\bf x}}} =\sigma({\bf w}^T{\bf x})$
$\displaystyle s_0$	$\displaystyle =$	$\displaystyle \frac{e^{{\bf w}_0^T{\bf x}}}{e^{{\bf w}_0^T{\bf x}}+e^{{\bf w}_1... ...}_1-{\bf w}_0)^T{\bf x}}} =\sigma(-{\bf w}^T{\bf x})=1-\sigma({\bf w}^T{\bf x})$	(236)

where we have defined ${\bf w}={\bf w}_1-{\bf w}_0$ , as the vector between the two classes of $C_+$

and

, which is the normal direction of a decision plane separating the two classes. Any point ${\bf x}$ of unknown class can be therefore classified into either of the two classes depending on whether its projection ${\bf w}^T{\bf x}$ onto ${\bf w}$ is positive or negative, i.e., on which side of the decision plane it is located.

For mathamatical convenience, we also label each sample ${\bf x}_n$ in the training set by a binary vector ${\bf y}_n=[y_{n1},\cdots,y_{nK}]^T$ , in addition to its integer labeling $y_n\in\{1,\cdots,K\}$ . If $y_n=k$ indicating ${\bf x}_n\in C_k$ , then the kth component of ${\bf y}_n$ is $y_{nk}=1$ , while other components are zero $y_{nl}=0$ for all $l\ne k$ . Note that all components of ${\bf y}_n$ add up to 1. The $N$ training samples in ${\bf X}=[{\bf x}_1,\cdots,{\bf x}_N]$ are also labeled by their corresponding binary labelings in ${\bf Y}=[{\bf y}_1,\cdots,{\bf y}_N]$ .

We define the probability for ${\bf x}_n\in C_k$ as the softmax function based on the linear function ${\bf w}_k^T{\bf x}_n$ :

$\displaystyle s_{nk}=p(\hat{y}=k\vert{\bf x}_n) =\frac{e^{{\bf w}_k^T{\bf x}_n}... ...l=1}^K e^{{\bf w}_l^T{\bf x}_n}}, \;\;\;\;\;\;\;(k=1,\cdots,K,\;\;n=1,\cdots,N)$

(237)

and the probability for ${\bf x}_n$ to be correctly classified into class $C_{y_n}$ can be written as the following product of $K$

factors (of which $K-1$

are equal to 1):

$\displaystyle p(\hat{y}=y_n\vert{\bf x}_n,{\bf W}) =\prod_{k=1}^K p(\hat{y}=k\vert{\bf x}_n,{\bf W})^{y_{nk}} =\prod_{k=1}^K s_{nk}^{y_{nk}}$

(238)

Our goal is to find these weight vectors in ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$ as the parameter of the softmax model for it to optimally fit the $N$ i.i.d. data points in the training set, so that that the following likelihood is maximized:

$\displaystyle L({\bf W}\vert{\cal D})$	$\displaystyle \propto$	$\displaystyle p({\cal D}\vert{\bf W}) =\prod_{n=1}^N p(\hat{y}=y_n\vert{\bf x}_n,{\bf W}) =\prod_{n=1}^N \left( \prod_{k=1}^K s_{nk} ^{y_{nk}} \right)$
	$\displaystyle =$	$\displaystyle \prod_{n=1}^N \prod_{k=1}^K \left( \frac{ e^{{\bf w}_k^T{\bf x}_n}} {\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}_n}} \right)^{y_{nk}}$	(239)

Equivalently we can also minimize the negative log likelihood as the objective function:

$\displaystyle J({\bf W})=-l({\bf W}\vert{\cal D})$	$\displaystyle =$	$\displaystyle -\log L({\bf w}\vert{\cal D}) =-\sum_{n=1}^N\sum_{k=1}^K y_{nk}\l... ...\frac{ e^{{\bf w}_k^T{\bf x}_n}} {\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}_n}}\right)$
	$\displaystyle =$	$\displaystyle -\sum_{n=1}^N\sum_{k=1}^K y_{nk} \left( {\bf w}_k^T{\bf x}_n -\log\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}_n}\right)$
	$\displaystyle =$	$\displaystyle -\sum_{n=1}^N \left[\sum_{k=1}^K y_{nk} {\bf w}_k^T{\bf x}_n -\sum_{k=1}^K y_{nk} \left(\log\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}_n}\right)\right]$
	$\displaystyle =$	$\displaystyle -\sum_{n=1}^N \left[\sum_{k=1}^K y_{nk} {\bf w}_k^T{\bf x}_n -\log\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}_n}\right]$	(240)

The last equality is due to the fact that the last summation of $K$

terms is independent of $k$

and $\sum_{k=1}^K y_{nk}=1$ . Also, same as in the cae of $K=2$

, to avoid overfitting, we could further include in this objective function extra regularization terms $\lambda\vert\vert{\bf w}_k\vert\vert^2/2,\;\;(k=1,\cdots,K)$ to penalize large weights for smoother thresholding:

$\displaystyle J({\bf W})=-l({\bf W}\vert{\cal D})+\frac{\lambda}{2}\sum_{k=1}^K\vert\vert{\bf w}_k\vert\vert^2$

(241)

To find the optimal ${\bf W}$ that minimize this objective function $J({\bf W})$ , we find its gradient vector with respective to each of its $K$ columnns ${\bf w}_i,\;(i=1,\cdots,K)$ :

$\displaystyle {\bf g}_i({\bf W})$	$\displaystyle =$	$\displaystyle \frac{d}{d{\bf w}_i} J({\bf W}) =\frac{d}{d{\bf w}_i}\left[-l({\b... ...t{\cal D})+\frac{\lambda}{2} \sum_{k=1}^K\vert\vert{\bf w}_k\vert\vert^2\right]$
	$\displaystyle =$	$\displaystyle -\sum_{n=1}^N \left[\sum_{k=1}^K y_{nk} \frac{d}{d{\bf w}_i} \lef... ...\left(\log\sum_{l=1}^K e^{{\bf w}_l^T{\bf x}_n} \right)\right]+\lambda{\bf w}_i$
	$\displaystyle =$	$\displaystyle -\sum_{n=1}^N\left[ \sum_{k=1}^K y_{nk} \delta_{ik}{\bf x}_n -\le... ...a{\bf w}_i =\sum_{n=1}^N \left( s_{in}-y_{in}\right){\bf x}_n +\lambda{\bf w}_i$
	$\displaystyle =$	$\displaystyle [{\bf x}_1,\cdots,{\bf x}_N]\left[\begin{array}{c}s_{i1}-y_{i1}\\... ..._N]\left[\begin{array}{c}r_1\\ \vdots\\ r_N \end{array}\right]+\lambda{\bf w}_i$
	$\displaystyle =$	$\displaystyle {\bf Xr}_i+\lambda{\bf w}_i$	(242)

where $r_n=s_{in}-y_{in}$ is the residual, the difference between the model output $s_{in}$ and the binary labeling $y_{in}$ , the ith component of ${\bf y}_n$ that labels ${\bf x}_n$ as well as $y_n$

As $s_{in}$ is a function of all $K$ weight vectors in ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$ , so is $r_{in}=s_{in}-y_{in}$ , and ${\bf r}_i$ can be explicitly expressed as ${\bf r}_i({\bf W})$ . We note that Eq. (242) takes the same form as Eq. (221) for $K=2$ . When specially $K=2$ with $k\in\{0,\;1\}$ , the two equations become the same. We therefore see that logistic regression is actually a special case of softmax regression.

We stack all $K$ such $d+1$ dimensional gradient vectors ${\bf g}_1,\cdots,{\bf g}_K$ together, and get a $(d+1)K$ dimensional gradient vector of the objective function $J({\bf W})$ with respect to all $K$ parameter vectors ${\bf w}_1,\cdots,{\bf w}_K$ :

$\displaystyle {\bf g}_J=\left[\begin{array}{c}{\bf g}_1\\ \vdots\\ {\bf g}_K\end{array}\right]$

(243)

based on which the optmal parameters in ${\bf W}$ that minimize the objective function $J({\bf W})$ can be found by the gradient descent method.

We can also find the optimal ${\bf W}$ by the Newton's method, if we can further find the Hessian matrix ${\bf H}_J$ as the second derivative of $J({\bf W})$ , by taking the derivative of the ith gradient vector ${\bf g}_i({\bf W})\;(i=1,\cdots,K)$ with respect to the jth weight vector ${\bf w}_j\;(j=1,\cdots,K)$ to get the second order derivative of $J({\bf W})$ with respect to both ${\bf w}_i$ and ${\bf w}_j,\; (i,j=1,\cdots,K$ ):

$\displaystyle {\bf H}_{ij}$	$\displaystyle =$	$\displaystyle \frac{d^2}{d{\bf w}_j d{\bf w}_i}J({\bf W}) =\frac{d}{d{\bf w}_j}... ...bf w}_j}{\bf g}_i =\frac{d}{d{\bf w}_j}\left[{\bf Xr}_i+\lambda{\bf w}_i\right]$
	$\displaystyle =$	$\displaystyle {\bf X}\frac{d{\bf r}_i}{d{\bf w}_j}+\delta_{ij}\lambda ={\bf X}{\bf J}_{ij}+\delta_{ij}\lambda$	(244)

where ${\bf J}_{ij}=d{\bf r}_i/d{\bf w}_j$ is the $N\times(d+1)$ Jacobian matrix of ${\bf r}_i$ with respect to ${\bf w}_j$ , of which the nth row is the transpose of the following vector:

$\displaystyle \frac{d}{d{\bf w}_j}r_{in}$	$\displaystyle =$	$\displaystyle \frac{d}{d{\bf w}_j}(s_{in}-y_{in}) =\frac{d}{d{\bf w}_j} s_{in} =\frac{d}{d{\bf w}_j} s({\bf w}_i^T{\bf x}_n)$
	$\displaystyle =$	$\displaystyle s_{in}\left(\delta_{ij}-s_{jn}\right) \frac{d}{d{\bf w}_j}\left({\bf w}_j^T{\bf x}_n\right)=z_n{\bf x}_n \;\;\;\;\;\;\;\;\;(n=1,\cdots,N)$	(245)

where we have used Eq. (234) and defined $z_n=s_{in}(\delta_{ij}-s_{jn})$ . Now the Jacobian can be written as

$\displaystyle {\bf J}_{ij} =\left[\begin{array}{c} dr_{i1}/d{\bf w}_j\\ \vdots ... ...array}{c}{\bf x}_1^T\\ \vdots\\ {\bf x}_N^T\end{array}\right] ={\bf Z}{\bf X}^T$

(246)

where ${\bf Z}=diag(z_1,\cdots,z_N)$ is an $N\times N$ diagonal matrix. Substituting ${\bf J}_r({\bf W})$ back into the expression for ${\bf H}_{ij}$ , we get:

$\displaystyle {\bf H}_{ij}={\bf X}{\bf J}_r+\delta_{ij}\lambda ={\bf XZX}^T+\de... ...sum_{n=1}^N s_{in}(\delta_{ij}-s_{jn}){\bf x}_n {\bf x}_n^T +\delta_{ij}\lambda$

(247)

Here ${\bf H}_{ij}$ is a matrix of dimension $(d+1)\times(d+1)$ , corresponds to ${\bf w}_i$ and ${\bf w}_j\;(i,j=1,\cdots,K)$ . We further stack all $K\times K$ such matrices ${\bf H}_{ij}$ together to get the $K(d+1)\times K(d+1)$ dimensional full Hessian matrix ${\bf H}_J$ of $J({\bf W})$ with respect to all $K$ vectors in ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$ :

$\displaystyle {\bf H}_J=\left[\begin{array}{ccc} {\bf H}_{11} & \cdots &{\bf H}... ...dots & \ddots & \vdots\\ {\bf H}_{K1} & \cdots &{\bf H}_{KK} \end{array}\right]$

(248)

Now we can find the optimal parameter ${\bf W}$ iteratively by the Newton-Raphson method based on both ${\bf H}_J$ and ${\bf g}_J$ :

$\displaystyle \left[\begin{array}{c}{\bf w}_1\\ \vdots\\ {\bf w}_K\end{array}\r... ...{-1}_n \left[\begin{array}{c}{\bf g}_1\\ \vdots\\ {\bf g}_K\end{array}\right]_n$

(249)

Once ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$ is available, any unlabeled sample ${\bf x}$ of unknown class can be classified into one of the $K$ classes with the maximum conditional probability given in Eq. (235):

if $\displaystyle \;\;\; p(\hat{y}=k\vert{\bf x}) = \max_l \; p(\hat{y}=l\vert{\bf x}),\;\;\;\;$ then $\displaystyle \;\;\;\;\;{\bf x}\in C_k$

(250)

Whether we should use softmax regression or $K$ logistic regressions for a problem of K classes $C_1,\cdots,C_K$ depends on the nature of the classes. The method of softmax regression is suitable if the $K$ classes are mutually exclusive and independent, as assumed by the method. Otherwise, $K$ logistic regression binary classifiers are more suitable.

Below is a Matlab function for estimating the $K$ weight vectors in ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$ based on the training set ${\cal D}=\{{\bf X},{\bf y}$ and the hyperparameter $\lambda$ :

function W=softmaxRegression(X,y,lambda)
    [d N]=size(X);
    K=length(unique(y));        % number of classes
    X=[ones(1,N); X];           % augmented data points
    d=d+1;
    Y=zeros(K,N);
    for n=1:N
        Y(y(n),n)=1;            % generate binary labeling Y
    end
    W=zeros(d*K,1);            	% initial guess of K weight vectors
    I=eye(K);
    s=zeros(K,N);
    phi=zeros(K,N);             % softmax functions
    gi=zeros(d,1);              % ith gradient
    Hij=zeros(d,d);             % ij-th Hebbian
    er=9;
    tol=10^(-6);
    it=0;
    while er>tol
        it=it+1;
        W2=reshape(W,d,K);      % weight vectors in d x K 2-D array
        g=[];                   % total gradient vector
        for i=1:K
            for n=1:N
                xn=X(:,n);      % get the nth sample
                t=0;
                for k=1:K        
                    wk=W2(:,k);  % get the kth weight vector
                    s(k,n)=exp(wk'*xn);     
                    t=t+s(k,n);
                end
                phi(i,n)=s(i,n)/t;      % softmax function
            end
            gi=X*(phi(i,:)-Y(i,:))'+lambda*W2(:,i);     % ith gradient
            g=[g; gi];          % stack all gradients into a long gradient vector
        end
        H=[];                   % total Hessian
        z=zeros(N,1);
        for i=1:K               % for the ith block row
            Hi=[];
            for j=1:K           % for the jth block in ith row
                for n=1:N
                   z(n)=phi(i,n)*((i==j)-phi(j,n));
                end
                Hij=X*diag(z)*X';
                Hi=[Hi Hij];    % append jth block
            end
            H=[H; Hi];          % append ith blow row
        end
        H=H+lambda*eye(d*K);    % include regulation term
        W=W-inv(H)*g;           % update W by Newton's method
        er=norm(g);
    end
    W=reshape(W,d,K);           % reshape weight vector into d x K array
end

Here is the function for the classification of the unlabeled data samples in matrix ${\bf X}$ based on the weight vectors in ${\bf W}=[{\bf w}_1,\cdots,{\bf w}_K]$ :

function yhat=softmaxClassify(W,X)
    [d N]=size(X);              % dataset to be classified
    [d K]=size(W);              % model parameters 
    X=[ones(1,N); X];           % augmented data points
    yhat=zeros(N,1);
    for n=1:N                   % for each of the N samples
        xn=X(:,n);              % nth sample point
        t=0;
        for k=1:K        
            wk=W(:,k);          % kth weight vector
            s(k)=exp(wk'*xn);     
            t=t+s(k);
        end
        pmax=0;
        for k=1:K
            p=s(k)/t;           % probability based on softmax function
            if p>pmax
                kmax=k; pmax=p;
            end
        end
        yhat(n)=kmax;           % predicted class labeling
    end    
end