In a multiclass classification problem, an unlabeled data point is to be classified into one of classes , based on the training set , where is an integer indicating
Any binary classifier, such as the logistic regression considered above, can be used to solve such a multiclass classification problem in either of the following two ways:
if then | (230) |
(231) |
Alternatively, a multiclass problem with can also be solved by multinomial logistic or softmax regression, which can be considered as a generalized version of the logistic regression method, based on the softmax function of variabbles :
(232) |
(233) |
Similar to how the logistic function is used to model the conditional probability for a given data point to belong to class based on model parameter inEq. (212), here the softmax function defined above is used to model the conditional probability for to belong to class based on some model parameter :
where is composed of weight vectors each associated with one of the classes , to be determined in the training process based on training set . Same as in logistic regression, here both and are augmented dimensional vectors.We note that the inner product of vectors and is inversely related to the angle between the two vectors, i.e., when compared with other classes, a data sample has a smaller angular distance to , then it has a larger inner product and thereby greater probability to belong to class . We therefore see that weight vectors as the parameters of the softmax regression model actually represent the angular directions of the corresponding classes in the feature space.
Specially, when , the above becomes logistic functions:
(236) |
For mathamatical convenience, we also label each sample in the training set by a binary vector , in addition to its integer labeling . If indicating , then the kth component of is , while other components are zero for all . Note that all components of add up to 1. The training samples in are also labeled by their corresponding binary labelings in .
We define the probability for as the softmax function based on the linear function :
and the probability for to be correctly classified into class can be written as the following product of factors (of which are equal to 1):
Our goal is to find these weight vectors in
as the parameter of the
softmax model for it to optimally fit the i.i.d. data points
in the training set, so that that the following likelihood is
maximized:
(240) |
(241) |
To find the optimal that minimize this objective
function
, we find its gradient vector with
respective to each of its columnns
:
As is a function of all weight vectors in , so is , and can be explicitly expressed as . We note that Eq. (242) takes the same form as Eq. (221) for . When specially with , the two equations become the same. We therefore see that logistic regression is actually a special case of softmax regression.
We stack all such dimensional gradient vectors together, and get a dimensional gradient vector of the objective function with respect to all parameter vectors :
(243) |
We can also find the optimal by the
Newton's method,
if we can further find the
Hessian matrix
as the second derivative of
, by
taking the derivative of the ith gradient vector
with respect to
the jth weight vector
to get the
second order derivative of
with respect to both
and
):
(244) |
(245) |
(246) |
(247) |
Here is a matrix of dimension , corresponds to and . We further stack all such matrices together to get the dimensional full Hessian matrix of with respect to all vectors in :
(248) |
(249) |
Once is available, any unlabeled sample of unknown class can be classified into one of the classes with the maximum conditional probability given in Eq. (235):
if then | (250) |
Whether we should use softmax regression or logistic regressions for a problem of K classes depends on the nature of the classes. The method of softmax regression is suitable if the classes are mutually exclusive and independent, as assumed by the method. Otherwise, logistic regression binary classifiers are more suitable.
Below is a Matlab function for estimating the weight vectors in based on the training set and the hyperparameter :
function W=softmaxRegression(X,y,lambda) [d N]=size(X); K=length(unique(y)); % number of classes X=[ones(1,N); X]; % augmented data points d=d+1; Y=zeros(K,N); for n=1:N Y(y(n),n)=1; % generate binary labeling Y end W=zeros(d*K,1); % initial guess of K weight vectors I=eye(K); s=zeros(K,N); phi=zeros(K,N); % softmax functions gi=zeros(d,1); % ith gradient Hij=zeros(d,d); % ij-th Hebbian er=9; tol=10^(-6); it=0; while er>tol it=it+1; W2=reshape(W,d,K); % weight vectors in d x K 2-D array g=[]; % total gradient vector for i=1:K for n=1:N xn=X(:,n); % get the nth sample t=0; for k=1:K wk=W2(:,k); % get the kth weight vector s(k,n)=exp(wk'*xn); t=t+s(k,n); end phi(i,n)=s(i,n)/t; % softmax function end gi=X*(phi(i,:)-Y(i,:))'+lambda*W2(:,i); % ith gradient g=[g; gi]; % stack all gradients into a long gradient vector end H=[]; % total Hessian z=zeros(N,1); for i=1:K % for the ith block row Hi=[]; for j=1:K % for the jth block in ith row for n=1:N z(n)=phi(i,n)*((i==j)-phi(j,n)); end Hij=X*diag(z)*X'; Hi=[Hi Hij]; % append jth block end H=[H; Hi]; % append ith blow row end H=H+lambda*eye(d*K); % include regulation term W=W-inv(H)*g; % update W by Newton's method er=norm(g); end W=reshape(W,d,K); % reshape weight vector into d x K array end
Here is the function for the classification of the unlabeled data samples in matrix based on the weight vectors in :
function yhat=softmaxClassify(W,X) [d N]=size(X); % dataset to be classified [d K]=size(W); % model parameters X=[ones(1,N); X]; % augmented data points yhat=zeros(N,1); for n=1:N % for each of the N samples xn=X(:,n); % nth sample point t=0; for k=1:K wk=W(:,k); % kth weight vector s(k)=exp(wk'*xn); t=t+s(k); end pmax=0; for k=1:K p=s(k)/t; % probability based on softmax function if p>pmax kmax=k; pmax=p; end end yhat(n)=kmax; % predicted class labeling end end