In a multiclass classification problem, an unlabeled data
point is to be classified into one of
classes
, based on the training set
, where
is an integer indicating
Any binary classifier, such as the logistic regression considered above, can be used to solve such a multiclass classification problem in either of the following two ways:
if |
(230) |
(231) |
Alternatively, a multiclass problem with can also be
solved by multinomial logistic or softmax regression,
which can be considered as a generalized version of the logistic
regression method, based on the softmax function of
variabbles
:
(232) |
(233) |
Similar to how the logistic function is used to model the
conditional probability
for a
given data point
to belong to class
based on
model parameter
inEq. (212),
here the softmax function
defined above is used to model
the conditional probability for
to belong to class
based on some model parameter
:
We note that the inner product
of vectors
and
is inversely related to the angle
between the two vectors, i.e., when compared with other classes, a
data sample
has a smaller angular distance to
,
then it has a larger inner product
and thereby
greater probability
to belong to class
.
We therefore see that weight vectors
as
the parameters of the softmax regression model actually represent
the angular directions of the corresponding classes in the feature
space.
Specially, when , the above becomes logistic functions:
(236) |
For mathamatical convenience, we also label each sample
in the training set by a binary vector
, in addition to its
integer labeling
. If
indicating
, then the kth component of
is
, while other components are zero
for all
. Note that all components of
add up to 1. The
training samples in
are also
labeled by their corresponding binary labelings in
.
We define the probability for
as the softmax
function based on the linear function
:
Our goal is to find these weight vectors in
as the parameter of the
softmax model for it to optimally fit the
i.i.d. data points
in the training set, so that that the following likelihood is
maximized:
(240) |
(241) |
To find the optimal that minimize this objective
function
, we find its gradient vector with
respective to each of its
columnns
:
As is a function of all
weight vectors in
, so is
, and
can be explicitly expressed as
. We
note that Eq. (242) takes the same form as
Eq. (221) for
. When specially
with
, the two equations become the same. We therefore
see that logistic regression is actually a special case of softmax
regression.
We stack all such
dimensional gradient vectors
together, and get a
dimensional gradient vector of the objective function
with respect to all
parameter vectors
:
(243) |
We can also find the optimal by the
Newton's method,
if we can further find the
Hessian matrix
as the second derivative of
, by
taking the derivative of the ith gradient vector
with respect to
the jth weight vector
to get the
second order derivative of
with respect to both
and
):
(244) |
(245) |
(246) |
(247) |
Here
is a matrix of dimension
,
corresponds to
and
.
We further stack all
such matrices
together to get the
dimensional full
Hessian matrix
of
with respect to all
vectors in
:
(248) |
(249) |
Once
is available, any unlabeled
sample
of unknown class can be classified into one of the
classes with the maximum conditional probability given in
Eq. (235):
if |
(250) |
Whether we should use softmax regression or logistic regressions
for a problem of K classes
depends on the nature of the
classes. The method of softmax regression is suitable if the
classes
are mutually exclusive and independent, as assumed by the method. Otherwise,
logistic regression binary classifiers are more suitable.
Below is a Matlab function for estimating the weight vectors
in
based on the training
set
and the hyperparameter
:
function W=softmaxRegression(X,y,lambda) [d N]=size(X); K=length(unique(y)); % number of classes X=[ones(1,N); X]; % augmented data points d=d+1; Y=zeros(K,N); for n=1:N Y(y(n),n)=1; % generate binary labeling Y end W=zeros(d*K,1); % initial guess of K weight vectors I=eye(K); s=zeros(K,N); phi=zeros(K,N); % softmax functions gi=zeros(d,1); % ith gradient Hij=zeros(d,d); % ij-th Hebbian er=9; tol=10^(-6); it=0; while er>tol it=it+1; W2=reshape(W,d,K); % weight vectors in d x K 2-D array g=[]; % total gradient vector for i=1:K for n=1:N xn=X(:,n); % get the nth sample t=0; for k=1:K wk=W2(:,k); % get the kth weight vector s(k,n)=exp(wk'*xn); t=t+s(k,n); end phi(i,n)=s(i,n)/t; % softmax function end gi=X*(phi(i,:)-Y(i,:))'+lambda*W2(:,i); % ith gradient g=[g; gi]; % stack all gradients into a long gradient vector end H=[]; % total Hessian z=zeros(N,1); for i=1:K % for the ith block row Hi=[]; for j=1:K % for the jth block in ith row for n=1:N z(n)=phi(i,n)*((i==j)-phi(j,n)); end Hij=X*diag(z)*X'; Hi=[Hi Hij]; % append jth block end H=[H; Hi]; % append ith blow row end H=H+lambda*eye(d*K); % include regulation term W=W-inv(H)*g; % update W by Newton's method er=norm(g); end W=reshape(W,d,K); % reshape weight vector into d x K array end
Here is the function for the classification of the unlabeled
data samples in matrix based on the weight vectors
in
:
function yhat=softmaxClassify(W,X) [d N]=size(X); % dataset to be classified [d K]=size(W); % model parameters X=[ones(1,N); X]; % augmented data points yhat=zeros(N,1); for n=1:N % for each of the N samples xn=X(:,n); % nth sample point t=0; for k=1:K wk=W(:,k); % kth weight vector s(k)=exp(wk'*xn); t=t+s(k); end pmax=0; for k=1:K p=s(k)/t; % probability based on softmax function if p>pmax kmax=k; pmax=p; end end yhat(n)=kmax; % predicted class labeling end end