It is impossible to reduce the dimensionality of a given dataset which is intrinsically high-dimensional (high-D), while still preserving all the pairwise distances in the resulting low-dimensional (low-D) space, compromise will have to be made to sacrifice certain aspects of the dataset when the dimensionality is reduced.
The PCA method as a linear, dimensionality reduction algorithm,
is unable to interpret complex nonlinear relationship between features. tance preserving method, the PCA emphasizes the importance of the PCA dimensions with great variances, but neglects small distance variations. It places distant and dissimilar data points far apart in the low-D space, while ignoring the importance of similar datapoints close together in the high-D, which need to be also placed close together in the low-D space. Therefore the PCA method tends to preserve large scale global structure of the dataset, while ignoring the small scale local structure.
But in some cases the local structures are of equal or even more importance than the global ones, such as the typical example of “Swiss roll”.
Also, when the dimensionality is much reduced from a high-D space to a low-D space, the crowding problem occurs. In a high-D space, a data point may have a large number of similar neighbors of roughly equal distance; however, in a low-D space, the number of equal-distance neighbors is significantly reduced (8 in a 3-D space, 4 in a 2-D space, 2 in a 1-D space). Thus the space available for the neighbors of a point is much reduced. As the result, when a large number of neighbors in the high-D are mapped to a low-D space, they are forced to spread out, some are getting closer to those that are farther away in the original space, thereby reducing the gaps between potential clusters of similar points.
In contrast, the t-SNE method is a nonlinear method that is based on probability distributions of the data points being neighbors, and it attempts to preserve the structure at all scales, but emphasizing more on the small scale structures, by mapping nearby points in high-D space to nearby points in low-D space.
The similarity between any two distinct points and
in the high-D space is defined as a probability based on the Gaussian kernel:
(185) |
If is close to
,
is small
and
is large, indicating
is similar to
;
but if
is far away from
,
is large and
is small, indicating
is dissimilar to
. Also, as
,
can be considered as
a probability that any two points
and
are neighbors
in the feature space and are therefore similar to each other.
Similarly, we could also define the similarity in the low-D space
also based on the Gaussian kernel. However, for reason explained below,
we prefer to define the similarity
in the low-D space based on
Student's t-distribution of degree 1, instead of Gaussian distribution:
(186) |
How well the points in the low-D space represent the similarity
relationship among the points
in the original high-D space can
be measured by the discrepancy bwtween the two sets of probabilities,
measured by the cost function, defined as the
Kullback-Leibler divergence:
(187) |
The KL divergence is not symmetric as
,
therefore it can not be viewed as a distance between the two sets of
probabilities. This asymmetry is actually a desired feature. Consider
the following two cases:
Before proceeding to consider the minimization of the cost function ,
we still need to resolve two additional issues:
(188) |
(189) |
(190) |
We can now find the optimal map points that minimize the cost
function
, by gradient descent method.
To find the gradient of
, we first introduce a
set of intermediate variables
(191) |
(192) |
(193) |
(194) |
(195) |
(196) |
(197) |
(198) |
(199) |
(200) |
Given the gradient vector, we can iteratively approach the optimal data
points in the low-D space that minimize the discrepancy
:
(201) |