High-dimensional data visualization is a very important problem for human to sense the data. Currently, the state of art methods are t-SNE (t-SNE, BH-SNE) and UMAP (UMAP
), which has similar principle for the nonlinear low dimension reduction. They use neighborhood probability distribution to connect the high-dimensional data points to low-dimensional map points, which try to make the local relative neighborhood relation unchanged but ignoring the change in the macro structure of the data. However, this may make the low dimension map points representing the high-dimensional structure unfaithfully. In the low-dimensional neighborhood keeping and patching process, t-SNE sometimes will make the neighborhood relations in the high-dimensional structure break in the the low-dimensional space. We add a macro loss term on the loss of t-SNE to make it keep the relative k-means centroids structure in the low and high dimensional space, which basically keep the macro structure unchanged in the low dimensional space.
We now begin to derive the loss function of the global t-distributed stochastic neighborhood embedding (GTSNE). Suppose that there arepoints in the high-dimensional space, , where . We want to get their low-dimensional embedding map points, , where , where .
2.1 The Loss of t-SNE
Recall that the loss function of t-SNE is given by
is the probability of high dimensional pointconnecting to , and is the probability of low dimensional point connecting to . The probability and characterize the neighborhood relation of and . The close two points will have a higher probability than those two far-separated points. The key is that we need to seek the probability distributions in both the high-dimensional space and low dimensional space, such that they can match each other, i.e. when we will have the best layouts in the low dimensional space.
t-SNE use the Gaussian probability to model the neighborhood relations in the high dimensional space, i.e.
where the was chosen such that it satisfies the perplexity equation
. Intuitively, this means that the probability can effectively distinguish Perplexity neighbors of .
To solve the crowding problem, i.e. when make that the distance relation keeping in the low dimensional space, it will make that the mediately separate points in the high dimensional space clustering together in the low dimension space, t-SNE uses the heavy tail t-Distribution to model the low dimensional neighborhood relations.
for , and .
In the above formulation, t-SNE only captures the local neighborhood relations in the low dimension embeddings. In our numerical experiments, we find the t-SNE map points can not faithfully represent the high-dimensional data points. There exist two problems. The first one is that t-SNE can not fully preserve the local neighborhood relation. This occurs when two neighbor points were separated by a line in the map points in the low dimension layout, t-SNE will separate the two points on the two sides of the line and push them far away from the line. Note that the problem is due to that the t-SNE loss is non-convex, which is hard to optimize. Once a line lies in the middle of the near points, it is hard to push the line far away from the two points. The second problem is that t-SNE can not preserve the macro structure of data, e.g, it will project a three dimensional sphere into 2D space but do not own a circle boundary. To overcome the above two problems, we propose the following GTSNE loss, which will consider both the local neighborhood structure, and also the macro structure of the data points.
2.2 Global t-Distributed Stochastic Neighbor Embedding
To characterize the global structure of the high dimensional points, we use the k-means clustering centroids in the high dimension space, their neighborhood relations and the data point with the centroids relations to represent the macro structures. To do this, we first run PCA on to get the PCA embedding , where . Then we run k-means clustering algorithm on to get the k-means centroids , where . The k-means centroids capture the global structure of the data points. For each point , we calculate the probability that point belong to the cluster by the t-Distributed distribution denoted by ,
Note that we use the scaling factor on the distance , since this will use to represent the data point belong to the its cluster centroids in the low dimensional space.
To transfer the global structure information in the the low dimension map points, we use the t-distributed distribution to characterize these centroids relations,
To characterize the low dimensional macro structure, we define the low dimension centroids by with the formula,
And define the corresponding low-dimensional t-Distributed macro neighborhood relations by
for and .
Now we can get the GTSNE loss function,
where are weight parameters of the loss. The loss was composed of three parts. The first part is the t-SNE loss which penalizes the mismatch between and , such that will maintain the local neighborhood relation. The second part is the macro loss , which try to make the low dimensional centroids relations match the high dimensional centroids relations. The third part is the k-means loss , which try to make that the map points satisfies the centroids belong relations .
After some mathematical calculation, we get the gradient of loss ,
We use the gradient descent method to optimize the loss function. The adaptive learning rate scheme described by Jacobs Jacobs1988 was used, which gradually increases the learning rate in the direction in which the gradient is stable.
Now we give the GTSNE algorithm (1) to guide the details of imagination.
Implementation details. We use the quadratic tree (BH-SNE) to compute the t-SNE gradient part approximately.
To compare performance of GTSNE , we compare it with PCA, t-SNE and UMAP algorithms, on both the simulation data and real data.
The parameters are set to the default value for each algorithm.
PCA from sklearn.decomposition.PCA.
GTSNE: , , , .
t-SNE from sklearn.manifold.TSNE: .
UMAP from umap.UMAP: "n_neighbors"= 30, "min_dist"= 0.3.
3.1 Simulation Data
We first run the algorithm on the simulated data to verify the effectiveness of GTSNE. The simulated data are three continuous lines in the high dimension data, which was generated by
Generate the velocity
by random sampling from the normal distribution, i,e,.
Choose three start points of data points , . . where
is the zeros vector with lengthand is the ones vector with length .
Generate the data points where from the three starting points and moving along the velocity one by one. i.e.
We take and to generated the data. After running the dimension reduction methods, we get the results showed in Fig 1. The figure shows that t-SNE break the lines while GTSNE and UMAP do keep the lines continuity, which shows the effectiveness of the GTSNE.
Why t-SNE break the continuous line? From the figure, we see that two horizontal neighbor points are separated by the vertical line. After t-SNE run into this state, the gradient of t-SNE at one of the neighbor points was driven by two forces. The attractive force comes from their neighbor points, which will make this point close to them. The another repulse force comes the points on the the middle lines, which will make that the point far from them. When the two forces balanced with each other, i.e. canceled to zero. The point do not move when the algorithms running. Thus t-SNE will jump into the local minimum of loss, and can not jump out from it by the gradient descent. When in the loss of GTSNE, the macro loss part will strength the attractive force of two neighbor points, since if the do not close to each other, the centroid probability will do not match to there high dimensional parts, so that it will make the continuity of the lines.
But in the figure, we also see that GTSNE twists the lines in the low-dimension map. This need to be improved, which is left to the future work.
3.2 Real Data
3.2.1 Five toy datasets
Now we run the algorithm on five famous toy datasets. Their information are summarized in Table 1
|Blobs||Isotropic Gaussian blobs|
|Iris||The iris dataset|
|Wine||The wine dataset|
|Swiss Roll||Swiss roll dataset|
|Sphere||The sphere dataset|
After running the algorithms, we get the results showed in Fig 2.
From the results we see that GTSNE worked well on these datasets. On the Swiss Roll dataset, GTSNE preserves the continuous circle structure while t-SNE and UMAP only get the half circle. On the Sphere dataset, GTSNE preserves the sphere shape which are better than the results of t-SNE and UMAP.
3.2.2 MNIST dataset
The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. Each example is an image of pixels. The whole dataset contains examples.
From the result, we see that GTSNE do a comparative representation with t-SNE and UMAP.
3.2.3 Pancreas dataset
We now run the algorithms on the Pancreas dataset, this dataset was used in dynamical-RNA-velocity. It is a single cell RNA-seq dataset. After selecting the velocity genes, we get the final dataset which has the dimension , i.e. cells and selected velocity genes. After running the algorithms, we get the results showed in Fig 4.
From the results, we see that GTSNE works similarly with t-SNE and UMAP. And GTSNE generates a continuous map which is similar to UMAP, while the result of t-SNE has some breaks in the continuous structure.
GTSNE use the k-means method on the PCA embedding of the high dimensional data to grasp the macro structure, and try to preserve the relative relations of centroids by probability in the low dimensional space. But it has some limitations.
The first problem of GTSNE is that is run slowly in the large dataset, for the MNIST dataset (), it takes about one and half hour. It need to make more efficient implementation.
The second essential question is that how to define the macro structure? In this paper, we use the k-means centroids to characterize the macro structure, it is just an initial try. Can we find more reasonable solutions? The answer need to find by the reader.
In this paper, we proposed a visualization method — GTSNE, which is a modified version of t-SNE. It include the macro structure in the loss function to make that the low dimensional map preserve the macro structure. We hope that this method will help to the data visualization which need to preserve the macro structures.
Thank to my family ( especially for my mother, Qixia Chen and father, Wenjiang Shi ) for they provides me a suitable environment for this work. Thank to all the teachers who taught me to guide me to the road of truth.
Appendix A. Code availability
GTSNE are available as python package on https://github.com/songtingstone/gtsne. Scripts to reproduce results of the primary analyses will be made available on https://github.com/songtingstone/gtsne2021. The code is learned and adapted the C implementation https://github.com/danielfrg/tsne of BH-SNE (BH-SNE), special thanks to Laurens van der Maaten and Daniel Rodriguez.
Appendix B. Derivation of the GTSNE gradient.
The derivation of GTSNE gradient is similar to the derivation of t-SNE gradient. We now give the details of the derivation. The loss function of GTSNE consists of three parts,
so does its gradient,
The first part is
The second part is
The third part is
The derivation is finished.