Visualizing Data using GTSNE

08/03/2021 ∙ by Songting Shi, et al. ∙ 0

We present a new method GTSNE to visualize high-dimensional data points in the two dimensional map. The technique is a variation of t-SNE that produces better visualizations by capturing both the local neighborhood structure and the macro structure in the data. This is particularly important for high-dimensional data that lie on continuous low-dimensional manifolds. We illustrate the performance of GTSNE on a wide variety of datasets and compare it the state of art methods, including t-SNE and UMAP. The visualizations produced by GTSNE are better than those produced by the other techniques on almost all of the datasets on the macro structure preservation.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

High-dimensional data visualization is a very important problem for human to sense the data. Currently, the state of art methods are t-SNE (


), which has similar principle for the nonlinear low dimension reduction. They use neighborhood probability distribution to connect the high-dimensional data points to low-dimensional map points, which try to make the local relative neighborhood relation unchanged but ignoring the change in the macro structure of the data. However, this may make the low dimension map points representing the high-dimensional structure unfaithfully. In the low-dimensional neighborhood keeping and patching process, t-SNE sometimes will make the neighborhood relations in the high-dimensional structure break in the the low-dimensional space. We add a macro loss term on the loss of t-SNE to make it keep the relative k-means centroids structure in the low and high dimensional space, which basically keep the macro structure unchanged in the low dimensional space.

2 Methods

We now begin to derive the loss function of the global t-distributed stochastic neighborhood embedding (GTSNE). Suppose that there are

points in the high-dimensional space, , where . We want to get their low-dimensional embedding map points, , where , where .

2.1 The Loss of t-SNE

Recall that the loss function of t-SNE is given by



is the probability of high dimensional point

connecting to , and is the probability of low dimensional point connecting to . The probability and characterize the neighborhood relation of and . The close two points will have a higher probability than those two far-separated points. The key is that we need to seek the probability distributions in both the high-dimensional space and low dimensional space, such that they can match each other, i.e. when we will have the best layouts in the low dimensional space.

t-SNE use the Gaussian probability to model the neighborhood relations in the high dimensional space, i.e.




where the was chosen such that it satisfies the perplexity equation


. Intuitively, this means that the probability can effectively distinguish Perplexity neighbors of .

To solve the crowding problem, i.e. when make that the distance relation keeping in the low dimensional space, it will make that the mediately separate points in the high dimensional space clustering together in the low dimension space, t-SNE uses the heavy tail t-Distribution to model the low dimensional neighborhood relations.


for , and .

In the above formulation, t-SNE only captures the local neighborhood relations in the low dimension embeddings. In our numerical experiments, we find the t-SNE map points can not faithfully represent the high-dimensional data points. There exist two problems. The first one is that t-SNE can not fully preserve the local neighborhood relation. This occurs when two neighbor points were separated by a line in the map points in the low dimension layout, t-SNE will separate the two points on the two sides of the line and push them far away from the line. Note that the problem is due to that the t-SNE loss is non-convex, which is hard to optimize. Once a line lies in the middle of the near points, it is hard to push the line far away from the two points. The second problem is that t-SNE can not preserve the macro structure of data, e.g, it will project a three dimensional sphere into 2D space but do not own a circle boundary. To overcome the above two problems, we propose the following GTSNE loss, which will consider both the local neighborhood structure, and also the macro structure of the data points.

2.2 Global t-Distributed Stochastic Neighbor Embedding

To characterize the global structure of the high dimensional points, we use the k-means clustering centroids in the high dimension space, their neighborhood relations and the data point with the centroids relations to represent the macro structures. To do this, we first run PCA on to get the PCA embedding , where . Then we run k-means clustering algorithm on to get the k-means centroids , where . The k-means centroids capture the global structure of the data points. For each point , we calculate the probability that point belong to the cluster by the t-Distributed distribution denoted by ,


Note that we use the scaling factor on the distance , since this will use to represent the data point belong to the its cluster centroids in the low dimensional space.

To transfer the global structure information in the the low dimension map points, we use the t-distributed distribution to characterize these centroids relations,


To characterize the low dimensional macro structure, we define the low dimension centroids by with the formula,


And define the corresponding low-dimensional t-Distributed macro neighborhood relations by


for and .

Now we can get the GTSNE loss function,


where are weight parameters of the loss. The loss was composed of three parts. The first part is the t-SNE loss which penalizes the mismatch between and , such that will maintain the local neighborhood relation. The second part is the macro loss , which try to make the low dimensional centroids relations match the high dimensional centroids relations. The third part is the k-means loss , which try to make that the map points satisfies the centroids belong relations .

After some mathematical calculation, we get the gradient of loss ,


We use the gradient descent method to optimize the loss function. The adaptive learning rate scheme described by Jacobs Jacobs1988 was used, which gradually increases the learning rate in the direction in which the gradient is stable.

Now we give the GTSNE algorithm (1) to guide the details of imagination.

1:function GTSNE(, perlexity, , , , , )
2:     Dataset .
3:     cost function parameters: perplexity , weight parameter of the macro loss , weight parameter of the k-means loss.
4:     learning rate , the momentum scalar .
5:     Result: low-dimensional data representation .
6:     Sample initial solution from

     Initializing the moment accumulate gradient

with values .
8:     Compute , do the SVD decomposition . Get the PCA embedding . Run k-means algorithm on with number of clusters . Get the k-means centroids .Compute the cluster assignment probability matrix by equation (6). Compute the macro probability matrix by the equation (7)
9:     Search the K nearest neighbors for each with Euclidean distance which finished by the vantage point tree algorithm(vptree).
10:     Compute the the probability matrix by equation (3).
11:     repeat
12:         Compute the gradient of with as given in equation (11)
13:         Update the gains of gradient with .
14:         Update the momentum accumulated gradient .
15:         Update with .
16:     until convergence
17:     return
Algorithm 1 GTSNE: Global t-distributed Stochastic Neighbor Embedding

Implementation details. We use the quadratic tree (BH-SNE) to compute the t-SNE gradient part approximately.

3 Experiments

To compare performance of GTSNE , we compare it with PCA, t-SNE and UMAP algorithms, on both the simulation data and real data.

The parameters are set to the default value for each algorithm.

  • PCA from sklearn.decomposition.PCA.

  • GTSNE: , , , .

  • t-SNE from sklearn.manifold.TSNE: .

  • UMAP from umap.UMAP: "n_neighbors"= 30, "min_dist"= 0.3.

3.1 Simulation Data

We first run the algorithm on the simulated data to verify the effectiveness of GTSNE. The simulated data are three continuous lines in the high dimension data, which was generated by

  1. Generate the velocity

    by random sampling from the normal distribution, i,e,


  2. Choose three start points of data points , . . where

    is the zeros vector with length

    and is the ones vector with length .

  3. Generate the data points where from the three starting points and moving along the velocity one by one. i.e.

We take and to generated the data. After running the dimension reduction methods, we get the results showed in Fig 1. The figure shows that t-SNE break the lines while GTSNE and UMAP do keep the lines continuity, which shows the effectiveness of the GTSNE.

Why t-SNE break the continuous line? From the figure, we see that two horizontal neighbor points are separated by the vertical line. After t-SNE run into this state, the gradient of t-SNE at one of the neighbor points was driven by two forces. The attractive force comes from their neighbor points, which will make this point close to them. The another repulse force comes the points on the the middle lines, which will make that the point far from them. When the two forces balanced with each other, i.e. canceled to zero. The point do not move when the algorithms running. Thus t-SNE will jump into the local minimum of loss, and can not jump out from it by the gradient descent. When in the loss of GTSNE, the macro loss part will strength the attractive force of two neighbor points, since if the do not close to each other, the centroid probability will do not match to there high dimensional parts, so that it will make the continuity of the lines.

But in the figure, we also see that GTSNE twists the lines in the low-dimension map. This need to be improved, which is left to the future work.

Figure 1: There lines dataset, with dimension .

3.2 Real Data

3.2.1 Five toy datasets

Now we run the algorithm on five famous toy datasets. Their information are summarized in Table 1

Dataset Dimension Description
Blobs Isotropic Gaussian blobs
Iris The iris dataset
Wine The wine dataset
Swiss Roll Swiss roll dataset
Sphere The sphere dataset
Table 1: The toy example, where the number of classes if it has.

After running the algorithms, we get the results showed in Fig 2.

Figure 2: Five toy datasets.

From the results we see that GTSNE worked well on these datasets. On the Swiss Roll dataset, GTSNE preserves the continuous circle structure while t-SNE and UMAP only get the half circle. On the Sphere dataset, GTSNE preserves the sphere shape which are better than the results of t-SNE and UMAP.

3.2.2 MNIST dataset

The MNIST database of handwritten digits

, has a training set of 60,000 examples, and a test set of 10,000 examples. Each example is an image of pixels. The whole dataset contains examples.

Figure 3: MNIST dataset with dimensions

From the result, we see that GTSNE do a comparative representation with t-SNE and UMAP.

3.2.3 Pancreas dataset

We now run the algorithms on the Pancreas dataset, this dataset was used in dynamical-RNA-velocity. It is a single cell RNA-seq dataset. After selecting the velocity genes, we get the final dataset which has the dimension , i.e. cells and selected velocity genes. After running the algorithms, we get the results showed in Fig 4.

Figure 4: The Pancreas dataset, with dimension .

From the results, we see that GTSNE works similarly with t-SNE and UMAP. And GTSNE generates a continuous map which is similar to UMAP, while the result of t-SNE has some breaks in the continuous structure.

4 Discussion

GTSNE use the k-means method on the PCA embedding of the high dimensional data to grasp the macro structure, and try to preserve the relative relations of centroids by probability in the low dimensional space. But it has some limitations.

The first problem of GTSNE is that is run slowly in the large dataset, for the MNIST dataset (), it takes about one and half hour. It need to make more efficient implementation.

The second essential question is that how to define the macro structure? In this paper, we use the k-means centroids to characterize the macro structure, it is just an initial try. Can we find more reasonable solutions? The answer need to find by the reader.

5 Conclusion

In this paper, we proposed a visualization method — GTSNE, which is a modified version of t-SNE. It include the macro structure in the loss function to make that the low dimensional map preserve the macro structure. We hope that this method will help to the data visualization which need to preserve the macro structures.


Thank to my family ( especially for my mother, Qixia Chen and father, Wenjiang Shi ) for they provides me a suitable environment for this work. Thank to all the teachers who taught me to guide me to the road of truth.

Appendix A. Code availability

GTSNE are available as python package on Scripts to reproduce results of the primary analyses will be made available on The code is learned and adapted the C implementation of BH-SNE (BH-SNE), special thanks to Laurens van der Maaten and Daniel Rodriguez.

Appendix B. Derivation of the GTSNE gradient.

The derivation of GTSNE gradient is similar to the derivation of t-SNE gradient. We now give the details of the derivation. The loss function of GTSNE consists of three parts,

so does its gradient,


The first part is


The second part is


The third part is


Substitute three parts (15), (18), (19) into equation (12), we get the gradient of GTSNE,


The derivation is finished.