Cluster analysis of high dimensional datasets is a core component of unsupervised learning and a common task in many academic settings and practical applications. A widely used practice to more easily visualize the cluster structure present in the data is to first reduce the dimension of data to a lower dimension, typically two. A popular tool to perform this dimensionality reduction is the t-distributed Stochastic Neighborhood Embedding (t-SNE) algorithm, introduced by van der Maaten and Hinton [van der Maaten and Hinton(2008)]. t-SNE has been empirically shown to preserve local neighborhood structure of points in many applications and the popularity of this method can be seen by the number of citations of the original paper () and implementations in popular packages such as scikit-learn.
The main drawback of t-SNE however, is its large computational cost, a big part of which can be written as # of distances computed where is the dimension of the input points. This drawback severely limits its potential use in unsupervised learning problems. To combat the large computational cost of t-SNE, many practical developments have been made using techniques such as parallel computation, GPU, and multicore implementations with a large chunk of work on reducing the number of pairwise distances that are computed between the input points [Chan et al.(2019)Chan, Rao, Huang, and Canny, Pezzotti et al.(2019)Pezzotti, Thijssen, Mordvintsev, Höllt, Van Lew, Lelieveldt, Eisemann, and Vilanova, van der Maaten(2014)]. We take a different approach by focusing instead on reducing the value of in the runtime cost.
In this work, we show that the distance computation can be optimized by first projecting the input data into a much smaller random subspace before running t-SNE. Our motivation for this approach comes from the field of metric embeddings where it is known that random projections preserve many geometric structures of the input data. For more information, see Section 1.3. More formally, our contributions are the following:
We empirically show that projecting to a very small dimension (even as low as of the original dimension in some cases) and then performing t-SNE preserves local cluster information just as well as t-SNE with no prior dimensionality reduction step.
We empirically show that performing dimensionality reduction first and then running t-SNE is significantly faster than just running t-SNE.
We note that our last result is implementation agnostic, since every implementation must compute distances among some of the input points. To show the benefits of our approach and simplify our experiments, we use two implementations of t-SNE: a widely available, but slower scikit-Learn implementation, and a highly optimized Python implementation called ‘openTSNE’ [Poličar et al.(2019)Poličar, Stražar, and Zupan]. For more details about our experimental setup, see Section 2.1.
1.1 Measuring Quality of an Embedding
We devise an accuracy score to measure the quality of the cluster structure of an embedding. First we note that ideal clusters are ones in which every element has the same label as others in the same cluster. If one has labels for only a few datapoints, this kind of clustering would allow much of the dataset to be correctly labeled. Thus ideally, such clustering would be created without using labels (note that neither random projections nor t-SNE use labels to determine embeddings). A datapoint is said to be “correctly clustered” in the low-dimension embedding if it’s label matches that of its nearest neighbor. The accuracy score of an embedding is given by the fraction of datapoints that are correctly clustered. Such a score could be refined by instead considering the modal label of the -nearest-neighbors. The accuracy score rewards the case that clusters in high dimension stay together in the lower dimension and penalizes the case of different clusters in the higher dimension merging together in the lower dimension since then we can expect the nearest neighbors of many points to have a differing label. In Section 2.2, we compare the accuracy scores of performing dimensionality reduction and then t-SNE versus performing only t-SNE on various labeled datasets.
1.2 Overview of t-SNE
Given a set of points , t-SNE first computes the similarity score between and defined as where for a fixed , for some parameter . Intuitively, the value measures the ‘similarity’ between points and . t-SNE then aims to learn the lower dimensional points such that if , then
minimizes the Kullback–Leibler divergence of the distributionfrom the distribution . For a more detailed explanation of the t-SNE algorithm, see [van der Maaten and Hinton(2008)].
1.3 Motivation for Using Random Projections
Like t-SNE, using a random projection is also a dimensionality reduction tool which has been shown to preserve many geometrical properties of the input data, such as pairwise distances, nearest neighbors, and cluster structures [Dasgupta and Gupta(2003), Indyk and Naor(2007), Makarychev et al.(2019)Makarychev, Makarychev, and Razenshteyn]. One of the key results of this field is the well known Johnson–Lindenstrauss (JL) Lemma which roughly states that given any set of points, a random projection of these points into dimension preserves all pairwise distances upto multiplicative error [Dasgupta and Gupta(2003)]. However, it is not possible to use a random projection to project down to a very small subspace, such as two dimensions used in most applications, while still maintaining the desired geometric properties. This is in contrast to t-SNE which has been empirically shown to preserve many local properties of the input data even in very low dimensions (such as two). However, we can leverage advantages of both of these approaches through the counter intuitive idea of using dimensionality reduction before dimensionality reduction. More specifically, we can hope that by first randomly projecting the input data into a smaller dimension, we can preserve enough local information for t-SNE to use to be able to meaningfully project the data into two dimensions and still maintain the inherent cluster structure.
Indeed, if we consider the similarity scores computed by t-SNE, we see that if a random projection preserves pairwise distances, then the values of the similarity scores remain the same. Furthermore, if we require the weaker condition that the random projection preserves nearest neighbor relations, then we see that the values maintain their order. More precisely, if holds, the similarity scores between points and will be large and thus, t-SNE will try to place these points closer together. Using a random projection from to , we immediately get a reduction of the runtime from # of distances computed to # of distances computed . We show the success of our method in Section 2.
1.4 Related Works
There have been many works about improving t-SNE through better implementations, such as GPU acceleration or parallel implementations [Chan et al.(2019)Chan, Rao, Huang, and Canny, Pezzotti et al.(2019)Pezzotti, Thijssen, Mordvintsev, Höllt, Van Lew, Lelieveldt, Eisemann, and Vilanova, Ulyanov(2016)]. There has also been work on reducing the number of distances that need to be computed using tree based methods, see [van der Maaten(2014)] for more info. To our knowledge, our work is the first to suggest using random projections in conjunction with t-SNE. We have verified this claim by going through the papers that cite the original t-SNE paper of [van der Maaten and Hinton(2008)] and searching Google Scholar for papers with the phrase ‘random projection’.
2 Experimental Results
2.1 Experimental Setup and Justification
We construct embeddings by first applying a random projection into
dimensions, then using t-SNE to arrive at a two dimensional embedding. The random projections are performed by using an appropriately sized random matrix with i.i.d. Gaussian entrieswhere is the original dimension of the input data. We repeat this process for several values of exponentially spaced and ranging from to .
We use the following four datasets in our experiments: MNIST, Kuzushiji-MNIST (KMnist) (data set of Japanese cursive characters in the same format as MNIST) [Clanuwat et al.(2018)Clanuwat, Bober-Irizar, Kitamoto, Lamb, Yamamoto, and Ha], Fashion-MNIST (data set of fashion images in the same format as MNIST) [Xiao et al.(2017)Xiao, Rasul, and Vollgraf], and the Small Norb data set (data set of images of toys) [LeCun et al.(2004)LeCun, Huang, and Bottou]. All of our experiments were done in Python using one core of a MacBook Pro using a GHz Intel Core i processor. We only report the time to perform t-SNE since the t-SNE runtime is many orders of magnitude larger than all the other steps. Our code can be accessed from the following GitHub reprository: https://github.com/ssilwa/optml.
We first compare the time taken to perform t-SNE and the accuracy scores achieved if a dimensionality reduction step is used before using t-SNE versus the case where no dimensionality reduction is used. We do so by plotting the ratio of the runtime and accuracy scores. For a given dimension, a higher accuracy score ratio is better and signifies that projecting down to that dimension using a random projection does not deteriorate the performance of t-SNE. Likewise, a lower time ratio is better. In Figure 1 we plot the ratio of the accuracy scores and the time taken when we use the openTSNE implementation. The axis is the dimension of the random projection (ranging all the way up to the actual dimension of the input data). In all four datasets, we see that as the dimension increases, both the accuracy ratio and the ratio of the time taken approach . However, we also observe that the accuracy score ratios approach much faster, indicating a ‘’sweet spot” where the dimension is high enough for the random projection to preserve geometric structure, while still low enough for t-SNE to run relatively quickly.
Likewise, we show the results of the same experiments using the scikit-learn implementation. Since the scikit-learn implementation is significantly slower than the openTSNE implementation, we subsample the values used and do not test the Small Norb dataset. However, even in this different implementation, we again observe the same trends as in the openTSNE implementation.
In practice, SVD methods such as principal component analysis (PCA) are more widely used than random projections for the general task of dimensionality reduction. We show that for our four datasets, PCA indeed outperforms random projections. That is, we empirically observe that using PCA, rather than a random projection before t-SNE, allows us to project to a much smaller dimension while still retaining a high accuracy score ratio, (compared to the case where no dimensionality reduction is used). This is shown in Figure3 where regardless of the dataset, a projection to a dimension of is sufficient to get an accuracy score ratio close to . An advantage of random projections however, is that it is data oblivious (the dimensionality reduction does not depend on the data) and it has provable guarantees in many cases, such as the JL lemma.
- [Chan et al.(2019)Chan, Rao, Huang, and Canny] David M Chan, Roshan Rao, Forrest Huang, and John F Canny. Gpu accelerated t-distributed stochastic neighbor embedding. Journal of Parallel and Distributed Computing, 131:1–13, 2019.
- [Clanuwat et al.(2018)Clanuwat, Bober-Irizar, Kitamoto, Lamb, Yamamoto, and Ha] Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature, 2018.
- [Dasgupta and Gupta(2003)] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Struct. Algorithms, 22(1):60–65, January 2003. ISSN 1042-9832. doi: 10.1002/rsa.10073. URL http://dx.doi.org/10.1002/rsa.10073.
- [Indyk and Naor(2007)] Piotr Indyk and Assaf Naor. Nearest-neighbor-preserving embeddings. ACM Trans. Algorithms, 3(3), August 2007. ISSN 1549-6325. doi: 10.1145/1273340.1273347. URL http://doi.acm.org/10.1145/1273340.1273347.
- [LeCun et al.(2004)LeCun, Huang, and Bottou] Yann LeCun, Fu Jie Huang, and Léon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In URL http://dl.acm.org/citation.cfm?id=1896300.1896315.
[Makarychev et al.(2019)Makarychev, Makarychev, and
Konstantin Makarychev, Yury Makarychev, and Ilya Razenshteyn.
Performance of johnson-lindenstrauss transform for k-means and k-medians clustering.In
Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 1027–1038, New York, NY, USA, 2019. ACM. ISBN 978-1-4503-6705-9. doi: 10.1145/3313276.3316350. URL http://doi.acm.org/10.1145/3313276.3316350.
- [Pezzotti et al.(2019)Pezzotti, Thijssen, Mordvintsev, Höllt, Van Lew, Lelieveldt, Eisemann, and Vilanova] N. Pezzotti, J. Thijssen, A. Mordvintsev, T. Höllt, B. Van Lew, B. P. F. Lelieveldt, E. Eisemann, and A. Vilanova. Gpgpu linear complexity t-sne optimization. IEEE Transactions on Visualization and Computer Graphics, pages 1–1, 2019. doi: 10.1109/TVCG.2019.2934307.
- [Poličar et al.(2019)Poličar, Stražar, and Zupan] Pavlin G. Poličar, Martin Stražar, and Blaž Zupan. opentsne: a modular python library for t-sne dimensionality reduction and embedding. bioRxiv, 2019. doi: 10.1101/731877. URL https://www.biorxiv.org/content/early/2019/08/13/731877.
- [Ulyanov(2016)] Dmitry Ulyanov. Multicore-tsne. https://github.com/DmitryUlyanov/Multicore-TSNE, 2016.
[van der Maaten(2014)]
Laurens van der Maaten.
Accelerating t-sne using tree-based algorithms.
Journal of Machine Learning Research, 15:3221–3245, 2014. URL http://jmlr.org/papers/v15/vandermaaten14a.html.
- [van der Maaten and Hinton(2008)] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008. URL http://www.jmlr.org/papers/v9/vandermaaten08a.html.
- [Xiao et al.(2017)Xiao, Rasul, and Vollgraf] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.