An Analysis of the t-SNE Algorithm for Data Visualization

03/05/2018
by   Sanjeev Arora, et al.
0

A first line of attack in exploratory data analysis is data visualization, i.e., generating a 2-dimensional representation of data that makes clusters of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The t-SNE heuristic of van der Maaten and Hinton, which is based on non-convex optimization, has become the de facto standard for visualization in a wide range of applications. This work gives a formal framework for the problem of data visualization - finding a 2-dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of t-SNE under a natural, deterministic condition on the "ground-truth" clusters (similar to conditions assumed in earlier analyses of clustering) in the underlying data. These are the first provable guarantees on t-SNE for constructing good data visualizations. We show that our deterministic condition is satisfied by considerably general probabilistic generative models for clusterable data such as mixtures of well-separated log-concave distributions. Finally, we give theoretical evidence that t-SNE provably succeeds in partially recovering cluster structure even when the above deterministic condition is not met.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2016

A Theoretical Analysis of Noisy Sparse Subspace Clustering on Dimensionality-Reduced Data

Subspace clustering is the problem of partitioning unlabeled data points...
research
01/26/2015

IT-map: an Effective Nonlinear Dimensionality Reduction Method for Interactive Clustering

Scientists in many fields have the common and basic need of dimensionali...
research
02/16/2020

Structures of Spurious Local Minima in k-means

k-means clustering is a fundamental problem in unsupervised learning. Th...
research
12/09/2019

Self Organizing Nebulous Growths for Robust and Incremental Data Visualization

Non-parametric dimensionality reduction techniques, such as t-SNE and UM...
research
10/25/2022

A Spectral Method for Assessing and Combining Multiple Data Visualizations

Dimension reduction and data visualization aim to project a high-dimensi...
research
10/06/2021

T-SNE Is Not Optimized to Reveal Clusters in Data

Cluster visualization is an essential task for nonlinear dimensionality ...
research
12/08/2020

Algorithms for finding k in k-means

k-means Clustering requires as input the exact value of k, the number of...

Please sign up or login with your details

Forgot password? Click here to reset