High-dimensional data is created at unprecedented rates by scientific fields as diverse as information technology, bioinformatics, and astronomy[buhlmann:2011:stats_for_high_dim]. As a result, there is a growing need for visualization and interaction methods for high-dimensional data. A common choice is to project the high-dimensional data to two dimensions using methods such as t-SNE[maaten:2008:tsne], PCA[pearson:1901:pca], LLE[roweis:2000:lle], or UMAP[mcinnes:2018:umap], among the many other existing options[espadoto19].
Projections allow better insight into the overall structure of data and can be enriched by interactions that allow users to reason about the corresponding high-dimensional data by selecting, brushing, and querying the 2D scatterplots they create. For example, t-SNE has allowed computational biologists to investigate human genetic data, revealing otherwise obfuscated population stratification[Li:2017:application_of_tsne]. However, any projection technique will create errors when mapping complex and high-dimensional datasets to a low number of dimensions[martins14, nonato18]. Moreover, projections are often complex algorithms, so the way they map the high-dimensional data to low-dimensional space can be difficult for users to fully interpret. As such, additional mechanisms need to be complement projections to empower users to better explore the high-dimensional data.
Recently attention has turned to inverse-projection, a process that allows one to compute the inverse mapping from the projection space back to the original high-dimensional data space[amorim:2012:ilamp]
. Also called back-projections, these methods help users to explore projections by allowing a user to interactively query the projection space to find high-dimensional data points. These points correspond to specific locations in the low-dimensional projection. Inverse-projections are also instrumental in explaining the decision boundaries of machine learning classifiers[rodrigues:2019:classifier_boundaries] and data augmentation scenarios[amorim:2012:ilamp]. In contrast to the many existing projection techniques[espadoto19], only a handful of inverse projection algorithms exist, including iLAMP[amorim:2012:ilamp]
and its extension that uses radial basis functions (RBFs)[amorim:2015:rbf]. Algorithms like iLAMP and RBF are quite slow, and have multiple free parameters, making it hard to use them in interactive data exploration scenarios[rodrigues:2019:classifier_boundaries].
In this paper, we present NNInv, a technique for computing the inverse of any projection using a deep learning approach. Our idea is inspired by the recent work of Espadoto et al.[espadoto:2020:innp] that demonstrates that deep learning can learn to imitate the style of any projection technique, and is parametric and stable to data changes (and thereby offers out-of-sample capabilities). Following their approach, we show that NNInv is a scalable, robust, high-quality inverse projection method which supports multiple applications.
Using NNInv, we introduce three use cases across a number of well-known datasets to illustrate how the use of inverse-projection can improve the user’s interaction, exploration, and understanding of high-dimensional data in a 2D visualization. Additionally, we provide an evaluation of iLAMP, RBF, and NNInv in terms of scalability and accuracy. To this end, we provide a novel visualization for evaluating the joint quality of a pair of inverse-projection and direct-projection methods. We make a point of studying the NNInv inverse-projection method on two synthetic datasets with well-known topology (i.e., a 3D sphere dataset and a 3D swissroll dataset), allowing us to illustrate the behaviors of the learned inverse function.
Applications for this work are numerous. First, we use inverse-projections to explore the “empty” spaces in a 2D projection of high-dimensional data. While the user interactively brushes such spaces, high-dimensional instances corresponding to the visited 2D points are synthesized and displayed, thus allowing one to form a better mental map of how the 2D image represents the entire high-dimensional data space, beyond how a 2D scatterplot represents a high-dimensional dataset.
Second, we present a visualization of the decision boundary of an ensemble classifier. Visualizing cluster boundaries can help users see patterns within the data and the behaviors of the classifiers (see the classifier comparison by Scikit-Learn [scikitboundary]). We show that the use of an inverse projection method such as NNInv makes it possible to visualize this important information with high-dimensional data (see Fig. 2). Finally, we introduce a gradient map visualization to help users find projection artifacts. This method highlights regions where the projection shrinks and expands the relationships between points by visualizing the rate of change between the learned 2D embedding and the original high-dimensional space. We show that NNInv provides an alternative approach to helping the user “see” the high-dimensional space in 2D.
In summary, the main contributions of this paper are:
A deep learning approach to inverse-projection, which is fast enough to be used at scale.
A comparison to existing inverse-projection methods.
An exploration of the behavior of NNInv on datasets with well-known topology.
Two novel visualizations for evaluating inverse-projection methods.
A showcase of visual exploration techniques enabled by inverse-projection.
2 Related Work
2.1 Visualization of High-Dimensional Data
We first list the notation used for the remainder of the paper. Let be a -dimensional data point, also called a sample or an observation. Let , be a dataset of such samples. The need to examine, interpret, and explore high-dimensional datasets is not new. As early as the 1970’s, Andrews[andrews:1972:plot_greater_than_2d] recognized the need to visualize data whose dimension exceeded the limits of what can conventionally be drawn on a 2D plane. Geng[geng:2013:3d_display] presented several techniques and systems from stereoscopic imaging to volumetric displays that allow visualizing dimensional data. Yet, as the dimensionality grows beyond three or four, it becomes clear that increasing the dimensionality of display technology is not a solution.
Also called Dimensionality Reduction (DR) methods, projections are techniques that aim to go beyond the aforementioned limitations of high-dimensional visualization techniques[gorban:2008:principal_manifold_techniques, van:2009:dim_reduction_survey, joia:2011:lamp, silva:2012:user_centered_projection, sorzano:2014:dim_reduction_survey, sacha:2016:dim_reduction_interaction_survey, jeong:2009:ipca]. Formally put, a projection technique is a function that maps every point of a high-dimensional dataset to a low-dimensional counterpart . Typically , which allows directly depicting the projection as a 2D or 3D scatterplot, respectively.
Projection techniques aim to preserve the so-called data structure between the original dataset and its low-dimensional counterpart . Structure is captured in terms of inter-point distances[roweis:2000:lle, paulovich:2008:least, joia:2011:lamp], point neighborhoods[maaten:2008:tsne, mcinnes:2018:umap], or clusters[paulovich2006text]. Projections can be further classified as linear[cunningham:2015:linear_dim_reduction_survey] or nonlinear[yin07_survey, van:2009:dim_reduction_survey]. Linear techniques, such as PCA, are simple and fast to compute, have an intuitive geometric interpretation, and robust association with statistical analysis. Nonlinear techniques, such as UMAP, are generally more computationally expensive, but strive to represent local neighborhood information with minimal distortion. There are also a number of projection techniques that generally fit under the RadViz family[hoffman:1997:radviz, angelini:2019:enhancing_radviz, pagliosa:2019:radviz++]. RadViz is able to visualize multidimensional data in 2D by anchoring each feature around the perimeter of a circle, and leverages spring forces from those points to assign each instance a location inside the circle. Projection techniques are further classified, analyzed, and compared both theoretically and practically in a number of surveys[hoffman02, bunte11, sorzano:2014:dim_reduction_survey, maljovec15, nonato18, espadoto19], to which we refer the reader.
All projection techniques transform data between the original space and the projection space . Several techniques aim to show errors in this process, i.e., areas in that may miss or not reflect actual structures in . For example, Stress Maps[seifert:2010:stress_maps] is a visual analysis tool that displays the local stress values, or how local distance relationships have changed, under a projection algorithm. Other error metrics and subsequent visualization mechanisms include trustworthiness and continuity[venna10], false and missing neighbors[martins14, martins15], and false neighborhoods and tears[aupetit:2007:visualizing_distortions_in_projections, lespinats:2011:checkviz]. Surveys of such metrics are given in[nonato18, espadoto19]. In order to demonstrate potential changes of caused by a hypothetical perturbation of the data in , DimReader[faust:2018:dim_reader] utilized a filled contour plot in the background. t-viSNE[kerren:2020:tvisne]
focuses on helping users understand t-SNE projections, such as how hyperparameters affect the properties of the final projection. Probing Projections[stahnke:2015:probing_projections] allows users to display the value of any attribute through a background heat map, and also enables users to correct distance errors in the projection by moving individual points in on the 2D projection space. A similar technique is proposed by LAMP[joia:2011:lamp]. In contrast, Dis-Function[brown:2012:dis] updates the mapping from the user’s dragging of data points to generate new, and hopefully better, projections. Sirius[dowling:2019:sirius] allows practitioners to investigate both the observations and attributes of a dataset through symmetric projections.
Choosing a good projection – one which yields a low projection error on a given family of datasets, is simple to use in terms of parameter setting, is robust to small changes in the data , and is computationally scalable to large dimensions and sample counts
– is challenging. Recently, Neural Network Projection (NNP)[nnp] was proposed as a method to achieve these goals by leveraging deep learning: Given a dataset , a small subset is chosen and projected by any user-chosen technique . After a suitable projection is obtained by tuning
’s parameters, a fully-connected feed-forward neural network is trained to inferfrom . The trained network is then used to project any data drawn from a similar distribution as . NNP has shown remarkable ability in producing projections that mimic a wide range of techniques on many types of datasets, with little or no parameter tuning[espadoto:2020:innp]. Moreover, NNP is parametric, making it robust to small-scale data changes in while also providing an out-of-sample capability – that is, NNP learns a continuous function with rather than a discrete mapping formed by a non-parametric projection. NNP is important as a basis for the discussion of inverse-projections in the next section.
2.1.2 Inverse Projection
Inverse-projection can be seen as a function , which should ideally be the mathematical inverse of a given projection , i.e., . A crucial component of inverse-projections is that they should have an out-of-sample ability that can be expressed as a continuous mapping, which is generally not the case for direct (discrete) projections. Thus, can be used to invert points that fall between the points of the scatterplot , helping the user to understand what kind of data samples could project at a particular location in . This ability further supports applications such as data augmentation and classifier exploration[rodrigues:2018:classifier_boundaries, rodrigues:2019:classifier_boundaries].
Inverse-projection is inherently harder than direct projection due to the need for an out-of-sample ability and the fact that needs to synthesize a high number of dimensions from a lower dimension
. Early on, autoencoders were proposed to jointly infer bothand by deep learning to minimize the projection error from to [hinton2006reducing]. While autoencoders are parametric, the resulting mappings are not always intuitive[van:2009:dim_reduction_survey] and autoencoders can be difficult to train[vernier20]. Amorim et al. approach inverse-projection in iLAMP[amorim:2012:ilamp] by using local affine transformations, following the earlier idea of the LAMP direct projection technique[joia:2011:lamp]. Mamani et al. also use local affine transformations in their inverse-projections as a part of their work on user-driven feature space transformation[mamani:2013:user_driven_feature_space_transformation]. iLAMP was later extended by leveraging radial basis functions (RBFs) to provide a smoother inverse mapping , which was shown to be useful for data augmentation[amorim:2015:rbf]. Kriegeskorte and Mur[kriegeskorte:2012:inverse_mds] proposed inverse MDS, which infers pairwise dissimilarities from multiple 2D arrangements of items. Cavallo et al.[cavallo:2018:praxis] used inverse-projection in Praxis, an interactive exploratory analysis tool for high-dimensional data. The authors leveraged the analytical inverse of PCA in addition to an autoencoder to both project and inverse-project data. Similarly, Zhao et al.[zhao:2020:chartseer] used a Grammar Variational Autoencoder (GVAE)[kusner2017grammar] to project and inverse-project data charts for steering exploratory visual analysis.
2.2 Latent Spaces with Neural Networks
Recent developments in machine learning and AI have shown that deep learning approaches are both accurate and flexible when used and trained properly[goodfellow19]. In general, neural network encoders work by learning a mapping from the input data space to a lower dimensional representation called the latent space. This mapping, conceptually similar to our projection , is often difficult to interpret as the latent dimensions are abstract. Moreover, the neural network’s operation is harder to understand than the equivalent operation of a typical projection function .
2.2.1 Interpreting the Latent Space
Interactive visualization tools have been developed to help with analysis tasks that give a better understanding of latent spaces. In particular, when a neural network model has a generative component (e.g.
autoencoders and Generative Adversarial Networks), its latent space can be explained by bringing data points back to the original space via its generative component. Liuet al.[liu:2019:latent_space_catography] presented a latent space cartography (LSC) visual analysis system for vector space embeddings. The LSC system was created to address common interpretation tasks for latent spaces. It provides a means to both quantify attribute vector uncertainty and compare multiple attribute vectors. Spinner et al.[spinner:2018:towards_interpretable_latent_space] also used latent spaces to visually compare autoencoders with variational autoencoders. A number of techniques have been developed in order to try to disentangle the latent features of autoencoders[higgins:2017:beta_vae, kim:2019:disentangling_factorising, chen:2019:isolating_disentanglement_vae]. A recent work by Gou et al. moved these advances forward within a full visual analytics system for traffic light detection[Gou:2020:valtd]. Additional visualizations making use of, and explaining, latent spaces are discussed in a recent survey[garcia18].
3 Learning the Inverse Projection
Figure 1 shows the operation of NNInv. Given a dataset , of points, let , be its projection by any user-chosen projection method . In practice, is a dimensional scatterplot, so . NNInv constructs an approximation of the inverse of by using deep learning. Let
be a -dimensional point inferred by the neural network from a 2D point . Here, are the learned parameters of the function (i.e., the weights of the network). To train the model, we minimize the loss between each predicted and true within the training set (,
) using some loss function.
We used five different datasets across our evaluation and proposed applications.
MNIST: This dataset[lecun:2010:mnist] has grayscale images of hand-drawn digits, zero through nine. Each image is at a resolution of
. The images have been translated so that the center of mass of the pixels is at the center of the image. The MNIST dataset is commonly used to illustrate and measure the quality of projection techniques[maaten:2008:tsne, van:2009:dim_reduction_survey, espadoto19, nnp, espadoto:2019:nn_inv].
Fashion-MNIST: This dataset[xiao:2017:fashion_mnist] is constructed in the same manner as the original MNIST dataset, but contains pictures of different items of clothing. It was designed as a slightly more difficult replacement for the MNIST dataset.
Blobs: This synthetic dataset has
points sampled from a Gaussian distribution with 5 different centers (clusters) indimensions.
Sphere: This dataset consists of points uniformly sampled from a 3D unit sphere. It allows us to clearly demonstrate the behaviour of the projection techniques included, and more importantly, offer a simple illustration of our gradient map visualization.
Swiss Roll: This dataset consists of points sampled from a densely-sampled 2D patch which was smoothly mapped to a “roll” in 3D. It is commonly used to gauge the capability of projections to “unroll” the data back to its 2D configuration[amorim:2012:ilamp, joia:2011:lamp, balasubramanian2002isomap].
show the number of neurons used in the respective hidden layers.
We next describe the design and tuning of the neural network used to learn the inverse projection. Following [elsken:2019:nas_survey], and also the method used to tune NNP [espadoto:2020:innp], we used grid search to explore different architecture configurations: total number of neurons, neurons per layer, and dropout values.
We ran the grid search across the four datasets introduced in Sec. 3.1. As the direct projection , we used t-SNE, which was earlier shown to be the hardest projection from a set of nine different projections to mimic via deep learning[nnp]. Hence, we believe that t-SNE is also a hard challenge to invert via NNInv. We varied the training-set size between 5250, 10500, 21000, and 42000 samples. To account for variation in random initialization of the neural network weights, we ran each configuration three times and averaged the results into a single error score. We measure quality via mean absolute error (MAE)
, and also provide its standard deviation across the three runs. Training is stopped automatically on convergence, defined as the moment when the validation loss stops decreasing. We next discuss the hyperparameters investigated.
Network Architectures: We restricted ourselves to fully-connected layers and used four hidden layers () in each configuration. We varied the network shape and number of neurons in each layer. The total number of neurons in each network varied between 240, 480, 960, 1920, 3840, 7680, and 15360. We experimented with four network shapes (see Table II).
Activation Functionsis normalized such that each of the dimensions ranges over , we used a sigmoid activation function on the output layer.
: We used both early stopping and dropout, with dropout probabilities of, , . Experiments showed dropout was not generally effective. We believe that this is due to the fact that overfitting is unlikely given that we used smaller networks and early stopping.
Loss Function: For the loss function we used MAE.
Optimizer: We used the Adam optimizer[kingma_adam_2014], given its good performance with NNP[nnp, espadoto:2020:innp].
Table I shows the MAE and standard deviation (STD) results for the top-five configurations of the tested ones, i.e., the ones obtaining lowest error. The full results including all configurations tested is available in the supplemental material. The best architectures for each dataset either had the same number of neurons in each layer, or used a widening architecture which doubles or quadruples the number of neurons in each successive layer.
Our results suggest that smaller architectures can be used other than the original architecture from Espadoto et al.[espadoto:2019:nn_inv]. The only dataset that performs better with more than 1920 neurons is the FashionMNIST dataset. For this dataset, we obtain a slightly lower MAE when using 7680 neurons. However, as in most cases observed, the error decrease is negligible compared to the increase in complexity (network size). Summarizing our findings from the tested datasets, we offer suggestions for future experimentation: (1) networks should follow a straight or fan-out style shape as described in Table II, and (2) even relatively small networks can perform exceptionally well at this task as seen in Table I.
4 Applications of Inverse Projection in Visual Analytics
Traditional error calculations may tell something of the overall loss, i.e., going from high-dimensional space to 2D and back to the original space. However, a robust analysis of an inverse-projection technique must include more than just this type of error. This section focuses on a qualitative evaluation of inverse projections using applications that are of interest to the visualization community. In particular, we explore three use cases of inverse-projection in visual analytics: (1) direct interpolation of high-dimensional data using the 2D screen, (2) leveraging the generation of high-dimensional data across the screen to per-pixel color classifier agreement, and (3) using the generated high-dimensional data to illustrate high gradient areas of the projection.
4.1 Case Study 1: Dynamic Imputation
One shortcoming of the current use of projection methods is that the projections are “one-way streets.” From a user interaction and exploration standpoint, the most that a user can do using such techniques is to select a data point in the (2D) visualization and look up the original values of that point in high-dimensional space. Due to this limitation, the user’s exploration of the data is restricted. For example, the user would have no easy way of knowing why two data points appear close to each other in the 2D space, or what other data points, if they existed, would appear near or between these points.
4.1.1 Example with MNIST
In this case study, we demonstrate the use of inverse-projection to perform “dynamic imputation.” The inspiration for this case study comes from recent works by Cavalloet al.[cavallo:2018:praxis] that explores inverse-projection with PCA and autoencoders, and Kwon et al.[kwon:2020:deep_generative_graphs] that generates graph layouts from the user’s interactions in a 2D latent space.
Consider Figure 2: The user can select projected data points in a 2D visualization (of the MNIST dataset) and see their original values (see Figure 2A and Figure 2E), similar to traditional visual analytics systems. However, with the use of inverse-projection, the user can also select an “empty” space between these data points (see the three inner images). The inverse-projection function implicitly performs imputation (i.e., generating a new data point) when performing inference over the 2D pixel location to find its position in high-dimensional space.
Since the inference step of a trained neural network is fast, this computation can be done in a web browser and be made fully interactive using mouse hovering. In this example, the computation time of inverse-projecting a point in tensorflow.js on an Intel i7-8650U CPU is below 10 milliseconds. With this high degree of interactivity, the user can quickly explore both the high-dimensional dataset as well as the high-dimensionalspace (between data points) itself. Figure 3 further showcases the ability of using inverse-projection to interpolate between clusters. Here, the images furthest to the left and right represent two visually distinct objects (i.e., a pair of pants and a dress, and the digits 6 and 5). The images in between are interpolations generated by the inverse-projection algorithm.
4.1.2 Evaluation of the Inverse Projection
Using this framework, we can also visually evaluate the quality of the inverse-projection algorithm. Specifically, instead of selecting an “empty” pixel, a user can select the 2D position of an existing data point. We can then compare the original values of the data point with the values generated by the inverse-projection algorithm. For example, the two images on the upper left side of Figure 4 are two different styles of pants from the original Fashion-MNIST dataset. The two images directly below, on the lower left side, are those generated by the inverse-projection. Similarly, images on the upper right side of Figure 4 are from the original MNIST dataset, and images on the lower right side of Figure 4 are generated. In both cases, the generated images are “blurrier” than the originals. However, it is shown that the inverse-projection function has successfully learned the important visual features of these images and can reproduce them with high fidelity.
4.1.3 Implications to Visual Analytics
Although we used two relatively simple image datasets in this case study (MNIST and Fashion-MNIST) for illustrative purposes, the use of inverse-projection for dynamic imputation should be extendable to visual analysis of other high-dimensional datasets, including temporal data, geographic data, and tabular data. As such, having an accurate inverse-projection function in a visual analytics system can allow the system designer and the user to explore high-dimensional data in ways that have not been possible. For example, in the context of business analysis, the use of inverse-projection for data imputation can serve as a “hypothesis generator” (e.g., Figure 2
). With inverse-projection, the user can interpolate between the 0 and the 6 from the original data, and use inverse-projection to generate hybrid examples between. While the generated data points are estimates of the inverse-projection function, they may nonetheless serve as potential hypotheses for an analyst to further explore.
4.2 Case Study 2: Model Agreements
Previous work in defining and interpreting back projections has shown that creating dense pixel maps in the 2D projection space can provide additional insight into the behavior of classification type tasks[espadoto:2019:nn_inv, rodrigues:2018:classifier_boundaries, rodrigues:2019:classifier_boundaries]. Figure 5 shows how this concept is extended to highlight regions of lower classification agreement.
4.2.1 Example with MNIST and Fashion-MNIST
We demonstrate ensemble classification confidence by creating dense pixel maps to show the classifier agreement of two of the ten digits in the MNIST dataset (digits 1 and 7) and two of the objects in the Fashion-MNIST dataset (handbags and shirts). While there is nothing preventing the technique from being extended to multiclass classification, as in previous work[espadoto:2019:nn_inv, rodrigues:2018:classifier_boundaries, rodrigues:2019:classifier_boundaries], we limit ourselves to binary classification. In both cases, we begin by inverse-projecting each screen pixel to learn its position in the high-dimensional space. This high-dimensional point is then put through some number (greater than one) of classification methods. Since our dataset only contains two classes, each of the classifiers will simply assign a data point to one class or the other. We then color the pixel based on the number of classifiers that predicted each class. As shown in Figure 5, we color a pixel bright blue if the majority of classifiers predicted class one, and bright red if the majority of classifiers predicted class two. In between these two extremes, pixels are colored by decreasing the amount of saturation such that complete disagreement between the models results in a white pixel – that is, half of the classifiers says the data point inverse-projected from the respective pixel belongs to class 1, while the other half says the point belongs to class 2.
The ensemble is formed by nine classifiers, namely Logistic regression, Linear SVM, SVM with radial basis function, K-Nearest Neighbors, Gaussian Process, Decision Tree, Random Forest, Adaboost, Gaussian Naive Bayes, and Quadratic Discriminant Analysis. These classifiers represent a diverse number of classification algorithms, including linear and non-linear methods. The output from the nine classifiers is used to generate the images in Figure5, where not only can we see the class memberships of each point, we can also see the shape of the decision boundaries.
For example, we can combine the use of inverse-projection for visualizing decision boundaries with its use for dynamic imputation, resulting in an interactive visual exploring system for understanding the uncertainty of the classifiers.
As an illustrative example, consider the differences between t-SNE and UMAP in Figure 5. When only considering the separability of the two clusters, one would likely assume that UMAP outperforms t-SNE, especially for the MNIST dataset (top row of Figure 5). However, when inspecting the decision boundaries, it becomes less clear that the separability affects the classifiers’ abilities to distinguish data points from one class to another. Specifically, in the t-SNE example, although the separation between the two clusters in the MNIST data is small, the boundaries are sharp and clean. Conversely, while UMAP produces high separation between the two clusters, there are disagreements between the classifiers in that space.
4.2.2 Implications to Visual Analytics
While there has been a number of proposed methods for illustrating the decision boundaries for classifiers of high-dimensional data[migut:2015:decision_boundaries, Hamel:2006:decision_boundaries_of_svm, rodrigues:2018:classifier_boundaries, schulz:2015:decision_boundaries, rodrigues:2019:classifier_boundaries], our proposed use of inverse-projection offers an alternative that can be more flexible for visual analytics systems. As illustrated in Figure 6, the user can hover over areas with low model agreement (e.g., pixels that are white or near white), and see what characteristics of the data might cause the classifier models to disagree.
In the context of designing new visual analytics techniques, the use of inverse-projection to help users better understand the behaviors of machine learning models can prove to be invaluable. Colloquially referred to as Explainable AI (or XAI), visualization researchers have been active in developing novel visualization and interaction techniques that can help a user understand, debug, and improve a complex machine learning model. While the space of XAI is large, we posit that the inverse-projection technique can contribute to this broad space of research.
4.3 Case Study 3: Gradient Map Visualization
Related to Explainable AI (XAI), one of the primary use cases for multidimensional projection is the visualization and interaction of data that exists in high-dimensional spaces that humans have difficulty interpreting. Unfortunately, a side effect of projection is the loss of information. To help mitigate the consequences of information loss imposed by projection, most techniques strive to maintain local relationships. In other words, they seek to preserve the relative distances between neighboring data points in the high-dimensional space in the two-dimensional projection. Of course, keeping these relationships intact after projecting is not always possible.
Using the concepts of data imputation, a more holistic view of how a projection represents the spatial relationships between data points is presented. The ability to determine high-dimensional coordinates from a projected point in 2D enables a more complete investigation of the consequences of selecting a given projection technique by inspecting its gradient image (see Fig. 9). This image is a 2D scalar field representing a pseudo total derivative of inverse projection function computed using central differences as
where is a point in the 2D projection space and and are a pixel’s width and height, respectively. In summary, regions of a projection with large gradient values illustrate where the high-dimensional distance is changing most rapidly with respect to the low-dimensional distance. Figure 8 demonstrates how values on either side of large gradient values map to larger distances in the original data space, compared to values on either side of small gradient values. While the above method uses simple finite differences, any method for computing the gradient magnitude of is appropriate.
4.3.1 Example with Sphere Data
Figure 7 shows how, under a standard projection for parameterization, even a simple three-dimensional sphere is transformed into a stretched and squeezed plane. Here, two equal length lines are placed on the parameterized plane at different locations. When each line segment is inverse-projected to recover the coordinates on the sphere, it is clear that the relative lengths of each segment have changed considerably. The degree of change is more completely understood when it is observed in the context of the gradient map overlaid onto this plane. In this case, areas towards the poles of the globe intuitively have a gradient that approaches zero, while the equator will have the highest gradient. Thus, a line segment back projected from the two-dimensional plane will necessarily shrink; however, a similar line segment positioned near the equator will grow in length.
In the top row of Figure 9, a uniformly sampled three-dimensional sphere was projected to a two-dimensional plane using t-SNE, PCA, LLE, and UMAP. In these images, the sample points on the sphere are represented by blue dots, while the background is colored with the gradient image. In the cases of t-SNE, LLE, and UMAP, the projections maintain similar gradient characteristics with respect to neighboring points. However, there are some points that project to regions of high gradient. These regions are inevitable, as tears in the three-dimensional sphere are required in order to represent it on a plane. Conversely, PCA is a linear projection method that does not seek to preserve neighborhood information between data points. As a result, the gradient map under the data points is constant and reflects the planar nature of the projection space.
4.3.2 Implications to Visual Analytics
The gradient maps shown in Figure 9 illustrate the use of inverse-projection to help users see the quality of the projection. It is relevant to note that the gradient maps do not show the topology of an embedding space created using a projection function, which is the goal of works like Stress Maps[seifert:2010:stress_maps]. Instead, these gradient maps represent the reconstructed embedding space by the inverse-projection function. In some cases the inverse-projection function does not perfectly recover the original embedding space. For example, the top row of the PCA column in Figure 9 shows the reconstruction of a plane created by PCA in the 3D sphere dataset. Notice that the reconstructed surface is not perfectly linear as should be the case of PCA projections.
As such, we consider the gradient map as a debugging mechanism similar to tools in the XAI community for debugging machine learning models. In particular, the gradient map can help data scientists and visual analytics researchers to better understand the effect of projection and inverse-projection when visualizing high-dimensional data. For example, the top left image in Figure 9 shows the projection of a 3D sphere using t-SNE. The intense colors denote sharp discontinuities between the parts of the 3D sphere separated by t-SNE. This information has illustrative values and can be used to help a user better understand the behaviors of a projection function such as t-SNE.
We present an empirical evaluation of the inverse projection function described in the previous section. In the following, we split each dataset into a training set and a test set . We train NNInv using the pair . Within Sections 5.1, 5.3, and 5.4 we restrict to t-SNE, but also explore PCA, LLE, and UMAP in Section 5.2. We next evaluate the quality of NNInv using various error metrics computed using and . We next discuss our method in terms of scalability (Sec. 5.4), quantitative assessment of quality (Sec. 5.1), qualitative assessment of quality (Sec. 5.2), and our novel inverse-projection error map (Sec. 5.3).
5.1 Quantitative Assessment of Quality
Besides being fast, we want an inverse-projection to be accurate. That is, given some ground truth pair , unseen during training, we want to be as close as possible to . This follows the same idea as normalized stress metrics used to gauge the quality of projections in the literature[sorzano:2014:dim_reduction_survey, van:2009:dim_reduction_survey, espadoto19] and also classical validation of inference models in machine learning. We measure quality in our case by computing the average inverse-projection mean square error over the test set . The closer MSE is to zero, the better is. While we minimized MAE in our loss function, we report MSE here to enable easier comparison to earlier papers[amorim:2012:ilamp, amorim:2015:rbf].
Figure 10 shows the MSE for our three datasets, two projections (t-SNE and UMAP), and three tested inverse projections (iLAMP, RBF, and NNInv). We also consider several training-set sizes to show how MSE depends on the training amount. For Blobs, a relatively easy-to-project synthetic dataset, all methods have essentially zero error except RBF. MNIST and FashionMNIST show similar behavior: Our method achieves consistently one of the lowest errors. Errors are larger for these real-world complex datasets than for the synthetic Blobs, which is expected.
5.2 Qualitative Exploration
We explore NNInv’s performance on two well-understood synthetic datasets, Sphere and Swiss Roll (Sec. 3.1). Simple datasets where the projections are well understood give us greater ability to reason about the inverse projection. In particular it is easier to understand how error is distributed across a dataset, as well as which projections will incur higher error. To illustrate this, we once again split the datasets into training () and test sets (), this time having and percent of the total data, respectively. We then plot the projections of test portion () of these two datasets in Fig. 11 and color the points by the root-mean-squared error between the inverse-projections () and the true high-dimensional data (). Error colormaps are normalized within each image, so that we can better see error variations within a given projection. Hence, colors cannot be compared across rows or columns of Fig. 11.
When analyzing inverse projection results, we must remember that the concept of error encompasses inaccuracies and faults in both the projection and the inverse-projection methods. For example, linear techniques like PCA will have a substantially different error profile compared to non-linear techniques such as t-SNE, LLE, or UMAP.
For the sphere, t-SNE and UMAP are able to peel away the surface, and error seems to congregate along the edges of the structures that make up the peel. In contrast, PCA and LLE end up with a slice out of the sphere causing the largest error in the center of their slices. For the Swiss Roll dataset, t-SNE, LLE, and UMAP are able to remove the swirl, with UMAP and t-SNE making similar ribboned shapes and LLE unraveling to an rectangle with perspective. In contrast, PCA keeps the general shape of the spiral, causing a speckling of high error throughout the whole structure.
Projecting high-dimensional space down to 2D is inherently lossy, and each method will project the data to 2D differently. This difference is not only visual – each projection method emphasizes certain aspects of the data. As a result, different techniques throw away different portions and amounts of the high-dimensional data when performing the projection. This means that certain projection techniques will be easier to inverse-project than others.
For example, PCA does not aim to prevent overdrawing or projecting different points to the same two-dimensional location. As such, several data points can be projected to the same position in 2D space, making it impossible to correctly learn an inverse. In contrast, t-SNE and other non-linear techniques work to maintain local neighborhood relationships; when projecting a set of points, they try to preserve the relative distances in the projection that exist in the original space. In cases where there is poor preservation of inter-cluster distances, NNInv remains a valuable tool. If an area in the projection is shrunk or expanded relative to the high-dimensional space, the rate of change between inverse-projections will either increase or decrease respectively. When the interpolation moves very quickly, NNInv may be less useful for tasks like dynamic imputation (Sec. 4.1), but NNInv can help identify these spots with gradient maps (Sec. 4.3). The properties of each projection technique inform and define the types of errors exhibited during the inverse-projection process.
As Fig. 11 shows, one consequence of PCA projecting multiple distant points to a small region on a 2D plane is that the inverse-projected points will likely be erroneous. In this case, the error increases as the distance in the high-dimensional space increases between points co-located on the 2D projection. Conversely, for t-SNE and UMAP, the non-linear projections distort the input geometry, often into shapes that no longer resemble the topology of the original data. In return, the inverse-projected data points from 2D back to the high-dimensional space are much closer to their original positions, resulting in significantly smaller total error. In other words, better grouping by similarity as well as better separation of points will make inverse-projection easier.
5.3 Dense map of inverse projection error
Evaluation of inverse-projection methods often uses error metrics defined for direct projections such as stress or reprojection error[amorim:2012:ilamp, amorim:2015:rbf]. However, the above metrics only gauge the error at the locations of projection points . The same is actually the case for all errors for direct projections we are aware of – they only gauge how good a (direct) projection is at the locations of the scatterplot points. As explained earlier in Sec. 2.1.2, the key use-case of inverse projections is the out-of-sample one, where one inversely projects different points than .
We next propose a validation approach that considers the out-of-sample case, i.e., evaluates the quality of at all points in . We proceed as follows. Given a dataset , we construct as usual given a user-chosen projection technique , and use to train our inverse projection . Next, we discretize the projection space using a pixel grid with a given resolution , in our case . Then, for every pixel , we compute the pixel given by the “round trip” of back projecting it to and next projecting it again to . To perform this, we must assume that is parametric. Then, ideally, for all pixels . This way, we can assess an inverse projection error also for points in which do not correspond to projections of points in our given dataset .
We visualize the round-trip errors as a dense map as follows. We create a hue image by bilinear interpolation of four different hues (Fig. 12a). Next, we color every pixel by the hue of the round-trip pixel and set its luminance to . Dense map areas which show the same color gradient as Fig. 12a have, thus, low inverse-projection errors. Bright areas and/or hue differences from this gradient show large projection errors. Scatterplot points are colored in the same way, but use a slightly lower brightness value to avoid confusion with the map pixels. Figures 12b-d show the error maps for iLAMP, RBF, and NNInv for the inverse projection of the MNIST dataset projected by t-SNE. We see that NNInv creates a color gradient which is close to the reference one, has minimal discontinuities, and has few bright spots. Hence, NNInv can inverse-project the entire 2D space without introducing large amounts of error.
5.4 Scalability in Training and Inference
Scalability implies the effort required to train our method and, separately, the effort needed to infer as function of the size of the dataset to inversely project. Concerning training, Table III
shows the number of training epochs needed to obtain convergence (defined as in Sec.3.2) as function of the training set size , for all three considered datasets and . Columns 2..4 indicate averages for multiple runs created by randomly sampling from the entire dataset . Overall, we obtain convergence for roughly 150 epochs for all datasets and training-set sizes.
|Training set||Average # epochs for each dataset||Row|
Figure 13 shows the inference speed for all three datasets. Speed does not depend on the projection method – once NNInv is trained, its performance is linear in the number of inversely-projected samples. When computing inference speed, we inversely project any point in and not just points in . Indeed, for assessing speed, we do not need ground-truth information. Moreover, in real use cases, one would inversely project unseen data, for which such ground-truth information is not available. We see that both RBF and iLAMP have a superlinear behavior, while NNInv (our method) is basically linear. NNInv is roughly one magnitude order faster than RBF and nearly two magnitude orders faster than iLAMP for 40K samples or more. This speed-up is crucial for applications that need to inversely project hundreds of thousands of samples (or more), like in the construction of dense maps [rodrigues:2018:classifier_boundaries, espadoto:2019:nn_inv] and the maps in Sec. 4.2 and 4.3. NNInv constructs such maps in seconds, while iLAMP and RBF need (tens of) minutes, making human-in-the-loop usage of such methods impossible in visual analytics scenarios – one of the key reasons why dense maps are built in the first place. This scalability is one of the most important advantages of NNInv.
NNInv is scalable, accurate, and relatively smooth, as shown in Sec. 5. Yet, using a neural network does have its disadvantages[bengio:2013:deep]. A neural network (1) requires a particular threshold of good quality training examples, (2) can be computationally expensive to train, and (3) can be generally hard to interpret. In Sec. 5.1 we show acceptable mean squared error with as few as 500 training examples, and caution that below that threshold, our technique will not perform as successfully. In all of the examples in this paper, the projections NNInv are trained on are good quality projections, obtained by choosing reasonable values for the projection’s hyperparameters. Good quality projections are generally more likely to have the qualities (as described in Sec. 5.2) required for accurate inverse-projection. While NNInv is useful in helping to interpret projections (e.g., Fig. 2), it can be difficult to reason about NNInv itself, since neural networks are hard to interpret in general. That is, our metrics show that NNInv performs better as it can approximate nonlinear patterns, but it is not obvious how NNInv does this. We leave the explainability of NNInv’s improved performance to future work.
7 Discussion and Future Work
Future inverse-projection research can take several interesting directions. Of particular relevance is the discussion in Sec. 5.2 regarding the properties of projection techniques and their inverses.
As Sec. 5.2 shows, when discussing the invertibility of projection functions, we find that not all projection methods are equally suitable for the inverse-projection method: PCA is worse than t-SNE or UMAP because multiple data points can be projected into the same 2D pixel.
Interestingly, a type of projection that is designed specifically for its invertibility is the encoder portion of an autoencoder. When trained together with the decoder, the entire process optimizes for the recoverability of data points from input to output. Yet, a user would have a hard time understanding the embedding of a regular encoder because there is no intentionally designed structure in the embedding space created with an encoder. Also, there are no guarantees about neighborhood preservation or relative distance preservation.
The tradeoff between understandability of the latent space created by a projection and the appropriateness of the projection for learning its inverse is interesting. On one hand, a projection technique may sacrifice some information to create a more insightful, or more spatially intuitive, visualization. Yet, the use of inverse-projection can lead to novel visualization and interaction techniques that can better help the user explore and understand a high-dimensional space. Further steps should be taken to find a happy medium between these two extremes, whether that be autoencoders with some cost for occlusion, or spacing items too far apart, or a projection technique with a greater loss for discarding information.
Accessible and fast inverse projections will have far-reaching impacts on visual analytics (VA) systems that use projections.
We believe that a deep learning approach to inverse projection is especially accessible given today’s robust ecosystem for neural network development [chollet:2015:keras, google:2019:kerastuner, abadi:2015:tensorflow, paszke:2019:pytorch
paszke:2019:pytorch]. We hope that future works along this line of research continue to leverage approachable methods and libraries that ease adoption for tool builders. The most potential use-case is hypothesis generation made possible by dynamic imputation (Sec. 4.1), but several different augmentations exist, e.g., adding extra information showing how models understand the data space in the same vein as classifier agreement maps (Sec. 4.2), or helping projection users to better understand the underlying structure as in gradient maps (Sec. 4.3). We are particularly interested in how combinations of these techniques, as the hypothesis generation paired with gradient map style backgrounds, can help users who are less familiar with projection techniques make sense of overview projections in VA applications.
Lastly, we believe there are several applications of this technique that should be explored further. Projections and inverse-projections can be used to explore the space of different 2D charts that have themselves been projected to 2D (in a manner similar to ChartSeer[zhao:2020:chartseer]), and data that is often modeled on graphs, such as molecular data.
In this paper, we present NNInv, a deep learning approach to learning the inverse of projection functions. Similar to existing works such as iLAMP and RBF, NNInv is agnostic of the projection used, i.e., it can learn to invert any projection algorithm (such as PCA, t-SNE, UMAP, LLE, etc.). NNInv uses a trained neural network to learn the approximate mapping from a given 2D scatterplot produced by a projection algorithm to the corresponding high-dimensional data. We find that NNInv can be more accurate than iLAMP and RBF on both synthetic and real-world datasets, and is more scalable to large datasets: Once trained, NNInv can perform inferencing within less than 10 milliseconds when running in a browser on a laptop, which makes NNInv a more suitable technique than iLAMP and RBF for interactive visualizations. Lastly, we show the potential of NNInv for analysis tasks such as hypothesis generation, classifier agreement, and gradient visualization. These are three areas important to the field of visual analytics and serve as evidence to the possibility of the broad applicability of NNInv in high-dimensional data exploration and analysis.
This work was supported by National Science Foundation grants IIS1452977, OAC-1940175, OAC-1939945, DGE-1855886, DARPA grant FA8750-17-2-0107, and DOD grant HQ0860-20-C-7137. We would also like to thank the reviewers for their helpful feedback.