1 Introduction
Highdimensional data is created at unprecedented rates by scientific fields as diverse as information technology, bioinformatics, and astronomy[buhlmann:2011:stats_for_high_dim]. As a result, there is a growing need for visualization and interaction methods for highdimensional data. A common choice is to project the highdimensional data to two dimensions using methods such as tSNE[maaten:2008:tsne], PCA[pearson:1901:pca], LLE[roweis:2000:lle], or UMAP[mcinnes:2018:umap], among the many other existing options[espadoto19].
Projections allow better insight into the overall structure of data and can be enriched by interactions that allow users to reason about the corresponding highdimensional data by selecting, brushing, and querying the 2D scatterplots they create. For example, tSNE has allowed computational biologists to investigate human genetic data, revealing otherwise obfuscated population stratification[Li:2017:application_of_tsne]. However, any projection technique will create errors when mapping complex and highdimensional datasets to a low number of dimensions[martins14, nonato18]. Moreover, projections are often complex algorithms, so the way they map the highdimensional data to lowdimensional space can be difficult for users to fully interpret. As such, additional mechanisms need to be complement projections to empower users to better explore the highdimensional data.
Recently attention has turned to inverseprojection, a process that allows one to compute the inverse mapping from the projection space back to the original highdimensional data space[amorim:2012:ilamp]
. Also called backprojections, these methods help users to explore projections by allowing a user to interactively query the projection space to find highdimensional data points. These points correspond to specific locations in the lowdimensional projection. Inverseprojections are also instrumental in explaining the decision boundaries of machine learning classifiers
[rodrigues:2019:classifier_boundaries] and data augmentation scenarios[amorim:2012:ilamp]. In contrast to the many existing projection techniques[espadoto19], only a handful of inverse projection algorithms exist, including iLAMP[amorim:2012:ilamp]and its extension that uses radial basis functions (RBFs)
[amorim:2015:rbf]. Algorithms like iLAMP and RBF are quite slow, and have multiple free parameters, making it hard to use them in interactive data exploration scenarios[rodrigues:2019:classifier_boundaries].In this paper, we present NNInv, a technique for computing the inverse of any projection using a deep learning approach. Our idea is inspired by the recent work of Espadoto et al.[espadoto:2020:innp] that demonstrates that deep learning can learn to imitate the style of any projection technique, and is parametric and stable to data changes (and thereby offers outofsample capabilities). Following their approach, we show that NNInv is a scalable, robust, highquality inverse projection method which supports multiple applications.
Using NNInv, we introduce three use cases across a number of wellknown datasets to illustrate how the use of inverseprojection can improve the user’s interaction, exploration, and understanding of highdimensional data in a 2D visualization. Additionally, we provide an evaluation of iLAMP, RBF, and NNInv in terms of scalability and accuracy. To this end, we provide a novel visualization for evaluating the joint quality of a pair of inverseprojection and directprojection methods. We make a point of studying the NNInv inverseprojection method on two synthetic datasets with wellknown topology (i.e., a 3D sphere dataset and a 3D swissroll dataset), allowing us to illustrate the behaviors of the learned inverse function.
Applications for this work are numerous. First, we use inverseprojections to explore the “empty” spaces in a 2D projection of highdimensional data. While the user interactively brushes such spaces, highdimensional instances corresponding to the visited 2D points are synthesized and displayed, thus allowing one to form a better mental map of how the 2D image represents the entire highdimensional data space, beyond how a 2D scatterplot represents a highdimensional dataset.
Second, we present a visualization of the decision boundary of an ensemble classifier. Visualizing cluster boundaries can help users see patterns within the data and the behaviors of the classifiers (see the classifier comparison by ScikitLearn [scikitboundary]). We show that the use of an inverse projection method such as NNInv makes it possible to visualize this important information with highdimensional data (see Fig. 2). Finally, we introduce a gradient map visualization to help users find projection artifacts. This method highlights regions where the projection shrinks and expands the relationships between points by visualizing the rate of change between the learned 2D embedding and the original highdimensional space. We show that NNInv provides an alternative approach to helping the user “see” the highdimensional space in 2D.
In summary, the main contributions of this paper are:

A deep learning approach to inverseprojection, which is fast enough to be used at scale.

A comparison to existing inverseprojection methods.

An exploration of the behavior of NNInv on datasets with wellknown topology.

Two novel visualizations for evaluating inverseprojection methods.

A showcase of visual exploration techniques enabled by inverseprojection.
2 Related Work
We divide our discussion of related work into two main topics – visualization of highdimensional data (Sec. 2.1) and deep learning latent spaces (Sec. 2.2).
2.1 Visualization of HighDimensional Data
We first list the notation used for the remainder of the paper. Let be a dimensional data point, also called a sample or an observation. Let , be a dataset of such samples. The need to examine, interpret, and explore highdimensional datasets is not new. As early as the 1970’s, Andrews[andrews:1972:plot_greater_than_2d] recognized the need to visualize data whose dimension exceeded the limits of what can conventionally be drawn on a 2D plane. Geng[geng:2013:3d_display] presented several techniques and systems from stereoscopic imaging to volumetric displays that allow visualizing dimensional data. Yet, as the dimensionality grows beyond three or four, it becomes clear that increasing the dimensionality of display technology is not a solution.
2.1.1 Projections
Also called Dimensionality Reduction (DR) methods, projections are techniques that aim to go beyond the aforementioned limitations of highdimensional visualization techniques[gorban:2008:principal_manifold_techniques, van:2009:dim_reduction_survey, joia:2011:lamp, silva:2012:user_centered_projection, sorzano:2014:dim_reduction_survey, sacha:2016:dim_reduction_interaction_survey, jeong:2009:ipca]. Formally put, a projection technique is a function that maps every point of a highdimensional dataset to a lowdimensional counterpart . Typically , which allows directly depicting the projection as a 2D or 3D scatterplot, respectively.
Projection techniques aim to preserve the socalled data structure between the original dataset and its lowdimensional counterpart . Structure is captured in terms of interpoint distances[roweis:2000:lle, paulovich:2008:least, joia:2011:lamp], point neighborhoods[maaten:2008:tsne, mcinnes:2018:umap], or clusters[paulovich2006text]. Projections can be further classified as linear[cunningham:2015:linear_dim_reduction_survey] or nonlinear[yin07_survey, van:2009:dim_reduction_survey]. Linear techniques, such as PCA, are simple and fast to compute, have an intuitive geometric interpretation, and robust association with statistical analysis. Nonlinear techniques, such as UMAP, are generally more computationally expensive, but strive to represent local neighborhood information with minimal distortion. There are also a number of projection techniques that generally fit under the RadViz family[hoffman:1997:radviz, angelini:2019:enhancing_radviz, pagliosa:2019:radviz++]. RadViz is able to visualize multidimensional data in 2D by anchoring each feature around the perimeter of a circle, and leverages spring forces from those points to assign each instance a location inside the circle. Projection techniques are further classified, analyzed, and compared both theoretically and practically in a number of surveys[hoffman02, bunte11, sorzano:2014:dim_reduction_survey, maljovec15, nonato18, espadoto19], to which we refer the reader.
All projection techniques transform data between the original space and the projection space . Several techniques aim to show errors in this process, i.e., areas in that may miss or not reflect actual structures in . For example, Stress Maps[seifert:2010:stress_maps] is a visual analysis tool that displays the local stress values, or how local distance relationships have changed, under a projection algorithm. Other error metrics and subsequent visualization mechanisms include trustworthiness and continuity[venna10], false and missing neighbors[martins14, martins15], and false neighborhoods and tears[aupetit:2007:visualizing_distortions_in_projections, lespinats:2011:checkviz]. Surveys of such metrics are given in[nonato18, espadoto19]. In order to demonstrate potential changes of caused by a hypothetical perturbation of the data in , DimReader[faust:2018:dim_reader] utilized a filled contour plot in the background. tviSNE[kerren:2020:tvisne]
focuses on helping users understand tSNE projections, such as how hyperparameters affect the properties of the final projection. Probing Projections
[stahnke:2015:probing_projections] allows users to display the value of any attribute through a background heat map, and also enables users to correct distance errors in the projection by moving individual points in on the 2D projection space. A similar technique is proposed by LAMP[joia:2011:lamp]. In contrast, DisFunction[brown:2012:dis] updates the mapping from the user’s dragging of data points to generate new, and hopefully better, projections. Sirius[dowling:2019:sirius] allows practitioners to investigate both the observations and attributes of a dataset through symmetric projections.Choosing a good projection – one which yields a low projection error on a given family of datasets, is simple to use in terms of parameter setting, is robust to small changes in the data , and is computationally scalable to large dimensions and sample counts
– is challenging. Recently, Neural Network Projection (NNP)
[nnp] was proposed as a method to achieve these goals by leveraging deep learning: Given a dataset , a small subset is chosen and projected by any userchosen technique . After a suitable projection is obtained by tuning’s parameters, a fullyconnected feedforward neural network is trained to infer
from . The trained network is then used to project any data drawn from a similar distribution as . NNP has shown remarkable ability in producing projections that mimic a wide range of techniques on many types of datasets, with little or no parameter tuning[espadoto:2020:innp]. Moreover, NNP is parametric, making it robust to smallscale data changes in while also providing an outofsample capability – that is, NNP learns a continuous function with rather than a discrete mapping formed by a nonparametric projection. NNP is important as a basis for the discussion of inverseprojections in the next section.2.1.2 Inverse Projection
Inverseprojection can be seen as a function , which should ideally be the mathematical inverse of a given projection , i.e., . A crucial component of inverseprojections is that they should have an outofsample ability that can be expressed as a continuous mapping, which is generally not the case for direct (discrete) projections. Thus, can be used to invert points that fall between the points of the scatterplot , helping the user to understand what kind of data samples could project at a particular location in . This ability further supports applications such as data augmentation and classifier exploration[rodrigues:2018:classifier_boundaries, rodrigues:2019:classifier_boundaries].
Inverseprojection is inherently harder than direct projection due to the need for an outofsample ability and the fact that needs to synthesize a high number of dimensions from a lower dimension
. Early on, autoencoders were proposed to jointly infer both
and by deep learning to minimize the projection error from to [hinton2006reducing]. While autoencoders are parametric, the resulting mappings are not always intuitive[van:2009:dim_reduction_survey] and autoencoders can be difficult to train[vernier20]. Amorim et al. approach inverseprojection in iLAMP[amorim:2012:ilamp] by using local affine transformations, following the earlier idea of the LAMP direct projection technique[joia:2011:lamp]. Mamani et al. also use local affine transformations in their inverseprojections as a part of their work on userdriven feature space transformation[mamani:2013:user_driven_feature_space_transformation]. iLAMP was later extended by leveraging radial basis functions (RBFs) to provide a smoother inverse mapping , which was shown to be useful for data augmentation[amorim:2015:rbf]. Kriegeskorte and Mur[kriegeskorte:2012:inverse_mds] proposed inverse MDS, which infers pairwise dissimilarities from multiple 2D arrangements of items. Cavallo et al.[cavallo:2018:praxis] used inverseprojection in Praxis, an interactive exploratory analysis tool for highdimensional data. The authors leveraged the analytical inverse of PCA in addition to an autoencoder to both project and inverseproject data. Similarly, Zhao et al.[zhao:2020:chartseer] used a Grammar Variational Autoencoder (GVAE)[kusner2017grammar] to project and inverseproject data charts for steering exploratory visual analysis.2.2 Latent Spaces with Neural Networks
Recent developments in machine learning and AI have shown that deep learning approaches are both accurate and flexible when used and trained properly[goodfellow19]. In general, neural network encoders work by learning a mapping from the input data space to a lower dimensional representation called the latent space. This mapping, conceptually similar to our projection , is often difficult to interpret as the latent dimensions are abstract. Moreover, the neural network’s operation is harder to understand than the equivalent operation of a typical projection function .
2.2.1 Interpreting the Latent Space
Interactive visualization tools have been developed to help with analysis tasks that give a better understanding of latent spaces. In particular, when a neural network model has a generative component (e.g.
autoencoders and Generative Adversarial Networks), its latent space can be explained by bringing data points back to the original space via its generative component. Liu
et al.[liu:2019:latent_space_catography] presented a latent space cartography (LSC) visual analysis system for vector space embeddings. The LSC system was created to address common interpretation tasks for latent spaces. It provides a means to both quantify attribute vector uncertainty and compare multiple attribute vectors. Spinner et al.[spinner:2018:towards_interpretable_latent_space] also used latent spaces to visually compare autoencoders with variational autoencoders. A number of techniques have been developed in order to try to disentangle the latent features of autoencoders[higgins:2017:beta_vae, kim:2019:disentangling_factorising, chen:2019:isolating_disentanglement_vae]. A recent work by Gou et al. moved these advances forward within a full visual analytics system for traffic light detection[Gou:2020:valtd]. Additional visualizations making use of, and explaining, latent spaces are discussed in a recent survey[garcia18].3 Learning the Inverse Projection
Figure 1 shows the operation of NNInv. Given a dataset , of points, let , be its projection by any userchosen projection method . In practice, is a dimensional scatterplot, so . NNInv constructs an approximation of the inverse of by using deep learning. Let
(1) 
be a dimensional point inferred by the neural network from a 2D point . Here, are the learned parameters of the function (i.e., the weights of the network). To train the model, we minimize the loss between each predicted and true within the training set (,
) using some loss function.
3.1 Data
We used five different datasets across our evaluation and proposed applications.
MNIST: This dataset[lecun:2010:mnist] has grayscale images of handdrawn digits, zero through nine. Each image is at a resolution of
. The images have been translated so that the center of mass of the pixels is at the center of the image. The MNIST dataset is commonly used to illustrate and measure the quality of projection techniques
[maaten:2008:tsne, van:2009:dim_reduction_survey, espadoto19, nnp, espadoto:2019:nn_inv].FashionMNIST: This dataset[xiao:2017:fashion_mnist] is constructed in the same manner as the original MNIST dataset, but contains pictures of different items of clothing. It was designed as a slightly more difficult replacement for the MNIST dataset.
Blobs: This synthetic dataset has
points sampled from a Gaussian distribution with 5 different centers (clusters) in
dimensions.Sphere: This dataset consists of points uniformly sampled from a 3D unit sphere. It allows us to clearly demonstrate the behaviour of the projection techniques included, and more importantly, offer a simple illustration of our gradient map visualization.
Swiss Roll: This dataset consists of points sampled from a denselysampled 2D patch which was smoothly mapped to a “roll” in 3D. It is commonly used to gauge the capability of projections to “unroll” the data back to its 2D configuration[amorim:2012:ilamp, joia:2011:lamp, balasubramanian2002isomap].
3.2 Implementation
Dataset  MAE  STD  

Blobs  64  128  256  512  0.036941  3e06 
128  256  512  1024  0.036944  2.5e05  
256  512  1024  2048  0.036945  2.1e05  
640  1280  1280  640  0.03695  3.3e05  
240  240  240  240  0.036961  3.1e05  
MNIST  128  256  512  1024  0.06241  0.000425 
640  320  320  640  0.062606  0.000113  
640  320  320  640  0.062787  0.000551  
480  480  480  480  0.06303  5e05  
480  480  480  480  0.063168  0.000258  
FashionMNIST  1024  2048  4096  8192  0.072804  0.000411 
1280  2560  2560  1280  0.072873  0.000136  
512  1024  2048  4096  0.073108  0.000268  
1280  640  640  1280  0.073209  6.5e05  
256  512  1024  2048  0.073214  0.00064  
Swiss  64  128  256  512  0.011698  0.000489 
256  512  1024  2048  0.012288  0.001286  
640  1280  1280  640  0.013136  0.000805  
160  320  320  160  0.013209  0.001  
640  320  320  640  0.013929  0.001523 
show the number of neurons used in the respective hidden layers.
We next describe the design and tuning of the neural network used to learn the inverse projection. Following [elsken:2019:nas_survey], and also the method used to tune NNP [espadoto:2020:innp], we used grid search to explore different architecture configurations: total number of neurons, neurons per layer, and dropout values.
We ran the grid search across the four datasets introduced in Sec. 3.1. As the direct projection , we used tSNE, which was earlier shown to be the hardest projection from a set of nine different projections to mimic via deep learning[nnp]. Hence, we believe that tSNE is also a hard challenge to invert via NNInv. We varied the trainingset size between 5250, 10500, 21000, and 42000 samples. To account for variation in random initialization of the neural network weights, we ran each configuration three times and averaged the results into a single error score. We measure quality via mean absolute error (MAE)
, and also provide its standard deviation across the three runs. Training is stopped automatically on convergence, defined as the moment when the validation loss stops decreasing. We next discuss the hyperparameters investigated.
Network Architectures: We restricted ourselves to fullyconnected layers and used four hidden layers () in each configuration. We varied the network shape and number of neurons in each layer. The total number of neurons in each network varied between 240, 480, 960, 1920, 3840, 7680, and 15360. We experimented with four network shapes (see Table II).
Shape  

straight  
wide  
bottleneck  
fanout 
Activation Functions
: We used a ReLU activation function for all hidden layers. Since the input data
is normalized such that each of the dimensions ranges over , we used a sigmoid activation function on the output layer.Regularization
: We used both early stopping and dropout, with dropout probabilities of
, , . Experiments showed dropout was not generally effective. We believe that this is due to the fact that overfitting is unlikely given that we used smaller networks and early stopping.Loss Function: For the loss function we used MAE.
Optimizer: We used the Adam optimizer[kingma_adam_2014], given its good performance with NNP[nnp, espadoto:2020:innp].
Table I shows the MAE and standard deviation (STD) results for the topfive configurations of the tested ones, i.e., the ones obtaining lowest error. The full results including all configurations tested is available in the supplemental material. The best architectures for each dataset either had the same number of neurons in each layer, or used a widening architecture which doubles or quadruples the number of neurons in each successive layer.
Our results suggest that smaller architectures can be used other than the original architecture from Espadoto et al.[espadoto:2019:nn_inv]. The only dataset that performs better with more than 1920 neurons is the FashionMNIST dataset. For this dataset, we obtain a slightly lower MAE when using 7680 neurons. However, as in most cases observed, the error decrease is negligible compared to the increase in complexity (network size). Summarizing our findings from the tested datasets, we offer suggestions for future experimentation: (1) networks should follow a straight or fanout style shape as described in Table II, and (2) even relatively small networks can perform exceptionally well at this task as seen in Table I.
4 Applications of Inverse Projection in Visual Analytics
Traditional error calculations may tell something of the overall loss, i.e., going from highdimensional space to 2D and back to the original space. However, a robust analysis of an inverseprojection technique must include more than just this type of error. This section focuses on a qualitative evaluation of inverse projections using applications that are of interest to the visualization community. In particular, we explore three use cases of inverseprojection in visual analytics: (1) direct interpolation of highdimensional data using the 2D screen, (2) leveraging the generation of highdimensional data across the screen to perpixel color classifier agreement, and (3) using the generated highdimensional data to illustrate high gradient areas of the projection.
4.1 Case Study 1: Dynamic Imputation
One shortcoming of the current use of projection methods is that the projections are “oneway streets.” From a user interaction and exploration standpoint, the most that a user can do using such techniques is to select a data point in the (2D) visualization and look up the original values of that point in highdimensional space. Due to this limitation, the user’s exploration of the data is restricted. For example, the user would have no easy way of knowing why two data points appear close to each other in the 2D space, or what other data points, if they existed, would appear near or between these points.
4.1.1 Example with MNIST
In this case study, we demonstrate the use of inverseprojection to perform “dynamic imputation.” The inspiration for this case study comes from recent works by Cavallo
et al.[cavallo:2018:praxis] that explores inverseprojection with PCA and autoencoders, and Kwon et al.[kwon:2020:deep_generative_graphs] that generates graph layouts from the user’s interactions in a 2D latent space.Consider Figure 2: The user can select projected data points in a 2D visualization (of the MNIST dataset) and see their original values (see Figure 2A and Figure 2E), similar to traditional visual analytics systems. However, with the use of inverseprojection, the user can also select an “empty” space between these data points (see the three inner images). The inverseprojection function implicitly performs imputation (i.e., generating a new data point) when performing inference over the 2D pixel location to find its position in highdimensional space.
Since the inference step of a trained neural network is fast, this computation can be done in a web browser and be made fully interactive using mouse hovering. In this example, the computation time of inverseprojecting a point in tensorflow.js on an Intel i78650U CPU is below 10 milliseconds. With this high degree of interactivity, the user can quickly explore both the highdimensional dataset as well as the highdimensional
space (between data points) itself. Figure 3 further showcases the ability of using inverseprojection to interpolate between clusters. Here, the images furthest to the left and right represent two visually distinct objects (i.e., a pair of pants and a dress, and the digits 6 and 5). The images in between are interpolations generated by the inverseprojection algorithm.4.1.2 Evaluation of the Inverse Projection
Using this framework, we can also visually evaluate the quality of the inverseprojection algorithm. Specifically, instead of selecting an “empty” pixel, a user can select the 2D position of an existing data point. We can then compare the original values of the data point with the values generated by the inverseprojection algorithm. For example, the two images on the upper left side of Figure 4 are two different styles of pants from the original FashionMNIST dataset. The two images directly below, on the lower left side, are those generated by the inverseprojection. Similarly, images on the upper right side of Figure 4 are from the original MNIST dataset, and images on the lower right side of Figure 4 are generated. In both cases, the generated images are “blurrier” than the originals. However, it is shown that the inverseprojection function has successfully learned the important visual features of these images and can reproduce them with high fidelity.
4.1.3 Implications to Visual Analytics
Although we used two relatively simple image datasets in this case study (MNIST and FashionMNIST) for illustrative purposes, the use of inverseprojection for dynamic imputation should be extendable to visual analysis of other highdimensional datasets, including temporal data, geographic data, and tabular data. As such, having an accurate inverseprojection function in a visual analytics system can allow the system designer and the user to explore highdimensional data in ways that have not been possible. For example, in the context of business analysis, the use of inverseprojection for data imputation can serve as a “hypothesis generator” (e.g., Figure 2
). With inverseprojection, the user can interpolate between the 0 and the 6 from the original data, and use inverseprojection to generate hybrid examples between. While the generated data points are estimates of the inverseprojection function, they may nonetheless serve as potential hypotheses for an analyst to further explore.
4.2 Case Study 2: Model Agreements
Previous work in defining and interpreting back projections has shown that creating dense pixel maps in the 2D projection space can provide additional insight into the behavior of classification type tasks[espadoto:2019:nn_inv, rodrigues:2018:classifier_boundaries, rodrigues:2019:classifier_boundaries]. Figure 5 shows how this concept is extended to highlight regions of lower classification agreement.
4.2.1 Example with MNIST and FashionMNIST
We demonstrate ensemble classification confidence by creating dense pixel maps to show the classifier agreement of two of the ten digits in the MNIST dataset (digits 1 and 7) and two of the objects in the FashionMNIST dataset (handbags and shirts). While there is nothing preventing the technique from being extended to multiclass classification, as in previous work[espadoto:2019:nn_inv, rodrigues:2018:classifier_boundaries, rodrigues:2019:classifier_boundaries], we limit ourselves to binary classification. In both cases, we begin by inverseprojecting each screen pixel to learn its position in the highdimensional space. This highdimensional point is then put through some number (greater than one) of classification methods. Since our dataset only contains two classes, each of the classifiers will simply assign a data point to one class or the other. We then color the pixel based on the number of classifiers that predicted each class. As shown in Figure 5, we color a pixel bright blue if the majority of classifiers predicted class one, and bright red if the majority of classifiers predicted class two. In between these two extremes, pixels are colored by decreasing the amount of saturation such that complete disagreement between the models results in a white pixel – that is, half of the classifiers says the data point inverseprojected from the respective pixel belongs to class 1, while the other half says the point belongs to class 2.
The ensemble is formed by nine classifiers, namely Logistic regression, Linear SVM, SVM with radial basis function, KNearest Neighbors, Gaussian Process, Decision Tree, Random Forest, Adaboost, Gaussian Naive Bayes, and Quadratic Discriminant Analysis. These classifiers represent a diverse number of classification algorithms, including linear and nonlinear methods. The output from the nine classifiers is used to generate the images in Figure
5, where not only can we see the class memberships of each point, we can also see the shape of the decision boundaries.For example, we can combine the use of inverseprojection for visualizing decision boundaries with its use for dynamic imputation, resulting in an interactive visual exploring system for understanding the uncertainty of the classifiers.
As an illustrative example, consider the differences between tSNE and UMAP in Figure 5. When only considering the separability of the two clusters, one would likely assume that UMAP outperforms tSNE, especially for the MNIST dataset (top row of Figure 5). However, when inspecting the decision boundaries, it becomes less clear that the separability affects the classifiers’ abilities to distinguish data points from one class to another. Specifically, in the tSNE example, although the separation between the two clusters in the MNIST data is small, the boundaries are sharp and clean. Conversely, while UMAP produces high separation between the two clusters, there are disagreements between the classifiers in that space.
4.2.2 Implications to Visual Analytics
While there has been a number of proposed methods for illustrating the decision boundaries for classifiers of highdimensional data[migut:2015:decision_boundaries, Hamel:2006:decision_boundaries_of_svm, rodrigues:2018:classifier_boundaries, schulz:2015:decision_boundaries, rodrigues:2019:classifier_boundaries], our proposed use of inverseprojection offers an alternative that can be more flexible for visual analytics systems. As illustrated in Figure 6, the user can hover over areas with low model agreement (e.g., pixels that are white or near white), and see what characteristics of the data might cause the classifier models to disagree.
In the context of designing new visual analytics techniques, the use of inverseprojection to help users better understand the behaviors of machine learning models can prove to be invaluable. Colloquially referred to as Explainable AI (or XAI), visualization researchers have been active in developing novel visualization and interaction techniques that can help a user understand, debug, and improve a complex machine learning model. While the space of XAI is large, we posit that the inverseprojection technique can contribute to this broad space of research.
4.3 Case Study 3: Gradient Map Visualization
Related to Explainable AI (XAI), one of the primary use cases for multidimensional projection is the visualization and interaction of data that exists in highdimensional spaces that humans have difficulty interpreting. Unfortunately, a side effect of projection is the loss of information. To help mitigate the consequences of information loss imposed by projection, most techniques strive to maintain local relationships. In other words, they seek to preserve the relative distances between neighboring data points in the highdimensional space in the twodimensional projection. Of course, keeping these relationships intact after projecting is not always possible.
Using the concepts of data imputation, a more holistic view of how a projection represents the spatial relationships between data points is presented. The ability to determine highdimensional coordinates from a projected point in 2D enables a more complete investigation of the consequences of selecting a given projection technique by inspecting its gradient image (see Fig. 9). This image is a 2D scalar field representing a pseudo total derivative of inverse projection function computed using central differences as
where is a point in the 2D projection space and and are a pixel’s width and height, respectively. In summary, regions of a projection with large gradient values illustrate where the highdimensional distance is changing most rapidly with respect to the lowdimensional distance. Figure 8 demonstrates how values on either side of large gradient values map to larger distances in the original data space, compared to values on either side of small gradient values. While the above method uses simple finite differences, any method for computing the gradient magnitude of is appropriate.
4.3.1 Example with Sphere Data
Figure 7 shows how, under a standard projection for parameterization, even a simple threedimensional sphere is transformed into a stretched and squeezed plane. Here, two equal length lines are placed on the parameterized plane at different locations. When each line segment is inverseprojected to recover the coordinates on the sphere, it is clear that the relative lengths of each segment have changed considerably. The degree of change is more completely understood when it is observed in the context of the gradient map overlaid onto this plane. In this case, areas towards the poles of the globe intuitively have a gradient that approaches zero, while the equator will have the highest gradient. Thus, a line segment back projected from the twodimensional plane will necessarily shrink; however, a similar line segment positioned near the equator will grow in length.
In the top row of Figure 9, a uniformly sampled threedimensional sphere was projected to a twodimensional plane using tSNE, PCA, LLE, and UMAP. In these images, the sample points on the sphere are represented by blue dots, while the background is colored with the gradient image. In the cases of tSNE, LLE, and UMAP, the projections maintain similar gradient characteristics with respect to neighboring points. However, there are some points that project to regions of high gradient. These regions are inevitable, as tears in the threedimensional sphere are required in order to represent it on a plane. Conversely, PCA is a linear projection method that does not seek to preserve neighborhood information between data points. As a result, the gradient map under the data points is constant and reflects the planar nature of the projection space.
4.3.2 Implications to Visual Analytics
The gradient maps shown in Figure 9 illustrate the use of inverseprojection to help users see the quality of the projection. It is relevant to note that the gradient maps do not show the topology of an embedding space created using a projection function, which is the goal of works like Stress Maps[seifert:2010:stress_maps]. Instead, these gradient maps represent the reconstructed embedding space by the inverseprojection function. In some cases the inverseprojection function does not perfectly recover the original embedding space. For example, the top row of the PCA column in Figure 9 shows the reconstruction of a plane created by PCA in the 3D sphere dataset. Notice that the reconstructed surface is not perfectly linear as should be the case of PCA projections.
As such, we consider the gradient map as a debugging mechanism similar to tools in the XAI community for debugging machine learning models. In particular, the gradient map can help data scientists and visual analytics researchers to better understand the effect of projection and inverseprojection when visualizing highdimensional data. For example, the top left image in Figure 9 shows the projection of a 3D sphere using tSNE. The intense colors denote sharp discontinuities between the parts of the 3D sphere separated by tSNE. This information has illustrative values and can be used to help a user better understand the behaviors of a projection function such as tSNE.
5 Evaluation
We present an empirical evaluation of the inverse projection function described in the previous section. In the following, we split each dataset into a training set and a test set . We train NNInv using the pair . Within Sections 5.1, 5.3, and 5.4 we restrict to tSNE, but also explore PCA, LLE, and UMAP in Section 5.2. We next evaluate the quality of NNInv using various error metrics computed using and . We next discuss our method in terms of scalability (Sec. 5.4), quantitative assessment of quality (Sec. 5.1), qualitative assessment of quality (Sec. 5.2), and our novel inverseprojection error map (Sec. 5.3).
5.1 Quantitative Assessment of Quality
Besides being fast, we want an inverseprojection to be accurate. That is, given some ground truth pair , unseen during training, we want to be as close as possible to . This follows the same idea as normalized stress metrics used to gauge the quality of projections in the literature[sorzano:2014:dim_reduction_survey, van:2009:dim_reduction_survey, espadoto19] and also classical validation of inference models in machine learning. We measure quality in our case by computing the average inverseprojection mean square error over the test set . The closer MSE is to zero, the better is. While we minimized MAE in our loss function, we report MSE here to enable easier comparison to earlier papers[amorim:2012:ilamp, amorim:2015:rbf].
Figure 10 shows the MSE for our three datasets, two projections (tSNE and UMAP), and three tested inverse projections (iLAMP, RBF, and NNInv). We also consider several trainingset sizes to show how MSE depends on the training amount. For Blobs, a relatively easytoproject synthetic dataset, all methods have essentially zero error except RBF. MNIST and FashionMNIST show similar behavior: Our method achieves consistently one of the lowest errors. Errors are larger for these realworld complex datasets than for the synthetic Blobs, which is expected.
5.2 Qualitative Exploration
We explore NNInv’s performance on two wellunderstood synthetic datasets, Sphere and Swiss Roll (Sec. 3.1). Simple datasets where the projections are well understood give us greater ability to reason about the inverse projection. In particular it is easier to understand how error is distributed across a dataset, as well as which projections will incur higher error. To illustrate this, we once again split the datasets into training () and test sets (), this time having and percent of the total data, respectively. We then plot the projections of test portion () of these two datasets in Fig. 11 and color the points by the rootmeansquared error between the inverseprojections () and the true highdimensional data (). Error colormaps are normalized within each image, so that we can better see error variations within a given projection. Hence, colors cannot be compared across rows or columns of Fig. 11.
When analyzing inverse projection results, we must remember that the concept of error encompasses inaccuracies and faults in both the projection and the inverseprojection methods. For example, linear techniques like PCA will have a substantially different error profile compared to nonlinear techniques such as tSNE, LLE, or UMAP.
For the sphere, tSNE and UMAP are able to peel away the surface, and error seems to congregate along the edges of the structures that make up the peel. In contrast, PCA and LLE end up with a slice out of the sphere causing the largest error in the center of their slices. For the Swiss Roll dataset, tSNE, LLE, and UMAP are able to remove the swirl, with UMAP and tSNE making similar ribboned shapes and LLE unraveling to an rectangle with perspective. In contrast, PCA keeps the general shape of the spiral, causing a speckling of high error throughout the whole structure.
Projecting highdimensional space down to 2D is inherently lossy, and each method will project the data to 2D differently. This difference is not only visual – each projection method emphasizes certain aspects of the data. As a result, different techniques throw away different portions and amounts of the highdimensional data when performing the projection. This means that certain projection techniques will be easier to inverseproject than others.
For example, PCA does not aim to prevent overdrawing or projecting different points to the same twodimensional location. As such, several data points can be projected to the same position in 2D space, making it impossible to correctly learn an inverse. In contrast, tSNE and other nonlinear techniques work to maintain local neighborhood relationships; when projecting a set of points, they try to preserve the relative distances in the projection that exist in the original space. In cases where there is poor preservation of intercluster distances, NNInv remains a valuable tool. If an area in the projection is shrunk or expanded relative to the highdimensional space, the rate of change between inverseprojections will either increase or decrease respectively. When the interpolation moves very quickly, NNInv may be less useful for tasks like dynamic imputation (Sec. 4.1), but NNInv can help identify these spots with gradient maps (Sec. 4.3). The properties of each projection technique inform and define the types of errors exhibited during the inverseprojection process.
As Fig. 11 shows, one consequence of PCA projecting multiple distant points to a small region on a 2D plane is that the inverseprojected points will likely be erroneous. In this case, the error increases as the distance in the highdimensional space increases between points colocated on the 2D projection. Conversely, for tSNE and UMAP, the nonlinear projections distort the input geometry, often into shapes that no longer resemble the topology of the original data. In return, the inverseprojected data points from 2D back to the highdimensional space are much closer to their original positions, resulting in significantly smaller total error. In other words, better grouping by similarity as well as better separation of points will make inverseprojection easier.
5.3 Dense map of inverse projection error
Evaluation of inverseprojection methods often uses error metrics defined for direct projections such as stress or reprojection error[amorim:2012:ilamp, amorim:2015:rbf]. However, the above metrics only gauge the error at the locations of projection points . The same is actually the case for all errors for direct projections we are aware of – they only gauge how good a (direct) projection is at the locations of the scatterplot points. As explained earlier in Sec. 2.1.2, the key usecase of inverse projections is the outofsample one, where one inversely projects different points than .
We next propose a validation approach that considers the outofsample case, i.e., evaluates the quality of at all points in . We proceed as follows. Given a dataset , we construct as usual given a userchosen projection technique , and use to train our inverse projection . Next, we discretize the projection space using a pixel grid with a given resolution , in our case . Then, for every pixel , we compute the pixel given by the “round trip” of back projecting it to and next projecting it again to . To perform this, we must assume that is parametric. Then, ideally, for all pixels . This way, we can assess an inverse projection error also for points in which do not correspond to projections of points in our given dataset .
We visualize the roundtrip errors as a dense map as follows. We create a hue image by bilinear interpolation of four different hues (Fig. 12a). Next, we color every pixel by the hue of the roundtrip pixel and set its luminance to . Dense map areas which show the same color gradient as Fig. 12a have, thus, low inverseprojection errors. Bright areas and/or hue differences from this gradient show large projection errors. Scatterplot points are colored in the same way, but use a slightly lower brightness value to avoid confusion with the map pixels. Figures 12bd show the error maps for iLAMP, RBF, and NNInv for the inverse projection of the MNIST dataset projected by tSNE. We see that NNInv creates a color gradient which is close to the reference one, has minimal discontinuities, and has few bright spots. Hence, NNInv can inverseproject the entire 2D space without introducing large amounts of error.
5.4 Scalability in Training and Inference
Scalability implies the effort required to train our method and, separately, the effort needed to infer as function of the size of the dataset to inversely project. Concerning training, Table III
shows the number of training epochs needed to obtain convergence (defined as in Sec.
3.2) as function of the training set size , for all three considered datasets and . Columns 2..4 indicate averages for multiple runs created by randomly sampling from the entire dataset . Overall, we obtain convergence for roughly 150 epochs for all datasets and trainingset sizes.Training set  Average # epochs for each dataset  Row  

size  Blobs  FashionMNIST  MNIST  averages 
500  268.0  214.0  213.5  192.5 
1000  190.5  129.0  147.5  149.0 
2000  153.0  112.0  111.0  112.5 
5000  103.0  120.5  138.0  127.5 
7000  127.0  118.5  151.0  144.0 
10000  82.0  124.5  142.5  146.5 
column avg  153.9  136.4  150.6  145.3 
Figure 13 shows the inference speed for all three datasets. Speed does not depend on the projection method – once NNInv is trained, its performance is linear in the number of inverselyprojected samples. When computing inference speed, we inversely project any point in and not just points in . Indeed, for assessing speed, we do not need groundtruth information. Moreover, in real use cases, one would inversely project unseen data, for which such groundtruth information is not available. We see that both RBF and iLAMP have a superlinear behavior, while NNInv (our method) is basically linear. NNInv is roughly one magnitude order faster than RBF and nearly two magnitude orders faster than iLAMP for 40K samples or more. This speedup is crucial for applications that need to inversely project hundreds of thousands of samples (or more), like in the construction of dense maps [rodrigues:2018:classifier_boundaries, espadoto:2019:nn_inv] and the maps in Sec. 4.2 and 4.3. NNInv constructs such maps in seconds, while iLAMP and RBF need (tens of) minutes, making humanintheloop usage of such methods impossible in visual analytics scenarios – one of the key reasons why dense maps are built in the first place. This scalability is one of the most important advantages of NNInv.
6 Limitations
NNInv is scalable, accurate, and relatively smooth, as shown in Sec. 5. Yet, using a neural network does have its disadvantages[bengio:2013:deep]. A neural network (1) requires a particular threshold of good quality training examples, (2) can be computationally expensive to train, and (3) can be generally hard to interpret. In Sec. 5.1 we show acceptable mean squared error with as few as 500 training examples, and caution that below that threshold, our technique will not perform as successfully. In all of the examples in this paper, the projections NNInv are trained on are good quality projections, obtained by choosing reasonable values for the projection’s hyperparameters. Good quality projections are generally more likely to have the qualities (as described in Sec. 5.2) required for accurate inverseprojection. While NNInv is useful in helping to interpret projections (e.g., Fig. 2), it can be difficult to reason about NNInv itself, since neural networks are hard to interpret in general. That is, our metrics show that NNInv performs better as it can approximate nonlinear patterns, but it is not obvious how NNInv does this. We leave the explainability of NNInv’s improved performance to future work.
7 Discussion and Future Work
Future inverseprojection research can take several interesting directions. Of particular relevance is the discussion in Sec. 5.2 regarding the properties of projection techniques and their inverses.
As Sec. 5.2 shows, when discussing the invertibility of projection functions, we find that not all projection methods are equally suitable for the inverseprojection method: PCA is worse than tSNE or UMAP because multiple data points can be projected into the same 2D pixel.
Interestingly, a type of projection that is designed specifically for its invertibility is the encoder portion of an autoencoder. When trained together with the decoder, the entire process optimizes for the recoverability of data points from input to output. Yet, a user would have a hard time understanding the embedding of a regular encoder because there is no intentionally designed structure in the embedding space created with an encoder. Also, there are no guarantees about neighborhood preservation or relative distance preservation.
The tradeoff between understandability of the latent space created by a projection and the appropriateness of the projection for learning its inverse is interesting. On one hand, a projection technique may sacrifice some information to create a more insightful, or more spatially intuitive, visualization. Yet, the use of inverseprojection can lead to novel visualization and interaction techniques that can better help the user explore and understand a highdimensional space. Further steps should be taken to find a happy medium between these two extremes, whether that be autoencoders with some cost for occlusion, or spacing items too far apart, or a projection technique with a greater loss for discarding information.
Accessible and fast inverse projections will have farreaching impacts on visual analytics (VA) systems that use projections. We believe that a deep learning approach to inverse projection is especially accessible given today’s robust ecosystem for neural network development [chollet:2015:keras, google:2019:kerastuner, abadi:2015:tensorflow,
paszke:2019:pytorch
]. We hope that future works along this line of research continue to leverage approachable methods and libraries that ease adoption for tool builders. The most potential usecase is hypothesis generation made possible by dynamic imputation (Sec. 4.1), but several different augmentations exist, e.g., adding extra information showing how models understand the data space in the same vein as classifier agreement maps (Sec. 4.2), or helping projection users to better understand the underlying structure as in gradient maps (Sec. 4.3). We are particularly interested in how combinations of these techniques, as the hypothesis generation paired with gradient map style backgrounds, can help users who are less familiar with projection techniques make sense of overview projections in VA applications.Lastly, we believe there are several applications of this technique that should be explored further. Projections and inverseprojections can be used to explore the space of different 2D charts that have themselves been projected to 2D (in a manner similar to ChartSeer[zhao:2020:chartseer]), and data that is often modeled on graphs, such as molecular data.
8 Conclusion
In this paper, we present NNInv, a deep learning approach to learning the inverse of projection functions. Similar to existing works such as iLAMP and RBF, NNInv is agnostic of the projection used, i.e., it can learn to invert any projection algorithm (such as PCA, tSNE, UMAP, LLE, etc.). NNInv uses a trained neural network to learn the approximate mapping from a given 2D scatterplot produced by a projection algorithm to the corresponding highdimensional data. We find that NNInv can be more accurate than iLAMP and RBF on both synthetic and realworld datasets, and is more scalable to large datasets: Once trained, NNInv can perform inferencing within less than 10 milliseconds when running in a browser on a laptop, which makes NNInv a more suitable technique than iLAMP and RBF for interactive visualizations. Lastly, we show the potential of NNInv for analysis tasks such as hypothesis generation, classifier agreement, and gradient visualization. These are three areas important to the field of visual analytics and serve as evidence to the possibility of the broad applicability of NNInv in highdimensional data exploration and analysis.
Acknowledgments
This work was supported by National Science Foundation grants IIS1452977, OAC1940175, OAC1939945, DGE1855886, DARPA grant FA87501720107, and DOD grant HQ086020C7137. We would also like to thank the reviewers for their helpful feedback.
Comments
There are no comments yet.