1 Introduction
The high dimensionality of modern data introduces significant challenges in descriptive and exploratory data analysis. These challenges gave rise to extensive work on dimensionality reduction aiming to provide low dimensional representations that preserve or uncover intrinsic patterns and structures in processed data. A common assumption in such work is that high dimensional meausrements are a result of (often nonlinear) functions applied to a small set of latent variables that control the observed phenomena of interest, and thus one can expect an appropriate embedding in low dimensions to indirectly recover a faithful latent data representation. While classic approaches, such as principal component analysis (PCA)
pearson1901 and classical multidimensional scaling (MDS) cox2008multidimensional, construct linear embeddings, more recent attempts mostly focus on nonlinear dimensionality reduction. These approaches can be roughly split into manifold learning kernel methods and deep learning autoencoder methods, each with their own benefits and deficiencies.
Kernel methods for manifold learning cover some of the most popular nonlinear dimensionality reduction methods, dating back to the introduction of Isomap tenenbaum2000global and Locally Linear Embedding (LLE) roweis2000nonlinear
. These two methods proposed the notion of data manifolds as a model for intrinsic low dimensional geometry in high dimensional data. The manifold construction in both cases is approximated by a local neighborhood graph, which is then leveraged to form a lowdimensional representation that preserves either pairwise geodesic distances (in the case of Isomap) or local linearity of neighborhoods (in LLE). The construction of neighborhood graphs to approximate manifold structures was further advanced by Laplacian eigenmaps
belkin2002laplacian and diffusion maps coifman2006diffusion , together with a theoretical framework for relating the captured geometry to Riemannian manifolds via the LaplaceBeltrami operators and heat kernels. These approaches that, until recently, dominated manifold learning can collectively be considered as spectral methods, since the embedding provided by them is based on the spectral decomposition of a suitable kernel matrix that encodes (potentially multiscale) neighborhood structure from the data. They are also known as kernel PCA methods, as they conceptually extend the spectral decomposition of covariance matrices used in PCA, or that of a Gram (inner product) matrix used in classic MDS.Recent work in dimensionality reduction has focused on visualization for data exploration. Kernel PCA methods are generally unsuitable for such tasks because, while their learned representation has lower dimension than the original data, they tend to embed data geometry in more dimensions than can be conveniently visualized (i.e., significantly more than 2D or 3D). This is typically due to the orthogonality constraint and linearity of spectral decompositions with respect to the initial dimensionality expansion of kernel constructions moon2019visualizing ; haghverdi2016diffusion . This led to the invention of methods like tSNE (tDistributed Stochastic Neighbor Embedding) maaten2008visualizing , UMAP (Uniform Manifold Approximation and Projection) UMAP2018 , and PHATE (Potential of Heatdiffusion for Affinitybased Transition Embedding) moon2019visualizing . These methods embed the data by preserving pairwise relationships between points and can thus be viewed as generalizations of metric and nonmetric MDS. These methods and their extensions have been used in applications such as single cell genomics moon2019visualizing ; amir2013visne ; sgier2016flow ; macosko2015highly ; becht2019dimensionality ; zhao2020single ; karthaus2020regenerative ; baccin2019combined , visualizing time series duque2019 , visualizing music for a recommendation system van2013deep
, and analyzing the internal representations in neural networks
2019Gigante ; horoi2020low . However, these and the previously mentioned spectral methods typically provide fixed latent coordinates for the input data. Thus, they do not come with a natural embedding function to perform outofsample extension (OOSE). This shortcoming is usually tackled by employing geometric harmonics coifman2006geometric , Nyström extension williams2001using , or a landmark approach vladymyrov2013locally .In contrast, Autoencoders (AEs) can be viewed as a different paradigm for nonlinear dimensionality reduction. First introduced in rumelhart1986learning , this nonconvex and parametric approach has gained more attention in recent years, especially due to the computational and mathematical advances in the field allowing the implementation and training of neural networks in a more efficient way. Denoising AEs vincent2008extracting
have become widely used to find good latent representations and perform feature extraction exploiting the flexibility provided by neural networks (e.g.
supratak2014feature , liu2014feature ). In contrast to kernel methods, AEs learn a parametric function and are thus equipped with a natural way to perform OOSE, as well as an inverse mapping from the latent to the input space. Despite these nice properties, AEs usually fail to accurately recover the geometry present in the data. This not only restricts their desirability to perform exploratory data analysis, e.g. via lowdimensional visualization, but can also lead to bad reconstructions over certain regions of the data, as we show in this work.Motivated by the complementary advantages provided by AE and kernel methods, we introduce geometryregularized autoencoders (GRAE), a general framework which splices the wellestablished machinery from kernel methods to recover a sensible geometry with the parametric structure of AEs. Thus we gain the benefits of both methods, furnishing kernel methods with efficient OOSE and inverse mapping, and providing the autoencoder with a geometrically driven representation. To achieve this, GRAE introduces a regularization on the latent representation of the autoencoder, leading it towards a representation previously learned by a kernel method. In this paper we focus our presentation using PHATE moon2019visualizing as our preferred method for finding a sensible geometry. Nevertheless, our general formulation can also be easily implemented with other methods such as UMAP (see the supplement).
2 Geometryregularized autoencoder
2.1 Learning embedding functions instead of embedding coordinates
Manifold learning methods for dimensionality reduction typically assume data lie on a low dimensional manifold immersed in the high dimensional ambient space. Therefore they aim to map points from to a low dimensional Euclidean space that encodes or reveals its intrinsic geometry. However, in practice, such methods only consider a finite set of data points (for dimensional ambient space), assumed to be sampled from , and optimize a fixed set of low dimensional points (for ) such that the Euclidean relations between pairs will reflect intrinsic nonlinear relations between the corresponding . While the realization of this optimization and its objective vary between methods, a common property of most recent approaches is that the resulting “manifold learning” only provides coordinates for a fixed and finite set of data points. Therefore, they do not provide a general embedding function that operates on, or provides a representation of, the entire manifold . Thus these methods are not applicable to arbitrary input points in , which we would ideally want to project onto the learned manifold.
In this work, we aim to learn a data manifold geometry to find an appropriate embedding function , rather than just a fixed pointcloud embedding. This contrast can be seen, for example, by considering the classic PCA and MDS methods. While both of these methods can be shown analytically to extract equivalent (or even the same) linear embeddings from data, MDS only assigns coordinates to fixed input points (similar to the described manifold learning methods), while PCA provides an embedding function (albeit linear) defined by a projection operator on principal components. Here, we aim to establish a similar equivalence in nonlinear settings by providing an alternative to popular manifold learning approaches that constructs an embedding function (as a nonlinear successor of PCA) and also yields representations that capture an intrinsic geometry similar to that of established kernel methods (seen as successors of MDS).
2.2 Extendable and invertible embedding with autoencoders
As mentioned before, PCA naturally provides an extendable and (approximately) invertible embedding function, but falls short in capturing nonlinear mappings. To see this, we recall that its embedding function is constructed using a matrix with orthogonal columns consisting of principal components (PCs) such that
is a projection operator, thus acting as an identity on a hyperplane spanned by the PCs. Then, the PCA embedding is given by the linear function
and its inverse (on the embedded space) is given by , where the approximation quality (quantified as reconstruction error) serves as the optimization target for computing the PCs in . To generalize this construction to compute nonlinear embedding functions over a data manifold , autoencoders (AEs) replace by an encoder function and by a decoder function , which is aimed to serve as an approximate inverse of . Both functions are then both parametrized by a neural network and trained via a reconstruction loss that aims to ensure the composite function acts as an identity on data sampled from . By considering datasets in matrix notations (i.e., with rows as datapoints), the AE optimization can be generally formulated as(1) 
where are understood to be applied separately to each row in their input matrix (yielding corresponding output data points organized in matrix form), and
denotes a loss function that measures the discrepancy between the original and reconstructed data points (commonly MSE). It is common to select
forcing the autoencoder to find a representation in latent codes of dimension while retaining as much information for reconstruction as possible. In this case the autoencoder is usually referred to as undercomplete. Under this formulation, instead of learning new coordinates for the input data, we learn an embedding function and an inverse function . If is a linear function, the network will project onto the same subspace spanned by the principal components in PCA baldi1989neural .The previous formulation departs away from manifold learning approaches as it lacks an explicit condition to recover geometrical interactions between observations. To fill that gap, we propose a general framework called GRAE (Geometry Regularized autoencoders) which explicitly penalizes misguided representations in the latent space from a geometric perspective. Thus, we add a soft constraint in the bottleneck of the autoencoder as follows:
(2) 
The term in Eq. (2) is the geometric loss, penalizing the discrepancy between the latent representation and the embedding previously learned by a manifold learning algorithm. The parameter determines how strongly the latent space of the AE should match the embedding . Thus the network will implicitly force the latent space of the autoencoder to preserve the relationships learned by the manifold learning technique, resulting in a nonlinear embedding function and its inverse that are consistent with sensible geometric properties. We note that by introducing this regularization, the global optimum of the training reconstruction error deteriorates. However, in practice, given the complexity of finding the global optimum of a neural network, it often leads to better reconstructions as we show in Sec. 3. This regularization can be applied with any manifold learning approach, whether it be Isomap, tSNE, etc. The resulting latent space will then inherit the corresponding strengths and weaknesses of the selected approach.
2.3 Geometric regularization with diffusionbased manifold learning
To generate , we use PHATE moon2019visualizing as it has proven to preserve longterm relationships (global structure) in a lowdimensional representation beyond the capabilities of spectral methods such as Laplacian eigenmaps, Diffusion Maps, LLE, and Isomap, especially when the dimension is required to be 2 or 3 for visualization. PHATE (Alg. 1) is built upon diffusion geometry coifman2006diffusion ; nadler2006diffusion . PHATE first computes the
decay kernel with an adaptive bandwidth, which captures local geometry while remaining robust to density changes. The kernel matrix is normalized to obtain a probability transition matrix
(diffusion operator) between every pair of points. Various scales of the geometry can then be uncovered by computing a step random walk over , with a higher implying more diffusion, pushing transition probabilities to a more global scale.The parameter
is automatically chosen by studying the entropy of a distribution over the eigenvalues of
(known as the von Neumann Entropy) as increases. Typically, the first few steps lead to a sharp drop in the entropy, which is thought in moon2019visualizing to be the process of denoising the transition probabilities, whereas later steps will reduce entropy at a lower rate, thus, slowly losing meaningful information in the structure. Subsequently PHATE computes the socalled potential distances , which have proven to be adequate distances between transition probabilities encoded in . Finally, MDS and metric MDS are applied over to optimally preserve the potential distances in a lowdimensional representation.3 Empirical results
We experimentally compare GRAE with a standard AE, Diffusion nets mishne2019diffusion , topological autoencoders (TAE) moor2019topological , and UMAP UMAP2018 . Diffusion nets and TAE impose regularizations that enforce a diffusion maps or topological structure on the embedding coordinates. A detailed discussion of these and other related methods such as Embedding with Autoencoder Regularization (ERAE) yu2013embedding and Laplacian Autoencoders (LAE) jia2015laplacian is contained in the supplement. UMAP was selected as a baseline to compare our approach to a “pure” manifold learning algorithm. This choice is motivated by the presence of a native extension to new points in UMAP’s implementation as well as an inverse transform from the latent space back to the input space, thus offering a natural comparison to autoencoders over all the relevant metrics. UMAP is also similar to tSNE from an algorithmic standpoint and provides qualitatively similar embeddings at a lower computational cost. To emphasize GRAE’s robustness with respect to the regularization parameter , we do not tune it but rather show how similar performances can be obtained using values of 10, 50 and 100, thus serving as an ablation study for the proposed geometric regularization. Full details about the architectures (e.g., for all autoencoder methods) and parameter tuning appear in the supplement.
We perform these comparisons on three benchmark datasets. The first one is the classic manifold problem known as “Swiss Roll” where data points are lying on a two dimensional “roll” in a three dimensional ambient space. Classical approaches such as PCA or MDS typically fail to recover the non linear geometry of the data, as they rely on pairwise Euclidean distances instead of the true geodesic distances along the curvature of the roll. The second dataset consists of 684 images of a face model rotated horizontally and vertically, as introduced in tenenbaum2000global and shown in Fig. 1. An adequate two dimensional embedding of the images is expected to recover both rotation axes in a meaningful way. The final dataset is derived from the MNIST dataset lecun2010mnist , where three digits are picked randomly and rotated 360 times at onedegree intervals, for a total of 1080 images.
All the reported results in this section were computed on a test split to assess the ability of various methods to learn a function from the training data with some generalization capacity. On the Swiss Roll, we remove a thin middle slice and use it for testing (leaving essentially two disconnected swiss rolls for training). We use a typical 8020 split for the other two datasets.
3.1 Qualitative evaluation
We first qualitatively evaluate GRAE and the other methods by visualizing the embedding layer after training. The visualizations for all three datasets are presented in Figure 2. We first notice that only GRAE and UMAP are able to uncoil the training points of the Swiss Roll dataset. AE and TAE both produce embeddings reminiscent of what is expected from a linear dimensionality reduction method such as PCA, where the roll is flattened in two dimensions without any unrolling. Diffusion nets achieve a certain degree of uncoiling, but do not recover the intrinsically rectangular nature of the roll. UMAP forces the test points in between the two swiss roll components to be close to the training embedding, thus tearing the overall structure of the manifold. Only GRAE manages to learn a continuous embedding bridging the two rolls of the training set. This demonstrates that GRAE can be used to generate data points in previously unsampled regions of the manifold.
The results on the Faces and Rotated Digits show that progressively increasing the regularization factor of GRAE adds some structure to the AE embedding. The horizontal rotation axis of the Faces is made clearer by GRAE with lambda values of 50 and 100. We further quantify this recovery in Section 3.2. GRAE improves upon the AE embedding by disentangling the rings representing the rotations of the three rotated digits. Going from GRAE (10) to GRAE (50) denoises the rings. Only UMAP achieves an embedding of similar quality, whereas other methods show overlapping curves, which do not faithfully represent the rotations of three different images.
3.2 Quantitative evaluation
To quantitatively evaluate the performance of GRAE, we compare it to the other methods using a variety of metrics. These metrics focus on the preservation of various properties of the ambient space in an unsupervised fashion and include the trustworthiness, continuity (kaski2003trustworthiness , venna2006visualizing ), and mean relative rank error lee2007nonlinear , all of which measure the quality of the latent representations in terms of the preservation of the nearest neighbors of each point in both the ambient and the latent spaces. We refer the reader to the supplement for further discussion of these metrics. Reconstruction error (MSE) is also reported to study the impact of the GRAE regularization on the traditional AE objective. We report the average out of ten runs for all of the performance metrics on the test data.
Continuity (Rank)  Trustworthiness (Rank)  MRRE (Rank)  MSE (Rank)  
Dataset  Model  k = 5  k = 10  k = 5  k = 10  k = 5  k = 10  
Swiss Roll  Autoencoder  0.973 (5)  0.962 (6)  0.848 (6)  0.855 (6)  0.155 (6)  0.154 (6)  0.014 (4) 
GRAE (10)  0.988 (2)  0.981 (2)  0.988 (2)  0.982 (1)  0.013 (2)  0.018 (2)  0.005 (1)  
GRAE (50)  0.975 (4)  0.966 (4)  0.953 (4)  0.942 (3)  0.044 (4)  0.052 (4)  0.010 (2)  
GRAE (100)  0.973 (5)  0.964 (5)  0.942 (5)  0.933 (5)  0.056 (5)  0.063 (5)  0.011 (3)  
TAE (1000)  0.979 (3)  0.967 (3)  0.812 (7)  0.819 (7)  0.192 (7)  0.191 (7)  0.019 (5)  
Diffusion Nets  0.991 (1)  0.991 (1)  0.964 (3)  0.934 (4)  0.033 (3)  0.049 (3)  0.584 (7)  
UMAP  0.905 (7)  0.861 (7)  0.992 (1)  0.982 (1)  0.010 (1)  0.016 (1)  0.019 (5)  
Faces  Autoencoder  0.934 (7)  0.939 (3)  0.940 (4)  0.946 (2)  0.065 (4)  0.069 (4)  0.019 (2) 
GRAE (10)  0.939 (4)  0.935 (5)  0.949 (2)  0.946 (2)  0.056 (2)  0.062 (2)  0.017 (1)  
GRAE (50)  0.936 (5)  0.922 (6)  0.916 (5)  0.893 (5)  0.086 (5)  0.101 (5)  0.019 (2)  
GRAE (100)  0.935 (6)  0.917 (7)  0.890 (7)  0.868 (7)  0.112 (6)  0.126 (7)  0.020 (5)  
TAE (1000)  0.942 (2)  0.948 (1)  0.945 (3)  0.953 (1)  0.064 (3)  0.066 (3)  0.019 (2)  
Diffusion Nets  0.952 (1)  0.945 (2)  0.892 (6)  0.891 (6)  0.115 (7)  0.120 (6)  0.024 (6)  
UMAP  0.940 (3)  0.936 (4)  0.956 (1)  0.937 (4)  0.044 (1)  0.057 (1)  0.205 (7)  
Rotated Digits  Autoencoder  0.949 (7)  0.905 (3)  0.949 (7)  0.922 (7)  0.045 (7)  0.059 (7)  0.020 (5) 
GRAE (10)  0.965 (3)  0.842 (6)  0.992 (1)  0.951 (1)  0.005 (1)  0.023 (1)  0.016 (1)  
GRAE (50)  0.965 (3)  0.842 (6)  0.992 (1)  0.950 (2)  0.005 (1)  0.023 (1)  0.017 (2)  
GRAE (100)  0.965 (3)  0.843 (5)  0.991 (4)  0.950 (2)  0.005 (1)  0.023 (1)  0.017 (2)  
TAE (1000)  0.961 (6)  0.935 (2)  0.960 (6)  0.937 (6)  0.036 (6)  0.049 (6)  0.019 (4)  
Diffusion Nets  0.986 (1)  0.952 (1)  0.984 (5)  0.950 (2)  0.011 (5)  0.028 (5)  0.026 (6)  
UMAP  0.971 (2)  0.856 (4)  0.992 (1)  0.949 (5)  0.005 (1)  0.024 (4)  0.056 (7) 
Table 1 shows the results using these metrics. Echoing its convincing visualizations, GRAE quantitatively preserves the geometry of the Swiss Roll, being first or runnerup on all metrics, and further displays competitive results on the other two problems. Generally speaking, models with manifold learning or topological components offer better performance than the vanilla AE over the considered datasets for metrics relating to embedding quality. However, we note that while other methods do show promising results in some cases (sometimes outperforming GRAE), these same methods also underperform in others, and thus may not provide a reliable embedding in truly exploratory unsupervised settings. For example, TAE shows good results on Faces, but significantly underperforms other methods on Swiss Roll. For its part, while generally competitive in terms of continuity, Diffusion Net lacks consistency with respect to other metrics. UMAP shows similar instability, albeit to a somewhat lesser extend. On the other hand, across three configurations presented here, GRAE proves to be noticeably more consistent than competing methods, thus indicating a more reliable overall structure embedding.
Furthermore, with respect to the goal of providing invertible embeddings, other methods typically fail to match the reconstruction error of vanilla AE, which is expected given that they do not specifically account for this metric when inverting latent coordinates (UMAP), or that they are driven by other objectives that may dominate their training over reconstruction terms (TAE and Diffusion Nets). GRAE, on the other hand, surprisingly improves reconstruction on all three benchmarks; most notably on the Swiss Roll and the Rotated Digits. We provide further study of this property in the following section. Overall, our results indicate that geometric regularization based on the PHATE embedding brings tangible benefits to the bottleneck of the auto encoder in terms of structure.
Distance  Pearson (Rank)  Spearman (Rank)  Mutual Information (Rank)  

Dataset  Model  Corr. (Rank)  Naive  ICA  Section  Naive  ICA  Section  Naive  ICA  Section 
Swiss Roll  Autoencoder  0.093 (7)  0.260 (7)  0.608 (6)  0.445 (7)  0.270 (7)  0.611 (6)  0.452 (6)  1.324 (6)  2.153 (2)  0.657 (6) 
GRAE (10)  0.962 (1)  0.980 (1)  0.980 (1)  0.993 (1)  0.987 (1)  0.983 (1)  0.997 (1)  2.597 (1)  2.044 (5)  1.643 (1)  
GRAE (50)  0.955 (2)  0.979 (2)  0.977 (2)  0.991 (2)  0.985 (2)  0.980 (2)  0.993 (2)  2.463 (2)  2.074 (4)  1.625 (2)  
GRAE (100)  0.953 (3)  0.977 (3)  0.976 (3)  0.991 (2)  0.982 (3)  0.979 (3)  0.992 (3)  2.434 (3)  2.110 (3)  1.619 (3)  
TAE (1000)  0.155 (6)  0.278 (6)  0.627 (5)  0.500 (5)  0.287 (6)  0.633 (5)  0.502 (5)  1.230 (7)  2.206 (1)  0.688 (5)  
Diffusion Nets  0.336 (5)  0.394 (5)  0.443 (7)  0.455 (6)  0.309 (5)  0.388 (7)  0.378 (7)  1.492 (5)  1.431 (7)  0.535 (7)  
UMAP  0.791 (4)  0.912 (4)  0.916 (4)  0.926 (4)  0.873 (4)  0.890 (4)  0.934 (4)  1.644 (4)  1.586 (6)  1.099 (4)  
Faces  Autoencoder  0.040 (7)  0.253 (7)  0.295 (7)  0.370 (6)  0.233 (7)  0.300 (7)  0.358 (7)  0.364 (6)  0.298 (7)  0.224 (6) 
GRAE (10)  0.691 (5)  0.642 (4)  0.642 (4)  0.737 (4)  0.629 (4)  0.628 (5)  0.723 (4)  0.731 (5)  0.720 (5)  0.524 (5)  
GRAE (50)  0.725 (2)  0.729 (2)  0.735 (2)  0.829 (2)  0.732 (2)  0.740 (2)  0.817 (2)  0.860 (3)  0.885 (3)  0.659 (2)  
GRAE (100)  0.718 (4)  0.758 (1)  0.763 (1)  0.855 (1)  0.763 (1)  0.769 (1)  0.847 (1)  0.872 (2)  0.901 (2)  0.665 (1)  
TAE (1000)  0.067 (6)  0.269 (6)  0.304 (6)  0.362 (7)  0.244 (6)  0.304 (6)  0.365 (6)  0.353 (7)  0.309 (6)  0.216 (7)  
Diffusion Nets  0.719 (3)  0.684 (3)  0.686 (3)  0.767 (3)  0.697 (3)  0.703 (3)  0.748 (3)  1.027 (1)  1.079 (1)  0.569 (3)  
UMAP  0.854 (1)  0.633 (5)  0.636 (5)  0.706 (5)  0.628 (5)  0.649 (4)  0.711 (5)  0.794 (4)  0.739 (4)  0.536 (4) 
Beyond the unsupervised metrics provided in Table 1, we also recall that an important motivation and goal for using manifold learning methods in data exploration is to reveal underlying latent variables in the data, explaining variation and global structure in them. To quantify the ability of GRAE and other methods to extract such information we note that both the Swiss Roll and Faces dataset provide two dimensional ground truth representations for their latent space, consistently capturing a global intrinsic coordinate system for their manifold structure^{1}^{1}1Rotated Digits not considered here as it lacks global structure, since it consists of 3 disconnected manifolds.. Namely, in Swiss Roll we have access to the intrinsic manifold coordinates given by length of the roll and the rolled axis, while in Faces we have access to physical latent variables given by the two rotation axes of the head in the captured images. We therefore quantify the quality of the embedded representations in these cases by comparing them with the true intrinsic coordinates, using metrics such as the Pearson correlation to correlate the Euclidean distance matrices of the embeddings and the true intrinsic space. We further report the Pearson correlation, the Spearman correlation, and the mutual information between the embedding coordinates and the intrinsic coordinates. For these three measures, we first find the optimal matching between the embedding axes and the intrinsic axes (in the case of Pearson and Spearman, the sign is ignored) and the metrics are then computed over i) the raw embeddings, ii) the embeddings after an ICA preprocessing (to be more flexible with respect to rotations) and iii) by averaging the applicable metric over 10 equalwidth slices of the intrinsic space with respect to both intrinsic variables (hence 20 slices total). Further details on the calculation of these metrics appear in the supplement.
The results of the described evaluation of embedded representations compared to the true “manifold” coordinates are shown in Table 2. Our results show that GRAE outperforms other methods, achieving the best results in almost all metrics on the two datasets (with only two exceptions, which do not indicate a single stable competitor to our method), suggesting it provides a more faithful representation with superior ability to reveal latent structure and variables in data. Further, we note that over the three configurations considered here, GRAE often achieves also the second and third best performances, indicating robustness to the choice of its regularization parameter .
3.3 Impact of geometric regularization on reconstruction quality
Based on the GRAE reconstruction errors (Table 1), we observe that GRAE improves the AE latent representation, guiding the network outside a local optimum region towards a more accurate geometry. Additionally, the regularization generates a more stable reconstruction over the whole dataset, especially in those regions where the autoencoder produces inaccurate geometry in the latent space. To support these claims, we conduct an experiment on two rotated MNIST digits (Figure 3
), generating a full rotation for each of the digits and removing inbetween observations as the test set. After training GRAE on the remaining observations, we interpolate over consecutive observations i.e., consecutive angle rotations in the training set. Then we compute the reconstruction error between the generated points via interpolation with the previously removed inbetween observations.
This experiment shows that learning accurate geometry from the data, can be useful to generate new points on the latent space via interpolation over adjacent observations. Such points can then be fed to GRAE’s decoder, generating a more faithfully reconstructed geodesic trajectory between points on the manifold in the ambient space in comparison to AE.
. Red lines represent the 1st and 3rd quartiles, and dashed black lines represent the median. We notice that AE is more unstable than GRAE, having a more spread distribution. Despite achieving the smallest reconstruction errors for some observations, it also fails heavily for others, while GRAE typically presents lighter tails.
B) A typical embedding produced for different values of . Blue points represent a subsample of the training data (subsampling only done for visualization purposes). Black points are the generated points on the latent space via interpolation. Red markers in the AE embedding represent the 20 interpolated points with the highest reconstruction error. We observe that bad reconstruction typically occurs in sparse regions or crossing lines, i.e., in regions with poorly learned geometry.4 Conclusion
We proposed geometry regularized autoencoder (GRAE), a general parametric framework to enhance autoencoders’ latent representation by taking advantage of well established manifold learning methods. By imposing a geometrical soft constraint on the bottleneck of the autoencoder, we demonstrated empirically how GRAE can achieve good visualizations and good latent representations on several performance metrics compared to AE and other methods motivated by geometry. Furthermore, GRAE is equipped with an inverse mapping that may produce better reconstruction than AE. While the primary focus of this work is on using PHATE embeddings to regularize the bottleneck, our approach is general and we show in the supplementary material preliminary results using UMAP as the embedding target for GRAE. We leave to future work the study of other manifold learning algorithms as constraints for learning AE representations with better geometry and the benefits they bring in terms of visualizations, reconstruction, and generation of data.
Acknowledgments
This research was partially funded by an IVADO (l’institut de valorisation des données) excellence scholarship [S.M.]; IVADO Professor research funds, and NIH grant R01GM135929 [G.W]. The content provided here is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.
References
 (1) Karl Pearson. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.

(2)
Michael A.A. Cox and Trevor F. Cox.
Multidimensional scaling.
In
Handbook of data visualization
, pages 315–347. Springer, 2008.  (3) Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
 (4) Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
 (5) Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pages 585–591, 2002.
 (6) Ronald R. Coifman and Stéphane Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006.
 (7) Kevin R. Moon, David van Dijk, Zheng Wang, Scott Gigante, Daniel B. Burkhardt, William S. Chen, Kristina Yim, Antonia van den Elzen, Matthew J. Hirn, Ronald R. Coifman, Natalia B. Ivanova, Guy Wolf, and Smita Krishnaswamy. Visualizing structure and transitions in highdimensional biological data. Nature Biotechnology, 37(12):1482–1492, 2019.
 (8) Laleh Haghverdi, Maren Buettner, F. Alexander Wolf, Florian Buettner, and Fabian J. Theis. Diffusion pseudotime robustly reconstructs lineage branching. Nature methods, 13(10):845, 2016.

(9)
Laurens van der Maaten and Geoffrey E. Hinton.
Visualizing data using tSNE.
Journal of machine learning research
, 9(Nov):2579–2605, 2008.  (10) Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv eprints, page arXiv:1802.03426, February 2018.
 (11) Elad David Amir, Kara L. Davis, Michelle D. Tadmor, Erin F. Simonds, Jacob H. Levine, Sean C. Bendall, Daniel K. Shenfeld, Smita Krishnaswamy, Garry P. Nolan, and Dana Pe’er. viSNE enables visualization of high dimensional singlecell data and reveals phenotypic heterogeneity of leukemia. Nature biotechnology, 31(6):545, 2013.
 (12) Linn Sgier, Remo Freimann, Anze Zupanic, and Alexandra Kroll. Flow cytometry combined with viSNE for the analysis of microbial biofilms and detection of microplastics. Nature communications, 7(1):1–10, 2016.
 (13) Evan Z. Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, Allison R. Bialas, Nolan Kamitaki, Emily M. Martersteck, et al. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015.
 (14) Etienne Becht, Leland McInnes, John Healy, CharlesAntoine Dutertre, Immanuel W.H. Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Dimensionality reduction for visualizing singlecell data using umap. Nature biotechnology, 37(1):38, 2019.
 (15) Yujiao Zhao, Matthew Amodio, Brent Vander Wyk, Bram Gerritsen, Mahesh M. Kumar, David van Dijk, Kevin Moon, Xiaomei Wang, Anna Malawista, Monique M. Richards, Megan Cahill, Anita Desai, Jayasree Sivadasan, Manjunatha Venkataswamy, Vasanthapuram Ravi, Erol Fikrig, Priti Kumar, Steven Kleinstein, Smita Krishnaswamy, and Ruth Montgomery. Single cell immune profiling of dengue virus patients reveals intact immune responses to Zika virus with enrichment of innate immune signatures. PLoS Neglected Tropical Diseases, 14(3):e0008112, 2020.
 (16) Wouter R. Karthaus, Matan Hofree, Danielle Choi, Eliot L. Linton, Mesruh Turkekul, Alborz Bejnood, Brett Carver, Anuradha Gopalan, Wassim Abida, Vincent Laudone, Moshe Biton, Ojasvi Chaudhary, Tianhao Xu, Ignas Masilionis, Katia Manova, Linas Mazutis, Dana Pe’er, Aviv Regev, and Charles Sawyers. Regenerative potential of prostate luminal cells revealed by singlecell analysis. Science, 368(6490):497–505, 2020.
 (17) Chiara Baccin, Jude AlSabah, Lars Velten, Patrick M. Helbling, Florian Grünschläger, Pablo HernándezMalmierca, César NombelaArrieta, Lars M. Steinmetz, Andreas Trumpp, and Simon Haas. Combined singlecell and spatial transcriptomics reveal the molecular, cellular and spatial bone marrow niche organization. Nature Cell Biology, pages 1–11, 2019.
 (18) Andrés F. Duque, Guy Wolf, and Kevin R. Moon. Visualizing high dimensional dynamical processes. In IEEE International Workshop on Machine Learning for Signal Processing, 2019.
 (19) Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep contentbased music recommendation. In Advances in neural information processing systems, pages 2643–2651, 2013.
 (20) Scott Gigante, Adam S. Charles, Smita Krishnaswamy, and Gal Mishne. Visualizing the phate of neural networks. In Advances in Neural Information Processing Systems, pages 1840–1851, 2019.

(21)
Stefan Horoi, Victor Geadah, Guy Wolf, and Guillaume Lajoie.
Lowdimensional dynamics of encoding and learning in recurrent neural networks.
InCanadian Conference on Artificial Intelligence
, pages 276–282. Springer, 2020.  (22) Ronald R Coifman and Stéphane Lafon. Geometric harmonics: a novel tool for multiscale outofsample extension of empirical functions. Applied and Computational Harmonic Analysis, 21(1):31–52, 2006.
 (23) Christopher K.I. Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. In Advances in neural information processing systems, pages 682–688, 2001.
 (24) Max Vladymyrov and Miguel Á CarreiraPerpinán. Locally linear landmarks for largescale manifold learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 256–271. Springer, 2013.
 (25) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by backpropagating errors. nature, 323(6088):533–536, 1986.

(26)
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and PierreAntoine Manzagol.
Extracting and composing robust features with denoising autoencoders.
In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.  (27) Akara Supratak, Ling Li, and Yike Guo. Feature extraction with stacked autoencoders for epileptic seizure detection. In 2014 36th Annual international conference of the IEEE engineering in medicine and biology society, pages 4184–4187. IEEE, 2014.

(28)
Hailong Liu and Tadahiro Taniguchi.
Feature extraction and pattern recognition for human motion by a deep sparse autoencoder.
In 2014 IEEE International Conference on Computer and Information Technology, pages 173–181. IEEE, 2014.  (29) Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.

(30)
Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, and Ronald R. Coifman.
Diffusion maps, spectral clustering and eigenfunctions of fokkerplanck operators.
In Advances in neural information processing systems, pages 955–962, 2006.  (31) Gal Mishne, Uri Shaham, Alexander Cloninger, and Israel Cohen. Diffusion nets. Applied and Computational Harmonic Analysis, 47(2):259–285, 2019.
 (32) Michael Moor, Max Horn, Bastian Rieck, and Karsten Borgwardt. Topological autoencoders. arXiv preprint arXiv:1906.00722, 2019.
 (33) Wenchao Yu, Guangxiang Zeng, Ping Luo, Fuzhen Zhuang, Qing He, and Zhongzhi Shi. Embedding with autoencoder regularization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 208–223. Springer, 2013.
 (34) Kui Jia, Lin Sun, Shenghua Gao, Zhan Song, and Bertram E Shi. Laplacian autoencoders: An explicit learning of nonlinear data manifold. Neurocomputing, 160:250–260, 2015.
 (35) Yann LeCun, Corinna Cortes, and C.J. Burges. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
 (36) Samuel Kaski, Janne Nikkilä, Merja Oja, Jarkko Venna, Petri Törönen, and Eero Castrén. Trustworthiness and metrics in visualizing similarity of gene expression. BMC bioinformatics, 4(1):48, 2003.
 (37) Jarkko Venna and Samuel Kaski. Visualizing gene interaction graphs with local multidimensional scaling. In ESANN, volume 6, pages 557–562. Citeseer, 2006.
 (38) John A. Lee and Michel Verleysen. Nonlinear dimensionality reduction. Springer Science & Business Media, 2007.
 (39) Diederik P. Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 (40) Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive autoencoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 833–840, 2011.
 (41) Andrew Ng et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19, 2011.
 (42) Herbert Edelsbrunner and John Harer. Persistent homologya Survey. Contemporary mathematics, 453:257–282, 2008.
 (43) Sergey Barannikov. The framed morse complex and its invariants. Singularities and Bifurcations, 1994.
 (44) Yoshua Bengio, JeanFrançois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas Le Roux, and Marie Ouimet. Outofsample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In Advances in neural information processing systems, pages 177–184, 2004.
 (45) Vin de Silva and Joshua B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. In Advances in neural information processing systems, pages 721–728, 2003.
 (46) Geoffrey E. Hinton and Sam Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15(Nov):833–840, 2008.
 (47) Wei Dong, Charikar Moses, and Kai Li. Efficient knearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577–586, 2011.
 (48) Dmitry Kobak and George C. Linderman. UMAP does not preserve global structure any better than tSNE when using the same initialization. bioRxiv, 2019.

(49)
John Lee and Michel Verleysen.
Quality assessment of nonlinear dimensionality reduction based on
Kary neighborhoods.
In
New Challenges for Feature Selection in Data Mining and Knowledge Discovery
, pages 21–35, 2008.  (50) L.F. Kozachenko and Nikolai N. Leonenko. Problemy Peredachi Informatsii, 23(2):9–16, 1987.
 (51) Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(45):411–430, 2000.
Supplement
In this supplement, we provide further details about GRAE. In Section A, we provide further details about related work and the methods we compare with GRAE. In Section B, we present some results using UMAP in the GRAE regularization instead of PHATE. We then discuss technical details of the experiments in Section C.
Appendix A Detailed discussion of compared methods and related work
Regularized autoencoders
The vanilla AE formulation in Eq. (1) of the main paper has been extended for many purposes by adding some kind of regularization. Extensions to produce generative models, such as variational autoencoders (VAE) kingma2013auto
, regularize the latent representation to match a tractable probability distribution (e.g., isotropic multivariate Gaussian), from which it is possible to sample over a continuous domain to generate new points. Contractive autoencoders (CAE)
rifai2011contractive penalize the Frobenius norm of the Jacobian of the encoder in addition to the reconstruction loss, encouraging a more robust representation around small perturbations from the training data. When dealing with a high dimensional latent space (e.g., the overcomplete case), sparse autoencoders ng2011sparse are particularly useful, introducing a sparsity constraint that forces the network to learn significant features in the data.More closely related to our work, some attempts to impose geometrically driven regularizations have been proposed over the past decade. A relatively new implementation called Diffusion Nets mishne2019diffusion enforces the AE embedding coordinates to match with Diffusion Maps (DM) coifman2006diffusion
coordinates. They combine an MSE loss with the socalled eigenvector constraint to learn the diffusion geometry. Their approach inherits some of the inherent issues of Diffusion Maps. Perhaps most importantly, they inherit its inability to ensure significant representation of the data on a fixed lower dimension, due to the natural orthogonality imposed among the diffusion coordinates
moon2019visualizing. Therefore, in general, effective utilization of their approach might require the network architecture (i.e., latent dimension set by the number of neurons in the bottleneck layer) to be determined from the numerical rank of the diffusion affinity kernel used in DM. This, in turn, would limit the capabilities of this approach in data exploration (e.g., visualization), while in contrast PHATE (and UMAP) allow an a priori set dimension (e.g., 2D or 3D) determined by the GRAE architecture. Moreover, as a spectral kernel method, DM itself
^{2}^{2}2We note that the DM runtime (relative to UMAP) is equivalent to that of Laplacian eigenmaps reported in UMAP2018 , as the algorithmic difference between these spectral methods is negligible coifman2006diffusion . tends to be more computationally expensive than PHATE or UMAP moon2019visualizing ; UMAP2018 . As a result, we observe that Diffusion Nets are also relatively slow to train, although it is likely that some sampling or landmarking techniques may help with this aspect. However, given the other described deficiencies, we deem the consideration of such computational improvements out of scope here.The formulation of Diffusion Nets can also be interpreted as a generalization of Laplacian autoencoders (LAE) jia2015laplacian and Embedding with regularized autoencoders (EAER) yu2013embedding . Both of these methods include a regularization term that penalizes inaccurate preservation of neighborhood relationships in the original space:
(3) 
For instance, in ERAE the second term can have the form , where are computed using the weighted edges in an adjacency graph. This gives the objective function of Laplacian eigenmaps. LAE goes further by adding a second order term including the Hessian of . We note that, to the best of our knowledge, no standard code or implementation is available for these methods. Furthermore, while reported results in jia2015laplacian ; yu2013embedding show improvement in embeddedspace classification accuracy over benchmark datasets compared to CAE (in LAE jia2015laplacian ) and a “pure” manifold learning counterpart (in EAER yu2013embedding ), their embeddings do not show a significant advantage over more recent popular methods. It is therefore unlikely these methods would outperform Diffusion Nets, which does provide a standard implementation, as well as other methods considered in this work. Therefore, we do not compare our method directly to ERAE and LAE in this work, instead relying on the comparison with Diffusion Nets and the discussed similarities to them (as well as the similarities between Laplacian eigenmaps and Diffusion maps) to establish a representative assessment of the improvement provided by GRAE over this family of previous approaches.
We depart from previous methods by providing a more general approach. In many applications, there may not be strong reasons for imposing a particular relationship in the geometric loss that resembles a loss function from a specific kernel method. Any approach employed to find , whether it be PHATE, Isomap, tSNE, LLE, etc., is already performing an optimization to its particular loss function, imposing the preservation of its desired geometrical relationships in the data. Thus, the network will implicitly enforce such a relationship.
Recently, topological autoencoders (TAE) moor2019topological were proposed which include a regularization based on a topological signature of the input data. The topology of a manifold can be characterized by homology groups, which, depending on their dimension, represent various topological features such as the number of disconnected components or the number of cycles on the manifold that cannot be deformed into each another. When working on data points sampled from a manifold, such topological features can be approximately derived from a ball neighborhood graph of the data. Persistent homology edelsbrunner2008persistent ; barannikov1994framed was introduced as a means to identify the topological signature of manifolds based on how long topological features persist when progressively increasing the ball of the neighborhood graph up to a point where the graph is fullyconnected (topological features with a short lifespan are attributed to noise). TAE thus penalizes discrepancies between the topological signatures of the input space and the latent space. However, we note that the calculation of the topological signatures is computationally slow, thus limiting the ability of TAE to scale to large datasets.
Out of sample extension
Manifold learning algorithms are typically based on the eigendecomposition of a kernel matrix (Diffusion Maps) or a stochastic optimization of the latent coordinates (metric MDS, UMAP). Both approaches, in contrast to neural networks, do not extend naturally to new points, as they merely learn an embedding of the training points instead of a function mapping the input space to the latent space.
A wellknown solution to this problem is the Nyström extension bengio2004out and its improvements, such as geometric harmonics coifman2006geometric , which approximate an empirical function over new points using linear combinations of the eigenvectors of a kernel matrix computed on the training set. Let be the training set used to compute the initial kernel matrix with kernel function . Then a new point can be extended to the learned latent space using the eigenvectors of with eigenvalues as follows:
One can thus project a function on the eigenvectors of and then extend to new points using the new approximated eigenvectors. However, as discussed in mishne2019diffusion , this approach has several drawbacks. Given training points (resulting in being ), extending a function to points requires us to compute new kernel rows leading to a time complexity of . Furthermore, the empirical function must be within the interpolation range of the kernel, which requires bandwidth tuning.
Other methods, such as those discussed in vladymyrov2013locally , perform OOSE by a linear combination of training points (the “landmarks”) close to the new points in the input space, as in PHATE moon2019visualizing and landmark MDS silva2003global . UMAP takes a similar approach to OOSE by initializing latent coordinates of new points in the latent space based on their affinities with training points in the input space. The new layout is then optimized by gradient descent following the same optimization scheme as the main algorithm, using the embeddings of training points as reference.
All of the aforementioned approaches require the training points, or a subset, to be stored in memory with their embeddings as a proxy for the target function, which can quickly become inconvenient given a large dataset or lead to a loss in embedding quality because of necessary subsampling or the use of landmarks. Moreover, they do not provide a straightforward approximation of the inverse function, which is critical in assessing how well the information is preserved in the embedded space in a truly unsupervised setting. As such, they are not directly comparable to GRAE and other AE based models, which present a native approximation of the inverse and only need to store the weights and biases of the network to perform OOSE, thus having memory requirements independant from the size of the training set.
Umap
The key assumption of manifold learning is that data points actually lie on or near a lowdimensional manifold in the ambient space. Nonlinear dimensionality reduction techniques can leverage the geometry of this manifold to find a lowdimensional embedding which preserves the relevant signals in the data. Recent manifold learning kernel methods typically follow the framework introduced in snepaper and further extended by tSNE maaten2008visualizing
, which are themselves generalization of the metric MDS algorithm, whereby the coordinates in the latent space are optimized by gradient descent to recreate the pairwise similarities (as defined by a kernel) in the input space. Intuitively, the use of a kernel which outputs high similarities for close neighbors enables the capture of the curvature of the underlying manifold in the ambient space. tSNE, for instance, uses normalized Gaussian similarities in the input space and tdistributed similarities in the latent space. The embedding is optimized so as to minimize the KullbackLeibler divergence between both distributions.
UMAP UMAP2018 was introduced as an improvement of this framework, with claims of improved preservation of global features and better run times. Specifically, the cost function of tSNE is replaced by crossentropy and similarities between objects in the input space is computed based on the smooth nearest neighbor distances, that is:
where is the distance between and its nearest neighbor, is the bandwidth and is a distance, not necessarily Euclidean. In contrast with tSNE, UMAP does not normalize similarities and relies on an approximation of the neighborhoods using the NearestNeighborDescent algorithm of dong2011efficient . The UMAP implementation further distinguishes itself from tSNE by not restricting the embedded space to two or three dimensions.
Recently, the claim that UMAP improves over tSNE in preserving global structure has been challenged in kobak2019umap . The authors attribute the better global structure commonly obtained by UMAP to the differences between both methods in the initialization procedure. Typically tSNE uses a random initialization, whereas UMAP uses Laplacian eigenmaps as its starting point. They showed that nearly identical results can be obtained by also initializing tSNE with Laplacian eigenmaps.
Appendix B Generalization of geometric regularization to additional manifold learning approaches
We implemented GRAE using PHATE for our experiments in the main document, mainly motivated by the capabilities exhibited by this method in preserve global structure as empirically verified by the supervised metrics in Table 2 (in the main paper) and in moon2019visualizing . However, the extension of GRAE to other manifold learning algorithms is straightforward, only requiring the computation of the embedding target with the desired algorithm. Now, we show some of the same experiments performed in the main document with PHATE replaced with UMAP or tSNE, since both have become popular visualization methods in recent years.
Visualizations for GRAE UMAP and GRAE tSNE are provided in Figure 4. Most of the comments in Section 3.1 of the main paper are applicable here. We further observe that GRAE UMAP successfully recovers the corners of Swiss Roll and seems to have a regularization effect on both the training and test data by filling some of the holes present in the original UMAP embedding. This could be the result of GRAE (or AE in general) being more robust to fluctuations in density than UMAP. We do note however that the Faces embeddings of GRAE UMAP and GRAE tSNE do not match the ones offered by GRAE in Figure 2 in the main paper. Moreover, GRAE tSNE fails to return a sensible geometry for the Swiss Roll.
Preliminary metrics for GRAE UMAP and GRAE tSNE seem to support the observations in the main paper that a good latent representation geometry is beneficial to reconstruction error. For example, embeddings from GRAE UMAP with typically result in reconstruction errors on the order of 0.004 on Swiss Roll, 0.017 on Faces, and 0.017 on Rotated Digits, which marks an improvement over both AE and UMAP (see Table 1 in the main paper).
Appendix C Technical details
Architecture and Tuning:
We use the same AE architecture for all autoencoderbased models, which consists of 3 fullyconnected hidden layers in the encoder, a bottleneck of dimension 2, and 3 fullyconnected hidden layers in the decoder, producing the following sequence of neurons: 8004002002200400800. We choose a bottleneck dimension of 2 for visualization purposes, although higher dimensions could also be chosen. We apply ReLU activations on all of the layers except in the bottleneck and the output of the decoder. To prevent overfitting, we include an L2 penalization with a
weightdecay parameter equal to 1. We use a learning rate of .0001 and a batch size of 200 for all models and problems. MSE is used for the reconstruction error.We use the default parameters for UMAP and PHATE (as the embedding target of GRAE) for most problems, the only exception being the Swiss Roll where setting the neighborhood parameter to 20 led to better visuals for both UMAP and PHATE. Presumably this enables the algorithms to better bridge the gap left by the removal of the test slice. For TAE, we used a regularization factor of 1000, since lower values led to embeddings indistinguishable from those of vanilla AE. Diffusion Nets is run on the provided base code, only changing the number of neurons for the neural network.
Splits and datasets:
All results were computed on a test split to benchmark how well the considered methods generalize to new data points. We generated 10,000 points on the Swiss Roll using the scikitlearn library. The manifold is then stretched and rotated, before removing a thin middle slice of 250 points for testing. We selected this approach over a uniform sampling of the points to study how various methods would behave when required to generalize to outofdistribution data.
The Rotated Digits are generated as described in the main paper and the Faces images are used as is. On both problems, we set aside 20 % of the data for testing.
We report the average of 10 runs for all experiments and use the same splits for all runs, as we do not expect significant variance resulting from different splits. The objective was rather to show the robustness of our results against the inherently stochastic nature of training neural networks and fitting manifold learning algorithms (both PHATE and UMAP, for instance, rely on some randomness that can return slightly different results depending on the random state). We did not include an analysis of the standard deviation of the metrics in the main paper, given that the rankings were generally stable from run to run (and metric to metric, especially in Table 2 in the main paper). Still, we provide box plots for the reconstruction error in Figure
5 to add more context to the findings of the main paper.Box plots of the reconstruction error on the Swiss Roll, Faces and Rotated Digits data sets over 10 runs. The edges of a given box mark the first and third quartile values of the data, with an additional green line indicating the median. The whiskers are set to 1.5 * interquartile range and outliers are marked as circles. Red arrows indicate models with significantly higher reconstruction error that are omitted from the visualizations to focus on competitive models.
Unsupervised evaluation metrics
Trustworthiness:
This unsupervised performance metric (kaski2003trustworthiness ) measures how well the neighborhood of each observation in the latent space is faithfully representing the neighborhood in the original space. More formally, let be the rank of with respect to the ordering in terms of distance from in the high dimensional space. Let be the set of observations that belong to the neighbors of in the latent space but not in the original space. Then trustworthiness is defined as:
Trustworthiness is low if an observation is embedded as one of the neighbors of an observation in the latent space, but is not one of its neighbors in the original space.
Continuity:
Continuity (kaski2003trustworthiness ) measures discrepancy in the other direction. Let be the rank of with respect to in terms of distance in the lowdimensional space. And the set of observations that belong to the neighbors of in the original space but not in the latent space.
Mean relative rank error (MRRE):
Using the same ranks as continuity and trustworthiness, the MRRE (lee2008quality ) penalizes the relative difference between the ranks within the neighborhood of each observation. Let be the set of the neighbors of in the latent space.
Thus MRRE penalizes relative changes of the ranks in the neighbors of each observation.
Supervised metrics
Distance Correlation:
We first compute the pairwise Euclidean distance matrix of the model embedding and then the same matrix using the ground truth coordinates. The reported metric is the Pearson correlation between the flattened lower triangular halves of both matrices.
Mutual Information (MI):
Mutual information quantifies the dependence between two random variables. Let
andbe two continuous random variables,
and their respective marginal distributions andthe joint distribution. Then mutual information is defined as:
On real data, distributions of a continuous random variable need to be estimated. We rely on a scikitlearn utility to do so, itself based on entropy estimation from knearest neighbors distances, originally proposed in kozachenko1987sample .
Methodology for Pearson correlation, Spearman correlation and Mutual information:
All three measures are first reported on the raw embeddings of the models. Let and be the coordinate vectors of the manifold ground truth and and the coordinate vectors of an embedding. Given similarity measure (Pearson, Spearman, or MI), the reported result is:
Both the absolute and max functions allow us to find the optimal matching of the axes and further ensure that we accurately score identical reflections of the same embedding. We average both axes to further summarize the results.
The above described procedure is still arguably unfair towards embeddings that do not recover orthogonal components with the right rotation (e.g. a very good embedding with a 45 rotation could lower ). The second reported result for similarity measures aims to mitigate this by applying the Independant Component Analysis (ICA) algorithm to the embeddings before computing . Where PCA finds components with maximum variance, ICA instead seeks components maximizing statistical independence, which is more robust to rotations. Specifically, we use the FastICAhyvarinen2000independent implementation of scikitlearn, which relies on nonGaussianity of the components as a proxy for statistical independence. We can observe in Table 2 (in the main paper) that AE benefits from this preprocessing, especially on the Swiss Roll.
The final variant (‘Sections’) of the similarity measures partitions the ground truth into 10 sections, which are based on equal width intervals over the range of . The corresponding sections of (or depending on the global optimal matching found using ) are then compared with , which should be more or less parallel to the longest ’side’ of the sections. The procedure is repeated with 10 more sections over the range of and the 20 results are averaged. This assesses the quality of the embeddings and the behavior of the metrics over longer trajectories orthogonal to both axes. A visualization of the partitions for the Faces data set is provided in Figure 6.
Hardware & software environment:
Experiments were executed using Python 3.6.9 and Torch 1.5.0 for deep learning components. We used author implementations for UMAP (0.4.3) and PHATE (1.0.4). We reused the original TAE source code for the topological soft constraint and adapted it to our AE architecture. For Diffusion nets we run their base code with Tensorflow (1.14.0). Other major utilities include scikitlearn 0.22.2 and numpy 1.18.4. We ran everything in a Google Colab environment with an Intel(R) Xeon(R) CPU @ 2.30GHz, 13 GB of available RAM, and an NVIDIA Tesla P100 GPU.
Comments
There are no comments yet.