Extendable and invertible manifold learning with geometry regularized autoencoders

07/14/2020 ∙ by Andrés F. Duque, et al. ∙ Université de Montréal Utah State University 0

A fundamental task in data exploration is to extract simplified low dimensional representations that capture intrinsic geometry in data, especially for the purpose of faithfully visualizing data in two or three dimensions. Common approaches to this task use kernel methods for manifold learning. However, these methods typically only provide an embedding of fixed input data and cannot extend to new data points. On the other hand, autoencoders have recently become widely popular for representation learning, but while they naturally compute feature extractors that are both extendable to new data and invertible (i.e., reconstructing original features from latent representation), they provide limited capabilities to follow global intrinsic geometry compared to kernel-based manifold learning. Here, we present a new method for integrating both approaches by incorporating a geometric regularization term in the bottleneck of the autoencoder. Our regularization, based on the diffusion potential distances from the recently-proposed PHATE visualization method, encourages the learned latent representation to follow intrinsic data geometry, similar to manifold learning algorithms, while still enabling faithful extension to new data and reconstruction of data in the original feature space from latent coordinates. We compare our approach with leading kernel methods and autoencoder models for manifold learning to provide qualitative and quantitative evidence of our advantages in preserving intrinsic structure, out of sample extension, and reconstruction.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The high dimensionality of modern data introduces significant challenges in descriptive and exploratory data analysis. These challenges gave rise to extensive work on dimensionality reduction aiming to provide low dimensional representations that preserve or uncover intrinsic patterns and structures in processed data. A common assumption in such work is that high dimensional meausrements are a result of (often nonlinear) functions applied to a small set of latent variables that control the observed phenomena of interest, and thus one can expect an appropriate embedding in low dimensions to indirectly recover a faithful latent data representation. While classic approaches, such as principal component analysis (PCA) 

pearson1901 and classical multidimensional scaling (MDS) cox2008multidimensional

, construct linear embeddings, more recent attempts mostly focus on nonlinear dimensionality reduction. These approaches can be roughly split into manifold learning kernel methods and deep learning autoencoder methods, each with their own benefits and deficiencies.

Kernel methods for manifold learning cover some of the most popular nonlinear dimensionality reduction methods, dating back to the introduction of Isomap tenenbaum2000global and Locally Linear Embedding (LLE) roweis2000nonlinear

. These two methods proposed the notion of data manifolds as a model for intrinsic low dimensional geometry in high dimensional data. The manifold construction in both cases is approximated by a local neighborhood graph, which is then leveraged to form a low-dimensional representation that preserves either pairwise geodesic distances (in the case of Isomap) or local linearity of neighborhoods (in LLE). The construction of neighborhood graphs to approximate manifold structures was further advanced by Laplacian eigenmaps 

belkin2002laplacian and diffusion maps coifman2006diffusion , together with a theoretical framework for relating the captured geometry to Riemannian manifolds via the Laplace-Beltrami operators and heat kernels. These approaches that, until recently, dominated manifold learning can collectively be considered as spectral methods, since the embedding provided by them is based on the spectral decomposition of a suitable kernel matrix that encodes (potentially multiscale) neighborhood structure from the data. They are also known as kernel PCA methods, as they conceptually extend the spectral decomposition of covariance matrices used in PCA, or that of a Gram (inner product) matrix used in classic MDS.

Recent work in dimensionality reduction has focused on visualization for data exploration. Kernel PCA methods are generally unsuitable for such tasks because, while their learned representation has lower dimension than the original data, they tend to embed data geometry in more dimensions than can be conveniently visualized (i.e., significantly more than 2D or 3D). This is typically due to the orthogonality constraint and linearity of spectral decompositions with respect to the initial dimensionality expansion of kernel constructions moon2019visualizing ; haghverdi2016diffusion . This led to the invention of methods like t-SNE (t-Distributed Stochastic Neighbor Embedding) maaten2008visualizing , UMAP (Uniform Manifold Approximation and Projection) UMAP2018 , and PHATE (Potential of Heat-diffusion for Affinity-based Transition Embedding) moon2019visualizing . These methods embed the data by preserving pairwise relationships between points and can thus be viewed as generalizations of metric and non-metric MDS. These methods and their extensions have been used in applications such as single cell genomics moon2019visualizing ; amir2013visne ; sgier2016flow ; macosko2015highly ; becht2019dimensionality ; zhao2020single ; karthaus2020regenerative ; baccin2019combined , visualizing time series duque2019 , visualizing music for a recommendation system van2013deep

, and analyzing the internal representations in neural networks 

2019Gigante ; horoi2020low . However, these and the previously mentioned spectral methods typically provide fixed latent coordinates for the input data. Thus, they do not come with a natural embedding function to perform out-of-sample extension (OOSE). This shortcoming is usually tackled by employing geometric harmonics coifman2006geometric , Nyström extension williams2001using , or a landmark approach vladymyrov2013locally .

In contrast, Autoencoders (AEs) can be viewed as a different paradigm for non-linear dimensionality reduction. First introduced in rumelhart1986learning , this non-convex and parametric approach has gained more attention in recent years, especially due to the computational and mathematical advances in the field allowing the implementation and training of neural networks in a more efficient way. Denoising AEs vincent2008extracting

have become widely used to find good latent representations and perform feature extraction exploiting the flexibility provided by neural networks (e.g.

supratak2014feature , liu2014feature ). In contrast to kernel methods, AEs learn a parametric function and are thus equipped with a natural way to perform OOSE, as well as an inverse mapping from the latent to the input space. Despite these nice properties, AEs usually fail to accurately recover the geometry present in the data. This not only restricts their desirability to perform exploratory data analysis, e.g. via low-dimensional visualization, but can also lead to bad reconstructions over certain regions of the data, as we show in this work.

Motivated by the complementary advantages provided by AE and kernel methods, we introduce geometry-regularized autoencoders (GRAE), a general framework which splices the well-established machinery from kernel methods to recover a sensible geometry with the parametric structure of AEs. Thus we gain the benefits of both methods, furnishing kernel methods with efficient OOSE and inverse mapping, and providing the autoencoder with a geometrically driven representation. To achieve this, GRAE introduces a regularization on the latent representation of the autoencoder, leading it towards a representation previously learned by a kernel method. In this paper we focus our presentation using PHATE moon2019visualizing as our preferred method for finding a sensible geometry. Nevertheless, our general formulation can also be easily implemented with other methods such as UMAP (see the supplement).

2 Geometry-regularized autoencoder

2.1 Learning embedding functions instead of embedding coordinates

Manifold learning methods for dimensionality reduction typically assume data lie on a low dimensional manifold immersed in the high dimensional ambient space. Therefore they aim to map points from to a low dimensional Euclidean space that encodes or reveals its intrinsic geometry. However, in practice, such methods only consider a finite set of data points (for dimensional ambient space), assumed to be sampled from , and optimize a fixed set of low dimensional points (for ) such that the Euclidean relations between pairs will reflect intrinsic nonlinear relations between the corresponding . While the realization of this optimization and its objective vary between methods, a common property of most recent approaches is that the resulting “manifold learning” only provides coordinates for a fixed and finite set of data points. Therefore, they do not provide a general embedding function that operates on, or provides a representation of, the entire manifold . Thus these methods are not applicable to arbitrary input points in , which we would ideally want to project onto the learned manifold.

In this work, we aim to learn a data manifold geometry to find an appropriate embedding function , rather than just a fixed point-cloud embedding. This contrast can be seen, for example, by considering the classic PCA and MDS methods. While both of these methods can be shown analytically to extract equivalent (or even the same) linear embeddings from data, MDS only assigns coordinates to fixed input points (similar to the described manifold learning methods), while PCA provides an embedding function (albeit linear) defined by a projection operator on principal components. Here, we aim to establish a similar equivalence in nonlinear settings by providing an alternative to popular manifold learning approaches that constructs an embedding function (as a nonlinear successor of PCA) and also yields representations that capture an intrinsic geometry similar to that of established kernel methods (seen as successors of MDS).

2.2 Extendable and invertible embedding with autoencoders

As mentioned before, PCA naturally provides an extendable and (approximately) invertible embedding function, but falls short in capturing non-linear mappings. To see this, we recall that its embedding function is constructed using a matrix with orthogonal columns consisting of principal components (PCs) such that

is a projection operator, thus acting as an identity on a hyperplane spanned by the PCs. Then, the PCA embedding is given by the linear function

and its inverse (on the embedded space) is given by , where the approximation quality (quantified as reconstruction error) serves as the optimization target for computing the PCs in . To generalize this construction to compute nonlinear embedding functions over a data manifold , autoencoders (AEs) replace by an encoder function and by a decoder function , which is aimed to serve as an approximate inverse of . Both functions are then both parametrized by a neural network and trained via a reconstruction loss that aims to ensure the composite function acts as an identity on data sampled from . By considering datasets in matrix notations (i.e., with rows as datapoints), the AE optimization can be generally formulated as


where are understood to be applied separately to each row in their input matrix (yielding corresponding output data points organized in matrix form), and

denotes a loss function that measures the discrepancy between the original and reconstructed data points (commonly MSE). It is common to select

forcing the autoencoder to find a representation in latent codes of dimension while retaining as much information for reconstruction as possible. In this case the autoencoder is usually referred to as undercomplete. Under this formulation, instead of learning new coordinates for the input data, we learn an embedding function and an inverse function . If is a linear function, the network will project onto the same subspace spanned by the principal components in PCA baldi1989neural .

The previous formulation departs away from manifold learning approaches as it lacks an explicit condition to recover geometrical interactions between observations. To fill that gap, we propose a general framework called GRAE (Geometry Regularized autoencoders) which explicitly penalizes misguided representations in the latent space from a geometric perspective. Thus, we add a soft constraint in the bottleneck of the autoencoder as follows:


The term in Eq. (2) is the geometric loss, penalizing the discrepancy between the latent representation and the embedding previously learned by a manifold learning algorithm. The parameter determines how strongly the latent space of the AE should match the embedding . Thus the network will implicitly force the latent space of the autoencoder to preserve the relationships learned by the manifold learning technique, resulting in a non-linear embedding function and its inverse that are consistent with sensible geometric properties. We note that by introducing this regularization, the global optimum of the training reconstruction error deteriorates. However, in practice, given the complexity of finding the global optimum of a neural network, it often leads to better reconstructions as we show in Sec. 3. This regularization can be applied with any manifold learning approach, whether it be Isomap, t-SNE, etc. The resulting latent space will then inherit the corresponding strengths and weaknesses of the selected approach.

2.3 Geometric regularization with diffusion-based manifold learning

To generate , we use PHATE moon2019visualizing as it has proven to preserve long-term relationships (global structure) in a low-dimensional representation beyond the capabilities of spectral methods such as Laplacian eigenmaps, Diffusion Maps, LLE, and Isomap, especially when the dimension is required to be 2 or 3 for visualization. PHATE (Alg. 1) is built upon diffusion geometry coifman2006diffusion ; nadler2006diffusion . PHATE first computes the

decay kernel with an adaptive bandwidth, which captures local geometry while remaining robust to density changes. The kernel matrix is normalized to obtain a probability transition matrix

(diffusion operator) between every pair of points. Various scales of the geometry can then be uncovered by computing a -step random walk over , with a higher implying more diffusion, pushing transition probabilities to a more global scale.

0:  Data matrix , neighborhood size , desired embedding dimension , alpha decay
0:  PHATE embedding
1:   compute pairwise distance matrix from

compute local affinity matrix from

3:   normalize to form a Markov transition matrix
4:   compute time scale via Von Neumann Entropy
5:  Diffuse for time steps to obtain
6:  Compute potential representations:
7:   compute potential distances matrix from .
8:   apply classical MDS to
9:   apply metric MDS to with as an initialization
Algorithm 1 PHATE algorithm moon2019visualizing

The parameter

is automatically chosen by studying the entropy of a distribution over the eigenvalues of

(known as the von Neumann Entropy) as increases. Typically, the first few -steps lead to a sharp drop in the entropy, which is thought in moon2019visualizing to be the process of denoising the transition probabilities, whereas later steps will reduce entropy at a lower rate, thus, slowly losing meaningful information in the structure. Subsequently PHATE computes the so-called potential distances , which have proven to be adequate distances between transition probabilities encoded in . Finally, MDS and metric MDS are applied over to optimally preserve the potential distances in a low-dimensional representation.

Figure 1: Overview of GRAE on the Faces dataset tenenbaum2000global . Geometric regularization is applied to enforce similarity of GRAE and PHATE embeddings. The vanilla AE embedding (top right) is added for reference.

Given the PHATE embedding of training points , we define our geometric loss as which is added to the AE reconstruction loss via Eq. (2). Figure 1 shows a graphical overview of GRAE using the PHATE embedding.

3 Empirical results

We experimentally compare GRAE with a standard AE, Diffusion nets mishne2019diffusion , topological autoencoders (TAE) moor2019topological , and UMAP UMAP2018 . Diffusion nets and TAE impose regularizations that enforce a diffusion maps or topological structure on the embedding coordinates. A detailed discussion of these and other related methods such as Embedding with Autoencoder Regularization (ERAE) yu2013embedding and Laplacian Autoencoders (LAE) jia2015laplacian is contained in the supplement. UMAP was selected as a baseline to compare our approach to a “pure” manifold learning algorithm. This choice is motivated by the presence of a native extension to new points in UMAP’s implementation as well as an inverse transform from the latent space back to the input space, thus offering a natural comparison to autoencoders over all the relevant metrics. UMAP is also similar to t-SNE from an algorithmic standpoint and provides qualitatively similar embeddings at a lower computational cost. To emphasize GRAE’s robustness with respect to the regularization parameter , we do not tune it but rather show how similar performances can be obtained using values of 10, 50 and 100, thus serving as an ablation study for the proposed geometric regularization. Full details about the architectures (e.g., for all autoencoder methods) and parameter tuning appear in the supplement.

We perform these comparisons on three benchmark datasets. The first one is the classic manifold problem known as “Swiss Roll” where data points are lying on a two dimensional “roll” in a three dimensional ambient space. Classical approaches such as PCA or MDS typically fail to recover the non linear geometry of the data, as they rely on pairwise Euclidean distances instead of the true geodesic distances along the curvature of the roll. The second dataset consists of 684 images of a face model rotated horizontally and vertically, as introduced in tenenbaum2000global and shown in Fig. 1. An adequate two dimensional embedding of the images is expected to recover both rotation axes in a meaningful way. The final dataset is derived from the MNIST dataset lecun2010mnist , where three digits are picked randomly and rotated 360 times at one-degree intervals, for a total of 1080 images.

All the reported results in this section were computed on a test split to assess the ability of various methods to learn a function from the training data with some generalization capacity. On the Swiss Roll, we remove a thin middle slice and use it for testing (leaving essentially two disconnected swiss rolls for training). We use a typical 80-20 split for the other two datasets.

3.1 Qualitative evaluation

Figure 2: Latent representations from all considered methods on Swiss Roll (top), Faces (middle), and Rotated Digits (bottom). Three different GRAE embeddings with , and are shown. Training points are grey scale and testing points are colored. Faces dataset is colored according to the horizontal rotation axis of the head. Only GRAE recovers a sensible geometry for all three problems.

We first qualitatively evaluate GRAE and the other methods by visualizing the embedding layer after training. The visualizations for all three datasets are presented in Figure 2. We first notice that only GRAE and UMAP are able to uncoil the training points of the Swiss Roll dataset. AE and TAE both produce embeddings reminiscent of what is expected from a linear dimensionality reduction method such as PCA, where the roll is flattened in two dimensions without any unrolling. Diffusion nets achieve a certain degree of uncoiling, but do not recover the intrinsically rectangular nature of the roll. UMAP forces the test points in between the two swiss roll components to be close to the training embedding, thus tearing the overall structure of the manifold. Only GRAE manages to learn a continuous embedding bridging the two rolls of the training set. This demonstrates that GRAE can be used to generate data points in previously unsampled regions of the manifold.

The results on the Faces and Rotated Digits show that progressively increasing the regularization factor of GRAE adds some structure to the AE embedding. The horizontal rotation axis of the Faces is made clearer by GRAE with lambda values of 50 and 100. We further quantify this recovery in Section 3.2. GRAE improves upon the AE embedding by disentangling the rings representing the rotations of the three rotated digits. Going from GRAE (10) to GRAE (50) denoises the rings. Only UMAP achieves an embedding of similar quality, whereas other methods show overlapping curves, which do not faithfully represent the rotations of three different images.

3.2 Quantitative evaluation

To quantitatively evaluate the performance of GRAE, we compare it to the other methods using a variety of metrics. These metrics focus on the preservation of various properties of the ambient space in an unsupervised fashion and include the trustworthiness, continuity (kaski2003trustworthiness , venna2006visualizing ), and mean relative rank error lee2007nonlinear , all of which measure the quality of the latent representations in terms of the preservation of the nearest neighbors of each point in both the ambient and the latent spaces. We refer the reader to the supplement for further discussion of these metrics. Reconstruction error (MSE) is also reported to study the impact of the GRAE regularization on the traditional AE objective. We report the average out of ten runs for all of the performance metrics on the test data.

Continuity (Rank) Trustworthiness (Rank) MRRE (Rank) MSE (Rank)
Dataset Model k = 5 k = 10 k = 5 k = 10 k = 5 k = 10
Swiss Roll Autoencoder 0.973 (5) 0.962 (6) 0.848 (6) 0.855 (6) 0.155 (6) 0.154 (6) 0.014 (4)
GRAE (10) 0.988 (2) 0.981 (2) 0.988 (2) 0.982 (1) 0.013 (2) 0.018 (2) 0.005 (1)
GRAE (50) 0.975 (4) 0.966 (4) 0.953 (4) 0.942 (3) 0.044 (4) 0.052 (4) 0.010 (2)
GRAE (100) 0.973 (5) 0.964 (5) 0.942 (5) 0.933 (5) 0.056 (5) 0.063 (5) 0.011 (3)
TAE (1000) 0.979 (3) 0.967 (3) 0.812 (7) 0.819 (7) 0.192 (7) 0.191 (7) 0.019 (5)
Diffusion Nets 0.991 (1) 0.991 (1) 0.964 (3) 0.934 (4) 0.033 (3) 0.049 (3) 0.584 (7)
UMAP 0.905 (7) 0.861 (7) 0.992 (1) 0.982 (1) 0.010 (1) 0.016 (1) 0.019 (5)
Faces Autoencoder 0.934 (7) 0.939 (3) 0.940 (4) 0.946 (2) 0.065 (4) 0.069 (4) 0.019 (2)
GRAE (10) 0.939 (4) 0.935 (5) 0.949 (2) 0.946 (2) 0.056 (2) 0.062 (2) 0.017 (1)
GRAE (50) 0.936 (5) 0.922 (6) 0.916 (5) 0.893 (5) 0.086 (5) 0.101 (5) 0.019 (2)
GRAE (100) 0.935 (6) 0.917 (7) 0.890 (7) 0.868 (7) 0.112 (6) 0.126 (7) 0.020 (5)
TAE (1000) 0.942 (2) 0.948 (1) 0.945 (3) 0.953 (1) 0.064 (3) 0.066 (3) 0.019 (2)
Diffusion Nets 0.952 (1) 0.945 (2) 0.892 (6) 0.891 (6) 0.115 (7) 0.120 (6) 0.024 (6)
UMAP 0.940 (3) 0.936 (4) 0.956 (1) 0.937 (4) 0.044 (1) 0.057 (1) 0.205 (7)
Rotated Digits Autoencoder 0.949 (7) 0.905 (3) 0.949 (7) 0.922 (7) 0.045 (7) 0.059 (7) 0.020 (5)
GRAE (10) 0.965 (3) 0.842 (6) 0.992 (1) 0.951 (1) 0.005 (1) 0.023 (1) 0.016 (1)
GRAE (50) 0.965 (3) 0.842 (6) 0.992 (1) 0.950 (2) 0.005 (1) 0.023 (1) 0.017 (2)
GRAE (100) 0.965 (3) 0.843 (5) 0.991 (4) 0.950 (2) 0.005 (1) 0.023 (1) 0.017 (2)
TAE (1000) 0.961 (6) 0.935 (2) 0.960 (6) 0.937 (6) 0.036 (6) 0.049 (6) 0.019 (4)
Diffusion Nets 0.986 (1) 0.952 (1) 0.984 (5) 0.950 (2) 0.011 (5) 0.028 (5) 0.026 (6)
UMAP 0.971 (2) 0.856 (4) 0.992 (1) 0.949 (5) 0.005 (1) 0.024 (4) 0.056 (7)
Table 1: Embedding quality of the latent representations based on various unsupervised metrics. GRAE shows competitive results in terms of neighborhood preservation and surprisingly improves the reconstruction error on all datasets. Results are colored by performance rank.

Table 1 shows the results using these metrics. Echoing its convincing visualizations, GRAE quantitatively preserves the geometry of the Swiss Roll, being first or runner-up on all metrics, and further displays competitive results on the other two problems. Generally speaking, models with manifold learning or topological components offer better performance than the vanilla AE over the considered datasets for metrics relating to embedding quality. However, we note that while other methods do show promising results in some cases (sometimes outperforming GRAE), these same methods also underperform in others, and thus may not provide a reliable embedding in truly exploratory unsupervised settings. For example, TAE shows good results on Faces, but significantly underperforms other methods on Swiss Roll. For its part, while generally competitive in terms of continuity, Diffusion Net lacks consistency with respect to other metrics. UMAP shows similar instability, albeit to a somewhat lesser extend. On the other hand, across three configurations presented here, GRAE proves to be noticeably more consistent than competing methods, thus indicating a more reliable overall structure embedding.

Furthermore, with respect to the goal of providing invertible embeddings, other methods typically fail to match the reconstruction error of vanilla AE, which is expected given that they do not specifically account for this metric when inverting latent coordinates (UMAP), or that they are driven by other objectives that may dominate their training over reconstruction terms (TAE and Diffusion Nets). GRAE, on the other hand, surprisingly improves reconstruction on all three benchmarks; most notably on the Swiss Roll and the Rotated Digits. We provide further study of this property in the following section. Overall, our results indicate that geometric regularization based on the PHATE embedding brings tangible benefits to the bottleneck of the auto encoder in terms of structure.

Distance Pearson (Rank) Spearman (Rank) Mutual Information (Rank)
Dataset Model Corr. (Rank) Naive ICA Section Naive ICA Section Naive ICA Section
Swiss Roll Autoencoder 0.093 (7) 0.260 (7) 0.608 (6) 0.445 (7) 0.270 (7) 0.611 (6) 0.452 (6) 1.324 (6) 2.153 (2) 0.657 (6)
GRAE (10) 0.962 (1) 0.980 (1) 0.980 (1) 0.993 (1) 0.987 (1) 0.983 (1) 0.997 (1) 2.597 (1) 2.044 (5) 1.643 (1)
GRAE (50) 0.955 (2) 0.979 (2) 0.977 (2) 0.991 (2) 0.985 (2) 0.980 (2) 0.993 (2) 2.463 (2) 2.074 (4) 1.625 (2)
GRAE (100) 0.953 (3) 0.977 (3) 0.976 (3) 0.991 (2) 0.982 (3) 0.979 (3) 0.992 (3) 2.434 (3) 2.110 (3) 1.619 (3)
TAE (1000) 0.155 (6) 0.278 (6) 0.627 (5) 0.500 (5) 0.287 (6) 0.633 (5) 0.502 (5) 1.230 (7) 2.206 (1) 0.688 (5)
Diffusion Nets 0.336 (5) 0.394 (5) 0.443 (7) 0.455 (6) 0.309 (5) 0.388 (7) 0.378 (7) 1.492 (5) 1.431 (7) 0.535 (7)
UMAP 0.791 (4) 0.912 (4) 0.916 (4) 0.926 (4) 0.873 (4) 0.890 (4) 0.934 (4) 1.644 (4) 1.586 (6) 1.099 (4)
Faces Autoencoder 0.040 (7) 0.253 (7) 0.295 (7) 0.370 (6) 0.233 (7) 0.300 (7) 0.358 (7) 0.364 (6) 0.298 (7) 0.224 (6)
GRAE (10) 0.691 (5) 0.642 (4) 0.642 (4) 0.737 (4) 0.629 (4) 0.628 (5) 0.723 (4) 0.731 (5) 0.720 (5) 0.524 (5)
GRAE (50) 0.725 (2) 0.729 (2) 0.735 (2) 0.829 (2) 0.732 (2) 0.740 (2) 0.817 (2) 0.860 (3) 0.885 (3) 0.659 (2)
GRAE (100) 0.718 (4) 0.758 (1) 0.763 (1) 0.855 (1) 0.763 (1) 0.769 (1) 0.847 (1) 0.872 (2) 0.901 (2) 0.665 (1)
TAE (1000) 0.067 (6) 0.269 (6) 0.304 (6) 0.362 (7) 0.244 (6) 0.304 (6) 0.365 (6) 0.353 (7) 0.309 (6) 0.216 (7)
Diffusion Nets 0.719 (3) 0.684 (3) 0.686 (3) 0.767 (3) 0.697 (3) 0.703 (3) 0.748 (3) 1.027 (1) 1.079 (1) 0.569 (3)
UMAP 0.854 (1) 0.633 (5) 0.636 (5) 0.706 (5) 0.628 (5) 0.649 (4) 0.711 (5) 0.794 (4) 0.739 (4) 0.536 (4)
Table 2: Embedding quality with respect to the ground truth manifolds for the Swiss Roll and the Faces datasets. Results are colored by performance rank. GRAE comes first on the majority of metrics.

Beyond the unsupervised metrics provided in Table 1, we also recall that an important motivation and goal for using manifold learning methods in data exploration is to reveal underlying latent variables in the data, explaining variation and global structure in them. To quantify the ability of GRAE and other methods to extract such information we note that both the Swiss Roll and Faces dataset provide two dimensional ground truth representations for their latent space, consistently capturing a global intrinsic coordinate system for their manifold structure111Rotated Digits not considered here as it lacks global structure, since it consists of 3 disconnected manifolds.. Namely, in Swiss Roll we have access to the intrinsic manifold coordinates given by length of the roll and the rolled axis, while in Faces we have access to physical latent variables given by the two rotation axes of the head in the captured images. We therefore quantify the quality of the embedded representations in these cases by comparing them with the true intrinsic coordinates, using metrics such as the Pearson correlation to correlate the Euclidean distance matrices of the embeddings and the true intrinsic space. We further report the Pearson correlation, the Spearman correlation, and the mutual information between the embedding coordinates and the intrinsic coordinates. For these three measures, we first find the optimal matching between the embedding axes and the intrinsic axes (in the case of Pearson and Spearman, the sign is ignored) and the metrics are then computed over i) the raw embeddings, ii) the embeddings after an ICA preprocessing (to be more flexible with respect to rotations) and iii) by averaging the applicable metric over 10 equal-width slices of the intrinsic space with respect to both intrinsic variables (hence 20 slices total). Further details on the calculation of these metrics appear in the supplement.

The results of the described evaluation of embedded representations compared to the true “manifold” coordinates are shown in Table 2. Our results show that GRAE outperforms other methods, achieving the best results in almost all metrics on the two datasets (with only two exceptions, which do not indicate a single stable competitor to our method), suggesting it provides a more faithful representation with superior ability to reveal latent structure and variables in data. Further, we note that over the three configurations considered here, GRAE often achieves also the second and third best performances, indicating robustness to the choice of its regularization parameter .

3.3 Impact of geometric regularization on reconstruction quality

Based on the GRAE reconstruction errors (Table 1), we observe that GRAE improves the AE latent representation, guiding the network outside a local optimum region towards a more accurate geometry. Additionally, the regularization generates a more stable reconstruction over the whole dataset, especially in those regions where the autoencoder produces inaccurate geometry in the latent space. To support these claims, we conduct an experiment on two rotated MNIST digits (Figure 3

), generating a full rotation for each of the digits and removing in-between observations as the test set. After training GRAE on the remaining observations, we interpolate over consecutive observations i.e., consecutive angle rotations in the training set. Then we compute the reconstruction error between the generated points via interpolation with the previously removed in-between observations.

This experiment shows that learning accurate geometry from the data, can be useful to generate new points on the latent space via interpolation over adjacent observations. Such points can then be fed to GRAE’s decoder, generating a more faithfully reconstructed geodesic trajectory between points on the manifold in the ambient space in comparison to AE.

Figure 3: A) Distributions of errors averaged over ten runs for different values of

. Red lines represent the 1st and 3rd quartiles, and dashed black lines represent the median. We notice that AE is more unstable than GRAE, having a more spread distribution. Despite achieving the smallest reconstruction errors for some observations, it also fails heavily for others, while GRAE typically presents lighter tails.

B) A typical embedding produced for different values of . Blue points represent a subsample of the training data (subsampling only done for visualization purposes). Black points are the generated points on the latent space via interpolation. Red markers in the AE embedding represent the 20 interpolated points with the highest reconstruction error. We observe that bad reconstruction typically occurs in sparse regions or crossing lines, i.e., in regions with poorly learned geometry.

4 Conclusion

We proposed geometry regularized autoencoder (GRAE), a general parametric framework to enhance autoencoders’ latent representation by taking advantage of well established manifold learning methods. By imposing a geometrical soft constraint on the bottleneck of the autoencoder, we demonstrated empirically how GRAE can achieve good visualizations and good latent representations on several performance metrics compared to AE and other methods motivated by geometry. Furthermore, GRAE is equipped with an inverse mapping that may produce better reconstruction than AE. While the primary focus of this work is on using PHATE embeddings to regularize the bottleneck, our approach is general and we show in the supplementary material preliminary results using UMAP as the embedding target for GRAE. We leave to future work the study of other manifold learning algorithms as constraints for learning AE representations with better geometry and the benefits they bring in terms of visualizations, reconstruction, and generation of data.


This research was partially funded by an IVADO (l’institut de valorisation des données) excellence scholarship [S.M.]; IVADO Professor research funds, and NIH grant R01GM135929 [G.W]. The content provided here is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.


  • (1) Karl Pearson. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
  • (2) Michael A.A. Cox and Trevor F. Cox. Multidimensional scaling. In

    Handbook of data visualization

    , pages 315–347. Springer, 2008.
  • (3) Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
  • (4) Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
  • (5) Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pages 585–591, 2002.
  • (6) Ronald R. Coifman and Stéphane Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006.
  • (7) Kevin R. Moon, David van Dijk, Zheng Wang, Scott Gigante, Daniel B. Burkhardt, William S. Chen, Kristina Yim, Antonia van den Elzen, Matthew J. Hirn, Ronald R. Coifman, Natalia B. Ivanova, Guy Wolf, and Smita Krishnaswamy. Visualizing structure and transitions in high-dimensional biological data. Nature Biotechnology, 37(12):1482–1492, 2019.
  • (8) Laleh Haghverdi, Maren Buettner, F. Alexander Wolf, Florian Buettner, and Fabian J. Theis. Diffusion pseudotime robustly reconstructs lineage branching. Nature methods, 13(10):845, 2016.
  • (9) Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-SNE.

    Journal of machine learning research

    , 9(Nov):2579–2605, 2008.
  • (10) Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv e-prints, page arXiv:1802.03426, February 2018.
  • (11) El-ad David Amir, Kara L. Davis, Michelle D. Tadmor, Erin F. Simonds, Jacob H. Levine, Sean C. Bendall, Daniel K. Shenfeld, Smita Krishnaswamy, Garry P. Nolan, and Dana Pe’er. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nature biotechnology, 31(6):545, 2013.
  • (12) Linn Sgier, Remo Freimann, Anze Zupanic, and Alexandra Kroll. Flow cytometry combined with viSNE for the analysis of microbial biofilms and detection of microplastics. Nature communications, 7(1):1–10, 2016.
  • (13) Evan Z. Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, Allison R. Bialas, Nolan Kamitaki, Emily M. Martersteck, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015.
  • (14) Etienne Becht, Leland McInnes, John Healy, Charles-Antoine Dutertre, Immanuel W.H. Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Dimensionality reduction for visualizing single-cell data using umap. Nature biotechnology, 37(1):38, 2019.
  • (15) Yujiao Zhao, Matthew Amodio, Brent Vander Wyk, Bram Gerritsen, Mahesh M. Kumar, David van Dijk, Kevin Moon, Xiaomei Wang, Anna Malawista, Monique M. Richards, Megan Cahill, Anita Desai, Jayasree Sivadasan, Manjunatha Venkataswamy, Vasanthapuram Ravi, Erol Fikrig, Priti Kumar, Steven Kleinstein, Smita Krishnaswamy, and Ruth Montgomery. Single cell immune profiling of dengue virus patients reveals intact immune responses to Zika virus with enrichment of innate immune signatures. PLoS Neglected Tropical Diseases, 14(3):e0008112, 2020.
  • (16) Wouter R. Karthaus, Matan Hofree, Danielle Choi, Eliot L. Linton, Mesruh Turkekul, Alborz Bejnood, Brett Carver, Anuradha Gopalan, Wassim Abida, Vincent Laudone, Moshe Biton, Ojasvi Chaudhary, Tianhao Xu, Ignas Masilionis, Katia Manova, Linas Mazutis, Dana Pe’er, Aviv Regev, and Charles Sawyers. Regenerative potential of prostate luminal cells revealed by single-cell analysis. Science, 368(6490):497–505, 2020.
  • (17) Chiara Baccin, Jude Al-Sabah, Lars Velten, Patrick M. Helbling, Florian Grünschläger, Pablo Hernández-Malmierca, César Nombela-Arrieta, Lars M. Steinmetz, Andreas Trumpp, and Simon Haas. Combined single-cell and spatial transcriptomics reveal the molecular, cellular and spatial bone marrow niche organization. Nature Cell Biology, pages 1–11, 2019.
  • (18) Andrés F. Duque, Guy Wolf, and Kevin R. Moon. Visualizing high dimensional dynamical processes. In IEEE International Workshop on Machine Learning for Signal Processing, 2019.
  • (19) Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep content-based music recommendation. In Advances in neural information processing systems, pages 2643–2651, 2013.
  • (20) Scott Gigante, Adam S. Charles, Smita Krishnaswamy, and Gal Mishne. Visualizing the phate of neural networks. In Advances in Neural Information Processing Systems, pages 1840–1851, 2019.
  • (21) Stefan Horoi, Victor Geadah, Guy Wolf, and Guillaume Lajoie.

    Low-dimensional dynamics of encoding and learning in recurrent neural networks.


    Canadian Conference on Artificial Intelligence

    , pages 276–282. Springer, 2020.
  • (22) Ronald R Coifman and Stéphane Lafon. Geometric harmonics: a novel tool for multiscale out-of-sample extension of empirical functions. Applied and Computational Harmonic Analysis, 21(1):31–52, 2006.
  • (23) Christopher K.I. Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. In Advances in neural information processing systems, pages 682–688, 2001.
  • (24) Max Vladymyrov and Miguel Á Carreira-Perpinán. Locally linear landmarks for large-scale manifold learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 256–271. Springer, 2013.
  • (25) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
  • (26) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.

    Extracting and composing robust features with denoising autoencoders.

    In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.
  • (27) Akara Supratak, Ling Li, and Yike Guo. Feature extraction with stacked autoencoders for epileptic seizure detection. In 2014 36th Annual international conference of the IEEE engineering in medicine and biology society, pages 4184–4187. IEEE, 2014.
  • (28) Hailong Liu and Tadahiro Taniguchi.

    Feature extraction and pattern recognition for human motion by a deep sparse autoencoder.

    In 2014 IEEE International Conference on Computer and Information Technology, pages 173–181. IEEE, 2014.
  • (29) Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.
  • (30) Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, and Ronald R. Coifman.

    Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators.

    In Advances in neural information processing systems, pages 955–962, 2006.
  • (31) Gal Mishne, Uri Shaham, Alexander Cloninger, and Israel Cohen. Diffusion nets. Applied and Computational Harmonic Analysis, 47(2):259–285, 2019.
  • (32) Michael Moor, Max Horn, Bastian Rieck, and Karsten Borgwardt. Topological autoencoders. arXiv preprint arXiv:1906.00722, 2019.
  • (33) Wenchao Yu, Guangxiang Zeng, Ping Luo, Fuzhen Zhuang, Qing He, and Zhongzhi Shi. Embedding with autoencoder regularization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 208–223. Springer, 2013.
  • (34) Kui Jia, Lin Sun, Shenghua Gao, Zhan Song, and Bertram E Shi. Laplacian auto-encoders: An explicit learning of nonlinear data manifold. Neurocomputing, 160:250–260, 2015.
  • (35) Yann LeCun, Corinna Cortes, and C.J. Burges. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  • (36) Samuel Kaski, Janne Nikkilä, Merja Oja, Jarkko Venna, Petri Törönen, and Eero Castrén. Trustworthiness and metrics in visualizing similarity of gene expression. BMC bioinformatics, 4(1):48, 2003.
  • (37) Jarkko Venna and Samuel Kaski. Visualizing gene interaction graphs with local multidimensional scaling. In ESANN, volume 6, pages 557–562. Citeseer, 2006.
  • (38) John A. Lee and Michel Verleysen. Nonlinear dimensionality reduction. Springer Science & Business Media, 2007.
  • (39) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • (40) Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 833–840, 2011.
  • (41) Andrew Ng et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19, 2011.
  • (42) Herbert Edelsbrunner and John Harer. Persistent homology-a Survey. Contemporary mathematics, 453:257–282, 2008.
  • (43) Sergey Barannikov. The framed morse complex and its invariants. Singularities and Bifurcations, 1994.
  • (44) Yoshua Bengio, Jean-François Paiement, Pascal Vincent, Olivier Delalleau, Nicolas Le Roux, and Marie Ouimet. Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In Advances in neural information processing systems, pages 177–184, 2004.
  • (45) Vin de Silva and Joshua B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. In Advances in neural information processing systems, pages 721–728, 2003.
  • (46) Geoffrey E. Hinton and Sam Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15(Nov):833–840, 2008.
  • (47) Wei Dong, Charikar Moses, and Kai Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577–586, 2011.
  • (48) Dmitry Kobak and George C. Linderman. UMAP does not preserve global structure any better than t-SNE when using the same initialization. bioRxiv, 2019.
  • (49) John Lee and Michel Verleysen. Quality assessment of nonlinear dimensionality reduction based on K-ary neighborhoods. In

    New Challenges for Feature Selection in Data Mining and Knowledge Discovery

    , pages 21–35, 2008.
  • (50) L.F. Kozachenko and Nikolai N. Leonenko.

    Sample estimate of the entropy of a random vector.

    Problemy Peredachi Informatsii, 23(2):9–16, 1987.
  • (51) Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411–430, 2000.


In this supplement, we provide further details about GRAE. In Section A, we provide further details about related work and the methods we compare with GRAE. In Section B, we present some results using UMAP in the GRAE regularization instead of PHATE. We then discuss technical details of the experiments in Section C.

Appendix A Detailed discussion of compared methods and related work

Regularized autoencoders

The vanilla AE formulation in Eq. (1) of the main paper has been extended for many purposes by adding some kind of regularization. Extensions to produce generative models, such as variational autoencoders (VAE) kingma2013auto

, regularize the latent representation to match a tractable probability distribution (e.g., isotropic multivariate Gaussian), from which it is possible to sample over a continuous domain to generate new points. Contractive autoencoders (CAE)

rifai2011contractive penalize the Frobenius norm of the Jacobian of the encoder in addition to the reconstruction loss, encouraging a more robust representation around small perturbations from the training data. When dealing with a high dimensional latent space (e.g., the overcomplete case), sparse autoencoders ng2011sparse are particularly useful, introducing a sparsity constraint that forces the network to learn significant features in the data.

More closely related to our work, some attempts to impose geometrically driven regularizations have been proposed over the past decade. A relatively new implementation called Diffusion Nets mishne2019diffusion enforces the AE embedding coordinates to match with Diffusion Maps (DM) coifman2006diffusion

coordinates. They combine an MSE loss with the so-called eigenvector constraint to learn the diffusion geometry. Their approach inherits some of the inherent issues of Diffusion Maps. Perhaps most importantly, they inherit its inability to ensure significant representation of the data on a fixed lower dimension, due to the natural orthogonality imposed among the diffusion coordinates 


. Therefore, in general, effective utilization of their approach might require the network architecture (i.e., latent dimension set by the number of neurons in the bottleneck layer) to be determined from the numerical rank of the diffusion affinity kernel used in DM. This, in turn, would limit the capabilities of this approach in data exploration (e.g., visualization), while in contrast PHATE (and UMAP) allow an a priori set dimension (e.g., 2D or 3D) determined by the GRAE architecture. Moreover, as a spectral kernel method, DM itself

222We note that the DM runtime (relative to UMAP) is equivalent to that of Laplacian eigenmaps reported in UMAP2018 , as the algorithmic difference between these spectral methods is negligible coifman2006diffusion . tends to be more computationally expensive than PHATE or UMAP moon2019visualizing ; UMAP2018 . As a result, we observe that Diffusion Nets are also relatively slow to train, although it is likely that some sampling or landmarking techniques may help with this aspect. However, given the other described deficiencies, we deem the consideration of such computational improvements out of scope here.

The formulation of Diffusion Nets can also be interpreted as a generalization of Laplacian autoencoders (LAE) jia2015laplacian and Embedding with regularized autoencoders (EAER) yu2013embedding . Both of these methods include a regularization term that penalizes inaccurate preservation of neighborhood relationships in the original space:


For instance, in ERAE the second term can have the form , where are computed using the weighted edges in an adjacency graph. This gives the objective function of Laplacian eigenmaps. LAE goes further by adding a second order term including the Hessian of . We note that, to the best of our knowledge, no standard code or implementation is available for these methods. Furthermore, while reported results in jia2015laplacian ; yu2013embedding show improvement in embedded-space classification accuracy over benchmark datasets compared to CAE (in LAE jia2015laplacian ) and a “pure” manifold learning counterpart (in EAER yu2013embedding ), their embeddings do not show a significant advantage over more recent popular methods. It is therefore unlikely these methods would outperform Diffusion Nets, which does provide a standard implementation, as well as other methods considered in this work. Therefore, we do not compare our method directly to ERAE and LAE in this work, instead relying on the comparison with Diffusion Nets and the discussed similarities to them (as well as the similarities between Laplacian eigenmaps and Diffusion maps) to establish a representative assessment of the improvement provided by GRAE over this family of previous approaches.

We depart from previous methods by providing a more general approach. In many applications, there may not be strong reasons for imposing a particular relationship in the geometric loss that resembles a loss function from a specific kernel method. Any approach employed to find , whether it be PHATE, Isomap, t-SNE, LLE, etc., is already performing an optimization to its particular loss function, imposing the preservation of its desired geometrical relationships in the data. Thus, the network will implicitly enforce such a relationship.

Recently, topological autoencoders (TAE) moor2019topological were proposed which include a regularization based on a topological signature of the input data. The topology of a manifold can be characterized by homology groups, which, depending on their dimension, represent various topological features such as the number of disconnected components or the number of cycles on the manifold that cannot be deformed into each another. When working on data points sampled from a manifold, such topological features can be approximately derived from a -ball neighborhood graph of the data. Persistent homology edelsbrunner2008persistent ; barannikov1994framed was introduced as a means to identify the topological signature of manifolds based on how long topological features persist when progressively increasing the -ball of the neighborhood graph up to a point where the graph is fully-connected (topological features with a short lifespan are attributed to noise). TAE thus penalizes discrepancies between the topological signatures of the input space and the latent space. However, we note that the calculation of the topological signatures is computationally slow, thus limiting the ability of TAE to scale to large datasets.

Out of sample extension

Manifold learning algorithms are typically based on the eigendecomposition of a kernel matrix (Diffusion Maps) or a stochastic optimization of the latent coordinates (metric MDS, UMAP). Both approaches, in contrast to neural networks, do not extend naturally to new points, as they merely learn an embedding of the training points instead of a function mapping the input space to the latent space.

A well-known solution to this problem is the Nyström extension bengio2004out and its improvements, such as geometric harmonics coifman2006geometric , which approximate an empirical function over new points using linear combinations of the eigenvectors of a kernel matrix computed on the training set. Let be the training set used to compute the initial kernel matrix with kernel function . Then a new point can be extended to the learned latent space using the eigenvectors of with eigenvalues as follows:

One can thus project a function on the eigenvectors of and then extend to new points using the new approximated eigenvectors. However, as discussed in mishne2019diffusion , this approach has several drawbacks. Given training points (resulting in being ), extending a function to points requires us to compute new kernel rows leading to a time complexity of . Furthermore, the empirical function must be within the interpolation range of the kernel, which requires bandwidth tuning.

Other methods, such as those discussed in vladymyrov2013locally , perform OOSE by a linear combination of training points (the “landmarks”) close to the new points in the input space, as in PHATE moon2019visualizing and landmark MDS silva2003global . UMAP takes a similar approach to OOSE by initializing latent coordinates of new points in the latent space based on their affinities with training points in the input space. The new layout is then optimized by gradient descent following the same optimization scheme as the main algorithm, using the embeddings of training points as reference.

All of the aforementioned approaches require the training points, or a subset, to be stored in memory with their embeddings as a proxy for the target function, which can quickly become inconvenient given a large dataset or lead to a loss in embedding quality because of necessary subsampling or the use of landmarks. Moreover, they do not provide a straightforward approximation of the inverse function, which is critical in assessing how well the information is preserved in the embedded space in a truly unsupervised setting. As such, they are not directly comparable to GRAE and other AE based models, which present a native approximation of the inverse and only need to store the weights and biases of the network to perform OOSE, thus having memory requirements independant from the size of the training set.


The key assumption of manifold learning is that data points actually lie on or near a low-dimensional manifold in the ambient space. Non-linear dimensionality reduction techniques can leverage the geometry of this manifold to find a low-dimensional embedding which preserves the relevant signals in the data. Recent manifold learning kernel methods typically follow the framework introduced in snepaper and further extended by t-SNE maaten2008visualizing

, which are themselves generalization of the metric MDS algorithm, whereby the coordinates in the latent space are optimized by gradient descent to recreate the pairwise similarities (as defined by a kernel) in the input space. Intuitively, the use of a kernel which outputs high similarities for close neighbors enables the capture of the curvature of the underlying manifold in the ambient space. t-SNE, for instance, uses normalized Gaussian similarities in the input space and t-distributed similarities in the latent space. The embedding is optimized so as to minimize the Kullback-Leibler divergence between both distributions.

UMAP UMAP2018 was introduced as an improvement of this framework, with claims of improved preservation of global features and better run times. Specifically, the cost function of t-SNE is replaced by cross-entropy and similarities between objects in the input space is computed based on the smooth nearest neighbor distances, that is:

where is the distance between and its nearest neighbor, is the bandwidth and is a distance, not necessarily Euclidean. In contrast with t-SNE, UMAP does not normalize similarities and relies on an approximation of the neighborhoods using the Nearest-Neighbor-Descent algorithm of dong2011efficient . The UMAP implementation further distinguishes itself from t-SNE by not restricting the embedded space to two or three dimensions.

Recently, the claim that UMAP improves over t-SNE in preserving global structure has been challenged in kobak2019umap . The authors attribute the better global structure commonly obtained by UMAP to the differences between both methods in the initialization procedure. Typically t-SNE uses a random initialization, whereas UMAP uses Laplacian eigenmaps as its starting point. They showed that nearly identical results can be obtained by also initializing t-SNE with Laplacian eigenmaps.

Appendix B Generalization of geometric regularization to additional manifold learning approaches

We implemented GRAE using PHATE for our experiments in the main document, mainly motivated by the capabilities exhibited by this method in preserve global structure as empirically verified by the supervised metrics in Table 2 (in the main paper) and in moon2019visualizing . However, the extension of GRAE to other manifold learning algorithms is straightforward, only requiring the computation of the embedding target with the desired algorithm. Now, we show some of the same experiments performed in the main document with PHATE replaced with UMAP or t-SNE, since both have become popular visualization methods in recent years.

Figure 4: Latent representations from GRAE using UMAP and t-SNE as embedding reference for the bottleneck with , and on the Swiss Roll (top), Faces (middle), and Rotated Digits (bottom). Training points are grey scale and testing points are colored. The Faces dataset is colored according to the horizontal rotation axis of the head. The UMAP embeddings are reproduced from Figure 2 (in the main paper) for reference.

Visualizations for GRAE UMAP and GRAE t-SNE are provided in Figure 4. Most of the comments in Section 3.1 of the main paper are applicable here. We further observe that GRAE UMAP successfully recovers the corners of Swiss Roll and seems to have a regularization effect on both the training and test data by filling some of the holes present in the original UMAP embedding. This could be the result of GRAE (or AE in general) being more robust to fluctuations in density than UMAP. We do note however that the Faces embeddings of GRAE UMAP and GRAE t-SNE do not match the ones offered by GRAE in Figure 2 in the main paper. Moreover, GRAE t-SNE fails to return a sensible geometry for the Swiss Roll.

Preliminary metrics for GRAE UMAP and GRAE t-SNE seem to support the observations in the main paper that a good latent representation geometry is beneficial to reconstruction error. For example, embeddings from GRAE UMAP with typically result in reconstruction errors on the order of 0.004 on Swiss Roll, 0.017 on Faces, and 0.017 on Rotated Digits, which marks an improvement over both AE and UMAP (see Table 1 in the main paper).

Appendix C Technical details

Architecture and Tuning:

We use the same AE architecture for all autoencoder-based models, which consists of 3 fully-connected hidden layers in the encoder, a bottleneck of dimension 2, and 3 fully-connected hidden layers in the decoder, producing the following sequence of neurons: 800-400-200-2-200-400-800. We choose a bottleneck dimension of 2 for visualization purposes, although higher dimensions could also be chosen. We apply ReLU activations on all of the layers except in the bottleneck and the output of the decoder. To prevent over-fitting, we include an L2 penalization with a

weight-decay parameter equal to 1. We use a learning rate of .0001 and a batch size of 200 for all models and problems. MSE is used for the reconstruction error.

We use the default parameters for UMAP and PHATE (as the embedding target of GRAE) for most problems, the only exception being the Swiss Roll where setting the neighborhood parameter to 20 led to better visuals for both UMAP and PHATE. Presumably this enables the algorithms to better bridge the gap left by the removal of the test slice. For TAE, we used a regularization factor of 1000, since lower values led to embeddings indistinguishable from those of vanilla AE. Diffusion Nets is run on the provided base code, only changing the number of neurons for the neural network.

Splits and datasets:

All results were computed on a test split to benchmark how well the considered methods generalize to new data points. We generated 10,000 points on the Swiss Roll using the scikit-learn library. The manifold is then stretched and rotated, before removing a thin middle slice of 250 points for testing. We selected this approach over a uniform sampling of the points to study how various methods would behave when required to generalize to out-of-distribution data.

The Rotated Digits are generated as described in the main paper and the Faces images are used as is. On both problems, we set aside 20 % of the data for testing.

We report the average of 10 runs for all experiments and use the same splits for all runs, as we do not expect significant variance resulting from different splits. The objective was rather to show the robustness of our results against the inherently stochastic nature of training neural networks and fitting manifold learning algorithms (both PHATE and UMAP, for instance, rely on some randomness that can return slightly different results depending on the random state). We did not include an analysis of the standard deviation of the metrics in the main paper, given that the rankings were generally stable from run to run (and metric to metric, especially in Table 2 in the main paper). Still, we provide box plots for the reconstruction error in Figure 

5 to add more context to the findings of the main paper.

Figure 5:

Box plots of the reconstruction error on the Swiss Roll, Faces and Rotated Digits data sets over 10 runs. The edges of a given box mark the first and third quartile values of the data, with an additional green line indicating the median. The whiskers are set to 1.5 * interquartile range and outliers are marked as circles. Red arrows indicate models with significantly higher reconstruction error that are omitted from the visualizations to focus on competitive models.

Unsupervised evaluation metrics


This unsupervised performance metric (kaski2003trustworthiness ) measures how well the neighborhood of each observation in the latent space is faithfully representing the neighborhood in the original space. More formally, let be the rank of with respect to the ordering in terms of distance from in the high dimensional space. Let be the set of observations that belong to the neighbors of in the latent space but not in the original space. Then trustworthiness is defined as:

Trustworthiness is low if an observation is embedded as one of the neighbors of an observation in the latent space, but is not one of its neighbors in the original space.


Continuity (kaski2003trustworthiness ) measures discrepancy in the other direction. Let be the rank of with respect to in terms of distance in the low-dimensional space. And the set of observations that belong to the neighbors of in the original space but not in the latent space.

Mean relative rank error (MRRE):

Using the same ranks as continuity and trustworthiness, the MRRE (lee2008quality ) penalizes the relative difference between the ranks within the neighborhood of each observation. Let be the set of the neighbors of in the latent space.

Thus MRRE penalizes relative changes of the ranks in the neighbors of each observation.

Supervised metrics

Distance Correlation:

We first compute the pairwise Euclidean distance matrix of the model embedding and then the same matrix using the ground truth coordinates. The reported metric is the Pearson correlation between the flattened lower triangular halves of both matrices.

Mutual Information (MI):

Mutual information quantifies the dependence between two random variables. Let


be two continuous random variables,

and their respective marginal distributions and

the joint distribution. Then mutual information is defined as:

On real data, distributions of a continuous random variable need to be estimated. We rely on a scikit-learn utility to do so, itself based on entropy estimation from k-nearest neighbors distances, originally proposed in kozachenko1987sample .

Methodology for Pearson correlation, Spearman correlation and Mutual information:

All three measures are first reported on the raw embeddings of the models. Let and be the coordinate vectors of the manifold ground truth and and the coordinate vectors of an embedding. Given similarity measure (Pearson, Spearman, or MI), the reported result is:

Both the absolute and max functions allow us to find the optimal matching of the axes and further ensure that we accurately score identical reflections of the same embedding. We average both axes to further summarize the results.

The above described procedure is still arguably unfair towards embeddings that do not recover orthogonal components with the right rotation (e.g. a very good embedding with a 45 rotation could lower ). The second reported result for similarity measures aims to mitigate this by applying the Independant Component Analysis (ICA) algorithm to the embeddings before computing . Where PCA finds components with maximum variance, ICA instead seeks components maximizing statistical independence, which is more robust to rotations. Specifically, we use the Fast-ICAhyvarinen2000independent implementation of scikit-learn, which relies on non-Gaussianity of the components as a proxy for statistical independence. We can observe in Table 2 (in the main paper) that AE benefits from this preprocessing, especially on the Swiss Roll.

Figure 6: Illustration of the ‘Sections’ variant of the metrics on Faces where the manifold ground truth is partitioned (left) into two sets of 10 sections each. We then use the Pearson correlation, the Spearman correlation, and MI to benchmark the corresponding sections of an embedding (right). The reported result is the average over all sections.

The final variant (‘Sections’) of the similarity measures partitions the ground truth into 10 sections, which are based on equal width intervals over the range of . The corresponding sections of (or depending on the global optimal matching found using ) are then compared with , which should be more or less parallel to the longest ’side’ of the sections. The procedure is repeated with 10 more sections over the range of and the 20 results are averaged. This assesses the quality of the embeddings and the behavior of the metrics over longer trajectories orthogonal to both axes. A visualization of the partitions for the Faces data set is provided in Figure 6.

Hardware & software environment:

Experiments were executed using Python 3.6.9 and Torch 1.5.0 for deep learning components. We used author implementations for UMAP (0.4.3) and PHATE (1.0.4). We reused the original TAE source code for the topological soft constraint and adapted it to our AE architecture. For Diffusion nets we run their base code with Tensorflow (1.14.0). Other major utilities include scikit-learn 0.22.2 and numpy 1.18.4. We ran everything in a Google Colab environment with an Intel(R) Xeon(R) CPU @ 2.30GHz, 13 GB of available RAM, and an NVIDIA Tesla P-100 GPU.