Log In Sign Up

Topologically Regularized Data Embeddings

Unsupervised feature learning often finds low-dimensional embeddings that capture the structure of complex data. For tasks for which expert prior topological knowledge is available, incorporating this into the learned representation may lead to higher quality embeddings. For example, this may help one to embed the data into a given number of clusters, or to accommodate for noise that prevents one from deriving the distribution of the data over the model directly, which can then be learned more effectively. However, a general tool for integrating different prior topological knowledge into embeddings is lacking. Although differentiable topology layers have been recently developed that can (re)shape embeddings into prespecified topological models, they have two important limitations for representation learning, which we address in this paper. First, the currently suggested topological losses fail to represent simple models such as clusters and flares in a natural manner. Second, these losses neglect all original structural (such as neighborhood) information in the data that is useful for learning. We overcome these limitations by introducing a new set of topological losses, and proposing their usage as a way for topologically regularizing data embeddings to naturally represent a prespecified model. We include thorough experiments on synthetic and real data that highlight the usefulness and versatility of this approach, with applications ranging from modeling high-dimensional single cell data, to graph embedding.


page 1

page 2

page 3

page 4


Factoring out prior knowledge from low-dimensional embeddings

Low-dimensional embedding techniques such as tSNE and UMAP allow visuali...

Structure-Preserving Graph Representation Learning

Though graph representation learning (GRL) has made significant progress...

Learning Topological Representation for Networks via Hierarchical Sampling

The topological information is essential for studying the relationship b...

Enhancing cluster analysis via topological manifold learning

We discuss topological aspects of cluster analysis and show that inferri...

TopoDetect: Framework for Topological Features Detection in Graph Embeddings

TopoDetect is a Python package that allows the user to investigate if im...

Persistence Homology of TEDtalk: Do Sentence Embeddings Have a Topological Shape?

Topological data analysis (TDA) has recently emerged as a new technique ...

Exploring the Representational Power of Graph Autoencoder

While representation learning has yielded a great success on many graph ...

1 Introduction


Modern data often arrives in complex forms that complicate their analysis. For example, high-dimensional data cannot be visualized directly, whereas relational data such as graphs lack the natural vectorized structure required by various machine learning models

(Bhagat et al., 2011; Kazemi and Poole, 2018; Goyal and Ferrara, 2018). Representation learning aims to derive mathematically and computationally convenient representations to process and learn from such data. However, obtaining an effective representation is often challenging, for example, due to the accumulation of noise in high-dimensional biological expression data (Vandaele et al., 2021). In other examples such as community detection in social networks, graph embeddings struggle to clearly separate communities due to the few interconnections between them. In such cases, expert prior knowledge of the topological model may improve learning from, visualizing, and interpreting the data. Unfortunately, a general tool for incorporating prior topological knowledge in representation learning is lacking.

In this paper, we introduce such tool under the name of topological regularization. Here, we build on the recently developed differentiation frameworks for optimizing data to capture topological properties of interest (Gabrielsson et al., 2020; Solomon et al., 2021; Carriere et al., 2021). Unfortunately, such topological optimization has been poorly studied within the context of representation learning. For example, the used topological losses are indifferent to any structure other than topological, such as neighborhood information, which may be useful for learning. Therefore, topological optimization often destructs natural and informative properties of the data in favor of the topological loss.

Our proposed method of topological regularization effectively resolves this by learning an embedding representation that incorporates the topological prior

. As we will see in this paper, these priors can be directly postulated through topological loss functions. For example, if the prior is that the data lies on a circular model, we design a loss function that is lower whenever a more prominent cycle is present in the embedding. By extending the previously suggested topological losses to fit a wider set of models, we show that topological regularization effectively embeds data according to a variety of topological priors, ranging from clusters, cycles, and flares, to any combination of these.

Related Work

Certain methods that incorporate topological information into representation learning have already been developed. For example, Deep Embedded Clustering (Xie et al., 2016)

simultaneously learns feature representations and cluster assignments using deep neural networks. Constrained embeddings of Euclidean data on spheres have also been studied by

Bai et al. (2015). However, such methods often require an extensive development for one particular kind of input data and topological model. Contrary to this, incorporating topological optimization into representation learning provides a simple yet versatile approach towards combining data embedding with topological priors, that generalizes well to any input data as long as the output is a point cloud embedding.

Topological autoencoders

(Moor et al., 2020) already combine topological optimization with a data embedding procedure. The main difference here is that the topological information used for optimization is obtained from the original high-dimensional data, and not passed as a prior. While this may sound as a major advantage—and certainly can be as shown by Moor et al. (2020)—obtaining such topological information heavily relies on distances between observations, which are often meaningless and unstable in high dimensions (Aggarwal et al., 2001). Furthermore, certain constructions such as the -filtration obtained from the Delanauy triangulation—which we will use extensively and is further discussed in Appendix A—are expensive to obtain from high-dimensional data (Cignoni et al., 1998), and are best computed from the low-dimensional embedding.


We include a sufficient background on persistent homology—the main tool behind topological optimization—in Appendix A (note that all of its concepts important for this paper are summarized in Figure 1). We summarize the previous idea behind topological optimization of point clouds (Section 2.1). We also introduce a new set of losses to model a wider variety of models in a natural manner (Section 2.2), which can be used to topologically regularize embeddings, for which the result—not necessarily the input—is a point cloud (Section 2). We include experiments on synthetic and real data that show the usefulness and versatility of topological regularization, and provide additional insights into the performance of data embedding methods (Section 3). We discuss open problems in topological representation learning and conclude on our work in Section (4).

2 Methods

The main purpose of this paper is to present a method to incorporate prior topological knowledge in a point cloud embedding (dimensionality reduction, graph embedding, …) of a data set . As will become clear below, these topological priors can be directly postulated through topological loss functions . Then, the goal is to find an embedding that that minimizes a total loss


where is a loss that aims to preserve structural attributes of the original data, and controls the strength of topological regularization. Note that, itself is not required to be a point cloud, or reside in the same space as , which is especially useful for representation learning.

In this section, we mainly focus on topological optimization of point clouds, that is, the loss . The basic idea behind this recently introduced method—as presented by Gabrielsson et al. (2020)—is illustrated in Section 2.1. We also show that direct topological optimization may neglect important structural information such as neighborhoods, which can effectively be resolved through (1). Hence, as we will also see in Section 3, while representation learning may benefit from topological losses for incorporating prior topological knowledge, topological optimization itself may also benefit from other structural losses as to represent the topological prior in a more truthful manner. Nevertheless, some topological models remain difficult to represent in a natural manner through topological optimization. Therefore, we introduce a new set of topological losses, and provide an overview of how different topological models can be postulated through them in Section 2.2. Experiments with and comparisons to topological regularization of embeddings through (1) will be presented in Section 3.

2.1 Background on Topological Optimization of Point Clouds

Topological optimization is performed through a topological loss function evaluated on the persistence diagram(s) of the data (Carlsson, 2009). These diagrams—obtained through a method termed persistent homology and further discussed in Appendix A—summarize all from the finest to coarsest topological holes (connected components, cycles, voids, …) in the data, as illustrated in Figure 1.

Figure 1: The two basic concepts from persistent homology important for our method. (a) Persistent homology quantifies topological changes in a filtration, i.e., a changing sequence of simplicial complexes ordered by inclusion, parameterized by a time parameter , is constructed from a point cloud. Various topological holes are either born or die during this filtration. Here the filtration starts of with one connected component per data point (0-dimensional holes), which can only merge together (resulting in the death of such components) when including additional edges. For larger values of , we observe the birth of cycles (1-dimensional holes), which are consecutively filled in (and thus die) when increases even further. Eventually, one single connected component persists indefinitely. (b) The results from persistent homology are commonly visualized through a persistence diagram. Here, a tuple marks a topological hole—in this case a connected component (H0) or a cycle (H1)—that is born at time and that dies at (possibly infinite) time in a filtration.

While methods that learn from persistent homology are now both well-developed and diverse (Pun et al., 2018), optimizing the data representation for the persistent homology thereof has only been gaining recent attention (Gabrielsson et al., 2020; Solomon et al., 2021; Carriere et al., 2021). Persistent homology has a rather abstract mathematical foundation within the field of algebraic topology (Hatcher, 2002), and its computation is inherently combinatorial (Zomorodian and Carlsson, 2005). This complicates working with usual derivatives for optimization. To accommodate for this, topological optimization makes use of Clarke subderivatives (Clarke, 1990), whose applicability to persistence builds on arguments from o-minimal geometry (van den Dries, 1998; Carriere et al., 2021). Fortunately, thanks to the recent work of Gabrielsson et al. (2020) and Carriere et al. (2021)

, powerful tools for topological optimization have been developed for software libraries such as PyTorch and TensorFlow, allowing their application without deeper knowledge of these mathematical subjects.

Mathematically, topological optimization optimizes the data representation with respect to the topological information summarized by its persistence diagram(s) . We will use the same approach by Gabrielsson et al. (2020), where all (birth, death) tuples in are first ordered according to decreasing persistence . The points with (these are usually plotted on top of the diagram, such as in Figure 0(b),) form the essential part of . The points with finite coordinates form the regular part of . For , , and functions , , we can now define a topological loss function


It turns out that for many useful definitions of and , has a well-defined Clarke subdifferential with respect to the parameters defining the filtration from which the persistence diagram is obtained. In this paper, we will consistently use the -filtration as shown in Figure 0(a) (see Appendix A for its formal definition), and these parameters are entire point clouds of size in the -dimensional Euclidean space. can then be easily optimized with respect to these parameters through standard stochastic subgradient algorithms (Carriere et al., 2021).

Within this entire paper, we only use the regular part of the diagram (this coincides with letting ), and let be proportional to the persistence function. By having ordered the points by persistence, is now a function of persistence on , i.e., it is invariant to permutations of the points in (Carriere et al., 2021). The factor of proportionality indicates whether we want to minimize () or maximize () persistence, i.e, the prominence of the topological hole, or thus, how well clusters, cycles, …, are (not) represented. The topological loss function in (2) then reduces to


Here, the data matrix (in this paper the embedding) defines the diagram through persistent homology of the -filtration of , and a persistence (topological hole) dimension to optimize for.

For example, consider (3) with , , , restricted to 0-dimensional persistence (measuring the prominence of connected components) of the -filtration. Figure 2 shows the data from Figure 1

optimized for this loss function for various epochs. The optimized point cloud quickly resembles a single connected component for smaller numbers of epochs. This is the single goal of the loss (

3), which neglects all other structural structural properties of the data such as its underlying cycles (e.g., the circular hole in the ‘R’) or local neighborhoods. Larger numbers of epochs mainly affect the scale of the data. While this scale has an absolute effect on the total persistence, the point cloud visually represents a single connected topological component equally well. We also observe that while local neighborhoods are preserved well during the first epochs simply by nature of the topological optimization procedure, they are increasingly distorted for a larger number of epochs.

2.2 Newly Proposed Topological Loss Functions

Figure 2: The data set in Figure 0(a), optimized to have a low total 0-dimensional persistence. Points are colored according to their initial grouping along one of the four letters in the ‘ICLR’ acronym.

In this paper, the prior topological knowledge incorporated into the point cloud data matrix embedding is directly postulated through a topological loss function. For example, letting be the 0-dimensional persistence diagram of , and choosing , , and in (3), corresponds to the criterion that should represents one closely connected component, as illustrated in Figure 2. Therefore, we often regard a topological loss as a topological prior, and vice versa.

Unfortunately, although persistent homology effectively measures the prominence of topological holes, topological optimization is often ineffective for representing such holes in a natural manner. An extreme example of this are clusters, despite the fact that they are captured through the simplest form of persistence, i.e., 0-dimensional. This is shown in Figure 3, where we sampled data

from two Gaussian distributions centered at different means in

(Figure 2(a)). Optimizing the point cloud for (at least) two clusters can be done by defining as in (3), letting be the 0-dimensional persistence diagram of , , and . However, we observe that topological optimization simply displaces one single point away from all other points (Figure 2(b)). Note that purely topologically, this is indeed a correct representation of two clusters.

To encourage more natural holes, we propose to conduct the topological optimization for the loss


where is defined as in (3). In practice, during each optimization iteration, is approximated by the mean of evaluated over random samples of . The idea behind this approach is that a topological model that is naturally present in the data should be represented well by many subsets of the data. Figure 3 shows the result for a sampling fraction and . The new data representation visualizes the clusters already well and far more naturally. An added benefit of the new loss (4) is that topological optimization can be conducted significantly faster for reasonably lower , as the -filtration and persistent homology are evaluated on smaller samples.

(a) Initial point cloud.
(b) Ordinary top. optimization.
(c) Optimization with (4).
Figure 3: A point cloud sampled from two ground truth clusters (labeled by color) topologically optimized without and with sampling. The optimization with sampling results in more natural clusters.

In summary, various topological priors can now be formulated through topological losses as follows.

-dimensional holes

Optimizing for -dimensional holes ( for clusters), can generally be done through (3) or (4), by letting be the corresponding -dimensional persistence diagram. The terms and in the summation are used to express how many holes one exactly, at least, or at most wants. Finally, can be chosen to either decrease () or increase () persistence.


Persistent homology is invariant to certain topological changes. For example, both a linear ‘I’-structured model and a bifurcating ‘Y’-structured model consist of one connected component, and no higher-dimensional holes. These models are indistinguishable based on the (persistent) homology thereof, even though they are topologically different in terms of their singular points.

Capturing such additional topological phenomena is possible through a refinement of persistent homology under the name of functional persistence, also well discussed and illustrated by Carlsson (2014). The idea is that instead of evaluating persistent homology on a data matrix , we evaluate it on a subset for a well chosen function

and hyperparameter


Inspired by this approach, for a diagram of a point cloud , we propose the topological loss


where is a real-valued function on , possibly dependent on —which changes during optimization—itself, a hyperparameter, and is an ordinary topological loss as defined by (3). In particular, we will focus on the case where equals a scaled centrality measure on :


For , . For sufficiently small , evaluates on the points ‘far away’ from the center of . As we will see in the experiments below, this is especially useful in conjunction with 0-dimensional persistence to optimize for flares in the point cloud representation.


Naturally, through linear combination of loss functions, different topological priors can be combined, e.g., if we want the represented model to both be connected and include a cycle.

3 Experiments

In this section, we show how our proposed topological regularization of data embeddings (1) leads to a powerful and versatile approach for representation learning. In particular, we show that

  • embeddings benefit from prior topological knowledge through topological regularization;

  • conversely, topological optimization may also benefit from incorporating structural information as captured through embedding losses, leading to more qualitative representations;

  • subsequent learning tasks may benefit from expert prior topological knowledge.

In Section 3.1, we show how topological regularization improves standard PCA dimensionality reduction and allows better understanding of its performance when noise is accumulated over many dimensions. In Section 3.2, we present applications to high-dimensional single cell trajectory data and graph embedding. Quantitative results are discussed in Section 3.3.

Topological optimization was performed in Pytorch, using code adapted from Gabrielsson et al. (2020). Appendix B discusses a supplementary graph embedding experiment where we embed the Harry Potter network according to a circular prior. Data sizes, hyperparameters, losses, and optimization times are summarized in Tables 2 & 2. All code for this project is available on

3.1 Synthetic Data

We sampled points uniformly from the unit circle in . We then added 500-dimensional noise to the resulting data matrix , where the noise in each dimension is sampled uniformly from . Since the additional noisy features are irrelevant to the topological (circular) model, an ideal projection embedding is its restriction to its first two data coordinates (Figure 3(a)).

(a) First two data coordinates.
(b) Ordinary PCA embedding.
(c) Top. optimized projection.
(d) Top. regularized embedding.
Figure 4: Various representations of the 500-dimensional synthetic data . The coloring represents the positioning of points without noise.
Figure 5: Feature importance densities of the 498 irrelevant features in the PCA embedding (blue), the top. optimized PCA embedding (orange), and the top. regularized PCA embedding (green).
Figure 4: Various representations of the 500-dimensional synthetic data . The coloring represents the positioning of points without noise.

However, it is probabilistically unlikely that that the irrelevant features will have a zero contribution to a PCA embedding of the data (Figure 3(b)). Measuring the feature importance of each feature as the sum of its two absolute contributions (the loadings) to the projection, we observe that most of the 498 irrelevant features have a small nonzero effect on the PCA embedding (Figure 5). Intuitively, each added feature slightly shifts the projection plane away from the plane spanned by the first two coordinates. As a result, the circular hole is less prominent in the PCA embedding of the data.

We can regularize this embedding using a topological loss function measuring the persistence of the most prominent 1-dimensional hole in the embedding ( in (3)). For a simple Pytorch compatible implementation, we used , as to minimize the reconstruction error between and its linear projection obtained through . To this, we added the loss , where is used to encourage orthonormality of the matrix to be optimized, initialized with the PCA-loadings. The resulting embedding is shown in Figure 3(d), which better captures the circular hole (with ). Furthermore, we see that irrelevant features now more often contribute less to the embedding according to (Figure 5).

For comparison, Figure 3(c) shows the optimization of without accounting for the reconstruction loss ( still included). From this and also from Figure 5, we observe that struggles more to converge to the correct projection, resulting in a slightly less prominent hole ().

3.2 Real Data

Circular Cell Trajectory Data

(a) Ordinary PCA embedding.
(b) Top. optimized projection.
(c) Top. regularized embedding.
(a) Ordinary UMAP embedding.
(b) Top. optimized embedding.
(c) Top. regularized embedding.
Figure 6: Various representations of the cyclic cell data. Colors represent the cell grouping.
Figure 7: Various representations of the bifurcating cell data. Colors represent the cell grouping.
Figure 6: Various representations of the cyclic cell data. Colors represent the cell grouping.

We considered a single cell trajectory data set of 264 cells in a 6812-dimensional gene expression space (Cannoodt et al., 2018; Saelens et al., 2019). The ground truth model—which can be considered a snapshot of the cells at a fixed time—is a circular model connecting three distinct cell groups through cell differentiation. It has been shown by Vandaele (2020) that real single cell data with such models are difficult to embed in a circular manner.

To explore this, we repeated the experiment with the same losses as in Section 3.1 on this data, where the (expected) topological loss is now modified through (4) with , and . From Figure 5(a), we see that while the ordinary PCA embedding does somehow respect the positioning of the cell groups (marked by their color), it indeed struggles to embed the data in a manner that visualizes the present circular hole. However, as shown in Figure 5(c), by topologically regularizing the embedding we are able to embed the data much better in a circular manner ().

Figure 5(b) shows the optimization of without the reconstruction loss. The embedding is similar to the one in Figure 5(c), with the pink colored cell group slightly more dispersed ().

Bifurcating Cell Trajectory Data

We considered a second cell trajectory data set of 154 cells in a 1770-dimensional expression space (Cannoodt et al., 2018). The ground truth here is a bifurcating model connecting four different cell groups through cell differentiation. However, this time we used the UMAP loss for the embeddings. We used a topological loss , where measures the total (sum of) finite 0-dimensional persistence in the embedding to encourage connectedness of the representation, and is as in (5), measuring the persistence of the third most prominent 0-dimensional hole in , where is as in (6). Thus, is used to optimize for a ‘flare’ with (at least) three clusters away from the embedding mean. We observe that while the ordinary UMAP embedding is more ‘blobby’ (Figure 6(a)), the topologically regularized embedding is more constrained towards a connected bifurcating shape (Figure 6(c)).

For comparison, we conducted topological optimization for the loss of the initialized UMAP embedding without the UMAP embedding loss. The resulting embedding is now more fragmented (Figure 6(b)). We thus see that topological optimization may also benefit from the embedding loss.

Graph Embedding

The topological loss in (1) can be evaluated on any embedding, and does not require a point cloud as original input. We can thus use topological regularization for embedding a graph , to learn a representation of the nodes of in that well respects properties of .

To explore this, we considered the Karate network (Zachary, 1977), a well known and studied network within graph mining that consists of two different communities. The communities are represented by two key figures (John A. and Mr. Hi), as shown in Figure 7(a). To embed the graph, we used a DeepWalk variant adapted from Dagar et al. (2020). While the ordinary DeepWalk embedding (Figure 7(b)) well respects the ordering of points according to their communities, the two communities remained close to each other. We thus regularized this embedding using the topological loss as defined by (4), where measures the persistence of the second most prominent 0-dimensional hole, and , . The resulting embedding (Figure 7(d)) now nearly perfectly separates the two ground truth communities present in the graph.

Topological optimization of the initialized DeepWalk embedding with the same topological loss but without the DeepWalk loss creates some natural community structure, but also results in a few outliers (Figure

7(c)). Thus, although our introduced loss (4) enables more natural topological modeling to some extent, we again observe that using this in conjunction with embedding losses, i.e., our proposed method of topological regularization, leads to the best qualitative results.

(a) The Karate network.
(b) Ordinary DeepWalk embedding.
(c) Top. optimized embedding.
(d) Top. regularized embedding.
Figure 8: The Karate network and various of its embeddings.
data size method lr epochs w/o top with top
Synthetic Cycle PCA 1e-1 500 1e1 1s 5s
Cell Cycle PCA 5e-4 1000 1e2 1s 35s
Cell Bifurcating UMAP 1e-1 100 1e1 1s 8s
Karate DeepWalk 1e-2 50 5e1 29s 29s
Harry Potter InnerProd 1e-1 100 1e-1 36s 34s
Table 2: Summary of the topological losses computed from persistence diagrams with points ordered by persistence . Note that for 0-th dimensional homology diagrams .
data top. loss function dimension of hole
Synthetic Cycle 1 gray N/A gray N/A
Cell Cycle 1 0.25 1
Cell Bifurcating 0 - 0 gray N/A gray N/A
Karate 0 0.25 10
Harry Potter 1 gray N/A gray N/A
Table 1: Summarization of the data, hyperparameters and optimization times.

3.3 Quantitative Evaluation

Table 4 summarizes the embedding and topological losses we obtained for the ordinary embeddings, the topologically optimized embeddings (initialized with the ordinary embeddings, but not using the embedding loss), as well as for the topologically regularized embeddings. As one would expect, topological regularization balances the embedding losses between the embedding losses of the ordinary and topologically optimized embeddings. More interestingly, topological regularization may actually result in a more optimal, i.e., lower topological loss than topological optimization only, here in particular for the synthetic cycle data and Harry Potter graph. This suggest that combining topological information with other structural information may facilitate convergence to the correct embedding model, as we also qualitatively confirmed for these data sets (see also Appendix B). We also observe that there are more significant differences in the obtained topological losses than in the embedding losses with and without regularization. This suggests that the optimum region for the embedding loss may be somewhat flat with respect to the corresponding region for the topological loss. Thus, slight shifts in the local embedding optimum, e.g., as caused by noise, may result in much worse topological embedding models, which can be resolved through topological regularization.

data embedding loss topological loss
ord. emb. top. opt. top. reg. ord. emb. top. opt. top. reg.
Synthetic Cycle
Cell Cycle
Cell Bifurcating
Karate gray N/A
Harry Potter
Table 4: Embedding performance evaluations for label prediction. Highest in bold.
data metric ord. emb. top. opt. top. reg.
Synthetic Cycle
Cell Cycle accuracy
Cell Bifurcating accuracy
Karate accuracy
Table 3: Embedding/reconstruction and topological losses of the final embeddings. Lowest in bold.

We also evaluated the quality of the embedding visualizations presented in this section, by assessing how informative they are for predicting the ground data truth labels. For the Synthetic Cycle data, these labels are the 2D coordinates of the noise-free data on the unit circle in

, and we used a multi-ouput support vector regressor model. For the cell trajectory data and Karate network, we used the ground truth cell groupings and community assignments, respectively, and a support vector machine model. All points in the 2D embeddings were then split into 90% points for training and 10% for testing. Consecutively, we used 5-fold cross-validation on the training data to tune the regularization hyperparameter

. All other settings were the default from scikit-learn. The performance of the final tuned and trained model was then evaluated on the test data, through the

coefficient of determination for the regression problem, and the accuracy for all classification problems. Finally, we repeated this entire experiment 100 times. The averaged test performance metrics and their standard deviations are summarized in Table

4. From this, we observe that topological regularization consistently leads to the more informative visualization embeddings.

4 Discussion and Conclusion

We proposed a new approach for representation learning under the name of topological regularization, which builds on the recently developed differentiation frameworks for topological optimization. This led to a versatile and effective way for embedding data according to expert prior topological knowledge, directly postulated through (some newly introduced) topological loss functions.

A clear limitation of topological regularization is that expert prior topological knowledge is not always available. How to select the best out of a list of topological priors is thus open to further research. Furthermore, designing topological loss functions currently requires some understanding of persistent homology, and it may be useful to study how to facilitate that design process for lay users. From a foundational perspective, our work provides new research opportunities into extending the developed theory for topological optimization (Carriere et al., 2021) to our newly introduced losses and their integration into data embeddings. Finally, topological optimization based on combinatorial structures other than the -complex may be of both theoretical and practical interest. For example, point cloud optimization based on graph-approximations such as the minimum spanning tree (Vandaele et al., 2021), or varying the functional threshold in the loss (5) alongside the filtration time (Chazal et al., 2009), may lead to new topological loss functions with fewer hyperparameters.

Nevertheless, through our approach, we already provided new and important insights into the performance of embedding methods, such as their potential inability to converge to the correct topological model due to the flatness of the embedding loss near its (local) optimum, with respect to the topological loss. Furthermore, we quantitatively showed that including prior topological knowledge provides a promising way to improve consecutive—even non-topological—learning tasks. In conclusion, topological regularization enables both improving and better understanding representation learning methods, for which we provided and thoroughly illustrated the first directions in this paper.


  • C. C. Aggarwal, A. Hinneburg, and D. A. Keim (2001) On the surprising behavior of distance metrics in high dimensional space. In International conference on database theory, pp. 420–434. Cited by: §1.
  • S. Bai, H. Qi, and N. Xiu (2015) Constrained best euclidean distance embedding on a sphere: a matrix optimization approach. SIAM Journal on Optimization 25 (1), pp. 439–467. Cited by: §1.
  • S. Bhagat, G. Cormode, and S. Muthukrishnan (2011) Node classification in social networks. In Social network data analytics, pp. 115–148. Cited by: §1.
  • R. Cannoodt, W. Saelens, H. Todorov, and Y. Saeys (2018) Single-cell -omics datasets containing a trajectory. Zenodo. External Links: Link Cited by: §3.2, §3.2.
  • G. Carlsson (2009) Topology and data. Bulletin of the American Mathematical Society 46 (2), pp. 255–308. External Links: ISSN 0273-0979 Cited by: §2.1.
  • G. Carlsson (2014)

    Topological pattern recognition for point cloud data

    Acta Numerica 23, pp. 289–368. Cited by: §2.2.
  • M. Carriere, F. Chazal, M. Glisse, Y. Ike, H. Kannan, and Y. Umeda (2021) Optimizing persistent homology based functions. In International Conference on Machine Learning, pp. 1294–1303. Cited by: §1, §2.1, §2.1, §2.1, §4.
  • F. Chazal, D. Cohen-Steiner, L. J. Guibas, F. Mémoli, and S. Y. Oudot (2009) Gromov-hausdorff stable signatures for shapes using persistence. In Computer Graphics Forum, Vol. 28, pp. 1393–1403. Cited by: §4.
  • P. Cignoni, C. Montani, and R. Scopigno (1998) DeWall: a fast divide and conquer delaunay triangulation algorithm in ed. Computer-Aided Design 30 (5), pp. 333–341. Cited by: §1.
  • F. H. Clarke (1990) Optimization and nonsmooth analysis. SIAM. Cited by: §2.1.
  • A. Dagar, A. Pant, S. Gupta, and S. Chandel (2020) graph_nets. GitHub. External Links: Link Cited by: §3.2.
  • R. B. Gabrielsson, B. J. Nelson, A. Dwaraknath, and P. Skraba (2020) A topology layer for machine learning. In

    International Conference on Artificial Intelligence and Statistics

    pp. 1553–1563. Cited by: §1, §2.1, §2.1, §2, §3.
  • P. Goyal and E. Ferrara (2018) Graph embedding techniques, applications, and performance: a survey. Knowledge-Based Systems 151, pp. 78–94. Cited by: §1.
  • A. Hatcher (2002) Algebraic topology. Cambridge University Press. External Links: ISBN 0521795400 Cited by: §2.1.
  • S. M. Kazemi and D. Poole (2018)

    Simple embedding for link prediction in knowledge graphs

    In Advances in neural information processing systems, pp. 4284–4295. Cited by: §1.
  • M. Moor, M. Horn, B. Rieck, and K. Borgwardt (2020) Topological autoencoders. In International conference on machine learning, pp. 7045–7054. Cited by: §1.
  • N. Otter, M. A. Porter, U. Tillmann, P. Grindrod, and H. A. Harrington (2017) A roadmap for the computation of persistent homology.

    EPJ Data Science

    6 (1), pp. 17.
    External Links: ISSN 2193-1127 Cited by: Appendix A.
  • C. S. Pun, K. Xia, and S. X. Lee (2018) Persistent-homology-based machine learning and its applications–a survey. arXiv preprint arXiv:1811.00252. Cited by: §2.1.
  • S. Rendle, W. Krichene, L. Zhang, and J. Anderson (2020) Neural collaborative filtering vs. matrix factorization revisited. In Fourteenth ACM Conference on Recommender Systems, pp. 240–248. Cited by: Appendix B.
  • W. Saelens, R. Cannoodt, H. Todorov, and Y. Saeys (2019) A comparison of single-cell trajectory inference methods. Nature Biotechnology 37, pp. 1. Cited by: §3.2.
  • Y. Solomon, A. Wagner, and P. Bendich (2021) A fast and robust method for global topological functional optimization. In International Conference on Artificial Intelligence and Statistics, pp. 109–117. Cited by: §1, §2.1.
  • The GUDHI Project (2021) GUDHI user and reference manual. 3.4.1 edition, GUDHI Editorial Board. External Links: Link Cited by: Appendix A.
  • L. van den Dries (1998) Tame topology and o-minimal structures. Vol. 248, Cambridge university press. Cited by: §2.1.
  • R. Vandaele, B. Rieck, Y. Saeys, and T. De Bie (2021) Stable topological signatures for metric trees through graph approximations. Pattern Recognition Letters 147, pp. 85–92. Cited by: §1, §4.
  • R. Vandaele, Y. Saeys, and T. D. Bie (2020) Mining topological structure in graphs through forest representations. Journal of Machine Learning Research 21 (215), pp. 1–68. Cited by: Appendix B, Appendix B, Appendix B.
  • R. Vandaele (2020) Topological data analysis of metric graphs for evaluating cell trajectory data representations. Master’s Thesis, Universiteit Gent. Faculteit Wetenschappen, Ghent University, (eng). Cited by: §3.2.
  • J. Xie, R. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    In International conference on machine learning, pp. 478–487. Cited by: §1.
  • W.W. Zachary (1977) An information flow model for conflict and fission in small groups. Journal of Anthropological Research 33, pp. 452–473. Cited by: §3.2.
  • A. Zomorodian and G. Carlsson (2005) Computing Persistent Homology. Discrete & Computational Geometry 33 (2), pp. 249–274. External Links: ISSN 0179-5376 Cited by: §2.1.

Appendix A Introduction to Persistent Homology

Persistent homology quantifies the change in topological holes (connected components, loops, voids, …) across a filtration, which is an ordered sequence of simplicial complexes

of an initial complex . A simplicial complex can be seen as a generalization of a graph, that apart from nodes (0-simplices) and edges (1-simplices), also includes higher-dimensional simplices such as triangles (2-simplices), tetrahedra (3-simplices), …, with the added constraint that if contains a simplex , every simplex must also be contained in . A simplex is commonly written as the set of its included vertices, , and its dimension is by definition .

An example filtration is shown in Figure 0(a) in the main paper. Here, the initial complex is the Delanauy triangulation of a point cloud data set (here . This triangulation, i.e., simplicial complex, is a subdivision of the convex hull of into simplices such that any two simplices intersect in a common face of , or not at all, and such that the set of vertices of the simplices are contained in , and such that no point in is inside the circum(hyper)sphere of any -simplex. Note that this complex is also shown in Figure 0(a) in the main paper (at time ).

The filtration constructed from in Figure 0(a) equals the -filtration. Here, every simplex in is assigned a filtration value , which equals the square of the circumradius of if its circumsphere contains no other vertices than those in , in which case is said to be Gabriel, and as the minimum of the filtration values of the -simplices containing that make it not Gabriel otherwise. At time , the complex in the -filtration includes all simplices with filtration value at most . Although not required to understand the basic ideas presented in the main paper, for a good overview of how the -filtration is constructed, we refer the interested reader to The GUDHI Project (2021).

What is most important is that the -filtration constructed from a point cloud is well able to capture topological properties of the underlying model of . For example, in Figure 0(a), we see that at some time in the filtration, the simplicial complex includes four connected components, one for each of the letter ‘I’, ‘C’, ‘L’, and ‘R’. We also see that at some time, the complex captures the cycle in the letter ‘R’, and later, it captures the larger cycle composed by the letters ‘C’ and ‘L’. These correspond to topological holes in the underlying model of . A 0-dimensional hole is a gap between components, a 1-dimensional hole is a cycle or a loop, a 2-dimensional hole is a void, and in general, an -dimensional hole can be regarded as the inside of an -sphere. Here, true topological holes, i.e., those of the underlying model, tend to persist longer in the -filtration.

Persistent homology now tracks and quantifies these topological holes, of which the results are commonly visualized by means of a persistence diagram. A persistence diagram contains a tuple for each topological hole of a fixed dimension that is born at time and that dies at time in a filtration. Persistence diagrams for different dimensions of holes in the same data are usually plotted on top of each other, as in Figure 0(b) in the main paper. Holes that persist longer correspond to more elevated points in the diagram, and capture more prominent topological properties of the underlying model. Tuples for which , which occur when a hole never dies in the filtration (e.g., at some point, the -filtration will always remain connected), are usually plotted on top of the diagram.

Note that topological optimization—that is, optimizing the data representation with respect to its persistence diagram(s) and one of the main tools for our proposed method of topological regularization—is especially effective when conducted through the -filtration constructed from a low-dimensional data embedding matrix . In particular when

, e.g., for data visualization applications—which was also the focus in the experiments section in the main paper—the

-filtration can be rapidly constructed from , whereas its computational cost increases exponentially for larger dimensions . A potential solution to this is to use Vietoris-Rips filtrations instead. These are filtrations constructed from the Vietoris-Rips complex of the data , which includes a simplex for every possible subset of points in , of which the dimensions are constrained by the homology dimension (plus one) of interest in practice. While Vietoris-Rips complexes, and thus, the filtrations thereof, can be constructed more rapidly in higher dimensions, they tend to include far more simplices than the -filtration, which inherently complicates the subsequent computation of persistent homology. Thus, optimizing the loss for topological regularization (equation (1) in the main paper) is most efficient through -filtrations for low-dimensional data embedding matrices . For more details on the computational cost of persistent homology as well as the associated filtrations, we refer to Otter et al. (2017).

Appendix B Supplementary Experiments

We considered an additional experiment on the Harry Potter graph obtained from This graph is composed of characters from the Harry Potter novel (the nodes in the graph), and edges marking friendly relationships between them (Figure 9). Only the largest connected component is used. This graph has previously been analyzed by Vandaele et al. (2020), who identified a circular model therein that transitions between the ‘good’ and ‘evil’ characters from the novel.

To embed the Harry Potter graph, we used a simple graph embedding model where the sigmoid of the inner product between embedded nodes captures the (Bernoulli) probability of an edge occurrence

(Rendle et al., 2020). Thus, this probability will be high for nodes close to each other in the embedding, and low for more distant nodes. These probabilities are then optimized to match the binary edge indicator vector. Figure 9(a) shows the result of this embedding, along with the circular model presented by Vandaele et al. (2020). For clarity, character labels are only annotated for a subset of the nodes (the same as by Vandaele et al. (2020)).

We furthermore regularized this embedding using a topological loss function that measures the persistence of the most prominent 1-dimensional hole in the embedding (see also Table 2 in the main paper), the result of which is shown in Figure 9(c). Interestingly, the topologically regularized embedding now better captures the circularity of the model identified by Vandaele et al. (2020), and focuses more on distributing the characters along it. Note that although this model is included in the visualizations, it is not used to derive the embeddings, nor is it derived from them.

For comparison, Figure 9(b) shows the result of optimizing the initialized ordinary graph embedding for the same topological loss, but without the graph embedding loss. We observe that this results in a sparse enlarged cycle. Most characters are positioned poorly along the circular model, and concentrate near a small region. Interestingly, even though we only optimized for the topological loss here, it is actually less optimal, i.e., higher, than when we applied topological regularization (see also Table 4 in the main paper). This is also a result from the sparsity of the circle, which constitutes to a larger birth time, and thus a lower persistence, of the corresponding hole.

Figure 9: The major connected component in the Harry Potter graph. Edges mark friendly relationships between characters.
(a) Ordinary graph embedding.
(b) Topologically optimized embedding (initialized with the ordinary graph embedding).
(c) Topologically regularized embedding.
Figure 10: Various embeddings of the Harry Potter graph and the circular model therein.