Graph Filtration Learning

05/27/2019 ∙ by Christoph Hofer, et al. ∙ University of North Carolina at Chapel Hill Universitätsbibliothek Salzburg 9

We propose an approach to learning with graph-structured data in the problem domain of graph classification. In particular, we present a novel type of readout operation to aggregate node features into a graph-level representation. To this end, we leverage persistent homology computed via a real-valued, learnable, filter function. We establish the theoretical foundation for differentiating through the persistent homology computation. Empirically, we show that this type of readout operation compares favorably to previous techniques, especially when the graph connectivity structure is informative for the learning problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the problem of learning a function from the space of (finite) undirected graphs, , to a (discrete/continuous) target domain . Additionally, graphs might have discrete, or continuous attributes attached to each node. Prominent examples for this class of learning

problem appear in the context of classifying molecule structures, chemical compounds or social networks.

A substantial amount of research has been devoted to developing techniques for supervised learning with graph-structured data, ranging from kernel-based methods

Shervashidze09a ; Shervashidze11a ; Feragen13a ; Kriege16a

, to more recent approaches based on graph neural networks (GNN)

Scarselli09a ; Hamilton17b ; Zhang18a ; Morris19a ; Xu19a ; Ying18a . Most of the latter works use an iterative message passing scheme Gilmer17a to learn node representations, followed by a graph-level pooling operation that aggregates node-level features. This aggregation step is typically referred to as a readout operation. While research has mostly focused on variants of the message passing function, the readout step may have a significant impact, as it aims to capture properties of the entire graph. Importantly, both simple and more refined readout operations, such as summation, differentiable pooling Ying18a , or sort pooling Zhang18b , are inherently coupled to the amount of information carried over via multiple rounds of message passing. Hence, architectural GNN choices are typically guided by dataset characteristics, e.g., requiring to tune the number of message passing rounds to the expected size of graphs.

Contribution. We propose a homological readout operation that captures the full global structure of a graph while relying only on node representations that are learned (end-to-end), from immediate neighbors. This not only alleviates the aforementioned design challenge, but potentially also offers additional discriminative information.

The main idea is to consider a graph, , as a simplicial complex, , i.e., the main structure in simplicial homology. While this view would allow us to study, e.g., the ranks of homology groups, revealing the number of connected components or loops, the information is quite coarse. Alternatively, we can construct , one part at a time, and keep track of the induced homological changes. To do this, we need an ordering on the parts of which can be realized by defining a suitable function . The concept of successively constructing extends homology to persistent homology Edelsbrunner2010 , which offers a concise summary representation of the induced homological changes that occur during this process.

Figure 1: Conceptual overview. For a graph, , interpreted as a simplicial complex , a real-valued time is first computed via for each node, e.g., implemented by a GNN with one level of message passing. We then compute persistence barcodes,

, which are fed through a vectorization scheme

and passed through a classifier (e.g., an MLP). Our approach allows passing a learning signal through the persistent homology computation, effectively allowing to optimize for the specific classification task.

2 Related work

Graph neural networks. Most previous work on neural network based approaches to learning with graph-structured data focuses on learning informative node embeddings to solve tasks such as link prediction Schuett17a , node classification Hamilton17b , or classifying entire graphs. Many of these approaches, including Scarselli09a ; Duvenaud15a ; Li16a ; Battaglia16a ; Kearns16a ; Morris19a , can be formulated as a message passing scheme Gilmer17a ; Xu18a where features of graph nodes are passed to immediate neighbors via a differentiable message passing function; this operation proceeds over multiple iterations with each iteration parametrized by a neural network. Aspects distinguishing these approaches include (1) the particular realization of the message passing function, (2) the way information at a node is eventually aggregated, and (3) whether additional edge features are included. Due to the algorithmic similarity of iterative message passing and aggregation to the Weisfeiler-Lehman (WL) graph isomorphism test Weisfeiler68a , several works Xu19a ; Morris19a have recently studied this connection and established a theoretical underpinning for analyzing properties / limitations of GNN variants in the context of the WL test.

Readout operations. With exceptions, surprisingly little effort has been devoted to the so called readout operation, i.e., a function that aggregates node features into a global graph representation and allows making predictions for an entire graph. Common strategies include summation Duvenaud15a , averaging, or passing node features through a network operating on sets Li16a ; Gilmer17a . As pointed out in Ying18a , this effectively ignores the often complex global hierarchical structure of graphs. To mitigate this issue, Ying18a proposed a differentiable pooling operation that interleaves each message passing iteration and successively coarsens the graph. A different pooling scheme is proposed in Zhang18b which relies on appropriately sorting node features obtained from each message passing iteration. Both pooling approaches are generic and show substantial improvements on multiple benchmarks. Yet, the gains inherently depend on multiple rounds of message passing, as the global structure is successively captured during this process. In our alternative approach, global structural information is captured even initially and only hinges on a scalar attached to each node, learnable via a GNN with only one level of message passing.

Persistent homology & graphs. Notably, analyzing graphs via persistent homology is not new, with several works showing promising results on graph classification Hofer17a ; Carriere19a . So far, however, persistent homology is used in a passive manner, meaning that the function mapping simplices to is fixed

and not informed by the learning task. Essentially, this degrades persistent homology to a feature extraction step, where the obtained topological summaries are fed through a suitable vectorization scheme and passed to a classifier. Further, the success of

Hofer17a ; Carriere19a inherently hinges on the choice of the a-priori defined function , e.g., the node degree function in Hofer17a and a heat kernel function in Carriere19a

. The difference to our approach is that backpropagating the learning signal stops at the persistent homology computation; in our case, the signal is passed through, allowing to adjust

during learning.

3 Background

We provide a concise introduction to the necessary concepts of persistent homology and refer the interested reader to Hatcher2002 or Edelsbrunner2010 for further details.

Homology. The key concept of homology theory is to study the properties of some object by means of (commutative) algebra. In particular, we assign to a sequence of groups/modules which are connected by homomorphisms such that . A structure of this form is called a chain complex and by studying its homology groups

(1)

we can derive (homological) properties of . The original motivation for homology is to analyze topological spaces. In that case, the ranks of homology groups yield directly interpretable properties, e.g., reflects the number of connected components and the number of loops.

A prominent example of a homology theory is simplicial homology. A simplicial complex, , over the vertex domain is a set of non-empty (finite) subsets of that is closed under the operation of taking non-empty subsets. Formally, this means with and . We call a simplex iff ; correspondingly . We set . Further, let be the module generated by over 111Simplicial homology is not specific to , but it’s a typical choice, since it allows us to interpret -chains as sets of -simplices. and we define

i.e., is mapped to the formal sum of its -dimensional faces. The linear extension to of this mapping defines the -th boundary operator of binary simplicial homology, i.e.,

Using we obtain the corresponding homology group of dimension as in Eq. (1).

Persistent homology. Let be a simplicial complex and a sequence of simplicial complexes such that . Then, is called a filtration of . If we use the extra information provided by the filtration of , we obtain a sequence of chain complexes

where and denotes the inclusion. This then leads to the concept of persistent homology groups, defined by

The ranks, , of these homology groups (i.e., the -th persistent Betti numbers), capture the number of homological features of dimensionality (e.g., connected components for , loops for , etc.) that persist from to (at least) . In fact, according to (Edelsbrunner2010, , Fundamental Lemma of Persistent Homology), the quantities

(2)

encode all the information about the persistent Betti numbers of dimension .

Persistence barcodes. One way to obtain a filtration of is to define a vertex filter function and consider

(3)

where is the sorted sequence of filtration values, i.e., . Intuitively, this means that the filter function is defined on the vertices of and lifted to simplices in by maximum aggregation. This enables us to consider a sub levelset filtration of . Then, for a given filtration of and , we can construct a multiset by inserting the point , , with multiplicity . This effectively encodes the -dimensional persistent homology of w.r.t. the given filtration. This representation is called a persistence barcode, . For a given complex of dimension and a function (of the discussed form), we can interpret -th persistent homology as a mapping of simplicial complexes, defined by

(4)
Remark 1.

By setting

we extend Eq. (2) to features which never disappear, also referred to as essential. If we use this extension, Eq. (4) yields an additional barcode, denoted as , per dimension. For practical reasons, the points in are just the birth-time, as all death-times equal to and thus are omitted.

Learning from persistence barcodes.

Using persistence barcodes as input to machine learning algorithms has recently found several applications 

Bendich2016 ; Kwitt15a ; Hofer17a ; Adams17a ; Carriere19a . Technically, the multiset nature of barcodes prevents the direct application of standard methods, such as SVMs or neural networks. There are two dominant strategies to mitigate this problem: vectorization and kernel-based learning. Of special interest to this work is the learnable vectorization scheme of Hofer17a , as it directly integrates barcodes into a neural network framework. In the current state-of-the-art, with the exception of Chen19a (where the authors regularize decision boundaries of classifiers via persistent homology), the gradient signal ends at the persistent homology computation. In the next section, we address this limitation.

4 Filtration learning

First, note that the computation of sublevel set persistent homology, cf. Eq. (3), is dependent on two arguments: (1) the complex and (2) the filter function which determines the order of the simplices in the filtration of . As is inherently of discrete nature, it is clear that can not be subject to gradient based optimization. However, assume that the filter has a differentiable dependence on a real-valued parameter . Then, also the persistent homology of the sub levelset filtration of is dependent on . This immediately raises the following question: Is the mapping differentiable in ?
Notation. We adhere to the following convention: If any symbol introduced in the context of persistent homology is dependent on , we interpret it as function in and attach “” to this symbol. If the dependence on is irrelevant to the current context, we omit it for brevity. For example, we write for the multiplicity of barcode points. Next, we concretize the idea of a learnable filter.

Definition 1 (Learnable vertex filter).

Let be a vertex domain, the set of possible simplicial complexes over and let

be differentiable in for . Then, we call a learnable vertex filter with parameter .

Remark 2.

The assumption that is dependent on a single parameter is solely dedicated to simplify the following theoretical part. The derived results immediately generalize to the case where is the number of parameters and we will drop this assumption later on.

By Eq. (4), is a mapping to (or if essential barcodes are included), i.e., the space of persistence barcodes. As has no natural linear structure, we consider differentiability in combination with a coordinatization strategy . This allows studying the mapping for which differentiability is defined.

Definition 2 (Barcode coordinate function).

Let be a differentiable function. Then

is called barcode coordinate function.

In fact, the (vectorization) input layer presented in Hofer17a falls withing the family of mappings defined in Definition 2, whereas the deep sets approach of Zaheer2017a yields a more general formulation, not specifically tailored to persistence barcodes. Upon using barcode coordinate functions, we can effectively map a barcode into and feed this representation through any differentiable layer downstream, e.g., implementing a classifier. We will discuss our particular choice of in §5.

Next, we show that, under certain conditions, in combination with a suitable barcode coordinate function preserves differentiability.

Lemma 1.

Let be a finite simplicial complex with vertex set , be a learnable vertex filter as in Definition 1 and a barcode coordinate function as in Definition 2. If, for , it holds that the pairwise vertex filter values are distinct, i.e.,

then the mapping

(5)

is differentiable in .

For brevity, we only sketch the proof; the full version can be found in the supplementary material.

Sketch of proof.

The pairwise vertex filter values, , are distinct which implies that they are (P1) strictly ordered by “” and (P2) . Let be the index permutation which sorts . Now consider . Since is assumed to be differentiable, and therefore continuous, also satisfies (P1) and (P2) and sorts , for sufficiently small. Importantly, this also implies

(6)

As a consequence, this allows deriving the following equality:

(7)

This concludes the proof, since the derivative within the summation on the right exists by assumption. ∎

A crucial assumption in Lemma 1 is that the filtration values are pairwise distinct. If this is not the case, the sorting permutation is not uniquely defined and Eq. (6) do not hold. Although the gradient w.r.t.  is still defined in this situation, it depends on the particular implementation of the persistent homology algorithm. The reason for this is that the latter depends on a strict total ordering of the simplices such that whenever is a face of , formally . Specifically, the sublevel set filtration , only yields a partial strict ordering by

(8)
(9)

However, in the case , we can neither infer nor from Eq. (8) and Eq. (9). Hence, those “ties” need to be settled in the implementation. If we were only interested in the barcodes, the tie settling strategy is irrelevant and therefor the particular barcodes do not depend on the implementation. To see this, consider a point in the barcode . Now let and Then,

are all valid representations of in , but the representation of the values resp. is dependent on the selection and . The selection, in turn, is determined by the tie settling strategy. Hence, although different implementations yield the same barcodes, the gradients may differ, see Fig. 2 for an example of a problematic configuration.

Figure 2: Illustration of a problematic scenario that can occur when trying to backpropagate through the persistent homology computation. The left-hand side shows three steps in a filtration sequence with filtration values , including the two valid choices of the -dimensional barcode and (essential). Numerically, the barcodes for both choices are equal, however, depending on the tie settling strategy, the gradients may differ.
Remark 3.

An alternative, but computationally more expensive, strategy would be as follows: for a particular filtration value , let . Now consider a point in the barcode where, say, is “problematic”, i.e., implementation dependent. Upon setting i.e., the (mean) aggregation of all possible representations of , results in a gradient which is independent of the actual tie settling strategy. Yet, this is far more expensive to compute, due to the index set construction of , especially if is large. While it can be argued that this strategy is more “natural“, it would lead to a more involved proof for differentiability, which we leave for future work.

Remark 4.

The difficulty of assigning/defining a proper gradient if the filtration values are not pairwise distinct is not unique to our approach. In fact, other set operations frequently used in neural networks face a similar problem. For example, consider the popular maxpool operator , which is a well defined mapping. However, in the case of, say, , it is unclear if or is used to represent its value . In the situation of , but differing gradients (w.r.t. ) this could be problematic. In fact, the gradient w.r.t.  would depend on the particular decision of representing , i.e., the implementation of the maxpool operator.

4.1 Graph filtration learning (GFL)

As mentioned in §1, graphs are simplicial complexes, although notationally represented in a slightly different way. For a graph we can directly define its simplicial complex by . We ignore this notational nuance and use and interchangeably. In fact, learning filtrations on graph-structured data integrates seamlessly into the presented framework. Specifically, the learnable vertex filter, generically introduced in Definition 1, can be easily implemented by a neural network. If local node neighborhood information should be taken into account, this can be realized via a graph neural network (GNN), operating on an initial node representation . The learnable vertex filter then is a mapping of the form .

Selection of . In the spirit of Hofer17a , we use a vectorization based on a local weighting function, , of points in a given barcode, . Our choice for is the mapping

(10)

are learnable parameters. The reason we prefer this form of over the original exponential variant of Hofer17a is that it yields a rational instead of exponential dependency w.r.t.  and , resulting in tamer gradient behavior during optimization. This is even more critical as, different to Hofer17a , we optimize over parts of the model which appear before and are thus directly dependent on its gradient.

lcccc Method & REDDIT-BINARY & REDDIT-MULTI-5K & IMDB-BINARY & IMDB-MULTI
&Initial node features: uninformative &Initial node features:
1-GIN (PH-only) & & & &
1-GIN (GFL) & & & &
1-GIN (SumPool) Xu19a & & & &
1-GIN (SortPool) Zhang18b & & & &
Baseline Zaheer2017a & & & &
State-of-the-Art (NN)
DCNN Wang18a & n/a & n/a & &
PatchySAN Niepert16a & & & &
DGCNN Zhang18b & n/a & n/a & &
1-2-3-GNN Morris19a & n/a & n/a & &
5-GIN (sum) Xu19a & & & &

5 Experiments

We evaluate the utility of the proposed homological readout operation with respect to different aspects. First, as argued in §1, we want to avoid the challenge of tuning how much information is aggregated locally via message passing. To this end, when using our readout operation, referred to as GFL, we learn the filter function from just one round of message passing, i.e., the most local, non-trivial variant. Second, we aim to address the question of whether learning a filter is actually beneficial, compared to defining the filter a-priori, to which we refer to as PH-only.

Importantly, to clearly assess the power of a readout operation across various datasets, we need to first investigate whether the discriminative power of a representation is not primarily contained in the initial node features. In fact, if a baseline approach using only node features performs en par with approaches that leverage the graph structure, this would indicate that detailed connectivity information is of little relevance for the task. In this case, we cannot expect that a readout which strongly depends on the latter is beneficial.

Datasets. We use two common benchmark datasets for graphs with discrete node attributes, i.e., PROTEINS and NCI1, as well as four social network datasets (IMDB-BINARY, IMDB-MULTI, REDDIT-BINARY, REDDIT-5k) which do not contain any node attributes; see supplementary material for dataset characteristics.

Implementation.222Source code is publicly available at Anonymous-URL. We provide a high-level description of the implementation (cf. Fig. 1), but refer the reader to the supplementary material for technical details. First, to implement the filter function , we use a single GIN- layer of Xu19a for one level of message passing (i.e., 1-GIN) with hidden dimensionality of . The obtained latent node representation is then passed through a two layer MLP, mapping from , with a sigmoid activation at the last layer. Node degrees and (if available) discrete node attributes are encoded via embedding layers with 64 dimensions. In case of REDDIT-* graphs, initial node features are set uninformative, i.e., vectors of all ones (as in Xu19a ).

Using the output of the vertex filter, persistent homology is computed via a parallel GPU variant of the original reduction algorithm from Edelsbrunner02a , implemented in PyTorch. In particular, we first compute - and -dimensional barcodes for and , which corresponds to sub- and super levelset filtrations. For each filtration, i.e., and , we obtain barcodes for - and -dim. essential features, as well as -dim. non-essential features. Non-essential points in barcodes w.r.t.  are mapped into by mirroring along the main diagonals and essential points (i.e., birth-times) are mapped to by mirroring around . We then take the union of the barcodes corresponding to sub- and super levelset filtrations which reduces the number of processed barcodes from six to three (see Fig. 1).

Finally, each barcode is passed through a vectorization layer Hofer17a , implementing , with output dimensions each. Upon concatenation, this results in a -dim. representation of a graph (i.e., the output of our readout) that is passed to a final MLP, implementing a classifier.

Baseline and readout operations. The previously mentioned Baseline, agnostic to the graph structure, is implemented using the deep sets approach of Zaheer2017a , which is similar to the GIN architecture used in all experiments, without any message passing. In terms of readout functions, we compare against the prevalent sum aggregation, as well as sort pooling from Zhang18b . We do not explicitely compare against differentiable pooling from Ying18a , since we allow just one level of message passing, rendering differentiable pooling equivalent to 1-GIN (Sum).

Training and evaluation. We train for epochs using ADAM with an initial learning rate of (halved every -th epoch) and a weight decay of

. No hyperparameter tuning or early stopping criterion is used. In terms of evaluation, we follow previous work (e.g.,

Morris19a ; Zhang18b ) and report cross-validation accuracy, averaged over ten folds, of the model obtained in the final training epoch.

Results. Table 4.1 lists the results for the social networks. On both IMDB datasets, the Baseline performs en par with the state-of-the-art, as well as all readout strategies. However, this is not surprising as the graphs are very densely connected and substantial information is already encoded in the node degree. Consequently, using the degree as an initial node feature, combined with GNNs and various forms of readout does not lead to noticeable improvements. REDDIT-* graphs, on the other hand, are far from fully connected and the global structure is more hierarchical. Here, the information captured by persistent homology is highly discriminative with GFL outperforming Sum and SortPool by a large margin. Notably, setting the filter a-priori to the degree function performs equally well. We argue that this might already be an optimal choice on REDDIT. On other datasets (e.g., IMDB) this is not the case. Importantly, although the initial features on REDDIT graphs are set uninformative for all readout variants with 1-GIN, GFL can learn a filter function that is equally informative.

Method PROTEINS NCI1
Initial node features:
1-GIN (PH-only)
1-GIN (GFL)
1-GIN (Sum) Xu19a
1-GIN (SortPool) Zhang18b
Baseline Zaheer2017a
Initial node features
1-GIN (PH-only) n/a n/a
1-GIN (GFL)
1-GIN (Sum) Xu19a
1-GIN (SortPool) Zhang18b
Baseline Zaheer2017a
State-of-the-Art (NN)
DCNN Wang18a
PatchySAN Niepert16a
DGCNN Zhang18b
1-2-3-GNN Morris19a
5-GIN (sum) Xu19a
Table 2: Results on graphs with node attributes.

Table 2 lists the results on NCI1 and PROTEINS. On the latter, we observe that already the Baseline, agnostic to the connectivity information, is competitive with the state-of-the-art. GNNs with different readout strategies, including ours, only marginally improve performance. It is therefor challenging, to assess the utility of different readout variants on this dataset. On NCI1, the situation is different. Relying on node degrees only, our GFL readout clearly outperforms the other readout strategies. This indicates that GFL can successfully capture the underlying discriminative graph structure, relying only on minimal information gathered at node-level. Including label information leads to results competitive to the state-of-the-art without explicitly tuning the architecture to this dataset. While all other readout strategies equally benefit from additional node attributes, the fact that 5-GIN (Sum) performs worse than 1-GIN (Sum) highlights our argument that message passing approaches need careful architectural design.

6 Discussion

In this work, we introduced an approach to actively integrate persistent homology within the realm of graph neural networks, offering GFL as a novel type of readout operation. As demonstrated throughout all experiments, GFL, which is based on the idea of filtration learning, is able to achieve results competitive to the state-of-the-art on various datasets. Most importantly, this is achieved with a single architecture that only relies on a simple one-level message passing scheme. This is different to previous works, where the amount of information that is iteratively aggregated via message passing can be crucial. We also highlight that GFL could be easily extended to incorporate edge level information, or be directly used on graphs with continuous node attributes. From a theoretical perspective, we established how to backpropagate a (gradient-based) learning signal through the persistent homology computation in combination with a differentiable vectorization scheme. We think that for future work, it will be interesting to study additional filtration techniques (e.g., based on edges, or cliques) and whether this is beneficial.

7 Supplementary Material

This supplementary material contains the full proof of Lemma 1 omitted in the main work and additional information to the used datasets. It further contains details to the implementation of the models used in the experiments. For readability, all necessary definitions and results are restated and the numbering matches the original numbering.

7.1 Dataset details

The following table contains a selection of statistics relevant to the datasets used in our experiments.

Datasets REDDIT-BINARY REDDIT-MULTI-5K IMDB-BINARY IMDB-MULTI PROTEINS NCI1
# graphs
# classes
nodes
edges
# labels n/a n/a n/a n/a

7.2 Proof of Lemma 1

Definition 1 (Learnable vertex filter).

Let be a vertex domain, the set of possible simplicial complexes over and let

be differentiable in for . Then, we call a learnable vertex filter with parameter .

Definition 2 (Barcode coordinate function).

Let be a differentiable function. Then

is called barcode coordinate function.

Lemma 1.

Let be a finite simplicial complex with vertex set , be a learnable vertex filter as in Definition 1 and a barcode coordinate function as in Definition 2. If, for , it holds that the pairwise vertex filter values are distinct, i.e.,

then the mapping

(11)

is differentiable in .

Proof.

For notational convenience, let . Also, let the sorting permutation of , i.e., . By assumption the pairwise filter values are distinct, thus there is a neighborhood around such that the ordering of the filtration values is not modified by changes of within this neighborhood, i.e.,

(12)

and

(13)

This implies that a sufficiently small change, , of does not change the induced filtrations. Formally,

(14)

Importantly, this means that

(15)

We next show that the derivative of Eq. (11) with respect to exists. By assumption, is differentiable and thus is differentiable. Now consider

( by Eq. (15))

This concludes the proof, since the derivative within the summation exists by assumption. ∎

7.3 Architectural details

As mentioned in Section 5 (Experiments), we implement the learnable filter using a single GIN- layer from Xu19a (with set as a learnable parameter). The internal architecture is as follows:

Embedding[n,64]-FC[64,64]-BatchNorm-LeakyReLU-FC(64,64).

Here, n denotes the dimension of the node attributes. For example, if initial node features are based on the degree function and the maximum degree over all graphs is 200, then n=200+1. In other words, n is the number of embedding vectors in used to represent node degrees.

The multi-layer perceptron (

MLP) mapping the output of the GIN layer to a real-valued node filtration value is parametrized as:

Embedding[64,64]-FC[64,64]-BatchNorm-LeakyReLU-FC(64,1)-Sigmoid.

As classifier, we use a simple MLP of the form

FC[300,64]-ReLU-FC[64,#classes]

.

Here, the input dimensionality is 300, as each barcode is represented by a 100-dimensional vector.

References

  • (1) H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier. Persistence images: A stable vector representation of persistent homology. JMLR, 18(8):1–35, 2017.
  • (2) P. Battaglia, R. Pascanu, M. Lai, D. Rezende, and K. Kavukcuoglu. Interaction networks for learning about objects, relations and physics. In NIPS, 2016.
  • (3) P. Bendich, J.S. Marron, E. Miller, A. Pieloch, and S. Skwerer. Persistent homology analysis of brain artery trees. Ann. Appl. Stat., 10(2), 2016.
  • (4) M. Carrière, F. Chazal, Y. Ike, T. Lacombe, M. Royer, and Y. Umeda. A general neural network architecture for persistence diagrams and graph classification. arXiv, 2019. https://arxiv.org/abs/1904.09378.
  • (5) C. Chen, X. Ni, Q. Bai, and Y. Wang. A topological regularizer for classifiers via persistent homology. In AISTATS, 2019.
  • (6) D.K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R.P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2015.
  • (7) H. Edelsbrunner and J. L. Harer. Computational Topology : An Introduction. American Mathematical Society, 2010.
  • (8) H. Edelsbrunner, D. Letcher, and A. Zomorodian. Topological persistence and simplification. Discrete Comput. Geom., 28(4):511–533, 2002.
  • (9) A. Feragen, N. Kasenburg, J. Petersen, M.D. Bruijne, and K.M. Borgwardt. Scalable kernels for graphs with continuous attributes. In NIPS, 2013.
  • (10) J. Gilmer, S.S. Schoenholz, P.F. Riley, O. Vinyals, and G.E. Dahl. Neural message passing for quantum chemistry. In ICML, 2017.
  • (11) W.L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In NIPS, 2017.
  • (12) A. Hatcher. Algebraic Topology. Cambridge University Press, Cambridge, 2002.
  • (13) C. Hofer, R. Kwitt, M. Niethammer, and A. Uhl. Deep learning with topological signatures. In NIPS, 2017.
  • (14) S. Kearns, K. McCloskey, M. Berndl, V. Pande, and P. Riley. Molecular graph convolutions: Moving beyond fingerprints. Journal of Computer-Aided Molecular Design, 30(8):595–608, 2016.
  • (15) N.M. Kriege, P.-L. Giscard, and R. Wilson. On valid optimal assignment kernels and applications to graph classification. In NIPS, 2016.
  • (16) R. Kwitt, S. Huber, M. Niethammer, W. Lin, and U. Bauer. Statistical topological data analysis - a kernel perspective. In NIPS, 2015.
  • (17) Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. In ICLR, 2016.
  • (18) C. Morris, M. Ritzert, M. Fey, and W.L. Hamilton. Weisfeiler and Leman go neural: Higher-order graph neural networks. In AAAI, 2019.
  • (19) M. Niepert, M. Ahmed, and K. Kutzkov.

    Learning convolutional neural networks for graphs.

    In ICML, 2016.
  • (20) F. Scarselli, M. Gori, A.C. Tsoi, M. hagenbuchner, and G. Monfardini. The graph neural network model. Transactions on Neural Networks, 20(1):61–80, 2009.
  • (21) K. Schütt, P. J. Kindermans, H. E. Sauceda, S. Chmiela, A. Tkatchenko, and K. R. Müller. SchNet: A continuous-filter convolutional neural network for modeling quantum interactions. In NIPS, 2017.
  • (22) N. Shervashidze, P. Schweitzer, E.J. van Leeuwen, K. Mehlhorn, and K.M Borgwardt. Weisfeiler-lehmann graph kernels. JMLR, 12:2539–2561, 2011.
  • (23) N. Shervashidze, S.V.N. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient graphlet kernels for large graph comparison. In AISTATS, pages 488–495, 2009.
  • (24) Y. Wang, Y. Sun, Z. Liu, S.E. Sarma, M.M. Bronstein, and J.M. Solomon. Dynamic graph cnn for learning on point clouds. arXiv. https://arxiv.org/abs/1801.07829, 2018.
  • (25) B. Weisfeiler and A. A. Lehman. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia - Seriya 2, (9):12–16, 1968.
  • (26) K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In ICLR, 2019.
  • (27) K. Xu, C. Li, Y. Tian, T. Sonobe, K.I. Kawarabayashi, and S. Jegelka. Representation learning on graphs with jumping knowledge networks. In ICML, 2018.
  • (28) R. Ying, J. You, C. Morris, X. Ren, W.L. Hamilton, and J. Leskovec. Hierarchical graph representation learning with differentiable pooling. In NIPS, 2018.
  • (29) M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. J. Smola. Deep sets. In NIPS, 2017.
  • (30) M. Zhang, Z. Cui, M. Neumann, and Y. Chen. An end-to-end deep learning architecture for graph classification. In AAAI, 2018.
  • (31) Z. Zhang, M. Wang, Y. Xiang, Y. Huang, and A. Nehorai.

    Retgk: Graph kernels based on return probabilities of random walks.

    In NIPS, 2018.
Table 1: Graph classification accuracies (with std. dev.), averaged over ten cross-validation folds, on social network datasets. All GIN variants (including 5-GIN (Sum)) were evaluated on exactly the same folds. The bottom part of the table lists results obtained by approaches from the literature. Operations in parentheses refer to the readout variant. Only PH-only always uses the node degree.

5 Experiments

We evaluate the utility of the proposed homological readout operation with respect to different aspects. First, as argued in §1, we want to avoid the challenge of tuning how much information is aggregated locally via message passing. To this end, when using our readout operation, referred to as GFL, we learn the filter function from just one round of message passing, i.e., the most local, non-trivial variant. Second, we aim to address the question of whether learning a filter is actually beneficial, compared to defining the filter a-priori, to which we refer to as PH-only.

Importantly, to clearly assess the power of a readout operation across various datasets, we need to first investigate whether the discriminative power of a representation is not primarily contained in the initial node features. In fact, if a baseline approach using only node features performs en par with approaches that leverage the graph structure, this would indicate that detailed connectivity information is of little relevance for the task. In this case, we cannot expect that a readout which strongly depends on the latter is beneficial.

Datasets. We use two common benchmark datasets for graphs with discrete node attributes, i.e., PROTEINS and NCI1, as well as four social network datasets (IMDB-BINARY, IMDB-MULTI, REDDIT-BINARY, REDDIT-5k) which do not contain any node attributes; see supplementary material for dataset characteristics.

Implementation.222Source code is publicly available at Anonymous-URL. We provide a high-level description of the implementation (cf. Fig. 1), but refer the reader to the supplementary material for technical details. First, to implement the filter function , we use a single GIN- layer of Xu19a for one level of message passing (i.e., 1-GIN) with hidden dimensionality of . The obtained latent node representation is then passed through a two layer MLP, mapping from , with a sigmoid activation at the last layer. Node degrees and (if available) discrete node attributes are encoded via embedding layers with 64 dimensions. In case of REDDIT-* graphs, initial node features are set uninformative, i.e., vectors of all ones (as in Xu19a ).

Using the output of the vertex filter, persistent homology is computed via a parallel GPU variant of the original reduction algorithm from Edelsbrunner02a , implemented in PyTorch. In particular, we first compute - and -dimensional barcodes for and , which corresponds to sub- and super levelset filtrations. For each filtration, i.e., and , we obtain barcodes for - and -dim. essential features, as well as -dim. non-essential features. Non-essential points in barcodes w.r.t.  are mapped into by mirroring along the main diagonals and essential points (i.e., birth-times) are mapped to by mirroring around . We then take the union of the barcodes corresponding to sub- and super levelset filtrations which reduces the number of processed barcodes from six to three (see Fig. 1).

Finally, each barcode is passed through a vectorization layer Hofer17a , implementing , with output dimensions each. Upon concatenation, this results in a -dim. representation of a graph (i.e., the output of our readout) that is passed to a final MLP, implementing a classifier.

Baseline and readout operations. The previously mentioned Baseline, agnostic to the graph structure, is implemented using the deep sets approach of Zaheer2017a , which is similar to the GIN architecture used in all experiments, without any message passing. In terms of readout functions, we compare against the prevalent sum aggregation, as well as sort pooling from Zhang18b . We do not explicitely compare against differentiable pooling from Ying18a , since we allow just one level of message passing, rendering differentiable pooling equivalent to 1-GIN (Sum).

Training and evaluation. We train for epochs using ADAM with an initial learning rate of (halved every -th epoch) and a weight decay of

. No hyperparameter tuning or early stopping criterion is used. In terms of evaluation, we follow previous work (e.g.,

Morris19a ; Zhang18b ) and report cross-validation accuracy, averaged over ten folds, of the model obtained in the final training epoch.

Results. Table 4.1 lists the results for the social networks. On both IMDB datasets, the Baseline performs en par with the state-of-the-art, as well as all readout strategies. However, this is not surprising as the graphs are very densely connected and substantial information is already encoded in the node degree. Consequently, using the degree as an initial node feature, combined with GNNs and various forms of readout does not lead to noticeable improvements. REDDIT-* graphs, on the other hand, are far from fully connected and the global structure is more hierarchical. Here, the information captured by persistent homology is highly discriminative with GFL outperforming Sum and SortPool by a large margin. Notably, setting the filter a-priori to the degree function performs equally well. We argue that this might already be an optimal choice on REDDIT. On other datasets (e.g., IMDB) this is not the case. Importantly, although the initial features on REDDIT graphs are set uninformative for all readout variants with 1-GIN, GFL can learn a filter function that is equally informative.

Method PROTEINS NCI1
Initial node features:
1-GIN (PH-only)
1-GIN (GFL)
1-GIN (Sum) Xu19a
1-GIN (SortPool) Zhang18b
Baseline Zaheer2017a
Initial node features
1-GIN (PH-only) n/a n/a
1-GIN (GFL)
1-GIN (Sum) Xu19a
1-GIN (SortPool) Zhang18b
Baseline Zaheer2017a
State-of-the-Art (NN)
DCNN Wang18a
PatchySAN Niepert16a
DGCNN Zhang18b
1-2-3-GNN Morris19a
5-GIN (sum) Xu19a
Table 2: Results on graphs with node attributes.

Table 2 lists the results on NCI1 and PROTEINS. On the latter, we observe that already the Baseline, agnostic to the connectivity information, is competitive with the state-of-the-art. GNNs with different readout strategies, including ours, only marginally improve performance. It is therefor challenging, to assess the utility of different readout variants on this dataset. On NCI1, the situation is different. Relying on node degrees only, our GFL readout clearly outperforms the other readout strategies. This indicates that GFL can successfully capture the underlying discriminative graph structure, relying only on minimal information gathered at node-level. Including label information leads to results competitive to the state-of-the-art without explicitly tuning the architecture to this dataset. While all other readout strategies equally benefit from additional node attributes, the fact that 5-GIN (Sum) performs worse than 1-GIN (Sum) highlights our argument that message passing approaches need careful architectural design.

6 Discussion

In this work, we introduced an approach to actively integrate persistent homology within the realm of graph neural networks, offering GFL as a novel type of readout operation. As demonstrated throughout all experiments, GFL, which is based on the idea of filtration learning, is able to achieve results competitive to the state-of-the-art on various datasets. Most importantly, this is achieved with a single architecture that only relies on a simple one-level message passing scheme. This is different to previous works, where the amount of information that is iteratively aggregated via message passing can be crucial. We also highlight that GFL could be easily extended to incorporate edge level information, or be directly used on graphs with continuous node attributes. From a theoretical perspective, we established how to backpropagate a (gradient-based) learning signal through the persistent homology computation in combination with a differentiable vectorization scheme. We think that for future work, it will be interesting to study additional filtration techniques (e.g., based on edges, or cliques) and whether this is beneficial.

7 Supplementary Material

This supplementary material contains the full proof of Lemma 1 omitted in the main work and additional information to the used datasets. It further contains details to the implementation of the models used in the experiments. For readability, all necessary definitions and results are restated and the numbering matches the original numbering.

7.1 Dataset details

The following table contains a selection of statistics relevant to the datasets used in our experiments.

Datasets REDDIT-BINARY REDDIT-MULTI-5K IMDB-BINARY IMDB-MULTI PROTEINS NCI1
# graphs
# classes
nodes
edges
# labels n/a n/a n/a n/a

7.2 Proof of Lemma 1

Definition 1 (Learnable vertex filter).

Let be a vertex domain, the set of possible simplicial complexes over and let

be differentiable in for . Then, we call a learnable vertex filter with parameter .

Definition 2 (Barcode coordinate function).

Let be a differentiable function. Then

is called barcode coordinate function.

Lemma 1.

Let be a finite simplicial complex with vertex set , be a learnable vertex filter as in Definition 1 and a barcode coordinate function as in Definition 2. If, for , it holds that the pairwise vertex filter values are distinct, i.e.,

then the mapping

(11)

is differentiable in .

Proof.

For notational convenience, let . Also, let the sorting permutation of , i.e., . By assumption the pairwise filter values are distinct, thus there is a neighborhood around such that the ordering of the filtration values is not modified by changes of within this neighborhood, i.e.,

(12)

and

(13)

This implies that a sufficiently small change, , of does not change the induced filtrations. Formally,

(14)

Importantly, this means that

(15)

We next show that the derivative of Eq. (11) with respect to exists. By assumption, is differentiable and thus is differentiable. Now consider

( by Eq. (15))

This concludes the proof, since the derivative within the summation exists by assumption. ∎

7.3 Architectural details

As mentioned in Section 5 (Experiments), we implement the learnable filter using a single GIN- layer from Xu19a (with set as a learnable parameter). The internal architecture is as follows:

Embedding[n,64]-FC[64,64]-BatchNorm-LeakyReLU-FC(64,64).

Here, n denotes the dimension of the node attributes. For example, if initial node features are based on the degree function and the maximum degree over all graphs is 200, then n=200+1. In other words, n is the number of embedding vectors in used to represent node degrees.

The multi-layer perceptron (

MLP) mapping the output of the GIN layer to a real-valued node filtration value is parametrized as:

Embedding[64,64]-FC[64,64]-BatchNorm-LeakyReLU-FC(64,1)-Sigmoid.

As classifier, we use a simple MLP of the form

FC[300,64]-ReLU-FC[64,#classes]

.

Here, the input dimensionality is 300, as each barcode is represented by a 100-dimensional vector.

References

  • (1) H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier. Persistence images: A stable vector representation of persistent homology. JMLR, 18(8):1–35, 2017.
  • (2) P. Battaglia, R. Pascanu, M. Lai, D. Rezende, and K. Kavukcuoglu. Interaction networks for learning about objects, relations and physics. In NIPS, 2016.
  • (3) P. Bendich, J.S. Marron, E. Miller, A. Pieloch, and S. Skwerer. Persistent homology analysis of brain artery trees. Ann. Appl. Stat., 10(2), 2016.
  • (4) M. Carrière, F. Chazal, Y. Ike, T. Lacombe, M. Royer, and Y. Umeda. A general neural network architecture for persistence diagrams and graph classification. arXiv, 2019. https://arxiv.org/abs/1904.09378.
  • (5) C. Chen, X. Ni, Q. Bai, and Y. Wang. A topological regularizer for classifiers via persistent homology. In AISTATS, 2019.
  • (6) D.K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R.P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2015.
  • (7) H. Edelsbrunner and J. L. Harer. Computational Topology : An Introduction. American Mathematical Society, 2010.
  • (8) H. Edelsbrunner, D. Letcher, and A. Zomorodian. Topological persistence and simplification. Discrete Comput. Geom., 28(4):511–533, 2002.
  • (9) A. Feragen, N. Kasenburg, J. Petersen, M.D. Bruijne, and K.M. Borgwardt. Scalable kernels for graphs with continuous attributes. In NIPS, 2013.
  • (10) J. Gilmer, S.S. Schoenholz, P.F. Riley, O. Vinyals, and G.E. Dahl. Neural message passing for quantum chemistry. In ICML, 2017.
  • (11) W.L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In NIPS, 2017.
  • (12) A. Hatcher. Algebraic Topology. Cambridge University Press, Cambridge, 2002.
  • (13) C. Hofer, R. Kwitt, M. Niethammer, and A. Uhl. Deep learning with topological signatures. In NIPS, 2017.
  • (14) S. Kearns, K. McCloskey, M. Berndl, V. Pande, and P. Riley. Molecular graph convolutions: Moving beyond fingerprints. Journal of Computer-Aided Molecular Design, 30(8):595–608, 2016.
  • (15) N.M. Kriege, P.-L. Giscard, and R. Wilson. On valid optimal assignment kernels and applications to graph classification. In NIPS, 2016.
  • (16) R. Kwitt, S. Huber, M. Niethammer, W. Lin, and U. Bauer. Statistical topological data analysis - a kernel perspective. In NIPS, 2015.
  • (17) Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. In ICLR, 2016.
  • (18) C. Morris, M. Ritzert, M. Fey, and W.L. Hamilton. Weisfeiler and Leman go neural: Higher-order graph neural networks. In AAAI, 2019.
  • (19) M. Niepert, M. Ahmed, and K. Kutzkov.

    Learning convolutional neural networks for graphs.

    In ICML, 2016.
  • (20) F. Scarselli, M. Gori, A.C. Tsoi, M. hagenbuchner, and G. Monfardini. The graph neural network model. Transactions on Neural Networks, 20(1):61–80, 2009.
  • (21) K. Schütt, P. J. Kindermans, H. E. Sauceda, S. Chmiela, A. Tkatchenko, and K. R. Müller. SchNet: A continuous-filter convolutional neural network for modeling quantum interactions. In NIPS, 2017.
  • (22) N. Shervashidze, P. Schweitzer, E.J. van Leeuwen, K. Mehlhorn, and K.M Borgwardt. Weisfeiler-lehmann graph kernels. JMLR, 12:2539–2561, 2011.
  • (23) N. Shervashidze, S.V.N. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient graphlet kernels for large graph comparison. In AISTATS, pages 488–495, 2009.
  • (24) Y. Wang, Y. Sun, Z. Liu, S.E. Sarma, M.M. Bronstein, and J.M. Solomon. Dynamic graph cnn for learning on point clouds. arXiv. https://arxiv.org/abs/1801.07829, 2018.
  • (25) B. Weisfeiler and A. A. Lehman. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia - Seriya 2, (9):12–16, 1968.
  • (26) K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In ICLR, 2019.
  • (27) K. Xu, C. Li, Y. Tian, T. Sonobe, K.I. Kawarabayashi, and S. Jegelka. Representation learning on graphs with jumping knowledge networks. In ICML, 2018.
  • (28) R. Ying, J. You, C. Morris, X. Ren, W.L. Hamilton, and J. Leskovec. Hierarchical graph representation learning with differentiable pooling. In NIPS, 2018.
  • (29) M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. J. Smola. Deep sets. In NIPS, 2017.
  • (30) M. Zhang, Z. Cui, M. Neumann, and Y. Chen. An end-to-end deep learning architecture for graph classification. In AAAI, 2018.
  • (31) Z. Zhang, M. Wang, Y. Xiang, Y. Huang, and A. Nehorai.

    Retgk: Graph kernels based on return probabilities of random walks.

    In NIPS, 2018.

6 Discussion

In this work, we introduced an approach to actively integrate persistent homology within the realm of graph neural networks, offering GFL as a novel type of readout operation. As demonstrated throughout all experiments, GFL, which is based on the idea of filtration learning, is able to achieve results competitive to the state-of-the-art on various datasets. Most importantly, this is achieved with a single architecture that only relies on a simple one-level message passing scheme. This is different to previous works, where the amount of information that is iteratively aggregated via message passing can be crucial. We also highlight that GFL could be easily extended to incorporate edge level information, or be directly used on graphs with continuous node attributes. From a theoretical perspective, we established how to backpropagate a (gradient-based) learning signal through the persistent homology computation in combination with a differentiable vectorization scheme. We think that for future work, it will be interesting to study additional filtration techniques (e.g., based on edges, or cliques) and whether this is beneficial.

7 Supplementary Material

This supplementary material contains the full proof of Lemma 1 omitted in the main work and additional information to the used datasets. It further contains details to the implementation of the models used in the experiments. For readability, all necessary definitions and results are restated and the numbering matches the original numbering.

7.1 Dataset details

The following table contains a selection of statistics relevant to the datasets used in our experiments.

Datasets REDDIT-BINARY REDDIT-MULTI-5K IMDB-BINARY IMDB-MULTI PROTEINS NCI1
# graphs
# classes
nodes
edges
# labels n/a n/a n/a n/a

7.2 Proof of Lemma 1

Definition 1 (Learnable vertex filter).

Let be a vertex domain, the set of possible simplicial complexes over and let

be differentiable in for . Then, we call a learnable vertex filter with parameter .

Definition 2 (Barcode coordinate function).

Let be a differentiable function. Then

is called barcode coordinate function.

Lemma 1.

Let be a finite simplicial complex with vertex set , be a learnable vertex filter as in Definition 1 and a barcode coordinate function as in Definition 2. If, for , it holds that the pairwise vertex filter values are distinct, i.e.,

then the mapping

(11)

is differentiable in .

Proof.

For notational convenience, let . Also, let the sorting permutation of , i.e., . By assumption the pairwise filter values are distinct, thus there is a neighborhood around such that the ordering of the filtration values is not modified by changes of within this neighborhood, i.e.,

(12)

and

(13)

This implies that a sufficiently small change, , of does not change the induced filtrations. Formally,

(14)

Importantly, this means that

(15)

We next show that the derivative of Eq. (11) with respect to exists. By assumption, is differentiable and thus is differentiable. Now consider

( by Eq. (15))