Higher-Order Label Homogeneity and Spreading in Graphs

Do higher-order network structures aid graph semi-supervised learning? Given a graph and a few labeled vertices, labeling the remaining vertices is a high-impact problem with applications in several tasks, such as recommender systems, fraud detection and protein identification. However, traditional methods rely on edges for spreading labels, which is limited as all edges are not equal. Vertices with stronger connections participate in higher-order structures in graphs, which calls for methods that can leverage these structures in the semi-supervised learning tasks. To this end, we propose Higher-Order Label Spreading (HOLS) to spread labels using higher-order structures. HOLS has strong theoretical guarantees and reduces to standard label spreading in the base case. Via extensive experiments, we show that higher-order label spreading using triangles in addition to edges is up to 4.7 Compared to prior traditional and state-of-the-art methods, the proposed method leads to statistically significant accuracy gains in all-but-one cases, while remaining fast and scalable to large graphs.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/08/2020

Nonlinear Higher-Order Label Spreading

Label spreading is a general technique for semi-supervised learning with...
02/06/2021

Understanding Higher-order Structures in Evolving Graphs: A Simplicial Complex based Kernel Estimation Approach

Dynamic graphs are rife with higher-order interactions, such as co-autho...
04/20/2021

What are higher-order networks?

Modeling complex systems and data using the language of graphs and netwo...
11/11/2019

Higher-order Weighted Graph Convolutional Networks

Graph Convolution Network (GCN) has been recognized as one of the most e...
07/06/2018

Sum-Product Networks for Sequence Labeling

We consider higher-order linear-chain conditional random fields (HO-LC-C...
12/06/2015

Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation

Traditional graph-based semi-supervised learning (SSL) approaches, even ...
04/06/2022

CHIEF: Clustering with Higher-order Motifs in Big Networks

Clustering a group of vertices in networks facilitates applications acro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1.

Graph SSL approaches which take only edges into account incorrectly classify the unlabeled central vertex ‘Alice’ as blue. By leveraging higher-order network structures, the proposed

HOLS correctly labels Alice as red.

Given an undirected unweighted graph and a few labeled vertices, the graph transductive learning or semi-supervised learning (SSL) aims to infer the labels for the remaining unlabeled vertices (Zhu et al., 2003; Zhou et al., 2003; Talukdar and Crammer, 2009; Belkin et al., 2004; Yedidia et al., 2003; Kipf and Welling, 2017; Yang et al., 2016; Abu-El-Haija et al., 2019). Graph SSL finds applications in a number of settings: in a social network, we can infer a particular characteristic (e.g. political leaning) of a user based on the information of her friends to produce tailored recommendations; in a user-product bipartite rating network, based on a few manually identified fraudulent user accounts, SSL is useful to spot other fraudulent accounts (Akoglu et al., 2013; Eswaran et al., 2017; Kumar et al., 2018; Kumar et al., 2019); SSL can identify protein functions from networks of their physical interaction using just a few labels (Vazquez et al., 2003).

Traditional graph SSL algorithms leverage a key property of real-world networks: the homophily of vertices (Albert and Barabási, 2002; McPherson et al., 2001), i.e., the nearby vertices in a graph are likely to have the same label. However, these methods tend to be limited by the fact that all the neighbors of a vertex are not equal. Consider your own friendship network where you have many acquaintances, but only a few close friends. In fact, prior research has shown that vertices with a strong connection participate in several higher-order structures, such as dense subgraphs and cliques (Jackson, 2010; Jackson et al., 2012; Sizemore et al., 2017; Hanneman and Riddle, 2005). Thus, leveraging the higher-order structure between vertices is crucial to accurately label the vertices.

Let us elaborate this using a small friendship network example, shown in Figure 1. The central vertex, Alice, participates in a closely-knit community with three friends B, C, and D, all of whom know each other. In addition, she has four acquaintances P, Q, R, and S from different walks of her life. Let the vertices be labeled by their ideological beliefs—vertices B, C, and D have the same blue label; and the rest of the vertices have the red label. Even though Alice has more red connections than blue, the connection between Alice, B, C, and D is stronger as Alice participates in three 3-cliques and one 4-clique with them. In contrast, Alice has no 3- and 4- cliques with P, Q, R, and S. Owing to the stronger connection with the red nodes, Alice should be labeled red as well. However, traditional graph SSL techniques that rely on edges alone label Alice as blue (Zhu et al., 2003; Zhou et al., 2003). This calls for graph SSL methods that look beyond edges to leverage the signal present in higher-order structures to label vertices.

Our present work focuses on three key research questions:

  • [leftmargin=*]

  • RQ1. How do the data reveal that higher-order network structures are homogeneous in labels?

  • RQ2. How can we leverage higher-order network structures for graph SSL in a principled manner?

  • RQ3. Do higher-order structures help improve graph SSL?

Accordingly, our contributions can be summarized as follows:
(i) Analysis: Through an empirical analysis of four diverse real-world networks, we demonstrate the phenomenon of higher-order label homogeneity, i.e., the tendency of vertices participating in a higher-order structure (e.g. triangle) to share the same label.
(ii) Algorithm: We develop Higher-Order Label Spreading (HOLS) to leverage higher-order structures during graph semi-supervised learning. HOLS works for any user-inputted higher-order structure and in the base case, is equivalent to edge-based label spreading (Zhou et al., 2003).
(iii) Effectiveness: We show that label spreading via higher-order structures strictly outperforms label spreading via edges by up to 4.7% statistically significant margin. Notably, HOLS

 is competitive with recent deep learning based methods, while running 15

faster.

For reproducibility, all the code and datasets are available at https://github.com/dhivyaeswaran/hols.

2. Related Work

Desiderata

LP (Zhu et al., 2003)

LS (Zhou et al., 2003)

BP (Yedidia et al., 2003)

Planetoid (Yang et al., 2016)

GCN (Kipf and Welling, 2017)

MixHop (Abu-El-Haija et al., 2019)

HOLS-3

Higher-order structures ? ? ?
Theoretical guarantees ?
Fast algorithm
Table 1. Qualitative comparison of HOLS-3 (using edges and triangles) with traditional and recent graph SSL approaches

Traditional graph SSL approaches: By far, the most widely used graph SSL techniques are label propagation (Zhu et al., 2003) and label spreading (Zhou et al., 2003). Label propagation (LP) clamps labeled vertices to their provided values and uses a graph Laplacian regularization, while label spreading (LS) uses a squared Euclidean penalty as supervised loss and normalized graph Laplacian regularization which is known to be better-behaved and more robust to noise (Von Luxburg et al., 2008). Both these techniques permit closed-form solution and are extremely fast in practice, scaling well to billion-scale graphs. Consequently, a number of techniques build on top of these approaches, for example, to allow inductive generalization (Belkin et al., 2004; Weston et al., 2008) and to incorporate certainty (Talukdar and Crammer, 2009). When the graph is viewed as pairwise Markov random field, belief propagation (BP) (Yedidia et al., 2003) may be used to recover the exact marginals on the vertices. BP can handle network effects beyond just homophily; however, it has well-known convergence problems from a practitioner’s point of view (Sen et al., 2008). While traditional techniques, in general, show many desirable theoretical properties such as closed-form solution, convergence guarantees, connections to spectral graph theory (Zhu et al., 2003) and statistical physics (Yedidia et al., 2003), as such, they do not account for higher-order network structures.

Recent graph SSL approaches differ from traditional SSL methods in training embeddings of vertices to jointly predict labels as well as the neighborhood context in the graph. Specifically, Planetoid (Yang et al., 2016) uses skipgrams, while GCN (Kipf and Welling, 2017) uses approximate spectral convolutions to incorporate neighborhood information. MixHop (Abu-El-Haija et al., 2019) can learn a general class of neighborhood mixing functions for graph SSL. As such, these do not incorporate specific higher-order structures provided by the user. Further, their performance in practice tends to be limited by the availability of ‘good’ vertex features for initializing the optimization procedure.

Hybrid approaches for graph SSL: Another way to tackle the graph SSL problem is a hybrid approach to first extract vertex embeddings using an unsupervised approach such as node2vec (Grover and Leskovec, 2016), DeepWalk (Perozzi et al., 2014) or LINE (Tang et al., 2015) and then use the available labels to learn a transductive classifier such as an SVM (Joachims, 1999). Such methods, however, neither have well-understood theoretical properties nor do they optimize for a single objective in an end-to-end manner.

Comparison: We compare the best performing HOLS algorithm (HOLS-3 which uses triangles in addition to edges) qualitatively to prominent SSL approaches in Table 1 and quantitatively via experiments to representative methods from the above categories: LP and LS (traditional), GCN (recent) and node2vec + TSVM (hybrid).

3. Higher-Order Label Homogeneity

Recent work has shown that graphs from diverse domains have many striking higher-order network structures (Benson et al., 2016) which can be leveraged to improve tasks such as clustering (Yin et al., 2018), link prediction (Benson et al., 2018; AbuOda et al., 2019) and ranking (Rossi et al., 2019). In this section, we motivate the need to consider such structures for semi-supervised learning through empirical analysis of four diverse large real-world networks. We define and quantify higher-order label homogeneity–i.e., the tendency of vertices participating in higher-order structures to share the same label. We will show that the higher-order label homogeneity is remarkably more common than expected in real-world graphs.

Figure 2. Prevalence of various label configurations in real-world graphs (orange) relative to a random baseline (green) which shuffles vertex labels fixing the graph structure: We note that more homogeneous label configurations (towards left) are strikingly more prevalent than expected, while less homogeneous label configurations (towards right) are unusually rare.

3.1. Dataset Description

Dataset Domain
EuEmail (Leskovec et al., 2007) Email communication 1005 16.0K 42
PolBlogs (Adamic and Glance, 2005) Blog hyperlinks 1224 16.7K 2
Cora (Subelj and Bajec, 2013) Article citations 23.1K 89.1K 10
Pokec (Takac and Zabovsky, 2012) Friendship 1.6M 22.3M 10
Table 2. Statistics of datasets used

We use four network datasets for our empirical analysis and experiments. Table 2 summarizes some important dataset statistics.

  • [leftmargin=*]

  • EuEmail (Leskovec et al., 2007) is an e-mail communication network from a large European research institution. Vertices indicate members of the institution and an edge between a pair of members indicates that they exchanged at least one email. Vertex labels indicate membership to one of the 42 departments.

  • PolBlogs (Adamic and Glance, 2005) is a network of hyperlinks between blogs about US politics during the period preceding the 2004 presidential election. Blogs are labeled as right-leaning or left-leaning.

  • Cora (Subelj and Bajec, 2013)

    is a citation network among papers published at computer science conferences. Vertex labels indicate one of 10 areas (e.g. Artificial Intelligence, Databases, Networking) that the paper belongs to based on its venue of publication.

  • Pokec (Takac and Zabovsky, 2012) is the most popular online social network in Slovakia. Vertices indicate users and edges indicate friendships. From the furnished user profile information, we extract the locality or ‘kraj’ that users belong to and use them as labels.

These datasets exhibit homophily (Albert and Barabási, 2002; McPherson et al., 2001): people typically e-mail others within the same department; blogs tend to link to others having the same political leaning; papers mostly cite those from the same area; people belonging to the same locality are more likely to meet and become friends. In all cases, we omit self-loops and take the edges as undirected and unweighted.

3.2. Empirical Evidence

We now examine the label homogeneity of higher-order -cliques, as they form the essential building blocks of many networks (Jackson, 2010; Jackson et al., 2012; Sizemore et al., 2017; Hanneman and Riddle, 2005) and moreover, can be counted and enumerated efficiently (Jain and Seshadhri, 2017; Danisch et al., 2018). We will stick to for computational reasons.

Methodology. We quantify label homogeneity of a given higher-order structure by measuring the distribution over what we term as its label configurations. Simply put, label configuration captures the extent to which participating vertices share the same label and is a function of vertex-label assignments that is invariant under the permutation of vertices and labels. A 2-clique has two label configurations: ‘2’ where both incident vertices have the same label and ‘1-1’ where they have different labels. A 3-clique has three label configurations: ‘3’ where all three vertices have the same label, ‘2-1’ where two of them share the same label and third vertex has a different label and ‘1-1-1’ where each vertex has a different label. Similarly, a 4-clique has 5 label configurations (4, 3-1, 2-2, 2-1-1, 1-1-1-1) and a 5-clique has 7 label configurations (5, 4-1, 3-2, 3-1-1, 2-2-1, 2-1-1-1, 1-1-1-1-1). Note that not all label configurations may be possible (e.g., 1-1-1 is impossible for a triangle in a 2-class problem) and still fewer may actually occur in practice.

We will now compare the observed distribution over label configurations to its commensurate distribution under a random baseline or null model, which shuffles vertex labels fixing the graph structure and that marginal distribution of labels. A priori, there is no reason to expect the observed distribution to be any different from random. But suppose that the observedprobability mass for homogeneous label configurations (e.g. ‘’ for -cliques) exceeds that of random and vice versa for non-homogeneous label configurations (e.g. ‘1--1’); this would suggest higher-order label homogeneity. Similarly, if the opposite occurs, we may conclude that vertices mix disassortatively (Newman, 2003) to form higher-order structures.

Observations. Figure 2 plots the observed distribution over -clique label configurations (orange), comparing it to random (green). Under the baseline, most of the probability mass is concentrated on less homogeneous label configurations displayed towards the right, which is expected since vertex labels are assigned at random.

In sharp contrast, the observed distribution is heavily concentrated towards the label configurations on the left. Notably, the most homogeneous label configuration (i.e. ‘’ for -clique, where all participating vertices have the same label), is 1.8-5.9, 3.7-60, 7.5-464, and 15-3416 more common than expected for respectively. On the other hand, the least homogeneous label configuration (‘1--1’, where each participating vertex has a different label) is 1.4-5.3, 1.8-15 and 22.2-90 rarer than expected when possible for respectively. For Cora dataset in particular, the ‘1-1-1-1-1’ label configuration is expected about once in eight or nine 5-clique occurrences (), but does not occur even once among its over twenty-two thousand 5-cliques.

Overall, our observations establish concrete evidence for the phenomenon of higher-order label homogeneity: vertices participating in real-world -cliques indeed share the same label to a greater extent than can be explained by random chance.

4. Higher-Order Label Spreading

We now derive our higher-order label spreading (HOLS) algorithm and show its theoretical properties including the connection to label spreading (Zhou et al., 2003), closed-form solution and convergence guarantee.

4.1. Notation

Let be a graph with vertex set and edge set . Edges are undirected with representing the edge weight between vertices . Each vertex has a unique label .

Let be the set of network structures or motifs (e.g., edges, triangles) which we want to leverage for graph semi-supervised learning. For a given motif , let denote its size which is its number of participating vertices. For example, when is a triangle, . Further, suppose is the set of occurrences of a motif in graph and each such occurrence has a weight (e.g. computed as the product of incident edge weights). Use to denote the indicator function, which evaluates to 1 when the enclosed expression is true. For example, is one if the vertex is part of the subgraph .

4.2. Generalized Loss Function

Let

be the one-hot vector of provided label for a labeled vertex

such that if vertex has a label and is zero otherwise. We propose to infer the final labels (where

) by minimizing the following loss function:

(1)

where is the supervised loss, which penalizes the deviation of inferred labels from their provided values, while is the graph loss with respect to motif , which enforces the inferred labels to be ‘smooth’ over all occurrences of as:

(2)

A parameter trades off supervised and graph losses, while captures the importance weight of in semi-supervised learning. Note . Now, define -participation matrix as where each entry denotes the total weight of -motifs that vertices and participate in. We have:

(3)

Observe that each pairwise loss term in Equation (1) appears with a total weight given by using which we may simplify the graph loss as:

(4)

Thus, Equation (4) establishes that the graph loss from Equation (1) is equivalent to that of label propagation (Zhu et al., 2003) on a modified graph with adjacency matrix where each edge of the original graph has been re-weighted according to the total weight of -motifs it participates in, scaled by its importance , and finally summed over all motifs of interest. We will use this connection to derive a closed-form solution to HOLS.

1:Input: graph , number of classes , set of labeled vertices and their labels (at least one labeled vertex per class)
2:Parameters: motif set , motif weights such that , weight for supervised loss
3:Output: final label assignments for all vertices
4:procedure HigherOrderLabelSpreading()
5: Construct higher-order normalized graph Laplacian for regularization
6:     for  do
7:           Construct -participation matrix
8: total weight of -motifs where and appear together      
9:     
10:      where
11:     
12: Construct label matrices (prior) and (inferred)
13:      
14:     
15:     
16: Label inference using HOLS
17:     while not converged do
18:           Equation (8)      
19:     
20:     return
Algorithm 1 Higher-Order Label Spreading (HOLS)

4.3. Closed-Form and Iterative Solutions

Let and be the matrices of prior and inferred labels where is the total number of vertices. Let be the diagonal degree matrix for the modified graph adjacency . Thus, and if . Let be the Laplacian matrix for the modified graph. Equation (4) can be re-written in matrix format as:

(5)

We also consider a version of the loss function which uses the normalized Laplacian for regularization:

(6)

Using in place of performs as well if not better in practice; and moreover provides certain theoretical guarantees (see Proposition 4.2, and also (Von Luxburg et al., 2008)). Therefore, we will use Equation (6) as the loss function for our higher-order label spreading and refer to it as . A closed-form solution can now be obtained by differentiating with respect to and setting it to zero. Thus:

(7)

Thus, using Equation (7), we are able to compute the optimal solution to HOLS, as long as the inverse of exists. Due to the use of normalized Laplacian regularization, the following holds:

Proposition 4.1 (Generalized Label Spreading).

The proposed HOLS algorithm reduces to traditional label spreading (Zhou et al., 2003) for the base case of using only edge motifs, i.e., .

This generalization grants HOLS its name. In practice, matrix inversion is computationally intensive and tends to be numerically unstable. Hence, we resort to an iterative approach to solve Equation (7) by first initializing to an arbitrary value and then repeatedly applying the following update:

(8)

Proposition 4.2 describes the theoretical properties of this approach.

Proposition 4.2 (Convergence Guarantee for Hols).

The iterative update in Equation (8) always converges to the unique fixed point given in Equation (7) for any choice of initial .

This can be proved using the theory of sparse linear systems (Saad, 2003).

The overall algorithm of HOLS is summarized in Algorithm (1). First, for each motif , construct its -participation matrix by enumerating all its occurrences. Note that the enumerated occurrences are processed one by one on the fly to update the participation matrix and discarded (no need for storage). Moreover, the enumeration for different motifs can be done in parallel. The participation matrices are combined into a single modified graph adjacency ; applying the iterative updates from Equation (8) finally results in labels for the unlabeled vertices. In practice, the iterative updates are applied until entries in do not change up to a precision or until a maximum number of iterations is reached.

4.4. Complexity Analysis

When only cliques are used as motifs for semi-supervised learning, the following space and time complexity bounds hold:

Proposition 4.3 (Space Complexity of Hols).

The space complexity of HOLS for a graph with vertices, edges and classes is independent of motif size and number of motifs used, provided all motifs are cliques.

Proposition 4.4 (Time Complexity of Hols).

The time complexity of HOLS over a graph with edges, classes and a degeneracy (core number) of using is given by for the construction of -participation matrices plus per iterative update using Equation (8).

The proofs rely on the sparsity structure of the modified adjacency and also Theorem 5.7 of (Danisch et al., 2018). Despite the exponential complexity in , we are able to enumerate cliques quickly using the sequential kClist algorithm (Danisch et al., 2018). For example, our largest Pokec dataset has 21M edges, 32M triangles, 43M 4-cliques and 53M 5-cliques; and the enumeration of each took at most 20 seconds on a stock laptop. Thus, HOLS remains fast and scalable when reasonably small cliques are used. Further, as we show in experiments, using triangles () in addition to edges typically suffices to achieve the best classification performance across a wide range of datasets.

5. Experiments

We empirically evaluate the proposed algorithm on the four real-world network datasets described in Section 3.1.

5.1. Experimental Setup

We implement HOLS in MATLAB and run the experiments on MacOS with 2.7 GHz Intel Core i5 processor and 16 GB main memory.

Baselines. We compare HOLS to the following baselines: (1) Label Propagation (LP) (Zhu et al., 2003) which uses Laplacian regularization. (2) Label Spreading (LS) (Zhou et al., 2003) which uses normalized Laplacian regularization. (3) node2vec+TSVM which generates unsupervised vertex embeddings using node2vec (Grover and Leskovec, 2016) and learns decision boundaries in the embedding space using one-vs-rest transductive SVMs (Joachims, 1999). (4) Graph Convolutional Network (GCN) (Kipf and Welling, 2017)

which is an end-to-end semi-supervised learner using neural networks. We implement

LP and LS

in MATLAB, and use open-sourced code for the rest.

Parameters. By default, we use a weight of for supervised loss and motifs (edges and triangles) for HOLS. The importance weight for triangles is tuned in for each dataset and results are reported on the best performing value. We use for LS as well. LP, LS and HOLS are run until labels converge to a precision of or until iterations are completed, whichever occurs sooner. We set and

. We use the default hyperparameters for

GCN, node2vec and TSVM. We supply 100, 20, 100 and 1000 labels for EuEmail, PolBlogs, Cora and Pokec datasets, where the vertices to label are chosen by stratified sampling based on class. These correspond to label fractions of 5%, 1.6%, 0.4% and 0.06% and on an average, 1, 10, 10 and 100 labeled vertices per class respectively.

Evaluation metrics. We quantify the success of semi-supervised learning using accuracy, which is the fraction of unlabeled vertices which are correctly classified. We also note down the end-to-end running time for all computation including any pre-processing such as clique enumeration, but excluding I/O operations.

MethodMetric Accuracy Running time (seconds)
EuEmail PolBlogs Cora Pokec EuEmail PolBlogs Cora Pokec
Label Propagation (LP) (Zhu et al., 2003) 0.2905 0.5814 0.2765 0.1994 0.11 0.070 2.1 1320
Label Spreading (LS) (Zhou et al., 2003) 0.5228 0.9361 0.4921 0.5514 0.040 0.036 0.21 93
node2vec+TSVM (Grover and Leskovec, 2016; Joachims, 1999) 0.4563 0.9481 0.4233 T.L.E. 46 29 3060 ¿1 day
Graph Convolution Networks (GCN) (Kipf and Welling, 2017) 0.5251 0.9470 0.4673 0.5290 1.8 1.3 6.4 2880
HOLS (proposed) 0.5473 0.9476 0.4953 0.5593 0.089 0.083 0.41 117
Table 3. Accuracy and Running Time (averaged over five runs): In each column, the best value is bold and underlined, and the second best is underlined. Asterisk () denotes statistically significant differences () compared to the second best.

5.2. Results

We present our experimental results. All reported values are averaged over five runs, each run differing in the set of labeled vertices.

Accuracy. Accuracies of all methods are summarized in Table 3 (left). The values for node2vec+TSVM on Pokec dataset are missing as the method did not terminate within 24 hours (‘T.L.E.’).

First, we observe in Table 3 (left) that HOLS consistently leads to statistically significant improvements over LS, showing that using higher-order structures for label spreading helps. In addition, HOLS outperforms all baselines in three out of four datasets. The improvements over the best baseline are statistically significant according to a two-sided micro-sign test (Yang and Liu, 1999) in at least three out of five runs. Importantly, for the smaller datasets (EuEmail and PolBlogs), while GCN outperforms LS, GCN loses to HOLS when triangles are used. node2vec+TSVM performs slightly better than HOLS on PolBlogs, however, the increase over HOLS is not statistically significant. For the larger datasets with labeled vertices, HOLS performs the best and LS follows closely.

Running Time. The running time of HOLS and all the baselines is summarized in Table 3 (right). Notably, we see that HOLS runs in less than 2 minutes for graphs with over 21 million edges (the Pokec graph), demonstrating its real-world practical scalability.

We observe that LS is the fastest of all methods and HOLS comes a close second for three out of four datasets. The small difference in running time predominantly stems from the construction of triangle participation matrix. Furthermore, HOLS is over faster than the recent GCN and node2vec+TSVM baselines, for comparable and often better values of accuracy.

Figure 3. Variation of accuracy with maximum clique size (left), and importance weight to triangles (right).
Figure 4. Case studies from PolBlogs dataset showing extended ego-networks of vertices v702 (left) and v1153 (right) which are both incorrectly classified by LS but correctly classified when triangles are taken into account using HOLS.

Accuracy vs. Maximum Clique Size. Fixing the motif set as , we vary to study the marginal benefit of including higher-order cliques in graph SSL. The motif weights are tuned in , ensuring that edges are given a weight for a connected graph, and further, all motif weights sum to 1. The best performing motif weights were used to generate Figure 3(a), which plots the relative improvement in accuracy over LS that uses edges only. We note that label spreading via higher-order structures strictly outperforms label spreading via edges. The gain is the most when using 3-cliques (triangles). Subsequent higher-order cliques did not lead to additional performance gain in most cases, presumably because their occurrences tend to be concentrated around a few high-degree vertices.

Accuracy vs. Importance Weight To Triangles. Fixing the motif set to , we vary the importance weight to triangles in to understand its effect on accuracy. Figure 3(b) shows that the accuracy gain of HOLS over LS increases with an increase in triangle weight for most graphs. The only exception is Cora, where the accuracy gain grows until before decreasing and eventually turning negative. Overall, triangles consistently help over a large range of motif weights.

Case Studies. In Figure 4, we look at real examples from the PolBlogs dataset to dig deeper into when HOLS improves over LS. Question mark denotes the central vertices v702 and v1153 of interest with ground truth labels ‘blue’ and ‘red’ respectively. The direct neighbors of the both vertices are unlabeled and a few second hop neighbors are labeled with one of two labels: ‘blue’ or ‘red’.

In both cases, both LS and HOLS classify the unlabeled 1-hop neighbors correctly. However, LS, relying only on the edge-level information (roughly the ratio of blue to red labeled 2-hop neighbors in this case), incorrectly labels both v702 and v1153. The proposed HOLS, on the other hand, accurately recognizes that v702 (v1153) is more tightly connected with its blue (red) neighbors via the higher-order triangle structures and thus leads to the correct classification.

6. Conclusion

In this paper, we demonstrated that label homogeneity–the tendency of vertices participating in a higher-order structure to share the same label–is a prevalent characteristic in real-world graphs. We created an algorithm to exploit the signal present in higher-order structures for more accurate semi-supervised learning over graphs. Experiments on real-world data showed that using triangles along with edges for label spreading leads to statistically significant accuracy gains compared to the use of edges alone.

This work opens the avenue for several exciting future research directions. First, we need principled measures quantifying label homogeneity to aid comparison across diverse graphs and higher-order structures. Next, having seen the improvements in real-world graphs, it becomes fundamental to understand the benefits in random graph models. Finally, it is crucial to develop algorithms exploiting higher-order structures which can cater to the increasingly heterogeneous and dynamic nature of real-world graphs at scale.

Acknowledgments. This material is based upon work supported by the National Science Foundation under Grants No. CNS-1314632 and IIS-1408924, by Adobe Inc., and by Pacific Northwest National Laboratory. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

References

  • (1)
  • Abu-El-Haija et al. (2019) Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. 2019. MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing. In ICML

    (Proceedings of Machine Learning Research)

    , Vol. 97. PMLR, 21–29.
  • AbuOda et al. (2019) Ghadeer AbuOda, Gianmarco De Francisci Morales, and Ashraf Aboulnaga. 2019. Link Prediction via Higher-Order Motif Features. CoRR abs/1902.06679 (2019).
  • Adamic and Glance (2005) Lada A. Adamic and Natalie S. Glance. 2005. The political blogosphere and the 2004 U.S. election: divided they blog. In LinkKDD. ACM, 36–43.
  • Akoglu et al. (2013) Leman Akoglu, Rishi Chandy, and Christos Faloutsos. 2013. Opinion Fraud Detection in Online Reviews by Network Effects. In ICWSM. The AAAI Press.
  • Albert and Barabási (2002) Réka Albert and Albert-László Barabási. 2002. Statistical mechanics of complex networks. Reviews of modern physics 74, 1 (2002), 47.
  • Belkin et al. (2004) Mikhail Belkin, Irina Matveeva, and Partha Niyogi. 2004. Regularization and Semi-supervised Learning on Large Graphs. In COLT (Lecture Notes in Computer Science), Vol. 3120. Springer, 624–638.
  • Benson et al. (2018) Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon M. Kleinberg. 2018. Simplicial closure and higher-order link prediction. Proc. Natl. Acad. Sci. U.S.A. 115, 48 (2018), E11221–E11230.
  • Benson et al. (2016) Austin R Benson, David F Gleich, and Jure Leskovec. 2016. Higher-order organization of complex networks. Science 353, 6295 (2016), 163–166.
  • Danisch et al. (2018) Maximilien Danisch, Oana Denisa Balalau, and Mauro Sozio. 2018. Listing k-cliques in Sparse Real-World Graphs. In WWW. ACM, 589–598.
  • Eswaran et al. (2017) Dhivya Eswaran, Stephan Günnemann, Christos Faloutsos, Disha Makhija, and Mohit Kumar. 2017. ZooBP: Belief Propagation for Heterogeneous Networks. PVLDB 10, 5 (2017), 625–636.
  • Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning for Networks. In KDD. ACM, 855–864.
  • Hanneman and Riddle (2005) Robert A Hanneman and Mark Riddle. 2005. Introduction to social network methods.
  • Jackson (2010) Matthew O Jackson. 2010. Social and economic networks. Princeton university press.
  • Jackson et al. (2012) Matthew O Jackson, Tomas Rodriguez-Barraquer, and Xu Tan. 2012. Social capital and social quilts: Network patterns of favor exchange. American Economic Review 102, 5 (2012), 1857–97.
  • Jain and Seshadhri (2017) Shweta Jain and C. Seshadhri. 2017.

    A Fast and Provable Method for Estimating Clique Counts Using Turán’s Theorem. In

    WWW. ACM, 441–449.
  • Joachims (1999) Thorsten Joachims. 1999.

    Transductive Inference for Text Classification using Support Vector Machines. In

    ICML. Morgan Kaufmann, 200–209.
  • Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR (Poster). OpenReview.net.
  • Kumar et al. (2018) Srijan Kumar, Bryan Hooi, Disha Makhija, Mohit Kumar, Christos Faloutsos, and VS Subrahmanian. 2018. Rev2: Fraudulent user prediction in rating platforms. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 333–341.
  • Kumar et al. (2019) Srijan Kumar, Xikun Zhang, and Jure Leskovec. 2019. Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1269–1278.
  • Leskovec et al. (2007) Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. 2007. Graph evolution: Densification and shrinking diameters. TKDD 1, 1 (2007), 2.
  • McPherson et al. (2001) Miller McPherson, Lynn Smith-Lovin, and James M Cook. 2001. Birds of a feather: Homophily in social networks. Annual review of sociology 27, 1 (2001), 415–444.
  • Newman (2003) Mark EJ Newman. 2003. Mixing patterns in networks. Physical Review E 67, 2 (2003), 026126.
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: online learning of social representations. In KDD. ACM, 701–710.
  • Rossi et al. (2019) Ryan A. Rossi, Anup Rao, Sungchul Kim, Eunyee Koh, Nesreen K. Ahmed, and Gang Wu. 2019. Higher-Order Ranking and Link Prediction: From Closing Triangles to Closing Higher-Order Motifs. CoRR abs/1906.05059 (2019).
  • Saad (2003) Yousef Saad. 2003. Iterative methods for sparse linear systems. Vol. 82. SIAM.
  • Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad. 2008. Collective Classification in Network Data. AI Magazine 29, 3 (2008), 93–106.
  • Sizemore et al. (2017) Ann Sizemore, Chad Giusti, and Danielle S Bassett. 2017. Classification of weighted networks through mesoscale homological features. Journal of Complex Networks 5, 2 (2017), 245–273.
  • Subelj and Bajec (2013) Lovro Subelj and Marko Bajec. 2013. Model of complex networks based on citation dynamics. In WWW (Companion Volume). International World Wide Web Conferences Steering Committee / ACM, 527–530.
  • Takac and Zabovsky (2012) Lubos Takac and Michal Zabovsky. 2012. Data analysis in public social networks. In International Scientific Conference & International Workshop Present Day Trends of Innovations.
  • Talukdar and Crammer (2009) Partha Pratim Talukdar and Koby Crammer. 2009. New Regularized Algorithms for Transductive Learning. In ECML/PKDD (2) (Lecture Notes in Computer Science), Vol. 5782. Springer, 442–457.
  • Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. LINE: Large-scale Information Network Embedding. CoRR abs/1503.03578 (2015).
  • Vazquez et al. (2003) Alexei Vazquez, Alessandro Flammini, Amos Maritan, and Alessandro Vespignani. 2003. Global protein function prediction from protein-protein interaction networks. Nature biotechnology 21, 6 (2003), 697.
  • Von Luxburg et al. (2008) Ulrike Von Luxburg, Mikhail Belkin, and Olivier Bousquet. 2008.

    Consistency of spectral clustering.

    The Annals of Statistics (2008), 555–586.
  • Weston et al. (2008) Jason Weston, Frédéric Ratle, and Ronan Collobert. 2008. Deep learning via semi-supervised embedding. In ICML (ACM International Conference Proceeding Series), Vol. 307. ACM, 1168–1175.
  • Yang and Liu (1999) Yiming Yang and Xin Liu. 1999. A Re-Examination of Text Categorization Methods. In SIGIR. ACM, 42–49.
  • Yang et al. (2016) Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. 2016. Revisiting Semi-Supervised Learning with Graph Embeddings. In ICML (JMLR Workshop and Conference Proceedings), Vol. 48. JMLR.org, 40–48.
  • Yedidia et al. (2003) Jonathan S Yedidia, William T Freeman, and Yair Weiss. 2003. Understanding belief propagation and its generalizations. In Exploring artificial intelligence in the new millennium. Morgan Kaufmann Publishers Inc., 239–269.
  • Yin et al. (2018) Hao Yin, Austin R Benson, and Jure Leskovec. 2018. Higher-order clustering in networks. Physical Review E 97, 5 (2018), 052306.
  • Zhou et al. (2003) Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf. 2003. Learning with Local and Global Consistency. In NIPS. MIT Press, 321–328.
  • Zhu et al. (2003) Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. 2003. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. In ICML. AAAI Press, 912–919.