Through carefully examining our collected graph-structured data, we find that they generally come from various sources such as social networks, citation networks, and communication networks where a tremendous amount of both content and linkage information exist. For instance, data on many social platforms like Twitter, Flickr, and Facebook include features of users, e.g., basic personal details, texts, images, IP, and their relations, e.g.
, buying the same item, being friends. These rich content data are sufficient to support subsequent mining tasks without additional guidance: if two entities exhibit the extreme similarity in features, there is a high probability of a link between them (link prediction), and they are likely to belong to the same category (classification); if two entities both link to the same entity, they probably have similar characteristics (recommendation). In this sense, preserving and extracting as much information as possible from information networks into embedding space facilitates learning high-quality expressive representations that exhibit desirable performance in mining tasks without any form of supervision. Unsupervised graph representation learning is a more favorable choice in many cases due to the freedom from labels, particularly when we intend to take benefit from a large scale unlabeled data in the wild.
To fully inherit the rich information in graphs, in this paper, we execute graph embedding based upon Mutual Information (MI) maximization, inspired by the empirical success of the Deep InfoMax method (Hjelm et al., 2018) which operates on images. To discover useful representations, Deep InfoMax trains the encoder to maximize MI between its inputs (i.e., the images) and outputs (i.e., the hidden vectors). When considering Deep InfoMax in the graph domain, the first stone we need to step over is how to define MI between graphs and hidden vectors, whereas the topology of graphs is more complicated than images (see Figure 1). One of the challenges is to ensure the MI function between each node’s hidden representation and its neighborhood input features to obey the symmetric property, or equivalently, being invariant to permutations of the neighborhoods. As one recent work considering MI, Deep Graph Infomax (DGI) (Veličković et al., 2018) first embeds a input graph and a corresponding corrupted graph, then summarizes the input graph as a vector via a readout function, finally maximizes MI between this summary vector and hidden representations by discriminating the input graph (positive sample) from the corrupted graph (negative sample). Figure 1 gives an easily understandable overview of DGI. Maximizing this kind of MI is proved to be equivalent to maximizing the one between the input node features and hidden vectors, but this equivalence holds under several preset conditions, e.g., the readout function should be injective, which yet seem to be over-restricted in real cases. Even we can guarantee the existence of injective readout function by certain design, e.g., the one used in DeepSets (Zaheer et al., 2017), the injective ability of readout function is also affected by how its parameters are trained. That is to say that an originally-injective function still has the risk of becoming non-injective if it is trained without any external supervision. And if the readout function is not injective, the input graph information contained in a summary vector will diminish as the size of the graph increases. Moreover, DGI stays in a coarse graph/patch-level MI maximization. Hence in DGI, there is no guarantee that the encoder can distill sufficient information from input data as it never elaborately correlates hidden representations with their original inputs.
In this paper, we put forward a more straightforward way to consider MI in terms of graphical structures without using any readout function and corruption function. We directly derive MI by comparing the input (i.e., the sub-graph consisting of the input neighborhood) and the output (i.e., the hidden representation of each node) of the encoder. And interestingly, our theoretical derivations demonstrate that the directly-formulated MI can be decomposed into a weighted sum of local MIs between each neighborhood feature and the hidden vector. In this way, we have decomposed the input features and made the MI computation tractable. Moreover, this form of MI can easily satisfy the symmetric property if we adjust the values of weights. We defer more details to § 3.1. As the above MI is mainly measured at the level of node features, we term it as Feature Mutual Information (FMI).
Two remaining issues about FMI: 1. the combining weights are still unknown and 2. it does not take the topology (i.e., the edge features) into account. To further address these two issues, we define our Graphical Mutual Information (GMI) measurement based on FMI. In particular, GMI applies an intuitive value assignment by setting the weights in FMI equal to the proximity between each neighbor and the target node in the representation space. As to retain the topology information, GMI further correlates these weights with the input edge features via an additional mutual information term. The resulting GMI is topologically invariant and also calculable with Mutual Information Neural Estimation (MINE) (Belghazi et al., 2018). The main contributions of our work are as follows:
Concepts: We generalize the conventional MI estimation to the graph domain and propose a new concept of Graphical Mutual Information (GMI) accordingly. GMI is free from the potential risk caused by the readout function since it considers MI between input graphs and high-level embeddings in a straightforward pattern.
Algorithms: Through our theoretical analysis, we give a tractable and calculable form of GMI which decomposes the entire GMI into a weighted sum of local MIs. With the help of the MINE method, GMI maximization can be easily achieved in a node-level.
Experimental Findings: We verify the effectiveness of GMI on several popular node classification and link prediction tasks including both transductive and inductive ones. The experiments demonstrate that our method delivers promising performance on a variety of benchmarks and it even sometimes outperforms the supervised counterparts.
2. Related Work
In line with the focus of our work, we briefly review the previous work in the two following areas: 1. mutual information estimation, and 2. neural networks for learning representation over graphs.
Mutual information estimation. As InfoMax principle (Bell and Sejnowski, 1995) advocates maximizing MI between the inputs and outputs of neural networks, many methods such as ICA algorithms (Hyvärinen and Pajunen, 1999; Almeida, 2003) attempt to employ the idea of MI in unsupervised feature learning. Nonetheless, these methods can not be generalized to deep neural networks easily due to the difficulty in calculating MI between high dimensional continuous variables. Fortunately, Mutual Information Neural Estimation (MINE) (Belghazi et al., 2018)
makes the estimation of MI on deep neural networks feasible via training a statistics network as a classifier to distinguish samples coming from the joint distribution and the product of marginals of two random variables. Specifically, MINE uses the exact KL-based formulation of MI, while a non-KL alternative, the Jensen-Shannon divergence (JSD)(Nowozin et al., 2016), can be used without the concern about the precise value of MI.
Neural networks for graph representation learning. With the rapid development of graph neural networks (GNNs), a large number of graph representation learning algorithms based on GNNs are proposed in recent years, which exhibit stronger performance than traditional random walk-based and factorization-based embedding approaches (Perozzi et al., 2014; Tang et al., 2015; Cao et al., 2015; Grover and Leskovec, 2016; Qiu et al., 2018). Typically, these methods can be divided into supervised and unsupervised categories. Among them, there is a rich literature on supervised representation learning over graphs (Kipf and Welling, 2016a; Veličković et al., 2017; Chen et al., 2018; Zhang et al., 2018; Ding et al., 2018)
. In spite of their variance in network architecture, they achieve empirical success with the help of labels that are often not accessible in realistic scenarios. In this case, unsupervised graph learning methods(Hamilton et al., 2017; Duran and Niepert, 2017; Veličković et al., 2018) have broader application potential. The well-known method is GraphSAGE (Hamilton et al., 2017), an inductive framework to train GNNs by a random-walk based objective in its unsupervised setting. And recently, DGI (Veličković et al., 2018) applies the idea of MI maximization to the graph domain and obtains the strong performance in an unsupervised pattern. However, DGI implements a coarse-grained maximization (, maximizing MI at graph/patch-level) which makes it difficult to preserve the delicate information in the input graph. Besides, the condition imposed on the readout function used in DGI seems to be over-restricted in real cases. By contrast, we focus on removing out the restriction of readout function and arriving at graphical mutual information maximization in a node-level by directly maximizing MI between inputs and outputs of the encoder. Representations derived by our method are more sophisticated in keeping input graph information, which ensures its potential for downstream graph mining tasks, e.g., node classification, link prediction, and recommendation.
3. Graphical Mutual Information: Definition and Maximization
Prior to going further, we first provide the preliminary concepts used in this paper. Let denote a graph with nodes and edges
. The node features, with assumed empirical probability distribution, are given by where denotes the feature for node . The adjacency matrix represent edge connections, where associated to edge could be a real number or multi-dimensional vector111Our method is generally applicable to the graphs with edge features, although we only consider edges with real weights in our experiments..
The goal of graph representation learning is to learn an encoder , such that the hidden vectors indicate high-level representations for all nodes. The encoding process can be rewritten in a node-wise form. To show this, we define and for node as respectively the features of its neighbors and the corresponding adjacency matrix conditional on the neighbors. Particularly, consists of all k-hop neighbors of with when the encoder is an -layer GNN, and it contains the node itself if we further add self-loops in the adjacency matrix. Here, we call the sub-graph expanded by and as a support graph for node , denoted by . With the definition of support graph, the encoding for each node becomes .
Difficulties in defining graphical mutual information. In Deep InfoMax (Hjelm et al., 2018), the training objective of the encoder is to maximize MI between its inputs and outputs. The MI is estimated by employing a statistics network as a discriminator to classify samples coming from the joint distribution and the ones drawn from the product of marginals. Naturally, when adapting the idea of Deep InfoMax to graphs, we should maximize MI between the representation and the support graph for each node. We denote such graphical MI as . However, it is non-straightforward to define . The difficulties are:
The graphical MI should be invariant concerning the node index. In other words, we should have , if and are isomorphic to each other.
If we adopt MINE method for MI calculation, the discriminator in MINE only accepts inputs of a fixed size. This yet is infeasible for as different usually include different numbers of nodes and thus are of distinct sizes.
To get around the issue of defining graphical mutual information, this section begins with introducing the concept of Feature Mutual Information (FMI) that only relies on node features. Upon the inspiration from the decomposition of FMI, we then define Graphical Mutual Information (GMI), which takes both the node features and graph topology into consideration.
3.1. Feature Mutual Information
We denote the empirical probability distribution of node features as , the probability of as , and the joint distribution by . According to the information theory, the MI between and is defined as
Interestingly, we have the following mutual information decomposition theorem for computing .
Theorem 1 (Mutual Information Decomposition).
If the conditional probability is multiplicative (see the definition of multiplicative in (Renner and Maurer, 2002)), the global mutual information defined in Eq. (1) can be decomposed as a weighted sum of local MIs, namely,
where, is the -th neighbor of node , is the number of all elements in , and the weight satisfies for each .
To prove the above theorem, we first introduce two lemmas and a definition.
Lemma 0 ().
For any random variables , , and , we have
Thus we achieve . ∎
Definition 0 ().
The conditional probability is called multiplicative if it can be written as a product
with appropriate functions .
Lemma 0 ().
If is multiplicative, then we have
See (Renner and Maurer, 2002) for detailed proof. ∎
Now all the necessities for proving Theorem 1 are in place.
According to Lemma 2, for any we have
On the other hand, based on Lemma 4, we get
Then the above two formulas could deduce the following
With the decomposition in Theorem 1, we can calculate the right side of Eq. (2) via MINE as inputs of the discriminator now become the pairs of whose size always keep the same (i.e., -by-). Besides, we can adjust the weights to reflect the isomorphic transformation of input graphs. For instance, if only contains one-hop neighbors of node , setting all weights to be identical will lead to the same MI for the input nodes in different orders.
Despite some benefits of the decomposition, it is hard to characterize the exact values of the weights since they are related to the values of and their underlying probability distributions. A trivial way is setting all weights to be , then maximizing the right side of Eq. (2) equivalents to maximizing the lower bound of , by which the true FMI is also maximized to some extent. Besides this method, we additionally provide a more enhanced solution by considering the weights as trainable attentions, which is the topic in the next subsection.
|Link prediction||Cora||Citation network||2,708||5,429||1,433||7|
The task on PPI is a multilabel classification problem.
3.2. Topology-Aware Mutual Information
Inspired from the decomposition in Theorem 1, we attempt to construct trainable weights from the other aspect of graphs (i.e., topological view) so that the values of can be more flexible and capture the inherent property of graphs. Ultimately we derive the definition of Graphical Mutual Information (GMI).
Definition 0 (Graphical Mutual Information).
Intuitively, weight in the first term of Eq. (10) measures the contribution of a local MI to the global one. We implement the contribution of by the similarity between representations and (i.e., ). Meanwhile, the term maximizes MI between and the edge weight/feature of input graph (i.e., ) to enforce to conform to topological relations. In this sense, the degree of the contribution would be consistent with the proximity in topological structure, which is commonly accepted as a fact that could be larger if node is “closer” to node and smaller otherwise. This strategy compensates for the flaw that FMI only focuses on node features and makes local MIs contribute to the global one adaptively. To better understand the idea of attention in this strategy, you could refer to the attention-based GCN (Veličković et al., 2017).
Note that the definition of Eq. (10) is applicable for general cases. For certain specific situations, we can slightly modify Eq. (10) for efficiency. For example, when dealing with unweighted graphs (namely the edge value is 1 if connected and 0 otherwise), we could replace the second MI term with a negative cross-entropy loss. Minimizing the cross-entropy also contributes to MI maximization, and it delivers a more efficient computation. We defer more details in the next section.
There are several benefits by the definition of Eq. (10). First, this kind of MI is invariant to the isomorphic transformation of input graphs. Second, it is computationally feasible as each component on the right side can be estimated by MINE. More importantly, GMI is more powerful than DGI in capturing original input information due to its explicit correlation between hidden vectors and input features of both nodes and edges in a fine-grained node-level.
3.3. Maximization of GMI
Now we directly maximize the right side of Eq. (10) with the help of MINE. Note that MINE estimates a lower-bound of MI with the Donsker-Varadhan (DV) (Donsker and Varadhan, 1983) representation of the KL-divergence between the joint distribution and the product of the marginals. As we focus more on maximizing MI rather than obtaining its specific value, the other non-KL alternatives such as Jensen-Shannon MI estimator (JSD) (Nowozin et al., 2016)
and Noise-Contrastive estimator (infoNCE)(Oord et al., 2018) could be employed to replace it. Based on the experimental findings and analysis in (Hjelm et al., 2018), we resort to JSD estimator in this paper for the sake of effectiveness and efficiency, since infoNCE estimator is sensitive to negative sampling strategies (the number of negative samples) thus may become a bottleneck for large-scale datasets with a fixed available memory. On the contrary, the insensitivity of JSD estimator to negative sampling strategies and its respectable performance on many tasks makes it more suitable for our task. In particular, we calculate in the first term of Eq. (10) by
where is a discriminator constructed by a neural network with parameter . is an negative sampled from , and denotes the soft-plus function.
As mentioned in § 3.2, we maximize via calculating its cross-entropy instead of using JSD estimator since the graphs we coped with in experiments are unweighted. Formally, we compute
By maximizing with the sum of Eq. (11) and Eq. (12) over all hidden vectors , we arrive at our complete objective function for GMI optimization. Besides, we can further add trade-off parameters to balance Eq. (11) and (12) for more flexibility.
In this section, we empirically evaluate the performance of GMI on two common tasks: node classification (transductive and inductive) and link prediction. An additional relatively fair comparison between GMI and another two unsupervised algorithms (EP-B and DGI) further exhibits its effectiveness. Also we provide the visualization of t-SNE plots and analyze the influence of model depth.
To assess the quality of our approach in each task, we adopt 4 or 5 commonly used benchmark datasets in the previous work (Kipf and Welling, 2016a; Hamilton et al., 2017; Veličković et al., 2018). Detailed statistics are given in Table 1.
In the classification task, Cora, Citeseer, and PubMed (Sen et al., 2008)222https://github.com/tkipf/gcn are citation networks where nodes correspond to documents and edges represent citations. Each document is associated with a bag-of-words representation vector and belongs to one of the predefined classes. Following the transductive setup in (Kipf and Welling, 2016a; Veličković et al., 2018), training is conducted on all nodes, and 1000 test nodes are used for evaluation. Reddit333http://snap.stanford.edu/graphsage/ is a large social network consisting of numerous interconnected Reddit posts created during September 2014 (Hamilton et al., 2017). Posts are treated as nodes and edges mean the same user comments. The class label is the community and our objective is to predict which community different posts belong to. PPI33footnotemark: 3 is a protein-protein interaction dataset that contains multiple graphs related to different human tissues (Zitnik and Leskovec, 2017). The positional gene sets, motif gene sets, and immunological signatures are viewed as node features, and each node has a totally of 121 labels given by gene ontology sets. Classifying protein functions across different PPI graphs is the goal. Following the inductive setup in (Hamilton et al., 2017), on Reddit, we feed posts made in the first 20 days into the model for training, while the remaining are used for testing (with 30 used for validation); on PPI, there are 20 graphs for training, 2 for validation and 2 for testing. It should be emphasized that, for Reddit and PPI, testing is carried out on unseen (untrained) nodes and graphs, while the first three datasets are used for transductive learning.
In the link prediction task, BlogCatalog444http://dmml.asu.edu/users/xufei/datasets.html is a social blogging website where bloggers follow each other and register their blogs under predefined 6 categories. The tags of blogs are taken as node features. Flickr44footnotemark: 4
is an image sharing website where users interact with others and form a social network. Users upload photos with 9 predefined classes and select attached tags to reflect their interests which provide attribute information. The description of Cora and PPI is omitted for brevity. Following the experimental settings and evaluation metrics in(Grover and Leskovec, 2016), given a graph with certain portions of edges removed, we aim to predict these missing links. For Cora, BlogCatalog, and Flickr, we randomly delete 20, 50, and 70 edges while ensuring that the rest of network obtained after the edge removal is connected and use the damaged network for training. About PPI, we directly treat part of the edges not seen during training as prediction targets instead of man-made edge deletion.
4.2. Experimental Settings
Encoder design. We resort to a standard Graph Convolutional Network (GCN) model with the following layer-wise propagation rule as the encoder for both classification and link prediction tasks:
where , , , and are the input and output matrices of the -th layer, is a layer-specific trainable weight matrix. Here the nonlinear transformation
we applied is the PReLU function (parametric ReLU)(He et al., 2015). It should be recognized that for node , the neighborhood in its support graph contains node itself as self-loops are inserted through .
To be more specific, the encoder we employed on Citeseer and PubMed is a one-layer GCN with the output dimension as . And on Cora, Reddit, BlogCatalog, Flickr, and PPI, we utilize a two-layer GCN as our encoder. Here, we have hidden dimensions as in each GCN layer. Note that utilizing the similar GCN encoder for both transductive and inductive classification task makes our proposed method easier to follow and scale to large networks than DGI, since DGI has to design varying encoders to adapt to distinct learning tasks, especially the encoders used in inductive tasks are too intricate and complicated, which are not friendly to practical applications.
represents a trainable scoring matrix and the activation functionwe employed is the sigmoid aiming at converting scores into probabilities of being a positive example.
Implementation details. Actually, for the weight of the first term in Eq. (10), we have two ways to get its value in experiments. The first is to keep , which makes local MIs contribute to the global one adaptively, and we term this variant GMI-adaptive. The other is to let , , the left endpoint of the interval where belongs (refer to Theorem 1), which means the contribution of each local MI is equal, and we term this variant GMI-mean. Here both GMI-mean and GMI-adaptive are included in the scope of comparison with baselines.
All experiments are implemented in PyTorch(Ketkar, 2017) with Glorot initialization (Glorot and Bengio, 2010) and conducted on a single Tesla P40 GPU. In preprocessing, we perform row normalization on Cora, Citeseer, PubMed, BlogCatalog, and Flickr following (Kipf and Welling, 2016a), and apply the processing strategy in (Hamilton et al., 2017) on Reddit and PPI. During training, we use Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 0.001 on all seven datasets. Suggested by (Veličković et al., 2018)
, we adopt an early stopping strategy with a window size of 20 on Cora, Citeseer, and PubMed, while training the model for a fixed number of epochs on the inductive datasets (20 on Reddit, 50 on PPI). The number of negative samples is set to 5. Due to the large scale of Reddit and PPI, we need to use the subsampling skill introduced in(Hamilton et al., 2017) to make them fit into GPU memory. In detail, a minibatch of 256 nodes is first selected, and then for each selected node, we uniformly sample 8 and 5 neighbors at its first and second-level neighborhoods, respectively. We adopt the one-hop neighborhood to construct the support graph in experiments and utilize (i.e., a compressed input feature) to calculate FMI since using the original input feature causes GPU memory overflow. The trade-off parameters are tuned in the range of [0,1] to balance Eq. (11) and Eq. (12
). The Batch Normalization strategy(Ioffe and Szegedy, 2015) is employed to train our model on Reddit and PPI.
For the classification task, we provide the learned embeddings across the training set to the logistic regression classifier and give the results on the test nodes(Kipf and Welling, 2016a; Hamilton et al., 2017). Specifically, in transductive learning, we adopt the mean classification accuracy after 50 runs to evaluate the performance, while the micro-averaged F1 score averaged after 50 runs is used in inductive learning. And for PPI, suggested by (Veličković et al., 2018), we standardize the learned embeddings before feeding them into the logistic regression classifier. For the link prediction task, the criteria we adopted is AUC which is the area under the ROC curve. The negative samples involved in the calculation of AUC are generated by randomly selecting an equal number of node pairs with no connections in the original graph. The closer the AUC score approaches 1, the better the performance of the algorithm is. Similarly, we report the AUC score averaged after 10 runs.
|Algorithm||Training data||Transductive tasks|
|Unsupervised||Raw features||✓||56.6 0.4||57.8 0.2||69.1 0.2|
|EP-B||✓ ✓||78.1 1.5||71.0 1.4||79.6 2.1|
|DGI||✓ ✓||82.3 0.6||71.8 0.7||76.8 0.6|
|GMI-mean (ours)||✓ ✓||82.7 0.2||73.0 0.3||80.1 0.2|
|GMI-adaptive (ours)||✓ ✓||83.0 0.3||72.4 0.1||79.9 0.2|
|Planetoid-T||✓ ✓ ✓||75.7||62.9||75.7|
|GCN||✓ ✓ ✓||81.5||70.3||79.0|
|GAT||✓ ✓ ✓||83.0 0.7||72.5 0.7||79.0 0.3|
|GWNN||✓ ✓ ✓||82.8||71.7||79.1|
|Algorithm||Training data||Inductive tasks|
|DGI||✓ ✓||94.0 0.10||63.8 0.20|
|GMI-mean (ours)||✓ ✓||95.0 0.02||65.0 0.02|
|GMI-adaptive (ours)||✓ ✓||94.9 0.02||64.6 0.03|
|Supervised||GAT||✓ ✓ ✓||-||97.3 0.20|
|FastGCN||✓ ✓ ✓||93.7||-|
|GaAN||✓ ✓ ✓||96.4 0.03||98.7 0.02|
Classification accuracies (with standard deviation) in percent on transductive tasks and micro-averaged F1 scores on inductive tasks. The third column illustrates the data used by each algorithm in the training phase, where, , and denotes features, adjacency matrix, and labels, respectively.
Transductive learning. Table 2 reports the mean classification accuracy of our method and other baselines on transductive tasks. Here, results for EP-B (Duran and Niepert, 2017), DGI (Veličković et al., 2018), Planetoid-T (Yang et al., 2016), GAT (Veličković et al., 2017), as well as GWNN (Xu et al., 2019) are taken from their original papers, and results for DeepWalk (Perozzi et al., 2014), LP (Label Propagation) (Zhu et al., 2003), and GCN (Kipf and Welling, 2016a) are copied from Kipf & Welling (Kipf and Welling, 2016a). As for raw features, we feed them into a logistic regression classifier for training and give the results on the test features555Strictly speaking, this experiment belongs to the inductive learning as testing is conducted on unseen features. But for comparison, we put it in this part.. Although we provide experimental results of both supervised and unsupervised methods, in this paper, we focus more on comparing against unsupervised ones which are consistent with our setup.
As can be observed, our proposed GMI-mean and GMI-adaptive, compared with other unsupervised methods, achieve the best classification accuracy across all three datasets. We consider this strong performance benefits from the idea of attempting to directly maximize graphical MI between input and output pairs of the encoder at a fine-grained node-level. Therefore, the encoded representation maximally preserves the information of node features and topology in , which contributes to classification. By contrast, EP-B ignores the underlying information between input data and learned representations, and DGI stays in a graph/patch-level MI maximization, which restricts their capability of preserving and extracting the original input information into embedding space. Thus slightly weak performance on classification tasks. Besides, without the guidance of labels, our method exhibits comparable results to some supervised models like GCN and GAT, even better results than them on Citeseer and PubMed. We believe that representations learned via GMI maximization between inputs and outputs inherit the rich information in graph which is enough for classification. More notable is that many available labels are given based on the information in as well. So keeping as much information as possible from the input can compensate for the information provided by the label to some extent, which sustains the performance of GMI in downstream graph mining tasks. It could be claimed that learning from original inputs without labels promises the potential for higher quality representations than the supervised pattern as the extreme sparsity of the training labels may suffer from the threat of overfitting or the correctness of given labels might not be reliable.
|EP-B loss||79.4 0.1||69.3 0.2||78.6 0.2||93.8 0.03||61.8 0.04|
|DGI loss||82.2 0.2||72.2 0.2||78.9 0.3||94.3 0.02||62.3 0.02|
|FMI (ours)||78.3 0.1||72.0 0.2||79.1 0.3||94.7 0.03||64.8 0.03|
|GMI-mean (ours)||82.7 0.1||73.0 0.3||80.1 0.2||95.0 0.02||65.0 0.02|
|GMI-adaptive (ours)||83.0 0.3||72.4 0.1||79.9 0.2||94.9 0.02||64.6 0.03|
Inductive learning. Table 2 also summarizes the micro-averaged F1 scores of GMI and other baselines on Reddit and PPI. We cite the results of DGI, GAT, FastGCN (Chen et al., 2018), and GaAN (Zhang et al., 2018) in their original papers, while results for the rest seven compared methods are extracted from Hamilton et al. (Hamilton et al., 2017) (here we reuse the unsupervised GraphSAGE results to match our setup). Similarly, the comparison with unsupervised algorithms is the emphasis of our work.
GMI-mean and GMI-adaptive successfully outperform all other competing unsupervised algorithms on Reddit and PPI, which substantiates the effectiveness of GMI maximization in the inductive classification domain (generalization to unseen nodes). Interestingly, the result of our method on Reddit is competitive with some advanced supervised models, but the situation on PPI is quite different. After conducting further analysis, we note that 42 of nodes have zero feature values in PPI, which means the feature matrix is very sparse (Hamilton et al., 2017). In this case, directly and merely relying on input graph limits the performance of unsupervised approaches including DGI and our method, whereas learning in a supervised fashion exhibits much better performance due to the auxiliary information brought by additional labels.
Evaluation on two variants of GMI. According to Table 2, the two variants of GMI (GMI-mean and GMI-adaptive), which use different strategies to measure the contribution of each local MI (details in § 4.2), achieve competitive results with each other, but GMI-adaptive exhibits slightly weaker performance than GMI-mean. Through further analysis, we assume that it might be due to the difficulties in training brought by the nature of adaptive learning. Maybe the performance of GMI-adaptive could be improved with the help of an advanced training strategy. In this sense, GMI-mean is more practical and feasible, thus it can be regarded as a representative in practice.
4.4. Effectiveness of Objective Function
To further clarify the effectiveness of maximizing graphical MI in unsupervised graph representation learning and provide a relatively fair comparison with DGI and EP-B (two unsupervised algorithms), we replace our objective function with their loss functions, respectively, while keeping other experimental settings unchanged. Table3 lists the results under the transductive and inductive setup. As can be observed, GMI (GMI-mean and GMI-adaptive) achieves stronger performance across all five datasets, which reflects DGI and EP-B lack some consideration in graph representation learning task. Specifically, EP-B loss only imposes constraints on each node and its neighbors at the output level (embedding space), it ignores the interaction between input and output pairs of the encoder, which results in its poor ability to retain the valid information in . For DGI, although it correlates hidden representations with their original input features implicitly, it discusses MI at the graph/patch-level which is somewhat coarse. Interestingly, compared with DGI, our FMI (without topology information) gains improvements more significantly with the increase of graph size. We attribute this discovery to the fact that the performance degradation of the readout function makes DGI lose certain useful information for node classification with the increase in graph size, although it exhibits good performance on small graphs such as Cora and Citeseer. When the topology of input graph is reflected, our GMI outperforms all other kinds of losses on all datasets. Furthermore, note that the whole training process of GMI is similar to the training of discriminatorys in generative models (Goodfellow et al., 2014; Nowozin et al., 2016), and GMI empirically exhibits a comparable training speed with EP-B and DGI on the largest dataset Reddit, which demonstrates its good scalability.
|Raw features||FMI (ours)||GMI (ours)||DGI|
4.5. Link Prediction
Based on the above experimental results, we find that DGI is a strong competitor to GMI in the scope of unsupervised algorithms. Therefore, in this section, we intend to further investigate the performance of DGI and GMI in another mining task—link prediction. Here we choose FMI and GMI-mean to compare with DGI. Table 4 reports their AUC scores on four different datasets. Under different edge removal rates, GMI and FMI both remarkably outperform DGI (except FMI in 70.0 BlogCatalog), showing that measuring graphical MI between input graph and output representations in a fine-grained pattern is capable of capturing rich information in inputs and delivering good generalization ability. About DGI, for one thing, its graph/patch-level MI maximization which is relatively coarse limits its performance in such a fine link prediction task; for another, the inappropriateness of corruption function weakens the ability of DGI to learn accurate representations to predict missing links. Recall that the negative sample for the discriminator in DGI is generated by corrupting the original input graph, and a well-designed corruption function is indispensable which needs some skillful strategies (Veličković et al., 2018). In this task, we still adopt feature shuffling function which shows the best results in the classification task to build negative samples. But in the case where an input graph is incomplete in terms of topological links, the guidance provided by this corrupted graph as a negative label in the discriminator becomes unreliable due to the inaccuracy of input graph, leading to poor performance. Therefore, the necessity of task-oriented corruption function is a weakness of DGI. In contrast, our GMI is free from this issue by eliminating the corruption function and directly maximizing graphical MI between inputs and outputs of the encoder. Furthermore, it can be observed that FMI is competitive to GMI in most cases, even on Flickr FMI is superior to GMI. We assume it to the benefits brought by the direct and elaborate feature mutual information maximization at a node-level. Based on the Homophily hypothesis (McPherson et al., 2001) (i.e., entities in the network with similar features are likely to interconnect), the input feature information preserved in learned embeddings makes FMI owns the good capability of inferring missing links.
For an intuitive illustration, Table 5 displays t-SNE (Maaten and Hinton, 2008) plots of the learned embeddings on Cora and Citeseer. From a qualitative perspective, the distribution of plots learned by FMI and DGI seems to be similar, and the embeddings generated by GMI exhibit more discernible clusters than raw features, FMI, and DGI. Especially on Cora, the compactness and separability of clusters are extremely obvious, which represents the seven topic categories. As for quantitative analysis, we attempt to measure clustering quality by calculating the Silhouette Coefficient score (Rousseeuw, 1987). Specifically, we employ silhouette_score function from the scikit-learn Python package (Pedregosa et al., 2011) with all default settings and follow the user guide to perform the evaluation. The clustering of embeddings learned via GMI obtains a Silhouette Coefficient score of 0.425 on Cora, 0.402 on Citeseer, and 0.385 on PubMed, while DGI gets 0.417, 0.391, 0.373 and EP-B gains 0.384, 0.385, 0.379 on the three datasets, respectively. Both qualitatively and quantitatively, it demonstrates the great performance of GMI, which illustrates the rationality and effectiveness of graphical mutual information maximization in unsupervised graph representation learning.
4.7. Influence of Model Depth
In this part, we adjust the number of convolutional layers in the encoder to investigate the influence of model depth on classification accuracy. Considering the potential difficulty of training deep neural networks, suggested by (He et al., 2016), we also experiment with a counterpart residual version of the standard GMI model, which adds identity shortcut connections between every two hidden layers to improve the training of deep networks. Here, we continue to have features for each hidden layer and start applying identity shortcuts from the second layer as the input and output of the first layer are not the same dimension. Moreover, compared to the standard GMI model that achieves GMI maximization between the final representation and original input graph, we consider another variant, called dense GMI, which maximizes GMI between each hidden layer and input graph. Figure 2
gives a detailed architecture illustration. The involved hyperparameters remain unchanged except that we train for fixed epochs (600 on Cora and Citeseer) without early stopping. Results are plotted in Figure3.
For one thing, the increase of model depth significantly widens the performance gap between models with and without shortcut connections. The best result for Cora is obtained with a two-layer GCN encoder, while the best result for Citeseer is achieved with a one-layer GCN encoder. Except for the fact that the increase of model depth makes training with no adoption of shortcut connections difficult, we also assume that the farther neighborhood information brought by multiple convolutional layers may be noise for self-representation learning. Specifically, the different proximity between neighbors means distinct extents of similarity, if two arbitrary nodes are a certain distance apart, they are likely to be completely different. Therefore, in the standard GMI model, the information aggregated from the farther neighborhood might contain much noise that is dissimilar to the characteristic of node itself, which degrades the quality of learned embeddings and subsequent classification performance. In contrast, additional identity shortcuts enable the model to carry over the information of the previous layer’s input, which can be regarded as a complementary process to similar neighborhood information from shallower layers to deeper layers, thus the residual version is relatively less vulnerable to model depth. For another, we observe that the dense GMI variant can also alleviate the performance deterioration to some extent, although MI tends to decay with depth by data processing inequality (Cover and Thomas, 2012). This thanks to maximizing graphical MI between the output of each layer and input graph, which imposes a direct constraint on each hidden layer to preserve input information as intact as possible. Based on this observation, enforcing the constraint of maximizing MI on hidden layers to reduce the loss of information when training deep neural networks could be a good practice.
To overcome the dilemma of lacking available supervision and evade the potential risk brought by unreliable labels, we introduce a novel concept of graphical mutual information (GMI) to carry out graph representation learning in an unsupervised pattern. Its core lies in directly maximizing the mutual information between the input and output of a graph neural encoder in terms of node features and topological structure. Through our theoretical analysis, we give a definition of GMI and decompose it into a form of a weighted sum which can be calculated by the current mutual information estimation method MINE easily. Accordingly, we develop an unsupervised model and conduct two common graph mining tasks. The results exhibit that GMI outperforms state-of-the-art unsupervised baselines across both classification tasks (transductive and inductive) and link prediction tasks, sometimes even be competitive with supervised algorithms. Future work will concentrate on task-oriented representation learning or adapting the idea of GMI maximization to other types of graphs such as heterogeneous graphs and hypergraphs.
Acknowledgements.This work was supported by National Key Research and Development Program of China (No. 2018AAA0101400), National Nature Science Foundation of China (No. 61872287 and No. 61532015), Innovative Research Group of the National Natural Science Foundation of China (No. 61721002), Innovation Research Team of Ministry of Education (IRT_17R86), and Project of China Knowledge Center for Engineering Science and Technology. Besides, this research was funded by National Science and Technology Major Project of the Ministry of Science and Technology of China (No. 2018AAA0102900).
MISEP–linear and nonlinear ica based on mutual information.
Journal of machine learning research4 (Dec), pp. 1297–1318. Cited by: §2.
- Mine: mutual information neural estimation. In ICML, Cited by: §1, §2.
- An information-maximization approach to blind separation and blind deconvolution. Neural computation 7 (6), pp. 1129–1159. Cited by: §2.
Molecular generation with recurrent neural networks (rnns). arXiv preprint arXiv:1705.04612. Cited by: §1.
- A two-step graph convolutional decoder for molecule generation. arXiv preprint arXiv:1906.03412. Cited by: §1.
- Grarep: learning graph representations with global structural information. In CIKM, Cited by: §2.
- Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247. Cited by: §2, §4.3.
- Elements of information theory. John Wiley & Sons. Cited by: §4.7.
- Semi-supervised learning on graphs with generative adversarial nets. In CIKM, Cited by: §2.
- Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on pure and applied mathematics 36 (2), pp. 183–212. Cited by: §3.3.
- Learning graph representations with embedding propagation. In NeurIPS, Cited by: §2, §4.3.
- Graph alignment networks with node matching scores. In NeurIPS, Cited by: §1.
- Understanding the difficulty of training deep feedforward neural networks. In AISTATS, Cited by: §4.2.
- Generative adversarial nets. In NeurIPS, Cited by: §4.4.
- Node2vec: scalable feature learning for networks. In KDD, Cited by: §2, §4.1.
- Inductive representation learning on large graphs. In NeurIPS, Cited by: §2, §4.1, §4.1, §4.2, §4.2, §4.3, §4.3.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In ICCV, Cited by: §4.2.
- Deep residual learning for image recognition. In CVPR, Cited by: §4.7.
- Regal: representation learning-based graph alignment. In CIKM, Cited by: §1.
- Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §1, §3.3, §3.
Nonlinear independent component analysis: existence and uniqueness results. Neural networks 12 (3), pp. 429–439. Cited by: §2.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.2.
- Introduction to pytorch. In Deep learning with python, pp. 195–208. Cited by: §4.2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2, §4.1, §4.1, §4.2, §4.2, §4.3.
- Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: §1.
- The link-prediction problem for social networks. Journal of the American society for information science and technology 58 (7), pp. 1019–1031. Cited by: §1.
- Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.6.
- Birds of a feather: homophily in social networks. Annual review of sociology 27 (1), pp. 415–444. Cited by: §4.5.
- F-gan: training generative neural samplers using variational divergence minimization. In NeurIPS, Cited by: §2, §3.3, §4.4.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.3, §4.2.
- Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §4.6.
- Deepwalk: online learning of social representations. In KDD, Cited by: §2, §4.3.
- Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In WSDM, Cited by: §2.
- GMNN: graph markov neural networks. In ICML, Cited by: §1.
- About the mutual (conditional) information. In ISIT, Cited by: §3.1, Theorem 1.
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20, pp. 53–65. Cited by: §4.6.
- Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.1.
- Line: large-scale information network embedding. In WWW, Cited by: §2.
- Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2, §3.2, §4.3.
- Deep graph infomax. arXiv preprint arXiv:1809.10341. Cited by: §1, §2, §4.1, §4.1, §4.2, §4.2, §4.3, §4.5.
- Net: degree-specific graph neural networks for node and graph classification. arXiv preprint arXiv:1906.02319. Cited by: §1.
Relation-aware entity alignment for heterogeneous knowledge graphs. In IJCAI, Cited by: §1.
- Graph wavelet neural network. arXiv preprint arXiv:1904.07785. Cited by: §4.3.
- Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861. Cited by: §4.3.
- Graph convolutional policy network for goal-directed molecular graph generation. In NeurIPS, Cited by: §1.
- Deep sets. In NeurIPS, pp. 3391–3401. Cited by: §1.
- Gaan: gated attention networks for learning on large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294. Cited by: §2, §4.3.
- Link prediction based on graph neural networks. In NeurIPS, Cited by: §1.
Bayesian graph convolutional neural networks for semi-supervised classification. In AAAI, Cited by: §1.
- Semi-supervised learning using gaussian fields and harmonic functions. In ICML, Cited by: §4.3.
- Predicting multicellular function through multi-layer tissue networks. Bioinformatics 33 (14), pp. i190–i198. Cited by: §4.1.