1. Introduction
To achieve reliable quantitative analysis for diverse information networks, highquality representation learning for graphstructured data has become one of the current fascinating topics. Recent methods towards this goal, commonly categorized as Graph Neural Networks (GNNs), have made remarkable advancements in a great many learning tasks, such as node classification (Zhang et al., 2019; Wu et al., 2019a; Qu et al., 2019), link prediction (LibenNowell and Kleinberg, 2007; Kipf and Welling, 2016b; Zhang and Chen, 2018), graph alignment (Heimann et al., 2018; Faerman et al., 2019; Wu et al., 2019b), molecular generation (Bjerrum and Threlfall, 2017; You et al., 2018; Bresson and Laurent, 2019), to name a few. Albeit in fruitful progress, training GNNs in existing approaches usually requires a certain form of supervision. Undoubtedly, the labeling information is expensive to acquire—manual annotation or paying for permission, and is even impossible to attain because of the privacy policy. Not to mention that the reliability of given labels is sometimes questionable. Hence, how to achieve highquality graph representation without supervision becomes necessitated for a great many practical cases, which motivates the study of this paper.
Through carefully examining our collected graphstructured data, we find that they generally come from various sources such as social networks, citation networks, and communication networks where a tremendous amount of both content and linkage information exist. For instance, data on many social platforms like Twitter, Flickr, and Facebook include features of users, e.g., basic personal details, texts, images, IP, and their relations, e.g.
, buying the same item, being friends. These rich content data are sufficient to support subsequent mining tasks without additional guidance: if two entities exhibit the extreme similarity in features, there is a high probability of a link between them (link prediction), and they are likely to belong to the same category (classification); if two entities both link to the same entity, they probably have similar characteristics (recommendation). In this sense, preserving and extracting as much information as possible from information networks into embedding space facilitates learning highquality expressive representations that exhibit desirable performance in mining tasks without any form of supervision. Unsupervised graph representation learning is a more favorable choice in many cases due to the freedom from labels, particularly when we intend to take benefit from a large scale unlabeled data in the wild.
To fully inherit the rich information in graphs, in this paper, we execute graph embedding based upon Mutual Information (MI) maximization, inspired by the empirical success of the Deep InfoMax method (Hjelm et al., 2018) which operates on images. To discover useful representations, Deep InfoMax trains the encoder to maximize MI between its inputs (i.e., the images) and outputs (i.e., the hidden vectors). When considering Deep InfoMax in the graph domain, the first stone we need to step over is how to define MI between graphs and hidden vectors, whereas the topology of graphs is more complicated than images (see Figure 1). One of the challenges is to ensure the MI function between each node’s hidden representation and its neighborhood input features to obey the symmetric property, or equivalently, being invariant to permutations of the neighborhoods. As one recent work considering MI, Deep Graph Infomax (DGI) (Veličković et al., 2018) first embeds a input graph and a corresponding corrupted graph, then summarizes the input graph as a vector via a readout function, finally maximizes MI between this summary vector and hidden representations by discriminating the input graph (positive sample) from the corrupted graph (negative sample). Figure 1 gives an easily understandable overview of DGI. Maximizing this kind of MI is proved to be equivalent to maximizing the one between the input node features and hidden vectors, but this equivalence holds under several preset conditions, e.g., the readout function should be injective, which yet seem to be overrestricted in real cases. Even we can guarantee the existence of injective readout function by certain design, e.g., the one used in DeepSets (Zaheer et al., 2017), the injective ability of readout function is also affected by how its parameters are trained. That is to say that an originallyinjective function still has the risk of becoming noninjective if it is trained without any external supervision. And if the readout function is not injective, the input graph information contained in a summary vector will diminish as the size of the graph increases. Moreover, DGI stays in a coarse graph/patchlevel MI maximization. Hence in DGI, there is no guarantee that the encoder can distill sufficient information from input data as it never elaborately correlates hidden representations with their original inputs.
In this paper, we put forward a more straightforward way to consider MI in terms of graphical structures without using any readout function and corruption function. We directly derive MI by comparing the input (i.e., the subgraph consisting of the input neighborhood) and the output (i.e., the hidden representation of each node) of the encoder. And interestingly, our theoretical derivations demonstrate that the directlyformulated MI can be decomposed into a weighted sum of local MIs between each neighborhood feature and the hidden vector. In this way, we have decomposed the input features and made the MI computation tractable. Moreover, this form of MI can easily satisfy the symmetric property if we adjust the values of weights. We defer more details to § 3.1. As the above MI is mainly measured at the level of node features, we term it as Feature Mutual Information (FMI).
Two remaining issues about FMI: 1. the combining weights are still unknown and 2. it does not take the topology (i.e., the edge features) into account. To further address these two issues, we define our Graphical Mutual Information (GMI) measurement based on FMI. In particular, GMI applies an intuitive value assignment by setting the weights in FMI equal to the proximity between each neighbor and the target node in the representation space. As to retain the topology information, GMI further correlates these weights with the input edge features via an additional mutual information term. The resulting GMI is topologically invariant and also calculable with Mutual Information Neural Estimation (MINE) (Belghazi et al., 2018). The main contributions of our work are as follows:

Concepts: We generalize the conventional MI estimation to the graph domain and propose a new concept of Graphical Mutual Information (GMI) accordingly. GMI is free from the potential risk caused by the readout function since it considers MI between input graphs and highlevel embeddings in a straightforward pattern.

Algorithms: Through our theoretical analysis, we give a tractable and calculable form of GMI which decomposes the entire GMI into a weighted sum of local MIs. With the help of the MINE method, GMI maximization can be easily achieved in a nodelevel.

Experimental Findings: We verify the effectiveness of GMI on several popular node classification and link prediction tasks including both transductive and inductive ones. The experiments demonstrate that our method delivers promising performance on a variety of benchmarks and it even sometimes outperforms the supervised counterparts.
2. Related Work
In line with the focus of our work, we briefly review the previous work in the two following areas: 1. mutual information estimation, and 2. neural networks for learning representation over graphs.
Mutual information estimation. As InfoMax principle (Bell and Sejnowski, 1995) advocates maximizing MI between the inputs and outputs of neural networks, many methods such as ICA algorithms (Hyvärinen and Pajunen, 1999; Almeida, 2003) attempt to employ the idea of MI in unsupervised feature learning. Nonetheless, these methods can not be generalized to deep neural networks easily due to the difficulty in calculating MI between high dimensional continuous variables. Fortunately, Mutual Information Neural Estimation (MINE) (Belghazi et al., 2018)
makes the estimation of MI on deep neural networks feasible via training a statistics network as a classifier to distinguish samples coming from the joint distribution and the product of marginals of two random variables. Specifically, MINE uses the exact KLbased formulation of MI, while a nonKL alternative, the JensenShannon divergence (JSD)
(Nowozin et al., 2016), can be used without the concern about the precise value of MI.Neural networks for graph representation learning. With the rapid development of graph neural networks (GNNs), a large number of graph representation learning algorithms based on GNNs are proposed in recent years, which exhibit stronger performance than traditional random walkbased and factorizationbased embedding approaches (Perozzi et al., 2014; Tang et al., 2015; Cao et al., 2015; Grover and Leskovec, 2016; Qiu et al., 2018). Typically, these methods can be divided into supervised and unsupervised categories. Among them, there is a rich literature on supervised representation learning over graphs (Kipf and Welling, 2016a; Veličković et al., 2017; Chen et al., 2018; Zhang et al., 2018; Ding et al., 2018)
. In spite of their variance in network architecture, they achieve empirical success with the help of labels that are often not accessible in realistic scenarios. In this case, unsupervised graph learning methods
(Hamilton et al., 2017; Duran and Niepert, 2017; Veličković et al., 2018) have broader application potential. The wellknown method is GraphSAGE (Hamilton et al., 2017), an inductive framework to train GNNs by a randomwalk based objective in its unsupervised setting. And recently, DGI (Veličković et al., 2018) applies the idea of MI maximization to the graph domain and obtains the strong performance in an unsupervised pattern. However, DGI implements a coarsegrained maximization (, maximizing MI at graph/patchlevel) which makes it difficult to preserve the delicate information in the input graph. Besides, the condition imposed on the readout function used in DGI seems to be overrestricted in real cases. By contrast, we focus on removing out the restriction of readout function and arriving at graphical mutual information maximization in a nodelevel by directly maximizing MI between inputs and outputs of the encoder. Representations derived by our method are more sophisticated in keeping input graph information, which ensures its potential for downstream graph mining tasks, e.g., node classification, link prediction, and recommendation.3. Graphical Mutual Information: Definition and Maximization
Prior to going further, we first provide the preliminary concepts used in this paper. Let denote a graph with nodes and edges
. The node features, with assumed empirical probability distribution
, are given by where denotes the feature for node . The adjacency matrix represent edge connections, where associated to edge could be a real number or multidimensional vector^{1}^{1}1Our method is generally applicable to the graphs with edge features, although we only consider edges with real weights in our experiments..The goal of graph representation learning is to learn an encoder , such that the hidden vectors indicate highlevel representations for all nodes. The encoding process can be rewritten in a nodewise form. To show this, we define and for node as respectively the features of its neighbors and the corresponding adjacency matrix conditional on the neighbors. Particularly, consists of all khop neighbors of with when the encoder is an layer GNN, and it contains the node itself if we further add selfloops in the adjacency matrix. Here, we call the subgraph expanded by and as a support graph for node , denoted by . With the definition of support graph, the encoding for each node becomes .
Difficulties in defining graphical mutual information. In Deep InfoMax (Hjelm et al., 2018), the training objective of the encoder is to maximize MI between its inputs and outputs. The MI is estimated by employing a statistics network as a discriminator to classify samples coming from the joint distribution and the ones drawn from the product of marginals. Naturally, when adapting the idea of Deep InfoMax to graphs, we should maximize MI between the representation and the support graph for each node. We denote such graphical MI as . However, it is nonstraightforward to define . The difficulties are:

The graphical MI should be invariant concerning the node index. In other words, we should have , if and are isomorphic to each other.

If we adopt MINE method for MI calculation, the discriminator in MINE only accepts inputs of a fixed size. This yet is infeasible for as different usually include different numbers of nodes and thus are of distinct sizes.
To get around the issue of defining graphical mutual information, this section begins with introducing the concept of Feature Mutual Information (FMI) that only relies on node features. Upon the inspiration from the decomposition of FMI, we then define Graphical Mutual Information (GMI), which takes both the node features and graph topology into consideration.
3.1. Feature Mutual Information
We denote the empirical probability distribution of node features as , the probability of as , and the joint distribution by . According to the information theory, the MI between and is defined as
(1) 
Interestingly, we have the following mutual information decomposition theorem for computing .
Theorem 1 (Mutual Information Decomposition).
If the conditional probability is multiplicative (see the definition of multiplicative in (Renner and Maurer, 2002)), the global mutual information defined in Eq. (1) can be decomposed as a weighted sum of local MIs, namely,
(2) 
where, is the th neighbor of node , is the number of all elements in , and the weight satisfies for each .
To prove the above theorem, we first introduce two lemmas and a definition.
Lemma 0 ().
For any random variables , , and , we have
(3) 
Proof.
Thus we achieve . ∎
Definition 0 ().
The conditional probability is called multiplicative if it can be written as a product
(4) 
with appropriate functions .
Lemma 0 ().
If is multiplicative, then we have
(5) 
Proof.
See (Renner and Maurer, 2002) for detailed proof. ∎
Now all the necessities for proving Theorem 1 are in place.
Proof.
According to Lemma 2, for any we have
(6) 
It means
(7) 
On the other hand, based on Lemma 4, we get
(8) 
Then the above two formulas could deduce the following
(9) 
As all , there must exist weights . When setting , we will achieve Eq. (2) while ensuring by Eq. (9), thus the Theorem 1 has been proved. ∎
With the decomposition in Theorem 1, we can calculate the right side of Eq. (2) via MINE as inputs of the discriminator now become the pairs of whose size always keep the same (i.e., by). Besides, we can adjust the weights to reflect the isomorphic transformation of input graphs. For instance, if only contains onehop neighbors of node , setting all weights to be identical will lead to the same MI for the input nodes in different orders.
Despite some benefits of the decomposition, it is hard to characterize the exact values of the weights since they are related to the values of and their underlying probability distributions. A trivial way is setting all weights to be , then maximizing the right side of Eq. (2) equivalents to maximizing the lower bound of , by which the true FMI is also maximized to some extent. Besides this method, we additionally provide a more enhanced solution by considering the weights as trainable attentions, which is the topic in the next subsection.
Task  Dataset  Type  Nodes  Edges  Features  Classes  
Classification  Transductive  Cora  Citation network  2,708  5,429  1,433  7 
Citeseer  Citation network  3,327  4,732  3,703  6  
PubMed  Citation network  19,717  44,338  500  3  
Inductive  Social network  232,965  11,606,919  602  41  
PPI  Protein network  56,944  806,174  50  121  
Link prediction  Cora  Citation network  2,708  5,429  1,433  7  
BlogCatalog  Social network  5,196  171,743  8,189  6  
Flickr  Social network  7,575  239,738  12,047  9  
PPI  Protein network  56,944  806,174  50  121 

The task on PPI is a multilabel classification problem.
3.2. TopologyAware Mutual Information
Inspired from the decomposition in Theorem 1, we attempt to construct trainable weights from the other aspect of graphs (i.e., topological view) so that the values of can be more flexible and capture the inherent property of graphs. Ultimately we derive the definition of Graphical Mutual Information (GMI).
Definition 0 (Graphical Mutual Information).
The MI between the hidden vector and its support graph is defined as
(10) 
where the definitions of both and are as same as Theorem 1, is the edge weight/feature in the adjacency matrix , and
is a sigmoid function.
Intuitively, weight in the first term of Eq. (10) measures the contribution of a local MI to the global one. We implement the contribution of by the similarity between representations and (i.e., ). Meanwhile, the term maximizes MI between and the edge weight/feature of input graph (i.e., ) to enforce to conform to topological relations. In this sense, the degree of the contribution would be consistent with the proximity in topological structure, which is commonly accepted as a fact that could be larger if node is “closer” to node and smaller otherwise. This strategy compensates for the flaw that FMI only focuses on node features and makes local MIs contribute to the global one adaptively. To better understand the idea of attention in this strategy, you could refer to the attentionbased GCN (Veličković et al., 2017).
Note that the definition of Eq. (10) is applicable for general cases. For certain specific situations, we can slightly modify Eq. (10) for efficiency. For example, when dealing with unweighted graphs (namely the edge value is 1 if connected and 0 otherwise), we could replace the second MI term with a negative crossentropy loss. Minimizing the crossentropy also contributes to MI maximization, and it delivers a more efficient computation. We defer more details in the next section.
There are several benefits by the definition of Eq. (10). First, this kind of MI is invariant to the isomorphic transformation of input graphs. Second, it is computationally feasible as each component on the right side can be estimated by MINE. More importantly, GMI is more powerful than DGI in capturing original input information due to its explicit correlation between hidden vectors and input features of both nodes and edges in a finegrained nodelevel.
3.3. Maximization of GMI
Now we directly maximize the right side of Eq. (10) with the help of MINE. Note that MINE estimates a lowerbound of MI with the DonskerVaradhan (DV) (Donsker and Varadhan, 1983) representation of the KLdivergence between the joint distribution and the product of the marginals. As we focus more on maximizing MI rather than obtaining its specific value, the other nonKL alternatives such as JensenShannon MI estimator (JSD) (Nowozin et al., 2016)
and NoiseContrastive estimator (infoNCE)
(Oord et al., 2018) could be employed to replace it. Based on the experimental findings and analysis in (Hjelm et al., 2018), we resort to JSD estimator in this paper for the sake of effectiveness and efficiency, since infoNCE estimator is sensitive to negative sampling strategies (the number of negative samples) thus may become a bottleneck for largescale datasets with a fixed available memory. On the contrary, the insensitivity of JSD estimator to negative sampling strategies and its respectable performance on many tasks makes it more suitable for our task. In particular, we calculate in the first term of Eq. (10) by(11) 
where is a discriminator constructed by a neural network with parameter . is an negative sampled from , and denotes the softplus function.
As mentioned in § 3.2, we maximize via calculating its crossentropy instead of using JSD estimator since the graphs we coped with in experiments are unweighted. Formally, we compute
(12) 
By maximizing with the sum of Eq. (11) and Eq. (12) over all hidden vectors , we arrive at our complete objective function for GMI optimization. Besides, we can further add tradeoff parameters to balance Eq. (11) and (12) for more flexibility.
4. Experiments
In this section, we empirically evaluate the performance of GMI on two common tasks: node classification (transductive and inductive) and link prediction. An additional relatively fair comparison between GMI and another two unsupervised algorithms (EPB and DGI) further exhibits its effectiveness. Also we provide the visualization of tSNE plots and analyze the influence of model depth.
4.1. Datasets
To assess the quality of our approach in each task, we adopt 4 or 5 commonly used benchmark datasets in the previous work (Kipf and Welling, 2016a; Hamilton et al., 2017; Veličković et al., 2018). Detailed statistics are given in Table 1.
In the classification task, Cora, Citeseer, and PubMed (Sen et al., 2008)^{2}^{2}2https://github.com/tkipf/gcn are citation networks where nodes correspond to documents and edges represent citations. Each document is associated with a bagofwords representation vector and belongs to one of the predefined classes. Following the transductive setup in (Kipf and Welling, 2016a; Veličković et al., 2018), training is conducted on all nodes, and 1000 test nodes are used for evaluation. Reddit^{3}^{3}3http://snap.stanford.edu/graphsage/ is a large social network consisting of numerous interconnected Reddit posts created during September 2014 (Hamilton et al., 2017). Posts are treated as nodes and edges mean the same user comments. The class label is the community and our objective is to predict which community different posts belong to. PPI^{3}^{3}footnotemark: 3 is a proteinprotein interaction dataset that contains multiple graphs related to different human tissues (Zitnik and Leskovec, 2017). The positional gene sets, motif gene sets, and immunological signatures are viewed as node features, and each node has a totally of 121 labels given by gene ontology sets. Classifying protein functions across different PPI graphs is the goal. Following the inductive setup in (Hamilton et al., 2017), on Reddit, we feed posts made in the first 20 days into the model for training, while the remaining are used for testing (with 30 used for validation); on PPI, there are 20 graphs for training, 2 for validation and 2 for testing. It should be emphasized that, for Reddit and PPI, testing is carried out on unseen (untrained) nodes and graphs, while the first three datasets are used for transductive learning.
In the link prediction task, BlogCatalog^{4}^{4}4http://dmml.asu.edu/users/xufei/datasets.html is a social blogging website where bloggers follow each other and register their blogs under predefined 6 categories. The tags of blogs are taken as node features. Flickr^{4}^{4}footnotemark: 4
is an image sharing website where users interact with others and form a social network. Users upload photos with 9 predefined classes and select attached tags to reflect their interests which provide attribute information. The description of Cora and PPI is omitted for brevity. Following the experimental settings and evaluation metrics in
(Grover and Leskovec, 2016), given a graph with certain portions of edges removed, we aim to predict these missing links. For Cora, BlogCatalog, and Flickr, we randomly delete 20, 50, and 70 edges while ensuring that the rest of network obtained after the edge removal is connected and use the damaged network for training. About PPI, we directly treat part of the edges not seen during training as prediction targets instead of manmade edge deletion.4.2. Experimental Settings
Encoder design. We resort to a standard Graph Convolutional Network (GCN) model with the following layerwise propagation rule as the encoder for both classification and link prediction tasks:
(13) 
where , , , and are the input and output matrices of the th layer, is a layerspecific trainable weight matrix. Here the nonlinear transformation
we applied is the PReLU function (parametric ReLU)
(He et al., 2015). It should be recognized that for node , the neighborhood in its support graph contains node itself as selfloops are inserted through .To be more specific, the encoder we employed on Citeseer and PubMed is a onelayer GCN with the output dimension as . And on Cora, Reddit, BlogCatalog, Flickr, and PPI, we utilize a twolayer GCN as our encoder. Here, we have hidden dimensions as in each GCN layer. Note that utilizing the similar GCN encoder for both transductive and inductive classification task makes our proposed method easier to follow and scale to large networks than DGI, since DGI has to design varying encoders to adapt to distinct learning tasks, especially the encoders used in inductive tasks are too intricate and complicated, which are not friendly to practical applications.
Discriminator design. The discriminator in Eq. (11) scores the inputoutput feature pairs through a simple bilinear function, which is similar to the discriminator used in (Oord et al., 2018):
(14) 
where
represents a trainable scoring matrix and the activation function
we employed is the sigmoid aiming at converting scores into probabilities of being a positive example.Implementation details. Actually, for the weight of the first term in Eq. (10), we have two ways to get its value in experiments. The first is to keep , which makes local MIs contribute to the global one adaptively, and we term this variant GMIadaptive. The other is to let , , the left endpoint of the interval where belongs (refer to Theorem 1), which means the contribution of each local MI is equal, and we term this variant GMImean. Here both GMImean and GMIadaptive are included in the scope of comparison with baselines.
All experiments are implemented in PyTorch
(Ketkar, 2017) with Glorot initialization (Glorot and Bengio, 2010) and conducted on a single Tesla P40 GPU. In preprocessing, we perform row normalization on Cora, Citeseer, PubMed, BlogCatalog, and Flickr following (Kipf and Welling, 2016a), and apply the processing strategy in (Hamilton et al., 2017) on Reddit and PPI. During training, we use Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 0.001 on all seven datasets. Suggested by (Veličković et al., 2018), we adopt an early stopping strategy with a window size of 20 on Cora, Citeseer, and PubMed, while training the model for a fixed number of epochs on the inductive datasets (20 on Reddit, 50 on PPI). The number of negative samples is set to 5. Due to the large scale of Reddit and PPI, we need to use the subsampling skill introduced in
(Hamilton et al., 2017) to make them fit into GPU memory. In detail, a minibatch of 256 nodes is first selected, and then for each selected node, we uniformly sample 8 and 5 neighbors at its first and secondlevel neighborhoods, respectively. We adopt the onehop neighborhood to construct the support graph in experiments and utilize (i.e., a compressed input feature) to calculate FMI since using the original input feature causes GPU memory overflow. The tradeoff parameters are tuned in the range of [0,1] to balance Eq. (11) and Eq. (12). The Batch Normalization strategy
(Ioffe and Szegedy, 2015) is employed to train our model on Reddit and PPI.Evaluation metrics.
For the classification task, we provide the learned embeddings across the training set to the logistic regression classifier and give the results on the test nodes
(Kipf and Welling, 2016a; Hamilton et al., 2017). Specifically, in transductive learning, we adopt the mean classification accuracy after 50 runs to evaluate the performance, while the microaveraged F1 score averaged after 50 runs is used in inductive learning. And for PPI, suggested by (Veličković et al., 2018), we standardize the learned embeddings before feeding them into the logistic regression classifier. For the link prediction task, the criteria we adopted is AUC which is the area under the ROC curve. The negative samples involved in the calculation of AUC are generated by randomly selecting an equal number of node pairs with no connections in the original graph. The closer the AUC score approaches 1, the better the performance of the algorithm is. Similarly, we report the AUC score averaged after 10 runs.Algorithm  Training data  Transductive tasks  

Cora  Citeseer  PubMed  
Unsupervised  Raw features  ✓  56.6 0.4  57.8 0.2  69.1 0.2 
DeepWalk  ✓  67.2  43.2  65.3  
EPB  ✓ ✓  78.1 1.5  71.0 1.4  79.6 2.1  
DGI  ✓ ✓  82.3 0.6  71.8 0.7  76.8 0.6  
GMImean (ours)  ✓ ✓  82.7 0.2  73.0 0.3  80.1 0.2  
GMIadaptive (ours)  ✓ ✓  83.0 0.3  72.4 0.1  79.9 0.2  
Supervised  LP  ✓ ✓  68.0  45.3  63.0 
PlanetoidT  ✓ ✓ ✓  75.7  62.9  75.7  
GCN  ✓ ✓ ✓  81.5  70.3  79.0  
GAT  ✓ ✓ ✓  83.0 0.7  72.5 0.7  79.0 0.3  
GWNN  ✓ ✓ ✓  82.8  71.7  79.1  
Algorithm  Training data  Inductive tasks  
PPI  
Unsupervised  Raw features  ✓  58.5  42.2  
DeepWalk  ✓  32.4    
DeepWalk+features  ✓ ✓  69.1    
GraphSAGEGCN  ✓ ✓  90.8  46.5  
GraphSAGEmean  ✓ ✓  89.7  48.6  
GraphSAGELSTM  ✓ ✓  90.7  48.2  
GraphSAGEpool  ✓ ✓  89.2  50.2  
DGI  ✓ ✓  94.0 0.10  63.8 0.20  
GMImean (ours)  ✓ ✓  95.0 0.02  65.0 0.02  
GMIadaptive (ours)  ✓ ✓  94.9 0.02  64.6 0.03  
Supervised  GAT  ✓ ✓ ✓    97.3 0.20  
FastGCN  ✓ ✓ ✓  93.7    
GaAN  ✓ ✓ ✓  96.4 0.03  98.7 0.02 
Classification accuracies (with standard deviation) in percent on transductive tasks and microaveraged F1 scores on inductive tasks. The third column illustrates the data used by each algorithm in the training phase, where
, , and denotes features, adjacency matrix, and labels, respectively.4.3. Classification
Transductive learning. Table 2 reports the mean classification accuracy of our method and other baselines on transductive tasks. Here, results for EPB (Duran and Niepert, 2017), DGI (Veličković et al., 2018), PlanetoidT (Yang et al., 2016), GAT (Veličković et al., 2017), as well as GWNN (Xu et al., 2019) are taken from their original papers, and results for DeepWalk (Perozzi et al., 2014), LP (Label Propagation) (Zhu et al., 2003), and GCN (Kipf and Welling, 2016a) are copied from Kipf & Welling (Kipf and Welling, 2016a). As for raw features, we feed them into a logistic regression classifier for training and give the results on the test features^{5}^{5}5Strictly speaking, this experiment belongs to the inductive learning as testing is conducted on unseen features. But for comparison, we put it in this part.. Although we provide experimental results of both supervised and unsupervised methods, in this paper, we focus more on comparing against unsupervised ones which are consistent with our setup.
As can be observed, our proposed GMImean and GMIadaptive, compared with other unsupervised methods, achieve the best classification accuracy across all three datasets. We consider this strong performance benefits from the idea of attempting to directly maximize graphical MI between input and output pairs of the encoder at a finegrained nodelevel. Therefore, the encoded representation maximally preserves the information of node features and topology in , which contributes to classification. By contrast, EPB ignores the underlying information between input data and learned representations, and DGI stays in a graph/patchlevel MI maximization, which restricts their capability of preserving and extracting the original input information into embedding space. Thus slightly weak performance on classification tasks. Besides, without the guidance of labels, our method exhibits comparable results to some supervised models like GCN and GAT, even better results than them on Citeseer and PubMed. We believe that representations learned via GMI maximization between inputs and outputs inherit the rich information in graph which is enough for classification. More notable is that many available labels are given based on the information in as well. So keeping as much information as possible from the input can compensate for the information provided by the label to some extent, which sustains the performance of GMI in downstream graph mining tasks. It could be claimed that learning from original inputs without labels promises the potential for higher quality representations than the supervised pattern as the extreme sparsity of the training labels may suffer from the threat of overfitting or the correctness of given labels might not be reliable.
Algorithm  Transductive  Inductive  

Cora  Citeseer  PubMed  PPI  
EPB loss  79.4 0.1  69.3 0.2  78.6 0.2  93.8 0.03  61.8 0.04 
DGI loss  82.2 0.2  72.2 0.2  78.9 0.3  94.3 0.02  62.3 0.02 
FMI (ours)  78.3 0.1  72.0 0.2  79.1 0.3  94.7 0.03  64.8 0.03 
GMImean (ours)  82.7 0.1  73.0 0.3  80.1 0.2  95.0 0.02  65.0 0.02 
GMIadaptive (ours)  83.0 0.3  72.4 0.1  79.9 0.2  94.9 0.02  64.6 0.03 
Algorithm  Cora  BlogCatalog  Flickr  PPI  

20.0  50.0  70.0  20.0  50.0  70.0  20.0  50.0  70.0  22.7  
DGI  95.60.3  94.60.4  94.40.2  77.20.4  76.40.4  75.50.3  90.30.3  89.00.4  74.10.7  77.40.1 
FMI (ours)  97.20.2  95.20.1  95.00.1  81.20.2  79.50.4  75.10.2  92.70.3  92.20.3  90.60.4  79.80.2 
GMI (ours)  97.90.3  96.40.2  96.30.1  84.10.3  83.60.2  82.50.1  92.00.2  90.10.3  88.50.2  80.00.2 
Inductive learning. Table 2 also summarizes the microaveraged F1 scores of GMI and other baselines on Reddit and PPI. We cite the results of DGI, GAT, FastGCN (Chen et al., 2018), and GaAN (Zhang et al., 2018) in their original papers, while results for the rest seven compared methods are extracted from Hamilton et al. (Hamilton et al., 2017) (here we reuse the unsupervised GraphSAGE results to match our setup). Similarly, the comparison with unsupervised algorithms is the emphasis of our work.
GMImean and GMIadaptive successfully outperform all other competing unsupervised algorithms on Reddit and PPI, which substantiates the effectiveness of GMI maximization in the inductive classification domain (generalization to unseen nodes). Interestingly, the result of our method on Reddit is competitive with some advanced supervised models, but the situation on PPI is quite different. After conducting further analysis, we note that 42 of nodes have zero feature values in PPI, which means the feature matrix is very sparse (Hamilton et al., 2017). In this case, directly and merely relying on input graph limits the performance of unsupervised approaches including DGI and our method, whereas learning in a supervised fashion exhibits much better performance due to the auxiliary information brought by additional labels.
Evaluation on two variants of GMI. According to Table 2, the two variants of GMI (GMImean and GMIadaptive), which use different strategies to measure the contribution of each local MI (details in § 4.2), achieve competitive results with each other, but GMIadaptive exhibits slightly weaker performance than GMImean. Through further analysis, we assume that it might be due to the difficulties in training brought by the nature of adaptive learning. Maybe the performance of GMIadaptive could be improved with the help of an advanced training strategy. In this sense, GMImean is more practical and feasible, thus it can be regarded as a representative in practice.
4.4. Effectiveness of Objective Function
To further clarify the effectiveness of maximizing graphical MI in unsupervised graph representation learning and provide a relatively fair comparison with DGI and EPB (two unsupervised algorithms), we replace our objective function with their loss functions, respectively, while keeping other experimental settings unchanged. Table
3 lists the results under the transductive and inductive setup. As can be observed, GMI (GMImean and GMIadaptive) achieves stronger performance across all five datasets, which reflects DGI and EPB lack some consideration in graph representation learning task. Specifically, EPB loss only imposes constraints on each node and its neighbors at the output level (embedding space), it ignores the interaction between input and output pairs of the encoder, which results in its poor ability to retain the valid information in . For DGI, although it correlates hidden representations with their original input features implicitly, it discusses MI at the graph/patchlevel which is somewhat coarse. Interestingly, compared with DGI, our FMI (without topology information) gains improvements more significantly with the increase of graph size. We attribute this discovery to the fact that the performance degradation of the readout function makes DGI lose certain useful information for node classification with the increase in graph size, although it exhibits good performance on small graphs such as Cora and Citeseer. When the topology of input graph is reflected, our GMI outperforms all other kinds of losses on all datasets. Furthermore, note that the whole training process of GMI is similar to the training of discriminatorys in generative models (Goodfellow et al., 2014; Nowozin et al., 2016), and GMI empirically exhibits a comparable training speed with EPB and DGI on the largest dataset Reddit, which demonstrates its good scalability.Raw features  FMI (ours)  GMI (ours)  DGI  
Cora  
Citeseer 
4.5. Link Prediction
Based on the above experimental results, we find that DGI is a strong competitor to GMI in the scope of unsupervised algorithms. Therefore, in this section, we intend to further investigate the performance of DGI and GMI in another mining task—link prediction. Here we choose FMI and GMImean to compare with DGI. Table 4 reports their AUC scores on four different datasets. Under different edge removal rates, GMI and FMI both remarkably outperform DGI (except FMI in 70.0 BlogCatalog), showing that measuring graphical MI between input graph and output representations in a finegrained pattern is capable of capturing rich information in inputs and delivering good generalization ability. About DGI, for one thing, its graph/patchlevel MI maximization which is relatively coarse limits its performance in such a fine link prediction task; for another, the inappropriateness of corruption function weakens the ability of DGI to learn accurate representations to predict missing links. Recall that the negative sample for the discriminator in DGI is generated by corrupting the original input graph, and a welldesigned corruption function is indispensable which needs some skillful strategies (Veličković et al., 2018). In this task, we still adopt feature shuffling function which shows the best results in the classification task to build negative samples. But in the case where an input graph is incomplete in terms of topological links, the guidance provided by this corrupted graph as a negative label in the discriminator becomes unreliable due to the inaccuracy of input graph, leading to poor performance. Therefore, the necessity of taskoriented corruption function is a weakness of DGI. In contrast, our GMI is free from this issue by eliminating the corruption function and directly maximizing graphical MI between inputs and outputs of the encoder. Furthermore, it can be observed that FMI is competitive to GMI in most cases, even on Flickr FMI is superior to GMI. We assume it to the benefits brought by the direct and elaborate feature mutual information maximization at a nodelevel. Based on the Homophily hypothesis (McPherson et al., 2001) (i.e., entities in the network with similar features are likely to interconnect), the input feature information preserved in learned embeddings makes FMI owns the good capability of inferring missing links.
4.6. Visualization
For an intuitive illustration, Table 5 displays tSNE (Maaten and Hinton, 2008) plots of the learned embeddings on Cora and Citeseer. From a qualitative perspective, the distribution of plots learned by FMI and DGI seems to be similar, and the embeddings generated by GMI exhibit more discernible clusters than raw features, FMI, and DGI. Especially on Cora, the compactness and separability of clusters are extremely obvious, which represents the seven topic categories. As for quantitative analysis, we attempt to measure clustering quality by calculating the Silhouette Coefficient score (Rousseeuw, 1987). Specifically, we employ silhouette_score function from the scikitlearn Python package (Pedregosa et al., 2011) with all default settings and follow the user guide to perform the evaluation. The clustering of embeddings learned via GMI obtains a Silhouette Coefficient score of 0.425 on Cora, 0.402 on Citeseer, and 0.385 on PubMed, while DGI gets 0.417, 0.391, 0.373 and EPB gains 0.384, 0.385, 0.379 on the three datasets, respectively. Both qualitatively and quantitatively, it demonstrates the great performance of GMI, which illustrates the rationality and effectiveness of graphical mutual information maximization in unsupervised graph representation learning.
4.7. Influence of Model Depth
In this part, we adjust the number of convolutional layers in the encoder to investigate the influence of model depth on classification accuracy. Considering the potential difficulty of training deep neural networks, suggested by (He et al., 2016), we also experiment with a counterpart residual version of the standard GMI model, which adds identity shortcut connections between every two hidden layers to improve the training of deep networks. Here, we continue to have features for each hidden layer and start applying identity shortcuts from the second layer as the input and output of the first layer are not the same dimension. Moreover, compared to the standard GMI model that achieves GMI maximization between the final representation and original input graph, we consider another variant, called dense GMI, which maximizes GMI between each hidden layer and input graph. Figure 2
gives a detailed architecture illustration. The involved hyperparameters remain unchanged except that we train for fixed epochs (600 on Cora and Citeseer) without early stopping. Results are plotted in Figure
3.For one thing, the increase of model depth significantly widens the performance gap between models with and without shortcut connections. The best result for Cora is obtained with a twolayer GCN encoder, while the best result for Citeseer is achieved with a onelayer GCN encoder. Except for the fact that the increase of model depth makes training with no adoption of shortcut connections difficult, we also assume that the farther neighborhood information brought by multiple convolutional layers may be noise for selfrepresentation learning. Specifically, the different proximity between neighbors means distinct extents of similarity, if two arbitrary nodes are a certain distance apart, they are likely to be completely different. Therefore, in the standard GMI model, the information aggregated from the farther neighborhood might contain much noise that is dissimilar to the characteristic of node itself, which degrades the quality of learned embeddings and subsequent classification performance. In contrast, additional identity shortcuts enable the model to carry over the information of the previous layer’s input, which can be regarded as a complementary process to similar neighborhood information from shallower layers to deeper layers, thus the residual version is relatively less vulnerable to model depth. For another, we observe that the dense GMI variant can also alleviate the performance deterioration to some extent, although MI tends to decay with depth by data processing inequality (Cover and Thomas, 2012). This thanks to maximizing graphical MI between the output of each layer and input graph, which imposes a direct constraint on each hidden layer to preserve input information as intact as possible. Based on this observation, enforcing the constraint of maximizing MI on hidden layers to reduce the loss of information when training deep neural networks could be a good practice.
5. Conclusion
To overcome the dilemma of lacking available supervision and evade the potential risk brought by unreliable labels, we introduce a novel concept of graphical mutual information (GMI) to carry out graph representation learning in an unsupervised pattern. Its core lies in directly maximizing the mutual information between the input and output of a graph neural encoder in terms of node features and topological structure. Through our theoretical analysis, we give a definition of GMI and decompose it into a form of a weighted sum which can be calculated by the current mutual information estimation method MINE easily. Accordingly, we develop an unsupervised model and conduct two common graph mining tasks. The results exhibit that GMI outperforms stateoftheart unsupervised baselines across both classification tasks (transductive and inductive) and link prediction tasks, sometimes even be competitive with supervised algorithms. Future work will concentrate on taskoriented representation learning or adapting the idea of GMI maximization to other types of graphs such as heterogeneous graphs and hypergraphs.
Acknowledgements.
This work was supported by National Key Research and Development Program of China (No. 2018AAA0101400), National Nature Science Foundation of China (No. 61872287 and No. 61532015), Innovative Research Group of the National Natural Science Foundation of China (No. 61721002), Innovation Research Team of Ministry of Education (IRT_17R86), and Project of China Knowledge Center for Engineering Science and Technology. Besides, this research was funded by National Science and Technology Major Project of the Ministry of Science and Technology of China (No. 2018AAA0102900).References

MISEP–linear and nonlinear ica based on mutual information.
Journal of machine learning research
4 (Dec), pp. 1297–1318. Cited by: §2.  Mine: mutual information neural estimation. In ICML, Cited by: §1, §2.
 An informationmaximization approach to blind separation and blind deconvolution. Neural computation 7 (6), pp. 1129–1159. Cited by: §2.

Molecular generation with recurrent neural networks (rnns)
. arXiv preprint arXiv:1705.04612. Cited by: §1.  A twostep graph convolutional decoder for molecule generation. arXiv preprint arXiv:1906.03412. Cited by: §1.
 Grarep: learning graph representations with global structural information. In CIKM, Cited by: §2.
 Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247. Cited by: §2, §4.3.
 Elements of information theory. John Wiley & Sons. Cited by: §4.7.
 Semisupervised learning on graphs with generative adversarial nets. In CIKM, Cited by: §2.
 Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on pure and applied mathematics 36 (2), pp. 183–212. Cited by: §3.3.
 Learning graph representations with embedding propagation. In NeurIPS, Cited by: §2, §4.3.
 Graph alignment networks with node matching scores. In NeurIPS, Cited by: §1.
 Understanding the difficulty of training deep feedforward neural networks. In AISTATS, Cited by: §4.2.
 Generative adversarial nets. In NeurIPS, Cited by: §4.4.
 Node2vec: scalable feature learning for networks. In KDD, Cited by: §2, §4.1.
 Inductive representation learning on large graphs. In NeurIPS, Cited by: §2, §4.1, §4.1, §4.2, §4.2, §4.3, §4.3.

Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification
. In ICCV, Cited by: §4.2.  Deep residual learning for image recognition. In CVPR, Cited by: §4.7.
 Regal: representation learningbased graph alignment. In CIKM, Cited by: §1.
 Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §1, §3.3, §3.

Nonlinear independent component analysis: existence and uniqueness results
. Neural networks 12 (3), pp. 429–439. Cited by: §2.  Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.2.
 Introduction to pytorch. In Deep learning with python, pp. 195–208. Cited by: §4.2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2, §4.1, §4.1, §4.2, §4.2, §4.3.
 Variational graph autoencoders. arXiv preprint arXiv:1611.07308. Cited by: §1.
 The linkprediction problem for social networks. Journal of the American society for information science and technology 58 (7), pp. 1019–1031. Cited by: §1.
 Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.6.
 Birds of a feather: homophily in social networks. Annual review of sociology 27 (1), pp. 415–444. Cited by: §4.5.
 Fgan: training generative neural samplers using variational divergence minimization. In NeurIPS, Cited by: §2, §3.3, §4.4.
 Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.3, §4.2.
 Scikitlearn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §4.6.
 Deepwalk: online learning of social representations. In KDD, Cited by: §2, §4.3.
 Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In WSDM, Cited by: §2.
 GMNN: graph markov neural networks. In ICML, Cited by: §1.
 About the mutual (conditional) information. In ISIT, Cited by: §3.1, Theorem 1.

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
. Journal of computational and applied mathematics 20, pp. 53–65. Cited by: §4.6.  Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.1.
 Line: largescale information network embedding. In WWW, Cited by: §2.
 Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2, §3.2, §4.3.
 Deep graph infomax. arXiv preprint arXiv:1809.10341. Cited by: §1, §2, §4.1, §4.1, §4.2, §4.2, §4.3, §4.5.
 Net: degreespecific graph neural networks for node and graph classification. arXiv preprint arXiv:1906.02319. Cited by: §1.

Relationaware entity alignment for heterogeneous knowledge graphs
. In IJCAI, Cited by: §1.  Graph wavelet neural network. arXiv preprint arXiv:1904.07785. Cited by: §4.3.
 Revisiting semisupervised learning with graph embeddings. arXiv preprint arXiv:1603.08861. Cited by: §4.3.
 Graph convolutional policy network for goaldirected molecular graph generation. In NeurIPS, Cited by: §1.
 Deep sets. In NeurIPS, pp. 3391–3401. Cited by: §1.
 Gaan: gated attention networks for learning on large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294. Cited by: §2, §4.3.
 Link prediction based on graph neural networks. In NeurIPS, Cited by: §1.

Bayesian graph convolutional neural networks for semisupervised classification
. In AAAI, Cited by: §1.  Semisupervised learning using gaussian fields and harmonic functions. In ICML, Cited by: §4.3.
 Predicting multicellular function through multilayer tissue networks. Bioinformatics 33 (14), pp. i190–i198. Cited by: §4.1.
Comments
There are no comments yet.