Given a graph representing known relationships between a set of nodes, the goal of link prediction is to learn from the graph and infer novel or previously unknown relationships (Liben-Nowell:2003:LPP:956863.956972). For instance, in a social network we may use link prediction to power a friendship recommendation system (Aiello:2012:FPH:2180861.2180866), or in the case of biological network data we might use link prediction to infer possible relationships between drugs, proteins, and diseases (zitnik2017predicting). However, despite its popularity, previous work on link prediction generally focuses only on one particular problem setting: it generally assumes that link prediction is to be performed on a single large graph and that this graph is relatively complete, i.e., that at least 50% of the true edges are observed during training (e.g., see grover2016node2vec; kipf2016variational; Liben-Nowell:2003:LPP:956863.956972; lu2011link).
In this work, we consider the more challenging setting of few shot link prediction
, where the goal is to perform link prediction on multiple graphs that contain only a small fraction of their true, underlying edges. This task is inspired by applications where we have access to multiple graphs from a single domain but where each of these individual graphs contains only a small fraction of the true, underlying edges. For example, in the biological setting, high-throughput interactomics offers the possibility to estimate thousands of biological interaction networks from different tissues, cell types, and organisms(barrios2005high); however, these estimated relationships can be noisy and sparse, and we need learning algorithms that can leverage information across these multiple graphs in order to overcome this sparsity. Similarly, in the e-commerce and social network settings, link prediction can often have a large impact in cases where we must quickly make predictions on sparsely-estimated graphs, such as when a service has been recently deployed to a new locale. In other words, link prediction for a new sparse graph can benefit from transferring knowledge from other, possibly more dense, graphs assuming there is exploitable shared structure.
We term this problem of link prediction from sparsely-estimated multi-graph data as few shot link prediction analogous to the popular few shot classification setting (miller2000learning; lake2011one; koch2015siamese). The goal of few shot link prediction is to observe many examples of graphs from a particular domain and leverage this experience to enable fast adaptation and higher accuracy when predicting edges on a new, sparsely-estimated graph from the same domain—a task that can can also be viewed as a form of meta learning, or learning to learn (bengio1990learning; bengio1992optimization; thrun2012learning; schmidhuber1987evolutionary) in the context of link prediction. This few shot link prediction setting is particularly challenging as current link prediction methods are generally ill-equipped to transfer knowledge between graphs in a multi-graph setting and are also unable to effectively learn from very sparse data.
Present work. We introduce a new framework called Meta-Graph for few shot link prediction and also introduce a series of benchmarks for this task. We adapt the classical gradient-based meta-learning formulation for few shot classification (miller2000learning; lake2011one; koch2015siamese) to the graph domain. Specifically, we consider a distribution over graphs as the distribution over tasks from which a global set of parameters are learnt, and we deploy this strategy to train graph neural networks (GNNs) that are capable of few-shot link prediction. To further bootstrap fast adaptation to new graphs we also introduce a graph signature function, which learns how to map the structure of an input graph to an effective initialization point for a GNN link prediction model. We experimentally validate our approach on three link prediction benchmarks. We find that our Meta-Graph approach not only achieves fast adaptation but also converges to a better overall solution in many experimental settings, with an average improvement of in AUC at convergence over non-meta learning baselines.
2 Preliminaries and Problem Definition
The basic set-up for few shot link prediction is as follows: We assume that we have a distribution over graphs, from which we can sample training graphs , where each is defined by a set of nodes , edges , and matrix of real-valued node attributes . When convenient, we will also equivalently represent a graph as , where is an adjacency matrix representation of the edges in . We assume that each of these sampled graphs, , is a simple graph (i.e., contain a single type of relation and no self loops) and that every node
in the graph is associated with a real valued attribute vectorfrom a common vector space. We further assume that for each graph we have access to only a sparse subset of the true edges (with ) during training. In terms of distributional assumptions we assume that this is defined over a set of related graphs (e.g., graphs drawn from a common domain or application setting).
Our goal is to learn a global or meta link prediction model from a set of sampled training graphs , such that we can use this meta model to quickly learn an effective link prediction model on a newly sampled graph . More specifically, we wish to optimize a global set of parameters , as well as a graph signature function , which can be used together to generate an effective parameter initialization, , for a local link prediction model on graph .
Relationship to standard link prediction. Few shot link prediction differs from standard link prediction in three important ways:
[leftmargin=*, itemsep=2pt, topsep=0pt, parsep=0pt]
Rather than learning from a single graph , we are learning from multiple graphs sampled from a common distribution or domain.
We presume access to only a very sparse sample of true edges. Concretely, we focus on settings where at most 30% of the edges in are observed during training, i.e., where .111By “true edges” we mean the full set of ground truth edges available in a particular dataset.
We distinguish between the global parameters, which are used to encode knowledge about the underlying distribution of graphs, and the local parameters , which are optimized to perform link prediction on a specific graph . This distinction allows us to consider leveraging information from multiple graphs, while still allowing for individually-tuned link prediction models on each specific graph.
Relationship to traditional meta learning. Traditional meta learning for few-shot classification generally assumes a distribution over classification tasks, with the goal of learning global parameters that can facilitate fast adaptation to a newly sampled task with few examples. We instead consider a distribution over graphs with the goal of performing link prediction on a newly sampled graph. An important complication of this graph setting is that the individual predictions for each graph (i.e., the training edges) are not i.i.d.. Furthermore, for few shot link prediction we require training samples as a sparse subset of true edges that represents a small percentage of all edges in a graph. Note that for very small percentages of training edges we effectively break all graph structure and recover the supervised setting for few shot classification.
3 Proposed Approach
We now outline our proposed approach, Meta-Graph, to the few shot link prediction problem. We first describe how we define the local link prediction models, which are used to perform link prediction on each specific graph . Next, we discuss our novel gradient-based meta learning approach to define a global model that can learn from multiple graphs to generate effective parameter initializations for the local models. The key idea behind Meta-Graph is that we use gradient-based meta learning to optimize a shared parameter initialization for the local models, while also learning a parametric encoding of each graph that can be used to modulate this parameter initialization in a graph-specific way (Figure 1).
3.1 Local Link Prediction Model
In principle, our framework can be combined with a wide variety of GNN-based link prediction approaches, but here we focus on variational graph autoencoders (VGAEs)(kipf2016variational) as our base link prediction framework. Formally, given a graph , the VGAE learns an inference model, , that defines a distribution over node embeddings , where each row of is a node embedding that can be used to score the likelihood of an edge existing between pairs of nodes. The parameters of the inference model are shared across all the nodes in , to define the approximate posterior
, where the parameters of the normal distribution are learned via GNNs:
The generative component of the VGAE is then defined as
i.e., the likelihood of an edge existing between two nodes, and , is proportional to the dot product of their node embeddings. Given the above components, the inference GNNs can be trained to minimize the variational lower bound on the training data:
where a Gaussian prior is used for .
We build upon VGAEs due to their strong performance on standard link prediction benchmarks (kipf2016variational), as well as the fact that they have a well-defined probabilistic interpretation that generalizes many embedding-based approaches to link prediction (e.g., node2vec (grover2016node2vec)). We describe the specific GNN implementations we deploy for the inference model in Section 3.3.
3.2 Overview of Meta-Graph
The key idea behind Meta-Graph is that we use gradient-based meta learning to optimize a shared parameter initialization for the inference models of a VGAE, while also learning a parametric encoding that modulates this parameter initialization in a graph-specific way. Specifically, given a sampled training graph , we initialize the inference model for a VGAE link prediction model using a combination of two learned components:
[leftmargin=*, itemsep=2pt, topsep=0pt, parsep=0pt]
A global initialization, , that is used to initialize all the parameters of the GNNs in the inference model. The global parameters are optimized via second-order gradient descent to provide an effective initialization point for any graph sampled from the distribution .
A graph signature that is used to modulate the parameters of inference model based on the history of observed training graphs. In particular, we assume that the inference model for each graph can be conditioned on the graph signature. That is, we augment the inference model to , where we also include the graph signature as a conditioning input. We use a k-layer graph convolutional network (GCN) (kipf2016semi), with sum pooling to compute the signature:
where GCN denotes a k-layer GCN (as defined in (kipf2016semi)), MLP denotes a densely-connected neural network, and we are summing over the node embeddings output from the GCN. As with the global parameters , the graph signature model is optimized via second-order gradient descent.
The overall Meta-Graph architecture is detailed in Figure 1 and the core learning algorithm is summarized in the algorithm block below.
The basic idea behind the algorithm is that we (i) sample a batch of training graphs, (ii) initialize VGAE link prediction models for these training graphs using our global parameters and signature function, (iii) run steps of gradient descent to optimize each of these VGAE models, and (iv) use second order gradient descent to update the global parameters and signature function based on a held-out validation set of edges. As depicted in Fig 1, this corresponds to updating the GCN based encoder for the local link prediction parameters and global parameters along with the graph signature function using second order gradients. Note that since we are running steps of gradient descent within the inner loop of Algorithm 1, we are also “meta” optimizing for fast adaptation, as and are being trained via second-order gradient descent to optimize the local model performance after gradient updates, where generally .
3.3 Variants of Meta-Graph
We consider several concrete instantiations of the Meta-Graph framework, which differ in terms of how the output of the graph signature function is used to modulate the parameters of the VGAE inference models. For all the Meta-Graph variants, we build upon the standard GCN propagation rule (kipf2016semi) to construct the VGAE inference models. In particular, we assume that all the inference GNNs (Equation 1) are defined by stacking neural message passing layers of the form:
where denotes the embedding of node at layer of the model, denotes the nodes in the graph neighborhood of , and is a trainable weight matrix for layer . The key difference between Equation 5 and the standard GCN propagation rule is that we add the modulation function , which is used to modulate the message passing based on the graph signature .
We describe different variations of this modulation below. In all cases, the intuition behind this modulation is that we want to compute a structural signature from the input graphs that can be used to condition the initialization of the local link prediction models. Intuitively, we expect this graph signature to encode structural properties of sampled graphs in order to modulate the parameters of the local VGAE link prediction models and adapt it to the current graph.
GS-Modulation. Inspired by brockschmidt2019gnn, we experiment with basic feature-wise linear modulation (strub2018visual) to define the modulation function :
Here, we restrict the modulation terms and output by the signature function to be in by applying a non-linearity after Equation 4.
GS-Gating. Feature-wise linear modulation of the GCN parameters (Equation 3.3) is an intuitive and simple choice that provides flexible modulation while still being relatively constrained. However, one drawback of the basic linear modulation is that it is “always on”, and there may be instances where the modulation could actually be counter-productive to learning. To allow the model to adaptively learn when to apply modulation, we extend the feature-wise linear modulation using a sigmoid gating term, (with entries), that gates in the influence of and :
GS-Weights. In the final variant of Meta-Graph, we extend the gating and modulation idea by separately aggregating graph neighborhood information with and without modulation and then merging these two signals via a convex combination:
where we use the basic linear modulation (Equation 3.3) to define .
3.4 MAML for link prediction as a special case
Note that a simplification of Meta-Graph, where the graph signature function is removed, can be viewed as an adaptation of model agnostic meta learning (MAML) (finn2017model) to the few shot link prediction setting. As discussed in Section 2, there are important differences in the set-up for few shot link prediction, compared to traditional few shot classification. Nonetheless, the core idea of leveraging an inner and outer loop of training in Algorithm 1—as well as using second order gradients to optimize the global parameters—can be viewed as an adaptation of MAML to the graph setting, and we provide comparisons to this simplified MAML approach in the experiments below. We formalize the key differences by depicting the graphical model of MAML as first depicted in (grant2018recasting) and contrasting it with the graphical model for Meta-Graph, in Figure 1. MAML when reinterpreted for a distribution over graphs, maximizes the likelihood over all edges in the distribution. On the other hand, Meta-Graph when recast in a hierarchical Bayesian framework adds a graph signature function that influences to produce the modulated parameters from sampled edges. This explicit influence of is captured by the term in Equation 7 below:
For computational tractability we take the likelihood of the modulated parameters as a point estimate —i.e., .
We design three novel benchmarks for the few-shot link prediction task. All of these benchmarks contain a set of graphs drawn from a common domain. In all settings, we use 80% of these graphs for training and 10% as validation graphs, where these training and validation graphs are used to optimize the global model parameters (for Meta-Graph) or pre-train weights (for various baseline approaches). We then provide the remaining 10% of the graphs as test graphs, and our goal is to fine-tune or train a model on these test graphs to achieve high link prediction accuracy. Note that in this few shot link prediction setting, there are train/val/test splits at both the level of graphs and edges: for every individual graph, we are optimizing a model using the training edges to predict the likelihood of the test edges, but we are also training on multiple graphs with the goal of facilitating fast adaptation to new graphs via the global model parameters.
Our goal is to use our benchmarks to investigate four key empirical questions:
[leftmargin=18pt, topsep=0pt, parsep=0pt, itemsep=2pt]
How does the overall performance of Meta-Graph compare to various baselines, including (i) a simple adaptation of MAML (finn2017model)
(i.e., an ablation of Meta-Graph where the graph signature function is removed), (ii), standard pre-training approaches where we pre-train the VGAE model on the training graphs before fine-tuning on the test graphs, and (iii) naive baselines that do not leverage multi-graph information (i.e., a basic VGAE without pre-training, the Adamic-Adar heuristic(adamic2003friends), and DeepWalk (perozzi2014deepwalk))?
How well does Meta-Graph perform in terms of fast adaption? Is Meta-Graph able to achieve strong performance after only a small number of gradient steps on the test graphs?
How necessary is the graph signature function for strong performance, and how do the different variants of the Meta-Graph signature function compare across the various benchmark settings?
What is learned by the graph signature function? For example, do the learned graph signatures correlate with the structural properties of the input graphs, or are they more sensitive to node feature information?
|Dataset||#Graphs||Avg. Nodes||Avg. Edges||#Node Feats|
Datasets. Two of our benchmarks are derived from standard multi-graph datasets from protein-protein interaction (PPI) networks (zitnik2017predicting) and 3D point cloud data (FirstMM-DB) (neumann2013graph). These benchmarks are traditionally used for node and graph classification, respectively, but we adapt them for link prediction. We also create a novel multi-graph dataset based upon the AMINER citation data (tang2008arnetminer), where each node corresponds to a paper and links represent citations. We construct individual graphs from AMINER data by sampling ego networks around nodes and create node features using embeddings of the paper abstracts (see Appendix for details). We preprocess all graphs in each domain such that each graph contains a minimum of nodes and up to a maximum of nodes. For all datasets, we perform link prediction by training on a small subset (i.e., a percentage) of the edges and then attempting to predict the unseen edges (with of the held-out edges used for validation). Key dataset statistics are summarized in Table 1.
Baseline details. Several baselines correspond to modifications or ablations of Meta-Graph, including the straightforward adaptation of MAML (which we term MAML in the results), a finetune baseline where we pre-train a VGAE on the training graphs observed in a sequential order and fine-tune on the test graphs (termed Finetune). We also consider a VGAE trained individually on each test graph (termed No Finetune
). For Meta-Graph and all of these baselines we employ Bayesian optimization with Thompson sampling(kandasamy2018parallelised)
to perform hyperparameter selection using the validation sets. We use the recommended default hyperparameters for DeepWalk and Adamic-Adar baseline is hyperparameter-free.222Code is included with our submission and will be made public after the review process
Q1: Overall Performance. Table 2 shows the link prediction AUC for Meta-Graph and the baseline models when trained to convergence using 10%, 20% or 30% of the graph edges. In this setting, we adapt the link prediction models on the test graphs until learning converges, as determined by performance on the validation set of edges, and we report the average link prediction AUC over the test edges of the test graphs. Overall, we find that Meta-Graph achieves the highest average AUC in all but one setting, with an average relative improvement of in AUC compared to the MAML approach and an improvement of compared to the Finetune baseline. Notably, Meta-Graph is able to maintain especially strong performance when using only of the graph edges for training, highlighting how our framework can learn from very sparse samples of edges. Interestingly, in the Ego-AMINER dataset, unlike PPI and FIRSTMM DB, we observe the relative difference in performance between Meta-Graph and MAML to increase with density of the training set. We hypothesize that this is due to the fickle nature of optimization with higher order gradients in MAML (antoniou2018train) which is somewhat alleviated in GS-gating due to the gating mechanism. With respect to computational complexity we observe only a slight overhead when comparing Meta-Graph to MAML, which can be reconciled by realizing that the graph signature function is not updated in the inner loop update but only in outer loop. In the Appendix, we provide additional results when using larger sets of training edges, and, as expected, we find that the relative gains of Meta-Graph decrease as more and more training edges are available.
Q2: Fast Adaptation. Table 3 highlights the average AUCs achieved by Meta-Graph and the baselines after performing only 5 gradient updates on the batch of training edges. Note that in this setting we only compare to the MAML, Finetune, and No Finetune baselines, as fast adaption in this setting is not well defined for the DeepWalk and Adamic-Adar baselines. In terms of fast adaptation, we again find that Meta-Graph is able to outperform all the baselines in all but one setting, with an average relative improvement of compared to MAML and compared to the Finetune baseline—highlighting that Meta-Graph can not only learn from sparse samples of edges but is also able to quickly learn on new data using only a small number of gradient steps. Also, we observe poor performance for MAML in the Ego-AMINER dataset dataset which we hypothesize is due to extremely low learning rates —i.e. needed for any learning, the addition of a graph signature alleviates this problem. Figure 2 shows the learning curves for the various models on the PPI and FirstMM DB datasets, where we can see that Meta-Graph learns very quickly but can also begin to overfit after only a small number of gradient updates, making early stopping essential.
Q3: Choice of Meta-Graph Architecture. We study the impact of the graph signature function and its variants GS-Gating and GS-Weights by performing an ablation study using the FirstMM DB dataset. Figure 3 shows the performance of the different model variants and baselines considered as the training progresses. In addition to models that utilize different signature functions we report a random baseline where parameters are initialized but never updated allowing us to assess the inherent power of the VGAE model for few-shot link prediction. To better understand the utility of using a GCN based inference network we also report a VGAE model that uses a simple MLP on the node features and is trained analogously to Meta-Graph as a baseline. As shown in Figure 3 many versions of the signature function start at a better initialization point or quickly achieve higher AUC scores in comparison to MAML and the other baselines, but simple modulation and GS-Gating are superior to GS-Weights after a few gradient steps.
Q4: What is learned by the graph signature? To gain further insight into what knowledge is transferable among graphs we use the FirstMM DB and Ego-AMINER datasets to probe and compare the output of the signature function with various graph heuristics. In particular, we treat the output of
as a vector and compute the cosine similarity between all pairs of graph in the training set (i.e., we compute the pairwise cosine similarites between graph signatures,). We similarly compute three pairwise graph statistics—namely, the cosine similarity between average node features in the graphs, the difference in number of nodes, and the difference in number of edges—and we compute the Pearson correlation between the pairwise graph signature similarities and these other pairwise statistics. As shown in Table 4 we find strong positive correlation in terms of Pearson correlation coefficient between node features and the output of the signature function for both datasets, indicating that the graph signature function is highly sensitive to feature information. This observation is not entirely surprising given that we use such sparse samples of edges—meaning that many structural graph properties are likely lost and making the meta-learning heavily reliant on node feature information. We also observe moderate negative correlation with respect to the average difference in nodes and edges between pairs of graphs for FirstMM DB dataset. For Ego-AMINER we observe small positive correlation for difference in nodes and edges.
|Diff Num. Nodes||-0.093||-0.196||-0.286||0.095||0.086||0.085|
|Diff Num. Edges||-0.093||-0.195||-0.281||0.093||0.072||0.075|
5 Related Work
We now briefly highlight related work on link prediction, meta-learning, few-shot classification, and few-shot learning in knowledge graphs. Link prediction considers the problem of predicting missing edges between two nodes in a graph that are likely to have an edge.(Liben-Nowell:2003:LPP:956863.956972). Common successful applications of link prediction include friend and content recommendations (Aiello:2012:FPH:2180861.2180866), shopping and movie recommendation (Huang:2005:LPA:1065385.1065415), knowledge graph completion (nickel2015review) and even important social causes such as identifying criminals based on past activities (Hasan06linkprediction). Historically, link prediction methods have utilized topological graph features (e.g., neighborhood overlap), yielding strong baselines such as the Adamic/Adar measure (adamic2003friends). Other approaches include matrix factorization methods (Menon:2011:LPV:2034117.2034146)
and more recently deep learning and graph neural network based approaches(grover2016node2vec; wang2015link; zhang2018link). A commonality among all the above approaches is that the link prediction problem is defined over a single dense graph where the objective is to predict unknown/future links within the same graph. Unlike these previous approaches, our approach considers link prediction tasks over multiple sparse graphs that are drawn from distribution over graphs akin to real world scenario such as protein-protein interaction graphs, 3D point cloud data and citation graphs in different communities.
In meta-learning or learning to learn (bengio1990learning; bengio1992optimization; thrun2012learning; schmidhuber1987evolutionary)
, the objective is to learn from prior experiences to form inductive biases for fast adaptation to unseen tasks. Meta-learning has been particularly effective in few-shot learning tasks with a few notable approaches broadly classified into metric based approaches(vinyals2016matching; snell2017prototypical; koch2015siamese), augmented memory (santoro2016meta; kaiser2017learning; mishra2017simple) and optimization based approaches (finn2017model; lee2018gradient). Recently, there are several works that lie at the intersection of meta-learning for few-shot classification and graph based learning. In Latent Embedding Optimization, rusu2018meta learn a graph between tasks in embedding space while liu2019propagate introduce a message propagation rule between prototypes of classes. However, both these methods are restricted to the image domain and do not consider meta-learning over a distribution of graphs as done here.
Another related line of work considers the task of few-shot relation prediction in knowledge graphs. xiong2018one developed the first method for this task, which leverages a learned matching metric using both a learned embedding and one-hop graph structures. More recently chen2019meta introduce Meta Relational Learning framework (MetaR) that seeks to transfer relation-specific meta information to new relation types in the knowledge graph. A key distinction between few-shot relation setting and the one which we consider in this work is that we assume a distribution over graphs, while in the knowledge graph setting there is only a single graph and the challenge is generalizing to new types of relations within this graph.
6 Discussion and Conclusion
We introduce the problem of few-shot link prediction—where the goal is to learn from multiple graph datasets to perform link prediction using small samples of graph data—and we develop the Meta-Graph framework to address this task. Our framework adapts gradient-based meta learning to optimize a shared parameter initialization for local link prediction models, while also learning a parametric encoding, or signature, of each graph, which can be used to modulate this parameter initialization in a graph-specific way. Empirically, we observed substantial gains using Meta-Graph compared to strong baselines on three distinct few-shot link prediction benchmarks. In terms of limitations and directions for future work, one key limitation is that our graph signature function is limited to modulating the local link prediction model through an encoding of the current graph, which does not explicitly capture the pairwise similarity between graphs in the dataset. Extending Meta-Graph by learning a similarity metric or kernel between graphs—which could then be used to condition meta-learning—is a natural direction for future work. Another interesting direction for future work is extending the Meta-Graph approach to multi-relational data, and exploiting similarities between relation types through a suitable graph signature function.
The authors would like to thank Thang Bui, Maxime Wabartha, Nadeem Ward, Sebastien Lachapelle, and Zhaocheng Zhu for helpful feedback on earlier drafts of this work. In addition, the authors would like to thank the Uber AI team including other interns that helped shape earlier versions of this idea. Joey Bose is supported by the IVADO PhD fellowship and this work was done as part of his internship at Uber AI.
7.1 A: Ego-Aminer Dataset Construction
To construct the Ego-Aminer dataset we first create citation graphs from different fields of study. We then select the top graphs in terms number of nodes for further pre-processing. Specifically, we take the -core of each graph ensuring that each node has a minimum of -edges. We then construct ego networks by randomly sampling a node from the -core graph and taking its two hop neighborhood. Finally, we remove graphs with fewer than nodes and greater than nodes which leads to a total of graphs as reported in Table 1.
7.2 B: Additional Results
We list out complete results when using larger sets of training edges for PPI, FIRSTMM DB and Ego-Aminer datasets. We show the results for two metrics i.e. Average AUC across all test graphs. As expected, we find that the relative gains of Meta-Graph decrease as more and more training edges are available.