# Relational Representation Learning for Dynamic (Knowledge) Graphs: A Survey

Graphs arise naturally in many real-world applications including social networks, recommender systems, ontologies, biology, and computational finance. Traditionally, machine learning models for graphs have been mostly designed for static graphs. However, many applications involve evolving graphs. This introduces important challenges for learning and inference since nodes, attributes, and edges change over time. In this survey, we review the recent advances in representation learning for dynamic graphs, including dynamic knowledge graphs. We describe existing models from an encoder-decoder perspective, categorize these encoders and decoders based on the techniques they employ, and analyze the approaches in each category. We also review several prominent applications and widely used datasets, and highlight directions for future research.

## Authors

• 15 publications
• 9 publications
• 9 publications
• 7 publications
• 4 publications
• 1 publication
• 46 publications
03/07/2020

### Knowledge Graphs and Knowledge Networks: The Story in Brief

Knowledge Graphs (KGs) represent real-world noisy raw information in a s...
03/04/2020

### Knowledge Graphs

In this paper we provide a comprehensive introduction to knowledge graph...
12/14/2021

### Efficient Dynamic Graph Representation Learning at Scale

Dynamic graphs with ordered sequences of events between nodes are preval...
03/29/2021

### Dynamic Network Embedding Survey

Since many real world networks are evolving over time, such as social ne...
10/31/2020

### Domain-specific Knowledge Graphs: A survey

Knowledge Graphs (KGs) have made a qualitative leap and effected a real ...
02/09/2021

### Dynamic Neural Networks: A Survey

Dynamic neural network is an emerging research topic in deep learning. C...
06/29/2018

### On embeddings as an alternative paradigm for relational learning

Many real-world domains can be expressed as graphs and, more generally, ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the era of big data, a challenge is to leverage data as effectively as possible to extract patterns, make predictions, and more generally unlock value. In many situations, the data does not consist only of vectors of features, but also relations that form graphs among entities. Graphs naturally arise in social networks (users with friendship relations, emails, text messages), recommender systems (users and products with transactions and rating relations), ontologies (concepts with relations), computational biology (protein-protein interactions), computational finance (web of companies with competitor, customer, subsidiary relations, supply chain graph, graph of customer-merchant transactions), etc. While it is often possible to ignore relations and use traditional machine learning techniques based on vectors of features, relations provide additional valuable information that permits inference among nodes. Hence, graph-based techniques have emerged as leading approaches in the industry for application domains with relational information.

Traditionally, research has been done mostly on static graphs where nodes and edges are fixed and do not change over time. Many applications, however, involve dynamic graphs. For instance, in social media, communication events such as emails and text messages are streaming while friendship relations evolve over time. In recommender systems, new products, new users and new ratings appear every day. In computational finance, transactions are streaming and supply chain relations are continuously evolving. As a result, the last few years have seen a surge of works on dynamic graphs. This survey focuses precisely on dynamic graphs. Note that there are already many good surveys on static graphs Hamilton et al. (2017b); Zhang et al. (2018a); Cai et al. (2018); Cui et al. (2018); Nickel et al. (2016a); Wang et al. (2017a). There are also several surveys on techniques for dynamic graphs Bilgin and Yener (2006); Zhang (2010); Spiliopoulou (2011); Aggarwal and Subbian (2014); Al Hasan and Zaki (2011), but they do not review recent advances in neural representation learning.

We present a survey that focuses on recent representation learning techniques for dynamic graphs. More precisely, we focus on reviewing techniques that either produce time-dependent embeddings that capture the essence of the nodes and edges of evolving graphs or use embeddings to answer various questions such as node classification, event prediction/interpolation, and link prediction. Accordingly, we use an encoder-decoder framework to categorize and analyze techniques that encode various aspects of graphs into embeddings and other techniques that decode embeddings into predictions. We survey techniques that deal with discrete- and/or continuous-time events.

The survey is structured as follows. Section 2 introduces the notation and provides some background about static/dynamic graphs, inference tasks, and learning techniques. Section 3 provides an overview of representation learning techniques for static graphs. This section is not meant to be a survey, but rather to introduce important concepts that will be extended for dynamic graphs. Section 4 categorizes decoders for dynamic graphs into time-predicting and time-conditioned decoders, and surveys the decoders in each category. Section 5

describes encoding techniques that aggregate temporal observations and static features, use time as a regularizer, perform decompositions, traverse dynamic networks with random walks, and model observation sequences with various types of processes (e.g., recurrent neural networks). Section

6 describes briefly other lines of work that do not conform to the encoder-decoder framework such as statistical relational learning, and topics related to dynamic (knowledge) graphs such as spatiotemporal graphs and the construction of dynamic knowledge graphs from text. Section 7 reviews important applications of dynamic graphs with representative tasks. A list of static and temporal datasets is also provided with a brief summary of their properties. Section 8 concludes the survey with a discussion of several open problems and possible research directions.

## 2 Background and Notation

In this section, we define our notation and provide the necessary background for readers to follow the rest of the survey. A summary of the main notation and abbreviations can be found in Table 1.

We use lower-case letters to denote scalars, bold lower-case letters to denote vectors, and bold upper-case letters to denote matrices. For a vector , we represent the element of the vector as . For a matrix , we represent the row of as , and the element at the row and column as . represents norm of a vector and represents the Frobenius norm of a matrix . For two vectors and , we use to represent the concatenation of the two vectors. When , we use to represent a matrix whose two columns correspond to and respectively. We use to represent element-wise (Hadamard) multiplication. We represent by

the identity matrix of size

. vectorizes into a vector of size . turns into a diagonal matrix that has the values of on its main diagonal. We denote the transpose of a matrix as .

### 2.1 Static Graphs

A (static) graph is represented as where is the set of vertices and is the set of edges. Vertices are also called nodes and we use the two terms interchangeably. Edges are also called links and we use the two terms interchangeably.

Several matrices can be associated with a graph. An adjacency matrix is a matrix where if ; otherwise represents the weight of the edge. For unweighted graphs, all non-zero s are . A degree matrix is a diagonal matrix where represents the degree of . A graph Laplacian is defined as .

A graph is undirected if the order of the nodes in the edges is not important. For an undirected graph, the adjacency matrix is symmetric, i.e. for all and . A graph is directed if the order of the nodes in the edges is important. Directed graphs are also called digraphs. For an edge in a digraph, we call the source and the target of the edge. A graph is bipartite if the nodes can be split into two groups where there is no edge between any pair of nodes in the same group. A multigraph is a graph where multiple edges can exist between two nodes. A graph is attributed if each node is associated with a number of properties representing its characteristics. For a node in an attributed graph, we let represent the attribute values of . When all nodes have the same attributes, we represent all attribute values of the nodes by a matrix whose row corresponds to the attribute values of .

A knowledge graph (KG) is a multi-digraph with labeled edges Kazemi (2018), where the label represents the type of the relationship. Let be a set of relation types. Then . A KG can be attributed in which case each node is associated with a vector of attribute values. A digraph is a special case of a KG with only one relation. An undirected graph is a special case of a KG with only one symmetric relation.

###### Example 1.

Figure 1(a) represents an undirected graph with three nodes , and and three edges , and . Figure 1(b) represents a graph with four nodes and four edges. The adjacency, degree, and Laplacian matrices for the graph in Figure 1(b) are as follows:

 A=⎡⎢ ⎢ ⎢⎣0110101111000100⎤⎥ ⎥ ⎥⎦D=⎡⎢ ⎢ ⎢⎣2000030000200001⎤⎥ ⎥ ⎥⎦L=⎡⎢ ⎢ ⎢ ⎢⎣ 2−1−1 0−1 3−1−1−1−1 2 0 0−1 0 1⎤⎥ ⎥ ⎥ ⎥⎦

where the row (and the column) corresponds to . Since the graph is undirected, is symmetric. Figure 1(c) represents a KG with four nodes , , and , three relation types , , and , and five labeled edges as follows:

 (v1,r1,v2)(v1,r1,v3)(v1,r2,v3)(v2,r3,v4)(v4,r3,v2)

The KG in Figure 1(c) is directed and is a multigraph as there are, e.g., two edges (with the same direction) between and .

### 2.2 Dynamic Graphs

We represent a continuous-time dynamic graph (CTDG) as where is a static graph representing an initial state of a dynamic graph at time and is a set of observations/events where each observation is a tuple . An event type can be an edge addition, edge deletion, node addition, node deletion, node splitting, node merging, etc. At any point in time, a snapshot (corresponding to a static graph) can be obtained from by updating sequentially according to the observations that occurred before (or at) time (sometimes, the update may require aggregation to handle multiple edges between two nodes).

A discrete-time dynamic graph (DTDG) is a sequence of snapshots from a dynamic graph sampled at regularly-spaced times. Formally, where . We use the term dynamic graph to refer to both DTDGs and CTDGs. Compared to a CTDG, a DTDG may lose information by looking only at some snapshots of the graph over time, but developing models for DTDGs may be generally easier. In particular, a model developed for CTDGs may be used for DTDGs, but the reverse is not necessarily true.

An undirected dynamic graph is a dynamic graph where at any time , is an undirected graph. A directed dynamic graph is a dynamic graph where at any time , is a digraph. A bipartite dynamic graph is a dynamic graph where at any time , is a bipartite graph. A dynamic KG is a dynamic graph where at any time , is a KG.

###### Example 2.

Let be a CTDG where is a graph with five nodes , , , and and with no edges between any pairs of nodes, and is:

may be represented graphically as in Figure 1(d). The only type of observation in this dynamic graph is the addition of new edges. The second element of each observation corresponding to an edge addition represents the source and the target nodes of the new edge. The third element of each observation represents the timestamp at which the observation was made.

###### Example 3.

Consider an undirected CTDG whose initial state is as in Figure 1(a). Suppose is:

where . Now consider a DTDG that takes two snapshots from this CTDG, one snapshot at time and one snapshot at time . The two snapshots of this DTDG look like the graphs in Figure 1(a) and Figure 1(b) respectively.

### 2.3 Prediction problems

In this survey, we mainly study three general problems for dynamic graphs: node classification, edge prediction, and graph classification

. Node classification is the problem of classifying each node into one class from a set of predefined classes. Link prediction is the problem of predicting new links between the nodes. Graph classification is the problem of classifying a whole graph into one class from a set of predefined classes. A high-level description of some other prediction problems can be found in Section

7.1.

Node classification and link prediction can be deployed under two settings: interpolation and extrapolation. Consider a dynamic graph that has incomplete information from the time interval . The interpolation problem is to make predictions at some time such that . The interpolation problem is also known as the completion problem and is mainly used for completing (dynamic) KGs Jiang et al. (2016); Leblay and Chekol (2018); García-Durán et al. (2018); Dasgupta et al. (2018). The extrapolation problem is to make predictions at time such that , i.e., predicting future based on the past. Extrapolation is usually a more challenging problem than the interpolation problem.

##### Streaming scenario:

In the streaming scenario, new observations are being streamed to the model at a fast rate and the model needs to update itself based on these observations in real-time so it can make informed predictions immediately after each observation arrives. For this scenario, a model may not have enough time to retrain completely or in part when new observations arrive. Streaming scenarios are often best handled by CTDGs and often give rise to extrapolation problems.

### 2.4 The Encoder-Decoder Framework

Following Hamilton et al. (2017b), to deal with the large notational and methodological diversity of the existing approaches and to put the various methods on an equal notational and conceptual footing, we develop an encoder-decoder framework for dynamic graphs. Before describing the encoder-de coder framework, we define a main component in this architecture known as embedding.

###### Definition 1.

An embedding is a function that maps every node of a graph, and every relation type

in case of a KG, to a hidden representation where the hidden representation is typically a tuple of one or more scalars, vectors, and/or matrices of numbers. The vectors and matrices in the tuple are supposed to contain the necessary information about the nodes and relations to enable making predictions about them.

For each node and relation , we refer to the hidden representation of and as the embedding of and the embedding of respectively. When the main goal is link prediction, me works define the embedding function as mapping each pair of nodes into a hidden representation. In these cases, we refer to the hidden representation of a pair of nodes as the embedding of the pair .

Having the above definition, we can now formally define an encoder and a decoder.

###### Definition 2.

An encoder takes as input a dynamic graph and outputs an embedding function that maps nodes, and relations in case of a KG, to hidden representations.

###### Definition 3.

A decoder takes as input an embedding function and makes predictions (such as node classification, edge prediction, etc.) based on the embedding function.

In many cases (e.g., Kipf and Welling (2017); Hamilton et al. (2017a); Yang et al. (2015); Bordes et al. (2013); Nickel et al. (2016b); Dong et al. (2014)), the embedding function maps each node, and each relation in the case of a KG, to a tuple containing a single vector; that is where and where . Other works consider different representations. For instance, Kazemi and Poole (2018c) define and , i.e. mapping each node and each relation to two vectors where each vector has a different usage. Nguyen et al. (2016) define and , i.e. mapping each node to a single vector but mapping each relation to a vector and two matrices. We will describe these approaches (and many others) in the upcoming sections.

A model corresponds to an encoder-decoder pair. One of the benefits of describing models in an encoder-decoder framework is that it allows for creating new models by combining the encoder from one model with the decoder from another model when the hidden representations produced by the encoder conform to the hidden representations consumed by the decoder.

#### 2.4.1 Training

For many choices of an encoder-decoder pair, it is possible to train the two components end-to-end. In such cases, the parameters of the encoder and the decoder are typically initialized randomly. Then, until some criterion is met, several epochs of stochastic gradient descent are performed where in each epoch, the embedding function is produced by the encoder, predictions are made based on the embedding function by the decoder, the error in predictions is computed with respect to a loss function, and the parameters of the model are updated based on the loss.

For node classification and graph classification, the loss function can be any classification loss (e.g., cross entropy loss). For link prediction, typically one only has access to positive examples corresponding to the links already in the graph. A common approach in such cases is to generate a set of negative samples where negative samples correspond to edges that are believed to have a low probability of being in the graph. Then, having a set of positive and a set of negative samples, the training of a link predictor turns into a classification problem and any classification loss can be used. The choice of the loss function depends on the application.

### 2.5 Expressivity

The expressivity of the models for (dynamic) graphs can be thought of as the diversity of the graphs they can represent. Depending on the problem at hand (e.g., node classification, link prediction, graph classification, etc.), the expressivity can be defined differently. We first provide some intuition on the importance of expressivity using the following example.

###### Example 4.

Consider a simple encoder for a KG that maps every node to a tuple containing a single scalar representing the number of incoming edges to the node (regardless of the labels of the edges). For the KG in Figure 1(c), this encoder will output an embedding function as:

 EMB(v1)=(0)EMB(v2)=(2)EMB(v3)=(2)EMB(v4)=(1)

No matter what decoder we use, since and are identical, the two nodes will be assigned the same class. Therefore, this model is not expressive enough to represent ground truths where and belong to different classes.

From Example 4, we can see why the expressivity of a model may be important. In this regard, one may favor models that are fully expressive, where we define full expressivity for node classification as follows (a model in the following definitions corresponds to an encoder-decoder pair):

###### Definition 4.

A model with parameters is fully expressive with respect to node classification if given any graph and any ground truth of class assignments for all nodes in the graph, there exists an instantiation of that classifies the nodes of according to .

A similar definition can be given for full expressivity of a model with respect to link prediction and graph classification.

###### Definition 5.

A model with parameters is fully expressive with respect to link prediction if given any graph and any ground truth indicating the existence or non-existence of a (labeled) edge for all node-pairs in the graph, there exists an instantiation of that classifies the node-pairs of according to .

###### Definition 6.

A model with parameters is fully expressive with respect to graph classification if given any set of non-isomorphic graphs and any ground truth of class assignments for all graphs in the set, there exists an instantiation of that classifies the graphs according to .

### 2.6 Sequence Models

In dynamic environments, data often consists of sequences of observations of varying length. There is a long history of models to handle sequential data without any fixed length. This includes auto-regressive models Akaike (1969)

that predict the next observations based on a window of past observations. Alternatively, since it is not always clear how long the window of part observations should be, hidden Markov models

Rabiner and Juang (1986)Welch et al. (1995)

, dynamic Bayesian networks

Murphy and Russell (2002) and dynamic conditional random fields Sutton et al. (2007) use hidden states to capture relevant information that might be arbitrarily far in the past. Today, those models can be seen as special cases of recurrent neural networks, which allow rich and complex hidden dynamics.

Recurrent neural networks (RNNs) Elman (1990); Cho et al. (2014) have achieved impressive results on a range of sequence modeling problems such as language modeling and speech recognition. The core principle of the RNN is that its input is a function of the current data point as well as the history of the previous inputs. A simple RNN model can be formulated as follows:

 ht=ϕ(Wixt+Whht−1+bi) (1)

where is the input at position in the sequence, is a hidden representation containing information about the sequence of inputs until time , and are weight matrices, represents the vector of biases, is an activation function, and is an updated hidden representation containing information about the sequence of inputs until time . With some abuse of notation, we use to represent the output of an RNN operation on a previous state and a new input .

Long short term memory (LSTM) Hochreiter and Schmidhuber (1997) is considered one of the most successful RNN architectures. The original LSTM model can be neatly defined with the following equations:

 it=σ(Wiixt+Wihht−1+bi) (2) ft=σ(Wfixt+Wfhht−1+bf) (3) ct=ft⊙ct−1+it⊙Tanh(Wcixt+Wchht−1+bc) (4) ot=σ(Woixt+Wohht−1+bo) (5) ht=ot⊙Tanh(ct) (6)

Here , , and represent the input, forget and output gates respectively, while is the memory cell and is the hidden state. and

represent the sigmoid and hyperbolic tangent activation functions respectively. Gated recurrent units (GRUs)

Cho et al. (2014) is another successful RNN architecture.

Fully attentive models have recently demonstrated on-par or superior performance compared to RNN variants for a variety of tasks (see, e.g.,

Vaswani et al. (2017); Dehghani et al. (2018); Krantz and Kalita (2018); Shaw et al. (2018)). These models rely only on (self-)attention and abstain from using recurrence. Vaswani et al. (2017) characterize a self-attention mechanism as a function from query, key, and value vectors to a vector that is a weighted sum of the value vectors. Their mechanism is presented in Equation (7).

 Attention(Q,K,V) =softmax(QK′√dk)V (7) where    Q=XWQ,K= XWK,V=XWV

where are called the query, key and value matrices, is the transpose of , is the input sequence, , and are weight matrices, and performs a row-wise normalization of the input matrix. A mask is added to Equation (7) to make sure that at time , the mechanism only allows a sequence model to attend to the points before time . Vaswani et al. (2017) also define a multi-head self-attention mechanism by considering multiple self-attention blocks (as defined in Equation (7)) each having different weight matrices and then concatenating the results.

### 2.7 Temporal Point Processes

Temporal Point Processes (TPP) Cox and Lewis (1972) are stochastic, or random, processes that are used for modeling sequential asynchronous discrete events occurring in continuous time. Asynchronous in this context means that the time between consecutive events may not be the same. TPPs have been applied for applications like e-commerce Xu et al. (2014), finance Bacry et al. (2015), etc. A typical realization of a TPP is a sequence of discrete events occurring at time points for , where the sequence has been generated by some stochastic process and represents the time horizon of the process. A TPP model uses a conditional density function indicating the density of the occurrence of the next event at some time point given the history of the process till time (including time ). The cumulative density function till time given the history is defined as follows:

 F(t|Htn)=∫tτ=tnf(τ|Htn)dτ (8)

Equation (8) also corresponds to the probability that the next event will happen between and . The survival function of a process Aalen et al. (2008) indicates the probability that no event will occur until given the history and is computed as . Having the density function, the time for the next event can be predicted by taking an expectation over as:

 ^t =Et∼f(t|Htn)[t]=∫Tτ=tnτf(τ|Htn)dτ (9)

The parameters of a TPP can be learned from data by maximizing the joint density of the entire process defined as follows:

 f(t1,…,tn)=n∏i=1f(ti|Hti−1) (10)

Another way of characterizing a TPP is through a conditional intensity function (a.k.a. hazard function) such that represents the probability of the occurrence of an event in the interval given that no event has occurred until time . represents the history of the process until but not including . The intensity and density functions can be derived from each other as follows:

 λ(t|Ht−)dt =Prob(tn+1∈[t,t+dt]∣Ht−) =Prob(tn+1∈[t,t+dt]∣Htn,tn+1∉(tn,t)) =Prob(tn+1∈[t,t+dt], tn+1∉(tn,t) ∣ Htn)Prob(tn+1∉(tn,t) ∣ Htn) =Prob(tn+1∈[t,t+dt] ∣ Htn)Prob(tn+1∉(tn,t) ∣ Htn) =f(t∣Htn)dtS(t∣Htn) (11)

The intensity function can be designed according to the application. The function usually contains learnable parameters Du et al. (2016) that can be learned from the data.

###### Example 5.

Consider the problem of predicting when the next earthquake will occur in a region based on the times of previous earthquakes in that region. Typically, an earthquake is followed by a series of other earthquakes as aftershocks. Thus, upon observing an earthquake, a model should increase the probability of another earthquake in near future and gradually decay this probability.

Let be the times at which an earthquake occurred in the region. Equation (12) gives one possible conditional intensity function for modeling this process.

 λ∗(t)=μ+α∑ti≤texp(−(t−ti)) (12)

where and are parameters that are constrained to be positive and are generally learned from the data. The sum is over all the timestamps at which an earthquake occurred. In this function, can be considered as the base intensity of an earthquake in the region. The occurrence of an earthquake increases the intensity of another earthquake in the near future (as it makes the value of the sum increase), which decays exponentially to the base intensity. The amount of increase is controlled by . Note that the conditional intensity function is always positive as , and are always positive. From Equation 11

, the density function for random variable

is

. We can estimate the time for the occurrence of the next earthquake (

) by taking an expectation over the random variable as in Equation (9).

Equation (12) is a special case of the well-known self-exciting Hawkes process Hawkes (1971); Mei and Eisner (2017). Other well-studied TPPs include Poisson processes Kingman (2005), self-correcting processes Isham and Westcott (1979), and autoregressive conditional duration processes Engle and Russell (1998). Depending on the application, one may use one of these intensity functions or even potentially design new ones. Recently, there has been growing interest in learning the intensity function entirely from the data Du et al. (2016).

## 3 Representation Learning for Static Graphs

In this section, we provide an overview of representation learning approaches for static graphs. The main aim of this section is to provide enough information for the descriptions and discussions in the next sections on dynamic graphs. Readers interested in learning more about representation learning on static graphs can refer to several existing surveys specifically written on this topic (e.g., see Hamilton et al. (2017b); Zhang et al. (2018a); Cai et al. (2018); Cui et al. (2018) for graphs and Nickel et al. (2016a); Wang et al. (2017a) for KGs).

### 3.1 Decoders

Assuming an encoder has provided the embedding function, the decoder aims at using the node and relation embeddings for node classification, edge prediction, graph classification, or other prediction purposes. We divide the discussion on decoders for static graphs into those used for graphs and those used for KGs.

#### 3.1.1 Decoders for Static Graphs

For static graphs, the embedding function usually maps each node to a single vector; that is, where for any . To classify a node , a decoder can be any classifier on

(e.g., logistic regression or random forest).

To predict a link between two nodes and , for undirected (and bipartite) graphs, the most common decoder is based on the dot-product of the vectors for the two nodes, i.e.,

. The dot-product gives a score that can then be fed into a sigmoid function whose output can be considered as the probability of a link existing between

and . Grover and Leskovec (2016) propose several other decoders for link prediction in undirected graphs. Their decoders are based on defining a function that combines the two vectors and into a single vector. The resulting vector is then considered as the edge features that can be fed into a classifier to predict if an edge exists between and or not. These combining functions include:

• The average of the two vectors: ,

• The element-wise (Hadamard) multiplication of the two vectors: ,

• The element-wise absolute value of the difference of the two vectors: ,

• The element-wise squared value of the difference of the two vectors: .

Instead of computing the distance between and in the Euclidean space, the distance can be computed in other spaces such as the hyperbolic space Chamberlain et al. (2017). Different spaces offer different properties. Note that all these four combination functions are symmetric, i.e., where is any of the above functions. This is an important property when the graph is undirected.

For link prediction in directed graphs, it is important to treat the source and target of the edge differently. Towards this goal, one approach is to concatenate the two vectors as and feed the concatenation into a classifier (see, e.g., Pareja et al. (2019)). Another approach used in Ma et al. (2018b) is to project the source and target vectors to another space as and , where and are matrices with learnable parameters, and then take the dot-product in the new space (i.e., ). A third approach is to take the vector representation of a node

and send it through a feed-forward neural network with

outputs where each output gives the score for whether

has a link with one of the nodes in the graph or not. This approach is used mainly in graph autoencoders (see, e.g.,

Wang et al. (2016); Cao et al. (2016); Tran (2018); Goyal et al. (2017); Chen et al. (2018a)) and is used for both directed and undirected graphs.

The decoder for a graph classification task needs to compress node representations into a single representation which can then be fed into a classifier to perform graph classification. Duvenaud et al. (2015) simply average all the node representations into a single vector. Gilmer et al. (2017) consider the node representations of the graph as a set and use the DeepSet aggregation Zaheer et al. (2017) to get a single representation. Li et al. (2015) add a virtual node to the graph which is connected to all the nodes and use the representation of the virtual node as the representation of the graph. Several approaches perform a deterministic hierarchical graph clustering step and combine the node representations in each cluster to learn hierarchical representations Defferrard et al. (2016); Fey et al. (2018); Simonovsky and Komodakis (2017). Instead of performing a deterministic clustering and then running a graph classification model, Ying et al. (2018b) learn the hierarchical structure jointly with the classifier in an end-to-end fashion.

#### 3.1.2 Decoders for Link Prediction in Static KGs

There are several classes of decoders for link prediction in static KGs. Here, we provide an overview of the translational, bilinear, and deep learning classes. When we discuss the expressivity of the decoders in this subsection, we assume the decoder is combined with a flexible encoder.

##### Translational decoders

usually assume the encoder provides an embedding function such that for every where , and for every where , , and . That is, the embedding for a node contains a single vector whereas the embedding for a relation contains a vector and two matrices. For an edge , these models use:

 ||Przv+zr−Qrzu||i (13)

as the dissimilarity score for the edge where represents norm of a vector. is usually either or . Translational decoders differ in the restrictions they impose on and . TransE Bordes et al. (2013) constrains . So the dissimilarity function for TransE can be simplified to:

 ||zv+zr−zu||i (14)

In TransR Lin et al. (2015), . In STransE Nguyen et al. (2016), no restrictions are imposed on the matrices. Kazemi and Poole (2018c) proved that TransE, TransR, STransE, and many other variants of translational approaches are not fully expressive with respect to link prediction (regardless of the encoder) and identified severe restrictions on the type of relations that can be modeled using these approaches.

##### Bilinear decoders

usually assume the encoder provides an embedding function such that for every where , and for every where . For an edge , these models use:

 z′vPrzu (15)

as the similarity score for the edge. Bilinear decoders differ in the restrictions they impose on matrices Wang et al. (2018). In RESCAL Nickel et al. (2011), no restrictions are imposed on the matrices. RESCAL is fully expressive with respect to link prediction, but the large number of parameters per relation makes RESCAL prone to overfitting. To reduce the number of parameters in RESCAL, DistMult Yang et al. (2015) constrains the matrices to be diagonal. This reduction in the number of parameters, however, comes at a cost: DistMult loses expressivity and is only able to model symmetric relations. That is because the score function of DistMult does not distinguish between the source and target vectors.

ComplEx Trouillon et al. (2016), CP Hitchcock (1927) and SimplE Kazemi and Poole (2018c) reduce the number of parameters in RESCAL without sacrificing expressivity. ComplEx extends DistMult by assuming the embeddings are complex (instead of real) valued, i.e. and for every and . Then, it slightly changes the score function to where returns the real part of an imaginary number and takes an element-wise conjugate of the vector elements. By taking the conjugate of the target vector, ComplEx differentiates between source and target nodes and does not suffer from the symmetry issue of DistMult. CP defines , i.e. the embedding of a node consists of two vectors, where captures the ’s behaviour when it is the source of an edge and captures ’s behaviour when it is the target of an edge. For relations, CP defines . The similarity function of CP for an edge is then defined as . Realizing the information may not flow well between the two vectors of a node, SimplE adds another vector to the relation embeddings as where models the behaviour of the inverse of the relation. Then, it changes the score function to be the average of and .

For ComplEx, CP, and SimplE, it is possible to view the embedding for each node as a single vector in by concatenating the two vectors (in the case of ComplEx, the two vectors correspond to the real and imaginary part of the embedding vector). Then, the matrices can be viewed as being restricted according to Figure 2 (taken from Kazemi and Poole (2018c)).

Other bilinear approaches include HolE Sadilek and Kautz (2010) whose equivalence to ComplEx has been established Hayashi and Shimbo (2017), and Analogy Liu et al. (2017) where the matrices are constrained to be block-diagonal.

##### Deep learning-based decoders:

Deep learning approaches typically use feed-forward or convolutional neural networks for scoring edges in a KG.

Dong et al. (2014) and Santoro et al. (2017) consider for every node such that and for every relation such that . Then for an edge , they feed (i.e., the concatenation of the three vector representations) into a feed-forward neural network that outputs a score for this edge. Dettmers et al. (2018) develop a score function based on convolutions. They consider for each node such that and for each relation such that 111Alternatively, the matrices can be viewed as vectors of size .. For an edge (, , ), first they combine and into a matrix by concatenating the two matrices on the rows, or by adding the row of each matrix in turn. Then 2D convolutions with learnable filters are applied on generating multiple matrices and the matrices are vectorized into a vector , where the size of the vector depends on the number of convolution filters. Then the score for the edge is computed as:

 (c′vrW)vec(Zu) (16)

where is a weight matrix. Other deep learning approaches include Balazevic et al. (2018) which is another score function based on convolutions, and Socher et al. (2013) which contains feed-forward components as well as several bilinear components.

### 3.2 Encoders

In the previous section, we discussed how an embedding function can be used by a decoder to make predictions. In this section, we describe different approaches for creating encoders that provide the embedding function to be consumed by the decoder.

#### 3.2.1 High-Order Proximity Matrices

While the adjacency matrix of a graph only represents local proximities, one can also define high-order proximity matrices Ou et al. (2016) or similarity metrics da Silva Soares and Prudêncio (2012). Let be a high-order proximity matrix. A simple approach for creating an encoder is to let (or ) corresponding to the row (or the column) of matrix . Encoders based on high-order proximity matrices are typically parameter-free and do not require learning (although some of them have hyper-parameters that need to be tuned). In what follows, we describe several of these matrices.

• Common neighbours matrix is defined as . corresponds to the number of nodes that are connected to both and . For a directed graph, counts how many nodes are simultaneously the target of an edge starting at and the source of an edge ending at .

• Jaccard’s coefficient is a slight modification of where one divides the number of common neighbours of and by the total number of distinct nodes that are the targets of edges starting at or the sources of edges ending at . Formally, Jaccard’s coefficient is defined as .

• Adamic-Adar is defined as , where . computes the weighted sum of common neighbours where the weight is inversely proportional to the degree of the neighbour.

• Katz index is defined as computes a weighted sum of all the paths between two nodes and . controls the depth of the connections: the closer is to , the longer paths one wants to consider. One can rewrite the formula recursively as and, as a corollary, obtain .

• Preferential Attachment is simply a product of in- and out- degrees of nodes: .

#### 3.2.2 Shallow Encoders

Shallow encoders first decide on the number and the shape of the vectors and matrices for node and relation embeddings. Then, they consider each element in these vectors and matrices as a parameter to be directly learned from the data. As an example, consider the problem of link prediction in a KG. Let the encoder be a shallow encoder with for each node in the KG and for each relation in the KG, and the decoder be the RESCAL function. ’s and ’s are initialized randomly and then their values are optimized such that becomes a large positive number if is in positive samples and becomes a large negative number if is in negative samples.

#### 3.2.3 Decomposition Approaches

Decomposition methods are among the earliest attempts for developing encoders for graphs. They learn node embeddings similar to shallow encoders but in an unsupervised way: the node embeddings are learned in a way that connected nodes are close to each other in the embedded space. Once the embeddings are learned, they can be used for purposes other than reconstructing the edges (e.g., for clustering). Formally, for an undirected graph , learning node embeddings , where , such that connected nodes are close in the embedded space can be done through solving the following optimization problem:

 (17)

This loss ensures that connected nodes are close to each other in the embedded space. One needs to impose some constraints to get rid of a scaling factor and to eliminate the trivial solution where all nodes are set to a single vector. For that let us consider a new matrix , such that its rows give the embedding: . Then one can add the constraints to the optimization problem (17): , where is a diagonal matrix of degrees as defined in Subsection 2.1. As was proved in Belkin and Niyogi (2001)

, this constrained optimization is equivalent to solving a generalized eigenvalue decomposition:

 Ly=λDy, (18)

where is a graph Laplacian; and the matrix can be obtained by considering the matrix of top-

generalized eigenvectors:

.

Sussman et al. (2012) suggested to use a slightly different embedding based on the eigenvalue decomposition of the adjacency matrix (this matrix is symmetric for an undirected graph). Then one can choose the top eigenvalues and the corresponding eigenvectors and construct a new matrix

 Z=U

where , and . Rows of this matrix can be used as node embedding: . This is the so called adjacency spectral embedding, see also Levin et al. (2018).

For directed graphs, because of their asymmetric nature, keeping track of the -order neighbours where becomes difficult. For this reason, working with a high-order proximity matrix is preferable. Furthermore, for directed graphs, it may be preferable to learn two vector representations per node, one to be used when the node is the source and the other to be used when the node is the target of an edge. One may learn embeddings for directed graphs by solving the following:

 minZs,Zt||S−ZsZ′t||2F, (20)

where is the Frobenius norm and . Given the solution, one can define the “source” features of a node as and the “target” features as . A single-vector embedding of a node can be defined as a concatenation of these features. The Eckart–Young–Mirsky theorem Eckart and Young (1936)

from linear algebra indicates that the solution is equivalent to finding the singular value decomposition of

:

 S=UsΣ(Ut)′, (21)

where is a matrix of singular values and and are matrices of left and right singular vectors respectively (stacked as columns). Then using the top singular vectors one gets the solution of the optimization problem in (20):

 Zs=(Us)

#### 3.2.4 Random Walk Approaches

One of the popular classes of approaches for learning an embedding function for graphs is the class of random walk approaches. Similar to decomposition approaches, encoders based on random walks also learn embeddings in an unsupervised way. However, compared to decomposition approaches, these embeddings may capture longer term dependencies. To describe the encoders in this category, first we define what a random walk is and then describe the encoders that leverage random walks to learn an embedding function.

###### Definition 7.

A random walk for a graph is a sequence of nodes where for all and for all . is called the length of the walk.

A random walk of length can be generated by starting at a node in the graph, then transitioning to a neighbor of (), then transitioning to a neighbor of and continuing this process for steps. The selection of the first node and the node to transition to in each step can be uniformly at random or based on some distribution/strategy.

###### Example 6.

Consider the graph in Figure 1(b). The following are three examples of random walks on this graph with length :

 1) v1,v3,v2,v32) v2,v1,v2,v43) v4,v2,v4,v2

In the first walk, the initial node has been selected to be . Then a transition has been made to , which is a neighbor of . Then a transition has been made to , which is a neighbor of and then a transition back to , which is a neighbor of . The following are two examples of invalid random walks:

 1) v1,v4,v2,v32) v1,v3,v4,v2

The first one is not a valid random walk since a transition has been made from to when there is no edge between and . The second one is not valid because a transition has been made from to when there is no edge between and .

Random walk encoders perform multiple random walks of length

on a graph and consider each walk as a sentence, where the nodes are considered as the words of these sentences. Then they use the techniques from natural language processing for learning word embeddings (e.g.,

Mikolov et al. (2013); Pennington et al. (2014)) to learn a vector representation for each node in the graph. One such approach is to create a matrix from these random walks such that corresponds to the number of times and co-occurred in random walks and then factorize the matrix (see Section 3.2.3) to get vector representations for nodes.

Random walk encoders typically differ in the way they perform the walk, the distribution they use for selecting the initial node, and the transition distribution they use. For instance, DeepWalk Perozzi et al. (2014) selects both the initial node and the node to transition to uniformly at random. Perozzi et al. (2016) extends DeepWalk by allowing random walks to skip over multiple nodes at each transition. Node2Vec Grover and Leskovec (2016) selects the node to transition to based on a combination of breadth-first search (to capture local information) and depth-first search (to capture global information).

#### 3.2.5 Autoencoder Approaches

Another class of models for learning an embedding function for static graphs is by using autoencoders. Similar to the decomposition approaches, these approaches are also unsupervised. However, instead of learning shallow embeddings that reconstruct the edges of a graph, the models in this category create a deep encoder that compresses a node’s neighbourhood to a vector representation, which can be then used to reconstruct the node’s neighbourhood. The model used for compression and reconstruction is referred to as an autoencoder. Similar to the decomposition approaches, once the node embeddings are learned, they may be used for purposes other than predicting a node’s neighbourhood.

In its simplest form, an autoencoder Hinton and Salakhutdinov (2006) contains two components called the encoder and decoder, where each component is a feed-forward neural network. To avoid confusion with graph encoder and decoders, we refer to these two components as the first and second component. The first component takes as input a vector (e.g., corresponding to numerical features of an object) and passes it through several feed-forward layers producing another vector such that . The second component receives as input and passes it through several feed-forward layers aiming at reconstructing . That is, assuming the output of the second component is , the two components are trained such that is minimized. can be considered a compression of .

Let be a graph with adjacency matrix . For a node , let represent the row of the adjacency matrix corresponding to the neighbors of . To use autoencoders for generating node embeddings, Wang et al. (2016) train an autoencoder (named SDNE) that takes a vector as input, compresses it to in its first component, and then reconstructs it in its second component. After training, the vectors corresponding to the output of the first component of the autoencoder can be considered as embeddings for the nodes . and may further be constrained to be close in Euclidean space if and are connected. For the case of attributed graphs, Tran (2018) concatenates the attribute values of node to and feeds the concatenation into an autoencoder. Cao et al. (2016) propose an autoencoder approach (named RDNG) that is similar to SDNE, but they first compute a similarity matrix based on two nodes co-occurring on random walks (any other matrix from Section 3.2.1 may also be used) showing the pairwise similarity of each pair of nodes, and then feed s into the autoencoder.

#### 3.2.6 Graph Convolutional Network Approaches

Yet another class of models for learning node embeddings in a graph are graph convolutional networks (GCNs). As the name suggests, graph convolutions generalize convolutions to arbitrary graphs. Graph convolutions have spatial (see, e.g., Hamilton et al. (2017a, b); Schlichtkrull et al. (2018); Gilmer et al. (2017)) and spectral constructions (see, e.g., Liao et al. (2019); Kipf and Welling (2017); Defferrard et al. (2016); Levie et al. (2017)). Here, we describe the spatial (or message passing) view and refer the reader to Bronstein et al. (2017) for the spectral view.

A GCN consists of multiple layers where each layer takes node representations (a vector per node) as input and outputs transformed representations. Let be the representation for a node after passing it through the layer. A very generic forward pass through a GCN layer transforms the representation of each node as follows:

 zv,l+1=transfo