Efficiently Embedding Dynamic Knowledge Graphs

10/15/2019
by   Tianxing Wu, et al.
0

Knowledge graph (KG) embedding encodes the entities and relations from a KG into low-dimensional vector spaces to support various applications such as KG completion, question answering, and recommender systems. In real world, knowledge graphs (KGs) are dynamic and evolve over time with addition or deletion of triples. However, most existing models focus on embedding static KGs while neglecting dynamics. To adapt to the changes in a KG, these models need to be re-trained on the whole KG with a high time cost. In this paper, to tackle the aforementioned problem, we propose a new context-aware Dynamic Knowledge Graph Embedding (DKGE) method which supports the embedding learning in an online fashion. DKGE introduces two different representations (i.e., knowledge embedding and contextual element embedding) for each entity and each relation, in the joint modeling of entities and relations as well as their contexts, by employing two attentive graph convolutional networks, a gate strategy, and translation operations. This effectively helps limit the impacts of a KG update in certain regions, not in the entire graph, so that DKGE can rapidly acquire the updated KG embedding by a proposed online learning algorithm. Furthermore, DKGE can also learn KG embedding from scratch. Experiments on the tasks of link prediction and question answering in a dynamic environment demonstrate the effectiveness and efficiency of DKGE.

READ FULL TEXT VIEW PDF

Authors

page 1

02/14/2021

Knowledge Graph Embedding using Graph Convolutional Networks with Relation-Aware Attention

Knowledge graph embedding methods learn embeddings of entities and relat...
10/28/2019

A Survey on Knowledge Graph Embeddings with Literals: Which model links better Literal-ly?

Knowledge Graphs (KGs) are composed of structured information about a pa...
03/19/2022

Sequence-to-Sequence Knowledge Graph Completion and Question Answering

Knowledge graph embedding (KGE) models represent each entity and relatio...
03/07/2020

Knowledge Graphs and Knowledge Networks: The Story in Brief

Knowledge Graphs (KGs) represent real-world noisy raw information in a s...
03/27/2019

Analyzing Knowledge Graph Embedding Methods from a Multi-Embedding Interaction Perspective

Knowledge graph is a popular format for representing knowledge, with man...
01/28/2020

The KEEN Universe: An Ecosystem for Knowledge Graph Embeddings with a Focus on Reproducibility and Transferability

There is an emerging trend of embedding knowledge graphs (KGs) in contin...
02/22/2021

LightCAKE: A Lightweight Framework for Context-Aware Knowledge Graph Embedding

Knowledge graph embedding (KGE) models learn to project symbolic entitie...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Knowledge graphs (KGs) such as DBpedia [26], YAGO [30], and Freebase [2], have been built to benefit many intelligent applications, e.g., semantic search, question answering, and recommender systems. These KGs are multi-relational graphs describing entities and their relations in the form of triples. A triple is often denoted as (head entity, relation, tail entity) (i.e., ) to indicate that two entities are connected by a specific relation, e.g., (Barack Obama, Party, Democratic Party). Recently, techniques of knowledge graph (KG) embedding [38] have received considerable attention, as they can learn the representations (i.e., embeddings) of entities and relations in low-dimensional vector spaces, and these embeddings can be used as features to support link prediction, entity classification, and question answering, among many others.

In real world, KGs are dynamic and always changing over time. For example, DBpedia extracts the update stream of Wikipedia each day to keep the KG up-to-date [16]. Amazon product KG needs to be updated quite frequently because there are a large number of new products everyday [10]. However, most existing KG embedding models [31, 3, 39, 41, 36, 12, 9] focus on embedding static KGs while neglecting dynamic updates. To adapt to the changes in a KG, these models need to be re-trained on the whole KG with a high time cost, but it is unacceptable when the KG has a high update frequency (e.g., once per day). Thus, how to embed dynamic KGs in an online manner is an important problem to solve.

Fig. 1: (a) A KG does not have the relation between entities and at time step , and we add a triple at time step . (b) An illustration of using puTransE [33] on .

Although there emerges many methods on dynamic graph embedding [14, 29, 45, 27, 11, 35] supporting the online learning of node embeddings, these methods cannot be applied in dynamic KG embedding. This is because they only learn node embeddings based on structural proximities without considering relation semantics on edges, but KG embedding needs to learn not only node (entity) embeddings but also relation embeddings, and preserves relational constrains between entities. Besides, some models [7, 22, 34] on temporal KG embedding also work on dynamic KGs, but their target is to mine evolving knowledge from multiple given snapshots of a KG to better perform link prediction and time prediction. In other words, they only conduct offline embedding learning, but when faced with KG updates, they also need to be retrained on the whole KG, so they cannot embed dynamic KGs with high efficiency.

The main reason why most KG embedding models lack the capability of online embedding learning is: when a KG has an update with addition and deletion of triples, if we revise the representations of some entities and relations to adapt to the updated KG, such revisions may probably spread to the entire graph by correlations among entities and relations. For example, suppose we embed a KG

(shown in Figure 1(a)) using TransE [3], which constrains (bold characters denote vectors) on each triple , and after adding a new triple into , where , are existing entities and is an existing relation in the earlier version of , we now need to optimize . No matter which element in we choose to revise its representation, it will break the constraint for other triples containing our chosen element, so it may cause a chain reaction of revisions on the embeddings of entities and relations in the entire graph.

The only existing work which supports online KG embedding learning is puTransE [33]. As illustrated in Figure 1(b), puTransE first splits the KG into different small sets of triples (a triple may exist in multiple sets), each of which is utilized to train an embedding space, and then only selects the maximum energy score (i.e., where is the or norm) of each triple across these embedding spaces for link prediction. When facing a KG update, to support online learning, puTransE directly trains new embedding spaces with small sets of triples containing newly added triples, and deletes existing spaces containing deleted triples. However, puTransE has two major problems which lower the quality of generated embeddings as follows:

  • Problem 1. puTransE learns embeddings of entities and relations from local parts of a KG, so it avoids re-training on the entire graph when the KG has an update, but this cannot preserve the global structure information of the KG in the learnt embeddings.

  • Problem 2. puTransE leverages the scoring function of TransE [3] to compute energy scores of each triple, which cannot work well to model 1-to-N, N-to-1, and N-to-N relations. Take a 1-to-N relation as an example, if we use puTransE to learn embeddings of the entities and relations in triples and (see Figure 1(a)) in a space, this will cause .

Fig. 2: The architecture of learning embeddings in DKGE

In this paper, we study how to efficiently learn high-quality embeddings of entities and relations in dynamic KGs. Based on the above analyses, we find that it is non-trivial and cannot be well solved by existing KG embedding models. This motivates us to propose a new method which can learn KG embedding from scratch, and support online embedding learning, as well as address the problems of puTransE. To this aim, we devise a novel context-aware Dynamic Knowledge Graph Embedding method, called DKGE, which can embed dynamic KGs with high effectiveness and efficiency.

For each triple , unlike puTransE that only uses an individual representation for each entity or relation in the scoring function, DKGE incorporates the contextual information into a joint embedding of each entity (denoted as and ) or relation (denoted as ) for the translation operation, i.e., . The context of an entity consists of itself and its neighbor entities. The context of a relation is composed of itself and the relation paths connecting the same entity pairs. These contexts are represented as neighborhood subgraphs. As shown in Figure 2, the joint embedding of each entity ( or ) or relation () is formed by combining the embedding of itself (called knowledge embedding, i.e., , , or ) and the embedding of its context (called contextual subgraph embedding, i.e., , , or ) through a gate strategy [40]

. Contextual subgraph embeddings of entities and relations are computed by two neural networks, called attentive graph convolutional networks (AGCNs), respectively. The above techniques enable DKGE to learn KG embedding from scratch and

well model 1-to-N, N-to-1, and N-to-N relations, which solves the problem 2 of puTransE. For example, when modeling triples and in Figure 1(a), as long as their contextual subgraph embeddings are different.

To support online learning, DKGE actually assigns two different representations to each entity or relation. When an entity (or a relation) denotes itself, we use a representation called knowledge embedding; when it denotes a part of the context of other entities (or relations), we use another representation called contextual element embedding. Contextual element embeddings are combined to form contextual subgraph embeddings using an attentive graph convolutional network (AGCN). Under this setting, we propose an online learning algorithm to incrementally learn KG embedding. In this algorithm, based on the idea of inductive learning, we keep all learnt parameters in AGCNs and the gate strategy unchanged, and contextual element embeddings of existing entities and relations unchanged. After a KG update, there will exist many triples in which contexts of all entities and relations are unchanged, so their contextual subgraph embeddings are unchanged. Thus, with existing knowledge embeddings of such entities and relations, these triples already hold , so we also keep the knowledge embeddings of existing entities and relations unchanged as long as their contexts are unchanged. In this way, we only need to learn knowledge embeddings and contextual element embeddings of emerging entities and relations, as well as knowledge embeddings of existing entities and relations with changed contexts. This greatly reduces the number of triples which need to be re-trained while preserving on the whole KG. Thus, our algorithm can effectively perform online learning with high efficiency and solve the problem 1 of puTransE.

In experiments, we first evaluate DKGE on link prediction in a dynamic environment. Compared with state-of-the-art static KG embedding methods, DKGE has comparable effectiveness in different evaluation metrics, and much better efficiency in online learning since the baselines need to be re-trained on the whole KG. When comparing with the dynamic KG embedding baseline, i.e., puTransE, DKGE significantly outperforms it in both effectiveness and efficiency. We also conduct case studies on question answering in a dynamic environment to show that DKGE can help get accurate answers without writing structured queries in query languages.

Contributions. The main contributions of this paper are summarized as follows:

  • We define the problem of embedding dynamic KGs, which is divided into two sub-problems: learning from scratch and online learning (Section II).

  • We propose a new context-aware dynamic KG embedding method DKGE, which can not only learn KG embedding from scratch (Section III), but also incrementally learn KG embedding by an online learning algorithm with high efficiency (Section IV). DKGE solves the problems of puTransE, which is the only existing model supporting online KG embedding learning.

  • We present a unified solution to encode contexts of entities and relations based on an AGCN model, which can select the most important information from the context of the given entity or relation (Section III).

  • We conduct comprehensive experiments on real-world data management applications in a dynamic environment, including link prediction and question answering (QA). QA with KG embedding techniques can query the triples which are not in the KG, while classical strategies using structured queries in query languages cannot return any result. Evaluation results show the effectiveness and efficiency of our method DKGE (Section V).

Ii Problem Definition

In this section, we define the problem of embedding dynamic KGs as two sub-problems, i.e., learning from scratch and online learning. Let a KG , where represents a set of triples, means a head entity, is a relation, is a tail entity, and are the sets of all entities and relations in , respectively, and is the current time step. We define learning from scratch as follows:

Definition 1

Learning from Scratch. Given the KG at time step , learning from scratch uses a KG embedding method to learn embeddings of all entities and relations.

At time step , becomes with an update including addition and deletion of triples. The update is not limited in existing entities and relations, and may introduce emerging ones. Here, we define online learning as follows:

Definition 2

Online Learning. Given the KGs and at time step as well as intermediate embedding results at time step , online learning efficiently learns new embeddings of entities and relations without re-training the whole updated KG .

Fig. 3: The workflow of DKGE

Figure 3 illustrates the workflow of our proposed dynamic KG embedding method DKGE, including learning from scratch and online learning.

Iii Learning from Scratch in DKGE

In this section, we present the details of learning from scratch in DKGE. The key idea behind DKGE is to preserve on each triple in the given KG, where , , and are the joint embeddings of the head entity, relation, and tail entity incorporating respective contextual information. Such contextual information can provide rich structural features and help well model 1-to-N, N-to-1, and N-to-N relations (discussed in Section I

), which enables DKGE to generate high-quality KG embedding. Thus, we first introduce a unified solution to encode the contexts of entities and relations as vector representations. Then, we describe our strategy to integrate knowledge embeddings of entities and relations with the vector representations of their corresponding contexts. Finally, we define a scoring function and a loss function based on translation operations for parameter training.

Iii-a Context Encoding

For entities, the most intuitive context is their neighbor entities. To preserve the structural information among the given entity and its neighbor entities, we define the context of each entity as an undirected subgraph consisting of its neighbor entities and itself. To effectively limit the complexity of DKGE, the number of neighbor entities should not be too large, e.g., only one-hop neighbor entities are considered, but more distant neighbor entities may bring useful information for DKGE, so there is a trade-off between effectiveness and efficiency here. Actually, in our experiments, after using more distant neighbor entities besides one-hop ones, it will take much more time for model training, but DKGE’s accuracy in link prediction will not be significantly improved (details will be discussed in Section V-D). This is mainly because the further away neighbor entities are from the given entity, the less relevance they have [24, 23], and less relevant neighbor entities may introduce not only useful information but also noise in DKGE. Therefore, we finally only choose one-hop neighbor entities to build the context of each given entity.

Fig. 4: (a) The context of entity in the KG at time step (shown in Figure 1(a)). (b) The context of relation in at time step (also shown in Figure 1(a)). is a relation path composed of relations and , and is composed of relations and .
Example 1

Figure 4(a) shows the context of entity in the KG at time step given in Figure 1(a). The subgraph contains the one-hop neighbor entities and itself. preserves not only the edges between and its one-hop neighbor entities, but also the edges between such neighbor entities.

Different from entities, each relation occurs many times in a KG, so it is hard to choose reasonable neighbor entities or relations as a part of the context of each relation. Here, we choose to use the relation paths connecting the same entity pairs (in the same direction) with each given relation as a part of its context. Then, to capture the structural associations of such relations and relation paths, we transform each relation and its corresponding relation paths connecting the same entity pairs as vertices, and add undirected edges between two vertices if their corresponding relations or relation paths connect the same entity pairs. As a result, we also construct an undirected subgraph as the context of each relation. Similar to the selection of the neighbor entities for each entity’s context, to maintain the efficiency of DKGE, we hope that the number of relevant relation paths of a given relation is not too large, so we choose to constrain the length of each relation path. In our experiments, if we consider the relation paths with the length greater than two, DKGE’s accuracy in link prediction will also not be significantly improved, but it will cause much more training time (details will be given in Section V-D). Hence, the length of each relation path is constrained as one or two.

Example 2

In Figure 1(a), relation and relation path are used to link entity to entity . Relation and relation path are used to link entity to entity . Thus, and are neighbor vertexes of in the subgraph (see Figure 4(b)), i.e., the context of .

Most existing models only assign one representation for each entity and each relation, which is insufficient to online embedding learning after a KG update with addition and deletion of triples, because a revision on the representations of few entities or relations may spread to the entire graph due to the correlations among entities and relations defined in the scoring function. Different from them, each entity or relation in DKGE corresponds to two different representations, i.e., knowledge embedding and contextual element embedding, which are defined as follows:

Definition 3

Knowledge Embedding. When we use a vector representation to denote the given entity (or, the relation ) itself in DKGE, this vector representation is knowledge embedding (or, ).

Definition 4

Contextual Element Embedding. When an entity (or, a relation ) denotes a part of the context of other entities or relations in DKGE, its corresponding representation for this role is contextual element embedding (or, ).

Such a setting enables DKGE to perform online learning without re-training the whole KG, which will be introduced in detail in Section IV. Note that the vector representation for the context of an entity or a relation is called contextual subgraph embedding, which is defined as follows:

Definition 5

Contextual Subgraph Embedding. The context of each entity (or, relation ) is represented as a subgraph (or, ), and its vector representation is contextual subgraph embedding (or, ), which is formed by combining contextual element embeddings of the entities (or, relations) in the subgraph.

Fig. 5: The AGCN model. The input is initial vertex features and adjacency information of the given subgraph. Hidden layers conduct convolutional operations to generate new vertex features. The attention layer computes the weight of each vertex. The output contextual subgraph embedding is the weighted sum of all vertices’ features.

Why we use attentive GCN? Since contexts of entities and relations are all represented as subgraphs, the problem of context encoding is converted to subgraph encoding. Recently, different graph convolutional networks [4, 17, 8, 25]

have been proposed to feature extraction on arbitrary graphs for machine learning, and achieved very promising results. The input of a graph convolutional network (GCN) is initial feature vectors of vertices and the graph structure (i.e., the adjacency matrix). The GCN learns a function of features on the input graph and output trained feature vectors of vertices by incorporating neighborhood information, which can capture rich structural information in the input graph. Since our target is to encode a subgraph as a vector, we can use a GCN to learn vectors of all vertices in the input subgraph, and combine them to acquire the vector representation of the subgraph, i.e., contextual subgraph embedding. However, in our scenario, a subgraph is the context of some object (refers to an entity or a relation), so some vertices may be important to this object and some may be useless. Thus, we propose a new attentive GCN model which can assign a weight to each vertex for the final combination.

The Attentive GCN (AGCN) Model. Figure 5 shows the framework of the AGCN model. Given an object (an entity or a relation) and its context, i.e., a subgraph with vertices , we first build the adjacency matrix and initialize the vertex feature matrix with the strategy introduced in Section III-C ( is the number of the initialized features for each vertex). Each row in is denoted as . If is an entity, then is an entity and is its contextual element embedding. When is a relation, if is a relation, then denotes its contextual element embedding; if is a relation path consisting of two relations, then is the sum of contextual element embeddings of these two relations.

Then, we input and to the hidden layers to generate the new vertex features incorporating neighborhood information. We apply the propagation rule proposed in [25] to compute the vertex feature matrix ( is the number of the features output by the th hidden layer) output by the th hidden layer with a convolution operation as:

(1)

where

is an activation function,

,

is the identity matrix,

is the diagonal degree matrix of , and is the weight matrix of the th hidden layer in the AGCN model. The number of hidden layers means that the AGCN performs propagation steps during the forward pass and convolves the information from all neighbor vertices up to hops away. For each entity or relation, it only has one-hop neighbor vertices in its context, but for the neighbor vertices themselves, they may have two-hop neighbors (see Figure 4(a)). Hence, the AGCN used in our scenario contains two hidden layers at most, i.e., . Besides, since each in may be taken as the input of the AGCN in online learning, we simply set the size of the weight matrix in each hidden layer as , and let .

The output of the last hidden layer (i.e., ) is the input of an attention layer, which computes the weight of each vertex for the object based on the attention mechanism [1] as follows:

(2)
(3)

where is a row in , denotes the knowledge embedding of (initialized by the strategy introduced in Section III-C), is a parameter vector for the attention layer, means element-wise multiplication, and measures the relevance between and .

Finally, we compute the contextual subgraph embedding of the subgraph by a weighted sum of the vectors of all vertices as follows:

(4)

To sum up, our unified solution to context encoding extracts the contexts of entities and relations as subgraphs, and uses an AGCN model to acquire contextual subgraph embeddings. Unlike existing GCNs that operate on a whole big graph, we leverage the small subgraphs of entities to train an AGCN, and the small graphs of relations to train another AGCN.

Iii-B Representation Integration

After obtaining contextual subgraph embeddings of entities and relations, we integrate them with the knowledge embeddings of entities and relations, to build the joint representation of each object in the KG. The simplest way is the mean operation, which directly averages the knowledge embedding and contextual subgraph embedding of to get . The benefit is that we do not need to train any parameter which makes DKGE efficient, but setting that the knowledge embedding and contextual subgraph embedding share the same weight is unreasonable. Another option is the weighting operation, which assigns different weights to the knowledge embedding and contextual subgraph embedding of , but a fixed weight on all dimensions is also inappropriate. Thus, we apply a gate strategy [40] to representation integration, which can assign different weights to different dimensions of a vector as follows:

(5)

where is an entity or a relation, is its knowledge embedding, is its contextual subgraph embedding, constrains that the value of each element in the gate vector is in , and is a parameter vector. Note that all entities share a denoted as , and all relations share another denoted as .

Iii-C Parameter Training

Since we aim to preserve on each triple , we define a scoring function as follows:

(6)

where , and are computed by Eq. (5), and denotes norm. As earlier discussed in Section I, Figure 2

shows architecture of learning embeddings in DKGE. In learning from scratch, we need to train two AGCNs, two gate vectors, and knowledge embeddings as well as contextual element embeddings of all entities and relations. Before training, we first initialize the knowledge embeddings and contextual element embeddings of all entities and relations following the uniform distribution

(also used in TransE [3]), where is the embedding size. The initialized contextual element embeddings form each input initial vertex feature matrix (i.e., in Eq. (1)) in our AGCNs.

For training, a margin-based loss function is defined as:

(7)

where is the margin, is the set of correct triples and is the set of incorrect triples. Since a KG only contains correct triples, we corrupt them by replacing head entities or tail entities to build . The replacement relies on the techniques of negative sampling. Although there exist some complex methods [5, 37, 42] on negative sampling which can effectively improve the quality of KG embedding, we apply a basic negative sampling strategy called Bernoulli sampling [39]

, which is the most widely used in KG embedding models. We generate an incorrect triple for each correct triple. During training, all parameters including embeddings are updated using stochastic gradient descent (SGD) in each minibatch.

Iv Online Learning in DKGE

In this section, we first introduce our online learning algorithm, and then conduct complexity analysis.

Iv-a Online Learning Algorithm

Knowledge is not static, and always evolves over the time, so that KGs should be updated very frequently with addition and deletion of triples. To adapt to such changes, KG embedding should also be dynamically updated in a short time. It raises challenges to existing models as they have to be re-trained on the whole KG with a high time cost. Thus, it is important to build an online embedding learning algorithm which can efficiently generate new high-quality KG embedding based on the results of existing KG embedding.

When the KG has an update, a good online learning algorithm should not only rapidly learn the embeddings of emerging entities and relations, but also consider the impacts on the embeddings of existing entities and relations. Such impacts should be limited in certain regions, not in the entire graph. Based on these principles, we apply the idea of inductive learning so that:

  • [leftmargin=*]

  • parameters in two learnt AGCNs are kept unchanged;

  • two learnt gate vectors are kept unchanged;

  • contextual element embeddings of existing entities and relations are kept unchanged.

After a KG update, in many triples, the contexts of all entities and relations are unchanged. With unchanged context element embeddings and unchanged parameters in the learnt AGCNs, the contextual subgraph embeddings of such entities and relations are unchanged. Based on this, with unchanged gated vectors and their existing knowledge embeddings, these triples already have , so we also constrain that:

  • [leftmargin=*]

  • knowledge embeddings of existing entities and relations are kept unchanged as long as their contexts are unchanged.

Thus, we only need to learn knowledge embeddings and contextual element embeddings of emerging entities and relations, as well as knowledge embeddings of existing entities and relations with changed contexts. This greatly reduces the number of triples which need to be re-trained while preserving on the whole KG.

Example 3

In Figure 6, after adding the triples and into the KG , we have an emerging entity , an emerging relation , one existing relation with changed context , and two existing entities with changed context and . Based on the above idea of online learning, we only need to re-train six triples containing , , , , and (i.e., , , , , , and ), instead of all ten triples in .

For online learning at time step , the embedding initialization is different to that of learning from scratch. We randomly initialize the knowledge embeddings and contextual elements embeddings of emerging entities (relations) following the uniform distribution . The knowledge embeddings and contextual elements embeddings of existing entities (relations) use the embedding results at time step .

Fig. 6: The KG at time step ( at time step is shown in Figure 1(a)) with addition of the triples and .

Algorithm 1 shows the whole process of our online learning. Here, we use a 3-tuple to record knowledge embeddings and contextual element embeddings of entities and relations, where and are the set of entities and relations at time step respectively, is a set of knowledge embeddings, is a set of contextual element embeddings, and each entity or relation corresponds to a knowledge embedding and a contextual element embedding. Given KGs and , respectively, at time step and , we first remove the deleted objects (i.e., entities and relations) and their embeddings in (line 3-4). Then, we add emerging objects and their initialized embeddings into , and collect all triples containing emerging objects (line 5-8). Besides, we collect the triples, each of which has at least one object with changed context (line 9-13). After that, we use SGD on the collected triples with the loss function defined in Eq. (7) to only update knowledge embeddings and contextual element embeddings of emerging entities and relations, as well as knowledge embeddings of existing entities and relations with changed contexts (line 14-27). The algorithm will stop based on the performance on a validation set composed of accurate triples. These triples are randomly selected from the given KG, and do not belong to the input of this algorithm. All entities and relations in these triples should occur in other triples used for embedding learning. Finally, our algorithm outputs the updated (line 28).

Input: KG , entity set , relation set , embedding tuple at time step ; KG , entity set , relation set at time step ; size of minibatch , learning rate , dimension of embeddings .
Output: Updated at time step .
1 , ;
2 , ;
3 foreach object  do
4        Remove , its knowledge embedding and contextual element embedding in the embedding tuple ;
5; initialize a triple set
6 foreach object  do
7        Add , its knowledge embedding and contextual element embedding into , and initialize and following the uniform distribution ;
8        Add all triples in containing into ;
9; initialize an object set
10 foreach object  do
11        if  then
12               ;
13               Add all triples in containing into ;
14       
15loop 
16        ; sample a minibatch: size b
17        ; initialize a set of pairs of triples
18        foreach triple  do
19               Sample a corrupt triple ;
20               ;
21       foreach object in  do
22               if  then
23                      ; : total loss on
24                      ;
25                      Update and in ;
26              else if  then
27                      ;
28                      Update in ;
29              
30       
return ;
Algorithm 1 Online Learning

Iv-B Complexity Analysis

In DKGE, online learning and learning from scratch actually follow the same architecture (shown in Figure 2), and the difference is that online learning has much fewer triples to train and fewer parameters to update. We analyse the space complexity and time complexity of DKGE in this subsection.

Space Complexity. Given a KG consisting of entities and relations, we define the size of the adjacency matrix in the AGCN for entities as and that in the AGCN for relations as . Since the contexts (i.e., subgraphs) of entities (or relations) have different number of vertices, to capture all adjacency information of the contexts for entities (or relations), (or ) should at least equal to the maximum number of vertices (or ) among these contexts. However, the KG is dynamic, so the number of vertices in each context may increase, and (or ) should be larger than (or ). In our experiments, the maximum number of vertices among the contexts of more than 95%111To limit computational resources, similar to [13], we randomly sample 35 vertices for the remaining 5% entities and relations to build the contexts. entities and relations in our datasets is

, and we apply zero padding to keeping the size of the adjacency matrix of each entity (or relation) as

(or ). The maximum value of (or ) is set as in our experiments. In total, we have adjacency matrices for entities and adjacency matrices for relations, which have the space complexity .

Suppose the AGCNs for entities and relations have and hidden layers (, which are analysed in Section III-A), respectively, since the size of the weight matrix in each hidden layer is set as (also discussed in Section III-A), we totally have weight matrices (each hidden layer corresponds to a weight matrix) requiring space. Besides, the AGCNs for entities and relations respectively has a -dimensional parameter vector (i.e., in Eq. (2)) in the attention layer, and this requires space. In the gate strategy, all entities (or relations) also correspond to a -dimensional parameter vector (i.e., in Eq. (5)), respectively, so this part needs . In addition, each entity and each relation has two vector representations, i.e., knowledge embedding and contextual element embedding, so we totally have -dimensional vectors to represent entities and relations. In summary, online learning and learning from scratch in DKGE share the same space complexity .

Time Complexity. For learning from scratch and online learning, we analyse the time complexities of updating parameters. In learning from scratch, given a KG with triples and the size of a minibatch , we have minibatches. Suppose each minibatch has entities and relations on average, so updating their knowledge embeddings requires time, where is the dimension of the embedding space. Suppose there are entities and relations on average composing the contexts of all entities and relations in each minibatch, so updating their contextual element embeddings requires time. Besides, we need to update the parameters in two AGCNs and the gate strategy. In the AGCN for entities, there are weight matrices, where is the number of hidden layers, and a -dimensional parameter vector in the attention layer, so updating them in a minibatch needs time. Similarly, in the AGCN for relations, updating parameters in a minibatch requires time, where is the number of hidden layers. For the gate strategy, updating two -dimensional gate vectors in a minibatch requires . Thus, for learning from scratch, the total time complexity of updating parameters is , where

is the number of epochs (one epoch means working through all triples once) when learning from scratch converges.

In online learning, all parameters in two AGCNs and the gate strategy are unchanged, and we only update knowledge embeddings and context element embeddings of emerging entities and relations, as well as knowledge embeddings of existing entities and relations with changed contexts. Suppose only triples, each of which contains at least one emerging object (i.e., entity or relation) or existing object with changed context, need to be re-trained, and the size of a minibatch is also , so we have minibatches. In each minibatch, on average, suppose there are existing entities with changed contexts, existing relations with changed contexts, emerging entities, and emerging relations, so updating their knowledge embeddings requires time, where is the dimension of the embedding space. Suppose there are emerging entities and emerging relations on average composing the contexts of all entities and relations in each minibatch, updating their contextual element embeddings requires time. Hence, for online learning, the total time complexity of updating parameters is , where is the number of epochs when online learning converges.

For the time complexities of learning from scratch and online learning on the same KG, we can find that , , and online learning does not require the time cost of updating parameters in AGCNs and the gate strategy . In a minibatch of size ( on all datasets in our experiments after tuning hyper-parameters), there is not much difference between and . Since , with the same learning rate, online learning should have a much faster convergence speed than learning from scratch and usually . In our experiments, is at least twice when testing DKGE on different datasets. These are why our online learning has high efficiency.

Remarks. Online learning has much fewer parameters to train compared with learning from scratch in DKGE, which causes that online learning has smaller model capacity to accumulating underfitting errors [6]. We perform extensive analysis in Section V-B to understand this effect, and we shall investigate this more theoretically in our future work.

Datasets #Entities (Avg.) #Edges (Avg.) #Relations (Avg.) #Add Triples (Avg.) #Del Triples (Avg.) #Train (Avg.) #Valid #Test
YAGO-3SP 27,009 130,757 37 950 150 124,757 3,000 3,000
IMDB-30SP 243,148 627,096 14 9,379 2,395 621,096 3,000 3,000
IMDB-13-3SP 3,244,455 7,923,773 14 17,472 18,405 7,913,773 10,000 -
DBpedia-3SP 66,967 106,211 968 1,005 103 103,211 3,000 -
TABLE I: Details of our datasets

V Experimental Results

In this section, we present experiments to show the effectiveness and efficiency (especially the online learning) of DKGE on the tasks of link prediction and question answering (QA) in a dynamic environment. The main difference between link prediction and QA is that link prediction aims to predict correct triples which do not exist in the KG, but QA with KG embedding techniques expect to use existing triples in the KG to answer questions. We also analyze the robustness of repeated online learning, investigate the sensitivity of the hyper-parameters of DKGE, and test the scalability of our online learning on a large-scale dataset. The codes of DKGE and baselines are implemented in Python on the deep learning platform PyTorch. All experiments were executed on a NVIDIA TITAN Xp GPU card (12GB) of a 64GB, 2.10GHz Xeon server. We release the codes of DKGE and all datasets at:

https://github.com/lienwc/DKGE/.

V-a Experimental Setup

Datasets. Since there is no publicly available benchmark dataset on link prediction and QA on dynamic KGs, we built four new datasets (two are for link prediction, one is for QA, and one is for scalability testing) from real-world KGs. Each dataset contains multiple snapshots, the differences between which are real changes between different versions of a KG.

(1) YAGO-3SP. YAGO [30] is a large-scale KG constructed from Wikipedia, WordNet, and GeoNames. Different versions of YAGO (http://yago-knowledge.org/) were published at different time. We extracted subsets of YAGO2.5, YAGO3, and YAGO3.1 as three snapshots of our dataset YAGO-3SP, respectively. YAGO-3SP was designed for link prediction, and we split each snapshot into a training set, a validation set, and a test set. The three snapshots share the same validation set and test set, in which triples are unchanged in these snapshots.

(2) IMDB-30SP. The Internet Movie Database (IMDB) is a KG consisting of the entities of movies, TV series, actors, directors, among others, as well as their relationships. IMDB provides daily dumps (https://datasets.imdbws.com/), and we downloaded them each day from January 22 to February 20 in 2019. We extracted 30 snapshots from such dumps to compose our dataset IMDB-30SP. Similar to YAGO-3SP, IMDB-30SP was also designed for link prediction, and we split each snapshot into a training set, a validation set, and a test set. All snapshots share the same validation set and test set.

(3) IMDB-13-3SP. Different from IMDB-30SP, the size of each snapshot in IMDB-13-3SP is much larger. We kept all the triples about the movies and TV series released after 2013 in the IMDB datasets from January 22 to 24, 2019. With these triples, we built three snaphots. Since IMDB-13-3SP was only utilized to test the scalability of our online learning, we only split each snapshot into a training set and a validation set.

(4) DBpedia-3SP. DBpedia [26], a KG built from Wikipedia, different versions of which (https://wiki.dbpedia.org/develop/datasets/) were also published at different time. We extracted subsets from DBpedia3.9 and two subsequent versions as three snapshots of our dataset DBpedia-3SP, respectively. DBpedia-3SP was used for case studies on QA, so we only split each snapshot into a training set and a validation set.

Table I shows the details of the above datasets. For each dataset, we recorded: 1) the average numbers of entities (#Entities (Avg.)), edges (#Edges (Avg.)), and relations (#Relations (Avg.)) in different snapshots, respectively; 2) the average numbers of added triples (#Add Triples (Avg.)) and deleted triples (#Del Triples (Avg.)) between snapshots, respectively; 3) the average number of triples in the training sets (#Train (Avg.)) of different snapshots, and the number of triples in the validation set (#Validate) and test set (#Test). Compared with IMDB-13-3SP, the size of each snapshot in YAGO-3SP, IMDB-30SP, and DBpedia-3SP is much smaller but similar to the sizes of widely used benchmark datasets [42, 3, 9, 32, 28, 39] for static KG embedding.

Baselines. We compared our method DKGE with the following baselines in link prediction on YAGO-3SP and IMDB-30SP. (1) puTransE [33]: the only existing model supporting online KG embedding learning for dynamic KGs. (2) ConvE [9]: in the research of static KG embedding using deep learning, ConvE is the state-of-the-art model. (3) ComplEx [36]: in the research of static KG embedding by matching compositions of head-tail entity pairs with their relations, ComplEx is one of the best models in both effectiveness and efficiency. (4) TransE [3]: the classic static KG embedding model using translation operations on entities and relations. (5) GAKE [12]: similar to DKGE, the static KG embedding model GAKE simultaneously models triples themselves and graph structural contexts in embedding learning.

We used publicly available codes (implemented in Python on PyTorch) of ConvE, ComplEx, and TransE from [9][36], and [15], respectively. Since the codes of GAKE (published by the authors) was implemented in C++ and puTransE does not release source codes, we implemented them in Python on PyTorch. For training, we adopted early stopping based on the Hits@10 (will be introduced in Section V-B) on the validation set, and also set the maximum number of epochs as .

V-B Link Prediction

YAGO-3SP IMDB-30SP
MR MRR Hits@10 Hits@3 Hits@1 MR MRR Hits@10 Hits@3 Hits@1
Snapshot 1 GAKE 2,984 0.150 0.237 0.155 0.098 5,798 0.116 0.213 0.119 0.081
puTransE 938 0.180 0.262 0.188 0.130 3,518 0.122 0.188 0.132 0.096
TransE 666 0.348 0.508 0.385 0.263 2,443 0.330 0.499 0.368 0.242
ComplEx 1,155 0.412 0.532 0.451 0.342 5,671 0.285 0.454 0.315 0.200
ConvE 1,614 0.450 0.525 0.473 0.402 6,713 0.271 0.412 0.317 0.208
DKGE-LFS 643 0.460 0.545 0.479 0.411 2,390 0.381 0.569 0.431 0.283
Snapshot 2 GAKE 3,012 0.141 0.218 0.151 0.095 5,542 0.116 0.218 0.118 0.079
puTransE 897 0.186 0.259 0.195 0.133 3,506 0.119 0.182 0.134 0.092
TransE 975 0.300 0.460 0.340 0.226 2,415 0.323 0.492 0.363 0.235
ComplEx 995 0.380 0.521 0.420 0.303 6,037 0.274 0.453 0.314 0.184
ConvE 1,319 0.450 0.538 0.473 0.406 7,011 0.265 0.418 0.301 0.203
DKGE-LFS 723 0.440 0.545 0.475 0.393 2,347 0.378 0.570 0.425 0.280
DKGE-OL 749 0.440 0.539 0.473 0.393 2,841 0.380 0.567 0.428 0.282
Snapshot 3 GAKE 2,873 0.140 0.220 0.156 0.087 5,623 0.116 0.219 0.116 0.081
puTransE 1,082 0.173 0.247 0.180 0.130 3,522 0.123 0.187 0.134 0.095
TransE 959 0.304 0.460 0.335 0.226 2,560 0.326 0.494 0.360 0.242
ComplEx 974 0.392 0.524 0.426 0.325 5,824 0.267 0.461 0.306 0.172
ConvE 1,531 0.447 0.531 0.470 0.404 7,129 0.260 0.422 0.292 0.190
DKGE-LFS 747 0.445 0.542 0.476 0.397 2,368 0.383 0.571 0.435 0.285
DKGE-OL 809 0.442 0.542 0.473 0.395 2,976 0.377 0.561 0.427 0.281
TABLE II: The comparison results on effectiveness (our methods: DKGE-LFS (learning from scratch) and DKGE-OL (online learning))

Link prediction [38] in a KG is typically defined as the task of predicting an entity that has a specific relation with another given entity, i.e., predicting the head entity given the relation and tail entity (denoted as ), or predicting the tail entity given the head entity and relation (denoted as ). Rather than requiring one best result, this task usually ranks a set of candidate entities from the KG.

Evaluation Metrics. In the test phase, for each triple in the test set, we replaced the head entity (or tail entity ) with each entity in the snapshot to construct a triple (or ), and ranked all based on the score calculated by the scoring function (e.g., Eq. (6) for DKGE). If a constructed triple occurs in the training set, then the corresponding entity will not participate in the ranking process, as training data cannot be used in testing. Based on such ranking results, we can get the rank of the original correct entity in each test triple, and we followed the same evaluation metrics of effectiveness used in ConvE [9] as follows. (1) Mean Rank (MR): the average rank of all head entities and tail entities in test triples. (2) Mean Reciprocal Rank (MRR): the average multiplicative inverse of the ranks for all head entities and tail entities in test triples. (3) Hits@: the proportion of the ranks not larger than for all head entities and tail entities in test triples. Besides, in order to evaluate the efficiency of DKGE and baselines, we recorded their training time.

Hyper-Parameters. In link prediction on dynamic datasets, we selected optimal hyper-parameters for DKGE and baselines on the first snapshot of each dataset. Each model directly uses such optimal hyper-parameters on subsequent snapshots. In DKGE, the hyper-parameters include embedding size, initial learning rate, size of minibatch, margin, the number of hidden layers of the AGCN for entities, and the number of hidden layers of the AGCN for relations. Given the ranges of each hyper-parameter, we chose the optimal hyper-parameters via grid search according to the Hits@10 on the validation set (details introduced in Section V-D

). For ConvE, ComplEx, TransE, and GAKE, we applied the same strategy to select the optimal hyper-parameters. puTransE is a non-parametric model without requiring hyper-parameter tuning, so we randomly selected one group of hyper-parameters for testing given the ranges of hyper-parameters. The details of hyper-parameters tuning for baselines will also be given in Section 

V-D.

(a) YAGO-3SP
(b) IMDB-30SP
Fig. 7: The comparison results on efficiency

Effectiveness and Efficiency. We tested DKGE and baselines on all snapshots of YAGO-3SP and the first three snapshots of IMDB-30SP. Given the first snapshot of each dataset, DKGE and baselines train embeddings from scratch on all triples in the training set. When faced with subsequent snapshots, dynamic KG embedding models DKGE and puTransE can use online learning to acquire new embeddings, but other static KG embedding baselines can only be re-trained on all triples in the training set. The comparison results on effectiveness and efficiency between DKGE and baselines are shown in Table II and Figure 7, respectively. On the second and third snapshots in two datasets, note that we tested both of the learning from scratch and online learning in DKGE, but for puTransE, we only tested its online learning.

In Table II, we can find that the learning from scratch (DKGE-LFS) and online learning (DKGE-OL) in DKGE outperform baselines on both datasets in most evaluation metrics, which reflects the superiority of our model. Only one static KG embedding model ConvE is comparable to DKGE-LFS and DKGE-OL on YAGO-3SP. This is because all static KG embedding baselines except GAKE only model triples, but neglect structural contexts, which can bring useful information in embedding learning, and GAKE models structural contexts, but neglects relational constrains between entities. Compared with the dynamic KG embedding model puTransE, DKGE-LFS and DKGE-OL have much better performance, as DKGE solves two major problems (introduced in Section I) of puTransE. DKGE-OL and DKGE-LFS have close performance, which also shows the effectiveness of our online learning.

In Figure 7, we can see that DKGE-LFS does not have the best efficiency on the first snapshot of each dataset, but when we used DKGE-OL on the second and third snapshots, the training time is much less. Compared with static KG embedding models, the training time of DKGE-OL on YAGO-3SP and IMDB-30SP is at least and times faster, respectively. Compared with the online learning of puTransE, the training time of DKGE-OL on YAGO-3SP and IMDB-30SP is at least and times faster, respectively. This demonstrates the high efficiency of our model.

Fig. 8: Robustness analysis for repeated online learning

Robustness w.r.t. Repeated Updates. DKGE-OL is the online version of DKGE-LFS. The quality of the learnt embeddings may become lower after continuously conducting online learning a number of times. Thus, we performed robustness analysis on IMDB-30SP for DKGE-OL. On the first snapshot, we applied DKGE-LFS with the optimal hyper-parameters. Starting from the second snapshot, we applied DKGE-LFS and DKGE-OL, and recorded their MRR difference, which gets larger as tesing more snapshots. When the MRR difference is larger than a threshold on the th snapshot, the embeddings generated by DKGE-LFS will be taken as the input of the DKGE-OL used on the th snapshot. As a result (see Figure 8), if we set the threshold as (or , or ), we should perform DKGE-LFS after continuously using DKGE-OL (or , or ) days (the IMDB dataset is updated once per day). The MRR difference between DKGE-LFS and DKGE-OL will not get larger significantly within a short time period, which indicates the good robustness of our online learning.

We argue that the main reason for the degradation of DKGE-OL is: DKGE-OL has much fewer parameters to train compared with DKGE-LFS, which causes that DKGE-OL has smaller model capacity to accumulating underfitting errors [6]. We also find that the loss of DKGE-OL is higher on average than that of DKGE-LFS (for triples update on average) on the test set in the above robustness evaluation. To further validate our argument, we aggregated the daily updates of IMDB-30SP once every 3 (or 5, or 10) days, performed DKGE-LFS and DKGE-OL, and recorded their MRR difference. Figure 9 shows that aggregating more KG updates for online learning (i.e., more parameters to train) can lower the MRR difference between DKGE-LFS and DKGE-OL. However, training more parameters in DKGE-OL will cost more time, e.g., the time of training DKGE-OL once every 3 days is at least 5 times more than that of training DKGE-OL once per day. Thus, whether aggregating more KG updates for online learning should be decided by users’ own needs.

Fig. 9: Update aggregation analysis for repeated online learning

V-C Question Answering

Number Question Answer in
Snapshot 1 Snapshot 2 Snapshot 3
Which team drafts Kobe Bryant? New_Orleans_Hornets Charlotte_Hornets Charlotte_Hornets
Who is the chief of China’s Central Military Commision? Hu_Jintao Xi_Jinping Xi_Jinping
Which team does Dwight Howard play for? Los_Angeles_Lakers Houston_Rockets Houston_Rockets
Who is the coach of Golden State Warriors? Mark_Jackson_(basketball) Steve_Kerr Steve_Kerr
Who has the most caps of Portugal national football team? Luís_Figo Cristiano_Ronaldo Cristiano_Ronaldo
Who is the top scorer of Argentina national football team? Gabriel_Batistuta Gabriel_Batistuta Lionel_Messi
Which team does Luke Walton coach? Golden_State_Warriors Golden_State_Warriors Los_Angeles_Lakers
Which team does Byron Scott coach? Cleveland_Cavaliers Los_Angeles_Lakers Los_Angeles_Lakers
Which team does Kevin Garnett play for? Boston_Celtics Brooklyn_Nets Brooklyn_Nets
Who is the wife of Martin Fowler in EastEnders? Sonia_Fowler Sonia_Fowler Stacey_Slater
TABLE III: The prepared questions and their answers in DBpedia-3SP

In this subsection, we conducted case studies on QA in a dynamic environment, and DKGE can help find correct answers without writing structured queries in query languages (e.g., SPARQL). We prepared ten questions (see Table III), and the answer of each question exist in each snapshot of DBpedia-3SP. The answers of the same question in different snapshots may be different because knowledge is always changing. Here, each question is a simple question which can be denoted in the form of a triple , and the head entity and relation exsiting in DBpedia-3SP are implied in the question. Thus, similar to the idea of the state-of-the-art KG embedding based QA system [18], we identified the head entity and relation expressed by each question. This step was manually finished, and it can also be solved by automatic strategies, such as the template-based method [43] or the learning-based model [18]. For example, we denoted the question ① in Table III as (Kobe_Bryant, draftTeam, ?). For embedding learning, given the first snapshot of DBpedia-3SP, we utilized DKGE-LFS to train KG embedding. The process of choosing hyper-parameters will be introduced in Section V-D. Given the second and third snapshots, we applied DKGE-OL with the selected hyper-parameters to generate new KG embedding. Finally, based on the identified head entity and relation of each question, as well as all parameters in DKGE including their embeddings, we inferred the tail entity as the answer by ranking all entities in DBpedia-3SP using the score calculated by the scoring function (i.e., Eq. (6)) in DKGE.

Evaluation Metrics. To evaluate the effectiveness of DKGE for QA on the dynamic dataset DBpedia-3SP, we used: (1) Mean Rank (MR): the average rank of all correct answers in each snapshot; (2) Mean Reciprocal Rank (MRR): the average multiplicative inverse of ranks for all correct answers in each snapshot; (3) P@: the average proportion of the correct answers (in each snapshot) in top- ranks. Besides, we recorded the training time of DKGE on each snapshot.

DBpedia-3SP MR MRR P@1
All Questions Snapshot 1 5 0.497 0.400
Snapshot 2 5 0.487 0.400
Snapshot 3 5 0.484 0.400
Question ①-⑦ Snapshot 1 4 0.648 0.571
Snapshot 2 4 0.638 0.571
Snapshot 3 4 0.640 0.571
Question ⑧-⑩ Snapshot 1 7 0.145 0
Snapshot 2 8 0.126 0
Snapshot 3 8 0.120 0
TABLE IV: Evaluation results of QA using DKGE

Results Analysis. DKGE-LFS takes 5,221 seconds on the first snapshot of DBpedia-3SP to train embeddings. For DKGE-OL on the second and third snapshots, it only takes 595 seconds and 672 seconds, respectively. With the embedding results and identified head entities and relations of questions, we constructed a triple for each question, to solve the QA problem. For example, we constructed a triple (Kobe_Bryant, draftTeam, New_Orleans_Hornets) for question ① (in Table III) given the first snapshot. Note that not all constructed triples exist in the KG, i.e., the training set of each snapshot in DBpedia-3SP. We found that all constructed triples of question ①-⑦ exist in the training sets, but the constructed triples of question ⑧-⑩ do not. Table IV shows the evaluation results of QA on DBpedia-3SP using DKGE. For all questions, given different snapshots, the performance on the same evaluation metric is good and close, which reflects that DKGE is effective for QA in a dynamic environment. The QA performance on question ①-⑦ is much better than that of question ⑧-⑩. The reason is that the embeddings of entities and relations for the constructed triples of question ①-⑦ have been optimized to constrain during training, but for the constructed triples of question ⑧-⑩, they do not have such optimizations. From another perspective, users cannot query the triples which do not exist in the KG by writing structured queries in query languages, but QA with KG embedding techniques can provide help by embedding calculations, e.g., all correct answers of question ⑧-⑩ on different snapshots occur in the top-10 ranks.

V-D Parameter Sensitivity

(a) Effects of the embedding size (i.e., dimensionality)
(b) Effects of the initial learning rate
(c) Effects of the size of minibatch
(d) Effects of the margin
Fig. 10: Effects of the embedding size, initial learning rate, size of minibatch, and margin in DKGE for link prediction

This subsection first gives the details of selecting the optimal hyper-parameters (for DKGE and baselines) on the first snapshot of YAGO-3SP and IMDB-30SP via grid search according to the Hits@10 on the validation set. With the optimal hyper-parameters of DKGE, we only varied one hyper-parameter each time to test its effects in link prediction.

For hyper-parameter tuning, we first set the ranges of the shared hyper-parameters of DKGE and baselines as follows: embedding size (i.e., dimensionality): , initial learning rate: , and the size of minibatch: . Then, we set the ranges of specific hyper-parameters belonging to each model based on [9, 36, 12, 3, 33] as follows:

  • [leftmargin=*]

  • DKGE: margin: , the number of hidden layers of the AGCN for entities: , and the number of hidden layers of the AGCN for relations: ;

  • ConvE: embedding dropout: , label smoothing: , feature map dropout: , and projection layer dropout: ;

  • ComplEx: regularization parameter: , and the number of negative samples per positive sample: ;

  • TransE: margin: , and norm: ;

  • GAKE: prestiges of neighbor context: , path context: , and edge context: ;

  • puTransE: margin: , and the number of embedding spaces: .

For YAGO-3SP, the optimal hyper-parameters are as follows: embedding size: , initial learning rate: , size of minibatch: , margin: , the number of hidden layers in the AGCN for entities: , and the number of hidden layers in the AGCN for relations: . For IMDB-30SP, all optimal hyper-parameters are the same with the ones used on YAGO-3SP except the size of minibatch, which is .

Figure 10 shows the effects of different hyper-parameters on YAGO-3SP and IMDB-30SP in link prediction. When the embedding size increases, DKGE will have better MRR, but the training time will increase. Similarly, the larger the margin, the better the MRR, but it does not significantly affect the training time. To ensure the effectiveness, the initial learning rate should be neither too large nor too small, and the larger initial learning rate, the less training time of DKGE-LFS. The size of minibatch does not significantly affect the effectiveness, but as it increases, the training time will also increase.

Table V shows the effects of the maximum hop of neighbor entities and the number of hidden layers of the AGCN for entities. We can see that if using more distant neighbor entities to build the contexts of entities, the MRR of DKGE in link prediction will not be significantly improved (may even decrease), but this will cost much more training time, especially for online learning. Adding hidden layers will also not help much on the effectiveness. This is why we only consider one-hop neighbor entities in DKGE.

Table VI shows the effects of the maximum length of relation paths and the number of hidden layers of the AGCN for relations. If using the relation paths with the length greater than two to build the contexts of relations, the MRR will not be significantly improved, but the training time will increase a lot in online learning. Since the vertices in the contexts of relations only have one-hop or two-hop neighbors (introduced in Section III-A), we tested the number of hidden layers in , and one hidden layer always achieves the better MRR.

Based on the above analysis, to simultaneously ensure the effectiveness and efficiency, we applied the optimal hyper-parameters used on YAGO-3SP to training DBpedia-3SP for QA and IMDB-13-3SP for scalability testing.

(, ) YAGO-3SP IMDB-30SP
Snapshot 1 Snapshot 2 Snapshot 1 Snapshot 2
MRR Time (s) Time (s) MRR Time (s) Time (s)
(1, 1) 0.460 6,861 232 0.381 54,135 560
(1, 2) 0.455 7,348 312 0.370 55,231 699
(2, 1) 0.460 7,836 651 0.375 57,910 1,251
(2, 2) 0.465 8,032 704 0.380 58,523 1,642
(2, 3) 0.453 8,288 718 0.368 59,127 1,889
(3, 2) 0.460 9,018 1,610 0.361 60,907 2,843
(3, 3) 0.450 9,229 1,856 0.343 62,378 3044
(4, 2) 0.448 10,123 4,501 0.340 66,303 7,630
TABLE V: Effects of the max. hop of neighbor entities and number of hidden layers of the AGCN for entities (snapshot 1: learning from scratch, snapshot 2: online learning)
(, ) YAGO-3SP IMDB-30SP
Snapshot 1 Snapshot 2 Snapshot 1 Snapshot 2
MRR Time (s) Time (s) MRR Time (s) Time (s)
(1, 1) 0.412 6,784 176 0.341 52,936 307
(1, 2) 0.412 7,043 196 0.332 54,770 323
(2, 1) 0.460 6,861 232 0.381 54,135 560
(2, 2) 0.453 7,007 287 0.370 55,002 677
(3, 1) 0.460 6,988 840 0.385 53,895 1,601
(3, 2) 0.442 7,285 885 0.372 55,883 1,548
(4, 1) 0.450 6,936 1,904 0.381 55,779 3,066
TABLE VI: Effects of the max. length of relation paths and number of hidden layers of the AGCN for relations (snapshot 1: learning from scratch, snapshot 2: online learning)

V-E Scalability

We tested the scalability of DKGE on a large-scale dataset IMDB-13-3SP. We applied DKGE-LFS to snapshot 1 and DKGE-OL to snapshot 2 and 3. In Figure 11, although DKGE-LFS takes around a week to train KG embedding, DKGE-OL only needs about two hours to finish training, which means the online learning in DKGE scales well on the real-world large-scale dynamic KG.

Fig. 11: The training time of DKGE on IMDB-13-3SP

Vi Related Work

Static KG Embedding. Almost all existing KG embedding models (for a survey, see [38]) represent entities and relations in the KG in low-dimensional vector spaces, and define a scoring function on each triple to measure its plausibility. This scoring function captures correlations among entities and relations. By maximizing the total plausibility of all triples in the KG, we obtain embeddings of entities and relations.

A line of research is to use translation operations to model correlations among entities and relations. The most typical work is TransE [3] which takes the relation between entities corresponding to a translation between the embeddings of entities. TransH [39]

improves TransE by projecting the embedding of each entity into a relation-specific hyperplane, and performing the same translation operations of TransE at this hyperplane. TransR 

[28] follows a similar idea of TransH, the only difference is to replace relation-specific hyperplanes with relation-specific spaces. TransR also has several extensions such as TransD [19] and TranSparse [20].

Another direction of research is to match compositions of head-tail entity pairs with their relations. The earliest work is RESCAL [31]

which represents each triple as a tensor. Each relation is denoted as a matrix modeling pairwise interactions between entity vectors by a bilinear function. DisMult 

[41] simplifies RESCAL by restricting relation matrices to diagonal matrices, to reduce the number of parameters, but it cannot handle symmetric relations. To solve this problem, ComplEx [36] models KG embedding in a complex space and takes the conjugate of the embedding of each tail entity before calculating the bilinear map.

Recently, neural networks are employed to produce high-quality KG embedding. R-GCN [32] is a relational graph convolutional network model which utilizes convolutional operators on the semantic information in local graph structures to generate KG embedding. ConvE [9]

is a multi-layer convolutional neural network model, which learns KG embedding by its deep structure and 2D convolutions.

All of the above models only consider triples themselves in embedding learning while neglecting graph structural features, such as neighbor information. To address this issue, GAKE [12] was proposed to embed KGs using co-occurrence probabilities of entities, relations and structural contexts.

Unlike our DKGE, all aforementioned models only embed static KGs, but cannot support online embedding learning.

Dynamic KG Embedding. The most relevant model to our paper is puTransE [33], which creates multiple parallel embedding spaces from local parts of the given KG, and selects the global highest energy score for link prediction across the embedding spaces. When facing a KG update, puTransE can train new embedding spaces (for triple addition) and delete existing spaces (for triple deletion) for online learning. As discussed in Section I, our method DKGE can solve the two major problems of puTransE to generate high-quality embeddings. iTransA [21] supports the online optimization of entity-specific and relation-specific margins, but for embedding learning, it needs to re-train all triples in the KG.

There also exist some models [22, 34, 7] on temporal KG embedding, which aims to incorporate the temporal information of triples into the embedding learning, to better perform link prediction and time prediction. They cannot update KG embedding in an online manner.

Dynamic Graph Embedding. Different from KG embedding, graph embedding usually only learns vertex embeddings based on structural proximities without considering relational semantics on edges. Recently, some graph embedding models focus on dealing with dynamics to acquire high-quality evolving embeddings of vertices. DynamicTriad [44] and DyRep [35] preserves structural information and evolving patterns of a graph to learn vertex embeddings, which are used for vertex classification, link prediction, and etc. at the next time step. However, DynamicTriad can only be applied when vertices are fixed, and DyRep does not support the online updating of existing vertex embeddings. GraphSAGE [14] is an inductive model utilizing neighbor attributes to generate embeddings for previously unseen data, but it cannot update embeddings of existing vertices when the graph has changed. DepthLGP [29] leverages Laplacian Gaussian process and deep learning to learn vertex embeddings, and it only infers the embeddings of new vertices when facing a graph update. Both of DHPE [45] and DANE [27] use matrix decomposition to learn vertex embeddings of a static graph, and matrix perturbation to incrementally update vertex embeddings to adapt to graph changes. DNE [11] extends skip-gram based graph embedding methods to the dynamic scenario. It decomposes the skip-gram objective function to support learning the embedding of each vertex separately, so it can calculate the embeddings of new vertices. It also measures the influence of graph changes on the original vertices to update their embeddings.

Although DHPE, DANE, and DNE can incrementally compute the embeddings of new vertices and update existing vertices’ embeddings after a graph update, when we need to learn edge (i.e., relation) embeddings and consider various semantic correlations among vertices and edges in dynamic KG embedding, these models cannot be applied.

Vii Conclusions

In this paper, we presented a context-aware dynamic knowledge graph (KG) embedding method DKGE, which can not only learn embeddings from scratch, but also support online embedding learning. Compared with state-of-the-art static and dynamic KG embedding models on dynamic datasets, DKGE has comparable effectiveness and much better efficiency in online learning. Experimental results also show the value of DKGE for link prediction and question answering in a dynamic environment, and the good robustness and scalability of the online learning in DKGE.

References

  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR, Cited by: §III-A.
  • [2] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In SIGMOD, pp. 1247–1250. Cited by: §I.
  • [3] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating Embeddings for Modeling Multi-Relational Data. In NIPS, pp. 2787–2795. Cited by: 2nd item, §I, §I, §III-C, §V-A, §V-A, §V-D, §VI.
  • [4] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral Networks and Locally Connected Networks on Graphs. In ICLR, Cited by: §III-A.
  • [5] L. Cai and W. Y. Wang (2018) KBGAN: Adversarial Learning for Knowledge Graph Embeddings. In NAACL, Vol. 1, pp. 1470–1480. Cited by: §III-C.
  • [6] R. Caruana, S. Lawrence, and C. L. Giles (2001)

    Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping

    .
    In NIPS, pp. 402–408. Cited by: §IV-B, §V-B.
  • [7] S. S. Dasgupta, S. N. Ray, and P. Talukdar (2018) HyTE: Hyperplane-based Temporally aware Knowledge Graph Embedding. In EMNLP, pp. 2001–2011. Cited by: §I, §VI.
  • [8] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In NIPS, pp. 3844–3852. Cited by: §III-A.
  • [9] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel (2018) Convolutional 2D Knowledge Graph Embeddings. In AAAI, pp. 1811–1818. Cited by: §I, §V-A, §V-A, §V-A, §V-B, §V-D, §VI.
  • [10] X. L. Dong (2018) Challenges and Innovations in Building a Product Knowledge Graph. In SIGKDD, pp. 2869. Cited by: §I.
  • [11] L. Du, Y. Wang, G. Song, Z. Lu, and J. Wang (2018) Dynamic Network Embedding: An Extended Approach for Skip-gram based Network Embedding. In IJCAI, pp. 2086–2092. Cited by: §I, §VI.
  • [12] J. Feng, M. Huang, Y. Yang, et al. (2016) GAKE: Graph Aware Knowledge Embedding. In COLING, pp. 641–651. Cited by: §I, §V-A, §V-D, §VI.
  • [13] T. Hamaguchi, H. Oiwa, M. Shimbo, and Y. Matsumoto (2017) Knowledge Transfer for Out-of-Knowledge-Base Entities: A Graph Neural Network Approach. In IJCAI, pp. 1802–1808. Cited by: footnote 1.
  • [14] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive Representation Learning on Large Graphs. In NIPS, pp. 1024–1034. Cited by: §I, §VI.
  • [15] X. Han, S. Cao, X. Lv, Y. Lin, Z. Liu, M. Sun, and J. Li (2018) OpenKE: An Open Toolkit for Knowledge Embedding. In EMNLP, pp. 139–144. Cited by: §V-A.
  • [16] S. Hellmann, C. Stadler, J. Lehmann, and S. Auer (2009) DBpedia Live Extraction. In OTM Conferences, pp. 1209–1223. Cited by: §I.
  • [17] M. Henaff, J. Bruna, and Y. LeCun (2015) Deep Convolutional Networks on Graph-Structured Data. arXiv preprint arXiv:1506.05163. Cited by: §III-A.
  • [18] X. Huang, J. Zhang, D. Li, and P. Li (2019) Knowledge Graph Embedding Based Question Answering. In WSDM, pp. 105–113. Cited by: §V-C.
  • [19] G. Ji, S. He, L. Xu, K. Liu, and J. Zhao (2015) Knowledge Graph Embedding via Dynamic Mapping Matrix. In ACL, Vol. 1, pp. 687–696. Cited by: §VI.
  • [20] G. Ji, K. Liu, S. He, and J. Zhao (2016) Knowledge Graph Completion with Adaptive Sparse Transfer Matrix. In AAAI, pp. 985–991. Cited by: §VI.
  • [21] Y. Jia, Y. Wang, X. Jin, H. Lin, and X. Cheng (2018) Knowledge Graph Embedding: A Locally and Temporally Adaptive Translation-Based Approach. ACM Transactions on the Web 12 (2), pp. 8. Cited by: §VI.
  • [22] T. Jiang, T. Liu, T. Ge, L. Sha, B. Chang, S. Li, and Z. Sui (2016) Towards Time-Aware Knowledge Graph Completion. In COLING, pp. 1715–1724. Cited by: §I, §VI.
  • [23] J. Jin, J. Luo, S. Khemmarat, and L. Gao (2017) Querying Web-Scale Knowledge Graphs Through Effective Pruning of Search Space. IEEE Transactions on Parallel and Distributed Systems 28 (8), pp. 2342–2356. Cited by: §III-A.
  • [24] A. Khan, Y. Wu, C. C. Aggarwal, and X. Yan (2013) NeMa: Fast Graph Search with Label Similarity. PVLDB 6 (3), pp. 181–192. Cited by: §III-A.
  • [25] T. N. Kipf and M. Welling (2017) Semi-Supervised Classification with Graph Convolutional Networks. ICLR. Cited by: §III-A, §III-A.
  • [26] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. Van Kleef, S. Auer, et al. (2015) DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6 (2), pp. 167–195. Cited by: §I, §V-A.
  • [27] J. Li, H. Dani, X. Hu, J. Tang, Y. Chang, and H. Liu (2017) Attributed Network Embedding for Learning in a Dynamic Environment. In CIKM, pp. 387–396. Cited by: §I, §VI.
  • [28] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu (2015) Learning Entity and Relation Embeddings for Knowledge Graph Completion. In AAAI, Vol. 15, pp. 2181–2187. Cited by: §V-A, §VI.
  • [29] J. Ma, P. Cui, and W. Zhu (2018) DepthLGP: Learning Embeddings of Out-of-Sample Nodes in Dynamic Networks. In AAAI, Cited by: §I, §VI.
  • [30] F. Mahdisoltani, J. Biega, and F. M. Suchanek (2015) YAGO3: A Knowledge Base from Multilingual Wikipedias. In CIDR, Cited by: §I, §V-A.
  • [31] M. Nickel, V. Tresp, and H. Kriegel (2011) A Three-Way Model for Collective Learning on Multi-Relational Data. In ICML, Vol. 11, pp. 809–816. Cited by: §I, §VI.
  • [32] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling (2018) Modeling Relational Data with Graph Convolutional Networks. In ESWC, pp. 593–607. Cited by: §V-A, §VI.
  • [33] Y. Tay, A. T. Luu, and S. C. Hui (2017)

    Non-Parametric Estimation of Multiple Embeddings for Link Prediction on Dynamic Knowledge Graphs

    .
    In AAAI, pp. 1243–1249. Cited by: Fig. 1, §I, §V-A, §V-D, §VI.
  • [34] R. Trivedi, H. Dai, Y. Wang, and L. Song (2017) Know-Evolve: Deep Temporal Reasoning for Dynamic Knowledge Graphs. In ICML, pp. 3462–3471. Cited by: §I, §VI.
  • [35] R. Trivedi, M. Farajtabar, P. Biswal, and H. Zha (2019) DyRep: Learning Representations over Dynamic Graphs. In ICLR, Cited by: §I, §VI.
  • [36] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard (2016) Complex Embeddings for Simple Link Prediction. In ICML, pp. 2071–2080. Cited by: §I, §V-A, §V-A, §V-D, §VI.
  • [37] P. Wang, S. Li, and R. Pan (2018) Incorporating gan for negative sampling in knowledge representation learning. In AAAI, pp. 2005–2012. Cited by: §III-C.
  • [38] Q. Wang, Z. Mao, B. Wang, and L. Guo (2017) Knowledge Graph Embedding: A Survey of Approaches and Applications. IEEE Transactions on Knowledge and Data Engineering 29 (12), pp. 2724–2743. Cited by: §I, §V-B, §VI.
  • [39] Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014) Knowledge Graph Embedding by Translating on Hyperplanes. In AAAI, Vol. 14, pp. 1112–1119. Cited by: §I, §III-C, §V-A, §VI.
  • [40] J. Xu, X. Qiu, K. Chen, and X. Huang (2017) Knowledge Graph Representation with Jointly Structural and Textual Encoding. In IJCAI, pp. 1318–1324. Cited by: §I, §III-B.
  • [41] B. Yang, W. Yih, X. He, J. Gao, and L. Deng (2015) Embedding Entities and Relations for Learning and Inference in Knowledge Bases. In ICLR, Cited by: §I, §VI.
  • [42] Y. Zhang, Q. Yao, Y. Shao, and L. Chen (2019) NSCaching: Simple and Efficient Negative Sampling for Knowledge Graph Embedding. In ICDE, pp. 614–625. Cited by: §III-C, §V-A.
  • [43] W. Zheng, J. X. Yu, L. Zou, and H. Cheng (2018) Question Answering over Knowledge Graphs: Question Understanding via Template Decomposition. PVLDB 11 (11), pp. 1373–1386. Cited by: §V-C.
  • [44] L. Zhou, Y. Yang, X. Ren, F. Wu, and Y. Zhuang (2018) Dynamic Network Embedding by Modeling Triadic Closure Process. In AAAI, Cited by: §VI.
  • [45] D. Zhu, P. Cui, Z. Zhang, J. Pei, and W. Zhu (2018) High-Order Proximity Preserved Embedding for Dynamic Networks. IEEE Transactions on Knowledge and Data Engineering 30 (11), pp. 2134–2144. Cited by: §I, §VI.