Answering Complex Queries in Knowledge Graphs with Bidirectional Sequence Encoders

04/06/2020 ∙ by Bhushan Kotnis, et al. ∙ NEC Corp. 0

Representation learning for knowledge graphs (KGs) has focused on the problem of answering simple link prediction queries. In this work we address the more ambitious challenge of predicting the answers of conjunctive queries with multiple missing entities. We propose Bi-Directional Query Embedding (BiQE), a method that embeds conjunctive queries with models based on bi-directional attention mechanisms. Contrary to prior work, bidirectional self-attention can capture interactions among all the elements of a query graph. We introduce a new dataset for predicting the answer of conjunctive query and conduct experiments that show BiQE significantly outperforming state of the art baselines.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge graphs represent real-world entities along with their types, attributes, and relationships. Most existing work on machine learning for knowledge graphs has focused on

simple link prediction problems where the query asks for a single missing entity (or relation type) in a single triple. A major benefit of knowledge graph systems, however, is their support of a wide variety of logical queries. For instance, SPARQL a typical query language for RDF-based knowledge graphs supports a variety of query types. However, it can query only for facts that exist in the database, it cannot infer missing knowledge. To address this shortcoming, we are interested in the problem of computing probabilistic answers to conjunctive queries (see for example Figure 1) that can be mapped to subgraph matching problems and which form a subset of SPARQL. Every query can be represented with a graph pattern, which we refer to as the query graph, with some of its entities missing.

Prior work answered simple path queries by composing scoring functions for triples Guu et al. (2015) or addressed queries composed of several paths intersecting at a missing entity Hamilton et al. (2018)

. The latter approach uses a feed forward neural network to combine the individual path embeddings into a vector representation. This vector is then used to compute the most probable missing entity. The approach only aggregates along the given paths and supports exactly one missing entity (see Fig.

2 (left)). With this work we explicitly model more complex dependencies between the various parts of the queries. Moreover, we want to be able to answer queries that are not restricted to a single missing entity.

To address the challenge of answering novel query types and the shortcomings of existing approaches, we propose BiQE, a Bi-directional Query Encoder, that uses a bidirectional transformer to incorporate the entire query context. Transformer models Vaswani et al. (2017) are primarily sequence models and there is no obvious way to feed a query graph to a transformer because sequences contain positional information, while the various parts of a query graph are permutation invariant. In this paper, we propose a novel positional encoding scheme that allows a transformer to answer conjunctive graph queries.

We make the following contributions:

  • A method for predicting the answers of queries with more than one missing entity.

  • An approach to encoding query DAGs into bidirectional transformer models.

  • A new benchmark dataset for complex query answering in KGs with queries containing multiple missing entities.

  • Empirical results demonstrating that our method outperforms the state of the art.

2 Problem Statement

Figure 1: Using BiQE to represent a conjunctive query. The two anchor nodes (“G” and “F”) each traverse a path to the target node (“?”) via the quantifier node.

A knowledge graph (KG) consists of a set of entities , a set of relation types , and a set of triples of the form . Let us denote as existentially quantified variables, as free variables, and as some arbitrary entity. The free variables represent the entities to be predicted. We specifically focus on conjunctive queries of the form


where is one of the following

  1. or ;

  2. or ;

  3. or ; or

  4. or

The query graph for a conjunctive query is the graph consisting of triples of equation (1) and of the types (1)-(4). We constrain the set of conjunctive queries to those for which the query graph is a connected directed acyclic graph (DAG). The depth of a vertex in a DAG is the maximal length of a path from a root node to this vertex. We require that any two missing entity nodes have different depth. This is similar to the definition of conjunctive queries in previous work Hamilton et al. (2018) with the exception that we can have more than one free variable in the query and that free variables can be at various positions in the DAG. A typical query graph is illustrated in the left top half of Figure 1.

3 The Bidirectional Query Encoder

We use a bi-directional transformer encoder (Devlin et al., 2019) to encode conjunctive graph queries. The crucial feature of the transformer is its self-attention module where every token given to the model can simultaneously attend to every other token. It has been noted before that this mechanism acts like a graph neural network that induces a weighted graph on the input tokens which allows the proposed model to induce a latent dependency structure in addition to the observed one.

The query DAG corresponding to a conjunctive query under consideration can have multiple nodes representing free variables (target nodes; limited to one per depth), bound variables (quantifier nodes), and entities (anchor nodes). Since the Transformer expects sequential input the query graph has to be mapped to a sequence representation. We address this challenge by decomposing the query DAG into a set of unique query paths from each root node to each leaf node of the query DAG. The DAG structure imposes a partial ordering on the nodes which allows us to decompose the query DAG into a set of path queries that originate from root nodes and end in leaf nodes. A DAG query graph with root and leaf nodes is decomposed into paths.

Since there is an order within each path but no ordering between paths, we use positional encodings to represent the order of paths but reset the position at every path boundary to zero. Because the self-attention layers are position invariant and the positional information lies solely in the position embeddings, the positional encoding of tokens in a branch do not change even if the order between the branches are changed. This allows us to feed a set of path queries to the transformer in any arbitrary order and allows the model to generalize even when the order of path queries is changed. This is depicted in Figure 2.

We map a single path query to a token sequence by representing free variables (values to be predicted) with tokens and dropping existentially quantified variables. For instance, the path query

is mapped to the sequence

We train the model to predict the entity from the set of all entities at the location of the tokens using a categorical cross-entropy loss. Entity and relations are separate tokens in the form of unique identifiers.

3.1 Baseline Models

We use the Graph Query Embedding (GQE) model (Hamilton et al., 2018) as a baseline model. The GQE model consists of two parts, the projection function and the intersection function. The projection function computes the path query embedding and the intersection function, using a feed forward neural network, computes the query intersection in embedding space. The projection function is a composible triple scoring function such as TransE Bordes et al. (2013) or DistMult Yang et al. (2014).

Guu et al. (2015) introduce a path compositional KG embedding model for answering path queries. This is done by applying the function recursively till the target entity is reached. Hamilton et al. (2018) extend the path compositional model to DAGs using a neural network for performing intersection. For paths, the GQE model is identical to the path compositional model.

4 Experiments

Figure 2: Query embedding in GQE (left bottom) vs. BiQE (right half). When computing the intersection of e or t, only the previous query context is considered and not the future. In contrast, for BiQE, every element can attend to every other element of the query. This joint modeling of elements leads to a higher accuracy.

4.1 Datasets

Prior work introduced two datasets for evaluating conjunctive queries Hamilton et al. (2018). Of those we use the Bio dataset which is the one that is publicly available. It consists of seven types of conjunctive queries. Three out of the seven types are path queries (one, two and three hop) and the rest are DAG queries. The dataset, however, only considers conjunctive queries with exactly one missing entity. Guu et al. (2015) introduce a path query dataset, however they also do not provide intermediate entities and furthermore the dataset is constructed using FB15K which is known to suffer from problems due to inverse edges Toutanova and Chen (2015). In addition and based on the popular benchmark dataset Freebase FB15k-237 Toutanova et al. (2015), we created a new dataset that includes paths and DAG queries with multiple missing entities. Following Guu et al. (2015), who mine paths using random walks, we sample paths by performing one random walk per node with depth chosen uniformly at random. We describe the process fo creating the FB15K-237-CQ dataset. First we mine paths by performing random walks starting from every node in the FB15K-237 knowledge graph. For each node the following is done:

  1. Randomly sample the path depth, i.e. a number between 2 and 6 (inclusive)

  2. Select a neighbor at random and obtain the relation linking the node and neighbor.

  3. Keep doing step (2) until the chosen depth is reached.

We obtain DAGs by intersecting the random walks at intermediate entities only if the intermediate entity appears only once in the path. In other words we filter out multiple intersections. The intersecting entity is considered to be the target leaf node. We decompose the DAGs and paths from the test split to obtain triples and remove all paths and DAGs from the train dataset that contain any of those triples. We term this dataset as FB15k-237-CQ, where CQ represents Conjunctive Queries.

We created this dataset for evaluating conjunctive queries with multiple missing entities. To this end, for the test and development set we mask (or remove) all entities except the anchor entity from the mined paths and DAGs. The anchor (source) entity is considered to be the starting or first entity of the path, or in case of DAGs multiple anchors for each intersecting branch. For the path case, we consider all other entities as labels, while for the DAG case we impose an additional constraint, namely the entities to be predicted must have different depth counted from the intersecting target.

It is important that the BiQE model learns contextual representations of all entities especially for paths. Therefore during training we do not remove all intermediate entities for paths, instead we mask intermediate entities at random. Additionally we also decompose all paths and DAGs in the training set into triples and add them to the training data. For fair comparison we also used these training triples for learning the baseline models.

4.2 Training Details

We re-implemented the GQE model for answering path queries with multiple missing entities on the FB15K-237-CQ dataset. Its is not easy to implement batched training of the GQE intersection function on arbitrary DAG queries with multiple missing entities, therefore for the DAG setting we only use the projection function. However, for the Bio dataset, we used the scores reported in the paper as baselines. We tuned the regularizer and the Adam optimizer step size for our implementation of GQE model on paths, while keeping the embedding dimension fixed at

. We used random negative sampling with 100 negative samples fed to a categorical cross entropy loss function. This setting gave better results than the max-margin training in

Hamilton et al. (2018).

For BiQE we used the transformer encoder configuration and hyperparameters used in

Devlin et al. (2019). For the Bio dataset we train BiQE by learning to predict the missing entity. We use the training, development and test data publicly released in the GQE paper.

4.3 Evaluation Metrics

For the FB15k-237-CQ path dataset, we evaluate entity prediction using two ranking metrics, namely Mean Reciprocal Rank (MRR) and HITS@10. Each missing intermediate entity and the final target is ranked against the entire entity set filtering out positive entities (Bordes et al., 2013). For the Bio dataset we follow the evaluation of Hamilton et al. (2018) for direct comparison, i.e, Area Under the Curve (AUC) and Average Percentile Rank (APR).

4.4 Results

We present the results on the FB15k-237-CQ dataset in Table 1 and on the Bio dataset in Table 2. We compare BiQE to the GQE model with the best performing embedding functionOn both datasets, BiQE significantly outperforms the baseline model.

We believe that the self-attention mechanism is the key factor behind the large improvement. As illustrated in Figure 2, the GQE model answers queries by traversing the query graph in embedding space in the direction of the source to the targets. In contrast, the self-attention mechanism allows BiQE to reach missing targets from any arbitrary path or element in the query in the embedding space. The attention matrix and its effect in embedding space illustrates this mechanism. To investigate if the simultaneous prediction and the information flow between all tokens is the crucial feature of BiQE, we conducted two additional experiments.

For queries involving multiple targets, the BiQE model predicts all the missing entities in one shot. However it is also possible to predict iteratively where the prediction of the first missing entity is added to the query. We experiment with starting from predicting missing entities closest to the source and ending at the target (we term this as Iterative prediction). If this setup is worse, then our model benefits from jointly predicting all missing entities. As shown in Table 3, we indeed see a significant drop in performance when employing iterative prediction. This confirms our intuition that it is advantageous to jointly predict the missing entities. Next, we would like to confirm that the model benefits from attending to the future query context. To this end we removed self-attention weights from the path elements occurring to the right (or future) of the current entity that is being predicted Lawrence et al. (2019). This setup is akin to the GQE model where the future parts of the query are unused. Indeed, with these connections removed, we see a further drop in performance, arriving at a score similar to GQE. This confirms that the benefits of BiQE stem from the information flow between all tokens and the joint prediction.

GQE-DistMult BiQE
MRR Hits10 MRR Hits10
Path 0.427 0.624 0.789 0.859
DAG 0.485 0.679 0.761 0.842
Micro Avg. 0.438 0.635 0.783 0.856
Table 1: Comparison of BiQE with best performing version of GQE on the FB15K-237-CQ dataset.
GQE-Bilinear 91.0 91.5
BiQE 96.91 96.69
Table 2: Comparing BiQE with GQE on the Bio dataset. GQE results were obtained from the paper.
MRR Hits10
BiQE 0.789 0.859
BiQE (Iterative prediction) 0.651 0.724
BiQE (No future context) 0.589 0.665
Table 3: Predicting one entity at a time leads to worse results than predicting them jointly and not attending to future context further hurts performance. Experiments were conducted on paths with multiple free variables.

5 Related Work

Existing work related to KG reasoning is primarily focused on completing a knowledge graph by inferring missing edges using embedding based models Nickel et al. (2016). Apart from learning to embed edges, other structures such as relational paths Luo et al. (2015); Das et al. (2018) and neighborhoods Schlichtkrull et al. (2018); Bansal et al. (2019) have been used to improve the learned graph embeddings. Similar to these methods, we too learn graph embeddings that learn from paths and DAGs. However, unlike these methods, we aim to answer more complex graph queries.

Our work builds on the path query task introduced in Guu et al. (2015) and the DAG query task in Hamilton et al. (2018). Mai et al. (2019) modify the intersection function in Hamilton et al. (2018) by adding a gating mechanism111We were unable to compare our results with them because they used an unreleased, modified Bio dataset.. To our knowledge, only two recent papers have used transformer encoder for KG completion. Petroni et al. (2019) investigate the use of relational knowledge present in pre-trained BERT for link prediction on open domain KGs. Finally, Yao et al. (2019) use a pre-trained BERT for KG completion using text.

6 Conclusion

We introduced a transformer based bidirectional attention model for answering conjunctive queries in Knowledge Graphs, particularly DAG and path queries with multiple missing entities. This introduced a challenge of encoding DAG queries because inputting a DAG to a sequential transformer model is not straightforward. We solved this problem by using a novel positional encoding scheme applied on paths obtained by decomposing the DAG queries. Experimentally we showed that our approach improves upon existing baselines by a large margin. We showed that the increase in accuracy is due to the bi-directional self-attention machanism capturing interactions among all elements of a query graph. Furthermore, we introduced a new dataset constructed from FB15k-237 for studying the problem of conjunctive query answering with multiple missing entities. We limited this work to path and DAG queries, but we speculate that

BiQE can work well on all kinds of graph queries. This is something we plan to explore in the future.