Knowledge graphs represent real-world entities along with their types, attributes, and relationships. Most existing work on machine learning for knowledge graphs has focused onsimple link prediction problems where the query asks for a single missing entity (or relation type) in a single triple. A major benefit of knowledge graph systems, however, is their support of a wide variety of logical queries. For instance, SPARQL a typical query language for RDF-based knowledge graphs supports a variety of query types. However, it can query only for facts that exist in the database, it cannot infer missing knowledge. To address this shortcoming, we are interested in the problem of computing probabilistic answers to conjunctive queries (see for example Figure 1) that can be mapped to subgraph matching problems and which form a subset of SPARQL. Every query can be represented with a graph pattern, which we refer to as the query graph, with some of its entities missing.
Prior work answered simple path queries by composing scoring functions for triples Guu et al. (2015) or addressed queries composed of several paths intersecting at a missing entity Hamilton et al. (2018)
. The latter approach uses a feed forward neural network to combine the individual path embeddings into a vector representation. This vector is then used to compute the most probable missing entity. The approach only aggregates along the given paths and supports exactly one missing entity (see Fig.2 (left)). With this work we explicitly model more complex dependencies between the various parts of the queries. Moreover, we want to be able to answer queries that are not restricted to a single missing entity.
To address the challenge of answering novel query types and the shortcomings of existing approaches, we propose BiQE, a Bi-directional Query Encoder, that uses a bidirectional transformer to incorporate the entire query context. Transformer models Vaswani et al. (2017) are primarily sequence models and there is no obvious way to feed a query graph to a transformer because sequences contain positional information, while the various parts of a query graph are permutation invariant. In this paper, we propose a novel positional encoding scheme that allows a transformer to answer conjunctive graph queries.
We make the following contributions:
A method for predicting the answers of queries with more than one missing entity.
An approach to encoding query DAGs into bidirectional transformer models.
A new benchmark dataset for complex query answering in KGs with queries containing multiple missing entities.
Empirical results demonstrating that our method outperforms the state of the art.
2 Problem Statement
A knowledge graph (KG) consists of a set of entities , a set of relation types , and a set of triples of the form . Let us denote as existentially quantified variables, as free variables, and as some arbitrary entity. The free variables represent the entities to be predicted. We specifically focus on conjunctive queries of the form
where is one of the following
or ; or
The query graph for a conjunctive query is the graph consisting of triples of equation (1) and of the types (1)-(4). We constrain the set of conjunctive queries to those for which the query graph is a connected directed acyclic graph (DAG). The depth of a vertex in a DAG is the maximal length of a path from a root node to this vertex. We require that any two missing entity nodes have different depth. This is similar to the definition of conjunctive queries in previous work Hamilton et al. (2018) with the exception that we can have more than one free variable in the query and that free variables can be at various positions in the DAG. A typical query graph is illustrated in the left top half of Figure 1.
3 The Bidirectional Query Encoder
We use a bi-directional transformer encoder (Devlin et al., 2019) to encode conjunctive graph queries. The crucial feature of the transformer is its self-attention module where every token given to the model can simultaneously attend to every other token. It has been noted before that this mechanism acts like a graph neural network that induces a weighted graph on the input tokens which allows the proposed model to induce a latent dependency structure in addition to the observed one.
The query DAG corresponding to a conjunctive query under consideration can have multiple nodes representing free variables (target nodes; limited to one per depth), bound variables (quantifier nodes), and entities (anchor nodes). Since the Transformer expects sequential input the query graph has to be mapped to a sequence representation. We address this challenge by decomposing the query DAG into a set of unique query paths from each root node to each leaf node of the query DAG. The DAG structure imposes a partial ordering on the nodes which allows us to decompose the query DAG into a set of path queries that originate from root nodes and end in leaf nodes. A DAG query graph with root and leaf nodes is decomposed into paths.
Since there is an order within each path but no ordering between paths, we use positional encodings to represent the order of paths but reset the position at every path boundary to zero. Because the self-attention layers are position invariant and the positional information lies solely in the position embeddings, the positional encoding of tokens in a branch do not change even if the order between the branches are changed. This allows us to feed a set of path queries to the transformer in any arbitrary order and allows the model to generalize even when the order of path queries is changed. This is depicted in Figure 2.
We map a single path query to a token sequence by representing free variables (values to be predicted) with tokens and dropping existentially quantified variables. For instance, the path query
is mapped to the sequence
We train the model to predict the entity from the set of all entities at the location of the tokens using a categorical cross-entropy loss. Entity and relations are separate tokens in the form of unique identifiers.
3.1 Baseline Models
We use the Graph Query Embedding (GQE) model (Hamilton et al., 2018) as a baseline model. The GQE model consists of two parts, the projection function and the intersection function. The projection function computes the path query embedding and the intersection function, using a feed forward neural network, computes the query intersection in embedding space. The projection function is a composible triple scoring function such as TransE Bordes et al. (2013) or DistMult Yang et al. (2014).
Guu et al. (2015) introduce a path compositional KG embedding model for answering path queries. This is done by applying the function recursively till the target entity is reached. Hamilton et al. (2018) extend the path compositional model to DAGs using a neural network for performing intersection. For paths, the GQE model is identical to the path compositional model.
Prior work introduced two datasets for evaluating conjunctive queries Hamilton et al. (2018). Of those we use the Bio dataset which is the one that is publicly available. It consists of seven types of conjunctive queries. Three out of the seven types are path queries (one, two and three hop) and the rest are DAG queries. The dataset, however, only considers conjunctive queries with exactly one missing entity. Guu et al. (2015) introduce a path query dataset, however they also do not provide intermediate entities and furthermore the dataset is constructed using FB15K which is known to suffer from problems due to inverse edges Toutanova and Chen (2015). In addition and based on the popular benchmark dataset Freebase FB15k-237 Toutanova et al. (2015), we created a new dataset that includes paths and DAG queries with multiple missing entities. Following Guu et al. (2015), who mine paths using random walks, we sample paths by performing one random walk per node with depth chosen uniformly at random. We describe the process fo creating the FB15K-237-CQ dataset. First we mine paths by performing random walks starting from every node in the FB15K-237 knowledge graph. For each node the following is done:
Randomly sample the path depth, i.e. a number between 2 and 6 (inclusive)
Select a neighbor at random and obtain the relation linking the node and neighbor.
Keep doing step (2) until the chosen depth is reached.
We obtain DAGs by intersecting the random walks at intermediate entities only if the intermediate entity appears only once in the path. In other words we filter out multiple intersections. The intersecting entity is considered to be the target leaf node. We decompose the DAGs and paths from the test split to obtain triples and remove all paths and DAGs from the train dataset that contain any of those triples. We term this dataset as FB15k-237-CQ, where CQ represents Conjunctive Queries.
We created this dataset for evaluating conjunctive queries with multiple missing entities. To this end, for the test and development set we mask (or remove) all entities except the anchor entity from the mined paths and DAGs. The anchor (source) entity is considered to be the starting or first entity of the path, or in case of DAGs multiple anchors for each intersecting branch. For the path case, we consider all other entities as labels, while for the DAG case we impose an additional constraint, namely the entities to be predicted must have different depth counted from the intersecting target.
It is important that the BiQE model learns contextual representations of all entities especially for paths. Therefore during training we do not remove all intermediate entities for paths, instead we mask intermediate entities at random. Additionally we also decompose all paths and DAGs in the training set into triples and add them to the training data. For fair comparison we also used these training triples for learning the baseline models.
4.2 Training Details
We re-implemented the GQE model for answering path queries with multiple missing entities on the FB15K-237-CQ dataset. Its is not easy to implement batched training of the GQE intersection function on arbitrary DAG queries with multiple missing entities, therefore for the DAG setting we only use the projection function. However, for the Bio dataset, we used the scores reported in the paper as baselines. We tuned the regularizer and the Adam optimizer step size for our implementation of GQE model on paths, while keeping the embedding dimension fixed at
. We used random negative sampling with 100 negative samples fed to a categorical cross entropy loss function. This setting gave better results than the max-margin training inHamilton et al. (2018).
4.3 Evaluation Metrics
For the FB15k-237-CQ path dataset, we evaluate entity prediction using two ranking metrics, namely Mean Reciprocal Rank (MRR) and HITS@10. Each missing intermediate entity and the final target is ranked against the entire entity set filtering out positive entities (Bordes et al., 2013). For the Bio dataset we follow the evaluation of Hamilton et al. (2018) for direct comparison, i.e, Area Under the Curve (AUC) and Average Percentile Rank (APR).
We present the results on the FB15k-237-CQ dataset in Table 1 and on the Bio dataset in Table 2. We compare BiQE to the GQE model with the best performing embedding functionOn both datasets, BiQE significantly outperforms the baseline model.
We believe that the self-attention mechanism is the key factor behind the large improvement. As illustrated in Figure 2, the GQE model answers queries by traversing the query graph in embedding space in the direction of the source to the targets. In contrast, the self-attention mechanism allows BiQE to reach missing targets from any arbitrary path or element in the query in the embedding space. The attention matrix and its effect in embedding space illustrates this mechanism. To investigate if the simultaneous prediction and the information flow between all tokens is the crucial feature of BiQE, we conducted two additional experiments.
For queries involving multiple targets, the BiQE model predicts all the missing entities in one shot. However it is also possible to predict iteratively where the prediction of the first missing entity is added to the query. We experiment with starting from predicting missing entities closest to the source and ending at the target (we term this as Iterative prediction). If this setup is worse, then our model benefits from jointly predicting all missing entities. As shown in Table 3, we indeed see a significant drop in performance when employing iterative prediction. This confirms our intuition that it is advantageous to jointly predict the missing entities. Next, we would like to confirm that the model benefits from attending to the future query context. To this end we removed self-attention weights from the path elements occurring to the right (or future) of the current entity that is being predicted Lawrence et al. (2019). This setup is akin to the GQE model where the future parts of the query are unused. Indeed, with these connections removed, we see a further drop in performance, arriving at a score similar to GQE. This confirms that the benefits of BiQE stem from the information flow between all tokens and the joint prediction.
|BiQE (Iterative prediction)||0.651||0.724|
|BiQE (No future context)||0.589||0.665|
5 Related Work
Existing work related to KG reasoning is primarily focused on completing a knowledge graph by inferring missing edges using embedding based models Nickel et al. (2016). Apart from learning to embed edges, other structures such as relational paths Luo et al. (2015); Das et al. (2018) and neighborhoods Schlichtkrull et al. (2018); Bansal et al. (2019) have been used to improve the learned graph embeddings. Similar to these methods, we too learn graph embeddings that learn from paths and DAGs. However, unlike these methods, we aim to answer more complex graph queries.
Our work builds on the path query task introduced in Guu et al. (2015) and the DAG query task in Hamilton et al. (2018). Mai et al. (2019) modify the intersection function in Hamilton et al. (2018) by adding a gating mechanism111We were unable to compare our results with them because they used an unreleased, modified Bio dataset.. To our knowledge, only two recent papers have used transformer encoder for KG completion. Petroni et al. (2019) investigate the use of relational knowledge present in pre-trained BERT for link prediction on open domain KGs. Finally, Yao et al. (2019) use a pre-trained BERT for KG completion using text.
We introduced a transformer based bidirectional attention model for answering conjunctive queries in Knowledge Graphs, particularly DAG and path queries with multiple missing entities. This introduced a challenge of encoding DAG queries because inputting a DAG to a sequential transformer model is not straightforward. We solved this problem by using a novel positional encoding scheme applied on paths obtained by decomposing the DAG queries. Experimentally we showed that our approach improves upon existing baselines by a large margin. We showed that the increase in accuracy is due to the bi-directional self-attention machanism capturing interactions among all elements of a query graph. Furthermore, we introduced a new dataset constructed from FB15k-237 for studying the problem of conjunctive query answering with multiple missing entities. We limited this work to path and DAG queries, but we speculate thatBiQE can work well on all kinds of graph queries. This is something we plan to explore in the future.
- Bansal et al. (2019) Trapit Bansal, Da-Cheng Juan, Sujith Ravi, and Andrew McCallum. 2019. A2N: Attending to neighbors for knowledge graph inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL).
- Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems 26 (NIPS).
- Das et al. (2018) Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay Krishnamurthy, Alex Smola, and Andrew McCallum. 2018. Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. In International Conference on Learning Representations (ICLR).
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (NAACL).
Guu et al. (2015)
Kelvin Guu, John Miller, and Percy Liang. 2015.
graphs in vector space.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Hamilton et al. (2018) William L. Hamilton, Payal Bajaj, Marinka Zitnik, Dan Jurafsky, and Jure Leskovec. 2018. Embedding logical queries on knowledge graphs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS).
- Lawrence et al. (2019) Carolin Lawrence, Bhushan Kotnis, and Mathias Niepert. 2019. Attending to future tokens for bidirectional sequence generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
- Luo et al. (2015) Yuanfei Luo, Quan Wang, Bin Wang, and Li Guo. 2015. Context-dependent knowledge graph embedding. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Mai et al. (2019) Gengchen Mai, Krzysztof Janowicz, Bo Yan, Rui Zhu, Ling Cai, and Ni Lao. 2019. Contextual graph attention for answering logical queries over incomplete knowledge graphs. In Proceedings of the 10th International Conference on Knowledge Capture (K-CAP). ACM.
- Nickel et al. (2016) M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. 2016. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33.
- Petroni et al. (2019) Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
- Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In The Semantic Web, pages 593–607. Springer International Publishing.
- Toutanova and Chen (2015) Kristina Toutanova and Danqi Chen. 2015. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality.
- Toutanova et al. (2015) Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. 2015. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS).
- Yang et al. (2014) Bishan Yang, Wen tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2014. Embedding entities and relations for learning and inference in knowledge bases. In International Conference on Learning Representations (ICLR).
- Yao et al. (2019) Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Kg-bert: Bert for knowledge graph completion. ArXiv, abs/1909.03193.