1. Introduction
Knowledge graphs represent statements in the form of graphs in which nodes represent entities and directed labeled edges indicate different types of relations between these entities (Mai et al., 2018). In the past decade, the Semantic Web community has published and interlinked vast amounts of data on the Web using the machinereadable and reasonable Resource Description Framework (RDF) in order to create smart data (janowicz2015data). By following open W3C standards or related proprietary technology stacks, several largescale knowledge graphs have been constructed (e.g., DBpedia, Wikidata, NELL, Google’s Knowledge Graph and Microsoft’s Satori) to support applications such as information retrieval and question answering (Berant et al., 2013; Liang et al., 2017).
Despite their size, knowledge graphs often suffer from incompleteness, sparsity, and noise as most KGs are constructed collaboratively and semiautomatically (Xu et al., 2016). Recent work studied different ways of applying graph learning methods to largescale knowledge graphs to support completion via socalled knowledge graph embedding techniques such as RESCAL (Nickel et al., 2012), TransE (Bordes et al., 2013), NTN (Socher et al., 2013), DistMult (Yang et al., 2015), TransR (Lin et al., 2015), and HOLE (Nickel et al., 2016b)
. These approaches aim at embedding KG components including entities and relations into continuous vector spaces while preserving the inherent structure of the original KG
(Wang et al., 2017). Although these models show promising results in link prediction and entity classification tasks, they all treat each statement (often called triple) independently, thereby ignoring the correlation between them. In addition, since the model needs to rank all entities for a given triple in the link prediction task, their complexity is linear with respect to the total number of entities in the KG, which makes it impractical for more complicated query answering tasks.Recent work (Hamilton et al., 2018; Wang et al., 2018; Mai et al., 2019) has explored ways to utilize knowledge graph embedding models for answering logical queries from incomplete KG. The task is to predict the correct answer to a query based on KG embedding models, even if this query cannot be answered directly because of one or multiple missing triples in the original graph. For example, Listing 1 shows an example SPARQL query over DBpedia which asks for the cause of death of a person whose alma mater was UCLA and who was a guest of Escape Clause. Executing this query via DBpedia SPARQL endpoint^{1}^{1}1https://dbpedia.org/sparql yields one answer dbr:Cardiovascular_disease and the corresponding person is dbr:Virginia_Christine. However, if the triple (dbr:Virginia_Christine dbo:deathCause dbr:Cardiovascular_disease) is missing, this query would become an unanswerable one (Mai et al., 2019) as shown in Figure 1. The general idea of query answering via KG embedding is to predict the embedding of the root variable ?Disease by utilizing the embeddings of known entities (e.g. UCLA and EscapeClause) and relations (deathCause, almaMater and guest) in the query. Ideally, a nearest neighbor search in the entity embedding space using the predicted variable’s embedding yields the approximated answer.
Hamilton et al. (Hamilton et al., 2018) and Wang et al. (Wang et al., 2018) proposed different approaches for predicting variable embedding. However, an unavoidable step for both is to integrate predicted embeddings for the same variable (in this query ?Person) from different paths (triple and in Fig. 1) by translating from the corresponding entity nodes via different relation embeddings. In Figure 1, triple and will produce different embeddings and for variable ?Person and they need to be integrated to produce one single embedding for ?Person. An intuitive integration method is an elementwise mean operation over and . This implies that we assume triple and have equal prediction abilities for the embedding of ?Person which is not necessarily true. In fact, triple matches 450 triples in DBpedia while only matches 5. This indicates that will be more similar to the real embedding of ?Person because has more discriminative power.
Wang et al. (Wang et al., 2018) acknowledged this unequal contribution from different paths and obtained the final embedding as a weighted average of and while the weight is proportional to the inverse of the number of triples matched by triple and . However, this deterministic weighting approach lacks flexibility and will produce suboptimal results. Moreover, they separated the knowledge graph embedding training and query answering steps. As a result, the KG embedding model is not directly optimized on the query answering objective which further impacts the model’s performance.
In contrast, Hamilton et al. (Hamilton et al., 2018) presented an endtoend model for KG embedding model training and logical query answering. However, they utilized a simple permutation invariant neural network (Zaheer et al., 2017) to integrate and which treats each embedding equally. Furthermore, in order to train the endtoend logical query answering model, they sampled logical queryanswer pairs from the KG as training datasets while ignoring the original KG structure which has proven to be important for embedding model training based on previous research (Kipf and Welling, 2017).
Based on these observations, we hypothesis that a graph attention network similar to the one proposed by Veličković et al. (Veličković et al., 2018) can handle these unequal contribution cases. However, Veličković et al. (Veličković et al., 2018) assume that the center node embedding (the variable embedding of ?Person in Fig. 1), known as the query embedding (Vaswani et al., 2017), should be known beforehand for attention score computing which is unknown in this case. This prevents us from using the normal attention method. Therefore, we propose an endtoend attentionbased logical query answering model over knowledge graphs in which the situation of unequal contribution from different paths to an entity embedding is handled by a new attention mechanism (Bahdanau et al., 2015; Vaswani et al., 2017; Veličković et al., 2018) where the center variable embedding is no longer a prerequisite. Additionally, the model is jointly trained on both sampled logical queryanswer pairs and the original KG structure information. The contributions of our work are as follows:

We propose an endtoend attentionbased logic query answering model over knowledge graphs in which an attention mechanism is used to handle the unequal contribution of neighboring entity embeddings to the center entity embedding. To the best of our knowledge, this is the first attention method applicable to logic query answering.

We show that the proposed model can be trained jointly on the original KG structure and the sampled logical QA pairs.

We introduce two datasets  DB18 and WikiGeo19  which have substantially more relation types (170+) compared to the Bio dataset (Hamilton et al., 2018).
The rest of this paper is structured as follows. We first introduce some basic notions in Section 2 and present our attentionbased query answering model in Section 3. In Section 4, we discuss the datasets we used to evaluate our model and present the evaluation results. We conclude our work in Section 5.
2. Basic Concepts
Before introducing our endtoend attentionbased logical query answering model, we outline some basic notions relevant to Conjunctive Graph Query models.
2.1. Conjunctive Graph Queries (CGQ)
In this work, a knowledge graph (KG) is a directed and labeled multirelational graph where is a set of entities (nodes), is the set of relations (predicates, edges); furthermore let be a set of triples. A triple or in this sense consists of a head entity and a tail entity connected by some relation (predicate).^{2}^{2}2Note that in many knowledge graphs, a triple can include a datatype property as the relation where the tail is a literal. In line with related work (Wang et al., 2017; Nickel et al., 2016a) we do not consider this kind of triples here. We will use head (h), relation (r), and tail(t) when discussing embeddings and subject (s), predicate (p), object (o) when discussing Semantic Web knowledge graphs to stay in line with the literature from both fields.
Definition 2.1 (Conjunctive Graph Query (CGQ)).
A query that can be written as follows:
Here denotes the target variable of the query which will be replaced with the answer entity, while are existentially quantified bound variables. is a basic graph pattern in this CGQ. To ensure is a valid CGQ, the dependence graph of must be a directed acyclic graph (DAG) (Hamilton et al., 2018) in which the entities (anchor nodes) in are the source nodes and the target variable is the unique sink node.
Figure 1 shows an example CGQ which is equivalent to the SPARQL query in Listing 1, where ?Person is an existentially quantified bound variable and ?Disease is the target variable. Note that for graph pattern where subject is a variable and object is an entity, we can convert it into the form by using the inverse relation of the predicate . In other words, we convert to . For example, In Figure 1, we use to represent the graph pattern . The benefit of this inverse relation conversion is that we can construct CGQ where the dependence graph is a directed acyclic graph (DAG) as shown in Figure 1 .
Comparing Definition 2.1 with SPARQL, we can see several differences:

Predicates in CGQs are assumed to be fixed while predicates in a SPARQL 1.1 basic graph pattern can also be variables (Mai et al., 2019).

CGQs only consider the conjunction of graph patterns while SPARQL 1.1 also contains other operations (UNION, OPTION, FILTER, LIMIT, etc.).

CGQs require one variable as the answer denotation, which is in alignment with most question answering over knowledge graph literature (Berant et al., 2013; Liang et al., 2017)
. In contrast, SPARQL 1.1 allows multiple variables as the returning variables. The unique answer variable property make it easier to evaluate the performance of different deep learning models on CGQs.
2.2. Geometric Operators in Embedding Space
Here we describe two geometric operators  the projection operator and the intersection operator  in the entity embedding space, which were first introduced by Hamilton et al. (Hamilton et al., 2018).
Definition 2.2 (Geometric Projection Operator).
Given an embedding in the entity embedding space which can be either an embedding of a real entity or a computed embedding for an existentially quantified bound variable in a conjunctive query , and a relation , the projection operator produces a new embedding where . The projection operator is defined as follows:
(1) 
where is a trainable and relationspecific matrix for relation type . The embedding denotes all entities that connect with entity or variable through relation . If embedding denotes entity , then denotes . If embedding denotes variable , then denotes .
In short, denotes the embedding of the relation specific neighboring set of entities. Different KG embedding models have different ways to represent the relation . We can also use TransE’s version () or a diagonal matrix version (, where is a diagonal matrix parameterized by vector in its diagonal axis). The bilinear version shown in Equation 1 has the best performance in logic query answering because it is more flexible in capturing different characteristics of relation (Hamilton et al., 2018).
As for the intersection operator, we first present the original version from Graph Query Embedding (GQE) (Hamilton et al., 2018), which will act as baseline for our model.
Definition 2.3 (Geometric Intersection Operator).
Assume we are given a set of different input embeddings , , …, ,…, as the outputs from different geometric projection operations by following different relation paths. We require all to have the same entity type. The geometric intersection operator outputs one embedding based on this set of embeddings which denotes the intersection of these different relation paths:
(2) 
where are trainable entity type specific matrices. is a symmetric vector function (e.g., an elementwise mean or minimum of a set of vectors) which is permutation invariant on the order of its inputs (Zaheer et al., 2017). As , , …, ,…, represent the embeddings of the neighboring set of entities, is interpreted as the intersection of these sets.
2.3. Entity Embedding Initialization
Generally speaking, any (knowledge) graph embedding model can be used to initialize entity embeddings. In this work, we adopt the simple “bagoffeatures” approach. We assume each entity will have an entity type , e.g. Place, Agent. The entity embedding lookup is shown below:
(3) 
is the typespecific embedding matrices for all entities with type which can be initialized using a normal embedding matrix normalization method. The is a binary feature vector such as a onehot vector which uniquely identifies entity among all entities with the same entity type . The indicates the norm. The reason why we use typespecific embedding matrices rather than one embedding matrix for all entities as (Nickel et al., 2012; Bordes et al., 2013; Socher et al., 2013; Yang et al., 2015; Lin et al., 2015; Ji et al., 2015; Nickel et al., 2016b) did is that recent node embedding work (Hamilton et al., 2017, 2018) show that most of the information contained in the node embeddings is typespecific information. Using typespecific entity embedding matrices explicitly handles this information. Note that in many KGs such as DBpedia one entity may have multiple types. We handle this by computing the common super class of these types (see Sec. 4).
3. Method
Next, we discuss the difference between our model and GQE (Hamilton et al., 2018). Our geometric operators (1) use an attention mechanism to account for the fact that different paths have different embedding prediction abilities with respect to the center entity embedding and (2) can be applied to two training phases – training on the original KG and training with sampled logic queryanswer pairs.
3.1. Attentionbased Geometric Projection Operator
Since the permutation invariant function directly operates on the set , Equation 2 assumes that each (relation path) has an equal contribution to the final intersection embedding . This is not necessarily the case in real settings as we have discussed in Section 1. Graph Attention Network (GAT) (Veličković et al., 2018) has shown that using an attention mechanism on graphstructured data to capture the unequal contribution of the neighboring nodes to the center node yields better result than a simple elementwise mean or minimum approaches. By following the attention idea of GAT, we propose an attentionbased geometric intersection operator.
Assume we are given the same input as Definition 2.3, a set of different input embeddings , , …, ,…,
. The geometric intersection operator contains two layers: a multihead attention layer and a feed forward neural network layer.
3.1.1. The multihead attention layer
The initial intersection embedding is computed as:
(4) 
Then the attention coefficient for each in the attention head is
(5) 
where represents transposition, vector concatenation, and is the typespecific trainable attention vector for attention head. Following the advice on avoiding spurious weights (Veličković et al., 2018), we use LeakyReLu here.
The attention weighted embedding is computed as the weighted average of different input embeddings while weights are automatically learned by the multihead attention mechanism. Here,
is the sigmoid activation function and
is the number of attention heads.(6) 
Furthermore, we add a residual connection
(He et al., 2016) of , followed by layer normalization (Ba et al., 2016) (Add & Norm).(7) 
3.1.2. The second layer
It is a normal feed forward neural network layer followed by the “Add & Norm” as shown in Equation 8.
(8) 
where and are trainable entity type
specific weight matrix and bias vector, respectively, in a feed forward neural network.
Figure 2 illustrates the model architecture of our attentionbased geometric intersection operator. The light green boxes at the bottom indicate embeddings , ,…,,…,, which are projected by the geometric projection operators. The output embeddings , , …, ,…, are the input embeddings of our intersection operator. The initial intersection embedding is computed based on these input embeddings as shown in Equation 4. Next, and , , …, ,…, are fed into the multihead attention layer followed by the feed forward neural network layer. This twolayer architecture is inspired by Transformer (Vaswani et al., 2017).
The multihead attention mechanism shown in Equation 4, 5, and 6 is similar to those used in Graph Attention Network (GAT) (Veličković et al., 2018). The major difference is the way we compute the initial intersection embedding in Equation 4. In the graph neural network context, the attention function can be interpreted as mapping the center node embedding and a set of neighboring node embeddings to an output embedding. In GAT, the model directly operates on the local graph structure by applying one or multiple convolution operations over the 1degree neighboring nodes of the center node. In order to compute the attention score for each neighboring node embedding, each of the neighboring node embedding is compared with the embedding of the center node for attention score computation. Here, the center node embedding is known in advance.
However, in our case, since we want to train our model directly on the logical queryanswer pairs (queryanswer pair training phase), the final intersection embedding might denote the variable in a conjunctive graph query whose embedding is unknown. For example, in Figure 1, we can obtain two embeddings and for variable ?Person by following two different triple path and . In this case, the input embeddings for our intersection operator are and . The center node embedding here is the true embedding for variable ?Person which is unknown. Equation 4 is used to compute an initial embedding for the center node, the variable ?Person, in order to compute the attention score for each input embedding.
Note that these two intersection operators in Definition 2.3 and Section 3.1 can also be directly applied to the local knowledge graph structure as RGCN (Schlichtkrull et al., 2018) does (original KG training phase). The output embedding can be used as the new embedding for the center entity which is computed by a convolution operation over its 1degree neighboring entityrelation pairs. In this KG training phase, although the center node embedding is known in advance, in order to make our model applicable to both of these two training phases, we still use the initial intersection embedding idea. Note that the initial intersection embedding computing step (see Equation 4) solves the problem of the previous attention mechanism where the center node embedding is a prerequisite for attention score computing. This makes our graph attention mechanism applicable to both logic query answering and KG embedding training. As far as we know, it is the first graph attention mechanism applied on both tasks.
3.2. Model Training
The projection operator and intersection operator constitute our attentionbased logical query answering model. As for the model training, it has two training phases: the original KG training phase and the queryanswer pair training phase.
3.2.1. Original KG Training Phase
In original KG training phase, we train those two geometric operators based on the local KG structure. Given a KG , for every entity , we use the geometric projection and intersection operator to compute a new embedding for entity given its 1degree neighborhood which is a sampled set of neighboring entityrelation pairs with size . Here, indicates either (baseline model) or (proposed model).
(9) 
Let indicates the true entity embedding for and indicates the embedding of a negative sample , where is the negative sample set for
. The loss function for this KG training phase is a maxmargin loss:
(10) 
3.2.2. Logical QueryAnswer Pair Training Phase
In this training phase, we first sample different conjunctive graph query (logical query)answer pairs from the original KG by sampling entities at each node in the conjunctive query structure according to the topological order (See Hamilton et al. (Hamilton et al., 2018)). Then for each conjunctive graph query with one or multiple anchor nodes , we compute the embedding for its target variable node , denote as , based on two proposed geometric operators (See Algorithm 1 in Hamilton et al. (Hamilton et al., 2018) for a detailed explanation). We denote the embedding for the correct answer entity as and the embedding for the negative answer as where . The loss function for this queryanswering pair train phase is:
(12) 
3.2.3. Negative Sampling
As for negative sampling method, we adopt two methods: 1) negative sampling: is a fixedsize set of entities which have the same entity type as except itself; 2) hard negative sampling: is a fixedsize set of entities which satisfy some of the entityrelation pairs in but not all of them.
3.2.4. Full Model Training
The loss function for the whole model training is the combination of these two training phases:
(13) 
4. Experiment
We carried out empirical study following the experiment protocol of Hamilton et al. (Hamilton et al., 2018). To properly test all models’ ability to reason with larger knowledge graph of many relations, we constructed two datasets from publicly available DBpedia and Wikidata.
Bio  DB18  WikiGeo19  
Training  Validation  Testing  Training  Validation  Testing  Training  Validation  Testing  
# of Triples  3,258,473  20,114  181,028  122,243  1,358  12,224  170,409  1,893  17,041 
# of Entities  162,622      21,953      18,782     
# of Relations  46      175      192     
# of Sampled 2edge QA Pairs  1M  1k/QT  10k/QT  1M  1k/QT  10k/QT  1M  1k/QT  10k/QT 
# of Sampled 3edge QA Pairs  1M  1k/QT  10k/QT  1M  1k/QT  10k/QT  1M  1k/QT  10k/QT 
4.1. Datasets
Hamilton et al. (Hamilton et al., 2018) conducted logic query answering evaluation with Biological interaction and Reddits videogame datasets^{3}^{3}3https://github.com/williamleif/graphqembed. However, the reddit dataset is not made publicly available. The Bio interaction dataset has some issue of their logic query generation process^{4}^{4}4Hamilton et al. (Hamilton et al., 2018) sample the training queries from the whole KG rather than the training KG, which makes all the triples in the KG known to the model and makes the tasks simpler than realistic test situations.. Therefore, we regenerate the train/valid/test queries from the Bio KG. Furthermore, the Bio interaction dataset has only 46 relation types which is very simple compared to many widely used knowledge graphs such as DBpedia and Wikidata. Therefore we construct two more datasets (DB18and WikiGeo19) with larger graphs and more relations based on DBpedia and Wikidata ^{5}^{5}5The code and both datasets are available at https://github.com/gengchenmai/Attention_GraphQA..
Both datasets are constructed in a similar manner as (Hamilton et al., 2018):

First collect a set of seed entities;

Use these seed entities to get their 1degree and 2degree object property triples;

Delete the entities and their associated triples with node degree less than a threshold ;

Split the triple set into training, validation, and testing set and make sure that every entity and relation in the validation and testing dataset will appear in training dataset. The training/validation/testing split ratio is 90%/1%/9%;

Sample the training queries from the training KG^{6}^{6}6We modify the query generation code provided by Hamilton et al. (Hamilton et al., 2018).
For DB18 the seed entities are all geographic entities directly linked to dbr:California via dbo:isPartOf with type (rdf:type) dbo:City. There are 462 seed entities in total. In Step 2, we filter out triples with no dbo: prefixed properties. The threshold is set up to be 10. For WikiGeo19 the seed entities are the largest cities in each state of the United States^{7}^{7}7https://www.infoplease.com/us/states/statecapitalsandlargestcities. The threshold is 20 which is a relatively small value compare to =100 for the widely used FB15K and WN18 dataset. Statistic for these 3 datasets are shown in Table 1. Given that the widely used KG completion dataset FB15K and WN18 have 15K and 41K triples, DB18 and WikiGeo19 are rather large in size (120K and 170K triples). Note that for each triple in training/validation/testing dataset, we also add its inverse relation to the corresponding dataset and the geometric projection operator will learn two separated projection matrices for each relation. The training triples constitute the training KG. Note that both GQE and CGA require to know the unique type for each entity. However, entities in DBpedia and Wikidata have multiple types (rdf:type). As for DB18, we utilize the level1 classes in DBpedia
ontology and classify each entity to these level1 classes based on the
rdfs:subClassOf relationships. For WikiGeo19, we simply annotate each entity with class Entity.4.2. Training Details
As we discussed in Section 3.2, we train our CGA model based on two training phases. In the original KG training phase, we adopt an minibatch training strategy. In order to speed up the model training process, we sample the neighborhood for each entity with different neighborhood sample size () in the training KG beforehand. We split these sampled nodeneighborhood pairs by their neighborhood sample size in order to do minibatch training.
As for the logical queryanswer pair training phase, we adopt the same queryanswer pair sampling strategy as Hamilton et al. (Hamilton et al., 2018). We consider 7 different conjunctive graph query structures shown in Figure 2(c). As for the 4 query structures with intersection pattern, we apply hard negative sampling (see Section 3.2.3) and indicate them as 4 separate query types. In total, we have 11 query types. All training (validation/testing) triples are utilized as 1edge conjunctive graph queries for model training (evaluation). As for 2edge and 3edge queries, the number for sampled queries for training/validation/testing are shown in Table 1. Note that all training queries are sampled from the training KG. All validation and testing queries are sampled from the whole KG and we make sure these queries cannot be directly answered based on the training KG (unanswerable queries (Mai et al., 2019)). To ensure these queries are truly unanswerable, the matched triple patterns of these queries should contain at least one triple in the testing/validation triple set.
4.3. Baselines
We use 6 different models as baselines: two models with the billinear projection operator and the elementwise mean or min as the simple intersection operator: Billinear[mean_simple], Billinear[min_simple]; two models with the TransE based projection operator and the GQE version of geometric intersection operator: TransE[mean], TransE[min]; and two GQE models (Hamilton et al., 2018): GQE[mean], GQE[min]. Since can be elementwise mean or min, we differentiate them using [mean] and [min]. Note that all of these 6 baseline models only use the logical queryanswer pair training phase (see Section 3.2.2) to train the model. As for model with billinear projection operator, based on multiple experiments, we find that the model with elementwise min consistently outperforms the model with elementwise mean. Hence for our model, we use elementwise min for .
4.4. Results
We first test the effect of the origin KG training on the model performance without the attention mechanism called GQE+KG[min] here. Then we test the models with different numbers of attention heads with the added original KG training phase which are indicated as CGA+KG+x[min], where x represents the number of attention heads (can be ).
Dataset  Bio  DB18  WikiGeo19  

Metric  AUC  APR  AUC  APR  AUC  APR  
All  HNeg  All  HNeg  All  HNeg  All  HNeg  All  HNeg  All  HNeg  
Billinear[mean_simple]  81.65  67.26  82.39  70.07  82.85  64.44  85.57  71.72  81.82  60.64  82.35  64.22 
Billinear[min_simple]  82.52  69.06  83.65  72.7  82.96  64.66  86.22  73.19  82.08  61.25  82.84  64.99 
TransE[mean]  80.64  73.75  81.37  76.09  82.76  65.74  85.45  72.11  80.56  65.21  81.98  68.12 
TransE[min]  80.26  72.71  80.97  75.03  81.77  63.95  84.42  70.06  80.22  64.57  81.51  67.14 
GQE[mean]  83.4  71.76  83.82  73.41  83.38  65.82  85.63  71.77  83.1  63.51  83.81  66.98 
GQE[min]  83.12  70.88  83.59  73.38  83.47  66.25  86.09  73.19  83.26  63.8  84.3  67.95 
GQE+KG[min]  83.69  72.23  84.07  74.3  84.23  68.06  86.32  73.49  83.66  64.48  84.73  68.51 
CGA+KG+1[min]  84.57  74.87  85.18  77.11  84.31  67.72  87.06  74.94  83.91  64.83  85.03  69 
CGA+KG+4[min]  85.13  76.12  85.46  77.8  84.46  67.88  87.05  74.66  83.96  64.96  85.36  69.64 
CGA+KG+8[min]  85.04  76.05  85.5  77.76  84.67  68.56  87.29  75.23  84.15  65.23  85.69  70.28 
Relative over GQE  2.31  7.29  2.28  5.97  1.44  3.49  1.39  2.79  1.07  2.24  1.65  3.43 
Table 2
shows the evaluation results of the baseline models as well as different variations of our models on the test queries. We use the ROC AUC score and average percentile rank (APR) as two evaluation metrics. All evaluation results are macroaveraged across queries with different DAG structures (Figure
2(c)).
All 3 variations of CGA consistently outperform baseline models with fair margins which indicates the effectiveness of contextual attention. The advantage is more obvious in query types with hard negative queries.

Comparing GQE+KG[min] with other baseline models we can see that adding the original KG training phase in the model training process improves the model performance. This shows that the structure information of the original KG is very critical for knowledge graph embedding model training even if the task is not link prediction.

Adding the attention mechanism further improves the model performance. This indicates the importance of considering the unequal contribution of the neighboring nodes to the center node embedding prediction.

Multihead attention models outperforms singlehead models which is consistent with the result from GAT (Veličković et al., 2018).

Theoretically, has learnable parameters while has parameters where is the total number of entity types in a KG. Since usually , our model has fewer parameters than GQE while achieves better performance.

CGA shows strong advantages over baseline models especially on query types with hard negative sampling (e.g., 7.3% relative AUC improvement over GQE on Bio dataset^{8}^{8}8Note that since we regenerate queries for Bio dataset, the GQE performance is lower than the reported performance in Hamilton et al. (Hamilton et al., 2018) which is understandable.).
All models shown in Table 2
are implemented in PyTorch based on the official code
^{9}^{9}9https://github.com/williamleif/graphqembed of Hamilton et al. (Hamilton et al., 2018). The hyperparameters for the baseline models GQE are tuned using grid search and the best ones are selected. Then we follow the practice of Hamilton et al. (Hamilton et al., 2018) and used the same hyperparameter settings for our CGA models: 128 for embedding dimension , 0.001 for learning rate, 512 for batch size. We use Adam optimizer for model optimization.The overall delta of CGA over GQE reported in Tab. 2 is similar in magnitude to the delta over baseline reported in Hamilton et al. (Hamilton et al., 2018). This is because CGA will significantly outperform GQE in query types with intersection structures, e.g., the 9th query type in Fig. 2(c), but perform on par in query types which do not contain intersection, e.g. the 1st query type in Fig. 2(c). Macroaverage computation over all query types makes the improvement less obvious. In order to compare the performance of different models on different query structures (different query types), we show the individual AUC and APR scores on each query type in three datasets for all models (See Figure 2(a), 2(b), 2(c), 2(d), 2(e), and 2(f)). To highlight the difference, we subtract the minimum score from the other scores in each figure. We can see that our model consistently outperforms the baseline models in almost all query types on all datasets except for the sixth and tenth query type (see Figure 3) which correspond to the same query structure 3inter_chain. In both these two query types, GQE+KG[min] has the best performance. The advantage of our attentionbased models is more obvious for query types with hard negative sampling strategy. For example, as for the 9th query type (Hard3inter) in Fig. 2(d), CGA+KG+8[min] has 5.8% and 6.5% relative APR improvement (5.9% and 5.1% relative AUC improvement) over GQE[min] on DB18 and WikiGeo19. Note that this query type has the largest number of neighboring nodes (3 nodes) which shows that our attention mechanism becomes more effective when a query type contains more neighboring nodes in an intersection structure. This indicates that the attention mechanism as well as the original KG training phase are effective in discriminating the correct answer from misleading answers.
5. Conclusion
In this work we propose an endtoend attentionbased logical query answering model called contextual graph attention model (CGA) which can answer complex conjunctive graph queries based on two geometric operators: the projection operator and the intersection operator. We utilized multihead attention mechanism in the geometric intersection operator to automatically learn different weights for different query paths. The original knowledge graph structure as well as the sampled queryanswer pairs are used jointly for model training. We utilized three datasets (Bio, DB18, and WikiGeo19) to evaluate the performance of the proposed model against the baseline. The results show that our attentionbased models (which are trained additionally on KG structure) outperform the baseline models (particularly on the hard negatives) despite using less parameters. The current model is utilized in a transductive setup. In the future, we want to explore ways to use our model in a inductive learning setup. Additionally, conjunctive graph queries are a subset of SPARQL queries which do not allow disjunction, negation, nor filters. They also require the predicates in all query patterns to be known. In the future, we plan to investigate models that can relax these restrictions.
References
 Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.1.1.
 Neural machine translation by jointly learning to align and translate. In ICLR 2015, Cited by: §1.
 Semantic parsing on freebase from questionanswer pairs. In EMNLP, pp. 1533–1544. Cited by: §1, item 3.
 Translating embeddings for modeling multirelational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: §1, §2.3.
 Embedding logical queries on knowledge graphs. In Advances in Neural Information Processing Systems, pp. 2030–2041. Cited by: Contextual Graph Attention for Answering Logical Queries over Incomplete Knowledge Graphs, item 3, §1, §1, §1, §2.1, §2.2, §2.2, §2.2, §2.3, §3.2.2, §3.2.4, §3, §4.1, §4.1, §4.2, §4.3, §4.4, §4.4, §4, footnote 4, footnote 6, footnote 8.
 Representation learning on graphs: methods and applications. arXiv preprint arXiv:1709.05584. Cited by: §2.3.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §3.1.1.  Knowledge graph embedding via dynamic mapping matrix. In ACL, Vol. 1, pp. 687–696. Cited by: §2.3.
 Semisupervised classification with graph convolutional networks. In ICLR 2017, Cited by: §1.

Neural symbolic machines: learning semantic parsers on freebase with weak supervision
. In ACL, Vol. 1, pp. 23–33. Cited by: §1, item 3.  Learning entity and relation embeddings for knowledge graph completion.. In AAAI, Vol. 15, pp. 2181–2187. Cited by: §1, §2.3.
 Support and centrality: learning weights for knowledge graph embedding models. In EKAW, pp. 212–227. Cited by: §1.
 Relaxing unanswerable geographic questions using a spatially explicit knowledge graph embedding model. In Proceedings of 22nd AGILE International Conference on Geographic Information Science, Cited by: §1, item 1, §4.2.
 A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104 (1), pp. 11–33. Cited by: footnote 2.
 Holographic embeddings of knowledge graphs.. In AAAI, pp. 1955–1961. Cited by: §1, §2.3.
 Factorizing yago: scalable machine learning for linked data. In WWW, pp. 271–280. Cited by: §1, §2.3.
 Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §3.1.2.

Reasoning with neural tensor networks for knowledge base completion
. In Advances in neural information processing systems, pp. 926–934. Cited by: §1, §2.3.  Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §3.1.2.
 Graph attention networks. In ICLR 2018, Cited by: Contextual Graph Attention for Answering Logical Queries over Incomplete Knowledge Graphs, §1, §3.1.1, §3.1.2, §3.1, item 4.
 Towards empty answers in sparql: approximating querying with RDF embedding. In International Semantic Web Conference, pp. 513–529. Cited by: §1, §1, §1.
 Knowledge graph embedding: a survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29 (12), pp. 2724–2743. Cited by: §1, footnote 2.
 Knowledge graph representation with jointly structural and textual encoding. arXiv preprint arXiv:1611.08661. Cited by: §1.
 Embedding entities and relations for learning and inference in knowledge bases. In ICLR, Cited by: §1, §2.3.
 Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: §1, Definition 2.3.