Dynamic Integration of Background Knowledge in Neural NLU Systems

by   Dirk Weissenborn, et al.

Common-sense or background knowledge is required to understand natural language, but in most neural natural language understanding (NLU) systems, the requisite background knowledge is indirectly acquired from static corpora. We develop a new reading architecture for the dynamic integration of explicit background knowledge in NLU models. A new task-agnostic reading module provides refined word representations to a task-specific NLU architecture by processing background knowledge in the form of free-text statements, together with the task-specific inputs. Strong performance on the tasks of document question answering (DQA) and recognizing textual entailment (RTE) demonstrate the effectiveness and flexibility of our approach. Analysis shows that our models learn to exploit knowledge selectively and in a semantically appropriate way.


page 1

page 2

page 3

page 4


XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

While deep and large pre-trained models are the state-of-the-art for var...

Machine Reading with Background Knowledge

Intelligent systems capable of automatically understanding natural langu...

Why Do Masked Neural Language Models Still Need Common Sense Knowledge?

Currently, contextualized word representations are learned by intricate ...

Interpretation of Natural Language Rules in Conversational Machine Reading

Most work in machine reading focuses on question answering problems wher...

Improving Commonsense Question Answering by Graph-based Iterative Retrieval over Multiple Knowledge Sources

In order to facilitate natural language understanding, the key is to eng...

Learning and Evaluating General Linguistic Intelligence

We define general linguistic intelligence as the ability to reuse previo...

Leveraging Knowledge Bases in LSTMs for Improving Machine Reading

This paper focuses on how to take advantage of external knowledge bases ...

1 Introduction

Understanding natural language depends crucially on common-sense and background knowledge, for example, knowledge about what concepts are expressed by the words being read (lexical knowledge), and what relations hold between these concepts (relational knowledge). As a simple illustration, if an agent needs to understand that the statement “King Farouk signed his abdication” is entailed by “King Farouk was exiled to France in 1952, after signing his resignation”, it must know (among other things) that abdication means resignation of a king.

In most neural natural language understanding (NLU) systems, the requisite background knowledge is implicitly encoded in the models’ parameters. That is, what background knowledge is present has been learned from task supervision and also by pre-training word embeddings (where distributional properties correlate with certain kinds of useful background knowledge, such as semantic relatedness). However, acquisition of background knowledge from static training corpora is limiting for two reasons. First, it is unreasonable to expect that all background knowledge that could be important for solving an NLU task can be extracted from a limited amount of training data. Second, as the world changes, the facts that may influence how a text is understood will likewise change. In short: building suitably large corpora to capture all relevant information, and keeping the corpus and derived models up to date with changes to the world would be impractical.

In this paper, we develop a new architecture for dynamically incorporating external background knowledge in NLU models. Rather than relying only on static knowledge implicitly present in the training data, supplementary knowledge is retrieved from external knowledge sources (in this paper, ConceptNet and Wikipedia) to assist with understanding text inputs. Since NLU systems must already read and understand text inputs, we assume that background knowledge will likewise be provided in text form (§2). The retrieved supplementary texts are read together with the task inputs by an initial reading module whose outputs are contextually refined word embeddings (§3). These refined embeddings are then used as input to a task-specific NLU architecture (any architecture that reads text as a sequence of word embeddings can be used here). The initial reading module and the task module are learnt jointly, end-to-end.

We experiment with several different datasets on the tasks of document question answering (DQA) and recognizing textual entailment (RTE) evaluating the impact of our proposed solution with both basic task architectures and a sophisticated task architecture for RTE (§4). We find that our embedding refinement strategy is effective (§5

). On four competitive benchmarks, we show that refinement helps. First, simply refining the embeddings just using the context (and no additional background information) can improve performance significantly, but adding background knowledge helps further. Our results are competitive with the best systems, achieving a new state of the art on the recent TriviaQA benchmarks. Our success on this task is especially noteworthy because the task-specific architecture is a simple reading architecture, in particular a single layer BiLSTM with a feed-forward neural network for span prediction. Finally, we provide an analysis demonstrating that our systems are able to exploit background knowledge in a semantically appropriate manner (§

5.3). It includes, for instance, an experiment showing that our system is capable of making appropriate counterfactual inferences when provided with “alternative facts”.

2 External Knowledge as Supplementary Text Inputs

Knowledge resources make information that could potentially be useful for improving NLU available in a variety different formats, such as natural language text, (subject, predicate, object)-triples, relational databases, and other structured formats. Rather than tailoring our solution to a particular structured representation, we assume that all supplementary information either already exists in natural language statements (e.g., encyclopedias) or can easily be recoded as natural language. Furthermore, while mapping from unstructured to structured representations is hard, the inverse problem is easy. For example, given a triple we can construct the free-text assertion “Abdication is a resignation.” using simple rules. Finally, the free-text format means that knowledge that exists only in unstructured text form such as encyclopedic knowledge (e.g., Wikipedia) is usable by our system.

An important question that remains to be answered is: given some text that is to be understood, what supplementary knowledge should be incorporated? The retrieval and preparation of contextually relevant information from knowledge sources is a complex research topic by itself, and there are several statistical (Manning et al., 2008) and more recently neural approaches (Mitra and Craswell, 2017)

as well as approaches based on reinforcement learning 

(Nogueira and Cho, 2017)

. Rather than learning both how to incorporate relevant information and which information is relevant, we use a heuristic retrieval mechanism (§

4) and focus on the integration model.

In the next section, we turn to the question of how to leverage the retrieved supplementary knowledge (encoded as text) in a NLU system.

3 Refining Word Embeddings by Reading

Figure 1: Illustration of our context-dependent, refinement strategy for word representations on an example from the SNLI dataset comprising the premise (), hypothesis () and additional external information in form of free-text assertions from ConceptNet (). Note that for the QA task there would be another stage that additionally integrates Wikipedia abstracts of answer candidates (, see §4). The reading architecture constructs refinements of word representations incrementally (conceptually represented as columns in a series of embedding matrices) are incrementally refined by reading the input text and textual renderings of relevant background knowledge before computing the representations used by the task model (in this figure, RTE).

Virtually every NLU task—from document classification to translation to question answering—should in theory be able to benefit from supplementary knowledge. While one could develop custom architectures for each task so as to read supplementary inputs, we would like ours to augment any existing NLU task architectures with the ability to read relevant information with minimal effort. To realize this goal, we adopt the strategy of refining word embeddings; that is, we replace static word embeddings with embeddings that are functions of the task inputs and any supplementary inputs. Word embeddings can be considered a simple form of key-value memory stores that, in our case, not only contain general-purpose knowledge (as in typical neural NLU systems) but also contextual information (including background knowledge). The use of word-embeddings as memory has the advantage that it is transparent to the task-architecture which kinds of embeddings (refined or unrefined) are used.

Our incremental refinement process encodes input texts followed by updates on the word embedding matrix in multiple reading steps. Words are first represented non-contextually (i.e., standard word embeddings), which can be conceived of as the columns in an embedding matrix . At each progressive reading step , a new embedding matrix is constructed by refining the embeddings from the previous step using (user-specified) contextual information for reading step , which is a set of natural language sequences (i.e., texts). An illustration of our incremental refinement strategy can be found in Figure 1.

In the following, we define this procedure formally. We denote the hidden dimensionality of our model by and a fully-connected layer by , .

3.1 Unrefined Word Embeddings ()

The first representation level consists of non-contextual word representations, that is, word representations that do not depend on any input; these can be conceived of as an embedding matrix whose columns are indexed by words in . The non-contextual word representation for a single word

is computed by using a gated combination of fixed, pre-trained word vectors

with learned character-based embeddings . We compute

using a single-layer convolutional neural network with

convolutional filters of width followed by a -pooling operation over time (Seo et al., 2017; Weissenborn et al., 2017). The formal definition of this combination is given in Eq. 1.


3.2 Refined Word Embeddings ()

In order to compute contextually refined word embeddings given prior representations we assume a given set of texts that are to be read at refinement iteration . Each text is a sequence of word tokens. We embed all tokens of every using the embedding matrix from the previous layer, . To each word, we concatenate a one-hot vector of length with position set to , indicating which layer is currently being processed.111Adding this one-hot feature lets the refinement model learn to update embeddings differently in different levels. Stacking the vectors into a matrix, we obtain a

. This matrix is processed by a bidirectional recurrent neural network, a

(Hochreiter and Schmidhuber, 1997) in this work. The resulting output is further projected to by a fully-connected layer with activation (Eq. 2).


To finally update the previous embedding of word , we initially all representations of occurrences matching the lemma of in every resulting in (Eq. 3). Finally, we combine the previous representation with to form an updated representation via a gated addition. This lets the model determine how much to revise the previous embedding with the newly read information (Eq. 5).


Note that we soften the matching condition for using lemmatization,222https://spacy.io is used for lemmatization. , during the pooling operation of Eq. 3 because contextual information about certain words is usually independent of the current word form they appear in. As a consequence, this minor linguistic pre-processing step allows for additional interaction between tokens of the same lemma.

Pooling over lemma-occurrences effectively connects different text passages (even across texts) that are otherwise disconnected, mitigating the problems arising from long-distance dependencies. This is reminiscent of the (soft) attention mechanism used in reading comprehension models (e.g., Cheng2016,wang2017gated). However, our setup is more general as it allows for the connection of multiple passages (via pooling) at once and is able to deal with multiple inputs which is necessary to make use of additional input texts such as relevant background knowledge.

4 Experimental Setup

We run experiments on four benchmarks for two standard NLU tasks: recognizing textual entailment (RTE) and document question answering (DQA). In the following we describe our experimental setup.

Task-specific Models

Since we wish to assess the value of the proposed embedding refinement strategy, we focus on relatively simple task architectures. We use single-layer bidirectional LSTMs (BiLSTMs) as encoders of the inputs represented by the refined or unrefined embeddings with a task-specific, feed-forward network for the final prediction. Such models are general reading architectures Bowman et al. (2015); Rocktäschel et al. (2015); Weissenborn et al. (2017). To demonstrate that our reading module can be integrated into arbitrary task architectures, we also add our refinement module to a reimplementation of a state of the art architecture for RTE called ESIM (Chen et al., 2017). We refer the interested reader to the ESIM paper for details of the model.

All models are trained end-to-end jointly with the refinement module using a dimensionality of for all but the TriviaQA experiments for which we had to reduce to due to memory constraints. All baselines operate on the unrefined word embeddings described in §3.1. For the DQA baseline system we add the lemma-in-question feature (liq) suggested in Weissenborn et al. (2017). Implementation details for the BiLSTM task architectures, as well as training details, are available in Appendix A.

Question Answering

We use 2 recent DQA benchmark training and evaluation datasets, SQuAD (Rajpurkar et al., 2016) and TriviaQA (Joshi et al., 2017). The task is to predict an answer span within a provided document given a question . Both datasets are large-scale, containing on the order of 100k examples, however, TriviaQA is more complex in that the supporting documents are much larger than those for SQuAD. Because TriviaQA is collected via distant supervision the test set is divided into a large but noisy distant supervision part and a much smaller (on the order of hundreds) human verified part. We report results on both. See Appendix A.1 for implementation details of the DQA system.

Recognizing Textual Entailment

We test on both the SNLI dataset (Bowman et al., 2015), a collection of sentence pairs, and the more recent MultiNLI dataset ( sentence pairs) (Williams et al., 2017). Given two sentences, a premise and a hypothesis , the task is to determine whether either entails, contradicts or is neutral to . See Appendix A.2 for implementation details of the RTE system.

Supplementary Knowledge Sources

We use ConceptNet333http://conceptnet.io/ (Speer and Havasi, 2012), a freely-available, multi-lingual semantic network that originated from the Open Mind Common Sense project and incorporates selected knowledge from various other knowledge sources, such as Wiktionary, Open Multilingual WordNet, OpenCyc and DBpedia. It presents information in the form of relational triples.444We exclude ConceptNet 4 assertions created by only one contributor and from Verbosity to reduce noise. Additionally, we exploit Wikipedia abstracts in our DQA experiments as described below.

ConceptNet Integration

Here we describe the heuristic we use to obtain plausibly relevant supplementary knowledge for understanding a text pair from ConceptNet. Our hypothesis is that relations that link words and phrases across and are likely to be most valuable. Because assertions in ConceptNet come in form of (subject, predicate, object)-triples , we retrieve all assertions for which appears in and appears in , or vice versa. Because still too many such assertions might be retrieved for an instance, we rank all retrievals based on their respective subject and object. The ranking score we use is the inverse product of appearances of the subject and the object in the KB, that is , where denotes the indicator function. During training and evaluation we retain the top- assertions, using for DQA and for RTE. Note that fewer or even no assertions might be retrieved for a particular instance during training and testing.

Wikipedia Integration

Here we describe the heuristic we use to obtain plausibly relevant supplementary knowledge from Wikipedia. We wish to use Wikipedia abstracts555Downloaded from http://wiki.dbpedia.org/downloads-2016-10 as an additional knowledge source to gather more information about the top answer predictions of our DQA model. To this end, we let the system first predict the top-16 answer spans without any information from Wikipedia. For each answer candidate string, we collect abstracts for their 3 most frequently linked Wikipedia entries.666Statistics were extracted from the DBpedia Anchor Text dataset (http://downloads.dbpedia.org/2016-10/core-i18n/en/anchor_text_en.ttl.bz2). Using more than only the most frequently linked Wikipedia entry for a given answer string, lets us mitigate problems arising from polysemous entity names, although it does mean the refinement model needs to be selective in extracting relevant information. The refinement module additionally reads the initial 50 tokens of each retrieved Wikipedia abstract and computes the final predictions.

Refinement Order

When employing our embedding-refinement strategy, we first read the document () followed by the question () in case of DQA, and the premise () followed by the hypothesis () for RTE, that is, and . Additional knowledge in the form of a set of assertions is integrated after reading the task-specific input for both DQA and RTE, that is, . Finally, for DQA we additionally add Wikipedia abstracts as background knowledge as described previously, that is, . In preliminary experiments we found that the final performance is not significantly sensitive to the order of presentation so we decided to fix our order as defined above.

5 Results

This section presents results. We provide ablations for a total of 7 task-dataset-model combinations and compare our final results to other works on the most recent benchmark datasets for each task (TriviaQA and MultiNLI), demonstrating that our results are competitive, and in some cases, state of the art, even without sophisticated task architectures.

5.1 Question Answering

Model SQuAD T-Wiki T-Web
BiLSTM () + liq 75.9 62.1 65.0
+ + () 78.6 65.5 68.7
+ + + () 79.7 67.1 70.3
+ + + + () 79.7 69.5 72.7
Table 1: Ablation on the SQuAD and TriviaQA (T-Wiki and T-Web) development sets for the F1 metric. Information used for embedding refinement: - refinement on task input (i.e., document and question); - top-50 retrieved ConceptNet assertions; - Wikipedia abstracts for the top-16 answer candidates. The liq-feature (lemma-in-question) is only used in the baseline.
Dataset Model Exact F1
TriviaQA Wiki (1) 64.0 / 68.0 68.9 / 72.9
Ours 64.6 / 72.8 69.9 / 77.4
TriviaQA Web (1) 66.4 / 80.0 71.3 / 83.7
Ours 67.5 / 77.6 72.8 / 82.0
SQuAD Dev (1) 71.6 80.8
Ours 69.5 79.7
Table 2: Test set results of our full model (BiLSTM++++, i.e., using as embeddings). Results for TriviaQA are divided by distant supervision results (left) and human verified results (right, comprise only several hundreds of examples). We compare against the concurrent work on TriviaQA of clark2017simple.

Table 1 presents our results on two question answering benchmarks. The results demonstrate that the introduction of the refinement module helps consistently, and further improvements come from using common sense knowledge from ConceptNet (). Wikipedia () yields further, significant improvements on TriviaQA, slightly outperforming the current state of the art model (Table 2). This is especially noteworthy given the simplicity of our QA architecture (i.e., a single layer BiLSTM) compared to the previous SotA attained by clark2017simple. The development results on SQuAD777We do not report test set results for SQuAD due to restrictions on code sharing. show the same pattern of improvement, but here the results are slightly worse than the model of clark2017simple, and they are way off from the current best-known results (currently at 87% F1);888https://rajpurkar.github.io/SQuAD-explorer/ however, our intention with these experiments is to show of the value that external knowledge and our refinement process can bring, not to compete with highly tuned task architectures on a single dataset.

Controlling for computation.

One potential explanation for the improvement obtained using the refinement module is that we are enabling more computation over the information present in the inputs, that is, we are effectively using a deeper architecture. To test whether this might be the case, we also ran an experiment with a 2-layer BiLSTM (+liq). This setup exhibits similar computational complexity and number of parameters to BiLSTM + + . We found that the second layer did not improve performance, suggesting that pooling over word/lemma occurrences in a given context between layers, is a powerful, yet simple technique.

5.2 Recognizing Textual Entailment

BiLSTM () 84.4 70.0 70.2
+ + () 86.1 75.3 76.3
+ + + () 86.5 76.8 77.5
ESIM () 88.2 76.8 77.3
+ + () 88.0 77.8 78.4
+ + + () 87.8 78.8 78.8
ESIM + + + 87.3 78.2 77.0
SotA 88.6 78.8 77.8
Table 3: Ablation on the SNLI and MultiNLI-Matched and -Mismatched development set and final results on the corresponding test sets. - refinement on task input (i.e., premise and hypothesis); - top-20 retrieved ConceptNet assertions. Chen et al. (2017), Gong et al. (2017).

Table 3 shows the results of our RTE experiments. In general, the introduction of our refinement strategy almost always helps, both with and without external knowledge. When providing additional background knowledge from ConceptNet, our BiLSTM based models improve substantially, while the ESIM-based models improve only on the more difficult MultiNLI dataset. Compared to previously published state of the art systems, our models acquit themselves quite well on the MultiNLI benchmark, and competitively on the SNLI benchmark. In parallel to this work, Gong et al. (2017) developed a novel task-specific architecture for RTE that achieves slightly better performance on MultiNLI than our ESIM + + + based models.999Our refinement architecture can be used of course with this new model, but we report ESIM results since that was best when this work was carried out. It draws attention to the fact that when using our knowledge-enhanced embedding module, on the MultiNLI, the basic BiLSTM task model outperforms the task-specific ESIM model, which is architecturally much more complex and designed specifically for the RTE task. We do find that there is little impact of using external knowledge on the RTE task with ESIM, although the refinement strategy helps using just + . A more detailed set of experiments reported in Appendix B shows that by impoverishing the amount of training data and information present in the GloVe embeddings, the positive impact of supplemental information becomes much more pronounced. These results suggest that ESIM is able to learn important background information from the large-scale datasets and from pretrained embeddings, but this can be supplemented when necessary. Nevertheless, both ESIM and our BiLSTM models when trained with knowledge from ConceptNet are sensitive to the semantics of the provided assertions as demonstrated in our analysis in §5.3. We argue that this is a desirable side effect because it makes the predictions of our model more interpretable than those not trained with knowledge. Furthermore, increasing the coverage of assertions in ConceptNet would most likely yield improved performance even without retraining our models.

Finally, we remark that despite careful tuning, our re-implementation of ESIM fails to match the 88% reported in Chen et al. (2017) by 0.8%; however, with MultiNLI, we find that our implementation of ESIM performs considerably better (by approximately 5%). The instability of the results suggests, as well as the failure of a custom RTE-architecture to consistently perform well suggests that current SotA RTE models may be overfit to the SNLI dataset.

5.3 Qualitative Analysis

p: His off-the-cuff style seems amateurish […] the net cost of operations. but uh these guys […] file their uh their final exams […]
h: He didn’t look like an amateur The gross cost. These men filed their midterm exams […]
a: look like synonym seem gross antonym net midterm antonym final
contradiction contradiction contradiction
: look like antonym seem gross synonym net midterm synonym final
entailment entailment entailment
Table 4: Three examples for the antonym synonym swapping experiment on MultiNLI. p-premise, h-hypothesis, a-assertion, -swapped assertion.

Although our empirical results show our knowledge-incorporation approach improves performance, in this section we attempt to assess whether we are learning to use the provided knowledge in a semantically appropriate way.


To test our models sensitivity towards the semantics of the assertions for recognizing textual entailment, we run an experiment in which we swap the synonym with the antonym predicate in the provided assertions during test time. We hypothesize that in many cases these two predicates are very important for predicting either contradiction or entailment. Indeed, there is a strong performance drop of about 10% on MultiNLI examples for both the BiLSTM and the ESIM model for which either a synonym or an antonym-assertion is present. This very large drop clearly shows that our models are sensitive to the semantics of the provided knowledge. Examples of prediction changes are presented in Table 4. They demonstrate that the system has learned to trust the presented assertions to the point that it will make appropriate counterfactual inferences—that is, the change in knowledge has caused the change in prediction. For the interested reader we provide additional RTE analysis results in Appendix C


The following is an example question from the TriviaQA dataset:

[roundcorner=2pt] Prince Philip […] was born on which island?

Answer candidates with corresponding abstracts:

  • [noitemsep,nolistsep]

  • Denmark is a Scandinavian country with territory in Europe and North America […]

  • Corfu is a Greek island in the Ionian Sea […]

  • Greece, officially the Hellenic Republic, […] is a transcontinental country […]

  • Vanuatu is a Pacific island nation located in the South Pacific Ocean […]

Answer candidates (i.e., Denmark, Corfu, Greece, Vanuata) were obtained from the top predicted answer spans computed by our model excluding Wikipedia (i.e., BiLSTM + + + ). Their corresponding abstracts were retrieved from Wikipedia and then given to our model in a second pass (i.e., BiLSTM + + + + ). In this example, the final best prediction of the model changes from Denmark to Corfu after integrating the abstracts (here, the abstract clearly states that Corfu is an island). We studied a total of 25 similar answer changes, 14 of which went from incorrect to correct, and 11 of which went from correct to incorrect. In 11 of the 14 corrections, obvious information is present in the Wikipedia abstracts that reinforced the correct answer. Where the system was confused by the answers (i.e., when the abstracts switched the production from correct to incorrect), no obvious information was present in 8 of the 11 cases, suggesting that the model had difficulty coping with unrelated background information. In 3 of the 11, plausibly relevant information was present in the abstract of the correct answer, yet the model still made the incorrect answer change.

The existence of counterfactual inferences in RTE and the tendency to use reinforcing information about candidate answers in DQA suggest that our knowledge incorporating strategy is exploiting heterogeneous knowledge sources in semantically sensible ways.

6 Related Work

The role of background knowledge in natural language understanding has long been remarked on, especially in the context of classical models of AI (Schank and Abelson, 1977; Minsky, 2000); however, it has only recently begun to play a role in neural network models of NLU (Ahn et al., 2016; Xu et al., 2016; Long et al., 2017; Dhingra et al., 2017). Previous efforts have focused on specific tasks or certain kinds of knowledge, whereas we take a step towards a more general-purpose solution for the integration of heterogeneous knowledge for NLU systems by providing a simple, general-purpose reading architecture that can read background knowledge encoded in simple natural language statements, e.g., “abdication is a type of resignation”.

In the area of visual question answering Wu et al. (2016) utilize external knowledge in form of DBpedia comments (short abstracts/definitions) to improve the answering ability of a model. Marino et al. (2017)

explicitly incorporate knowledge graphs into an image classification model.

Xu et al. (2016) created a recall mechanism into a standard LSTM cell that retrieves pieces of external knowledge encoded by a single representation for a conversation model. Concurrently, Dhingra et al. (2017) exploit linguistic knowledge using MAGE-GRUs, an adapation of GRUs to handle graphs, however, external knowledge has to be present in form of triples. Ahn et al. (2016) exploit knowledge base facts about mentioned entities for neural language models. Bahdanau et al. (2017) and Long et al. (2017) create word embeddings on-the-fly by reading word definitions prior to processing the task at hand. Pilehvar et al. (2017) incorporate information about word senses into their representations before solving the downstream NLU task, which is similar. We go one step further by seamlessly integrating all kinds of fine-grained assertions about concepts that might be relevant for the task at hand.

Another important aspect of our approach is the notion of dynamically updating word-representations with contextual information. Tracking and updating concepts, entities or sentences with dynamic memories is a very active research direction (Kumar et al., 2016; Henaff et al., 2017; Ji et al., 2017; Kobayashi et al., 2017). However, those works typically focus on particular tasks whereas our approach is task-agnostic and most importantly allows for the easy integration of external background knowledge. Important progress has also been made in creating pre-trained, contextualized token representations Peters et al. (2017); McCann et al. (2017).

7 Conclusion

We have presented a novel reading architecture that allows for the dynamic integration of background knowledge into neural NLU models. Our solution, which is based on the incremental refinement of word representations by reading supplementary inputs, is flexible and can be used with virtually any existing NLU architecture that rely on word embeddings as input. Our results show that embedding refinement using both the system’s text inputs, as well as supplementary text from external background knowledge can yield large improvements. In particular, we have shown that relatively simple task architectures (e.g., based on simple BiLSTM readers) can become competitive with state of the art, task-specific architectures when augmented with our reading architecture. Our analysis demonstrates that our model learns to exploit provided background knowledge in a semantically appropriate way.


  • Ahn et al. (2016) Sungjin Ahn, Heeyoul Choi, Tanel Pärnamaa, and Yoshua Bengio. 2016. A neural knowledge language model. arXiv .
  • Bahdanau et al. (2017) Dzmitry Bahdanau, Tom Bosc, Stanislaw Jastrzebski, Edward Grefenstette, Pascal Vincent, and Yoshua Bengio. 2017. Learning to compute word embeddings on the fly. arXiv .
  • Bowman et al. (2015) Samuel R Bowman, Gabor Angeli, Potts Christopher, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP.
  • Chen et al. (2017) Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and Hui Jiang. 2017. Enhancing and Combining Sequential and Tree LSTM for Natural Language Inference. ACL .
  • Cheng et al. (2016) Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. In EMNLP.
  • Clark and Gardner (2017) Christopher Clark and Matt Gardner. 2017. Simple and effective multi-paragraph reading comprehension. arXiv .
  • Dhingra et al. (2017) Bhuwan Dhingra, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. 2017. Linguistic Knowledge as Memory for Recurrent Neural Networks. arXiv .
  • Gong et al. (2017) Yichen Gong, Heng Luo, and Jian Zhang. 2017. Natural language inference over interaction space. arXiv .
  • Henaff et al. (2017) Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. 2017. Tracking the world state with recurrent entity networks. In ICLR.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation .
  • Ji et al. (2017) Yangfeng Ji, Chenhao Tan, Sebastian Martschat, Yejin Choi, and Noah A Smith. 2017. Dynamic entity representations in neural language models. In EMNLP.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL.
  • Kobayashi et al. (2017) Sosuke Kobayashi, Naoaki Okazaki, and Kentaro Inui. 2017. A neural language model for dynamically representing the meanings of unknown words and entities in a discourse. arXiv preprint arXiv:1709.01679 .
  • Kumar et al. (2016) Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016.

    Ask me anything: Dynamic memory networks for natural language processing.

    In ICML. pages 1378–1387.
  • Long et al. (2017) Teng Long, Emmanuel Bengio, Ryan Lowe, Jackie Chi Kit Cheung, and Doina Precup. 2017. World knowledge for reading comprehension: Rare entity prediction with hierarchical lstms using external descriptions. In EMNLP.
  • Manning et al. (2008) Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
  • Marino et al. (2017) Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. 2017. The more you know: Using knowledge graphs for image classification. CVPR .
  • McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NIPS.
  • Minsky (2000) Marvin Minsky. 2000. Commonsense-based interfaces. Communications of the ACM .
  • Mitra and Craswell (2017) Bhaskar Mitra and Nick Craswell. 2017. Neural models for information retrieval. arXiv preprint arXiv:1705.01509 .
  • Nogueira and Cho (2017) Rodrigo Nogueira and Kyunghyun Cho. 2017. Task-oriented query reformulation with reinforcement learning. In EMNLP.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP.
  • Peters et al. (2017) Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In ACL.
  • Pilehvar et al. (2017) Mohammad Taher Pilehvar, Jose Camacho-Collados, Roberto Navigli, and Nigel Collier. 2017. Towards a seamless integration of word senses into downstream nlp applications. In ACL.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. In SQuAD: 100,000+ Questions for Machine Comprehension of Text.
  • Rocktäschel et al. (2015) Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomás Kociský, and Phil Blunsom. 2015. Reasoning about entailment with neural attention. ICLR .
  • Schank and Abelson (1977) Roger Schank and Robert Abelson. 1977. Scripts, Plans, Goals, and Understanding. Psychology Press.
  • Seo et al. (2017) Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hananneh Hajishirzi. 2017. Bi-Directional Attention Flow for Machine Comprehension. In ICLR.
  • Speer and Havasi (2012) Robert Speer and Catherine Havasi. 2012. Representing General Relational Knowledge in ConceptNet 5. In LREC.
  • Wang et al. (2017) Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In ACL.
  • Weissenborn et al. (2017) Dirk Weissenborn, Georg Wiese, and Laura Seiffe. 2017. Making Neural QA as Simple as Possible but not Simpler. In CoNLL.
  • Williams et al. (2017) Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv .
  • Wu et al. (2016) Qi Wu, Peng Wang, Chunhua Shen, Anton van den Hengel, and Anthony Dick. 2016. Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources. CVPR .
  • Xu et al. (2016) Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, and Xiaolong Wang. 2016. Incorporating Loose-Structured Knowledge into LSTM with Recall Gate for Conversation Modeling. arXiv .

Appendix A Implementation Details

All our models were trained with 3 different random seeds and the top performance is reported 101010Result variations were small, that is within less than a percentage point in all experiments.. An overview of hyper-parameters used in our experiments can be found in Table 5. In the following we explain the detailed implementation of our two task-specific, baseline models.

We assume to have computed the contextually (un-)refined word representations depending on the setup and embedded our input sequences and to and , respectively. The word representation update gate in Eq. 4 is initialized with a bias of to refine representations only slightly in the beginning of training. In the following as before, we denote the hidden dimensionality of our model by and a fully-connected layer by , .

a.1 Question Answering


In the DQA task refers to the question and to the supporting text. For our baseline (i.e., BiLSTM + liq) we additionally concatenate a binary feature to and indicating whether the corresponding token lemma appeared in the question. However, it is omitted in the following for the sake of brevity. At first we process both sequences by identical s in parallel (Eq. 6) followed by a linear projection and a non-linearity (Eq. 7) .


is initialized by where

is the identity matrix.


Our prediction– or answer layer is similar to Weissenborn2017. We first compute a weighted, -dimensional representation of the processed question (Eq. 8).


The probability distributions

/ for the start/end location of the answer is computed by a 2-layer MLP with a ReLU activated, hidden layer as follows:


The model is trained to maximize the log-likelihood of the correct answer spans by computing the sum of the correct span probabilities for span under our model (Eq. 9). During evaluation we extract the span with the best score and maximum token length for SQuAD and for TriviaQA.


Properly training a QA system on TriviaQA is much more challenging than SQuAD because of the large document sizes and the use of multiple paragraphs. Therefore, we adopt the approach of clark2017simple who were the first to properly train neural QA models on TriviaQA. It relies on splitting documents and merging paragraphs up to a certain maximum token length ( per paragraph in our experiments), and only retaining the top- paragraphs ( in our case) for prediction. Paragraphs are ranked using the tf-idfcosine similarity between question and paragraph. To speed up training only paragraphs out of the top / for the / datasets were sampled. The only architectural difference for this multi-paragraph setup is that we encode multiple for each question and the of Eq. 9 is taken over all tokens of all paragraphs instead of only a single paragraph. For further details, we refer the interested reader to clark2017simple who explain this process in more detail.

Dataset Dim Drop. B.-size Ckpt Interval
SQuAD 300 0.2 32 1000
TriviaQA 150 0.2 16 2000
*NLI 300 0.2 64 2000
Table 5: Training hyper-parameters for our models. For optimization we employed ADAM with a learning rate of which was halved when performance dropped between checkpoint (ckpt) intervals. We use -dimensional word-embeddings from GloVe (Pennington et al., 2014) as pre-trained word embeddings in all experiments. For regularization we make use of dropout on the computed non-contextual word representations defined in §3.1 with the same dropout mask for all words in a batch. For QA we additionally applied dropout on the projections computed in Eq. 7.

a.2 Recognizing Textual Entailment


Analogous to DQA we encode our input sequences by BiLSTMs, however, for RTE we use conditional encoding (Rocktäschel et al., 2015) instead. Therefore, we initially process the embedded hypothesis by a BiLSTM and use the respective end states of the forward and backward LSTM as initial states for the forward and backward LSTM that processes the embedded premise .


We concatenate the outputs of the forward and backward LSTMs processing the premise , i.e., and run each of the resulting outputs through a fully-connected layer with activation () followed by a -pooling operation over time resulting in a hidden state . Finally, is used to predict the RTE label as follows:


The probability of choosing category {entailment, contradiction, neutral} is defined in Eq. 10. Finally, the model is trained to maximize the log-likelihood of the correct category label given probability distribution .

Appendix B Reducing Training Data & Dimensionality of Pre-trained Word Embeddings

We find that there is only little impact when using external knowledge on the RTE task when using a more sophisticated task model such as ESIM. We hypothesize that the attention mechanisms within ESIM together with powerful, pre-trained word representations allow for the recovery of some important lexical relations when trained on a large dataset. It follows that by reducing the number of training data and impoverishing pre-trained word representations the impact of using external knowledge should become larger.

To test this hypothesis, we gradually impoverish pre-trained word embeddings by reducing their dimensionality with PCA while reducing the number of training instances at the same time.111111Although reducing either embedding dimensionality or data individually exhibit similar (but less pronounced) results we only report the joint reduction results here. Our joint data and dimensionality reduction results are presented in Table 6. They show that there is indeed a slightly larger benefit when employing background knowledge from ConcepNet () in the more impoverished settings with largest improvements when using around 10k examples and reduced dimensionality to 10. However, we observe that the biggest overall impact over the baseline ESIM model stems from our contextual refinement strategy (i.e., reading only the premise and hypothesis ) which is especially pronounced for the 1k and 3k experiments. This highlights once more the usefulness of our refinement strategy even without the use of additional knowledge.

Dim/Data 1/1k 3/3k 10/10k 30/30k 100/100k 300/Full
ESIM 44.3 50.0 55.5 61.9 68.1 76.9
+ + 51.8(+7.5) 55.8(+5.8) 60.1(+4.6) 65.0(+3.1) 70.7(+2.6) 78.1(+1.2)
+ + + 52.4(+0.6) 57.9(+2.1) 62.4(+2.3) 66.6(+1.6) 71.3(+0.6) 78.8(+0.7)
Table 6: Development set results for MultiNLI (Matched + Mismatched) when reducing training data and embedding dimensionality with PCA. In parenthesis we report the relative differences to the respective result directly above.

Appendix C Further Analysis of Knowledge Utilization in RTE

(a) BiLSTM on MultiNLI.
(b) ESIM on MultiNLI.
(c) BiLSTM on SQuAD.
Figure 2: Performance differences when ignoring certain types of knowledge, i.e., relation predicates during evaluation. Normalized performance differences are measured on the subset of examples for which an assertion of the respective relation predicate occurs.

Is additional knowledge used?

To verify whether and how our models make use of additional knowledge, we conducted several experiments. First, we evaluated models trained with knowledge on our tasks while not providing any knowledge at test time. This ablation drops performance by 3.7–3.9% accuracy on MultiNLI, and by 4% F1 on SQuAD. This indicates the model is refining the representations using the provided assertions in a useful way.

What knowledge is used?

After establishing that our models are somehow sensitive to semantics we wanted to find out which type of knowledge is important for which task. For this analysis we exclude assertions including the most prominent predicates in our knowledge base individually when evaluating our models. The results are presented in Figure 2. They demonstrate that the biggest performance drop in total (blue bars) stems from related to assertions. This very prominent predicate appears much more frequently than other assertions and helps connecting related parts of the 2 input sequences with each other. We believe that related to assertions offer benefits mainly from a modeling perspective by strongly connecting the input sequences with each other and thus bridging long-range dependencies similar to attention. Looking at the relative drops obtained by normalizing the performance differences on the actually affected examples (green) we find that our models depend highly on the presence of antonym and synonym assertions for all tasks as well as partially on is a and derived from assertions. This is an interesting finding which shows that the sensitivity of our models is selective wrt. the type of knowledge and task. The fact that the largest relative impact stems from antonyms is very interesting because it is known that such information is hard to capture with distributional semantics contained in pre-trained word embeddings.