Infusing Knowledge into the Textual Entailment Task Using Graph Convolutional Networks

by   Pavan Kapanipathi, et al.

Textual entailment is a fundamental task in natural language processing. Most approaches for solving the problem use only the textual content present in training data. A few approaches have shown that information from external knowledge sources like knowledge graphs (KGs) can add value, in addition to the textual content, by providing background knowledge that may be critical for a task. However, the proposed models do not fully exploit the information in the usually large and noisy KGs, and it is not clear how it can be effectively encoded to be useful for entailment. We present an approach that complements text-based entailment models with information from KGs by (1) using Personalized PageR- ank to generate contextual subgraphs with reduced noise and (2) encoding these subgraphs using graph convolutional networks to capture KG structure. Our technique extends the capability of text models exploiting structural and semantic information found in KGs. We evaluate our approach on multiple textual entailment datasets and show that the use of external knowledge helps improve prediction accuracy. This is particularly evident in the challenging BreakingNLI dataset, where we see an absolute improvement of 5-20


Natural Language Inference from Multiple Premises

We define a novel textual entailment task that requires inference over m...

Medical Knowledge-enriched Textual Entailment Framework

One of the cardinal tasks in achieving robust medical question answering...

Heuristics for Interpretable Knowledge Graph Contextualization

In this paper, we introduce the problem of knowledge graph contextualiza...

Bridging Knowledge Gaps in Neural Entailment via Symbolic Models

Most textual entailment models focus on lexical gaps between the premise...

Can Neural Networks Understand Logical Entailment?

We introduce a new dataset of logical entailments for the purpose of mea...

AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples

We consider the problem of learning textual entailment models with limit...

Acquisition of Phrase Correspondences using Natural Deduction Proofs

How to identify, extract, and use phrasal knowledge is a crucial problem...

1 Introduction

Given two natural language sentences, a premise P and a hypothesis H, the textual entailment task – also known as natural language inference (NLI) – consists of determining whether the premise entails, contradicts, or is neutral with respect to the given hypothesis [18]. In practice, this means that textual entailment is characterized as either a three-class (entails/neutral/contradicts) or a two-class (entails/neutral) classification problem [11, 1].

Performance on the textual entailment task can be an indicator of whether a system, and the models it uses, are able to reason over text. This has tremendous value for modeling the complexities of human-level natural language understanding, and in aiding systems tuned for downstream tasks such as question answering [8].

Figure 1: A premise and hypothesis pair along with a relevant subgraph from ConceptNet. Blue concepts occur in the premise, green in the hypothesis, and purple connect them.

Most existing textual entailment models focus only on the text of the two sentences to improve classification accuracy [26, 41, 16]. A recent and promising line of work has turned towards extracting and harnessing more contextually relevant semantic information from knowledge graphs (KGs) for each textual entailment pair [2, 33]. These approaches map terms in the premise and hypothesis text to concepts in a KG, such as Wordnet [19] or ConceptNet [29]. Figure 1 shows an example of such a mapping, where select terms from the premise and hypothesis sentences are mapped to concepts from a knowledge graph (blue and green nodes, respectively). However these models suffer from one or more of the following drawbacks: (1) they do not possess the ability to select and harness semantic and structural information that connects the premise and hypothesis entities. For example, in Figure 1, the ability for models to encode information from paths between blue and green nodes via purple nodes which provides better context facilitating the system to more correctly judge entailment.; (2) they are not easily integrated with existing (and possibly complementary) NLI models; and (3) they are not flexible with respect to the type of KG that is used.

Contributions: We present an approach to the NLI problem that can augment any existing text-based entailment model with external knowledge. We specifically address the aforementioned challenges by: (1) introducing a neighbor-based expansion strategy in combination with subgraph filtering using Personalized PageRank (PPR) [10]. This approach reduces noise and yields contextually relevant subgraphs for premise and hypothesis texts that are extracted from larger external knowledge sources; (2) encoding subgraphs using graph convolutional networks (GCNs) [12], which are initialized with knowledge graph embeddings that capture structural and semantic information. This general approach to graph encoding allows us to use any external knowledge source that can be represented as a graph such as WordNet, ConceptNet, or DBpedia [14]. We show that the additional knowledge can improve textual entailment performance using four standard benchmarks: SciTail, SNLI, MultiNLI, and BreakingNLI. Our technique turns out as robust and resilient, that shows in our BreakingNLI experiments, where we see an absolute improvement of 5-20% over four text-based models.

2 Related Work

We categorize the related approaches for NLI into: (1) approaches that take only the premise and hypothesis text as input, and (2) approaches that utilize external knowledge.

Neural models focusing solely on the textual information [32, 40, 38]

explore the sentence representations of premise and hypothesis and sometimes cross-sentence correlations of the representations. Hierarchical BiLSTM Max Pooling (HBMP)

[30] belongs to the former category, which extends the general BiLSTM-based sentence encoding with a hierarchical structure and max pooling layers. Match-LSTM [32] and Decomposable Attention [26] learn cross-sentence correlations using attention mechanisms, where the former uses a asymmetric network structure to learn premise-attended representation of the hypothesis, and the latter a symmetric attention, to decompose the problem into sub-problems. Recent developments in NLI use tranformer architectures such as BERT [5] and RoBERTa [17]. Models that use these architectures perform exceedingly well on many NLI leaderboards [41, 16]. While training these architectures is data intensive, their pre-trained embeddings can be used as input for any of the text-based models. In this work, we use pre-trained BERT embeddings as input for one of the text-based entailment models and show that external knowledge does augment the performance of that model.

Utilizing external knowledge has shown improvement in performance on many natural language processing (NLP) tasks [9, 20]. Recently, for NLI, li2019several li2019several have argued the necessity for harnessing external knowledge by showing that the features from pre-trained language models and external knowledge complement each other. However, approaches that do utilize external knowledge for NLI are very few [33, 2]. In particular, the best model of wang2019improving wang2019improving combines rudimentary node information – in the form of concepts mentioned in premise and hypothesis text (blue and green nodes) – along with the text information. However, this approach misses the rich subgraph structure that connects premise and hypothesis entities. For example, in Figure 1, the structural information, i.e, paths between blue and green nodes via purple nodes which provides better context facilitating the system to more correctly judge entailment. [2] have developed a model with WordNet based co-attention that use five engineered features from WordNet for each pair of words from premise and hypothesis, respectively. This model being tightly integrated with WordNet has the following drawbacks: (1) it is inflexible to be used with other external knowledge sources such as ConceptNet or DBpedia, and (2) it is non-trivial to be integrated with other state of the art text-based entailment systems. This work addresses the drawbacks of each of these approaches mentioned above with competitive performance on many NLI datasets.

The availability of large-scale datasets [1, 35, 11] has fostered the advancement of neural NLI models in recent years. However, it is important to discuss the characteristics of these datasets to understand what they intend to evaluate [6]. Particularly, creation of datasets such as [1, 11, 36] are designed in a such a way that they contain language artifacts as significant cues for text-based neural models. These artifacts bias the models and makes it harder to evaluate the impact of external knowledge [2, 33]. In order to evaluate approaches that are more robust and not susceptible to such biases, [6] created BreakingNLI – an adversarial test set where most of the common text-based approaches show significant drop in performance. It is important to note that this test set is generated using a subset of relationships from online resources for English learning, making it more suitable for models exploiting KGs with lexical focus, such as WordNet. However, BreakingNLI represents a first and important step in the evaluation of models that utilize external knowledge sources.

One of the core contributions of this work is the application of Graph Convulutional Networks for encoding knowledge graphs. While (graph-)structured knowledge represents a significant challenge for classical machine learning models, graph convolutional networks

[12] offer an effective framework for representation learning of graphs. They have been used successfully in multiple NLP tasks including question answering [3], and text classification [39]. In parallel, relational GCNs (R-GCNs) [28] have been designed to accommodate the highly multi-relational data characteristics of large KGs. Inspired by these works, we explore the use of R-GCNs for infusing KG knowledge into NLI.

3 KG-Augmented Entailment Model

In this section, we describe the central contribution of this paper – the KG-augmented Entailment System (KES). As shown in Figure 2, KES consists of two main components. The first component is a standard text encoder that creates a fixed-size representation of the premise and hypothesis texts. The second component selects contextual subgraphs for the premise and the hypothesis from a given KG, and encodes them using a graph convolutional network (GCN). Finally, these two embeddings are input to a standard feedforward layer for classification. We opted for a combined graph plus text approach because the noise and incompleteness of KGs renders a purely graph-based approach insufficient as a standalone solution. However, we show that the KG-augmented model provides valuable context and additional knowledge that may be missing in text-only representations.

Figure 2:

Primary components of KES: text embedding model, GCN-based graph embedder, and final feedforward classifier.

Figure 3: KES selects and encodes one knowledge subgraph for each premise and hypothesis pair by (1) linking terms in the premise and hypothesis to concepts in ConceptNet and including their neighbors (and all edges between them), (2) using a personalized pagerank threshold to filter less relevant nodes, (3) adding two supernodes to reduce contextual subgraph disconnectivity, and (3) encoding the contextual subgraph with an R-GCN. Lastly, aggregated node embeddings are combined with text representations and fed into the final feedforward classifier. and in the figure denote and (see Equations (4) and (7)) respectively.

3.1 A Standard Text Model

Given the premise and hypothesis , let and

be the embeddings of words occurring in sequence in the premise and hypothesis texts. These embeddings are input to a neural network

that outputs a fixed size representation :


where can be any of the existing state of the art text-based NLI models [32, 30, 16].

3.2 Contextual Subgraphs and their Representation using GCNs

The second component of our system is entirely graph based, as detailed in Figure 3

. It uses an external KG to obtain a subgraph that is relevant with respect to the premise and hypothesis, and then applies a GCN to encode this subgraph into a fixed-size vector

. In this work, we use ConceptNet as the external KG.

Initial Subgraph Extraction:

An initial subgraph is extracted from the KG by first extracting relevant concepts, and then performing a one-hop concept expansion. Specifically, we perform a max-substring match to extract concepts from premise and hypothesis texts against the KG. We then retrieve all edges between the concepts (nodes) thus extracted. For example, given the premise and hypothesis in Figure 1, we extract the concepts in text and map them to the KG (shown in blue and green). This initial set of nodes is then expanded to include (one-hop) neighbor nodes, and all the edges between those nodes (initial set and their neighbors) from the KG. In the example in Figure 1, we extract the purple nodes because they are directly connected to green and/or blue nodes.

Personalized PageRank (PPR) to Filter Context:

The initial subgraph includes all one-hop neighbors. However, KGs are typically very large, and concept expansion by just one hop can introduce a significant amount of noise [33, 13]. For example, the concept girl is directly connected to over 1000 other nodes in ConceptNet. For this reason, we create a contextual subgraph by filtering the initial graph.

To enforce this notion of context and obtain the most relevant neighbor nodes given the text, we use Personalized PageRank (PPR) [25], which is chosen based on its application in recommendation systems and information retrieval [22, 24, 21]

. PPR adds a bias to the PageRank algorithm by scoring the nodes conditioned on a subset of specific initial nodes in the graph (e.g., web pages visited or movies watched by a user). The bias is introduced by changing the uniformly distributed jump probability vector

p of PageRank to a non-uniform distribution with respect to the specific initial set of nodes. For our initial graph, this set of specific nodes consists of the concepts mentioned in the premise and hypothesis. p is defined as:


PPR-scores are then computed as follows:


where is a vector with scores for each node (post convergence); is a normalized adjacency matrix (transition probability matrix); and is the damping factor.

We normalize the PPR-scores based on the maximum score (of any node) in the initial graph. We then choose a filtering threshold , and exclude all the nodes that are not in (neighbor-nodes) and that have a PPR-score below ; we also exclude the edges between those nodes. The remaining nodes and edges make up the contextual subgraph for the premise-hypothesis pair under consideration.

Encoding Contextual Subgraphs:

The contextual subgraph for premise and hypothesis is encoded using a relational graph convolutional network (R-GCN) [28, 23]. GCNs compute node embeddings by iteratively aggregating the embeddings of neighbor nodes. R-GCNs extend standard GCNs  [4, 12] to deal with the multi-relational data of KGs. They learn different weight matrices for each type of relation occurring in the graph. We use an R-GCN to compute node embeddings, and then aggregate these embeddings to obtain a fixed-size representation for the contextual subgraph.

We first extend the convolutional subgraph by adding a self-loop edge for each node; this is to retain node embeddings during convolution. In order to retain information on concepts (nodes) that occur in the premise text and hypothesis text, we extend the contextual graph by adding a premise supernode , and hypothesis supernode . We then connect them to the concepts mentioned in premise and hypothesis text with bi-directional edges.

We then apply the algorithm suggested by nguyen2018graph nguyen2018graph – which uses a simple sum as the aggregation function – but we include a normalization factor and disregard bias (similar to schlichtkrull2018modeling schlichtkrull2018modeling):


Here, is the set of edge types; is the set of neighbors connected to node through the edge type ; is a normalization constant; are the learnable weight matrices, one per edge type ; and

is a non-linear activation function. We use the (symmetric) normalized Laplacian as a normalization constant


The final node embeddings are aggregated using a summation-based graph-level readout function [37]:


is the set of nodes in our contextual graph, is a learnable weight matrix, and is an activation function. This summation-based readout function allows the encoder to learn representations that encode the structure of the graph, whereas max or mean pooling capture salient elements or distributions of elements [37], respectively.

The final encoding of the contextual subgraph is the vector obtained by concatenating – the aggregated embeddings of all the nodes – with the embeddings of the premise and hypothesis supernodes as follows:


3.3 Final Classifier

The final feedforward classifier takes as input the text encoding from Equation (1) and the graph encoding from Equation (6) to classify the premise and hypothesis as entailment/contradiction/neutral:


4 Experiments & Results

In this section, we describe the experiments that we performed to evaluate our approach; the setup, including datasets, models, and implementations; and the results.

4.1 Datasets

We considered the most popular NLI datasets: SNLI [1], SciTail [11], BreakingNLI [6], and MultiNLI [35]. While SNLI and MultiNLI are prominent datasets coverinbg a wide range of topics, SciTail offers an in-depth focus on science domain questions. Since this difference is also reflected in linguistic variations, the two datasets allow evaluating very different settings. As mentioned in Section 2, these datasets carry linguistic cues that are easily captured by the neural text based models. Hence, to show the impact of knowledge graphs, we primarily evaluate our approach on the BreakingNLI dataset.

4.2 Knowledge Graphs

Prior work on NLI has shown that ConceptNet contains information more useful to this problem compared to DBpedia and WordNet [33]. Furthermore, speer2017conceptnet speer2017conceptnet showed that, when ConceptNet is combined with embeddings acquired from distributional semantics, it provides applications with a richer understanding than narrower resources like the latter KGs. We therefore focus on ConceptNet for now, leaving experiments with other KGs as future work.

4.3 Models for Text Representations

We experimented with four different text-based models to obtain numerical representations of premise and hypothesis text (Equation (1)). Our selection criteria: (1) performance on leaderboards, (2) relevance for NLP in general, and (3) ease of implementation and availability. Our goal is to augment each of these models with external knowledge and hence test the generalizability of KES, which also shows the benefits of its modularity. We used the AllenNLP library111 AllenNLP includes the Decompattn model. to implement the models described below (see also Section 2).

Decomposable Attention Model (DecompAttn).

One of the earlier and most common baseline models used for NLI [26, 33, 6, 2]. Hence, our hypothesis is that KES can add more value and have a larger delta in performance.

match-LSTM. A NLI model with good performance on not only on multiple NLI leaderboards such as SciTail and SNLI but also applicable to other NLP tasks such as question answering [33].

BERT + match-LSTM. Version of match-LSTM using BERT embeddings instead of the GLoVe embeddings in the former. We opted for this model to take advantage of the improvements BERT embeddings have generated for numerous NLP tasks.

Hierarchical BiLSTM Max Pooling (HBMP). Shows superior performance on multiple NLI benchmarks including SciTail, SNLI, and MultiNLI.

4.4 Models using External Knowledge

There are two other models exploiting external knowledge for NLI. We compare them to KES:

KIM  [2] uses five different features for every pair of terms from premise and hypothesis. The features are extracted from WordNet and they are infused in the model as knowledge-based co-attention mechanism.

ConSeqNet  [33] takes the concepts mentioned in premise and hypothesis as input to a match-LSTM model (with a GRU encoding). It is important to note that the match-LSTM model better suits text than graph structure because it uses a seq2seq encoder to account for the inherent sequential nature of text, which is not present in graphs.

Models Scitail SNLI MultiNLI BreakingNLI
Text Text+Graph Text Text+Graph Text Text+Graph Text Text+Graph
match-LSTM 82.54 82.22 (0.6) 83.60 83.94 (0.6) 71.32 71.67 (0.8) 65.11 78.72
BERT+match-LSTM 89.13 90.68 (0.2) 85.78 85.97 (0.6) 77.96 76.73 (0.6) 59.42 77.59
HBMP 81.37 83.49 (0.2) 84.61 83.84 (0.2) 69.27 68.42 (0.6) 60.31 63.60
DecompAttn 76.57 72.43 (0.8) 79.28 85.56 (0.6) 64.89 71.93 (0.6) 51.3* 59.83
KIM [2] - NE - 88.6* - 76.4* - 83.1*
ConSeqNet [33] 84.2* 85.2* 83.60 83.34 71.32 70.9 65.11 61.12
Table 1: Entailment accuracy results of various text models, original vs. augmented with KG. PPR -values in parentheses. Reported values from related works.

4.5 Experimental Setup and Implementation

To evaluate the impact of KES on NLI in general and its compatibility with various existing models, we compared all text-based models described above (Section 4.3) to a combined text+graph model. We used the datasets described in Section 4.1. Because the BreakingNLI test set is derived from the SNLI training set, all models trained on SNLI were evaluated on both the SNLI and BreakingNLI test sets.

Text Model Parameters.

We chose hyperparameters as reported in related works. For match-LSTM and BERT-match-LSTM, we refer to 

[33]. For HBMP and DecomAttn, we used the parameters from [30] and [26].

KES Setup and Training.

As initial graph embeddings, we considered TransH [34] and ComplEx [31]

. For each model (i.e., text-only + graph model combination), we experimented with both embedding approaches and selected the one that performed best on the validation sets. All GCNs were configured as follows: two edge types (one for edges in ConceptNet and one for the self-loops); 300 dimensions for each embedding across all layers; one convolutional layer; one additional linear layer (after the convolution); and ReLU for all activations. These parameters yielded best average accuracy on the validation sets, so that we chose them uniformly for all models for consistency across our approaches.

The Personalized PageRank threshold for filtering the subgraphs was also tuned as a hyperparameter. We experimented with values of 0.2, 0.4, 0.6, and 0.8. We did not experiment with whole one-hop graphs ( = 0.0), as they have been shown to increase in size very rapidly over single hops in ConceptNet  [33].

Training (of the combined models) consisted of 140 epochs with a patience of 20 epochs. Batch size and learning rate over all the experiments remained 64 and 0.0001 to make the models comparable to each other.

4.6 Results

Table 1 gives an overview of our results. They demonstrate that KES, and thus external knowledge, has the biggest impact on the BreakingNLI test set. The accuracy of text-only models is improved, for BERT-based match-LSTM model by 18 percentage points, match-LSTM by 13 percentage points, HBMP by 3 percentage points, and DecompAttn by 8 percentage points. Notably, the most dramatic impact of KES is on the BERT-based match-LSTM model, which is generally the strongest text-only model on the other datasets.

Despite their competitive performance on SNLI, all text-only models perform significantly worse on the BreakingNLI test set when compared to the SNLI test set, which is consistent with observations from the paper introducing BreakingNLI originally. On average, the accuracies of text-only models drop 24 percentage points between SNLI and BreakingNLI. In contrast, KES shows only modest decreases in performance between SNLI and BreakingNLI when a GloVe- or BERT-based match-LSTM text model is used, with accuracy decreasing at most 8 percentage points. However, there is a significant decrease in performance between SNLI and BreakingNLI when KES uses HBMP or DecompAttn as its text model (average decrease of 23 percentage points), suggesting a potentially complex interaction between text and external knowledge features.

These results support three important claims. First, they demonstrate that KES is modular in that it can be combined with existing text models with different architectures. Second, the KES approach effectively infuses external knowledge into existing entailment models to improve performance on the challenging BreakingNLI dataset. Third, KES is robust to dataset changes that dramatically decrease the performance of other NLI models.

Comparison to Other KG-based Models.

Table 1 also contains the results for the graph-based models KIM and ConSeqNet. Both show comparable performance to match-LSTM KES, with KIM performing best on all datasets. We discuss important differences between KES, KIM, and ConSeqNet below.

KIM introduces an external knowledge-based co-attention mechanism, using five manually engineered features from WordNet for every term pair of words in premise and hypothesis. These features are specific to WordNet relations, which means that the model can only be used with WordNet or comparable KGs with the same set of relations. One can argue that, because KIM depends on WordNet, it is especially suited to BreakingNLI, as WordNet contains exactly the type of lexical information that is targeted by BreakingNLI. Another difference between KIM and KES is that it is not clear how to adapt KIM’s five engineered features to a different textual entailment system. In contrast, KES is not tied to any particular KG, KG vocabulary, or existing entailment system. One of the practical goals of KES is to develop an approach that is easily adaptable to different datasets, knowledge graphs, and existing entailment models. Although tuning only the PPR threshold as the hyper parameter, our knowledge augmented approaches perform almost on par with KIM on on SNLI and MultiNLI except BreakingNLI dataset (-4.4 percentage).

ConSeqNet, similar to our model and unlike KIM, does provide an architecture to plug in any text based entailment model. However, there are two primary differences between our work and ConSeqNet. First, we are the first to encode the graph structure of the knowledge graph where as ConSeqNet uses on the concepts mentioned in text encoding them using RNNs. Also, in comparison to ConSeqNet, our approach performs better with different entailment models over all the datasets. Particularly, on the BreakingNLI dataset, our implementation of ConSeqNet shows a drop in performance in comparison to its text-based method. This is in turn surprising and may need further investigations.

In summary, in addition to the performance goals of KIM and ConSeqNet, KES seeks to infuse entailment models with knowledge in a way that is modular and sensitive to graph structure, independent of a specific KG.

Harnessing External Knowledge.

Table 2 shows the average number of concepts (nodes) and relations (edges) in contextual subgraphs generated by KES, ConSeqNet, and KIM, excluding those that were explicitly mentioned in the premise and hypothesis texts. Unlike ConSeqNet and KIM, KES is able to use a great amount of external knowledge that is related to the premise and hypothesis but not explicitly mentioned. As observed in prior work [33], expanding subgraphs by even one hop results in very large graphs, making PPR filtering very important.

PPR Scitail (17.74*) SNLI (11.5*) MultiNLI (17.5*)
Edges Nodes Edges Nodes Edges Nodes
0.2 42.65 10.14 80.29 19.83 76.27 16.15
0.4 26.72 7.48 25.70 8.15 33.82 6.48
0.6 15.53 4.35 14.08 4.65 23.97 3.44
0.8 11.67 3.04 9.98 3.18 20.27 2.05
ConSeqNet 0 0 (17.74*) 0 0 (11.5*) 0 0 (17.5*)
Concepts that occur in text. No edges or new concepts are added from ConceptNet.
KIM Features based on fixed WordNet relations. No new concepts are added.
Table 2: Analysis of graphs generated based on PPR threshold (in parentheses). These are averages after combining premises and hypothesis in a single graph from training data. *Denotes the average number of concepts mentioned in text.

5 Discussion

Negative Results: In Table 1, we observe two results that did not confirm our hypotheses: (1) the reduced text+graph improvement on BreakingNLI for HBMP, and (2) lower text+graph performance for DecompAttn on SciTail (¿ 2 percentage points). We are investigating these issues, but one possible explanation for the reduced improvement on HBMP is that it is one of the few text based models that has a large final hidden layer ( 14K feature vector) in comparison to the features from the GCN model (900) which is possibly biasing the final classifier towards the text-based features.

Personalized PageRank Threshold: Our initial plan for using PPR thresholds was to make it a preprocessing step and fix one threshold for a dataset on a base model. However, as shown in Table 1, using PPR thresholding as a hyperparameter for each model trained showed better performance. Also, the PPR threshold, in particular filters very few concepts that aren’t mentioned in premise and hypothesis text, whereas contextual subgraphs from can contain the equal number of concepts from external knowledge as mentioned in text (Table 2). PPR filtering is just one possible method for reducing noise that results from neighborhood-based expansion techniques. In our future work we intend to investigate a different filtering approach where only those paths that connect premise and hypothesis are included.

Dataset characteristics. We evaluated our KES approach on NLI datasets that are widely used in the literature. However, there has been criticism regarding the way these datasets are created and the resulting biases that can be exploited by learning algorithms [6, 15, 7, 27]. Even in our work, in Table 1 where we see that the DecompAttn model is consistently improved by KES on SNLI, MNLI, and BreakingNLI, we also see the opposite effect on SciTail. Some qualitative analysis of the SciTail dataset showed us that use of KG can negatively impact the performance because of high overlap between premise and hypothesis terms.

Text-based models trained on SNLI perform significantly worse on the BreaklingNLI test set, consistent with the results reported above. Notably, the estimated human performance on the BreakingNLI test set is higher than that of the original SNLI test set, providing further evidence that models that perform well on SNLI but poorly on BreakingNLI are poor approximations for human inference. On the other hand, NLI models that generalize well to BreakingNLI are more likely to be better approximations for human-like inference. The complexity of the BreakingNLI test set and its characteristics make it the most interesting evaluation set.

Complexity of Knowledge Graphs and their usage: As mentioned above, the current state of the art for BreakingNLI in the KIM model, which achieves an 83% accuracy, while our best performing KES model (KES with the match-LSTM text model) achieves an accuracy of 79%. This difference can be attributed to aspects of the KIM model that make it particularly well suited to the BreakingNLI dataset at the expense of model flexibility and generality. KIM relies on WordNet, which has lexical information that aligns very closely with the challenging aspects of the BreakingNLI. This focus clearly benefits performance on the task. However, WordNet is relatively small (117k triples, i.e., edges) compared to ConceptNet (3.15M triples) and has a very specific scope that is unlikely to cover the broad classes of entailment that occur in natural language. For example, recognizing textual entailment may depend on world knowledge that is not lexical in nature. In such cases it would be necessary to invoke a model that is not primarily focused on lexical knowledge. This is one of the motivations behind the KES approach: to support very large KGs (e.g., ConceptNet) and to avoid dependencies on any single KG or domain area. An important topic for future work will be to understand the shortcomings of various knowledge sources, how to manage choosing the appropriate knowledge sources for a given task, and to continue exploring graph filtering and selection methods to leverage large scale KGs while minimizing noise. KIM mitigates the noise issue by using a restricted set of relations to provide greater focus and minimize intrusion of potentially irrelevant knowledge. Again, this is a characteristic of KIM that will not necessarily generalize well to other NLI datasets, such as SciTail, which may depend less on hyper- and hyponym relations, and more on knowledge about everyday physical objects and processes.

6 Conclusion

In this paper, we presented a systematic approach for infusing external knowledge into the textual entailment task using contextually relevant subgraphs extracted from a KG and encoded with graph convolutional networks. These graph representations are combined with standard text-based representations into a KG-augmented entailment system which yields significant improvement on the challenging BreakingNLI dataset. Additionally, the KES approach is modular, can be used with any knowledge graph, and is generalizable to multiple datasets. In future work, we plan to consider other KGs and to investigate alternative graph representations. Furthermore, it would be interesting to see how KES performs on the popular question answering datasets.


  • [1] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proc. of EMNLP, Cited by: §1, §2, §4.1.
  • [2] Q. Chen, X. Zhu, Z. Ling, D. Inkpen, and S. Wei (2018) Neural natural language inference models enhanced with external knowledge. In Proc. of ACL, Volume 1, pp. 2406–2417. Cited by: §1, §2, §2, §4.3, §4.4, Table 1.
  • [3] N. De Cao, W. Aziz, and I. Titov (2019) Question answering by reasoning across documents with graph convolutional networks. In Proc. of NAACL-HLT, Volume 1, pp. 2306–2317. Cited by: §2.
  • [4] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Proc. of NIPS, pp. 3844–3852. Cited by: §3.2.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT, Volume 1, pp. 4171–4186. Cited by: §2.
  • [6] M. Glockner, V. Shwartz, and Y. Goldberg (2018) Breaking nli systems with sentences that require simple lexical inferences. In Proc. of ACL, Volume 2, pp. 650–655. Cited by: §2, §4.1, §4.3, §5.
  • [7] S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324. Cited by: §5.
  • [8] S. Harabagiu and A. Hickl (2006) Methods for using textual entailment in open-domain question answering. In Proc. of CICLing and ACL, pp. 905–912. Cited by: §1.
  • [9] X. Huang, J. Zhang, D. Li, and P. Li (2019) Knowledge graph embedding based question answering. In Proc. of ACM WSDM, Cited by: §2.
  • [10] G. Jeh and J. Widom (2003) Scaling personalized web search. In Proc. of WWW, pp. 271–279. Cited by: §1.
  • [11] T. Khot, A. Sabharwal, and P. Clark (2018) SciTail: a textual entailment dataset from science question answering. In Proc. of AAAI, Cited by: §1, §2, §4.1.
  • [12] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In Proc. of ICLR, Cited by: §1, §2, §3.2, §3.2.
  • [13] S. Lalithsena, S. Perera, P. Kapanipathi, and A. Sheth (2017) Domain-specific hierarchical subgraph extraction: a recommendation use case. In Proc of Big Data, pp. 666–675. Cited by: §3.2.
  • [14] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. Van Kleef, S. Auer, et al. (2015) DBpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6 (2), pp. 167–195. Cited by: §1.
  • [15] T. Li, X. Zhu, Q. Liu, Q. Chen, Z. Chen, and S. Wei (2019) Several experiments on investigating pretraining and knowledge-enhanced models for natural language inference. arXiv preprint arXiv:1904.12104. Cited by: §5.
  • [16] X. Liu, P. He, W. Chen, and J. Gao (2019) Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504. Cited by: §1, §2, §3.1.
  • [17] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.
  • [18] B. MacCartney and C. D. Manning (2009) Natural language inference. Stanford University Stanford. Cited by: §1.
  • [19] G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §1.
  • [20] S. Moon, P. Shah, A. Kumar, and R. Subba (2019) OpenDialKG: explainable conversational reasoning with attention-based walks over knowledge graphs. In Proc. of ACL, pp. 845–854. Cited by: §2.
  • [21] C. Musto, G. Semeraro, M. de Gemmis, and P. Lops (2017) Tuning personalized pagerank for semantics-aware recommendations based on linked open data. In Proc. of ESWC, pp. 169–183. Cited by: §3.2.
  • [22] P. Nguyen, P. Tomeo, T. Di Noia, and E. Di Sciascio (2015) An evaluation of simrank and personalized pagerank to build a recommender system for the web of data. In Proc. of WWW, Cited by: §3.2.
  • [23] T. H. Nguyen and R. Grishman (2018) Graph convolutional networks with argument-aware pooling for event detection. In Proc. of AAAI, Cited by: §3.2.
  • [24] A. Otegi, X. Arregi, O. Ansa, and E. Agirre (2015) Using knowledge-based relatedness for information retrieval. Knowledge and Information Systems 44 (3), pp. 689–718. Cited by: §3.2.
  • [25] L. Page, S. Brin, R. Motwani, and T. Winograd (1999) The PageRank citation ranking: bringing order to the web.. Technical report Stanford InfoLab. Cited by: §3.2.
  • [26] A. Parikh, O. Täckström, D. Das, and J. Uszkoreit (2016) A decomposable attention model for natural language inference. In Proc. of EMNLP, pp. 2249–2255. Cited by: §1, §2, §4.3, §4.5.
  • [27] A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme (2018) Hypothesis only baselines in natural language inference. arXiv preprint arXiv:1805.01042. Cited by: §5.
  • [28] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In Proc. of ESWC, pp. 593–607. Cited by: §2, §3.2.
  • [29] R. Speer, J. Chin, and C. Havasi (2017) Conceptnet 5.5: an open multilingual graph of general knowledge. In Proc. of AAAI, Cited by: §1.
  • [30] A. Talman, A. Yli-Jyrä, and J. Tiedemann (2019) Sentence embeddings in nli with iterative refinement encoders. Natural Language Engineering 25 (4), pp. 467–482. Cited by: §2, §3.1, §4.5.
  • [31] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard (2016) Complex embeddings for simple link prediction. In International Conference on Machine Learning, pp. 2071–2080. Cited by: §4.5.
  • [32] S. Wang and J. Jiang (2016) Learning natural language inference with LSTM. In Proc. of NAACL-HLT, pp. 1442–1451. Cited by: §2, §3.1.
  • [33] X. Wang, P. Kapanipathi, R. Musa, M. Yu, K. Talamadupula, I. Abdelaziz, M. Chang, A. Fokoue, B. Makni, N. Mattei, and M. Witbrock (2019) Improving Natural Language Inference Using External Knowledge in the Science Questions Domain. Proc. of AAAI. Cited by: §1, §2, §2, §3.2, §4.2, §4.3, §4.3, §4.4, §4.5, §4.5, §4.6, Table 1.
  • [34] Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014)

    Knowledge graph embedding by translating on hyperplanes

    In Proc. of AAAI, Cited by: §4.5.
  • [35] A. Williams, N. Nangia, and S. R. Bowman (2017) A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. Cited by: §2, §4.1.
  • [36] A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proc. of NAACL-HLT, Volume 1, pp. 1112–1122. Cited by: §2.
  • [37] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In Proc. of ICLR, Cited by: §3.2.
  • [38] X. Yang, X. Zhu, H. Zhao, Q. Zhang, and Y. Feng (2019) Enhancing unsupervised pretraining with external knowledge for natural language inference. In Proc. of Canadian AI, pp. 413–419. Cited by: §2.
  • [39] L. Yao, C. Mao, and Y. Luo (2019) Graph convolutional networks for text classification. In Proc. of AAAI, Vol. 33, pp. 7370–7377. Cited by: §2.
  • [40] D. Yoon, D. Lee, and S. Lee (2018) Dynamic self-attention: computing attention over words dynamically for sentence embedding. arXiv preprint arXiv:1808.07383. Cited by: §2.
  • [41] Z. Zhang, Y. Wu, Z. Li, S. He, H. Zhao, X. Zhou, and X. Zhou (2018) I know what you want: semantic learning for text comprehension. arXiv preprint arXiv:1809.02794. Cited by: §1, §2.