## 1 Introduction

Knowledge Graphs (KGs) contain a very large amount of knowledge about the world and phenomena within it. Such knowledge can be very useful in natural language processing (NLP) tasks such as question answering, textual entailment etc. – tasks that can benefit from a large amount of specialized, domain-specific knowledge. However, recent approaches that have tried to use KGs as sources of external knowledge for the textual entailment problem

[wang2019improving] have found that bringing in external knowledge from KGs comes with a significant downside – namely noise that is brought in from the external knowledge. This noise mainly occurs due to the fact that KGs are very large graphs that often contain wrong, repeated, and incomplete information. Retrieving a sub-graph of a given KG that is relevant to a given problem instance is a non-trivial task, and continues to be a topic of much research study.In this paper, we focus on this problem from the perspective of search in the space of graphs. Specifically, we consider the problem of extracting the sub-graph of a given (large) graph that is most relevant to a given context or problem setting – we call this the knowledge graph contextualization problem. Obviously, there are many ways of extracting such a sub-graph, and they must all be tied in some way to the overall metric: that is, the performance on the problem setting in question. For the purposes of this study, we fix the problem setting as the textual entailment or natural language inference (NLI) problem, taking after wang2019improving (wang2019improving). The textual entailment problem has usually been cast as a classification problem, where a given textual entailment instance consists of a premise p and a hypothesis h. The label indicates the relationship between h and p.

The problem with bringing in external knowledge from a knowledge graph is one of scale: for any given entity (node) in the knowledge graph, within a few hops, a large number of nodes are retrieved. Many of these nodes are completely irrelevant to the task at hand, and are not influenced in any way by the context of the problem being solved. Figure 1 shows an example of an NLI problem instance, along with a sub-graph for that instance. The key problem that needs to be solved is one of ranking and filtering the nodes that are retrieved according to some context-sensitive measure. In this paper, we aim to use the entities in the premise p and hypothesis h – as well as the paths that connect them in an external KG – to do this filtering. In brief, our method is as follows: first, we generate the Cartesian product of all pairs of entities ; then, for each pair in c, we compute the shortest path between the premise entity and the hypothesis entity. The computation of the shortest path is done over a copy of the ConceptNet graph – however, we evaluate various cost functions to predict the closeness of entities (nodes) in the ConceptNet graph. Each heuristic gives rise to a different, cost-customized copy of the graph, in the following manner: we keep the structure of the graph unchanged, but add a weight to each edge that is computed using a specific cost function. In this way, we invert the traditional notion of the heuristic as used in A search [hart1968formal] – instead of assigning cost to each node in the graph, we transfer that cost on to teach out-going edge of the node. We evaluate various cost functions that change the nature of the shortest path between two given entities in a KG, and test the knowledge that is thus retrieved for any given pair via performance on the textual entailment problem.

## 2 Related Work

### 2.1 Natural Language Inference

Early work on the NLI problem was limited by the availability of small data only, and mostly relied on hand-crafted features [androutsopoulos2010survey]

. To address this problem, bowman2015large bowman2015large introduced the large-scale SNLI dataset for NLI, and proposed an LSTM-based neural network model which was the first generic neural model without any hand-crafted features. bowman2015large use their LSTM model to encode the premise and hypothesis sentences, whose concatenation is then fed to a perceptron classifier. In addition to LSTM-based models, several other neural network models were used for sentence encoding such as GRU

[vendrov2015order], tree-based CNN [mou2015natural], self-attention network [shen2018reinforced], and BiLSTM [liu2016learning]. “Matching aggregation” approaches, on the other hand, exploit various matching methods to obtain an interactive premise and hypothesis space. For example, wang2015learning wang2015learning perform a word-by-word matching of the hypothesis with the premise using match-LSTM (mLSTM). rocktaschel2015reasoning rocktaschel2015reasoning use a weighted attention mechanism to get an embedding of the hypothesis conditioned on the premise. parikh2016decomposable parikh2016decomposable decompose the entailment problem into sub-problems through an intra-sentence attention mechanism, and are thus able to parallelize the training process. ghaeini2018dr ghaeini2018dr encode both the premise and the hypothesis conditioned on each other, using BiLSTM and then a soft-attention mechanism over those encodings.### 2.2 Knowledge Graphs and NLI

Although there have been extensive studies on the NLI task, the potential for exploiting external knowledge encoded in knowledge graphs (KGs) has not been explored in enough detail. Among the few existing approaches, chen2018 chen2018 use WordNet [miller1995wordnet] as the external knowledge source for NLI. They generate features based on WordNet using the relationships in it. However, WordNet, being a lexical database, possesses very few linguistic relationships among entities, and thus its richness as an external knowledge source is limited. There are other KGs such as DBpedia, Yago [yago], Freebase [freebase] etc. that have become popular due to their expressiveness and the richer information contained in them. One issue with expressive KGs such as DBpedia and ConceptNet [liu2004conceptnet, speer2017conceptnet] is that they are quite massive in terms of the nodes and edges contained in them, which makes it hard to extract relevant information useful for the entailment task. However, in spirit, the closest approach to our current work is that of wang2019improving wang2019improving.

## 3 Methodology

First, we describe the novel methodology that we propose in this paper. The core of our approach is motivated from the understanding that knowledge graphs (KGs) like ConceptNet are essentially directed graphs with labeled edges – the labels denote the relations between the two nodes connected by the edge, while the nodes themselves denote entities. We posit that one of the keys to correctly classifying instances of the textual entailment task is the relationships between the various entities involved in the two propositions. Identifying these relationships using only the text content of the entailment task is an approximate reconstruction of the underlying relationships. While embedding-based methods (see Section 2) situate the sentences in some implicit knowledge-enhanced context, we seek to situate them in a much more explicit graphical context.

In brief, we do this as follows: first, we create different versions of the ConceptNet knowledge graph that feature customized costs as the weights on the relation-edges – we call these customized cost graphs. Following this, for each labeled premise and hypothesis pair in the dev partition of the SciTail dataset, we extract the entities from each respective sentence. We then take the Cartesian product of the premise and hypothesis entities (respectively) to create ordered premise-hypothesis entity pairs. We then find the shortest path between each of these entity pairs in the customized cost graphs. For each premise-hypothesis sentence pair (that is, a textual entailment problem instance), the collection of shortest paths thus found is then associated with the corresponding label for purposes of learning how to predict the entailment accurately (described in more detail in Section 4). In the rest of this section, we provide the details of the process that we have just described.

### 3.1 Customized Cost Graphs

The first step towards constructing that explicit graphical context is to pick the external knowledge repository. In this paper, we pick the ConceptNet [speer2017conceptnet] graph, which contains crowdsourced and expert-created knowledge in the form of entities (which are represented by nodes in the graph) and relations (which are represented by edges in the graph). Typically, the relations (edges) in ConceptNet carry labels which denote the semantic meaning of that edge. These edges are accompanied by a weight The central contribution of our work is to redefine these weights along the edges to take into account the structure of the graph. More specifically, we create copies of the ConceptNet knowledge graph and replace the default weights with customized weights on the relation-edges.

### 3.2 Cost Heuristics

Our quest to retrieve the right knowledge that is useful in classifying a textual entailment instance is grounded in a simple insight: not all relations between entities are equal. Put another way, the ConceptNet graph – which is made up of entities and the relation edges that connect them – needs to be re-weighted in order to reflect this fact. This re-weighting happens by rewriting the weights on the edges of the graph, and treating those weights as a cost that is incurred any time that specific edge has to be traversed. In the following, we detail four different heuristics that we use to generate these edge-costs: We call each of the copies of ConceptNet thus produced as a cost graph, and demonstrate the use of these various cost graphs in Section 3.4.

#### Default Cost (DC)

This is the simplest case we consider, where we assign every single edge in our target graph (ConceptNet) a cost of . This essentially turns the path-finding problem between two given nodes on the graph into a problem of minimizing the number of hops: the shortest hops give the most efficient path.

#### Relevant Relations (RR)

The next obvious step in defining costs is to consider the case where some relations are different from others: that is, some relations are more important to the task at hand than others. Specifically, in the case of this work, we look at relations that we consider relevant to the textual entailment task. This is a manually filtered subset of the total list of relations present in ConceptNet. Some examples of relations that are included in this subset are RelatedTo, is-A, SimilarTo, DerivedFrom etc. For each of these relations, the edge costs of any instance of that relation in the graph is reduced, thus reducing the cost of taking such an edge, and encouraging a shortest-path search algorithm to consider these edges first.

#### Relation Frequency (RF)

The two prior heuristics feature values that are manually decided and set: that is, we determine on our own what the weight on an edge should be. The next step up in complexity is to automate the computation of that weight, and base that computation on some feature of the graph itself. The first such heuristic is to simply count the frequency of the relations as they occur in relation to an entity. We specifically implement this heuristic as the normalized count of the number of outgoing edges bearing the same relation name from a given node. That is, given a node that represents an entity in the graph, and the set of outgoing edges from , we represent the cost for an edge as . For example, consider a node that has three outgoing edges: . Using the above formula, the weights of the edges would be set to , while the edge would have a cost of . This ensures that the edge that is “rarer” is given a lower cost, and is favored by a shortest-path algorithm in case there is more than one way to travel from node to a neighboring node.

#### Global Relation Frequency (GRF)

The final heuristic that we consider builds on top of the relation frequency metric by addressing a significant issue: the presence of common relations that occur throughout the knowledge graph, but may occur relatively fewer times at any one individual node. An example of such a relation is Is-A; while this relation is likely to occur relatively fewer times at any given node, it is clear that it occurs throughout the graph. We want to ensure that a truly rare relation that participates in an entailment instance is thus given more importance (and subsequently less cost) than one which occurs throughout the graph. To do this, we follow the inspiration of TF-IDF [salton1988term], which is often used to address similar issues in text corpora.

We first compute the Inverse Node Frequency (INF) (the analog of IDF) for every relation in the graph. Given a graph with node-set , let the quantity be the number of times relation appears in the nodes in as an outgoing edge. The INF for edges with the relation label can then be calculated as . Next, we compute the normalized Relation Frequency (RF) as in the previous section. Thus given a node with a set of outgoing edges , the RF for an edge with relation can be calculated as . Since we are interested in promoting “rarer” relations by associating lower cost with them, we invert INF during the the calculation of the final cost metric, giving us the cost as .

### 3.3 Ordered Premise & Hypothesis Pairs

Once we generate the various cost graphs as described above, it is then time to use those respective graphs to obtain the relationships between the two sentences in a given textual entailment instance. As before, let us assume that this instance is denoted , where is the premise sentence and is the hypothesis sentence. The first step we take is to represent each sentence using its respective entities: that is, we collapse the representation of a sentence into an ordered set of those entities from the sentence that also appear in ConceptNet.^{1}^{1}1We perform stemming in order to turn words into their normative forms, before doing a look-up in the list of ConceptNet entities. Let us denote these ordered sets as p and h respectively. Since we do not know which entities in the premise and which ones in the hypothesis contribute directly to the classification of the entailment relationship, we take the cartesian product of the two ordered sets p and h

to generate the set of all possible ordered pairs between

and . This set is then used as the input for the shortest path generation step.### 3.4 Shortest Paths

Once we have the sets of premise-hypothesis entity pairs from Section 3.3, we move on to finding all shortest paths between the first and second entity of each pair, for every cost graph outlined previously. We employ NetworkX’s [hagberg2008exploring] implementation of the Dijkstra shortest-path algorithm. Since ConceptNet has about 1 million nodes and well over 3 million edges, finding shortest paths is an extremely expensive process. Additionally, after an analysis of entity pairs from ConceptNet that feature more than one direct edge between them (multi-edges), we find that the most common relationship (RelatedTo) occurs about of the time. The second most common relationship (FormOf) occurs in about of cases. Further, these two relations co-occur around of the time, and of those cases, for about of the time, they are the only two relations connecting that entity pair. All of these support our hypothesis that selecting at random between paths that contains either of these relationships will not have a significant impact on the NLI classification problem. We therefore reduce the problem of finding all shortest paths between premise-hypothesis entity pairs to one of finding a single shortest path.

## 4 Using KG Information via Shortest Paths

Once the pairwise shortest paths are generated, we need to use them in a way that enables us to train on labeled textual entailment instances, in order to make predictions on new instances. In this we focus particularly on the path part of the shortest paths – that is, we are interested in considering the relations used to connect a given premise and hypothesis pair from a textual entailment instance. This harks back to our hypothesis in Section 3 that the relationships between entities in the textual entailment instance are key to identifying the overall entailment relationship. In this section, we detail two specific ways in which we use the shortest paths: by accounting for the number of times relations appear in those paths; and then the sequence order in which they appear. These two approaches are in contrast to the work of wang2019improving wang2019improving, as they only consider entity-level information and completely ignore relationships.

### 4.1 Text Model: mLSTM

Most models for the NLI problem use only the premise and hypothesis sentence as input; due to this fact, we decided to use match-LSTM (mLSTM) [wang2016learning] as our text-based model. The specific implementation of mLSTM that we use encodes both premise and hypothesis as Bi-GRUs (as against Bi-LSTMs), and a fixed representation of the hypothesis that is premise-attended is output. Such asymmetry in the modeling of the premise-hypothesis relationship has led to an improved performance of mLSTM on various leaderboards.

### 4.2 Relationship Frequency Vectors

In order to enhance the text models that have been used by prior work, we incorporate external knowledge in the form of the frequency distribution of relations present along the shortest paths between premise-hypothesis entity pairs. The size of the vector representing the paths is same as the number of distinct relationships in the knowledge graph. In our case, since we use ConceptNet, it has 47 distinct relationships. Hence, each relationship is assigned a fixed positional index in this vector.

We calculate the frequencies of relations present in the paths across all premise-hypothesis entity pairs in a single NLI instance. For example, consider that we have two premise-hypothesis entity pairs with shortest paths RelatedTo IsA RelatedTo; and RelatedTo Synonym FormOf respectively. The frequency counts would then be RelatedTo: 3, IsA: 1, Synonym: 1, FormOf: 1; and 0 everywhere else. The non-zero frequency values are set at their respective relation position index. The relation frequency vector thus formed is concatenated with the final hidden state from the text model, and the combination is then forwarded to a fully connected feed forward network.

We experimented with scaling the relation frequency vector to higher dimensions via linear layers; these results are reported in Table 3. The use of this simple frequency-based model makes it possible for us to analyze the learned weights, and subsequently intuit the importance and contribution of each relation in the classification task accuracy.

### 4.3 Recurrent Neural Networks

After modeling the shortest paths as the frequency counts of the relations along those paths, the next obvious step is to use the sequentiality inherent in a shortest path as well. Recent work on Graph Convolutional Recurrent Networks (GCRN) [seo2018structured] has explored representing sequential graphical structures as fixed representations. One of the major difference between that approach and the one we take in this work is the degree or level of sequentiality. In our current problem, we are faced with two levels of sequential information. One of these is at the level of ordered premise-hypothesis entity pairs. The other is at the level of the path, which is represented as a sequence of relations, entities, or both; per premise-hypothesis entity pair.

We first describe how we process the shortest paths to capture the bi-level sequentiality inherent in them. As before, we assume each textual-entailment instance consists of premise () and hypothesis (), which together constitute a sentence pair. After processing each as outlined in Sections 3.3 and 3.4, we obtain an ordered set of shortest paths. Each of these shortest paths can be represented by either the entities along that path (alone), the relations along that path (alone), or a combination of the entities and relations both. Our work follows various hierarchical architectures that have been proposed for different learning-centric tasks [sordoni2015hierarchical, li2015hierarchical, yang2016hierarchical, serban2016building]

. The hierarchical assumption formulates a sequence at two levels: (1) a sequence of tokens for each pair; and (2) a sequence of pairs. We model this hierarchy as two recurrent neural networks.

Figure 2 shows the architecture of our Graph Recurrent Network (GRN) architecture. We describe the functioning of the GRN via a simplified working example. Consider the two sentences: Waves are caused by wind (premise); and Winds causes most ocean waves (hypothesis). As described in Section 3.3, we first find all possible premise-hypothesis entity pairs. This particular example gives us such pairs: 3 premise (waves, caused, wind) times 4 hypothesis (winds, causes, ocean, waves) entities. We further simplify for the sake of exposition and focus on three entity pairs: (waves, ocean), (wind, winds), and (wind, ocean). As explained in Section 3.4, we identify shortest paths for each of these pairs. For example, for the pair (waves, ocean), the shortest path looks like: waves causesdesire surf isa wave partof ocean, where waves, surf, waves and ocean are entities along the path; and causesdesire, isa and partof are the relationships connecting them in sequential order.

The GRN model can take either relations, entities, or relations plus entities as its input. In Figure 2

, we show an instance where relations are fed as input to the token-representation layer. At this point, the tokens – which are relations in this case – are transformed into vector representation using an embedding matrix. The transformed representations are then fed to a bidirectional Recurrent Neural network (RNN) in the sequence order captured by the shortest path. The final hidden states from the bidirectional RNN are then concatenated to form a representation for the whole path. Thus after passing through the path representation layers, we have vector representations for each of the entity pairs. These representations are then fed into a second bidrectional RNN in the order prescribed by the ordered set of entity pairs. Once the final hidden states of the pair-level encoder are concatenated, a feed-forward network with rectified linear units (ReLU) and linear activation with softmax layer is used as a final prediction layer.

GRN | GRN + mLSTM | mLSTM | |||||||

DC | RR | RF | GRF | DC | RR | RF | GRF | ||

Relations Only | 59.27 | 59.43 | 60.10 | 60.90 | 87.88 | 87.70 | 88.26 | 88.42 | 88.42 |

Entities Only | 67.65 | 63.57 | 63.88 | 64.95 | 87.19 | 86.73 | 86.80 | 87.42 | 88.42 |

Relations + Entities | 64.11 | 65.72 | 64.18 | 64.26 | 87.26 | 87.65 | 86.42 | 88.65 | 88.42 |

GRN | GRN + mLSTM | mLSTM | |||||||

DC | RR | RF | GRF | DC | RR | RF | GRF | ||

Relations Only | 60.65 | 57.44 | 59.58 | 63.34 | 87.58 | 88.26 | 87.42 | 87.34 | 88.42 |

Entities Only | 65.87 | 63.65 | 65.03 | 63.80 | 86.04 | 86.58 | 85.19 | 86.27 | 88.42 |

Relations + Entities | 63.04 | 65.57 | 64.57 | 59.97 | 86.57 | 86.66 | 86.50 | 87.65 | 88.42 |

#### Token-level Encodings

Each pair consists of a sequence of tokens which are embedded using an embedding matrix as . Then the bidirectional token-level RNN – a GRU [cho2014learning] in our case – is used to form a fixed length representation by concatenating the final state from forward() and backward () passes in the GRU. This yields . Note that we use ComplEx [trouillon2016complex] and TransH [han2018openke] knowledge graph embeddings for token-level embeddings. These emebeddings are trained on ConceptNet using OpenKE ^{2}^{2}2https://github.com/thunlp/OpenKE.

#### Pair-level Encodings:

The input to the pair-level encoder is a sequence of token-level representations . Then, just as above, a bidirectional GRU computes the fixed length representation as: ; ; and .

## 5 Experimental Setup

In this section, we talk about our experimental setup, which includes the dataset and knowledge graph used; the implementation of that knowledge graph; the computational power used for our experiments; and various initializations and hyperparamters. We list all of these to bolster the reproducibility of our work.

### 5.1 Dataset & Knowledge Graph

In order to evaluate our approach, we use SciTail [khot2018scitail], which is a science domain entailment dataset. The SciTail dataset was created from a corpus of science domain multiple choice questions for and grade. It has approximately premise-hypothesis pairs, which are divided into train, dev, and test sets. The main motivation behind using SciTail is the ability to use it for additional downstream NLP tasks like question answering.

There are multiple open knowledge sources available such as DBpedia [auer2007dbpedia], WordNet [wordnet], and ConceptNet [speer2017conceptnet]. Each knowledge source contain different kinds of information. For example, DBpedia is fact based and comprises information relating Wikipedia entities; WordNet is a linguistic knowledge base; and ConceptNet contains common sense information gathered by crowdsourcing. Thus selecting the right knowledge source for a task or dataset is non-trivial. wang2019improving wang2019improving’s work provides some guidance on this task, by evaluating the relevance of each of these knowledge bases to the SciTail dataset; their conclusion is that ConceptNet is the best KG for the SciTail dataset.

### 5.2 Graph Implementation

ConceptNet [speer2017conceptnet] consists of a total of entries, capturing concepts and relations spanning over more than languages. For this work, we only focused on the English language entries. We transformed these entries into commonly used graph format of subject, predicate, object tuples. This reformatted ConceptNet was represented as a MultiDiGraph – due to the existence of multiple relations (edges) between some entities – with the NetworkX [hagberg2008exploring] Python library.

### 5.3 Compute Power

The ConceptNet filtering, cost graph customization, and shortest path generations were performed on a core Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz machine. The RNN and GRN models were trained and evaluated using Tesla P100 Nvidia GPUs with GB of memory.

### 5.4 Initializations & Hyperparameters

In the Graph Recurrent Network (GRN) model, we used ComplEx [trouillon2016complex] and TransH [han2018openke] knowledge graph embeddings for the token-level encoder, with the embedding dimension set to . The token-level and pair-level encoders used single-layered bidirectional GRUs with a hidden size of

. Parameters were not shared between token-level and pair-level encoders. A two-layered fully-connected feed-forward neural network with ReLU and linear activation, and dropout of

and respectively was used for the prediction layer. The size of the hidden layer was set to .Our models were implemented with AllenNLP

, a popular NLP library. We tuned the hyperparameters for the models using the validation set. We used a sigmoid function and minimized cross-entropy loss for training and updating the model. The training cycle involved a

epoch run, with a epoch patience cutoff. The batch size was set to , and gradients were clipped at . The trainer was configured to use the Adam [kingma2014adam] optimizer with a learning rate of .## 6 Results

In this section, we outline our results. The section is split into three parts: we first look at various graph-related statistics across the various customized cost graphs to externalize what the cost customization does to the retrieval of context-relevant knowledge. We then look at the performance of our classification methods on the NLI problem, both quantitatively and qualitatively.

### 6.1 Graph Statistics

We start by looking at the knowledge extracted by different heuristic cost functions for the same premise-hypothesis samples. A multi-edge directed graph is constructed by combining entity and relation information along the shortest paths for all entity pairs in every premise-hypothesis sample. The numbers shown in Table 4 are averaged over graphs for all premise-hypothesis samples. As we employ more informed cost heuristic functions, there is a steady increase in the number of nodes and edges per multi-edge directed graph. The higher in-degree and out-degree for these graphs indicates the increased variation in the set of nodes that can be reached from a given node, and the set of nodes from which a node can be reached respectively. This trend validates the informativeness in terms of diversity for the RF and GRF cost functions as compared to RR and DC.

DC | RR | RF | GRF | |

Frequency Only | 87.73 | 87.57 | 87.88 | 87.88 |

Transformation | 89.03 | 87.35 | 87.73 | 87.27 |

The reduction in premise-hypothesis samples that feature pairs with more that one shortest path between them might seem out of the ordinary at first glance, but there is a very good reason for such a reduction. This happens due to the discrete versus continuous nature of the cost functions. We know from Section 3.2 that DC and RR assign a fixed value cost by counting edges, whereas RF and GRF compute floating values for cost. This results in a very low likelihood that the latter cost functions will lead to the exact same (summed) cost across two different paths.

DC | RR | RF | GRF | |

Nodes | 70 | 72 | 87 | 92 |

Edges | 190 | 196 | 380 | 415 |

In-degree | 2.78 | 2.77 | 4.24 | 4.34 |

Out-degree | 2.7 | 2.69 | 4.12 | 4.34 |

1 path | 23501 | 23504 | 234 | 39 |

1st Position | 2nd Position | Last Position | |

Rank 1 | RelatedTo | RelatedTo | RelatedTo |

Rank 2 | FormOf | FormOf | Synonym |

Rank 3 | Synonym | Synonym | IsA |

Rank 4 | IsA | IsA | SimilarTo |

Rank 5 | SimilarTo | HasContext | UsedFor |

### 6.2 Quantitative Results

Table 1 shows the performance of the GRN and mLSTM + GRN models across different heuristic cost functions (DC, RR, RF, GRF) and external information types (relations, entities, relations + entities). We observe that the GRN model by itself cannot match the performance of a text model (mLSTM); simultaneously, the GRN + Text model only managed to marginally improve performance accuracy with the GRF heuristic and external knowledge consisting of both relations & entities. In the case of GRN models, even though the overall accuracy is not comparable with the mLSTM text model, we see a consistent gain in performance in most of the cases where an informed heuristic like GRF is used instead of RR or DC. This trend is also observed in GRN + Text models. This indicates that graph structure based heuristics like RF and GRF are better at capturing more relevant external knowledge. We also explored TransH [han2018openke] graph embeddings, but these models did not perform very well; this is reflected in Table 2. However, we present these results in the spirit of full disclosure, in the hope that they will assist future research.

With respect to the knowledge graph itself, we noticed some peculiar characteristics in ConceptNet. The number of relationships in ConceptNet () are extremely small when compared to the number of unique entities ( million). This causes certain sets of relations to be repeated quite frequently, thus losing all uniqueness. As Figure 3 shows, the top 6 edges – relatedto, formof, isa, synonym, hascontext, derivedfrom – are repeated more than times; with the most frequent relation occurring more than million times. In contrast, the remaining relations appear relatively infrequently, as shown in Figure 4

. This skewed distribution leads to the Top-

most frequent relations dominating the shortest paths for the majority of premise-hypothesis entity pairs. An additional piece of evidence in support of this argument is presented in Table 5, which shows the most popular (top ranks) relations for the first, second, and last^{3}

^{3}3We choose the first and last because of the bidirectional nature of the encoding approach that we pursue. positions across all paths of length to . These positions are mostly dominated by relations from Figure 3. This frequency-based relation domination thus overwhelms any real signal that might be coming from other paths that contain more infrequent and unpopular relations. This is an extremely interesting result for future work.

### 6.3 Qualitative Results

In this section, we highlight the promise of our approach looking at examples that are classified correctly by either the mLSTM approach (here called text), or the mLSTM + GRN approach (here called graph). We compared predictions from the text model against the graph model. Overall, we noticed that the graph model was able to handle sentences with higher numbers of entity pairs, thus resulting in graphs with a higher number of nodes and edges. Out of instances that the graph gets right (but not text), contained over entities; while out of the instances that text got right (but not graph), only had those many entities. If we drop the threshold to entities, these numbers flip to for text and for graph. This aligns well with the common problem faced by text-only models, which fail over long sequences of texts. This also shows the value of our graph approach, which is able to explicitly incorporate external knowledge and scale for instances that are more complex.

## 7 Conclusion

In this paper, we presented the notion of contextualizing a knowledge graph by customizing the edge-weights in that graph with costs produced by various heuristic functions. We used these cost-customized

graphs to find different shortest paths for different instances of the NLI problem, and trained two different classifiers using the sequence information from the shortest paths. Our results show some clear avenues for immediate future work, including: (1) testing on other KGs and NLI datasets; (2) experimenting with other, more complex cost functions; and (3) automating the construction of the cost function via reinforcement learning.