L2R2: Leveraging Ranking for Abductive Reasoning

The abductive natural language inference task (αNLI) is proposed to evaluate the abductive reasoning ability of a learning system. In the αNLI task, two observations are given and the most plausible hypothesis is asked to pick out from the candidates. Existing methods simply formulate it as a classification problem, thus a cross-entropy log-loss objective is used during training. However, discriminating true from false does not measure the plausibility of a hypothesis, for all the hypotheses have a chance to happen, only the probabilities are different. To fill this gap, we switch to a ranking perspective that sorts the hypotheses in order of their plausibilities. With this new perspective, a novel L2R^2 approach is proposed under the learning-to-rank framework. Firstly, training samples are reorganized into a ranking form, where two observations and their hypotheses are treated as the query and a set of candidate documents respectively. Then, an ESIM model or pre-trained language model, e.g. BERT or RoBERTa, is obtained as the scoring function. Finally, the loss functions for the ranking task can be either pair-wise or list-wise for training. The experimental results on the ART dataset reach the state-of-the-art in the public leaderboard.



There are no comments yet.


page 1

page 2

page 3

page 4


Interactive Model with Structural Loss for Language-based Abductive Reasoning

The abductive natural language inference task (αNLI) is proposed to infe...

Diversify and Disambiguate: Learning From Underspecified Data

Many datasets are underspecified, which means there are several equally ...

Learning to Rank for Plausible Plausibility

Researchers illustrate improvements in contextual encoding strategies vi...

Language Modelling via Learning to Rank

We consider language modelling (LM) as a multi-label structured predicti...

Top-Rank Enhanced Listwise Optimization for Statistical Machine Translation

Pairwise ranking methods are the basis of many widely used discriminativ...

Does the Objective Matter? Comparing Training Objectives for Pronoun Resolution

Hard cases of pronoun resolution have been used as a long-standing bench...

Generating Hypothetical Events for Abductive Inference

Abductive reasoning starts from some observations and aims at finding th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Abduction is considered to be the only logical operation that can introduce new ideas (Peirce, 1965). It contrasts with other types of inference such as entailment, which refers to the well-known natural language inference tasks (NLI), that focuses on inferring only such information that is already provided in the premise. Therefore, abduction reasoning is an important inference type deserved to be explored. A new reasoning task, namely the abductive natural language inference task (NLI), is proposed to test the abductive reasoning capability of an AI system (Bhagavatula et al., 2020). Different from traditional NLI tasks, NLI first provides two pieces of narrative text treated as a start observation and an end observation. The most plausible explanation is then asked to pick out from the candidate hypotheses.

Figure 1. An example in the NLI task. Given observations, multiple hypotheses are plausible with their probabilities.

Many models have been successfully developed for the NLI tasks and directly adopted in the new proposed NLI task. These methods for NLI tasks treat the entailment between two sentences from a classification perspective, also treat NLI task as a double-choice question answering problem, which selects one plausible hypothesis from two. However, discriminating true from false does not measure the plausibility of a hypothesis in abductive reasoning task, where all the hypotheses have a chance to happen with their probabilities, though some of their values are close to zero. As we can see in Figure 1, from a tidy room (observation ) to a mess room (observation ), we do not know what has happened. Thus, four hypotheses are proposed, where ‘thief broke into the room’ is the most likely happened, and ‘cat slipped into the room’ is also a potential answer. Nevertheless, even for the hypothesis ‘earthquake’ is also reasonable, just with a very small probability. It is hard to draw a line to determine which one is true from others.

Depending on these insights, we argue that NLI is better to be treated as a ranking problem. From the ranking perspective, double-choice question answering setting in the recent NLI task is just an incomplete pair-wise ranking scenario that only considers partial plausible order of given hypotheses. In order to fully model the plausibility of the hypotheses, we switch to a complete ranking perspective and propose a novel learning to rank for reasoning () approach111The source code is available at https://github.com/zycdev/L2R2. for NLI task. adopts the mature learning-to-rank framework, which first reorganizes training instances into a ranking form. Specifically, two observations and can be view as a query, and the candidate hypotheses can be view as a set of candidate documents. The relevance degree between a query and each document represents the plausible probability between observations and each hypothesis. Then, two parts of the learning-to-rank framework, scoring function and loss function are designed for NLI task. Two types of scoring functions are chosen in this paper, the matching model ESIM (Chen et al., 2017) and the pre-trained language models, e.g. BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019). Besides, pair-wise and list-wise loss functions are applied to train the ranking task. The experimental results show that our approach achieves a new state-of-the-art accuracy on the blind test set of ART dataset. Further analyses illustrate that the benefit of the ranking perspective is to assign a proper plausibility to each hypothesis, instead of either 0 or 1.

2. Task Formalization

The task of NLI contains two major concepts, observation and hypothesis. The observation describes the state of the scene, while the hypothesis is the imaginary cause that transforms one observation to another. The famous Piaget’s cognitive development theory tells us that our world is a dynamic system of continuous change, which involves transformations and states. Therefore, predicting the transformation is the core of the NLI task.

In detail, two observations are given , where is the space of all possible observations. The goal of NLI task is to predict the most plausible hypothesis , where is the space of all hypotheses. Note that the happening time of observation is earlier than . Inspired by the traditional NLI task, where the hypothesis is regarded to be directly entailed from the premise. However, the relation between hypothesis and two observations in NLI task is in a totally different way,

where hypothesis is depended on the first observation , and the last observation is depended on and . The best hypothesis is the one to max the score of these two parts. It can be modeled by a scoring function that treats , and as input, and outputs a real value , e.g. scoring function:

For easy model adaptation, NLI in the ART dataset is originally defined as a double-choice question answering problem, whose goal is to choose the most plausible hypothesis from two candidates and . From the classification perspective, it can be formalized as a discriminative task that distinguishes the category of . The positive indicates is more plausible than , while the negative is the opposite. We argue that it is an incomplete pair-wise approach in a ranking view, which only considers a small portion of the order in a ranking list and yields poor performance.

Therefore, we reformulate this task from the ranking perspective and adopt the learning-to-rank framework. In this framework, observations and can be regarded as a query, and their candidate hypotheses can be viewed as the corresponding candidate document set labeled with plausibility scores , where is the number of candidate hypotheses. The loss function is a key part of the learning-to-rank framework, where point-wise, pair-wise and list-wise are three commonly used loss function types. In this paper, we only consider pair-wise and list-wise loss function, because point-wise is just a classification loss that does not take the order of the hypotheses into consideration. Given the plausibility scores, we can make all possible hypotheses pairs, when plausibility scores are different, in order to train on a pair-wise loss function. We also use a list-wise loss function by treating the candidate hypotheses as an ordered list, which measures the error on a whole ranking list.

3. Our Approach

Under the ranking formalization, we proposed our learning to rank for reasoning () approach, which is an implementation of the learning-to-rank framework for the NLI. The learning-to-rank framework typically consists of two main components, e.g. a scoring function used to generate a real value score for a query-document pair and a loss function used to examine the accurate prediction of the ground truth rankings.

3.1. Scoring Function

The scoring function can be implemented in different forms, for example, the deep text matching models and the pre-trained language models can be employed as the scoring functions.

ESIM is a strong NLI model that uses Bi-LSTM to encode tokens within each sentence and perform cross-attention on these encoded token representations, whose performance on entailment NLI is close to state-of-the-art. Thus, it is a good choice to implement as a scoring function. ESIM takes two sentences premise and hypothesis as input. For NLI task, the concatenation of and is treated as the premise, and is treated as the hypothesis. ESIM outputs a scalar score indicating the relevance between them.

In scoring functions based on pre-trained language models such as BERT or RoBERTa. For NLI task, the observations and hypothesis

are first concatenated into a narrative story with a delimiter token and a sentinel token. Then, it feeds into the pre-trained language model to get a contextual embedding for each token. After that a mean pooling is applied to obtain the feature vector of the observations-hypothesis pair

. Finally, a dense layer is stacked upon to get the plausible score .

3.2. Loss Function and Inference

Though the implementation can be different, all scoring functions are optimized by minimizing the empirical risk as follow:


where is the loss function utilized to evaluate the prediction scores for a single query. Since point-wise loss functions are only suited for absolute judgment, we only explore pair-wise and list-wise loss functions in this work.

Pair-wise loss functions are defined on the basis of pairs of hypotheses whose labels are different, where ranking is reduced to a classification on hypotheses pairs. Here, the pairwise loss functions of Ranking SVM (Herbrich, 2000), RankNet (Burges et al., 2005) and LambdaRank (Burges et al., 2007) are used.

Hinge loss used in Ranking SVM and logistic (cross entropy) loss used in RankNet both have the following form:


where the functions are hinge function ( and logistic function () respectively; means that ranks higher (is more plausible) than with regards to the query .

Building upon RankNet, LambdaRank uses the logistic loss and adapts it by reweighing each hypotheses pair:


the is the absolute difference between the NDCG values when the ranking positions of and are swapped,

are gain and discount functions respectively, and maxDCG is a normalization factor per query.

Note that double-choice classification baselines for NLI can be viewed as special cases of pairwise ranking methods when .

Listwise loss functions are defined on the basis of lists of hypotheses. In this paper, the loss functions of ListNet (Cao et al., 2007), ListMLE (Xia et al., 2008) and ApproxNDCG (Qin et al., 2010) are employed.

In the ListNet approach, K-L divergence between the permutation probability distribution for the scoring function

and that for the ground truth is used as the loss function, where denotes a permutation. Due to the huge size of , ListNet reduces the training complexity by using the marginal distribution of the first position and the K-L divergence loss then becomes


Different from ListNet, ListMLE uses the negative log likelihood of the ground truth permutation as the loss function,


where is he ground truth permutation.

ApproxNDCG optimizes approximate NDCG directly, and its loss function is then defined as follow:


where is the approximation for that indicates the position of in the ranking list .

In inference stage, since original NLI task is to pick the more plausible one from two hypotheses, we can choose the hypothesis with highest score as the prediction result.

Model Double-Choise Pairwise Ranking Listwise Ranking
(l)3-5 (l)6-8 Classification Logistic Hinge LambdaRank KLD Likelihood ApproxNDCG
ESIM 57.98 58.86 58.88 58.61 59.08 59.10 58.81
BERT 67.75 72.26 72.45 71.34 72.26 73.30 71.80
RoBERTa 85.64 87.86 87.92 87.79 88.45 88.12 86.55
Table 1. Performances of and baselines on the development set of ART dataset.
Model Accuracy
Random 50.41
BERT 63.62
BERT 66.75
RoBERTa 83.91
McQueen + RoBERTa 84.18
HighOrderGN + RoBERTa 82.04
egel (Dynamics) 85.95
(RoBERTa + KLD) 86.81
Human 92.90
Table 2. Performances of and baselines on the test set of ART dataset. All the results come from the leaderboard222See https://leaderboard.allenai.org/anli/submissions/public. Due to the limited submission times, only one of models was evaluated and achieved the best (2020-02-23)..

4. Experiments

In this section, the experimental results on a public dataset are demonstrated to evaluate our proposed approaches.

4.1. Experimental Settings

We conduct our experiments on the ART (Bhagavatula et al., 2020) dataset. ART is the first large-scale benchmark dataset for abductive reasoning in narrative texts. It consists of 20K pairs of observations with over 200K explanatory hypotheses, where observations are drawn from a collection of manually curated stories, and the hypotheses are collected by crowd-sourcing. Besides, the candidate hypotheses for each narrative context in the test sets are selected through an adversarial filtering algorithm that uses BERT as the adversary.

For our approach, the data need to reorganize into a ranking form. Concretely, we merge original instances sharing the same observation pair into a new instance , where is a set of candidate hypotheses for a given observation pair. In the ART training set, there are an average of 13.41 hypotheses for each observation pair

, of which 4.05 are plausible. We further employ a heuristic labeling strategy to construct ground truth plausibility scores

for . Consider -th hypothesis for , the ground truth plausibility score of is labeled with .

To demonstrate the effectiveness of our approach, we develop 18 models based on three scoring functions, i.e. ESIM, BERT and RoBERTa, with six ranking loss functions, including Logistic for the loss (Eq 2) used in RankNet, Hinge for that (Eq 2) used in Ranking SVM, LambdaRank for that (Eq 3) used in LambdaRank, KLD for that (Eq 4) used in ListNet, Likelihood for that (Eq 5) used in ListMLE, and ApproxNDCG for that (Eq 6) used in ApproxNDCG.

Three double-choice classification models are selected as our baselines. They have the same structures with the aforementioned scoring functions, whereas the only difference is that they are trained on original data with the cross entropy loss function.

For implementation details, we employ Adam as the optimizer and use early-stopping to select the best model. The models based on ESIM use 300 as the LSTM hidden size, which are trained for at most 64 epochs with batch size set to 32 and learning rate set to 4e-4. The models based on pre-trained language models are finetuned for at most 10 epochs with batch size set to 4 and learning rate set to 5e-6. The evaluation method

accuracy defined in (Bhagavatula et al., 2020) is used.

4.2. Experimental Results

Table 1 shows the experimental results of models and baselines on the development set. Our best model was evaluated officially on the test set, which achieved the state-of-the-art accuracy (Table 2).

We summarize our observations as follows. (1) All 16 versions of our approaches improve the performance on the adbuctive reasoning task, which means that the ranking perspective is better than classification. (2) Pair-wise models perform better than classification models, and most list-wise models perform better than pair-wise models. The former boost can be attributed to full version of pair-wise training, whereas the latter boost from pair-wise to list-wise is due to the global reasoning over the entire candidate set. (3) BERT based ranking models have the largest gains about 8.2% improvement over the corresponding baseline. It is because BERT was taken as the adversary for dataset construction, the substantial improvement illustrates that our approach is more robust to adversarial inputs. (4) The loss functions optimizing NDCG metric, i.e. LambdaRank and ApproxDNCG, have poorer performances than others, mainly due to the gap between NDCG metric during training and accuracy metric during testing.

4.3. Detailed Analyses

To further illustrate the rationality of our approach, Figure 2 demonstrates two normalized score distributions on the more plausible hypotheses in development set candidate pairs, where the scores are predicted respectively by two models using BERT as the scoring function, the one trained with the classification loss and the other with list-wise likelihood loss. The area under the curves in the right part (probability ¿ 0.5) can be viewed as accuracy values. As shown in the figure, the classification model distinguishes the pairs of candidate hypotheses with a great disparity, either close to the probability 0 or 1, whereas the model has the ability to judge the borderline instances whose two candidates are competitive to each other. Look at the sampled borderline instance in the bottom of Figure 2, where both hypotheses are likely to happen but is slightly more plausible, the model makes the right choice, which outputs two competitive probabilities for and , 0.5891 vs 0.4109; whereas the classification model not only fails to distinguish which one is better but also outputs probabilities 0.0024 and 0.9976 in a significantly large gap. That is to say, the ranking view in approach is a more reasonable way to model the abductive reasoning task.

Figure 2. Normalized probability distribution on the more plausible hypotheses in development set candidate pairs, predicted by an model and a classification model. Probabilities larger than 0.5 denote the right picked instances, while near 0.5 denote the borderline instances.

5. Conclusion

In the NLI task, all the hypotheses have their own chance to happen, so it is naturally treated as a ranking problem. From the ranking perspective, is proposed for the NLI task under the learning-to-rank framework, which contains a scoring function and a loss function. The experiments on the ART dataset show that reformulating the NLI task as ranking has improvements, also reaches the state-of-the-art performance on the public leaderboard.


This work was supported by Beijing Academy of Artificial Intelligence (BAAI) under Grants BAAI2020ZJ0303, the National Natural Science Foundation of China (NSFC) under Grants No. 61773362, and 61906180, the Youth Innovation Promotion Association CAS under Grants No. 2016102, the National Key R&D Program of China under Grants No. 2016QY02D0405, the Tencent AI Lab Rhino-Bird Focused Research Program (No. JR202033).


  • C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, W. Yih, and Y. Choi (2020) Abductive commonsense reasoning. In ICLR, Cited by: §1, §4.1, §4.1.
  • C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender (2005) Learning to rank using gradient descent. In ICML, pp. 89–96. Cited by: §3.2.
  • C. J. Burges, R. Ragno, and Q. V. Le (2007) Learning to rank with nonsmooth cost functions. In NIPS, pp. 193–200. Cited by: §3.2.
  • Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: from pairwise approach to listwise approach. In ICML, pp. 129–136. Cited by: §3.2.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2017) Enhanced lstm for natural language inference. In ACL, pp. 1657–1668. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §1.
  • R. Herbrich (2000) Large margin rank boundaries for ordinal regression.

    Advances in large margin classifiers

    , pp. 115–132.
    Cited by: §3.2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1.
  • C. S. Peirce (1965) Principles of philosophy and elements of logic. Vol. 1, Belknap Press of Harvard University Press. Cited by: §1.
  • T. Qin, T. Liu, and H. Li (2010) A general approximation framework for direct optimization of information retrieval measures. Information retrieval 13 (4), pp. 375–397. Cited by: §3.2.
  • F. Xia, T. Liu, J. Wang, W. Zhang, and H. Li (2008) Listwise approach to learning to rank: theory and algorithm. In ICML, pp. 1192–1199. Cited by: §3.2.