1. Introduction
Abduction is considered to be the only logical operation that can introduce new ideas (Peirce, 1965). It contrasts with other types of inference such as entailment, which refers to the wellknown natural language inference tasks (NLI), that focuses on inferring only such information that is already provided in the premise. Therefore, abduction reasoning is an important inference type deserved to be explored. A new reasoning task, namely the abductive natural language inference task (NLI), is proposed to test the abductive reasoning capability of an AI system (Bhagavatula et al., 2020). Different from traditional NLI tasks, NLI first provides two pieces of narrative text treated as a start observation and an end observation. The most plausible explanation is then asked to pick out from the candidate hypotheses.
Many models have been successfully developed for the NLI tasks and directly adopted in the new proposed NLI task. These methods for NLI tasks treat the entailment between two sentences from a classification perspective, also treat NLI task as a doublechoice question answering problem, which selects one plausible hypothesis from two. However, discriminating true from false does not measure the plausibility of a hypothesis in abductive reasoning task, where all the hypotheses have a chance to happen with their probabilities, though some of their values are close to zero. As we can see in Figure 1, from a tidy room (observation ) to a mess room (observation ), we do not know what has happened. Thus, four hypotheses are proposed, where ‘thief broke into the room’ is the most likely happened, and ‘cat slipped into the room’ is also a potential answer. Nevertheless, even for the hypothesis ‘earthquake’ is also reasonable, just with a very small probability. It is hard to draw a line to determine which one is true from others.
Depending on these insights, we argue that NLI is better to be treated as a ranking problem. From the ranking perspective, doublechoice question answering setting in the recent NLI task is just an incomplete pairwise ranking scenario that only considers partial plausible order of given hypotheses. In order to fully model the plausibility of the hypotheses, we switch to a complete ranking perspective and propose a novel learning to rank for reasoning () approach^{1}^{1}1The source code is available at https://github.com/zycdev/L2R2. for NLI task. adopts the mature learningtorank framework, which first reorganizes training instances into a ranking form. Specifically, two observations and can be view as a query, and the candidate hypotheses can be view as a set of candidate documents. The relevance degree between a query and each document represents the plausible probability between observations and each hypothesis. Then, two parts of the learningtorank framework, scoring function and loss function are designed for NLI task. Two types of scoring functions are chosen in this paper, the matching model ESIM (Chen et al., 2017) and the pretrained language models, e.g. BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019). Besides, pairwise and listwise loss functions are applied to train the ranking task. The experimental results show that our approach achieves a new stateoftheart accuracy on the blind test set of ART dataset. Further analyses illustrate that the benefit of the ranking perspective is to assign a proper plausibility to each hypothesis, instead of either 0 or 1.
2. Task Formalization
The task of NLI contains two major concepts, observation and hypothesis. The observation describes the state of the scene, while the hypothesis is the imaginary cause that transforms one observation to another. The famous Piaget’s cognitive development theory tells us that our world is a dynamic system of continuous change, which involves transformations and states. Therefore, predicting the transformation is the core of the NLI task.
In detail, two observations are given , where is the space of all possible observations. The goal of NLI task is to predict the most plausible hypothesis , where is the space of all hypotheses. Note that the happening time of observation is earlier than . Inspired by the traditional NLI task, where the hypothesis is regarded to be directly entailed from the premise. However, the relation between hypothesis and two observations in NLI task is in a totally different way,
where hypothesis is depended on the first observation , and the last observation is depended on and . The best hypothesis is the one to max the score of these two parts. It can be modeled by a scoring function that treats , and as input, and outputs a real value , e.g. scoring function:
For easy model adaptation, NLI in the ART dataset is originally defined as a doublechoice question answering problem, whose goal is to choose the most plausible hypothesis from two candidates and . From the classification perspective, it can be formalized as a discriminative task that distinguishes the category of . The positive indicates is more plausible than , while the negative is the opposite. We argue that it is an incomplete pairwise approach in a ranking view, which only considers a small portion of the order in a ranking list and yields poor performance.
Therefore, we reformulate this task from the ranking perspective and adopt the learningtorank framework. In this framework, observations and can be regarded as a query, and their candidate hypotheses can be viewed as the corresponding candidate document set labeled with plausibility scores , where is the number of candidate hypotheses. The loss function is a key part of the learningtorank framework, where pointwise, pairwise and listwise are three commonly used loss function types. In this paper, we only consider pairwise and listwise loss function, because pointwise is just a classification loss that does not take the order of the hypotheses into consideration. Given the plausibility scores, we can make all possible hypotheses pairs, when plausibility scores are different, in order to train on a pairwise loss function. We also use a listwise loss function by treating the candidate hypotheses as an ordered list, which measures the error on a whole ranking list.
3. Our Approach
Under the ranking formalization, we proposed our learning to rank for reasoning () approach, which is an implementation of the learningtorank framework for the NLI. The learningtorank framework typically consists of two main components, e.g. a scoring function used to generate a real value score for a querydocument pair and a loss function used to examine the accurate prediction of the ground truth rankings.
3.1. Scoring Function
The scoring function can be implemented in different forms, for example, the deep text matching models and the pretrained language models can be employed as the scoring functions.
ESIM is a strong NLI model that uses BiLSTM to encode tokens within each sentence and perform crossattention on these encoded token representations, whose performance on entailment NLI is close to stateoftheart. Thus, it is a good choice to implement as a scoring function. ESIM takes two sentences premise and hypothesis as input. For NLI task, the concatenation of and is treated as the premise, and is treated as the hypothesis. ESIM outputs a scalar score indicating the relevance between them.
In scoring functions based on pretrained language models such as BERT or RoBERTa. For NLI task, the observations and hypothesis
are first concatenated into a narrative story with a delimiter token and a sentinel token. Then, it feeds into the pretrained language model to get a contextual embedding for each token. After that a mean pooling is applied to obtain the feature vector of the observationshypothesis pair
. Finally, a dense layer is stacked upon to get the plausible score .3.2. Loss Function and Inference
Though the implementation can be different, all scoring functions are optimized by minimizing the empirical risk as follow:
(1) 
where is the loss function utilized to evaluate the prediction scores for a single query. Since pointwise loss functions are only suited for absolute judgment, we only explore pairwise and listwise loss functions in this work.
Pairwise loss functions are defined on the basis of pairs of hypotheses whose labels are different, where ranking is reduced to a classification on hypotheses pairs. Here, the pairwise loss functions of Ranking SVM (Herbrich, 2000), RankNet (Burges et al., 2005) and LambdaRank (Burges et al., 2007) are used.
Hinge loss used in Ranking SVM and logistic (cross entropy) loss used in RankNet both have the following form:
(2) 
where the functions are hinge function ( and logistic function () respectively; means that ranks higher (is more plausible) than with regards to the query .
Building upon RankNet, LambdaRank uses the logistic loss and adapts it by reweighing each hypotheses pair:
(3) 
the is the absolute difference between the NDCG values when the ranking positions of and are swapped,
are gain and discount functions respectively, and maxDCG is a normalization factor per query.
Note that doublechoice classification baselines for NLI can be viewed as special cases of pairwise ranking methods when .
Listwise loss functions are defined on the basis of lists of hypotheses. In this paper, the loss functions of ListNet (Cao et al., 2007), ListMLE (Xia et al., 2008) and ApproxNDCG (Qin et al., 2010) are employed.
In the ListNet approach, KL divergence between the permutation probability distribution for the scoring function
and that for the ground truth is used as the loss function, where denotes a permutation. Due to the huge size of , ListNet reduces the training complexity by using the marginal distribution of the first position and the KL divergence loss then becomes(4) 
Different from ListNet, ListMLE uses the negative log likelihood of the ground truth permutation as the loss function,
(5) 
where is he ground truth permutation.
ApproxNDCG optimizes approximate NDCG directly, and its loss function is then defined as follow:
(6)  
where is the approximation for that indicates the position of in the ranking list .
In inference stage, since original NLI task is to pick the more plausible one from two hypotheses, we can choose the hypothesis with highest score as the prediction result.
Model  DoubleChoise  Pairwise Ranking  Listwise Ranking  

(l)35 (l)68  Classification  Logistic  Hinge  LambdaRank  KLD  Likelihood  ApproxNDCG 
ESIM  57.98  58.86  58.88  58.61  59.08  59.10  58.81 
BERT  67.75  72.26  72.45  71.34  72.26  73.30  71.80 
RoBERTa  85.64  87.86  87.92  87.79  88.45  88.12  86.55 
Model  Accuracy 

Random  50.41 
BERT  63.62 
BERT  66.75 
RoBERTa  83.91 
McQueen + RoBERTa  84.18 
HighOrderGN + RoBERTa  82.04 
egel (Dynamics)  85.95 
(RoBERTa + KLD)  86.81 
Human  92.90 
4. Experiments
In this section, the experimental results on a public dataset are demonstrated to evaluate our proposed approaches.
4.1. Experimental Settings
We conduct our experiments on the ART (Bhagavatula et al., 2020) dataset. ART is the first largescale benchmark dataset for abductive reasoning in narrative texts. It consists of 20K pairs of observations with over 200K explanatory hypotheses, where observations are drawn from a collection of manually curated stories, and the hypotheses are collected by crowdsourcing. Besides, the candidate hypotheses for each narrative context in the test sets are selected through an adversarial filtering algorithm that uses BERT as the adversary.
For our approach, the data need to reorganize into a ranking form. Concretely, we merge original instances sharing the same observation pair into a new instance , where is a set of candidate hypotheses for a given observation pair. In the ART training set, there are an average of 13.41 hypotheses for each observation pair
, of which 4.05 are plausible. We further employ a heuristic labeling strategy to construct ground truth plausibility scores
for . Consider th hypothesis for , the ground truth plausibility score of is labeled with .To demonstrate the effectiveness of our approach, we develop 18 models based on three scoring functions, i.e. ESIM, BERT and RoBERTa, with six ranking loss functions, including Logistic for the loss (Eq 2) used in RankNet, Hinge for that (Eq 2) used in Ranking SVM, LambdaRank for that (Eq 3) used in LambdaRank, KLD for that (Eq 4) used in ListNet, Likelihood for that (Eq 5) used in ListMLE, and ApproxNDCG for that (Eq 6) used in ApproxNDCG.
Three doublechoice classification models are selected as our baselines. They have the same structures with the aforementioned scoring functions, whereas the only difference is that they are trained on original data with the cross entropy loss function.
For implementation details, we employ Adam as the optimizer and use earlystopping to select the best model. The models based on ESIM use 300 as the LSTM hidden size, which are trained for at most 64 epochs with batch size set to 32 and learning rate set to 4e4. The models based on pretrained language models are finetuned for at most 10 epochs with batch size set to 4 and learning rate set to 5e6. The evaluation method
accuracy defined in (Bhagavatula et al., 2020) is used.4.2. Experimental Results
Table 1 shows the experimental results of models and baselines on the development set. Our best model was evaluated officially on the test set, which achieved the stateoftheart accuracy (Table 2).
We summarize our observations as follows. (1) All 16 versions of our approaches improve the performance on the adbuctive reasoning task, which means that the ranking perspective is better than classification. (2) Pairwise models perform better than classification models, and most listwise models perform better than pairwise models. The former boost can be attributed to full version of pairwise training, whereas the latter boost from pairwise to listwise is due to the global reasoning over the entire candidate set. (3) BERT based ranking models have the largest gains about 8.2% improvement over the corresponding baseline. It is because BERT was taken as the adversary for dataset construction, the substantial improvement illustrates that our approach is more robust to adversarial inputs. (4) The loss functions optimizing NDCG metric, i.e. LambdaRank and ApproxDNCG, have poorer performances than others, mainly due to the gap between NDCG metric during training and accuracy metric during testing.
4.3. Detailed Analyses
To further illustrate the rationality of our approach, Figure 2 demonstrates two normalized score distributions on the more plausible hypotheses in development set candidate pairs, where the scores are predicted respectively by two models using BERT as the scoring function, the one trained with the classification loss and the other with listwise likelihood loss. The area under the curves in the right part (probability ¿ 0.5) can be viewed as accuracy values. As shown in the figure, the classification model distinguishes the pairs of candidate hypotheses with a great disparity, either close to the probability 0 or 1, whereas the model has the ability to judge the borderline instances whose two candidates are competitive to each other. Look at the sampled borderline instance in the bottom of Figure 2, where both hypotheses are likely to happen but is slightly more plausible, the model makes the right choice, which outputs two competitive probabilities for and , 0.5891 vs 0.4109; whereas the classification model not only fails to distinguish which one is better but also outputs probabilities 0.0024 and 0.9976 in a significantly large gap. That is to say, the ranking view in approach is a more reasonable way to model the abductive reasoning task.
5. Conclusion
In the NLI task, all the hypotheses have their own chance to happen, so it is naturally treated as a ranking problem. From the ranking perspective, is proposed for the NLI task under the learningtorank framework, which contains a scoring function and a loss function. The experiments on the ART dataset show that reformulating the NLI task as ranking has improvements, also reaches the stateoftheart performance on the public leaderboard.
Acknowledgements.
This work was supported by Beijing Academy of Artificial Intelligence (BAAI) under Grants BAAI2020ZJ0303, the National Natural Science Foundation of China (NSFC) under Grants No. 61773362, and 61906180, the Youth Innovation Promotion Association CAS under Grants No. 2016102, the National Key R&D Program of China under Grants No. 2016QY02D0405, the Tencent AI Lab RhinoBird Focused Research Program (No. JR202033).
References
 Abductive commonsense reasoning. In ICLR, Cited by: §1, §4.1, §4.1.
 Learning to rank using gradient descent. In ICML, pp. 89–96. Cited by: §3.2.
 Learning to rank with nonsmooth cost functions. In NIPS, pp. 193–200. Cited by: §3.2.
 Learning to rank: from pairwise approach to listwise approach. In ICML, pp. 129–136. Cited by: §3.2.
 Enhanced lstm for natural language inference. In ACL, pp. 1657–1668. Cited by: §1.
 BERT: pretraining of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §1.

Large margin rank boundaries for ordinal regression.
Advances in large margin classifiers
, pp. 115–132. Cited by: §3.2.  Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1.
 Principles of philosophy and elements of logic. Vol. 1, Belknap Press of Harvard University Press. Cited by: §1.
 A general approximation framework for direct optimization of information retrieval measures. Information retrieval 13 (4), pp. 375–397. Cited by: §3.2.
 Listwise approach to learning to rank: theory and algorithm. In ICML, pp. 1192–1199. Cited by: §3.2.
Comments
There are no comments yet.