Technical report on Conversational Question Answering

09/24/2019 ∙ by Ying Ju, et al. ∙ Microsoft Zhuiyi Shenzhen Chaoyi Technology 0

Conversational Question Answering is a challenging task since it requires understanding of conversational history. In this project, we propose a new system RoBERTa + AT +KD, which involves rationale tagging multi-task, adversarial training, knowledge distillation and a linguistic post-process strategy. Our single model achieves 90.4(F1) on the CoQA test set without data augmentation, outperforming the current state-of-the-art single model by 2.6 F1.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Conversational question answering (CQA) is a QA task involving comprehension of both passage and historical QAs. It is proposed to mimic the way human seek information in conversations. Recently, several CQA datasets were proposed, such as CoQAReddy et al. (2018), QuAC Choi et al. (2018) and QBLink Elgohary et al. (2018).

In this paper, we propose RoBERTa + AT + KD, a system featuring Adversarial Training (AT) and Knowledge Distillation (KD) for CQA tasks. Empirical results on the CoQA dataset show the effectiveness of our system, which outperforms the previous best model by a 2.6% absolute improvement in F1 score.

The contributions of our paper are as follows:

  • We propose a general solution to fine-tuning pre-trained models for CQA tasks by (1) rationale tagging multi-task utilizing the valuable information in the answer’s rationale; (2) adversarial training Miyato et al. (2016) increasing model’s robustness to perturbations; (3) knowledge distillation Furlanello et al. (2018) making use of additional training signals from well-trained models.

  • We also analyze the limitation of extractive models including our system. To figure out the headroom of extractive models for improvement, we compute the proportion of QAs with free-form answers and estimate the upper bound of extractive models in F1 score.

  • Our system achieves the new state-of-the-art result on CoQA dataset without data augmentation.

2 Related Work

Machine Reading Comprehension

Machine Reading Comprehension(MRC) has become an important topic in natural language processing. Existing datasets can be classified into single-turn or multi-turn according to whether each question depends on the conversation history. Many MRC models have been proposed to tackle single-turn QA, such as BiDAF

Seo et al. (2016), DrQAChen et al. (2017), R-NetWang et al. (2017) and QANetYu et al. (2018) in SQuAD dataset. For multi-turn QA, existing models include FlowQAHuang et al. (2018) and SDNetZhu et al. (2018)

. FlowQA proposed an alternating parallel processing structure and incorporated intermediate representations generated during the process of answering previous questions. SDNet introduced an innovated contextualized attention-based deep neural network to fuse context into traditional MRC models.

Pretrained Model Pretrained language models, such as GPTAlec Radford and Sutskever. (2018), BERTDevlin et al. (2018), XLNETYang et al. (2019) and RoBERTaLiu et al. (2019), have brought significant performance gains to many NLP tasks, including machine reading comprehension. BERT is pretrained on “masked language model” and “next sentence prediction” tasks. RoBERTa can match or exceed the performance of all post-BERT methods by some modifications. Many recent models targeting CoQA are based on BERT, such as Google SQuAD 2.0 + MMFT, ConvBERT and BERT + MMFT + ADA111, but there is no model based on RoBERTa until now.

3 Methods

In this section, we first introduce a baseline model based on RoBERTa(a Robustly optimized BERT pretrainig Approach) for the CoQA dataset, and then adopt some methods to improve the model performance.

3.1 Baseline: RoBERTa adaption to CoQA

Different from other QA datasets, questions in CoQA are conversational. Since every question after the first one depends on the conversation history, each question is appended to its history QAs, similar to Reddy et al. (2018) . If the question in the turn is , the reformulated question is defined as:


Symbol and symbol are added before each question and each answer respectively. To make sure the length of is within the limit on the number of question tokens, we truncate it from the end. In our experiment, the pre-trained model RoBERTa is employed as our baseline model, which takes a concatenation of two segments as input. Specifically, Given a context , the input for RoBERTa is [CLS] [SEP] [SEP].

The answers of CoQA dataset can be free-form text, Yes, No or Unknown. Except the Unknown answers, each answer has its rationale, an extractive span in the passage. Considering that, we adopt an extractive approach with Yes/No/Unk classification to build the output layer on top of RoBERTa. First, the text span with the highest f1-score in rationale is selected as the gold answer for training. Then we use a FC layer to get the start logits

and end logits . For Yes/No/Unk classification, a FC layer is applied to the RoBERTa pooled ouptut to obtain three logits , and . The objective function for our baseline model is defined as:


where and are starting position and ending position in example respectively, and the total number of examples is .

3.2 Rationale Tagging Multi Task

As mentioned above, every answer except Unknown has its rationale. To utilize this valuable information, we add a new task Rationale Tagging in which the model predicts whether each token of the paragraph should be included in the rationale. In other words, tokens in the rationale will be labeled 1 and others will be labeled 0. For Unk questions, they should all be 0.

Therefore, besides predicting the boundary and classification, we also add an extra FC layer to compute the probability for rationael tagging:



is the RoBERTa’s output vector for

token, , . The rationale loss is defined as the averaged cross entropy.


Here is the number of tokens, is the rationale label for the token and is rationale loss for the example. The model is trained by jointly minimizing the two objective functions:


where is a hyper-parameter for weights of rationale loss.

In addition, rationales can be used to assist the classification of Yes/No/Unk questions. The process works as follows: We first compute the rationale sequence output by multiplying and . Then a attention layer for is used to obtain the rationale representation .


where , are learned parameters. Finally, when producing the , and for Yes/No/Unk, we replace with , which is the concatenation of RoBERTa’s pooled output and the rationale representation .

3.3 Adversarial and Virtual Adversarial Training

Adversarial training Goodfellow et al. (2014) is a regularization method for neural networks to make the model more robust to adversarial noise and thereby improves its generalization ability.

Adversarial Training (AT) In this work, adversarial examples are created by making small perturbations to the embedding layer. Assuming that is the embedding vector of word and represents the current parameters of the model, the adversarial embedding vector Miyato et al. (2016) is:


where is the gold label,

is a hyperparameter scaling the magnitude of perturbation. Then, we compute adversarial loss as:


where is the adversarial embedding matrix.

Virtual Adversarial Training (VAT) Virtual adversarial training is similar to AT, but it uses unsupervised adversarial perturbations. To obtain the virtual adversarial perturbation, we first add a gaussian noise to the embedding vector of word :


where is a hyperparameter and . Then the gradient with respect to the KL divergence between and is estimated as:


Next, similar to adversarial training, the adversarial perturbation is added to the word embedding vector:


Lastly, the virtual adversarial loss is computed as:


where is the adversarial embedding matrix.

Loss Function In this work, the total loss is simply summed up all the loss together as:


3.4 Knowledge Distillation

Knowledge Distillation(KD) Furlanello et al. (2018)

transfer ”knowledge” from one machine learning model (teacher) to another (student) by using teacher’s output as student’s training objective. Even when the teacher and student share the same architecture and size, the student still outperforms its teacher. KD can be used for both single-task and multi-task models

Clark et al. (2019).

Teacher Model In this work, the teacher model is trained using methods mentioned above, whose objective function is defined as Equation 19.

Student Model Knowledge distillation uses teacher’s output probability as an extra supervised label to train student models. In our work, we employ several teacher models with different random seeds to compute the teacher label :


where is the parameters of the teacher model and is the total number of teacher models. KD loss is defined as the cross-entropy between and :


Where is the student model parameters. The total loss for student model is defined as:


3.5 Post-Processing

Since our model is extractive, it cannot solve multiple-choice questions even when it can extract the correct span. There are also multiple-choice questions whose options are the same as the word that our model finds but in a different word form. For instance, the options are ’walk’ and ’ride’ while the span that our model extracts is ’walked’.

A post-processing procedure based on word similarity is applied to alleviate this problem. First, legal options are extracted from questions using linguistic rules. Second, word embeddings of options and answer tokens are prepared respectively. Third, we compute the cosine similarity between each option and answer token. At last, the option with the highest similarity is chosen as the final answer.


where and are the set of word embeddings of option and answer tokens and represents their cosine similarity.

3.6 Ensemble

The ensemble output is generated according to the ensemble logits , which is the average of output logits from models selected for ensemble.


where is ensemble size, the number of models used for ensemble.

The performance of our ensemble strategy relies heavily on the proper selection of models which is challenging. Given constraints on computational resources, the ensemble size is also limited. Genetic algorithm(GA), a kind of stochastic search algorithm that does not require gradient information, is used to search for a combination of models that maximizes performance while obeying the constraints on ensemble size. GAs are inherently parallel and tend to approximate the global optimal solution. Therefore they are widely used in combinatorial optimization and operation research problems

Deb (1998); Mühlenbein et al. (1988).

In our experiments, the genetic algorithm is able to converge within 200 generations. Our best ensemble solution contains only 9 models and reaches a F1 of 91.5 without post-processing while simply averaging all 68 candidate models results in a lower F1 of 91.2.


Figure 1: F1 score of best and worst individual of each generation.
Model In-domain Out-of-domain Overall
Bert-Large Baseline 82.6 78.4 81.4
BERT with History Augmented Query 82.7 78.6 81.5
Bert + Answer Verification 83.8 81.9 82.8
BERT + MMFT + ADA 86.4 81.9 85.0
ConvBERT 87.7 85.4 86.8
Google SQuAD 2.0 + MMFT 88.5 86.0 87.8
Our model 90.9 89.2 90.4
Google SQuAD 2.0 + MMFT(Ensemble) 89.9 88.0 89.4
Our model(Ensemble) 91.4 89.2 90.7
human 89.4 87.4 88.8
Table 1: CoQA test set results, which are scored by the CoQA evaluation server. All the results are obtained from

4 Experiments

4.1 Evaluation Metrics

Similar to SQuAD, CoQA uses macro-average F1 score of word overlap as its evaluation metric. In evaluation, to compute the performance of an answer

, CoQA provides human answers as reference answers () and the final F1 score is defined as:


4.2 Implementation Details

The implementation of our model is based on the PyTorch implementation of RoBERTa

. We use AdamW Loshchilov and Hutter (2017)

as our optimizer with a learning rate of 5e-5 and a batch size of 48. The number of epochs is set to 2. A linear warmup for the first 6% of steps followed by a linear decay to 0 is used. To avoid the exploding gradient problem, the gradient norm is clipped to within 1. All the texts are tokenized using Byte-Pair Encoding(BPE)

Sennrich et al. (2015) and are chopped to sequences no longer than 512 tokens. Layer-wise decay rate is 0.9 and loss weight .

4.3 Ablation

To assess the impact of each method we apply, we perform a series of analyses and ablations. The methods are added one by one to the baseline model. As shown in Table 2, Rationale Tagging multi-task and Adversarial Training bring relatively significant improvements. With all the methods in Table 2, the F1 score of our model (single) in dev-set attains to 91.3.

Model In-domain
Baseline Model 89.5
+ Rationale Tagging Task 90.0
+ Adversarial Training 90.7
+ Knowledge Distillation 91.0
+ Post-Processing 91.3
Ensemble 91.8
Table 2: Ablation study on the CoQA dev-set

4.4 Results

We submit our system to the public CoQA leaderboard and compare its performance with others on the CoQA test set in Table 1. Our system outperforms the current state-of-the-art single model by a 2.6% absolute improvement in F1 score and scores 90.7 points after ensemble. In addition, while most of the top systems rely on additional supervised data, our system do not use any extra training data.

Model F1 bound
our system 91.8
93.0 (+1.2)
95.1 (+3.3)
Table 3: Upper bound analysis on the CoQA dev-set
Question Rationale Ground truth Our Answer Error Type
How many factors contribute to endemism? Physical, climatic, and biological factors Three Physical, climatic, and biological factors Counting
Who told her? Your mother told me that you had a part-time job Sandy’s mother mother Pronoun Paraphrasing
What did she do after waving thanks? while driving off in the cab. Drove off in the cab driving off in cab Tense Paraphrasing
When? Not till I spoke to him When he spoke to him spoke to him Conjunction Paraphrasing
Table 4: Some typical types of bad cases.

5 Analysis

In this section, a comprehensive analysis explains the limitation of extracitve models and our system. To figure out the headroom of extractive models for improvement, we first compute the proportion of free-form answers and then estimate the upper bound of extractive models in F1 score. At last, we analyze some typical bad cases of our system.

5.1 CoQA

CoQA Reddy et al. (2018) is a typical conversational question answering dataset which contains 127k questions with answers obtained from 8k conversations and text passages from seven diverse domains.

An important feature of the CoQA dataset is that the answer to some of its questions could be free-form. Therefore, around 33.2% of the answers do not have an exact match subsequence in the given passage. Among these answers, the answers Yes and No constitute 78.8%. The next majority is 14.3% of answers which are edits to text span to improve fluency. The rest includes 5.1% for counting and 1.8% for selecting a choice from the question.

5.2 Upper Bound of Extractive Models

Considering the answers could be free-form text, we estimate the upper bound of the performance of extractive method. For each question, the subsequence of the passage with the highest F1 score is regarded as the best answer possiblely for extractive models. As shown in Table 3, with the first human answer as reference, the upper bound of F1 is 93.0. With all 4 human answers considered, the F1 can reach 95.1, indicating that the headroom for generative models is only 4.9%. This is also the reason why we use an extractive model.

5.3 Error Analysis

As shown in Table 4, there are two typical types of bad cases in our system, e.g. Counting and Paraphrasing. To solve these problems, the model must have the capability to paraphrase the answer according to its question format. However, since cases of these two types only account for a small proportion of the entire data set, introducing a naive generative mechanism may have a negative effect, especially for the questions whose answers could be found in the passage.

6 Future directions

While it is shown that our model can achieve state-of-the-art performance on CoQA dataset, there are still some possible directions to work on. One possible direction is to combine the extractive model and generative model in a more effective way to alleviate the free-form answer problem mentioned above. Another direction is to incorporate syntax and knowledge base information into our system. Furthermore, proper data augmentation may be useful to boost model performance.