Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-based Question Answering

04/07/2020 ∙ by Changmao Li, et al. ∙ Emory University 0

We introduce a novel approach to transformers that learns hierarchical representations in multiparty dialogue. First, three language modeling tasks are used to pre-train the transformers, token- and utterance-level language modeling and utterance order prediction, that learn both token and utterance embeddings for better understanding in dialogue contexts. Then, multi-task learning between the utterance prediction and the token span prediction is applied to fine-tune for span-based question answering (QA). Our approach is evaluated on the FriendsQA dataset and shows improvements of 3.8 the two state-of-the-art transformer models, BERT and RoBERTa, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transformer-based contextualized embedding approaches such as BERT Devlin et al. (2019), XLM CONNEAU and Lample (2019), XLNet Yang et al. (2019), RoBERTa Liu et al. (2019), and AlBERT Lan et al. (2019) have re-established the state-of-the-art for practically all question answering (QA) tasks on not only general domain datasets such as SQuAD Rajpurkar et al. (2016, 2018), MS Marco Nguyen et al. (2016), TriviaQA Joshi et al. (2017), NewsQA Trischler et al. (2017), or NarrativeQA Kočiský et al. (2018), but also multi-turn question datasets such as SQA Iyyer et al. (2017), QuAC Choi et al. (2018), CoQA Reddy et al. (2019), or CQA Talmor and Berant (2018). However, for span-based QA where the evidence documents are in the form of multiparty dialogue, the performance is still poor even with the latest transformer models Sun et al. (2019); Yang and Choi (2019) due to the challenges in representing utterances composed by heterogeneous speakers.

Several limitations can be expected for language models trained on general domains to process dialogue. First, most of these models are pre-trained on formal writing, which is notably different from colloquial writing in dialogue; thus, fine-tuning for the end tasks is often not sufficient enough to build robust dialogue models. Second, unlike sentences in a wiki or news article written by one author with a coherent topic, utterances in a dialogue are from multiple speakers who may talk about different topics in distinct manners such that they should not be represented by simply concatenating, but rather as sub-documents interconnected to one another.

This paper presents a novel approach to the latest transformers that learns hierarchical embeddings for tokens and utterances for a better understanding in dialogue contexts. While fine-tuning for span-based QA, every utterance as well as the question are separated encoded and multi-head attentions and additional transformers are built on the token and utterance embeddings respectively to provide a more comprehensive view of the dialogue to the QA model. As a result, our model achieves a new state-of-the-art result on a span-based QA task where the evidence documents are multiparty dialogue. The contributions of this paper are:111All our resources including the source codes and the dataset with the experiment split are available at
https://github.com/emorynlp/friendsqa

  • [itemsep=0.1em]

  • New pre-training tasks are introduced to improve the quality of both token-level and utterance-level embeddings generated by the transformers, that better suit to handle dialogue contexts (§2.1).

  • A new multi-task learning approach is proposed to fine-tune the language model for span-based QA that takes full advantage of the hierarchical embeddings created from the pre-training (§2.2).

  • Our approach significantly outperforms the previous state-of-the-art models using BERT and RoBERTa on a span-based QA task using dialogues as evidence documents (§3).

(a) Token-level MLM (§2.1.1)
(b) Utterance-level MLM (§2.1.2)
(c) Utterance order prediction (§2.1.3)
Figure 1: The overview of our models for the three pre-training tasks (Section 2.1).

2 Transformers for Learning Dialogue

This section introduces a novel approach for pre-training (Section 2.1) and fine-tuning (Section 2.2) transformers to effectively learn dialogue contexts. Our approach has been evaluated with two kinds of transformers, BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019), and shown significant improvement to a question answering task (QA) on multiparty dialogue (Section 3).

2.1 Pre-training Language Models

Pre-training involves 3 tasks in sequence, the token-level masked language modeling (MLM; §2.1.1), the utterance-level MLM (§2.1.2), and the utterance order prediction (§2.1.3), where the trained weights from each task are transferred to the next task. Note that the weights of publicly available transformer encoders are adapted to train the token-level MLM, which allows our QA model to handle languages in both dialogues, used as evidence documents, and questions written in formal writing. Transformers from BERT and RoBERTa are trained with static and dynamic MLM respectively, as described by Devlin et al. (2019); Liu et al. (2019).

2.1.1 Token-level Masked LM

Figure 1(a) illustrates the token-level MLM model. Let be a dialogue where is the ’th utterance in , is the speaker of , and is the ’th token in . All speakers and tokens in are appended in order with the special token CLS, representing the entire dialogue, which creates the input string sequence . For every , let , where is the masked token substituted in place of . is then fed into the transformer encoder (TE), which generates a sequence of embeddings where is the embedding list for , and are the embeddings of respectively. Finally,

is fed into a softmax layer that generates the output vector

to predict , where is the set of all vocabularies in the dataset.222: the maximum number of words in every utterance,
: the maximum number of utterances in every dialogue.

Figure 2: The overview of our fine-tuning model exploiting multi-task learning (Section 2.2).

2.1.2 Utterance-level Masked LM

The token-level MLM (t-MLM) learns attentions among all tokens in regardless of the utterance boundaries, allowing the model to compare every token to a broad context; however, it fails to catch unique aspects about individual utterances that can be important in dialogue. To learn an embedding for each utterance, the utterance-level MLM model is trained (Figure 1(b)). Utterance embeddings can be used independently and/or in sequence to match contexts in the question and the dialogue beyond the token-level, showing an advantage in finding utterances with the correct answer spans (§2.2.1).

For every utterance , the masked input sequence is generated. Note that CLS now represents instead of and is much shorter than the one used for t-MLM. is fed into TE, already trained by t-MLM, and the embedding sequence is generated. Finally, , instead of , is fed into a softmax layer that generates to predict . The intuition behind the utterance-level MLM is that once learns enough contents to accurately predict any token in , it consists of most essential features about the utterance; thus, can be used as the embedding of .

2.1.3 Utterance Order Prediction

The embedding from the utterance-level MLM (u-MLM) learns contents within , but not across other utterances. In dialogue, it is often the case that a context is completed by multiple utterances; thus, learning attentions among the utterances is necessary. To create embeddings that contain cross-utterance features, the utterance order prediction model is trained (Figure 1(c)). Let where and comprise the first and the second halves of the utterances in , respectively. Also, let where contains the same set of utterances as although the ordering may be different. The task is whether or not preserves the same order of utterances as .

For each , the input is created and fed into TE, already trained by u-MLM, to create the embeddings . The sequence is fed into two transformer layers, TL1 and TL2, that generate the new utterance embedding list . Finally, is fed into a softmax layer that generates to predict whether or not is in order.

2.2 Fine-tuning for QA on Dialogue

Fine-tuning exploits multi-task learning between the utterance ID prediction (§2.2.1) and the token span prediction (§2.2.2), which allows the model to train both the utterance- and token-level attentions. The transformer encoder (TE) trained by the utterance order prediction (UOP) is used for both tasks. Given the question ( is the ’th token in ) and the dialogue , and all are fed into TE that generates and for and every , respectively.

2.2.1 Utterance ID Prediction

The utterance embedding list is fed into TL1 and TL2 from UOP that generate . is then fed into a softmax layer that generates to predict the ID of the utterance containing the answer span if exists; otherwise, the ’th label is predicted, implying that the answer span for does not exist in .

2.2.2 Token Span Prediction

For every , the pair is fed into the multi-head attention layer, MHA, where and . MHA Vaswani et al. (2017) then generates the attended embedding sequences, , where . Finally, each is fed into two softmax layers, SL and SR, that generate and to predictthe leftmost and the rightmost tokens in respectively, that yield the answer span for . It is possible that the answer spans are predicted in multiple utterances, in which case, the span from the utterance that has the highest score for the utterance ID prediction is selected, which is more efficient than the typical dynamic programming approach.

3 Experiments

3.1 Corpus

Despite of all great work in QA, only two datasets are publicly available for machine comprehension that take dialogues as evidence documents. One is Dream comprising dialogues for language exams with multiple-choice questions Sun et al. (2019).The other is FriendsQA containing transcripts from the TV show Friends with annotation for span-based question answering Yang and Choi (2019).Since Dream is for a reading comprehension task that does not need to find the answer contents from the evidence documents, it is not suitable for our approach; thus, FriendsQA is chosen.

Each scene is treated as an independent dialogue in FriendsQA. Yang and Choi (2019) randomly split the corpus to generate training, development, and evaluation sets such that scenes from the same episode can be distributed across those three sets, causing inflated accuracy scores. Thus, we re-split them by episodes to prevent such inflation. For fine-tuning (§2.2), episodes from the first four seasons are used as described in Table 1. For pre-training (§2.1), all transcripts from Seasons 5-10 are used as an additional training set.

Set D Q A E
Training 973 9,791 16,352 1 - 20
Development 113 1,189 2,065 21 - 22
Evaluation 136 1,172 1,920 23 - *
Table 1: New data split for FriendsQA. D/Q/A: # of dialogues/questions/answers, E: episode IDs.

3.2 Models

The weights from the BERTbase and RoBERTabase models Devlin et al. (2019); Liu et al. (2019) are transferred to all models in our experiments. Four baseline models, BERT, BERTpre, RoBERTa, and RoBERTapre, are built, where all models are fine-tuned on the datasets in Table 1 and the *pre models are pre-trained on the same datasets with the additional training set from Seasons 5-10 (§3.1). The baseline models are compared to BERTour and RoBERTAour that are trained by our approach.333Detailed experimental setup are provided in Appendices.

3.3 Results

Table 2 shows results achieved by all the models. Following Yang and Choi (2019)

, exact matching (EM), span matching (SM), and utterance matching (UM) are used as the evaluation metrics. Each model is developed three times and their average score as well as the standard deviation are reported. The performance of

RoBERTa* is generally higher than BERT* although RoBERTabase is pre-trained with larger datasets including CC-News Nagel (2016), OpenWebText Gokaslan and Cohen (2019), and Stories Trinh and Le (2018) than BERTbase such that results from those two types of transformers cannot be directly compared.

Model EM SM UM
BERT 43.3(0.8) 59.3(0.6) 70.2(0.4)
BERTpre 45.6(0.9) 61.2(0.7) 71.3(0.6)
BERTour 46.8(1.3) 63.1(1.1) 73.3(0.7)
RoBERTa 52.6(0.7) 68.2(0.3) 80.9(0.8)
RoBERTapre 52.6(0.7) 68.6(0.6) 81.7(0.7)
RoBERTaour 53.5(0.7) 69.6(0.8) 82.7(0.5)
Table 2: Accuracies ( standard deviations) achieved by the BERT and RoBERTa models.

The *pre models show marginal improvement over their base models, implying that pre-training the language models on FriendsQA with the original transformers does not make much impact on this QA task. The models using our approach perform noticeably better than the baseline models, showing 3.8% and 1.4% improvements on SM from BERT and RoBERTa, respectively.

Type Dist. EM SM UM
Where 18.16 66.1(0.5) 79.9(0.7) 89.8(0.7)
When 13.57 63.3(1.3) 76.4(0.6) 88.9(1.2)
What 18.48 56.4(1.7) 74.0(0.5) 87.7(2.1)
Who 18.82 55.9(0.8) 66.0(1.7) 79.9(1.1)
How 15.32 43.2(2.3) 63.2(2.5) 79.4(0.7)
Why 15.65 33.3(2.0) 57.3(0.8) 69.8(1.8)
Table 3: Results from the RoBERTaour model by different question types.

Table 3 shows the results achieved by RoBERTaour w.r.t. question types. UM drops significantly for Why that often spans out to longer sequences and also requires deeper inferences to answer correctly than the others. Compared to the baseline models, our models show more well-around performance regardless the question types.444Question type results for all models are in Appendices.

3.4 Ablation Studies

Table 4 shows the results from ablation studies to analyze the impacts of the individual approaches. BERTpre and RoBERTapre are the same as in Table 2, that are the transformer models pre-trained by the token-level masked LM (§2.1.1) and fine-tuned by the token span prediction (§2.2.2). BERTuid and RoBERTauid are the models that are pre-trained by the token-level masked LM and jointly fine-tuned by the token span prediction as well as the utterance ID prediction (UID: §2.2.1). Given these two types of transformer models, the utterance-level masked LM (ULM: §2.1.2) and the utterance order prediction (UOP: §2.1.3) are separately evaluated.

Model EM SM UM
BERTpre 45.6(0.9) 61.2(0.7) 71.3(0.6)
ulm 45.7(0.9) 61.8(0.9) 71.8(0.5)
ulmuop 45.6(0.9) 61.7(0.7) 71.7(0.6)
BERTuid 45.7(0.8) 61.1(0.8) 71.5(0.5)
ulm 46.2(1.1) 62.4(1.2) 72.5(0.8)
ulmuop 46.8(1.3) 63.1(1.1) 73.3(0.7)
RoBERTapre 52.6(0.7) 68.6(0.6) 81.7(0.7)
ulm 52.9(0.8) 68.7(1.1) 81.7(0.6)
ulmuop 52.5(0.8) 68.8(0.5) 81.9(0.7)
RoBERTauid 52.8(0.9) 68.7(0.8) 81.9(0.5)
ulm 53.2(0.6) 69.2(0.7) 82.4(0.5)
ulmuop 53.5(0.7) 69.6(0.8) 82.7(0.5)
Table 4: Results for the ablation studies. Note that the *uidulmuop models are equivalent to the *our models in Table 2, respectively.

These two dialogue-specific LM approaches, ULM and UOP, give very marginal improvement over the baseline models, that is rather surprising. However, they show good improvement when combined with UID, implying that pre-training language models may not be enough to enhance the performance by itself but can be effective when it is coupled with an appropriate fine-tuning approach. Since both ULM and UOP are designed to improve the quality of utterance embeddings, it is expected to improve the accuracy for UID as well. The improvement on UM is indeed encouraging, giving 2% and 1% boosts to BERTpre and RoBERTapre, respectively and consequently improving the other two metrics.

3.5 Error Analysis

As shown in Table 3, the major errors are from the three types of questions, who, how, and why; thus, we select 100 dialogues associated with those question types that our best model, RoBERTaour, incorrectly predicts the answer spans for. Specific examples are provided in Tables 12, 13 and 14A.3).Following Yang et al. (2019), errors are grouped into 6 categories, entity resolution, paraphrase and partial match, cross-utterance reasoning, question bias, noise in annotation, and miscellaneous.

Table 5 shows the errors types and their ratios with respect to the question types. Two main error types are entity resolution and cross-utterance reasoning. The entity resolution error happens when many of the same entities are mentioned in multiple utterances. This error also occurs when the QA system is asked about a specific person, but predicts wrong people where there are so many people appearing in multiple utterances. The cross-utterance reasoning error often happens with the why and how

questions where the model relies on pattern matching mostly and predicts the next utterance span of the matched pattern.

Error Types Who How Why
Entity Resolution 34% 23% 20%
Paraphrase and Partial Match 14% 14% 13%
Cross-Utterance Reasoning 25% 28% 27%
Question Bias 11% 13% 17%
Noise in Annotation 4% 7% 9%
Miscellaneous 12% 15% 14%
Table 5: Error types and their ratio with respect to the three most challenging question types.

4 Conclusion

This paper introduces a novel transformer approach that effectively interprets hierarchical contexts in multiparty dialogue by learning utterance embeddings. Two language modeling approaches are proposed, utterance-level masked LM and utterance order prediction. Coupled with the joint inference between token span prediction and utterance ID prediction, these two language models significantly outperform two of the state-of-the-art transformer approaches, BERT and RoBERTa, on a span-based QA task called FriendsQA . We will evaluate our approach on other machine comprehension tasks using dialogues as evidence documents to further verify the generalizability of this work.

Acknowledgments

We gratefully acknowledge the support of the AWS Machine Learning Research Awards (MLRA). Any contents in this material are those of the authors and do not necessarily reflect the views of them.

References

  • E. Choi, H. He, M. Iyyer, M. Yatskar, W. Yih, Y. Choi, P. Liang, and L. Zettlemoyer (2018) QuAC: question answering in context.

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    .
    External Links: Link, Document Cited by: §1.
  • A. CONNEAU and G. Lample (2019) Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 7057–7067. External Links: Link Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL’19, pp. 4171–4186. External Links: Link Cited by: §1, §2.1, §2, §3.2.
  • A. Gokaslan and V. Cohen (2019) OpenWebText Corpus. External Links: Link Cited by: §3.3.
  • M. Iyyer, W. Yih, and M. Chang (2017) Search-based neural structured learning for sequential question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1821–1831. External Links: Link, Document Cited by: §1.
  • M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer (2017) TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). External Links: Link, Document Cited by: §1.
  • T. Kočiský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette (2018) The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6, pp. 317–328. External Links: ISSN 2307-387X, Link, Document Cited by: §1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    ALBERT: a lite bert for self-supervised learning of language representations

    .
    External Links: 1909.11942 Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 1907.11692. External Links: Link Cited by: §1, §2.1, §2, §3.2.
  • S. Nagel (2016) News Dataset Available. External Links: Link Cited by: §3.3.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, External Links: Link Cited by: §1.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). External Links: Link, Document Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. External Links: Link, Document Cited by: §1.
  • S. Reddy, D. Chen, and C. D. Manning (2019) CoQA: a conversational question answering challenge. Transactions of the Association for Computational Linguistics 7, pp. 249–266. External Links: ISSN 2307-387X, Link, Document Cited by: §1.
  • K. Sun, D. Yu, J. Chen, D. Yu, Y. Choi, and C. Cardie (2019) DREAM: A Challenge Data Set and Models for Dialogue-Based Reading Comprehension. Transactions of the Association for Computational Linguistics 7, pp. 217–231. External Links: Link Cited by: §1, §3.1.
  • A. Talmor and J. Berant (2018) The web as a knowledge-base for answering complex questions. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). External Links: Link, Document Cited by: §1.
  • T. H. Trinh and Q. V. Le (2018) A Simple Method for Commonsense Reasoning. arXiv 1806.02847. External Links: Link Cited by: §3.3.
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman (2017) NewsQA: a machine comprehension dataset. Proceedings of the 2nd Workshop on Representation Learning for NLP. External Links: Link, Document Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, USA, pp. 6000–6010. External Links: ISBN 978-1-5108-6096-4, Link Cited by: §2.2.2.
  • Z. Yang and J. D. Choi (2019) FriendsQA: open-domain question answering on TV show transcripts. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Stockholm, Sweden, pp. 188–197. External Links: Link Cited by: §1, §3.1, §3.1, §3.3.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 5754–5764. External Links: Link Cited by: §1, §3.5.

Appendix A Appendices

a.1 Experimental Setup

The BERT model and the RoBERTa

model use the same configuration. The two models both have 12 hidden transformer layers and 12 attention heads. The hidden size of the model is 768 and the intermediate size in the transformer layers is 3,072. The activation function in the transformer layers is

gelu.

Pre-training

The batch size of 32 sequences is used for pre-training. Adam with the learning rate of , , , the L2 weight decay of

, the learning rate warm up over the first 10% steps, and the linear decay of the learning rate are used. A dropout probability of

is applied to all layers. The cross-entropy is used for the training loss of each task. For the masked language modeling tasks, the model is trained until the perplexity stops decreasing on the development set. For the other pre-training tasks, the model is trained until both the loss and the accuracy stop decreasing on the development set.

Fine-tuning

For fine-tuning, the batch size and the optimization approach are the same as the pre-training. The dropout probability is always kept at . The training loss is the sum of the cross-entropy of two fine-tuning tasks as in §2.2.

a.2 Question Types Analysis

Tables in this section show the results with respect to the question types using all models (Section 3.2) in the order of performance.

Type Dist. EM SM UM
Where 18.16 68.3(1.3) 78.8(1.2) 89.2(1.5)
When 13.57 63.8(1.6) 75.2(0.9) 86.0(1.6)
What 18.48 54.1(0.8) 72.5(1.5) 84.0(0.9)
Who 18.82 56.0(1.3) 66.1(1.3) 79.4(1.2)
How 15.32 38.1(0.7) 59.2(1.6) 77.5(0.7)
Why 15.65 32.0(1.1) 56.0(1.7) 68.5(0.8)
Table 6: Results from RoBERTa by question types.
Type Dist. EM SM UM
Where 18.16 67.1(1.2) 78.9(0.6) 89.0(1.1)
When 13.57 62.3(0.7) 76.3(1.3) 88.7(0.9)
What 18.48 55.1(0.8) 73.1(0.8) 86.7(0.8)
Who 18.82 56.2(1.4) 64.0(1.7) 77.1(1.3)
How 15.32 41.2(1.1) 61.2(1.5) 79.8(0.7)
Why 15.65 32.4(0.7) 57.4(0.8) 69.1(1.4)
Table 7: Results from RoBERTapre by question types.
Type Dist. EM SM UM
Where 18.16 66.1(0.5) 79.9(0.7) 89.8(0.7)
When 13.57 63.3(1.3) 76.4(0.6) 88.9(1.2)
What 18.48 56.4(1.7) 74.0(0.5) 87.7(2.1)
Who 18.82 55.9(0.8) 66.0(1.7) 79.9(1.1)
How 15.32 43.2(2.3) 63.2(2.5) 79.4(0.7)
Why 15.65 33.3(2.0) 57.3(0.8) 69.8(1.8)
Table 8: Results from RoBERTaour by question types.
Type Dist. EM SM UM
Where 18.16 57.3(0.5) 70.2(1.3) 79.4(0.9)
When 13.57 56.1(1.1) 69.7(1.6) 78.6(1.7)
What 18.48 45.0(1.4) 64.4(0.7) 77.0(1.0)
Who 18.82 46.9(1.1) 56.2(1.4) 67.6(1.4)
How 15.32 29.3(0.8) 48.4(1.2) 60.9(0.7)
Why 15.65 23.4(1.6) 46.1(0.9) 56.4(1.3)
Table 9: Results from BERT by question types.
Type Dist. EM SM UM
Where 18.16 62.8(1.8) 72.3(0.8) 82.1(0.7)
When 13.57 60.7(1.5) 70.7(1.8) 80.4(1.1)
What 18.48 43.2(1.3) 64.3(1.7) 75.6(1.8)
Who 18.82 47.8(1.1) 56.9(1.9) 69.7(0.7)
How 15.32 33.2(1.3) 48.3(0.6) 59.8(1.1)
Why 15.65 22.9(1.6) 46.6(0.7) 54.9(0.9)
Table 10: Results from BERTpre by question types.
Type Dist. EM SM UM
Where 18.16 63.3(1.2) 72.9(1.7) 77.0(1.2)
When 13.57 48.4(1.9) 66.5(0.8) 79.5(1.5)
What 18.48 52.1(0.7) 69.2(1.1) 81.3(0.7)
Who 18.82 51.3(1.1) 61.9(0.9) 67.5(0.9)
How 15.32 30.9(0.9) 52.1(0.7) 65.4(1.1)
Why 15.65 29.2(1.6) 53.2(1.3) 65.7(0.8)
Table 11: Results from BERTour by question types.

a.3 Error Examples

Each table in this section gives an error example from the excerpt. The gold answers are indicated by the solid underlines whereas the predicted answers are indicated by the wavy underlines.

Q Why is Joey planning a big party?
J Oh, we’re having a big party tomorrow night. Later!
R Whoa! Hey-hey, you planning on inviting us?
J Nooo, later.
P Hey!! Get your ass back here, Tribbiani!!
R Hormones!
M What Phoebe meant to say was umm, how come
you’re having a party and we’re not invited?
J Oh, it’s Ross’ bachelor party.
M Sooo?
Table 12: An error example for the why question (Q).
J: Joey, R: Rachel, P: Pheobe, M: Monica.
Q Who opened the vent?
R Ok, got the vent open.
P Hi, I’m Ben. I’m hospital worker Ben.
It’s Ben… to the rescue!
R Ben, you ready? All right, gimme your foot.
Ok, on three, Ben. One, two, three. Ok, That’s it, Ben.
- (Ross and Susan lift Phoebe up into the vent.)
S What do you see?
P Well, Susan, I see what appears to be a dark vent.
Wait. Yes, it is in fact a dark vent.
- (A janitor opens the closet door from the outside.)
Table 13: An error example for the who question (Q).
P: Pheobe, R: Ross, S: Susan.
Q How does Joey try to convince the girl
to hang out with him?
J Oh yeah-yeah. And I got the duck totally trained.
Watch this. Stare at the wall. Hardly move. Be white.
G You are really good at that.
So uh, I had fun tonight, you throw one hell of a party.
J Oh thanks. Thanks. It was great meetin’ ya. And listen
if any of my friends gets married, or have a birthday, …
G Yeah, that would be great. So I guess umm, good night.
J Oh unless you uh, you wanna hang around.
G Yeah?
J Yeah. I’ll let you play with my duck.
Table 14: An error example for the how question (Q).
J: Joey, G: The Girl.