Transformer-based contextualized embedding approaches such as BERT Devlin et al. (2019), XLM CONNEAU and Lample (2019), XLNet Yang et al. (2019), RoBERTa Liu et al. (2019), and AlBERT Lan et al. (2019) have re-established the state-of-the-art for practically all question answering (QA) tasks on not only general domain datasets such as SQuAD Rajpurkar et al. (2016, 2018), MS Marco Nguyen et al. (2016), TriviaQA Joshi et al. (2017), NewsQA Trischler et al. (2017), or NarrativeQA Kočiský et al. (2018), but also multi-turn question datasets such as SQA Iyyer et al. (2017), QuAC Choi et al. (2018), CoQA Reddy et al. (2019), or CQA Talmor and Berant (2018). However, for span-based QA where the evidence documents are in the form of multiparty dialogue, the performance is still poor even with the latest transformer models Sun et al. (2019); Yang and Choi (2019) due to the challenges in representing utterances composed by heterogeneous speakers.
Several limitations can be expected for language models trained on general domains to process dialogue. First, most of these models are pre-trained on formal writing, which is notably different from colloquial writing in dialogue; thus, fine-tuning for the end tasks is often not sufficient enough to build robust dialogue models. Second, unlike sentences in a wiki or news article written by one author with a coherent topic, utterances in a dialogue are from multiple speakers who may talk about different topics in distinct manners such that they should not be represented by simply concatenating, but rather as sub-documents interconnected to one another.
This paper presents a novel approach to the latest transformers that learns hierarchical embeddings for tokens and utterances for a better understanding in dialogue contexts.
While fine-tuning for span-based QA, every utterance as well as the question are separated encoded and multi-head attentions and additional transformers are built on the token and utterance embeddings respectively to provide a more comprehensive view of the dialogue to the QA model.
As a result, our model achieves a new state-of-the-art result on a span-based QA task where the evidence documents are multiparty dialogue.
The contributions of this paper are:111All our resources including the source codes and the dataset with the experiment split are available at
New pre-training tasks are introduced to improve the quality of both token-level and utterance-level embeddings generated by the transformers, that better suit to handle dialogue contexts (§2.1).
A new multi-task learning approach is proposed to fine-tune the language model for span-based QA that takes full advantage of the hierarchical embeddings created from the pre-training (§2.2).
Our approach significantly outperforms the previous state-of-the-art models using BERT and RoBERTa on a span-based QA task using dialogues as evidence documents (§3).
2 Transformers for Learning Dialogue
This section introduces a novel approach for pre-training (Section 2.1) and fine-tuning (Section 2.2) transformers to effectively learn dialogue contexts. Our approach has been evaluated with two kinds of transformers, BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019), and shown significant improvement to a question answering task (QA) on multiparty dialogue (Section 3).
2.1 Pre-training Language Models
Pre-training involves 3 tasks in sequence, the token-level masked language modeling (MLM; §2.1.1), the utterance-level MLM (§2.1.2), and the utterance order prediction (§2.1.3), where the trained weights from each task are transferred to the next task. Note that the weights of publicly available transformer encoders are adapted to train the token-level MLM, which allows our QA model to handle languages in both dialogues, used as evidence documents, and questions written in formal writing. Transformers from BERT and RoBERTa are trained with static and dynamic MLM respectively, as described by Devlin et al. (2019); Liu et al. (2019).
2.1.1 Token-level Masked LM
Figure 1(a) illustrates the token-level MLM model. Let be a dialogue where is the ’th utterance in , is the speaker of , and is the ’th token in . All speakers and tokens in are appended in order with the special token CLS, representing the entire dialogue, which creates the input string sequence . For every , let , where is the masked token substituted in place of . is then fed into the transformer encoder (TE), which generates a sequence of embeddings where is the embedding list for , and are the embeddings of respectively. Finally,to predict , where is the set of all vocabularies in the dataset.222: the maximum number of words in every utterance,
: the maximum number of utterances in every dialogue.
2.1.2 Utterance-level Masked LM
The token-level MLM (t-MLM) learns attentions among all tokens in regardless of the utterance boundaries, allowing the model to compare every token to a broad context; however, it fails to catch unique aspects about individual utterances that can be important in dialogue. To learn an embedding for each utterance, the utterance-level MLM model is trained (Figure 1(b)). Utterance embeddings can be used independently and/or in sequence to match contexts in the question and the dialogue beyond the token-level, showing an advantage in finding utterances with the correct answer spans (§2.2.1).
For every utterance , the masked input sequence is generated. Note that CLS now represents instead of and is much shorter than the one used for t-MLM. is fed into TE, already trained by t-MLM, and the embedding sequence is generated. Finally, , instead of , is fed into a softmax layer that generates to predict . The intuition behind the utterance-level MLM is that once learns enough contents to accurately predict any token in , it consists of most essential features about the utterance; thus, can be used as the embedding of .
2.1.3 Utterance Order Prediction
The embedding from the utterance-level MLM (u-MLM) learns contents within , but not across other utterances. In dialogue, it is often the case that a context is completed by multiple utterances; thus, learning attentions among the utterances is necessary. To create embeddings that contain cross-utterance features, the utterance order prediction model is trained (Figure 1(c)). Let where and comprise the first and the second halves of the utterances in , respectively. Also, let where contains the same set of utterances as although the ordering may be different. The task is whether or not preserves the same order of utterances as .
For each , the input is created and fed into TE, already trained by u-MLM, to create the embeddings . The sequence is fed into two transformer layers, TL1 and TL2, that generate the new utterance embedding list . Finally, is fed into a softmax layer that generates to predict whether or not is in order.
2.2 Fine-tuning for QA on Dialogue
Fine-tuning exploits multi-task learning between the utterance ID prediction (§2.2.1) and the token span prediction (§2.2.2), which allows the model to train both the utterance- and token-level attentions. The transformer encoder (TE) trained by the utterance order prediction (UOP) is used for both tasks. Given the question ( is the ’th token in ) and the dialogue , and all are fed into TE that generates and for and every , respectively.
2.2.1 Utterance ID Prediction
The utterance embedding list is fed into TL1 and TL2 from UOP that generate . is then fed into a softmax layer that generates to predict the ID of the utterance containing the answer span if exists; otherwise, the ’th label is predicted, implying that the answer span for does not exist in .
2.2.2 Token Span Prediction
For every , the pair is fed into the multi-head attention layer, MHA, where and . MHA Vaswani et al. (2017) then generates the attended embedding sequences, , where . Finally, each is fed into two softmax layers, SL and SR, that generate and to predictthe leftmost and the rightmost tokens in respectively, that yield the answer span for . It is possible that the answer spans are predicted in multiple utterances, in which case, the span from the utterance that has the highest score for the utterance ID prediction is selected, which is more efficient than the typical dynamic programming approach.
Despite of all great work in QA, only two datasets are publicly available for machine comprehension that take dialogues as evidence documents. One is Dream comprising dialogues for language exams with multiple-choice questions Sun et al. (2019).The other is FriendsQA containing transcripts from the TV show Friends with annotation for span-based question answering Yang and Choi (2019).Since Dream is for a reading comprehension task that does not need to find the answer contents from the evidence documents, it is not suitable for our approach; thus, FriendsQA is chosen.
Each scene is treated as an independent dialogue in FriendsQA. Yang and Choi (2019) randomly split the corpus to generate training, development, and evaluation sets such that scenes from the same episode can be distributed across those three sets, causing inflated accuracy scores. Thus, we re-split them by episodes to prevent such inflation. For fine-tuning (§2.2), episodes from the first four seasons are used as described in Table 1. For pre-training (§2.1), all transcripts from Seasons 5-10 are used as an additional training set.
|Training||973||9,791||16,352||1 - 20|
|Development||113||1,189||2,065||21 - 22|
|Evaluation||136||1,172||1,920||23 - *|
The weights from the BERTbase and RoBERTabase models Devlin et al. (2019); Liu et al. (2019) are transferred to all models in our experiments. Four baseline models, BERT, BERTpre, RoBERTa, and RoBERTapre, are built, where all models are fine-tuned on the datasets in Table 1 and the *pre models are pre-trained on the same datasets with the additional training set from Seasons 5-10 (§3.1). The baseline models are compared to BERTour and RoBERTAour that are trained by our approach.333Detailed experimental setup are provided in Appendices.
, exact matching (EM), span matching (SM), and utterance matching (UM) are used as the evaluation metrics. Each model is developed three times and their average score as well as the standard deviation are reported. The performance ofRoBERTa* is generally higher than BERT* although RoBERTabase is pre-trained with larger datasets including CC-News Nagel (2016), OpenWebText Gokaslan and Cohen (2019), and Stories Trinh and Le (2018) than BERTbase such that results from those two types of transformers cannot be directly compared.
The *pre models show marginal improvement over their base models, implying that pre-training the language models on FriendsQA with the original transformers does not make much impact on this QA task. The models using our approach perform noticeably better than the baseline models, showing 3.8% and 1.4% improvements on SM from BERT and RoBERTa, respectively.
Table 3 shows the results achieved by RoBERTaour w.r.t. question types. UM drops significantly for Why that often spans out to longer sequences and also requires deeper inferences to answer correctly than the others. Compared to the baseline models, our models show more well-around performance regardless the question types.444Question type results for all models are in Appendices.
3.4 Ablation Studies
Table 4 shows the results from ablation studies to analyze the impacts of the individual approaches. BERTpre and RoBERTapre are the same as in Table 2, that are the transformer models pre-trained by the token-level masked LM (§2.1.1) and fine-tuned by the token span prediction (§2.2.2). BERTuid and RoBERTauid are the models that are pre-trained by the token-level masked LM and jointly fine-tuned by the token span prediction as well as the utterance ID prediction (UID: §2.2.1). Given these two types of transformer models, the utterance-level masked LM (ULM: §2.1.2) and the utterance order prediction (UOP: §2.1.3) are separately evaluated.
These two dialogue-specific LM approaches, ULM and UOP, give very marginal improvement over the baseline models, that is rather surprising. However, they show good improvement when combined with UID, implying that pre-training language models may not be enough to enhance the performance by itself but can be effective when it is coupled with an appropriate fine-tuning approach. Since both ULM and UOP are designed to improve the quality of utterance embeddings, it is expected to improve the accuracy for UID as well. The improvement on UM is indeed encouraging, giving 2% and 1% boosts to BERTpre and RoBERTapre, respectively and consequently improving the other two metrics.
3.5 Error Analysis
As shown in Table 3, the major errors are from the three types of questions, who, how, and why; thus, we select 100 dialogues associated with those question types that our best model, RoBERTaour, incorrectly predicts the answer spans for. Specific examples are provided in Tables 12, 13 and 14 (§A.3).Following Yang et al. (2019), errors are grouped into 6 categories, entity resolution, paraphrase and partial match, cross-utterance reasoning, question bias, noise in annotation, and miscellaneous.
Table 5 shows the errors types and their ratios with respect to the question types. Two main error types are entity resolution and cross-utterance reasoning. The entity resolution error happens when many of the same entities are mentioned in multiple utterances. This error also occurs when the QA system is asked about a specific person, but predicts wrong people where there are so many people appearing in multiple utterances. The cross-utterance reasoning error often happens with the why and how
questions where the model relies on pattern matching mostly and predicts the next utterance span of the matched pattern.
|Paraphrase and Partial Match||14%||14%||13%|
|Noise in Annotation||4%||7%||9%|
This paper introduces a novel transformer approach that effectively interprets hierarchical contexts in multiparty dialogue by learning utterance embeddings. Two language modeling approaches are proposed, utterance-level masked LM and utterance order prediction. Coupled with the joint inference between token span prediction and utterance ID prediction, these two language models significantly outperform two of the state-of-the-art transformer approaches, BERT and RoBERTa, on a span-based QA task called FriendsQA . We will evaluate our approach on other machine comprehension tasks using dialogues as evidence documents to further verify the generalizability of this work.
We gratefully acknowledge the support of the AWS Machine Learning Research Awards (MLRA). Any contents in this material are those of the authors and do not necessarily reflect the views of them.
QuAC: question answering in context.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. External Links: Cited by: §1.
- Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 7057–7067. External Links: Cited by: §1.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL’19, pp. 4171–4186. External Links: Cited by: §1, §2.1, §2, §3.2.
- OpenWebText Corpus. External Links: Cited by: §3.3.
- Search-based neural structured learning for sequential question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1821–1831. External Links: Cited by: §1.
- TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). External Links: Cited by: §1.
- The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6, pp. 317–328. External Links: Cited by: §1.
ALBERT: a lite bert for self-supervised learning of language representations. External Links: Cited by: §1.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 1907.11692. External Links: Cited by: §1, §2.1, §2, §3.2.
- News Dataset Available. External Links: Cited by: §3.3.
- MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, External Links: Cited by: §1.
- Know what you don’t know: unanswerable questions for squad. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). External Links: Cited by: §1.
- SQuAD: 100,000+ questions for machine comprehension of text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. External Links: Cited by: §1.
- CoQA: a conversational question answering challenge. Transactions of the Association for Computational Linguistics 7, pp. 249–266. External Links: Cited by: §1.
- DREAM: A Challenge Data Set and Models for Dialogue-Based Reading Comprehension. Transactions of the Association for Computational Linguistics 7, pp. 217–231. External Links: Cited by: §1, §3.1.
- The web as a knowledge-base for answering complex questions. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). External Links: Cited by: §1.
- A Simple Method for Commonsense Reasoning. arXiv 1806.02847. External Links: Cited by: §3.3.
- NewsQA: a machine comprehension dataset. Proceedings of the 2nd Workshop on Representation Learning for NLP. External Links: Cited by: §1.
- Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, USA, pp. 6000–6010. External Links: Cited by: §2.2.2.
- FriendsQA: open-domain question answering on TV show transcripts. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Stockholm, Sweden, pp. 188–197. External Links: Cited by: §1, §3.1, §3.1, §3.3.
- XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 5754–5764. External Links: Cited by: §1, §3.5.
Appendix A Appendices
a.1 Experimental Setup
The BERT model and the RoBERTa
model use the same configuration. The two models both have 12 hidden transformer layers and 12 attention heads. The hidden size of the model is 768 and the intermediate size in the transformer layers is 3,072. The activation function in the transformer layers isgelu.
The batch size of 32 sequences is used for pre-training. Adam with the learning rate of , , , the L2 weight decay of
, the learning rate warm up over the first 10% steps, and the linear decay of the learning rate are used. A dropout probability ofis applied to all layers. The cross-entropy is used for the training loss of each task. For the masked language modeling tasks, the model is trained until the perplexity stops decreasing on the development set. For the other pre-training tasks, the model is trained until both the loss and the accuracy stop decreasing on the development set.
For fine-tuning, the batch size and the optimization approach are the same as the pre-training. The dropout probability is always kept at . The training loss is the sum of the cross-entropy of two fine-tuning tasks as in §2.2.
a.2 Question Types Analysis
Tables in this section show the results with respect to the question types using all models (Section 3.2) in the order of performance.
a.3 Error Examples
Each table in this section gives an error example from the excerpt. The gold answers are indicated by the solid underlines whereas the predicted answers are indicated by the wavy underlines.
|Q||Why is Joey planning a big party?|
|J||Oh, we’re having a big party tomorrow night. Later!|
|R||Whoa! Hey-hey, you planning on inviting us?|
|P||Hey!! Get your ass back here, Tribbiani!!|
|M||What Phoebe meant to say was umm, how come|
|you’re having a party and we’re not invited?|
|J||Oh, it’s Ross’ bachelor party.|
J: Joey, R: Rachel, P: Pheobe, M: Monica.
|Q||Who opened the vent?|
|R||Ok, got the vent open.|
|P||Hi, I’m Ben. I’m hospital worker Ben.|
|It’s Ben… to the rescue!|
|R||Ben, you ready? All right, gimme your foot.|
|Ok, on three, Ben. One, two, three. Ok, That’s it, Ben.|
|-||(Ross and Susan lift Phoebe up into the vent.)|
|S||What do you see?|
|P||Well, Susan, I see what appears to be a dark vent.|
|Wait. Yes, it is in fact a dark vent.|
|-||(A janitor opens the closet door from the outside.)|
P: Pheobe, R: Ross, S: Susan.
|Q||How does Joey try to convince the girl|
|to hang out with him?|
|J||Oh yeah-yeah. And I got the duck totally trained.|
|Watch this. Stare at the wall. Hardly move. Be white.|
|G||You are really good at that.|
|So uh, I had fun tonight, you throw one hell of a party.|
|J||Oh thanks. Thanks. It was great meetin’ ya. And listen|
|if any of my friends gets married, or have a birthday, …|
|G||Yeah, that would be great. So I guess umm, good night.|
|J||Oh unless you uh, you wanna hang around.|
|J||Yeah. I’ll let you play with my duck.|
J: Joey, G: The Girl.