Large repositories of textual communications (e.g. forum and microblog posts) have gained recent popularity as proxies for dialog Galley et al. (2019); Ritter et al. (2010); Lowe et al. (2015). However, conversations in these settings differ from natural dialog: turns may be sparsely scattered over a large temporal span, contain distinct syntax and vocabulary Maity et al. (2016), and differ greatly in formality and focus Li et al. (2017). In this paper, we investigate how appropriate such data is for modeling natural dialog, and introduce Interview, a new high-quality large-scale open-domain conversational dataset grounded in interview settings with annotations for specific speaker roles.
|Dataset||Spoken||# Dialogs||# Turns||# Words|
|✗||147 M||–||1,800.0 M|
|Interview 2P||✓||23,714||454,739||21.7 M|
We compare the performance of state-of-the-art language models fine-tuned on Interview and other popular conversational datasets, demonstrating that Interview
contains more complex dialog and better models the characteristics of natural spoken conversations. Our dataset is an order of magnitude larger than existing high-quality natural dialog datasets and contains speaker role annotations for each turn, facilitating the development of conversational agents and assistive systems for settings involving specific speaker roles, such as doctor-patient interviews or hosted talk shows.
In particular, we explore the tasks of role modeling in media dialog and role change detection on Interview and find that leveraging role information can enable more nuanced, on-topic and natural dialog generation, as well as improve role change classification performance.
In summary, we present Interview, the first large-scale open-domain media dialog dataset. We explore two tasks for which it serves as a promising benchmark dataset: speaker role modeling and speaker change detection. We build simple yet strong models to show quantitatively that role labels from Interview improve performance on such tasks. Interview’s scale, spoken origins, role diversity, and complex utterances make it a better source for grounded open-domain conversations.
2 Related Works
Broadly speaking, dialog and conversation datasets can be classified as constrained (goal-oriented) or open-domain, written or spoken, and scripted or spontaneousSerban et al. (2018). In the realm of written dialog, the closest proxy to natural dialog comes in the form of role-play-style Bernsen et al. (1998) conversations, featuring two agents instructed to participate in a constrained conversation. This setup has seen recent usage to construct goal-oriented Byrne et al. (2019); Budzianowski et al. (2018) and grounded conversations Dinan et al. (2019); Gopalakrishnan et al. (2019). These datasets are expensive to collect at scale and are heavily constrained/guided by the instructions given to participants. Several initiatives have recorded and manually transcribed natural conversations occurring in the course of normal life, resulting in small, high-quality natural dialog datasets Canavan et al. (1997); Godfrey et al. (1992); Renals et al. (2007); Morgan et al. (2001). We explore an alternative venue for collecting a large-scale dataset of natural dialog: conversations and interviews on public radio.
The US Defense Advanced Research Projects Agency (DARPA) has undertaken efforts to collect broadcast and informal conversation from public and private sources including messaging boards, SMS DARPA (2011), and broadcast newswire content Strassel (2004); Cohen (2007). However, it proves difficult to adopt these datasets as widely available benchmarks on dialog modeling tasks, as they come with a substantial cost ($100-$1000 per dataset/year, covering up to a hundred hours of transcribed conversation). In this vein, we contribute an open-access large-scale corpus of cleanly annotated broadcast media dialog.
Weizman (2008) explores the patterns and discourse within media dialog and contrast the associated speaker role dynamics with spontaneous natural conversation. The author manually annotates and investigates 24 hours of Israeli news television programs. We see an opportunity for the investigation of speaker dynamics and significance of speaker roles at scale with our dataset.
Dialog modeling of open-domain chit-chat predicts one turn of dialog from one or many context turn(s). Structured approaches for dialog modeling build on hierarchical RNNs Sordoni et al. (2015a); Serban et al. (2016); Sankar and Ravi (2019), with recent work employing a simple concatenation of dialog history in a transformer-based architecture Zhang et al. (2019). We draw inspiration from recent works in dialog generation that model speakers via persistent ‘personas,’ whose representations are learned from a set of grounding facts Zhang et al. (2018) or other non-conversational metadata Luan et al. (2017). Our approach eschews external grounding and learns speaker embeddings via dialog modeling, similar to Li et al. (2016). We, however, propose to learn speaker embeddings for different roles and capture role-dependent lexical profiles in conversation.
3 Interview Dataset111https://www.kaggle.com/shuyangli94/interview-npr-media-dialog-transcripts
We collect a novel dataset of 105K multi-party interview transcripts for 7 programs on National Public Radio (NPR)333https://www.npr.org/ over 20 years (1999–2019), total of 10k hours. These transcripts contain a total of 3M turns comprising 7.5M sentences (127M words) from 184K speakers, of which 287 are hosts. To investigate role-play in media dialog, we curate a subset, Interview 2P, with two roles: a host and a guest, comprising 23K two-party conversations encompassing 455K turns, with 1.24M sentences and 21.7M words.
In these two-party conversations, each speaker takes an average of nine turns per dialog. Guests tend to speak longer on their turns, with 1.6x as many sentences spoken and 2x as many words per turn, and also use a more diverse vocabulary (1.6x size). Meanwhile, hosts ask five times as many questions as guests, with 40% of their dialog turns containing questions. When asking questions, hosts and guests use interrogative forms See et al. (2019) at the same rate (65%). We note that the host and guest roles have differing discourse patterns, which support the notion of role modeling.
Comparison with Other Datasets
To assess how well Interview represents open-domain dialog, we look to two datasets in widespread usage: DailyDialog Li et al. (2017), 13K short dialogs written to simulate simple conversations from daily life; and CALLHOME Canavan et al. (1997), transcriptions from 120 half-hour casual telephone conversations. We measure the language modeling performance of a pre-trained transformer model—117M-parameter GPT2 Radford et al. (2019)—both in its original form and versions fine-tuned (FT) on the training splits for Interview, DailyDialog, and CALLHOME. We evaluated the zero-shot performance of these models on the test splits of these datasets, with perplexities shown in Table 2.
While models fine-tuned on the training set performed best on each dataset as expected, we observe that 1) models trained on other datasets obtain relatively poor zero-shot performance on Interview; and 2) the model trained on Interview achieved the best out-of-domain performance on DailyDialog and CALLHOME by large margins. This suggests that language models trained on Interview can learn patterns characteristic of natural open-domain dialog in both simple daily conversation and informal long spoken exchanges. We also investigate DialoGPT, a model pre-trained on 147M Reddit threads as a proxy for dialog Zhang et al. (2019). Our results show that while Reddit threads can be used to emulate conversation, they may not resemble natural speech; DialoGPT posts by far the worst zero-shot modeling performance across all test datasets (500 perplexity)—inferior to zero-shot GPT2. These experiments confirm that Interview, a dataset of real, complex conversations, is useful for modeling patterns in natural spoken dialog. We show statistics for Interview compared to other dialog datasets in Table 1.
4 Tasks and Experiments
We additionally explore two tasks that are facilitated by speaker role annotations in Interview: 1) generating appropriate responses for a specific role given a conversation history (speaker role modeling); and 2) predicting whether a new speaker will interject on the next sentence of a conversation. These tasks are crucial components to building fluent and role-specific dialog systems, for settings such as healthcare and customer service.
4.1 Task 1: Role Modeling
We generate a response conditioned on the host speaker role, to specifically model how an interview host speaks and inquires, contrary to speaker-agnostic dialog settings Sordoni et al. (2015b); Shang et al. (2015). Individual guests appear sparsely and their utterances heavily rely on external world knowledge. Thus, we model host responses, which are generally aimed towards moderating the conversation via follow-up questions and acknowledgements. Role-specific generation like this can benefit the development of assistive technologies and role-dependent dialog systems.
We approach speaker role modeling conditional language modeling task: generating the next response for host h with the highest likelihood given a trace of prior utterances and . We use a transformer decoder to generate tokens from inputs , but calculate loss only across the target sequence (gold host response). We mimic the input schema for DialoGPT, concatenating all historical turns with separator tokens, appending the host target response.
Conditioning on Speakers
To condition on a speaker role, we prepend each utterance in the dialog history with a role-specific speaker ID. Hosts each have one ID, while guests share a single ID, allowing us to model idiosyncrasies and interviewing patterns for individual hosts:
These role-specific speaker IDs are modeled by a speaker embedding layer of the same dimensions as the transformer hidden state, injected into the transformer input layer. We fine-tune GPT2 (Speaker GPT2) and DialoGPT (Speaker DialoGPT) on our dataset with speaker embeddings. We also finetune (FT) DialoGPT and GPT2 on Interview without speaker information as strong speaker-agnostic baselines for host response generation.
|Model||BPE PPL||BLEU-1||BLEU-4||# Words||Q Rate (%)||NPO (%)||HMA||MRR|
|Context||SCOTT SIMON: Airstrikes by Syria and the Russian allies have been exacting a high cost in civilian lives […] Panos Moumtzis is the U.N.’s humanitarian coordinator for Syria[…] GUEST: […] within Idlib itself, it’s […] the area where many people had fled from […] GUEST: What we are calling for is […] the protection of the children, the women, the families, the civilians who are living in this area. SCOTT SIMON:|
|Gold||And there have also been strikes from the rebel side.|
|FT GPT2||What do you see in the future for the U.N.?|
|Speaker DialoGPT||Mr. Moumtzis, […] we have seen reports that there is a large number of people who live in the area, and they have been killed or wounded in the fighting, and that there are people who live in areas that are being targeted. What do you see as the biggest threat, the biggest danger, of the people that live in Idlib?|
For training and evaluation, we provide our model with up to 512 tokens of non-truncated historical turns. We use an 80-10-10 train/dev/test split with unique conversations in each split.
We use GPT2-small (Transformer with 12 layers, 768 hidden size, 12 heads, and 117M parameters) as the base architecture for all of our models. We perform BPE tokenization with the GPT2Tokenizer444https://huggingface.co/transformers/model_doc/gpt2.html. We use the RAdam optimizer Liu et al. (2019) with a learning rate of to utilize linear scaling in multi-GPU training. Our models are trained to convergence on 8 NVIDIA Tesla V100 GPUs, with a batch size of 5 per GPU. We use teacher-forcing to calculate perplexity for all train/dev/test splits. We avoid modeling salutations and sign-offs (which tend to be formulaic, speaker-independent, and specific to the radio station) by restricting the target turns to those with at least three prior turns and two following turns of conversation, resulting in a target training set of 87K host-only turns and 11K host-only turns for dev and test.
Speaker-conditioned models generate utterances closer to gold length than speaker-agnostic baselines, with significantly lower perplexity and higher BLEU scores. This indicates that including speaker information promotes the generation of higher fidelity responses. Our speaker models, especially Speaker GPT2, produce the most inquisitive responses (59.4% question-asking rate).
In an interview setting, it is also important for host utterances to be related to the conversation at hand. We evaluate the content similarity between generated responses and the dialog history. We show that our speaker-conditioned models generate responses with the most noun-phrases555detected via https://spacy.io/ / topical references. These also overlap the most with topics in the dialog history, indicating topical relatedness. We note that gold responses include more noun phrases with lower historical overlap, possibly due to hosts bringing up new topics.
Speaker Role Ranking
To measure the conditioning effect of speaker role profiles on host response generation, we generate a dialog turn with the gold host profile and a dialog history. We then compute the likelihood of generating that response conditioned on the same context but with the gold and nine randomly sampled hosts. As in Majumder et al. (2019), we rank the likelihoods for each host and report the host matching accuracy (HMA)—proportion where the gold host is highest ranked—and Mean Reciprocal Rank (MMR) Radev et al. (2002) of the gold host. Our speaker-conditioned models achieve much higher HMA and MRR compared to strong speaker-agnostic baselines, indicating significant conditioning on host profiles.
Our models additionally exhibit several qualitative properties of high-quality and fluent conversation. We present a sample generation in Table 4 (additional samples in the Appendix) that is indicative of broad trends across the test set. None of the models are able to introduce novel information (like Gold), but our speaker-conditioned models produce markedly better inquisitive responses. While GPT2 generates a natural-sounding short question with little relevance to the topic at hand, our Speaker DialoGPT model paraphrases previous turns and refers to existing entities to ask a substantial and coherent question. We further performed a human evaluation on a Likert scale to assess subjective dialog quality, with human raters preferring speaker model responses to speaker-agnostic models 62.5% of the time across 150 pairwise comparisons.
4.2 Task 2: Role Change Detection
We also investigate role change detection as a binary classification task for two-party dialogs. As a single turn of dialog may consist of multiple sentences, we aim to use a series of historical sentences and their speakers to classify whether a role change will occur in the next sentence of dialog. In contrast to previous textual speaker change detection tasks Meng et al. (2017), we do not provide the target sentence for which we are predicting the role change. This setting is more realistic for a real-time assistive dialog system and online prediction in general.
We fine-tune BERT Devlin et al. (2019) to encode the dialog history, classifying speaker changes with a linear layer over the [CLS] representation. To understand the role of contextual speaker information in this task, we investigate representing the dialog history with and without speaker labels for each turn. This is a difficult task on our dataset, as BERT obtains a 63.2 F1 score without speaker information, struggling to predict role changes substantially better than random. While the task remains difficult, the classifier benefits from the inclusion of speaker labels, learning speaker embeddings and achieving a 66.1 F1 score. We see the potential for further research toward learning speaker representations to predict role changes and infer the structure of dialogs.
We contribute a large-scale media dialog dataset that can act as a benchmark for complex open-domain, role-dependent grounded dialog. We present baseline model for role-conditioned dialog generation and show that they benefit from speaker information when added. In future work, we aim to perform temporal analyses of trends and biases within Interview and take advantage of the news setting to investigate external knowledge grounding in long natural conversations. These directions could potentially lead to more coherent free-form and assistive dialog systems.
- Bernsen et al. (1998) Niels Ole Bernsen, Hans Dybkjær, and Laila Dybkjær. 1998. Designing interactive speech systems - from first ideas to user testing. Springer.
- Budzianowski et al. (2018) Pawel Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018. Multiwoz - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In EMNLP.
- Byrne et al. (2019) Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai Sankar, Arvind Neelakantan, Ben Goodrich, Daniel Duckworth, Semih Yavuz, Amit Dubey, Kyu-Young Kim, and Andy Cedilnik. 2019. Taskmaster-1: Toward a realistic and diverse dialog dataset. In EMNLP.
- Canavan et al. (1997) Alexandra Canavan, David Graff, and George Zipperlen. 1997. Callhome american english speech. Linguistic Data Consortium.
Jordan Cohen. 2007.
The gale project: A description and an update.
2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pages 237–237. IEEE.
- DARPA (2011) DARPA. 2011. Broad Agency Announcement: I2O Broad Operational Language Translation (BOLT). DARPA-BAA-11-40.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
- Dinan et al. (2019) Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019. Wizard of wikipedia: Knowledge-powered conversational agents. In ICLR.
- Galley et al. (2019) Michel Galley, Chris Brockett, Xiang Gao, Jianfeng Gao, and Bill Dolan. 2019. Grounded resposne generation task at dstc7.
- Godfrey et al. (1992) John J Godfrey, Edward C Holliman, and Jane McDaniel. 1992. Switchboard: Telephone speech corpus for research and development. In ICASSP, volume 1. IEEE.
- Gopalakrishnan et al. (2019) Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. 2019. Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In Interspeech.
- Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jianfeng Gao, and William B. Dolan. 2016. A persona-based neural conversation model. In ACL.
- Li et al. (2017) Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. Dailydialog: A manually labelled multi-turn dialogue dataset. In IJCNLP.
- Liu et al. (2019) Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2019. On the variance of the adaptive learning rate and beyond. CoRR, abs/1908.03265.
- Lowe et al. (2015) Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In SIGDIAL.
- Luan et al. (2017) Yi Luan, Chris Brockett, Bill Dolan, Jianfeng Gao, and Michel Galley. 2017. Multi-task learning for speaker-role adaptation in neural conversation models. In IJCNLP.
- Maity et al. (2016) Suman Kalyan Maity, Anshit E. Chaudhary, Shraman Kumar, Animesh Mukherjee, Chaitanya Sarda, Abhijeet Patil, and Akash Mondal. 2016. Wassup? LOL : Characterizing out-of-vocabulary words in twitter. In CSCW.
- Majumder et al. (2019) Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, and Julian McAuley. 2019. Generating personalized recipes from historical user preferences. In EMNLP, pages 5975–5981.
- Meng et al. (2017) Zhao Meng, Lili Mou, and Zhi Jin. 2017. Hierarchical RNN with static sentence-level attention for text-based speaker change detection. In CIKM.
- Morgan et al. (2001) Nelson Morgan, Don Baron, Jane Edwards, Daniel P. W. Ellis, David Gelbart, Adam Janin, Thilo Pfau, Elizabeth Shriberg, and Andreas Stolcke. 2001. The meeting project at ICSI. In HLT.
- Radev et al. (2002) Dragomir R. Radev, Hong Qi, Harris Wu, and Weiguo Fan. 2002. Evaluating web-based question answering systems. In LREC.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
- Renals et al. (2007) Steve Renals, Thomas Hain, and Hervé Bourlard. 2007. Recognition and understanding of meetings the ami and amida projects. In ASRU, pages 238–247. IEEE.
- Ritter et al. (2010) Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsupervised modeling of twitter conversations. In HLT.
- Sankar and Ravi (2019) Chinnadhurai Sankar and Sujith Ravi. 2019. Deep reinforcement learning for modeling chit-chat dialog with discrete attributes. CoRR, abs/1907.02848.
- See et al. (2019) Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? how controllable attributes affect human judgments. In NAACL-HLT.
- Serban et al. (2018) Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, and Joelle Pineau. 2018. A survey of available corpora for building data-driven dialogue systems: The journal version. D&D, 9(1).
- Serban et al. (2016) Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI.
- Shang et al. (2015) Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. In ACL.
- Sordoni et al. (2015a) Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015a. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In CIKM.
- Sordoni et al. (2015b) Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015b. A neural network approach to context-sensitive generation of conversational responses. In NAACL-HLT.
- Strassel (2004) Stephanie M Strassel. 2004. Linguistic resources for effective, affordable, reusable speech-to-text. In LREC.
- Weizman (2008) Elda Weizman. 2008. Positioning in media dialogue: Negotiating roles in the news interview, volume 3. John Benjamins Publishing.
- Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In ACL.
- Zhang et al. (2019) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019. Dialogpt: Large-scale generative pre-training for conversational response generation. CoRR, abs/1911.00536.
Appendix A Generated Examples
See the following tables for sample dialog histories and generated host responses from each of our baseline and speaker-conditioned dialog models.