Interview: A Large-Scale Open-Source Corpus of Media Dialog

04/07/2020 ∙ by Bodhisattwa Prasad Majumder, et al. ∙ University of California, San Diego 0

Existing conversational datasets consist either of written proxies for dialog or small-scale transcriptions of natural speech. We introduce 'Interview': a large-scale (105K conversations) media dialog dataset collected from news interview transcripts. Compared to existing large-scale proxies for conversational data, language models trained on our dataset exhibit better zero-shot out-of-domain performance on existing spoken dialog datasets, demonstrating its usefulness in modeling real-world conversations. 'Interview' contains speaker role annotations for each turn, facilitating the development of engaging, responsive dialog systems. In fact, experiments on two dialog tasks show that leveraging such labels improves performance over strong speaker-agnostic baselines, and enabling models to generate more specific and inquisitive responses in interview-style conversations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large repositories of textual communications (e.g. forum and microblog posts) have gained recent popularity as proxies for dialog Galley et al. (2019); Ritter et al. (2010); Lowe et al. (2015). However, conversations in these settings differ from natural dialog: turns may be sparsely scattered over a large temporal span, contain distinct syntax and vocabulary Maity et al. (2016), and differ greatly in formality and focus Li et al. (2017). In this paper, we investigate how appropriate such data is for modeling natural dialog, and introduce Interview, a new high-quality large-scale open-domain conversational dataset grounded in interview settings with annotations for specific speaker roles.

Dataset Spoken # Dialogs # Turns # Words
Reddit 147 M 1,800.0 M
DailyDialog 13,118 102,979 1.4 M
CALLHOME 120 22,214 0.3 M
Interview 2P 23,714 454,739 21.7 M
Interview 105,848 3,199,856 126.7 M
Table 1: Comparative dialog dataset statistics, including two-party (2P) and full Interview dataset

We compare the performance of state-of-the-art language models fine-tuned on Interview and other popular conversational datasets, demonstrating that Interview

 contains more complex dialog and better models the characteristics of natural spoken conversations. Our dataset is an order of magnitude larger than existing high-quality natural dialog datasets and contains speaker role annotations for each turn, facilitating the development of conversational agents and assistive systems for settings involving specific speaker roles, such as doctor-patient interviews or hosted talk shows.

In particular, we explore the tasks of role modeling in media dialog and role change detection on Interview and find that leveraging role information can enable more nuanced, on-topic and natural dialog generation, as well as improve role change classification performance.

In summary, we present Interview, the first large-scale open-domain media dialog dataset. We explore two tasks for which it serves as a promising benchmark dataset: speaker role modeling and speaker change detection. We build simple yet strong models to show quantitatively that role labels from Interview improve performance on such tasks. Interview’s scale, spoken origins, role diversity, and complex utterances make it a better source for grounded open-domain conversations.

2 Related Works

Broadly speaking, dialog and conversation datasets can be classified as constrained (goal-oriented) or open-domain, written or spoken, and scripted or spontaneous

Serban et al. (2018). In the realm of written dialog, the closest proxy to natural dialog comes in the form of role-play-style Bernsen et al. (1998) conversations, featuring two agents instructed to participate in a constrained conversation. This setup has seen recent usage to construct goal-oriented Byrne et al. (2019); Budzianowski et al. (2018) and grounded conversations Dinan et al. (2019); Gopalakrishnan et al. (2019). These datasets are expensive to collect at scale and are heavily constrained/guided by the instructions given to participants. Several initiatives have recorded and manually transcribed natural conversations occurring in the course of normal life, resulting in small, high-quality natural dialog datasets Canavan et al. (1997); Godfrey et al. (1992); Renals et al. (2007); Morgan et al. (2001). We explore an alternative venue for collecting a large-scale dataset of natural dialog: conversations and interviews on public radio.

The US Defense Advanced Research Projects Agency (DARPA) has undertaken efforts to collect broadcast and informal conversation from public and private sources including messaging boards, SMS DARPA (2011), and broadcast newswire content Strassel (2004); Cohen (2007). However, it proves difficult to adopt these datasets as widely available benchmarks on dialog modeling tasks, as they come with a substantial cost ($100-$1000 per dataset/year, covering up to a hundred hours of transcribed conversation). In this vein, we contribute an open-access large-scale corpus of cleanly annotated broadcast media dialog.

Weizman (2008) explores the patterns and discourse within media dialog and contrast the associated speaker role dynamics with spontaneous natural conversation. The author manually annotates and investigates 24 hours of Israeli news television programs. We see an opportunity for the investigation of speaker dynamics and significance of speaker roles at scale with our dataset.

Dialog modeling of open-domain chit-chat predicts one turn of dialog from one or many context turn(s). Structured approaches for dialog modeling build on hierarchical RNNs Sordoni et al. (2015a); Serban et al. (2016); Sankar and Ravi (2019), with recent work employing a simple concatenation of dialog history in a transformer-based architecture Zhang et al. (2019). We draw inspiration from recent works in dialog generation that model speakers via persistent ‘personas,’ whose representations are learned from a set of grounding facts Zhang et al. (2018) or other non-conversational metadata Luan et al. (2017). Our approach eschews external grounding and learns speaker embeddings via dialog modeling, similar to Li et al. (2016). We, however, propose to learn speaker embeddings for different roles and capture role-dependent lexical profiles in conversation.

3 Interview Dataset111https://www.kaggle.com/shuyangli94/interview-npr-media-dialog-transcripts

We collect a novel dataset of 105K multi-party interview transcripts for 7 programs on National Public Radio (NPR)333https://www.npr.org/ over 20 years (1999–2019), total of 10k hours. These transcripts contain a total of 3M turns comprising 7.5M sentences (127M words) from 184K speakers, of which 287 are hosts. To investigate role-play in media dialog, we curate a subset, Interview 2P, with two roles: a host and a guest, comprising 23K two-party conversations encompassing 455K turns, with 1.24M sentences and 21.7M words.

In these two-party conversations, each speaker takes an average of nine turns per dialog. Guests tend to speak longer on their turns, with 1.6x as many sentences spoken and 2x as many words per turn, and also use a more diverse vocabulary (1.6x size). Meanwhile, hosts ask five times as many questions as guests, with 40% of their dialog turns containing questions. When asking questions, hosts and guests use interrogative forms See et al. (2019) at the same rate (65%). We note that the host and guest roles have differing discourse patterns, which support the notion of role modeling.

Model Interview DailyDialog CALLHOME
GPT2 35.20 57.19 137.21
FT-Interview 17.77 32.85 51.40
FT-DailyDialog 50.05 11.63 82.67
FT-CALLHOME 32.10 33.30 28.19
Table 2: Zero-shot BPE perplexity for GPT2-based models. Bold denotes best out-of-domain performance.

Comparison with Other Datasets

To assess how well Interview represents open-domain dialog, we look to two datasets in widespread usage: DailyDialog Li et al. (2017), 13K short dialogs written to simulate simple conversations from daily life; and CALLHOME Canavan et al. (1997), transcriptions from 120 half-hour casual telephone conversations. We measure the language modeling performance of a pre-trained transformer model—117M-parameter GPT2 Radford et al. (2019)—both in its original form and versions fine-tuned (FT) on the training splits for Interview, DailyDialog, and CALLHOME. We evaluated the zero-shot performance of these models on the test splits of these datasets, with perplexities shown in Table 2.

While models fine-tuned on the training set performed best on each dataset as expected, we observe that 1) models trained on other datasets obtain relatively poor zero-shot performance on Interview; and 2) the model trained on Interview achieved the best out-of-domain performance on DailyDialog and CALLHOME by large margins. This suggests that language models trained on Interview can learn patterns characteristic of natural open-domain dialog in both simple daily conversation and informal long spoken exchanges. We also investigate DialoGPT, a model pre-trained on 147M Reddit threads as a proxy for dialog Zhang et al. (2019). Our results show that while Reddit threads can be used to emulate conversation, they may not resemble natural speech; DialoGPT posts by far the worst zero-shot modeling performance across all test datasets (500 perplexity)—inferior to zero-shot GPT2. These experiments confirm that Interview, a dataset of real, complex conversations, is useful for modeling patterns in natural spoken dialog. We show statistics for Interview compared to other dialog datasets in Table 1.

4 Tasks and Experiments

We additionally explore two tasks that are facilitated by speaker role annotations in Interview: 1) generating appropriate responses for a specific role given a conversation history (speaker role modeling); and 2) predicting whether a new speaker will interject on the next sentence of a conversation. These tasks are crucial components to building fluent and role-specific dialog systems, for settings such as healthcare and customer service.

4.1 Task 1: Role Modeling

We generate a response conditioned on the host speaker role, to specifically model how an interview host speaks and inquires, contrary to speaker-agnostic dialog settings Sordoni et al. (2015b); Shang et al. (2015). Individual guests appear sparsely and their utterances heavily rely on external world knowledge. Thus, we model host responses, which are generally aimed towards moderating the conversation via follow-up questions and acknowledgements. Role-specific generation like this can benefit the development of assistive technologies and role-dependent dialog systems.

We approach speaker role modeling conditional language modeling task: generating the next response for host h with the highest likelihood given a trace of prior utterances and . We use a transformer decoder to generate tokens from inputs , but calculate loss only across the target sequence (gold host response). We mimic the input schema for DialoGPT, concatenating all historical turns with separator tokens, appending the host target response.

Conditioning on Speakers

To condition on a speaker role, we prepend each utterance in the dialog history with a role-specific speaker ID. Hosts each have one ID, while guests share a single ID, allowing us to model idiosyncrasies and interviewing patterns for individual hosts:

These role-specific speaker IDs are modeled by a speaker embedding layer of the same dimensions as the transformer hidden state, injected into the transformer input layer. We fine-tune GPT2 (Speaker GPT2) and DialoGPT (Speaker DialoGPT) on our dataset with speaker embeddings. We also finetune (FT) DialoGPT and GPT2 on Interview without speaker information as strong speaker-agnostic baselines for host response generation.

Model BPE PPL BLEU-1 BLEU-4 # Words Q Rate (%) NPO (%) HMA MRR
Gold 29.3 55.1 37.17
FT DialoGPT 20.4 11.79 0.29 19.9 53.1 45.28 0.1000 0.2930
FT GPT2 17.4 11.00 0.29 18.2 52.9 45.54 0.1000 0.2930
Speaker DialoGPT 15.3 12.23 0.42 20.8 56.4 47.02 0.3174 0.4662
Speaker GPT2 17.3 12.04 0.30 21.3 59.4 47.37 0.3344 0.4810
Table 3: Metrics on generated host responses on test set. NPO = Noun-phrase overlap with dialog history, HMA = Host Matching Accuracy, MRR = Mean Reciprocal Rank.
Context SCOTT SIMON: Airstrikes by Syria and the Russian allies have been exacting a high cost in civilian lives […] Panos Moumtzis is the U.N.’s humanitarian coordinator for Syria[…] GUEST: […] within Idlib itself, it’s […] the area where many people had fled from […] GUEST: What we are calling for is […] the protection of the children, the women, the families, the civilians who are living in this area. SCOTT SIMON:
Gold And there have also been strikes from the rebel side.
FT GPT2 What do you see in the future for the U.N.?
Speaker DialoGPT Mr. Moumtzis, […] we have seen reports that there is a large number of people who live in the area, and they have been killed or wounded in the fighting, and that there are people who live in areas that are being targeted. What do you see as the biggest threat, the biggest danger, of the people that live in Idlib?
Table 4: Sample generated responses. Bold emphasizes specificity and topicality.

For training and evaluation, we provide our model with up to 512 tokens of non-truncated historical turns. We use an 80-10-10 train/dev/test split with unique conversations in each split.

We use GPT2-small (Transformer with 12 layers, 768 hidden size, 12 heads, and 117M parameters) as the base architecture for all of our models. We perform BPE tokenization with the GPT2Tokenizer444https://huggingface.co/transformers/model_doc/gpt2.html. We use the RAdam optimizer Liu et al. (2019) with a learning rate of to utilize linear scaling in multi-GPU training. Our models are trained to convergence on 8 NVIDIA Tesla V100 GPUs, with a batch size of 5 per GPU. We use teacher-forcing to calculate perplexity for all train/dev/test splits. We avoid modeling salutations and sign-offs (which tend to be formulaic, speaker-independent, and specific to the radio station) by restricting the target turns to those with at least three prior turns and two following turns of conversation, resulting in a target training set of 87K host-only turns and 11K host-only turns for dev and test.

We decode the host response via top- sampling Radford et al. (2019) with . Results across all models on the test set are in Table 3.

Performance

Speaker-conditioned models generate utterances closer to gold length than speaker-agnostic baselines, with significantly lower perplexity and higher BLEU scores. This indicates that including speaker information promotes the generation of higher fidelity responses. Our speaker models, especially Speaker GPT2, produce the most inquisitive responses (59.4% question-asking rate).

In an interview setting, it is also important for host utterances to be related to the conversation at hand. We evaluate the content similarity between generated responses and the dialog history. We show that our speaker-conditioned models generate responses with the most noun-phrases555detected via https://spacy.io/ / topical references. These also overlap the most with topics in the dialog history, indicating topical relatedness. We note that gold responses include more noun phrases with lower historical overlap, possibly due to hosts bringing up new topics.

Speaker Role Ranking

To measure the conditioning effect of speaker role profiles on host response generation, we generate a dialog turn with the gold host profile and a dialog history. We then compute the likelihood of generating that response conditioned on the same context but with the gold and nine randomly sampled hosts. As in Majumder et al. (2019), we rank the likelihoods for each host and report the host matching accuracy (HMA)—proportion where the gold host is highest ranked—and Mean Reciprocal Rank (MMR) Radev et al. (2002) of the gold host. Our speaker-conditioned models achieve much higher HMA and MRR compared to strong speaker-agnostic baselines, indicating significant conditioning on host profiles.

Qualitative Analysis

Our models additionally exhibit several qualitative properties of high-quality and fluent conversation. We present a sample generation in Table 4 (additional samples in the Appendix) that is indicative of broad trends across the test set. None of the models are able to introduce novel information (like Gold), but our speaker-conditioned models produce markedly better inquisitive responses. While GPT2 generates a natural-sounding short question with little relevance to the topic at hand, our Speaker DialoGPT model paraphrases previous turns and refers to existing entities to ask a substantial and coherent question. We further performed a human evaluation on a Likert scale to assess subjective dialog quality, with human raters preferring speaker model responses to speaker-agnostic models 62.5% of the time across 150 pairwise comparisons.

4.2 Task 2: Role Change Detection

We also investigate role change detection as a binary classification task for two-party dialogs. As a single turn of dialog may consist of multiple sentences, we aim to use a series of historical sentences and their speakers to classify whether a role change will occur in the next sentence of dialog. In contrast to previous textual speaker change detection tasks Meng et al. (2017), we do not provide the target sentence for which we are predicting the role change. This setting is more realistic for a real-time assistive dialog system and online prediction in general.

We fine-tune BERT Devlin et al. (2019) to encode the dialog history, classifying speaker changes with a linear layer over the [CLS] representation. To understand the role of contextual speaker information in this task, we investigate representing the dialog history with and without speaker labels for each turn. This is a difficult task on our dataset, as BERT obtains a 63.2 F1 score without speaker information, struggling to predict role changes substantially better than random. While the task remains difficult, the classifier benefits from the inclusion of speaker labels, learning speaker embeddings and achieving a 66.1 F1 score. We see the potential for further research toward learning speaker representations to predict role changes and infer the structure of dialogs.

5 Conclusion

We contribute a large-scale media dialog dataset that can act as a benchmark for complex open-domain, role-dependent grounded dialog. We present baseline model for role-conditioned dialog generation and show that they benefit from speaker information when added. In future work, we aim to perform temporal analyses of trends and biases within Interview and take advantage of the news setting to investigate external knowledge grounding in long natural conversations. These directions could potentially lead to more coherent free-form and assistive dialog systems.

References

Appendix A Generated Examples

See the following tables for sample dialog histories and generated host responses from each of our baseline and speaker-conditioned dialog models.