Voice-based “personal assistants” such as Apple’s SIRI, Microsoft’s Cortana, Amazon Alexa, and the Google Assistant have finally entered the mainstream. This development is generally attributed to major breakthroughs in speech recognition and text-to-speech (TTS) technologies aided by recent progress in deep learningLecun et al. (2015), exponential gains in compute power Steinkrau et al. (2005); Jouppi et al. (2017)
, and the ubiquity of powerful mobile devices. The accuracy of machine learned speech recognizersHinton et al. (2012) and speech synthesizers van den Oord et al. (2016) are good enough to be deployed in real-world products and this progress has been driven by publicly available labeled datasets. However, conspicuously absent from this list is equal progress in machine learned conversational natural language understanding (NLU) and generation (NLG). The NLU and NLG components of dialog systems starting from the early research work Weizenbaum (1966)
to the present commercially available personal assistants largely rely on rule-based systems. The NLU and NLG systems are often carefully programmed for very narrow and specific casesGoogle (2019); Amazon (2019). General understanding of natural spoken behaviors across multiple dialog turns, even in single task-oriented situations, is by most accounts still a long way off. In this way, most of these products are very much hand crafted, with inherent constraints on what users can say, how the system responds and the order in which the various subtasks can be completed. They are high precision but relatively low coverage. Not only are such systems unscalable, but they lack the flexibility to engage in truly natural conversation.
Yet none of this is surprising. Natural language is heavily context dependent and often ambiguous, especially in multi-turn conversations across multiple topics. It is full of subtle discourse cues and pragmatic signals whose patterns have yet to be thoroughly understood. Enabling an automated system to hold a coherent task-based conversation with a human remains one of computer science’s most complex and intriguing unsolved problems Weizenbaum (1966). In contrast to more traditional NLP efforts, interest in statistical approaches to dialog understanding and generation aided by machine learning has grown considerably in the last couple of years Rojas-Barahona et al. (2017); Bordes et al. (2017); Henderson et al. (2013). However, the dearth of high quality, goal-oriented dialog data is considered a major hindrance to more significant progress in this area Bordes et al. (2017); Lowe et al. (2015).
To help solve the data problem we present Taskmaster-1, a dataset consisting of 13,215 dialogs, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations. For the spoken dialogs, we created a “Wizard of Oz” (WOz) system Kelley (1984) to collect two-person, spoken conversations. Crowdsourced workers playing the “user” interacted with human operators playing the “digital assistant” using a web-based interface. In this way, users were led to believe they were interacting with an automated system while it was in fact a human, allowing them to express their turns in natural ways but in the context of an automated interface. We refer to this spoken dialog type as “two-person dialogs”. For the written dialogs, we engaged crowdsourced workers to write the full conversation themselves based on scenarios outlined for each task, thereby playing roles of both the user and assistant. We refer to this written dialog type as “self-dialogs”. In a departure from traditional annotation techniques Henderson et al. (2013); Rojas-Barahona et al. (2017); Budzianowski et al. (2018), dialogs are labeled with simple API calls and arguments. This technique is much easier for annotators to learn and simpler to apply. As such it is more cost effective and, in addition, the same model can be used for multiple service providers.
shows that Taskmaster-1 has more unique words and is more difficult for language models to fit. We also find that Taskmaster-1 is more realistic than MultiWOZ. Specifically, the two-person dialogs in Taskmaster-1 involve more real-word entities than seen in MutliWOZ since we do not restrict conversations to a small knowledge base. Beyond the corpus and the methodologies used to create it, we present several baseline models including state-of-the-art neural seq2seq architectures together with perplexity and BLEU scores. We also provide qualitative human performance evaluations for these models and find that automatic evaluation metrics correlate well with human judgments. We will publicly release our corpus containing conversations, API call and argument annotations, and also the human judgments.
|# unique words||21,894||19,175|
|# unique named||8,218||1,338|
2 Related work
2.1 Human-machine vs. human-human dialog
Serban et al. (2017) discuss the major features and differences among the existing offerings in an exhaustive and detailed survey of available corpora for data driven learning of dialog systems. One important distinction covered is that of human-human vs. human-machine dialog data, each having its advantages and disadvantages. Many of the existing task-based datasets have been generated from deployed dialog systems such as the Let’s Go Bus Information System Raux et al. (2003) and the various Dialog State Tracking Challenges (DSTCs) Williams et al. (2016). However, it is doubtful that new data-driven systems built with this type of corpus would show much improvement since they would be biased by the existing system and likely mimic its limitations Williams and Young (2007). Since the ultimate goal is to be able to handle complex human language behaviors, it would seem that human-human conversational data is the better choice for spoken dialog system development Budzianowski et al. (2018). However, learning from purely human-human based corpora presents challenges of its own. In particular, human conversation has a different distribution of understanding errors and exhibits turn-taking idiosyncrasies which may not be well suited for interaction with a dialog system Williams and Young (2007); Serban et al. (2017).
2.2 The Wizard of Oz (WOz) Approach and MultiWOZ
The WOz framework, first introduced by Kelley (1984) as a methodology for iterative design of natural language interfaces, presents a more effective approach to human-human dialog collection. In this setup, users are led to believe they are interacting with an automated assistant but in fact it is a human behind the scenes that controls the system responses. Given the human-level natural language understanding, users quickly realize they can comfortably and naturally express their intent rather than having to modify behaviors as is normally the case with a fully automated assistant. At the same time, the machine-oriented context of the interaction, i.e. the use of TTS and slower turn taking cadence, prevents the conversation from becoming fully fledged, overly complex human discourse. This creates an idealized spoken environment, revealing how users would openly and candidly express themselves with an automated assistant that provided superior natural language understanding.
Perhaps the most relevant work to consider here is the recently released MultiWOZ dataset Budzianowski et al. (2018), since it is similar in size, content and collection methodologies. MultiWOZ has roughly 10,000 dialogs which feature several domains and topics. The dialogs are annotated with both dialog states and dialog acts. MultiWOZ is an entirely written corpus and uses crowdsourced workers for both assistant and user roles. In contrast, Taskmaster-1 has roughly 13,000 dialogs spanning six domains and annotated with API arguments. The two-person spoken dialogs in Taskmaster-1 use crowdsourcing for the user role but trained agents for the assistant role. The assistant’s speech is played to the user via TTS. The remaining 7,708 conversations in Taskmaster-1 are self-dialogs, in which crowdsourced workers write the entire conversation themselves. As Krause et al. (2017); Moghe et al. (2018) show, self dialogs are surprisingly rich in content.
|ASSISTANT:||How can I help you?|
|USER:||Hi, could you help me with booking movie tickets for tonight?|
|ASSISTANT:||What movie are you interested in?|
|ASSISTANT:||Did you have a theater in mind?|
|USER:||Could you check if the Regal Neshaminy… No, AMC Neshaminy in Neshaminy, PA is playing it?|
|ASSISTANT:||Could you spell that?|
|USER:||Sure, n e s h a m i n y.|
|ASSISTANT:||I have a showtime at 7:30 and at 10:30, is that okay?|
|USER:||Yes, could you get two tickets for the 7:30?|
One moment. Okay so that’s 2 tickets for 7:30 at the AMC Neshaminy 24?
|ASSISTANT:||It’ll be twenty-four ninety-nine for your tickets.|
|USER:||That sounds great.|
|ASSISTANT:||I’ve confirmed your tickets, they’ll arrive via text shortly. Did you need any other information?|
|USER:||No, that was it. Thank you so much for your help.|
|ASSISTANT:||Great, no problem. I hope you have fun.|
|USER:||I hope so, too. Thank you so much.|
3 The Taskmaster Corpus
There are several key attributes that make Taskmaster-1 both unique and effective for data-driven approaches to building dialog systems and for other research.
Spoken and written dialogs: While the spoken sources more closely reflect conversational language Chafe and Tannen (1987), written dialogs are significantly cheaper and easier to gather. This allows for a significant increase in the size of the corpus and in speaker diversity.
Goal-oriented dialogs: All dialogs are based on one of six tasks: ordering pizza, creating auto repair appointments, setting up rides for hire, ordering movie tickets, ordering coffee drinks and making restaurant reservations.
Two collection methods: The two-person dialogs and self-dialogs each have pros and cons, revealing interesting contrasts.
Multiple turns: The average number of utterances per dialog is about 23 which ensures context-rich language behaviors.
API-based annotation: The dataset uses a simple annotation schema providing sufficient grounding for the data while making it easy for workers to apply labels consistently.
Size: The total of 13,215 dialogs in this corpus is on par with similar, recently released datasets such as MultiWOZ Budzianowski et al. (2018).
3.2 Two-person, spoken dataset
In order to replicate a two-participant, automated digital assistant experience, we built a WOz platform that pairs agents playing the digital assistant with crowdsourced workers playing the user in task-based conversational scenarios. An example dialog from this dataset is given in Figure 1.
3.2.1 WOz platform and data pipeline
While it is beyond the scope of this work to describe the entire system in detail, there are several platform features that help illustrate how the process works.
Modality: The agents playing the assistant type their input which is in turn played to the user via text-to-speech (TTS) while the crowdsourced workers playing the user speak aloud to the assistant using their laptop and microphone. We use WebRTC to establish the audio channel. This setup creates a digital assistant-like communication style.
Conversation and user quality control: Once the task is completed, the agents tag each conversation as either successful or problematic depending on whether the session had technical glitches or user behavioral issues. We are also then able to root out problematic users based on this logging.
Agent quality control: Agents are required to login to the system which allows us to monitor performance including the number and length of each session as well as their averages.
User queuing: When there are more users trying to connect to the system than available agents, a queuing mechanism indicates their place in line and connects them automatically once they move to the front of the queue.
Transcription: Once complete, the user’s audio-only portion of the dialog is transcribed by a second set of workers and then merged with the assistant’s typed input to create a full text version of the dialog. Finally, these conversations are checked for transcription errors and typos and then annotated, as described in Section 3.4.
3.2.2 Agents, workers and training
Both agents and crowdsourced workers are given written instructions prior to the session. Examples of each are given in Figure 2 and Figure 3. The instructions continue to be displayed on screen to the crowdsourced workers while they interact with the assistant. Instructions are modified at times (for either participant or both) to ensure broader coverage of dialog scenarios that are likely to occur in actual user-assistant interactions. For example, in one case users were asked to change their mind after ordering their first item and in another agents were instructed to tell users that a given item was not available. Finally, in their instructions, crowdsourced workers playing the user are told they will be engaging in conversation with “a digital assistant”. However, it is plausible that some suspect human intervention due to the advanced level of natural language understanding from the assistant side.
Agents playing the assistant role were hired from a pool of dialog analysts and given two hours of training on the system interface as well as on how to handle specific scenarios such as uncooperative users and technical glitches. Uncooperative users typically involve those who either ignored agent input or who rushed through the conversation with short phrases. Technical issues involved dropped sessions (e.g. WebRTC connections failed) or cases in which the user could not hear the agent or vice-versa. In addition, weekly meetings were held with the agents to answer questions and gather feedback on their experiences. Agents typically work four hours per day with dialog types changing every hour. Crowdsourced workers playing the user are accessed using Amazon Mechanical Turk. Payment for a completed dialog session lasting roughly five to seven minutes was typically in the range of to . Problematic users are detected either by the agent involved in the specific dialog or by post-session assessment and removed from future requests.
|USER:||Hi I would like to buy 2 tickets for Shazam!|
|ASSISTANT:||What city would you like to see this movie?|
|ASSISTANT:||Ok, I’ll check that location for you.|
|USER:||I would prefer the Edwards Ontario Mountain Village, since it’s closest to me and my guest.|
|ASSISTANT:||What time is best for you?|
|USER:||Either 4 or 6 pm.|
|ASSISTANT:||I’m sorry, but it looks like the 4:10 and the 6:10 pm showings are sold out.|
|USER:||That’s too bad. I really wanted to see that movie.|
|ASSISTANT:||I’m sorry. Is there another movie you would like to see?|
|USER:||How about Captain Marvel at the Edwards Ontario Mountain theater.|
|ASSISTANT:||Show times are 3:45, 7:10 and 10:10 pm. Which would you like?|
|USER:||I am interested in the 7:10 showing.|
|ASSISTANT:||I’m sorry, it looks like the 7:10 showing is also sold out.|
|USER:||Wow, that’s too bad.|
|ASSISTANT:||I’m sorry. Is there another movie you would like me to look up?|
|USER:||No, I think I’ll pass on the movies tonight since those were the two I really wanted to see.|
|ASSISTANT:||If you want, I can check another theater.|
|USER:||No, that’s fine. Thank you for your help.|
3.3 Self-dialogs (one-person written dataset)
While the two-person approach to data collection creates a realistic scenario for robust, spoken dialog data collection, this technique is time consuming, complex and expensive, requiring considerable technical implementation as well as administrative procedures to train and manage agents and crowdsourced workers. In order to extend the Taskmaster dataset at minimal cost, we use an alternative self-dialog approach in which crowdsourced workers write the full dialogs themselves (i.e. interpreting the roles of both user and assistant).
3.3.1 Task scenarios and instructions
Targeting the same six tasks used for the two-person dialogs, we again engaged the Amazon Mechanical Turk worker pool to create self-dialogs, this time as a written exercise. In this case, users are asked to pretend they have a personal assistant who can help them take care of various tasks in real time. They are told to imagine a scenario in which they are speaking to their assistant on the phone while the assistant accesses the services for one of the given tasks. They then write down the entire conversation. Figure 4 shows a sample set of instructions.
3.3.2 Pros and cons of self-dialogs
The self-dialog technique renders quality data and avoids some of the challenges seen with the two-person approach. To begin, since the same person is writing both sides of the conversation, we never see misunderstandings that lead to frustration as is sometimes experienced between interlocutors in the two-person approach. In addition, all the self-dialogs follow a reasonable path even when the user is constructing conversations that include understanding errors or other types of dialog glitches such as when a particular choice is not available. As it turns out, crowdsourced workers are quite effective at recreating various types of interactions, both error-free and those containing various forms of linguistic repair. The sample dialog in Figure 5 shows the result of a self-dialog exercise in which workers were told to write a conversation with various ticket availability issues that is ultimately unsuccessful.
Two more benefits of the self-dialog approach are its efficiency and cost effectiveness. We were able to gather thousands of dialogs in just days without transcription or trained agents, and spent roughly six times less per dialog. Despite these advantages, the self-dialog written technique cannot recreate the disfluencies and other more complex error patterns that occur in the two-person spoken dialogs which are important for model accuracy and coverage.
We chose a highly simplified annotation approach for Taskmaster-1 as compared to traditional, detailed strategies which require robust agreement among workers and usually include dialog state and slot information, among other possible labels. Instead we focus solely on API arguments for each type of conversation, meaning just the variables required to execute the transaction. For example, in dialogs about setting up UBER rides, we label the “to” and “from” locations along with the car type (UberX, XL, Pool, etc). For movie tickets, we label the movie name, theater, time, number of tickets, and sometimes screening type (e.g. 3D vs. standard). A complete list of labels is included with the corpus release.
As discussed in Section 3.2.2, to encourage diversity, at times we explicitly ask users to change their mind in the middle of the conversation, and the agents to tell the user that the requested item is not available. This results in conversations having multiple instances of the same argument type. To handle this ambiguity, in addition to the labels mentioned above, the convention of either “accept” or “reject” was added to all labels used to execute the transaction, depending on whether or not that transaction was successful.
|USER:||Finally, I need the table to be for three people and 8pm.|
|ASSISTANT:||One moment….OK, I have your table for three (num.guests.accept) at 8pm (time.reservation.accept) reserved.|
In Figure 6, both the number of people and the time variables in the assistant utterance would have the “.accept” label indicating the transaction was completed successfully. If the utterance describing a transaction does not include the variables by name, the whole sentence is marked with the dialog type. For example, a statement such as The table has been booked for you would be labeled as reservation.accept.
4 Dataset Analysis
|# unique words||17,275||13,490|
4.1 Self-dialogs vs MultiWOZ
We quantitatively compare our self-dialogs (Section 3.3) with the MultiWOZ dataset in Table 1. Compared to MultiWOZ, we do not ask the users and assistants to stick to detailed scripts and do not restrict them to have conversations surrounding a small knowledge base. Table 1 shows that our dataset has more unique words, and has almost twice the number of utterances per dialog than the MultiWOZ corpus. Finally, when trained with the Transformer Vaswani et al. (2017) model, we observe significantly higher perplexities and lower BLEU scores for our dataset compared to MultiWOZ suggesting that our dataset conversations are difficult to model. Finally, Table 1 also shows that our dataset contains close to 10 times more real-world named entities than MultiWOZ and thus, could potentially serve as a realistic baseline when designing goal oriented dialog systems. MultiWOZ has only 1338 unique named entities and only 4510 unique values (including date, time etc.) in their datatset.
4.2 Self-dialogs vs Two-person
In this section, we quantitatively compare 5k conversations each of self-dialogs (Section 3.3) and two-person (Section 3.2). From Table 2, we find that self-dialogs exhibit higher perplexity ( almost 3 times) compared to the two-person conversations suggesting that self-dialogs are more diverse and contains more non-conventional conversational flows which is inline with the observations in Section-3.3.2. While the number of unique words are higher in the case of self-dialogs, conversations are longer in the two-person conversations. We also report metrics by training a single model on both the datasets together.
4.3 Baseline Experiments: Response Generation
We evaluate various seq2seq architectures Sutskever et al. (2014) on our self-dialog corpus using both automatic evaluation metrics and human judgments. Following the recent line of work on generative dialog systems Vinyals and Le (2015)
, we treat the problem of response generation given the dialog history as a conditional language modeling problem. Specifically we want to learn a conditional probability distributionwhere is the next response given dialog history . Each utterance itself is comprised of a sequence of words . The overall conditional probability is factorized autoregressively as
, in this work, is parameterized by a recurrent, convolution or Transformer-based seq2seq model.
: We consider 3-gram and 4-gram conditional language model baseline with interpolation. We use random grid search for the best coefficients for the interpolated model.
Convolution: We use the fconv architecture (Gehring et al., 2017)
and default hyperparameters from the fairseq(Ott et al., 2019) framework.222https://github.com/pytorch/fairseq We train the network with ADAM optimizer Kingma and Ba (2015) with learning rate of 0.25 and dropout probability set to 0.2.
framework for the LSTM baselines. We use a two-layer LSTM network for both the encoder and the decoder with 128 dimensional hidden vectors.
Transformer: As with LSTMs, we use the tensor2tensor framework for the Transformer model. Our Transformer (Vaswani et al., 2017) model uses 256 dimensions for both input embedding and hidden state, 2 layers and 4 attention heads. For both LSTMs and Transformer, we train the model with ADAM optimizer (, ) and dropout probability set to 0.2.
: Apart from supervised seq2seq models, we also include results from pre-trained GPT-2Radford et al. (2019) containing 117M parameters.
We evaluate all the models with perplexity and BLEU scores (Table 3). Additionally, we perform two kinds of human evaluation - Ranking and Rating (LIKERT scale) for the top-3 performing models - Convolution, LSTM-attention and Transformer. For the ranking task, we randomly show 500 partial dialogs and generated responses of the top-3 models from the test set to three different crowdsourced workers and ask them to rank the responses based on their relevance to the dialog history. For the rating task, we show the model responses individually to three different crowdsourced workers and ask them to rate the responses on a 1-5 LIKERT scale based on their appropriateness to the dialog history. From Table-4, we see that inter-annotator reliability scores (Krippendorf’s Alpha) are higher for the ranking task compared to the rating task. From Table 3, we see that Transformer is the best performing model on automatic evaluation metrics. It is interesting to note that there is a strong correlation between BLEU score and human ranking judgments.
|Rating (1-5 LIKERT)||0.21|
4.4 Baseline Experiments: Argument Prediction
Next, we discuss a set of baseline experiments for the task of argument prediction. API arguments are annotated as spans in the dialog (Section 3.4). We formulate this problem as mapping text conversation to a sequence of output arguments. Apart from the seq2seq Transformer baseline, we consider an additional model - an enhanced Transformer seq2seq model where the decoder can choose to copy from the input or generate from the vocabulary Merity et al. (2017); Gu et al. (2016). Since all the API arguments are input spans, the copy model having the correct inductive bias achieves the best performance.
|Model||Micro F1 (%)|
|Transformer + copy||51.79|
To address the lack of quality corpora for data-driven dialog system research and development, this paper introduces Taskmaster-1, a dataset that provides richer and more diverse language as compared to current benchmarks since it is based on unrestricted, task-oriented conversations involving more real-word entities. In addition, we present two data collection methodologies, both spoken and written, that ensure both speaker diversity and conversational accuracy. Our straightforward, API-oriented annotation technique is much easier for annotators to learn and simpler to apply. We give several baseline models including state-of-the-art neural seq2seq architectures, provide qualitative human performance evaluations for these models, and find that automatic evaluation metrics correlate well with human judgments.
- Alexa skills. External Links: Cited by: §1.
- Neural machine translation by jointly learning to align and translate. ICLR. Cited by: §4.3.
- Learning end-to-end goal-oriented dialog. ICLR. Cited by: §1.
- MultiWOZ - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. EMNLP. Cited by: §1, §1, §2.1, §2.2, §3.1.
- The relation between written and spoken language. Annual Review of Anthropology. Cited by: §3.1.
- Convolutional Sequence to Sequence Learning. Arxiv. Cited by: §4.3.
- Actions on google. External Links: Cited by: §1.
- Incorporating copying mechanism in sequence-to-sequence learning. ACL. Cited by: §4.4.
Deep neural network approach for the dialog state tracking challenge. SIGDIAL. Cited by: §1, §1.
- Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine. Cited by: §1.
- Long short-term memory. Neural Computation. Cited by: §4.3.
In-datacenter performance analysis of a tensor processing unit. ISCA. Cited by: §1.
- An iterative design methodology for user-friendly natural language office information applications. ACM Transactions on Information Systems (TOIS). Cited by: §1, §2.2.
- Adam: A method for stochastic optimization. ICLR. Cited by: §4.3.
- Edina: building an open domain socialbot with self-dialogues. Arxiv. Cited by: §2.2.
- Deep learning. Nature. Cited by: §1.
- The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. SIGDIAL. Cited by: §1.
- Pointer sentinel mixture models. ICLR. Cited by: §4.4.
- Towards exploiting background knowledge for building conversation systems. EMNLP. Cited by: §2.2.
- Fairseq: a fast, extensible toolkit for sequence modeling. NAACL Demonstrations. Cited by: §4.3.
- Language models are unsupervised multitask learners. Arxiv. Cited by: §4.3.
- LET’s go: improving spoken dialog systems for the elderly and non-natives. Eurospeech. Cited by: §2.1.
- A network-based end-to-end trainable task-oriented dialogue system. EACL. Cited by: §1, §1.
- A survey of available corpora for building data-driven dialogue systems. nnn. Cited by: §2.1.
- Using gpus for machine learning algorithms. ICDAR. Cited by: §1.
- Sequence to sequence learning with neural networks. NeurIPS. Cited by: §4.3, Table 3.
- WaveNet: a generative model for raw audio. Arxiv. Cited by: §1.
- Tensor2Tensor for neural machine translation. CoRR. Cited by: §4.3.
- Attention is all you need. NeurIPS. Cited by: §4.1, §4.3.
- A neural conversational model. Arxiv. Cited by: §4.3.
- ELIZA a computer program for the study of natural language communication between man and machine. Computational Linguistics. Cited by: §1, §1.
Partially observable markov decision processes for spoken dialog systems. Computer Speech & Language. Cited by: §2.1.
- The dialog state tracking challenge series: a review. Dialog and Discourse. Cited by: §2.1.