Building dialog systems typically requires a large collection of conversation logs that a model can use as training data. Crowd-sourcing is a popular method for generating such data-sets and depending on the aspect of dialog modeling being studied, crowd-sourced workers may be asked to annotate existing chat logs for intents and dialog acts Yu and Yu (2019); Rastogi et al. (2020), create dialog summaries Gliwa et al. (2019), converse with each other based on a script Zhang et al. (2018) or converse to accomplish tasks or goals El Asri et al. (2017); Budzianowski et al. (2018); Byrne et al. (2019) etc. For instance, to create datasets for task oriented dialogs, crowd-sourced workers may be provided with a goal (instruction) that describes the task that needs to be accomplished; workers then play the roles of a user and an agent to generate conversations Budzianowski et al. (2018). The user worker begins the conversation by stating their requirement and the agent worker provides information to the user by querying a knowledge base (KB), if required. Together, the two workers interact with each other via natural language to generate conversations that can involve booking restaurant tables, making train reservations, calling a taxi etc. However, creating large crowd-sourced datasets can be time consuming and expensive.
To reduce the cost associated with generating such dialog datasets, recent works have explored methods to automatically create larger datasets from small samples. Such methods include, generating paraphrased utterances of existing conversations using sequence-to-sequence generative approaches Hou et al. (2018); Anaby-Tavor et al. (2020), generating annotations including intent-slots and dialog acts Yoo et al. (2019, 2020b) etc. While it is reasonably straightforward to generate paraphrases for utterances, generating complete conversations directly from goals is significantly more challenging. This is because, unlike generating a paraphrase for an utterance, generating complete conversations requires systems to model the interaction between utterances over multiple conversation turns. Further, these interactions may also require the use of information present in external knowledge bases.
In this paper, we address this challenging problem of generating complete conversations using a goal that describes the task to be accomplished, by simulating the crowd-sourced data creation process. Thus, instead of creating conversations by having crowd-sourced workers chat with each other, we create conversation data by using two bots that simulate a user and an agent crowd-sourced worker.
Similar to the crowd-sourced data collection setup, the user bot has access to the goal while the agent bot has access to a knowledge base (KB). The agent bot maps the dialog history to a belief state (query) that can be executed over a knowledge base (KB) to retrieve a set of results if required. Thus, the two bots are trained to interact with each other to generate conversations conditioned on the goals and the KB. An example of a generated conversation is shown in Figure 1
. We train these bots using 5-30% of real crowd-sourced worker conversations and demonstrate how our simulated chats can be used as an effective data augmentation strategy. At the core of our model we utilize GPT-2Radford et al. (2018) - a transformer-based language model trained on a large number of documents crawled from the web. To the best of our knowledge we are the first to present a model for generating entire conversations by simulating the crowd-sourced data collection process. Our experiments show that by using a small number of existing conversations, we are able to train meaningful user and agent bots that in-turn generate new conversations.
Contributions: (1) We present a novel dialog-generation framework that mimics the roles played by crowd-sourced workers to generate complete conversations. (2) We demonstrate that training the simulators with just 5-10% data can serve as an effective method to generate new conversations. We find that using simulation-augmented data to train end-task dialog models in low data settings leads to a performance improvement of 18-25%. (3) We include a qualitative study to demonstrate how changes to the goal result in completely new conversations reflective of the new task.
Due to the costs associated with generating large dialog datasets, recent work has explored a variety of methods to artificially generate additional training data. For example, generating paraphrases is a widely used strategy for augmenting training data for dialog models. Paraphrases have been used to improve intent classification Anaby-Tavor et al. (2020), generate alternative conversation turns Gao et al. (2020), improve response ranking Du and Black (2018) etc. Methods to generate paraphrases can vary - these include the use of syntactic parse trees to generate alternatives Du and Black (2018)
, or generative models based on variational autoencodersMalandrakis et al. (2019) and sequence to sequence models Gao et al. (2020). Some methods developed for data augmentation exploit dialog-task specific features; for instance, in tasks where dialog-act labels are available, work that uses these labels to alter conversational flow to generate responses has also been explored Zhang et al. (2020). Further, methods that generate new data to improve dialog act classification Yoo et al. (2020a) or even inject noise to improve robustness in dialog act prediction for ASR data Wang et al. (2020) have also been developed.
have achieved extensive generalization in natural language understanding and generation across a plethora of tasks, including question-answering, text summarization and machine translation. In contrast to existing methods that modify existing conversations to generate additional dataZhang et al. (2020); Gao et al. (2020), we propose a new augmentation framework that harnesses the strength of such large scale language models, to simulate the crowd source data collection process and generate entirely new conversations.
Constrained Dialog Generation
We assume that the dialog comprises of a sequence of utterances between a user and an agent, that is, where is a user utterance while is an agent utterance. At any given turn , the sequence of utterances prior to the turn, that is, is referred to as dialog context or dialog history. Apart from the dialog , we have access to a set of goals and a knowledge base . The aim is to learn a model that can generate the dialog conditioned on the goals and the knowledge base . That is, we wish to model .
The dialog generation framework mimics the human-to-human data collection approach used in MultiWOZ Budzianowski et al. (2018). The dialog is generated in a sequence of turns alternating between the user bot and the agent bot. The user bot has access to goals while the agent bot can query the knowledge base
. Thus, the joint distribution of the dialog decomposes as follows:
The dialog history for the first turn, , is an empty set. The first factor in the product on the left corresponds to user bot which conditions on the goals, as well as, the dialog history to output the user utterance. The second product models the distribution of the agent bot over the responses, conditioned on the dialog history, knowledge base and the goals. A pictorial representation of the two interaction between the two bots is shown in Figure 2. We discuss the various modules in the two bots in further detail below. Note that all the modules in Figure 2 (shown in green) also receive dialog history as input which has not been shown in the figure for ease of presentation.
The user bot generates utterances conditioned on the dialog history and the goals, that is, it models . For the sake of readability, we will remove the turn index from the distribution. As shown in Figure 2, this distribution is modeled in two steps. Firstly, the dialog history and the goals are fed to a response generator module which outputs a pool of candidate responses . A response selector module then assigns a score to each response in the pool. Based on these scores, we define the distribution as follows:
The candidate response with the highest probability is selected as the next user utterance and sent to the agent bot.
Next, we discuss the various modules in the user bot and how they are trained. The input and output formats for the various networks of these modules are shown in Figure 3.
The aim of response generator module is to output a pool of candidate user utterances for the given dialog history and the goals. To achieve this, an autoregressive distribution over the tokens of the utterance is defined. That is, if , we define a distribution as follows:
where is the dialog history and refers to all the tokens in before . We finetune the pretrained GPT-2 network to model the above distribution by maximimum likelihood. Specifically, given the tokens in the goals and the dialog history, the GPT-2 network is trained to output the tokens of the user utterance.
While it is possible to sample an utterance from the GPT-2 network via greedy sampling or beam search, this poses several issues. Firstly, autoregressive distributions tend to assign high probability to short utterances. Secondly, commonly occurring utterances in the corpus tend to have higher probability than the informative responses that are less frequent. We observed this behavior with the user and agent bots when the greedy response was selected as the final response for each bot.
The aim of the response selector module is to assign a score to each candidate response in the pool based on its relevance to the dialog history. We achieve this by feeding the tokens of the dialog history and the response (clubbed with special tokens such as [CLS] and [SEP] as shown in Figure 3) to a Longformer network architecture Beltagy et al. (2020). The network outputs a contextualized embedding for each token. We feed the embedding of the [CLS] token through a linear layer followed by a sigmoid unit. The output of the network corresponds to the score assigned to the response for the given dialog history.
The network is trained to assign high scores to the positive (or ground-truth) responses while assigning low score to the negatively sampled responses. For each gold context-response pair, we provide a total of 10 negative response samples. These samples contain 5 random responses, 2 responses which are already part of the context (in order to stop the response selector from picking such responses) and 3 responses formed by concatenating 2 random responses to discourage the response selector from picking longer candidate responses.
The network is trained via the triplet loss Chechik et al. (2010); Hoffer and Ailon (2015). Specifically, given the dialog history , the ground-truth response and a negatively sampled response , the triplet loss is defined as follows:
where is the score assigned by the network to the response for the given dialog history . We use in our experiments.
The agent bot models the distribution of the agent response conditioned on the dialog history , the user utterance and the knowledge base , that is, . This distribution is modeled in four steps as shown in Figure 2. Firstly, the agent bot feeds the dialog history and the last user utterance to the belief state generator module which outputs a belief state of slot-value pairs (also referred to as query). Next, the query is executed over the knowledge base and a set of entities , whose attributes match the values in the query, are returned. The entities, the belief state, the dialog history and the user utterance are fed to the response generator which outputs a pool of candidate responses. Finally, the responses in the pool are scored by the response selector. Based on these scores, we define the distribution of the agent response as follows:
where is the score of the candidate response. The candidate response with the highest probability is selected and sent to the user bot to generate the next turn. This interaction between the user and agent bots is repeated until the user bot outputs the end-of-dialogue token.
Next, we discuss in detail about the modules in the agent bot and how these modules are trained. Note that these modules do not share weights with the corresponding modules of the user bot. The input and output formats for the various networks of these modules are shown in Figure 3.
Belief State (query) Generator
The aim of the belief state generator is to generate a belief state for the given dialog history and last user utterance. Here, belief state is a sequence of pairs of the form attribute_name=attribute_value . To achieve this, we define a distribution over the belief states that can be executed over the knowledge base. The belief state generator treats the belief state as a sequence of tokens and train a GPT-2 network to model the distribution of the belief state tokens given the tokens of the dialog history and user utterance. Once the belief state generator has been trained, a belief state is sampled by greedy sampling and executed over the knowledge base.
This module mimics the response generator of the user bot with the exception that the input to the GPT-2 network comprises the dialog history, the last user utterance, the belief state and the KB results. The GPT-2 network is used to define an autoregressive distribution over the tokens of the agent response and is trained using maximum likelihood. Once the module is trained, a pool of candidate responses is sampled via nucleus sampling.
This module outputs the score of each agent response in the candidate pool. To achieve this, the dialog history, the last user utterance and the agent response are fed to the Longformer network architecture (clubbed with [CLS] and [SEP] tokens). The contextualized embedding of the [CLS] token is fed to a linear layer followed by a sigmoid unit. The training of this network as well as the selection of negative samples mimics the training of the response selector for the user bot. Once the model has been trained, it outputs a score for each agent response in the candidate pool.
The user and the agent bot continue to interact with each other until the end-of-dialogue token is output by the user bot. All the user and agent utterances created till this juncture as well as the belief states and KB results comprise the generated dialog.
In this section, we experiment with our data generation framework. We study the following research questions: (1) Are the simulated chats generated by our user and agent bots useful? (2) Does the query generator in the agent bot generate meaningful queries, (3) Can the simulated conversations be used to augment the training data in low (5% of training data), medium (10% of training data) and full data (100% of training data) settings, (4) Can our simulators adapt to changes in input goals and reflect them in the generated dialog?
We use MultiWOZ 2.1 dataset Budzianowski et al. (2018) to study our simulators. MultiWOZ is a large scale multi-domain dialogue dataset consisting of 10438 conversations distributed across domains: Attraction, Train, Police, Hotel, Hospital, Restaurant and Taxi. Each conversation is associated with a goal that was used by the crowd-sourced workers to generate the conversations. The dataset is divided into training set (8430 conversations), validation set (1000 conversations) and test set (1000 conversations). 30% of the dataset consists of conversations with a single goal i.e, they require accomplishing just one task. The rest are multi-goal dialogues, i.e, conversations accomplish more than one task – example, booking a train followed by making a restaurant reservation.
End-task dialog model
The dialogs in the training data are augmented with the generated dialogs and used for training an end-task dialog model. The end-task is to generate a response for a given dialog history on the MultiWOZ data set. We could use any existing model developed for the MultiWOZ task as our end-task model. In contrast to recent state-of-the-art models such as DAMD Zhang et al. (2020), SimpleTOD Hosseini-Asl et al. (2020) and PARG Gao et al. (2020), our simulators do not generate dialog-acts which are heavily used by these models. Thus, we choose to implement a simple end-task model based on GPT2 which takes in the current context, belief state(query) and KB results as input, to generate final responses, using greedy sampling. The agent model generates delexicalised responses using the format followed by MultiWOZ Budzianowski et al. (2018). For example, ‘archway house is located in south’ after delexicalisation becomes ‘[hotel_name] is located in [value_area]’. The end-task model uses the same architecture as the Agent bot but it does not use response selectors and instead directly generates responses using greedy sampling.
Data Generation using Simulators
As mentioned previously, our simulator allows the generation of new conversations based on a goal. In our experiments, we operate our simulators using 5%, 10% and 30% of the original training data. In each setting, we generate an equal number of conversations using the single-goal data. In addition, to generate multi-goal conversations, we concatenate single-goal generated conversations from different domains. We generate twice as many multi-goal conversations as compared to single-goal conversations to mimic the distribution of the full MultiWOZ dataset. Thus we augment of the conversations of the original training data with 3-times as many conversations to obtain a total augmented size of .
Recall that each conversation requires KB queries by the agent. Our agent simulator generates queries as described earlier and thus, while training the end-task dialog models using the simulated data, we use these generated values as the oracle belief state. Similar to existing work on this dataset, we use delexicalised agent utterances using the format followed by MultiWOZ Budzianowski et al. (2018) which are later updated with KB values based on the results of the query.
In order to generate reasonable conversations from small amounts of training data, we train separate models for each domain (restaurant, train, hotel etc.,) using single-goal dialogues from the training dataset. For each domain, we create separate user bots and agent bots along with their constituent modules consisting of query models (for tracking belief state), response generators and response selectors. We use GPT2-small (12 layered, 768 hidden size, 117M parameters) from the ‘Transformers’ library by Huggingface Wolf et al. (2019) for the response generator . For response selectors, we use Longformers (12 layered, 1024 hidden size, 149M parameters) Beltagy et al. (2020) for both user and agent models. We train on 5%, 10% and 30% of the training data with a learning rate of 1e-5. Adam optimizer with default settings is used for all the models.
We evaluate the usefulness of our generated data by using it to train a dialog model for the end-task. We therefore use BLEU, and rates as originally defined by Budzianowski et al., along with combined score Mehri et al. (2019) given by, . While BLEU is used to evaluate the fluency of the generated response, and measure the relevance of the agent utterances. Specifically, the Rate measures the correctness of the entity provided by the agent at a particular conversation turn, while the Rate measures how often the agent was able to provide correct attributes when requested by the user.
We compare the performance of the GPT2 based end-task dialog model by training it using 5%, 10%, 30% of the MultiWOZ training data as well as by additionally including data generated using our simulators.
Data Augmentation in Low Data Settings
As can be seen in Table 1, the additional use of data generated by our simulators results in a significant improvement on the Combined metric. For instance, when using the oracle belief states in the end-task model, the use of our simulated data results in a 18-25% improvement. The improvements in performance suggest that the conversations generated by the simulators are meaningful. Further, recall that the end-task model is trained to generate queries (belief states).
The original training data includes the queries (belief states) created by crowd-sourced workers while in the case of the simulated data, these are created by the agent bot using the query generator module. Does the end-task model learn how to generate queries using this simulated data? As can be seen from the lower half of Table 1, when the end-task model itself generates queries, the performance gains continue to be significant even though it is trained on simulated data. This suggests our simulator is also able to generate meaningful belief states via the query generator. It is interesting to note that when using generated belief states, the use of simulated data in low data-settings (5%) results in a performance improvement of 146% (Combined Metric).
|DAMD Zhang et al. (2020)||Oracle||17.3||80.3||65.1||90|
|MogNet* Pei et al. (2020)||Oracle||19.03||73.4||63.4||87.43|
|SimpleTOD* Hosseini-Asl et al. (2020)||Oracle||16.01||79.3||65.4||88.36|
|GPT2 with Simulated Chats||Oracle||15.06||80.4||62.2||86.36|
|DAMD Zhang et al. (2020)||Generated||18||72.4||57.7||83.05|
|SimpleTOD* Hosseini-Asl et al. (2020)||Generated||14.99||83.4||67.1||90.24|
|GPT2 with Simulated Chats||Generated||14.62||72.5||53.7||77.72|
Data Augmentation on Full Data
Since our simulated data helps improve performance of the end-task dialog model in low data settings, we also study whether it can help improve the performance of dialog model when it used to augment the full MultiWOZ training data.
We include additional simulated data from the 30% setting described previously along with the full MultiWOZ Dataset to train our GPT2 based end-task model. As can be seen be seen in Table 2 the additional use of simulated data on the full training data, results in a 1-3% gain in performance. For comparison we also include the performance of recent state-of-the-art methods on MultiWOZ 2.1. We find that the performance of our simple GPT2-based end-task model trained using our simulated conversations is comparable to recent state-of-the-art models such as SimpleTOD Hosseini-Asl et al. (2020) and MogNet Pei et al. (2020) when they use oracle belief states. However, when using the generated belief state we notice the performance drop in our end-task model is larger as compared other models. We hypothesize that this may be because all other models also use dialog-acts in their input which are useful features for generating responses. Further, due to dependence of these models on dialog acts, we were unable to demonstrate their performance using our simulated data for augmentation. We note, however, that in future our simulators could also be extended to generate dialog acts, similar to our belief-state generators.
Qualitative Study - Response Selector
Figure 4 shows an incorrect response generated by greedy decoding. While the user was asking for information about a particular hotel named Bridge Guest House, the greedy response failed to provide the correct information. The response selector however, is able to choose from a wider set of responses generated via nucleus sampling to return the correct response.
Qualitative Study - Goal Perturbation
We now present a qualitative study demonstrating how our simulator is able to accommodate changes to a goal and reflect them in a conversation. Figure 5 shows the generated dialogs from an original goal in MultiWOZ and another from a goal created by perturbing the original goal. The generated dialogs demonstrate the robustness of our generator model which is able to produce new and meaningful conversations using new entities from perturbed goal. Further, the dialogues generated are very different from each other which shows the wide variety of conversations the simulators are capable of producing, when provided with similar goals.
In this paper, we demonstrated a dialog generation framework that mimics the data creation process employed by crowd-sourced workers. We find that our method is able to generate meaningful conversations that aids the training of end-task dialog models in both, low resource and full data settings. The use of additional simulated data to train end-task dialog models result in a performance improvement of 18-25% in low resource settings, and when combined with full training data, we find that the performance of a simple GPT2 based end-task model becomes comparable to current state-of-the-art models. The simulation-framework does not make strict assumptions about the domain or dataset and it would be interesting to explore its use in other dialogue tasks such as Persona-Chat Zhang et al. (2018) in future work.
Do not have enough data? deep learning to the rescue!. In
The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 7383–7390. External Links: Cited by: Introduction, Related Work.
- Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: Response Selector, Hyperparameter Settings.
- Language models are few-shot learners. External Links: Cited by: Related Work.
MultiWOZ - a large-scale multi-domain wizard-of-Oz dataset for task-oriented dialogue modelling.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 5016–5026. External Links: Cited by: Simulated Chats for Task-oriented Dialog: Learning to Generate Conversations from Instructions, Figure 1, Introduction, Overview, Dataset, End-task dialog model, Data Generation using Simulators, Metrics.
- Taskmaster-1: toward a realistic and diverse dialog dataset. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). External Links: Cited by: Introduction.
Large scale online learning of image similarity through ranking..
Journal of Machine Learning Research. Cited by: Response Selector.
- Data augmentation for neural online chats response selection. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, Brussels, Belgium, pp. 52–58. External Links: Cited by: Related Work.
- Frames: a corpus for adding memory to goal-oriented dialogue systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, pp. 207–219. External Links: Cited by: Introduction.
- Paraphrase augmented task-oriented dialog generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. External Links: Cited by: Related Work, Related Work, End-task dialog model.
- SAMSum corpus: a human-annotated dialogue dataset for abstractive summarization. Proceedings of the 2nd Workshop on New Frontiers in Summarization. External Links: Cited by: Introduction.
Deep metric learning using triplet network.
International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Cited by: Response Selector.
- The curious case of neural text degeneration. In International Conference on Learning Representations, Cited by: Response Generator.
- A simple language model for task-oriented dialogue. arXiv preprint arXiv:2005.00796. Cited by: End-task dialog model, Data Augmentation on Full Data, Table 2.
- Sequence-to-sequence data augmentation for dialogue language understanding. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1234–1245. External Links: Cited by: Introduction.
- Controlled text generation for data augmentation in intelligent artificial agents. Proceedings of the 3rd Workshop on Neural Generation and Translation. External Links: Cited by: Related Work.
- Structured fusion networks for dialog. CoRR abs/1907.10016. External Links: Cited by: Metrics.
- Retrospective and prospective mixture-of-generators for task-oriented dialogue response generation. In 24th European Conference on Artificial Intelligence, Cited by: Data Augmentation on Full Data, Table 2.
- Language models are unsupervised multitask learners. External Links: Cited by: Simulated Chats for Task-oriented Dialog: Learning to Generate Conversations from Instructions, Introduction, Related Work.
Towards scalable multi-domain conversational agents: the schema-guided dialogue dataset. Proceedings of the AAAI Conference on Artificial Intelligence 34 (05), pp. 8689–8696. External Links: Cited by: Introduction.
- Data augmentation for training dialog models robust to speech recognition errors. Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI. External Links: Cited by: Related Work.
- HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: Hyperparameter Settings.
- Variational hierarchical dialog autoencoder for dialog state tracking data augmentation. External Links: Cited by: Related Work.
- Variational hierarchical dialog autoencoder for dialog state tracking data augmentation. External Links: Cited by: Introduction.
Data augmentation for spoken language understanding via joint variational generation. Proceedings of the AAAI Conference on Artificial Intelligence 33, pp. 7402–7409. External Links: Cited by: Introduction.
- MIDAS: a dialog act annotation scheme for open domain human machine spoken conversations. External Links: Cited by: Introduction.
- Personalizing dialogue agents: I have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2204–2213. External Links: Cited by: Introduction.
- Personalizing dialogue agents: I have a dog, do you have pets too?. CoRR abs/1801.07243. External Links: Cited by: Conclusion.
- Task-oriented dialog systems that consider multiple appropriate responses under the same context. Proceedings of the AAAI Conference on Artificial Intelligence 34 (05), pp. 9604–9611. External Links: Cited by: Related Work, Related Work, End-task dialog model, Table 2.