. However, the major data challenge for both formalisms remains the same: the lack of annotated dialogue dataset in specific tasks or domains. Various slots and values in dialogue utterances need to be manually labeled for use in supervised learning. As the process of manual annotation is time-consuming and expensive, publicly available task-oriented dialogue datasets normally contain only a few thousand dialogues. For data-driven dialogue systems, especially neural dialogue systems which are more data-hungry, insufficient training data will substantially limit their power to learn from data, resulting in poor robustness and performance.
In this paper, we are interested in handling such a data scarce problem via automatic and cheap data augmentation methods. We propose four different data augmentation approaches: synonym substitution and stop-word deletion at the word level, translation and paraphrasing at the sentence level. We only apply these approaches to rephrase user utterances while keeping machine utterances intact on the training data. For user utterances, we leave slots and corresponding values unchanged and reword the remaining parts, keeping the meanings of user utterances as much the same as possible. In doing so, we hope to diversify user utterances so that our dialogue system can learn to deal with language variabilities in a robust way.
We use TSCP, an end-to-end dialogue system, recently proposed by  to validate the efficacy of our methods. We conduct experiments on two public datasets, CamRest676 and KVRET. The combination of the four data augmentation methods can collectively outperform the basic TSCP model by 4.5 points in terms of F
score, the TSCP model with reinforcement learning (RL) by 2.5 points on the CamRest676 dataset. Higher improvements are achieved on the KVRET dataset, 7.8 points and 4.1 points in terms of Fover the basic TSCP model and TSCP+RL respectively.
The contributions of the paper are threefold:
First, we present and empirically investigate four different approaches to data augmentation for end-to-end task-oriented dialogue, which, to the best of our knowledge, is the first attempt in automatic data augmentation for task-oriented dialogue.
Second, we achieve the state-of-the-art performance on the two datasets with the proposed methods.
Third, our analyses further display that data augmentation on user utterances is better than augmentation on machine utterances. Details on how the proposed methods improve the performance are also provided.
Ii Background: End-to-End Task-Oriented Dialogue
Task-oriented dialogue systems that can be trained end-to-end have been studied in recent years as alternatives to traditional pipeline-style dialogue systems. Without loss of generality, we use Sequicity  as our baseline system to evaluate our data augmentation methods. It significantly outperforms state-of-the-art pipeline-based methods and obtains a satisfactory entity match rate on out-of-vocabulary (OOV) cases where pipeline-designed competitors totally fail. Sequicity handles both task completion and response generation in a single seq2seq model which can be further optimized with reinforcement learning. It provides a theoretically and aesthetically appealing framework, as it achieves true end-to-end trainability with one single seq2seq model. The key concept introduced in Sequicity is the belief span (bspan), a text span that tracks the dialogue belief states at each turn.
Based on this concept, Sequicity decomposes the task-oriented dialogue problem into the generation of bspans and machine responses in a seq2seq framework. Specifically it decodes in two stages. In the first stage, it generates a bspan to facilitate knowledge base (KB) retrieval. It then generates a machine utterance in the second stage, conditioned on the knowledge base search result and the bspan from the previous stage. Our work is based on an implementation of the Sequicity as a two-stage copynet (TSCP). In the implementation, CopyNet  is used to instantiate Sequicity to allow key words from previous utterances to recur in bspans and generated machine responses.
Iii Data Augmentation Approaches
In this section, we elaborate the four data augmentation approaches at both the word and sentence level.
Iii-a Word-Level Data Augmentation
We substitute words with their synonyms and delete stop words so as to produce diversity in user utterances at the word level.
In synonym substitution, we first utilize the NLTK toolkit  and WordNet [17, 3] to conduct part-of-speech tagging and synonym retrieval respectively. In order to ensure that the meaning of user utterances does not change semantically, we only allow some specific words to be replaced by their synonyms. Proper nouns (e.g., Africa, America), qualifiers (e.g., the, a, some, most, every, no), personal pronouns (e.g., hers, herself, him, himself), and modal verbs (e.g., can, cannot, could, couldn’t) should not be replaced as the substitution of them can easily result in inconsistent statements or even semantic changes. For notional verbs (e.g., want, like, tell, find), adjectives (e.g., cheap, great, delicious) and nouns (e.g., food, restaurant, area, south), we look up their synonyms from WordNet and select the candidate synonyms whose part-of-speech tags are consistent with the corresponding words. For each user utterance, we randomly sample one word that satisfies our substitution rules and randomly select a synonym candidate to replace it. In this way, multiple user utterances can be randomly generated for each original utterance in the training data. These generated utterances will be added to the training data to increase diversity at the word level.
Similarly, we can obtain varieties by deleting stop words in user utterances without changing their meaning. It is common for users to ignore stop words, such as articles, prepositions, adverbs and conjunctions. In order to improve the robustness of the task-oriented dialogue system, and to enable it to pay more attention to the key semantic information in user utterances, we propose to discard these high-frequency stop words in user utterances.
Iii-B Sentence-Level Data Augmentation
For data augmentation at the sentence level, we investigate two different approaches: translation and paraphrasing. These two methods will improve the sentence-level variances, not limited to the presence/absence or variety of some specific words.
We use neural machine translation (NMT) models to translate user utterances into other languages and then use reversed NMT systems to translate the generated translations from other languages back to the original language. In this paper, we use Google online translation engine as our NMT translation system.
For the sentence-level paraphrasing, we use a seq2seq paraphrase model which contains a bidirectional LSTM encoder and LSTM decoder together with an attention network.111https://github.com/vsuthichai/paraphraser The model is trained on a mixed data set consisting of paraphrases from para-nmt-5m, Quora question pairs, SNLI and Semeval [7, 24]. In the decoder part, we can either use a greedy search to generate a single unique paraphrase for each entire user utterance, or generate a plenty of different paraphrases via sampling from a distribution.
Iii-C Implementation Details for the four Data Augmentation Approaches
Synonym substitution: we created four different utterances for each user utterance by randomly replacing words with their synonyms. The created data was combined with the original training data. The size of the augmented data in this way was 5 times as large as that of the original training data.
Stop-word deletion: for this augmentation, we utilized the dictionary of stop words from NLTK toolkit and created only one copy for each user utterance and combined the additional copy with the original data.
Translation: user utterances in original English version data were translated into Chinese, Japanese, French, German via Google Translate, and then translated back to English, thus forming four sets of data expressed in different styles.
Paraphrasing: we generated four sets of dialogue data with the seq2seq-based paraphrase generator.
Assembled Augmentation: we combined all data generated by the four methods above. Together, the size of the assembly augmented data is 14 times as large as that of the original data.
The sizes of mini-batch and vocabulary for each data augmentation approach on the two datasets are shown in Table I, which are chosen according to the performance on the development set.
|Batch size||Vocab size||Batch size||Vocab size|
|Results from |
|TSCP + RL||0.854||0.811|
|TSCP + RL||0.858||0.831|
|Results obtained by data augmentation|
|Machine Utterance Augmentation (synonym substitution)||0.775||-|
|User + Machine Utterance Augmentation (translation)||0.822||-|
Iv Experiments and Analyses
We conducted extensive experiments and analyses on two datasets to validate the effectiveness of the proposed methods in this section.
Iv-a Datasets and Settings
We used two datasets: CamRest676 [22, 21, 23] and KVRET , both of which are manually created by crowd-sourcing workers on the Amazon Mechanical Turk platform by a Wizard-of-Oz method . CamRest676 contains 676 dialogues in the restaurant searching domain while KVRET covers three domains: calendar scheduling, weather information and point of interest (POI) navigation.
For TSCP, the dimensionality for both hidden states and word embeddings was set to 50. Vocabulary size was 800 for CamRest676 and 1400 for KVRET. The size of mini-batch for both datasets was set 32. The model was trained with the Adam optimizer , with a learning rate of 0.003 and a decay parameter λ of 0.5. We used a learning rate of 0.0001 and decay of 0.8 for the subsequent reinforcement learning process. We used beam search strategy with a beam size of 10 on CamRest676 and greedy search strategy on KVRET. Early stopping was also performed to improve the training efficiency.
We used the Success F score as the automatic metric for dialogue evaluation. The Success F16].
Table II shows the experiment results on the two datasets, from which we have three findings. First, the results demonstrate that all the proposed data augmentation methods contribute to the significant improvements in F over the basic TSCP model. Except for the stop-word deletion method, all other methods perform better than the RL-enhanced TSCP. Second, the sentence-level augmentation methods are better than the word-level methods in most cases as the former provide more variances for user utterances. Third, the assembled augmentation, which combines all data generated by the four data augmentation methods, achieve the new state-of-the-art performance on the two datasets, with more than 2 points higher than the RL-enhanced TSCP model in terms of F score.
Iv-C Effect of Augmentation on Machine Utterances
At each turn in a dialogue from the two datasets, a user utterance triggers some special requests and a machine response utterance provides answers to these requests. In our previous experiments, we performed data augmentation only on user utterances. In order to study the effect of data augmentation on machine utterances, we further carried out two experiments. One is to generate both user and machine utterances with the translation augmentation method. The other is to create copies only for machine utterances with synonym substitution. Both experiments were carried out on the CamRest676 dataset.
Results are displayed at the bottom of Table II. It is clear to observe that machine utterance augmentation seriously deteriorates the performance. The reason for this may be that data augmentation introduces both variance and noise. The variance and noise in user utterances can prevent the system from over-sensitivity , thus making the system more robust. However, the variance and noise in machine utterances will distract the system. This resonates with the back translation that uses real target sentences and translated source sentences, widely used for seq2seq-based neural machine translation .
We took a deep look into the data to investigate how the proposed data augmentation methods improve the Success F score that computes both the precision and recall of requested slots being correctly answered.
The precision and recall in F can be formulated as follows:
where TP denotes the number of requested slots that are correctly predicted and do exist in real machine responses, FN the number of slots that exist in real responses but not answered at all, FP the number of slots being predicted but not present in real responses.
We provide the values for the precision, recall, TP, FN and FP in Table III for the assembled augmentation. Obviously, our method can significantly improve the recall by nearly 9 points while keeping the precision basically the same as the baseline. The reason behind the improvement of the recall is that the proposed methods substantially increases TP and decreases FN. This is because the diversity in user utterances created by data augmentation helps the dialogue system recognize more requested slots and further allows the decoder to answer these slots in machine responses. Without data augmentation, some slots are just not detected at all in the baseline (thus a higher FN).
Iv-E Dialogue Samples
Table IV shows some dialogue examples generated by the model with or without data augmentation. The dialogues on the left side of the table is generated by the baseline model, while on the right side is the examples generated by the model with assembled data augmentation. Obviously, the model after our data augmentation is more robust to understand the user utterances and can produce more appropriate machine responses.
V Related Work
Data augmentation has achieved great success in various tasks including computer vision, speech recognition  and text classification , but is explored in a very limited way for the natural language understanding (NLU) module of traditional pipeline systems of task-oriented dialogue.  propose to augment data for the NLU module by adding noise to one single user utterance without considering its relation with other utterances.  introduce a technique to expand the limited in-domain data for a new spoken language understanding task.  propose a data-augmentation framework to model relations between utterances of the same semantic frame in the training data. Other researchers present methods for gathering dialogue data through crowd-sourcing, e.g., via talking to myself  or MultiWOZ . Different from our methods, these methods either focus solely on the NLU module or rely on expensive human efforts.
Vi Conclusion and Future Work
In this paper, we have presented four different effective methods of data augmentation for end-to-end task-oriented dialogue systems at both the word and sentence level. Empirical study on two public datasets CamRest676 and KVRET shows that data augmentation can prevent the dialogue system from the omission of key information in user utterances and significantly improve the F score via effectively solving the problem of data scarcity.
In the future, we intend to apply our data augmentation methods on more datasets and to explore some other efficient ways to increase the diversity of machine responses as well.
The present research was supported by the National Natural Science Foundation of China (Grant No. 61622209 and 61861130364). We would like to thank the three anonymous reviewers for their insightful comments.
-  (2004) NLTK: the natural language toolkit. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, pp. 31. Cited by: §III-A.
MultiWOZ-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5016–5026. Cited by: §V.
-  (2015) A robust spoken q&a system with scarce in-domain resources. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 47–53. Cited by: §III-A.
-  (2017) Key-value retrieval networks for task-oriented dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 37–49. Cited by: §I, §IV-A.
-  (2017) A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 468–473. Cited by: §I.
Talking to myself: self-dialogues as data for conversational agents. arXiv preprint arXiv:1809.06641. Cited by: §V.
-  (2017) Learning paraphrastic sentence embeddings from back-translated bitext. In Proceedings of Empirical Methods in Natural Language Processing, Cited by: §III-B.
-  (2016) Incorporating copying mechanism in sequence-to-sequence learning. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1631–1640. Cited by: §II.
-  (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §V.
-  (2018) Sequence-to-sequence data augmentation for dialogue language understanding. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1234–1245. Cited by: §V.
-  (2018) Automatic data expansion for customer-care spoken language understanding. arXiv preprint arXiv:1810.00670. Cited by: §V.
-  (1984) An iterative design methodology for user-friendly natural language office information applications. ACM Transactions on Information Systems (TOIS) 2 (1), pp. 26–41. Cited by: §IV-A.
-  (2015) Adam: a method for stochastic optimization. Proceedings of the 3rd International Conference for Learning Representations, San Diego. Cited by: §IV-A.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §V.
-  (2016) Labeled data generation with encoder-decoder lstm for semantic slot filling.. In INTERSPEECH, pp. 725–729. Cited by: §V.
-  (2018) Sequicity: simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1437–1447. Cited by: §I, §I, §II, TABLE II, §IV-A.
-  (1990) Introduction to wordnet: an on-line lexical database. International journal of lexicography 3 (4), pp. 235–244. Cited by: §III-A.
-  (2018) Adversarial over-sensitivity and over-stability strategies for dialogue models. arXiv preprint arXiv:1809.02079. Cited by: §IV-C.
-  (1999) Creating natural dialogs in the carnegie mellon communicator system. In Sixth European Conference on Speech Communication and Technology, Cited by: §I.
-  (2016) Improving neural machine translation models with monolingual data. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 86–96. Cited by: §IV-C.
-  (2016) Conditional generation and snapshot learning in neural dialogue systems. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2153–2162. Cited by: §IV-A.
Latent intention dialogue models.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3732–3741. Cited by: §IV-A.
-  (2016) A network-based end-to-end trainable task-oriented dialogue system. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 438–449. Cited by: §IV-A.
-  (2018) Paranmt-50m: pushing the limits of paraphrastic sentence embeddings with millions of machine translations. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pp. 451–462. Cited by: §III-B.
-  (2015) Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §V.
-  (2000) JUPlTER: a telephone-based conversational interface for weather information. IEEE Transactions on speech and audio processing 8 (1), pp. 85–96. Cited by: §I.
-  (2000) Conversational interfaces: advances and challenges. Proceedings of the IEEE 88 (8), pp. 1166–1180. Cited by: §I.