Dialog State Tracking with Reinforced Data Augmentation

by   Yichun Yin, et al.
HUAWEI Technologies Co., Ltd.

Neural dialog state trackers are generally limited due to the lack of quantity and diversity of annotated training data. In this paper, we address this difficulty by proposing a reinforcement learning (RL) based framework for data augmentation that can generate high-quality data to improve the neural state tracker. Specifically, we introduce a novel contextual bandit generator to learn fine-grained augmentation policies that can generate new effective instances by choosing suitable replacements for the specific context. Moreover, by alternately learning between the generator and the state tracker, we can keep refining the generative policies to generate more high-quality training data for neural state tracker. Experimental results on the WoZ and MultiWoZ (restaurant) datasets demonstrate that the proposed framework significantly improves the performance over the state-of-the-art models, especially with limited training data.



There are no comments yet.


page 1

page 2

page 3

page 4


Paraphrase Augmented Task-Oriented Dialog Generation

Neural generative models have achieved promising performance on dialog g...

Variational Hierarchical Dialog Autoencoder for Dialog State Tracking Data Augmentation

Recent works have shown that generative data augmentation, where synthet...

SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy

Deep learning based singing voice synthesis (SVS) systems have been demo...

Data Augmentation for Hypernymy Detection

The automatic detection of hypernymy relationships represents a challeng...

Hybrid Dialog State Tracker

This paper presents a hybrid dialog state tracker that combines a rule b...

Incremental LSTM-based Dialog State Tracker

A dialog state tracker is an important component in modern spoken dialog...

Tailor: Generating and Perturbing Text with Semantic Controls

Making controlled perturbations is essential for various tasks (e.g., da...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the increasing popularity of intelligent assistants such as Alexa, Siri and Google Duplex, the research on spoken dialog systems has gained a great deal of attention in recent years gao2018neural. Dialog state tracking (DST) williams2013dialog is an essential component of most spoken dialog systems, aiming to track user’s goal at each step in a dialog. Based on that, the dialog agent decides how to converse with the user. In a slot-based dialog system, the dialogue states are typically formulated as a set of slot-value pairs and one concrete example is as follows:

User: Grandma wants Italian, any suggestions?
State: inform(food=Italian)
Agent: Would you prefer south or center?
User: It doesn’t matter. Whichever is less expensive.
State: inform(food=Italian, price=cheap, area=don’t care)
Figure 1: An overview of our framework. Given a dataset, we induce new instances using the RL-based Generator to improve the DST Tracker. The Generator is trained with the rewards from the Tracker. The learning process is performed in an alternate manner.

The state-of-the-art models for DST are based on neural network 

henderson2014word; mrkvsic2017neural; zhong2018global; D18-1299; sharma2019improving

. They typically predict the probabilities of the candidate slot-value pairs with the user utterance, previous system actions or other external information as inputs, and then determine the final value of each slot based on the probabilities. Although the neural network based methods are promising with advanced deep learning techniques such as gating and self-attention mechanisms 

lin2017structured; vaswani2017attention, the data-hungry nature makes them difficult to generalize well to the scenarios with limited or sparse training data.

To alleviate the data sparsity in DST, we propose a reinforced data augmentation (RDA) framework to increase both the amount and diversity of the training data. The RDA learns to generate high-quality labeled instances, which are used to re-train the neural state trackers to achieve better performances. As shown in Figure 1, the RDA consists of two primary modules: Generator and Tracker. The two learnable modules alternately learn from each other during the training process. On one hand, the Generator module is responsible for generating new instances based on a parameterized generative policy, which is trained with the rewards from the Tracker module. The Tracker, on the other hand, is refined via the newly generated instances from the Generator.

Data augmentation performs perturbation on the original dataset without actually collecting new data, which has been widely used in the field of computer vision 

krizhevsky2012imagenet; cubuk2018autoaugment and speech recognition ko2015audio

, but relatively limited in natural language processing 

kobayashi2018contextual. The reason is that, in contrast to image augmentation (e.g., rotating or flipping images), it is significantly more difficult to augment text because it requires preserving the semantics and fluency of newly augmented data. In this paper, to derive a more general and effective policy for text data augmentation, we adopt a coarse-to-fine strategy to model the generation process. Specifically, we initially use some coarse-grained methods to get candidates (such as cost effective, affordable and too expensive in Figure 1), some of which are inevitably noisy or unreliable for the specific sentence context. We then adopt RL to learn the policies for selecting high quality candidates to generate new instances, where the total rewards are obtained from the Tracker. After learning the Generator, we use it to induce more training data to re-train the Tracker. Accordingly, the Tracker will further provide more reliable rewards to the Generator. With alternate learning, we can progressively improve the generative policies for data augmentation and at the same time learn the better Tracker with the augmented data.

To demonstrate the effectiveness of the proposed RDA framework in DST, we conduct extensive experiments with the WoZ wen2017network and MultiWoZ (restaurant) budzianowski2018multiwoz datasets. The results show that our model consistently outperforms the strong baselines and achieves new state-of-the-art results. In addition, the effects of the hyper-parameter choice on performance are analyzed and case studies on the policy network are performed.

The main contributions of this paper include:

  • We propose a novel framework of data augmentation for dialog state tracking, which can generate high-quality labeled data to improve neural state trackers.

  • We use RL for the Generator to produce effective text augmentation.

  • We demonstrate the effectiveness of the proposed framework on two datasets, showing that the RDA can consistently boost the state-tracking performance and obtain new state-of-the-art results.

2 Reinforced Data Augmentation

We elaborate on our framework in three parts: the Tracker module, the Generator module, and the alternate learning algorithm.

2.1 Tracker Module

The dialog state tracker aims to track the user’s goal during the dialog process. At each turn, given the user utterance and the system action/response111If the system actions do not exist in the dataset, we use the system response as the input.

, the trackers first estimate the probabilities of the candidate slot-value pairs

222For each slot, none value is added as one candidate slot-value pair., and then the pair with the maximum probability for each slot is chosen as the final prediction. To obtain the dialog state of the current turn, trackers typically use the newly predicted slot-values to update the corresponding values in the state of previous turn. One concrete example of the Tracker module is illustrated in Figure 2.

Our RDA framework is generic and can be applied to different types of tracker models. To demonstrate its effectiveness, we experiment with two different trackers: the state-of-the-art

model and the classical NBT-CNN (Neural Belief Tracking - Convolutional Neural Networks) model 

mrkvsic2017neural. The is built based on the recently proposed GLAD (Attentive Dialogue State Tracker) zhong2018global by modifying the parameter sharing and attention mechanisms. Due to limited space, we detail the model in the Supplementary material. We use the Tracker to refer to and NBT-CNN in the following sections.

2.2 Generator Module

We formulate data augmentation as an optimal text span replacement problem in a labeled sentence. Specifically, given the tuple of a sentence , its label , and the text span of the sentence, the Generator aims to generate a new training instance by substituting in with an optimal candidate from a set of candidates for , which we denote as .

In the span-based data augmentation, we can replace the text span with its paraphrases derived either from existing paraphrase databases or neural paraphrase generation models e.g. zhao2009application; D18-1421. However, directly applying the coarse-grained approach can introduce ineffective or noisy instances for training, and eventually hurt the performance of trackers. Therefore, we train the Generator to learn fine-grained generation policies to further improve the quality of the augmented data.

Generation Process. The problem of high quality data generation is modeled as a contextual bandit (or one-step reinforcement learningdudik2011efficient. Formally, at each trial of a contextual bandit, the context including the sentence and its text span , is sampled and shown to the agent, then the agent selects a candidate from to generate a new instance by replacing with .

Figure 2: The Tracker module. (1) System action or response, and user utterance as input; (2) The tracker predicts the probabilities of all possible slot-value pairs; (3) The prediction and state of previous turn are used to update the state of the current turn.

Policy Learning. The policy

represents a probability distribution over the valid actions at the current trial, where the state vector

is extracted from the sentence , the text span and the candidate . forms the action space of the agent given the state , and the reward is a scalar value function. The policy is learned to maximize the expected rewards:


where the expectation is taken over state and action .

Figure 3: The algorithm flow of the reinforced data augmentation framework. The left is the Generator learning and the right is the Tracker learning. The two learning processes are performed in an alternate manner.

The policy decides which to take based on the state , which is formulated as:


where is the contextual representation of , which is derived from the hidden states in the encoder of the Tracker, and are the word embeddings of and respectively. For multi-word phrases, we use the average representations of words as the phrase representation. We use a two-layer fully connected network and sigmoid to compute the score function of being replaced by . As each has multiple choices of replacement , we normalize the scores and obtain the final probabilities for the alternative phrases:


The sampling-based policy gradient is used to approximate the gradient of the expected reward. To obtain more feedback and make the policy learning more stable, as illustrated in Figure 3, we propose to use a two-step sampling method: at first, sample a bag of sentences , then iteratively sample a candidate for each instance in according to the current policy, obtaining a new bag of instances . After running the bag-level sampling times, the gradient of objective function can be estimated as:


where and denote the state and action of the -th instance-level sampling from the -th bag-level sampling, respectively. is the corresponding reward.

Reward Design. One key problem is assigning suitable rewards to various actions given state . We design two kinds of rewards: bag-level reward and instance-level reward in reinforcement learning. The bag-level reward feng2018reinforcement; qin2018robust indicates whether the new sampled bag is helpful to improve the Tracker and the instances in the same bag receive the same reward value. While the instance-level reward assigns different reward values to the instances in the same sampled bags by checking whether the instance can cause the Tracker to make incorrect prediction kang18acl; ribeiro-etal-2018-semantically. We sum two kinds of rewards as the final reward: , for more reliable policy learning.

Bag-level reward : we re-train the Tracker with each sampled bag and use their performance (e.g., joint goal accuracy henderson2014second) on the validation set to indicate their rewards. Suppose the performance of the -th bag is denoted as , the bag-level rewards are formulated as:


where refers to the set . Here we scale the value to be bounded in the range of [-1, 1] to alleviate the instability in RL training333In this work, the original text span is also used as one candidate in , which actually acts as an implicit Baseline sutton2018reinforcement in RL training..

Instance-level reward : we evaluate each generated instance in the bag and denote the instance which causes the Tracker to make wrong prediction, as large-loss instance (LI) han2018co. Compared to the non-LIs, the LIs are more informative and can induce larger loss for training the Tracker. Thus, in the design of instance-level rewards, the LI is encouraged more when its corresponding bag reward is positive, and punished more when its bag reward is negative. Specifically, we define the instance-level reward as follow:


where is an indicator function of being a LI. We obtain the value by checking if the pre-trained Tracker can correctly predict the label on the generated example. is a hyper-parameter, which is set to 0.5 by running a grid search over the validation set.

1:Pre-trained Tracker with parameters ; the randomly initialized Generator with parameters ;
2:Re-trained Tracker
4:for  do
5:     Re-initialize the Generator with
6:     for  do
7:         Re-initialize the Tracker with
8:         Sample a bag
9:         for  do
10:              Sample a new bag
11:         end for
12:         Compute bag reward with Eq. 5
13:         Compute instance reward with Eq. 6
14:         Update by the gradients in Eq.4
15:     end for
16:     Obtain new data by the Generator
17:     Re-train the Tracker on , update
18:end for
19:Save the Tracker with which performs best on the validation set among the epochs
Algorithm 1 The Reinforced Data Augmentation

2.3 Alternate Learning

In the framework of RDA, the learning of Generator and Tracker is conducted in an alternate manner, which is detailed in Algorithm 1 and Figure 3.

The text span to be replaced has different distribution in the training set. To make learning more efficient, we first sample one text span , then sample one sentence from the sentences containing . This process is made iteratively to obtain a bag . To learn the Generator, we generate bags of instances by running the policy, compute their rewards and update the policy network via the policy gradient method. To learn the Tracker, we augment the training data by the updated policy. Particularly for each , we generate a new instance by sampling based on the learned policies. To further reduce the effect of noisy augmented instances, we remove the new instance if its has minimum probability among . We randomly initialize the policy at each epoch to make the generator learn adaptively which policy is best for the current Tracker. The alternate learning is performed multiple rounds and the Tracker with the best performances on the validation set is saved.

3 Experiment

In this section, we show the experimental results to demonstrate the effectiveness of our framework.

3.1 Dataset and Evaluation

We use WoZ wen2017network and MultiWoZ budzianowski2018multiwoz to evaluate the proposed framework on the task of dialog state tracking444DSTC2 mrkvsic2017neural dataset is not used because its clean version (http://mi.eng.cam.ac.uk/~nm480/dstc2-clean.zip) is no longer available.. Following the work budzianowski2018multiwoz, we extract the restaurant domain of the MultiWoZ as the evaluation dataset, denoted as MultiWoZ (restaurant). Both WoZ and MultiWoZ (restaurant) are in the restaurant domain. In the experiment, we use the widely used joint goal accuracy henderson2014second

as the evaluation metric, which measures whether all slot values of the updated dialog state exactly match the ground truth values at every turn.

3.2 Implementation Details

We implement the proposed model using PyTorch

555https://pytorch.org/. All hyper-parameters of our model are tuned based on the validation set. To demonstrate the robustness of our model, we use the similar hyper-parameter settings for both datasets. Following the previous work D18-1299; zhong2018global; nouri2018toward, we concatenate the pre-trained GloVe embeddings pennington2014glove and the character embeddings D17-1206 as the final word embeddings and keep them fixed when training. The epoch number of the alternate learning , the epoch number of the generator learning and the sampling times for each bag are set to 5, 200 and 2 respectively. We set the dimensions of all hidden states to 200 in both the Tracker and the Generator, and set the head number of multi-head Self-Attention to 4 in the Tracker. All learnable parameters are optimized by the ADAM optimizer kingma2015adam with a learning rate of 1e-3. The batch size is set to 16 in the Tracker learning, and the bag size in the Generator learning is set to 25.

To avoid over-fitting, we apply dropout to the layer of word embeddings with a rate of 0.2. We also assign rewards based on subsampled validation set with a ratio of 0.3 to avoid over-fitting the policy network on the validation set.

In our experiments, the newly augmented dataset is times the size of the original training data ( for the Woz and for MultiWoz). At each iteration, we randomly sample a subset of the augmented data to train the Tracker. The sampling ratios are 0.4 for Woz and 0.3 for MutiWoz.

For the coarse-grained data augmentation method, we have tried the current neural paraphrase generation model. The preliminary experiment indicates that almost all generated sentences are not helpful for the task of DST. The reason is that most of the neural paraphrase generation models require additional labeled paraphrase corpus which may not be always available ray2018robust. In this work, we extract unigrams, bigrams and trigrams in the training data as the text spans in the generation process. After that, we retrieve the paraphrases for each text span from the PPDB666http://paraphrase.org/ database as the candidates. We also use the golden slot value in the sentence as the text spans, the other values of the same slot as the candidates and the label will be changed accordingly.

Model WoZ Multi
Delexicalised Model 70.8 71.2
NBT-DNN 84.4 80.3
NBTKS 85.5 80.9
StateNet 88.9 82.4
GLAD 88.1 82.7
GCE 88.5 83.5
+ DA
+ DA
Table 1:

Comparison of our model and other baselines. DA refers the coarse-grained data augmentation without the reinforced framework, and Multi refers the dataset MultiWoZ (restaurant). t-test is conducted in our proposed models and original trackers (NBT-CNN and

) are used as the comparison baselines. † and ‡: significant over the baseline trackers at 0.05/0.01. The mean and the standard deviation are also reported.

3.3 Baseline Methods

We compare our model with some baselines. Delexicalised Model

uses generic tags to replace the slot values and employs a CNN for turn-level feature extraction and a Jordan RNN for state updates 

henderson2014word; wen2017network. NBT-DNN and NBT-CNN respectively use the summation and convolution filters to learn the representations for the user utterance, candidate slot-value pair and the system actions mrkvsic2017neural. Then, they fuse these representations by a gating mechanism for the final prediction. NBTKS has a similar structure to NBT-DNN and NBT-CNN, but with a more complicated gating mechanism ramadan2018large. StateNet learns a representation from the dialog history, and then compares the distances between the learned representation and the vectors of the candidate slot-value pairs for the final prediction D18-1299. GLAD is a global-locally self-attentive state tracker, which learns representations of the user utterance and previous system actions with global-local modules zhong2018global. GCE is developed based on GLAD by using global recurrent networks rather than the global-local modules nouri2018toward.

We also use the coarse-grained data augmentation (DA) without the reinforced framework as the baseline, which generate new instances by randomly choosing one from the candidates.

Dataset Model 10% 20% 50%
WoZ 50.1 72.5 81.7
+ RDA 66.8 81.5 86.9
Multi 60.0 72.6 77.6
+ RDA 71.5 81.2 85.2
Table 2: The results with different sub-sampling ratios on WoZ and MultiWoZ (restaurant).
Setting WoZ Multi
RDA 90.7 86.7
- Bag Reward 89.1 84.3
- Instance Reward 89.8 85.4
DA 88.0 82.7
Table 3: Ablation study of performances on the test set of WoZ and MultiWoZ.

3.4 Results and Analyses

We compare our model with baselines and the joint goal accuracy is used as the evaluation metric. The results are shown in Table 1.

From the table, we observe that the proposed  achieves comparable performances (88.3% and 83.6%) with other state-of-the-art models on both datasets. The framework RDA can further boost the performances of the competitive  by the margin of 2.4% and 3.1% on two datasets respectively, achieving new state-of-the-art results (90.7% and 86.7%). Compared with the , the classical NBT-CNN with the RDA framework obtains more improvements: 3.9% and 3.6%. We also conduct significance test (t-test), and the results show that the proposed RDA achieves significant improvements over baseline models ( and respectively for WoZ and MultiWoZ (restaurant)).

The table also shows that directly using coarse-grained data augmentation methods without the RDA is less effective, and can even degrade the performances, as it may generate noisy instances. The results show that: using the RDA, the achieves improvements of (88.0%90.7%) and (82.7%86.7%) respectively on the WoZ and MultiWoZ. The NBT-CNN obtains improvements of (84.2%87.9%) and (79.7%83.4%) respectively. Overall, the results indicate that the RDA framework offers an effective mechanism to improve the quality of augmented data.

To further verify the effectiveness of the RDA when the training data is scarce, we conduct sub-sampling experiments with the tracker trained on different ratios [10%, 20%, 50%] of the training set. The results on both datasets are shown in Table 2. We find that our proposed RDA methods consistently improve the original tracker performance. Notably, we obtain 10% improvements with [10%, 20%] ratios of training set on both WoZ and MultiWoZ (restaurant), which indicates that the RDA framework is particularly useful when the training data is limited.

To evaluate the performance of different level rewards, we perform ablation study with on both the WoZ and MultiWoz datasets. The results are shown in Table 3. From the table we can see that both rewards can provide the improvements of 1% to 2% in the datasets and the bag-level reward achieves larger gains than the instance-level reward. Compared with DA setting, RDA obtains the improvements of 3% to 4% on the datasets by combining the both rewards, which indicates that the summation reward is more reliable for policy learning than individual ones.

3.5 Effects of Hyper-parameters

Figure 4: Results of different hyper-parameters. Top: different times the size of original data; Middle: different epochs of alternate learning; Bottom: different epochs of the Generator learning. The solid circles of and in the figure refer to the model of coarse-grained data augmentation (DA).

In this subsection, we investigate the effects of the number of newly augmented data in the Tracker learning, the epoch number of the alternate learning and the epoch number of the Generator learning on performance. We conduct experiments with the tracker which is evaluated on the validation set of WoZ and the joint goal accuracy is used as the evaluation metric.

Number of newly augmented data: we use 0 to 5 times the size of original data in the Tracker learning. The performance is shown in Figure 4 (top). The model continues to improve when the number of newly added examples is less than 2 times the original data. When we add more than twice the amount of original data, the improvement is not significant.
Epoch number of the alternate learning: we vary from 0 to 10 and the performance is shown in Figure 4 (middle). We can see that, with alternate learning, the model continues to improve when , and becomes stable with no improvement after .
Epoch number of the Generator learning: we vary from 0 to 350, and the performance is shown in Figure 4 (bottom). We find that the performance increases dramatically when , and shows no improvement after . It shows that the Generator needs a large to ensure a good policy.

3.6 Case Study for Policy Network

We sample four sentences from WoZ to demonstrate the effectiveness of the Generator policy in the case study. Due to limited space, we present the candidate phrases with maximum and minimum probabilities derived from the policy network and the details are shown in Table 4.

We observe that both high-quality and low-quality replacements exist in the candidate set. The high-quality replacements will generate reliable instances, which can potentially improve the generalization ability of the Tracker. The low-quality ones will induce noisy instances and can reduce the performance of the Tracker. From the results of the policy network, we find that our Generator can automatically infer the quality of candidate replacements, assigning higher probabilities to the high-quality candidates and lower probabilities to the low-quality candidates.

Sentence and text span Candidates
Thanks , [could you give] me the
phone number for the restaurant?
i was wonder if you could provide
are you able to
What restaurants are on the east
side that are not [overpriced] ?
too expensive
cheap enough
What is a affordable restaurant in
the [south side part] of town?
south end
southern countries
I want Cuban food and i [do n’t
care] about the price range.
do n’t worry
do n’t give a danm
Table 4: Case study for the Generator policy. The phrases with maximum policy values are listed at the first line in each cell of Candidates and the ones with minimum values are listed at the second line.

4 Related Work

Dialog State Tracking. DST is studied extensively in the literature williams2016dialog

. The methods can be classified into three categories: rule-based 

zue2000juplter, generative devault2007managing; williams2008exploiting, and discriminative  metallinou2013discriminative methods. The discriminative methods metallinou2013discriminative study dialog state tracking as a classification problem, designing a large number of features and optimizing the model parameters by the annotated data. Recently, neural networks based models with different architectures have been applied in DST henderson2014word; zhong2018global. These models initially employ CNN wen2017network, RNN ramadan2018large, self-attention nouri2018toward to learn the representations for the user utterance and the system actions/response, then various gating mechanisms ramadan2018large are used to fuse the learned representations for prediction. Another difference among these neural models is the way of parameter sharing, most of which use one shared global encoder for representation learning, while the work zhong2018global pairs each slot with a local encoder in addition to one shared global encoder. Although these neural network based trackers obtain state-of-the-art results, they are still limited by insufficient amount and diversity of annotated data. To address this difficulty, we propose a method of data augmentation to improve neural state trackers by adding high-quality generated instances as new training data.

Data Augmentation. Data augmentation aims to generate new training data by conducting transformations (e.g. rotating or flipping images, audio perturbation, etc.) on existing data. It has been widely used in computer vision krizhevsky2012imagenet; cubuk2018autoaugment and speech recognition ko2015audio

. In contrast to image or speech transformations, it is difficult to obtain effective transformation rules for text which can preserve the fluency and coherence of newly generated text and be useful for specific tasks. There is prior work on data augmentation in NLP 

zhang2015character; kang18acl; kobayashi2018contextual; hou2018sequence; ray2018robust; yoo2018data. These approaches do not specially design some mechanisms to filter out low-quality generated instances. In contrast, we propose a coarse-to-fine strategy for data augmentation, where the fine-grained generative polices learned by RL are used to automatically reduce the noisy instances and retain the effective ones.

Reinforcement Learning in NLP. RL is a general purpose framework for decision making and has been applied in many NLP tasks such as information extraction narasimhan2016improving, relational reasoning xiong2017deeppath, sequence learning ranzato2015sequence; D18-1421; celikyilmaz2018deep, summarization paulus2017deep; dong2018banditsum, text classification wu2018reinforced; feng2018reinforcement and dialog singh2000reinforcement; D16-1127. Previous works by feng2018reinforcement and P18-1046 design RL algorithm to learn how to filter out noisy ones. Our work is significantly different from these works, especially in the problem settings and model frameworks. The previous work assume there are many distant sentences. However, in our work we only know possible replacements, and our RL algorithm should learn how to choose optimal replacements to “generate” new high-quality sentences. Moreover, the action space and reward design are different.

5 Conclusion and Future Work

We have proposed a reinforced data augmentation (RDA) method for dialogue state tracking in order to improve its performance by generating high-quality training data. The Generator and the Tracker are learned in an alternate manner, i.e. the Generator is learned based on rewards from the Tracker while the Tracker is re-trained and boosted with the new high-quality data augmented by the Generator. We conducted extensive experiments on the datasets of WoZ and MultiWoZ (restaurant); the results demonstrate the effectiveness of our framework. In future work, we would conduct experiments on more NLP tasks and introduce neural network based paraphrasing method in the RDA framework.