Adversarial Bootstrapping for Dialogue Model Training

by   Oluwatobi Olabiyi, et al.
Capital One

Open domain neural dialogue models, despite their successes, are known to produce responses that lack relevance, diversity, and in many cases coherence. These shortcomings stem from the limited ability of common training objectives to directly express these properties as well as their interplay with training datasets and model architectures. Toward addressing these problems, this paper proposes bootstrapping a dialogue response generator with an adversarially trained discriminator as an effective solution. The proposed method involves training a neural generator in both auto-regressive and traditional teacher-forcing modes, with the maximum likelihood loss of the auto-regressive outputs weighted by the score from a metric-based discriminator model. The discriminator input is a mixture of ground truth labels, the teacher-forcing outputs of the generator, and distractors sampled from the dataset, thereby allowing for richer feedback on the autoregressive outputs of the generator. To improve the calibration of the discriminator output, we also bootstrap the discriminator with the matching of the intermediate features of the ground truth and the generator's autoregressive output. We explore different sampling and adversarial policy optimization strategies during training in order to understand how to encourage response diversity without sacrificing relevance. Our experiments shows that adversarial bootstrapping is effective at addressing exposure bias, leading to improvement in response relevance and coherence. The improvement is demonstrated with the state-of-the-art results on Movie and Ubuntu dialogue datasets with respect to human evaluations and BLUE, ROGUE, and DISTINCT scores.



There are no comments yet.


page 1

page 2

page 3

page 4


Multi-turn Dialogue Response Generation with Autoregressive Transformer Models

Neural dialogue models, despite their successes, still suffer from lack ...

Multi-turn Dialogue Response Generation in an Adversarial Learning Framework

We propose an adversarial learning approach to the generation of multi-t...

Adaptive Bridge between Training and Inference for Dialogue

Although exposure bias has been widely studied in some NLP tasks, it fac...

Meta-Context Transformers for Domain-Specific Response Generation

Despite the tremendous success of neural dialogue models in recent years...

EnsembleGAN: Adversarial Learning for Retrieval-Generation Ensemble Model on Short-Text Conversation

Generating qualitative responses has always been a challenge for human-c...

Better Conversations by Modeling,Filtering,and Optimizing for Coherence and Diversity

We present three enhancements to existing encoder-decoder models for ope...

Another Diversity-Promoting Objective Function for Neural Dialogue Generation

Although generation-based dialogue systems have been widely researched, ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

End-to-end neural dialogue models have demonstrated the ability to generate reasonable responses to human interlocutors. However, a significant gap remains between these state-of-the-art dialogue models and human-level discourse. The fundamental problem with neural dialogue modeling is exemplified by their generic responses, such as I don’t know, I’m not sure, or how are you, when conditioned on broad ranges of dialogue contexts. In addition to the limited contextual information in single-turn Seq2Seq models [Sutskever, Vinyals, and Le2014, Vinyals and Le2015, Li et al.2016a], which has motivated the need for hierarchical recurrent encoder decoder (HRED) multi-turn models [Serban et al.2016, Xing et al.2017, Serban et al.2017b, Serban et al.2017a, Olabiyi et al.2018, Olabiyi et al.2019], previous work points to three underlying reasons why neural models fail at dialogue response generation.

i) Exposure Bias: Similar to language and machine translation models, traditional conversation models are trained with the model input taken from the ground truth rather than a previous output (a method known as teacher forcing [Williams and Zipser1989]). During inference, however, the model uses past outputs, i.e., is used autoregressively. Interestingly, training with teacher forcing does not present a significant problem in the machine translation setting since the conditional distribution of the target given the source is well constrained. On the other hand, this is problematic in the dialogue setting since the learning task is unconstrained [Lowe et al.2015]. In particular, there are several suitable target responses per dialogue context and vice versa. This discrepancy between training and inference is known as exposure bias [Williams and Zipser1989, Lamb et al.2016] and significantly limits the informativeness of the responses as the decoding error compounds rapidly during inference. Training methods that incorporate autoregressive sampling into model training have been explored to address this [Li et al.2016b, Li et al.2017, Yu et al.2017, Che et al.2017, Zhang et al.2017, Xu et al.2017, Zhang et al.2018b].

Figure 1: Positional Entropy of Movie and Ubuntu datasets - Applying a greedy training objective to the datasets can achieve low overall entropy just by overfitting to low entropy regions, resulting in short and generic responses.

ii) Training data: The inherent problem with dialogue training data, although identified, has not been particularly addressed in the literature [Sharath, Tandon, and Bauer2017]. Human conversations contain a large number of generic, uninformative responses with little or no semantic information, giving rise to a classic class-imbalance problem. This problem also exists at the word and turn level; human dialogue [Banchs2012, Serban et al.2017b] contains non-uniform sequence entropy that is concave with respect to the token position, with the tokens at the beginning and end of a sequence having lower entropy than those in the middle (see Fig. 1). This initial positive energy gradient can create learning barriers for recurrent models, and is a primary contributing factor to their short, generic outputs.

iii) Training Objective:

Most existing dialogue models are trained using maximum likelihood estimation (MLE)

[Sutskever, Vinyals, and Le2014, Vinyals and Le2015, Serban et al.2016, Xing et al.2017]

with teacher forcing because autoregressive sampling leads to unstable training. Unfortunately, the use of MLE is incongruent with the redundant nature of dialogue datasets, exacerbates the exposure bias problem in dialogue datasets, and is the primary factor leading to uninteresting and generic responses. Alternative training frameworks that complement MLE with other constraints such as generative adversarial networks, reinforcement learning, and variational auto-encoders that specifically encourage diversity have been explored to overcome the limitations of the MLE objective alone

[Li et al.2016a, Li et al.2016b, Li et al.2017, Yu et al.2017, Che et al.2017, Zhang et al.2017, Xu et al.2017, Serban et al.2017b, Zhang et al.2018b, Olabiyi et al.2018, Olabiyi et al.2019].

In this paper, we propose an adversarial bootstrapping framework for training dialogue models. This framework tackles the class imbalance caused by the redundancy in dialogue training data, and addresses the problem of exposure bias in dialogue models. Bootstrapping has been proposed in the literature as a way to handle data with noisy, subjective, and incomplete labels by combining cross-entropy losses from both the ground truth (i.e. teacher forcing) and model outputs (i.e. autoregression) [Reed et al.2015, Grandvalet and Bengio2005, Grandvalet and Bengio2006]

. Here, we first extend its use to dialogue model training to encourage the generation of high variance response sequences for a given ground truth target

[Reed et al.2015]. This should reduce the tendency of dialogue models to reproduce those generic and uninteresting target responses present in the training data. This is achieved by training a discriminator adversarially, and use the feedback from the discriminator to weigh the cross-entropy loss from the model-predicted target. The gradient from the feedback provided by the discriminator encourages the dialogue model to generate a wide range of structured outputs. Second, we bootstrap the discriminator to improve the calibration of its output. We use the similarity between the representations of the generator’s autoregressive output and the ground truth from an intermediate layer of the discriminator as an addition target for the discriminator. This further improves the diversity of the generator’s output without sacrificing relevance. We apply adversarial bootstrapping to multi-turn dialogue models. Architecture wise, we employ an HRED generator and an HRED discriminator as depicted in Fig. 2, with a shared hierarchical recurrent encoder. In our experiments, the proposed adversarial bootstrapping demonstrates state-of-the-art performances on the Movie and Ubuntu datasets as measured in terms of both automatic (BLUE, ROGUE, and distinct n-gram scores) and human evaluations.

2 Related Work

The literature on dialogue modeling even in multi-turn scenario is vast (see [Serban et al.2016, Xing et al.2017, Serban et al.2017b, Serban et al.2017a, Xing et al.2017, Olabiyi et al.2018, Olabiyi, Khazan, and Mueller2018, Olabiyi et al.2019, Li et al.2016b]), and so in this section, we focus on key relevant previous papers. The proposed adversarial bootstrapping is closely related to the use of reinforcement learning for dialogue response generation with an adversarially trained discriminator serving as a reward function [Li et al.2017]. First, we employ a different discriminator training strategy from Li2017 Li2017. The negative samples of our discriminator consist of (i) the generator’s deterministic teacher forcing output and (ii) distractors sampled from the training set. This makes the discriminator’s task more challenging and improves the quality of the feedback to the generator by discouraging the generation of high frequency generic responses. Also, while Li2017 samples over all the possible outputs of the generator, we take samples from the generator’s top k outputs or the MAP output with Gaussian noise as additional inputs. This allows our model to explore mostly plausible trajectories during training compared to Li2017 where the discriminator mostly score the generated samples very low. The top_k sampling strategy also mitigates the gradient variance problem found in the traditional policy optimization employed by Li2017. Finally, we bootstrap our discriminator with the similarity between the intermediate representations of the generator’s autoregressive output and the ground truth to improve the calibration of the discriminator output.

3 Model

Let denote the context or conversation history up to turn and let denote the associated target response. Provided input-target samples , we aim to learn a generative model which scores representative hypotheses given arbitrary dialogue contexts such that responses that are indistinguishable from informative and diverse target responses are favored with high scores and otherwise given low scores. Notationally, we write the collection of possible responses at turn as the set containing elements where is the length of the -th candidate response and is the -th word of that response.

3.1 Generator Bootstrapping

To achieve the goal outlined above, we propose an Adversarial Bootstrapping (AB) approach to training multi-turn dialogue models such as the one depicted in Fig. 2. The adversarial bootstrapping for the generator can be expressed according to the objective


where is the target variable that controls the generator training. Indeed, hard bootstrapping [Reed et al.2015] is one such special case of (1) wherein , , and otherwise, where

is a hyperparameter. Similarly, MLE is another special case in which

, and otherwise. It is reasonable to assume from these formulations that bootstrapping will outperform MLE since it does not assume all negative outputs are equally wrong.

Figure 2: A multi-turn recurrent architecture with adversarial bootstrapping: The generator and discriminator share the same encoder (through the context state) and the same word embeddings. The generator also uses the word embeddings as the output projection weights. The encoder and the discriminator RNNs are bidirectional while the context and generator RNNs are unidirectional.

Interestingly, Li2017 Li2017 make use of the MLE setting but additionally relies on the sampling stochasticity to obtain non-zero credit assignment information from the discriminator for the generator policy updates. To avoid this inconsistency, we instead modify the generator target to


where is a hyperparamter and

is the bootstrapped target obtained from a neural network discriminator

with parameters . The first two assignments in (2) are also used in training the discriminator in addition to the human-generated distractors, denoted , from the dataset. In detail, we make use of the term


within the context of the objective function. Namely, the discriminator objective is the cross-entropy between the output and the target of the discriminator given by


The inclusion of human-generated negative samples encourages the discriminator to assign low scores to high frequency, generic target responses in the dataset, thereby discouraging the generator from reproducing them.

3.2 Discriminator Bootstrapping

In addition to the generator bootstrapping with the discriminator, we can also bootstrap the discriminator using the similarity measure, , between latent representations of the sampled generator outputs, , and ground truth encoded by the discriminator. i.e.


In our experiments, we chose the cosine similarity metric and the output of the discriminator before the logit projection for

and respectively. This helps to better calibrate the discriminator’s judgment of the generator’s outputs.

3.3 Sampling Strategy

To backpropagate the learning signal for the case where

, we explore both stochastic and deterministic policy gradient methods. For stochastic policies, we approximate the gradient of w.r.t. by Monte Carlo samples using the REINFORCE policy gradient method [Li et al.2017, Glynn1990, Williams1992]:


For deterministic policies, we approximate the gradient according to [Silver et al.2014, Zhang et al.2018b]


where and is the source of randomness. We denote the model trained with (7) as . To reduce the variance of (6), we propose a novel approach of sampling from top_k generator outputs using (i) a categorical distribution based on the output logits (

), similar to the treatment of Radford2019 Radford2019, and (ii) a uniform distribution (

); where top_k is a hyperparameter. This is especially useful for dialogue modeling with large vocabulary sizes.

3.4 Encoder

Referring to the network architecture in Fig. 2, the generator and discriminator share the same encoder. The encoder uses two RNNs to handle multi-turn representations similar to the approach of Serban2016 Serban2016. First, during turn , a bidirectional encoder RNN, , with an initial state of maps the conversation context comprising the sequence of input symbols , where

is the sequence length, into a sequence of hidden state vectors

according to


where is the embedding lookup and is the embedding matrix with dimension and vocabulary size . The vector representation of the input sequence is the pooling over the encoded sequence [Serban et al.2016]. In addition, we use the output sequence as an attention memory to the generator as depicted in Fig. 2. This is done to improve the relevance of the generated response.

To capture we use a unidirectional context RNN, , to combine the past dialogue context with the pooling of the encoded sequence as


3.5 Generator

The generator, denoted , is a unidirectional decoder RNN with an attention mechanism [Bahdanau, Cho, and Bengio2015, Luong et al.2015]. Similar to Serban2016 Serban2016, the decoder RNN is initialized with the last state of the context RNN. The generator outputs a hidden state representation for each previous token according to


where is the attention over the encoded sequence . When the generator is run in teacher-forcing mode, as is typically done during training, the previous token from the ground truth is used, i.e., . During inference (autoregressive mode), the generator’s previous decoded output is used, i.e., .

The decoder hidden state,

is mapped to a probability distribution typically through a logistic layer,

, yielding,


where is a hyperparameter, , and is the logit bias. The generative model can then be derived as:


3.6 Discriminator

The discriminator

is a binary classifier that takes as input a response sequence

and a dialogue context and is trained with output labels provided in (3) and (5). The discriminator, as shown in Fig. 2 is an RNN,

, that shares the hierarchical encoder and the word embeddings with the generator, with the initial state being the final state of the context RNN. The last layer of the discriminator RNN is fed to a logistic layer and a sigmoid function to produce the normalized

(action-value function) value for a pair of dialogue context (state) and response (action).

We explore two options of estimating the value, i.e., at the word or utterance level. At the utterance level, we use (4) in conjunction with a unidirectional discriminator RNN. The value is calculated using the last output of , i.e,


where , , and are the logit projection and bias respectively. At the word level, the discriminator RNN (we use a bidirectional RNN in our implementation) produces a word-level evaluation. The normalized value and the adversarial bootstrapping objective function are then respectively given by




4 Training

We train both the generator and discriminator simultaneously with two samples for the generator and three for the discriminator. In all our experiments, we use the generator’s teacher forcing outputs to train the discriminator (i.e., cases of (2) and (3)). The encoder parameters are included with the generator, i.e., we did not update the encoder during discriminator updates. Each RNN is a 3-layer GRU cell, with a hidden state size () of 512. The word embedding size is the same , and the vocabulary size is . Other hyperparameters include , , , for and , and for . Although we used a single top_k value during training, we avoided training with multiple top_k values by searching for the optimum top_k (between 1 and 20) on the validation set using the BLEU score. We used the obtained optimum values for inference. Other training parameters are as follows: the initial learning rate is with decay rate factor of , applied when the generator loss has increased over two iterations. We use a batch size of 64 and clip gradients around . All parameters are initialized with Xavier uniform random initialization [Glorot and Bengio2010]. Due to the large vocabulary size, we use sampled softmax loss [Jean et al.2015]

for the generator to limit the GPU memory requirements and expedite the training process. However, we use full softmax for evaluation. The model is trained end-to-end using the stochastic gradient descent algorithm.

Model Movie Ubuntu
Relevance Diversity Relevance Diversity
HRED 0.0474 0.0384 0.0026/0.0056 0.535 0.0177 0.0483 0.0203/0.0466 0.892
VHRED 0.0606 0.1181 0.0048/0.0163 0.831 0.0171 0.0855 0.0297/0.0890 0.873
hredGAN_u 0.0493 0.2416 0.0167/0.1306 0.884 0.0137 0.0716 0.0260/0.0847 1.379
hredGAN_w 0.0613 0.3244 0.0179/0.1720 1.540 0.0216 0.1168 0.0516/0.1821 1.098
DAIM 0.0155 0.0077 0.0005/0.0006 0.721 0.0015 0.0131 0.0013/0.0048 1.626
Transformer 0.0360 0.0760 0.0107/0.0243 1.602 0.0030 0.0384 0.0465/0.0949 0.566
aBoots_u_gau 0.0642 0.3326 0.0526/0.2475 0.764 0.0115 0.2064 0.1151/0.4188 0.819
aBoots_w_gau 0.0749 0.3755 0.0621/0.3051 0.874 0.0107 0.1712 0.1695/0.7661 1.235
aBoots_u_uni 0.0910 0.4015 0.0660/0.3677 0.975 0.0156 0.1851 0.0989/0.4181 0.970
aBoots_w_uni 0.0902 0.4048 0.0672/0.3653 0.972 0.0143 0.1984 0.1214/0.5443 1.176
aBoots_u_cat 0.0880 0.4063 0.0624/0.3417 0.918 0.0210 0.1491 0.0523/0.1795 1.040
aBoots_w_cat 0.0940 0.3973 0.0613/0.3476 1.016 0.0233 0.2292 0.1288/0.5190 1.208

Table 1: Automatic evaluation of generator performance
Model Pair Movie Ubuntu
aBoots_w_cat – DAIM 0.9570.043 0.9600.040
aBoots_w_cat – HRED 0.6450.355 0.7700.230
aBoots_w_cat – VHRED 0.6100.390 0.7460.254
aBoots_w_cat – hredGAN_w 0.550 – 0.450 0.556 – 0.444

Table 2: Human evaluation of generator performance based on response informativeness
Model Movie Ubuntu
Relevance Diversity Relevance Diversity
aBoots_g_u_gau 0.0638 0.3193 0.0498/0.2286 0.778 0.0150 0.1298 0.0480/0.1985 0.960
aBoots_g_w_gau 0.0729 0.3678 0.0562/0.3049 1.060 0.0123 0.1370 0.0646/0.1820 0.841
aBoots_g_u_uni 0.0801 0.3972 0.0655/0.3414 0.869 0.0124 0.1424 0.0636/0.1853 0.870
aBoots_g_w_uni 0.0860 0.4046 0.0671/0.3514 0.838 0.0170 0.2049 0.1074/0.4646 1.349
aBoots_g_u_cat 0.0836 0.3887 0.0597/0.3276 0.917 0.0131 0.1214 0.0597/0.3276 1.060
aBoots_g_w_cat 0.0928 0.4029 0.0613/0.3358 0.976 0.0202 0.2343 0.1254/0.4805 0.873

Table 3: Automatic evaluation of models with the generator bootstrapping only

5 Experiments

5.1 Setup

We evaluated the proposed adversarial bootstrapping (aBoots) with both generator and discriminator bootstrapping, on the Movie Triples and Ubuntu Dialogue corpora randomly split into training, validation, and test sets, using 90%, 5%, and 5% proportions. We performed minimal preprocessing of the datasets by replacing all words except the top 50,000 most frequent words by an UNK symbol. The Movie dataset [Serban et al.2016] spans a wide range of topics with few spelling mistakes and contains about 240,000 dialogue triples, which makes it suitable for studying the relevance vs. diversity tradeoff in multi-turn conversations. The Ubuntu dataset, extracted from the Ubuntu Relay Chat Channel [Serban et al.2017b], contains about 1.85 million conversations with an average of 5 utterances per conversation. This dataset is ideal for training dialogue models that can provide expert knowledge/recommendation in domain-specific conversations.

We explore different variants of aBoots along the choice of discrimination (either word(_w) or utterance(_u) level) and sampling strategy (either uniform(_uni), categorical(_cat) or with Gaussian noise (_gau)). We compare their performance with existing state-of-the-art dialogue models including (V)HRED111implementation obtained from [Serban et al.2016, Serban et al.2017b], and DAIM222implementation obtained from [Zhang et al.2018b]. For completeness, we also include results from a transformer-based Seq2Seq model [Vaswani et al.2017].

We compare the performance of the models based on the informativeness (a combination of relevance and diversity metrics) of the generated responses. For relevance, we adopted BLEU-2 [Papineni et al.2002] and ROUGE-2 [Lin2014] scores. For diversity, we adopted distinct unigram (DIST-1) and bi-gram (DIST-2) [Li et al.2016a] as well as and normalized average sequence length (NASL) scores [Olabiyi et al.2018].

For human evaluation, we follow a similar setup as Li2016 Li2016, employing crowd sourced judges to evaluate a random selection of 200 samples. We present both the multi-turn context and the generated responses from the models to 3 judges and ask them to rank the response quality in terms of informativeness. Ties are not allowed. The informativeness measure captures the temporal appropriateness, i.e, the degree to which the generated response is temporally and semantically appropriate for the dialogue context as well as other factors such as length of the response, and repetitions. For analysis, we pair the models and compute the average number of times each model is ranked higher than the other.

6 Results and Discussion

6.1 Quantitative Evaluation

The quantitative measures reported in Table 1 show that adversarial bootstrapping gives the best overall relevance and diversity performance in comparison to (V)HRED, hredGAN, DAIM and Transformer, on both the Movie and Ubuntu datasets. We believe that the combination of improved discriminator training and the policy-based objective is responsible for the observed performance improvement. On the other hand, multi-turn models (V)HRED and hredGAN suffer performance loss due to exposure bias, since autoregressive sampling is not included in their training. Although DAIM uses autoregressive sampling, its poor performance shows the limitation of the single-turn architecture and GAN objective compared to the multi-turn architecture and policy-based objective in . The transformer Seq2Seq model, which performs better than RNNs on the machine translation task, also suffers from exposure bias, and overfits very quickly to the low entropy regions in the data, which leads to a poor inference performance. Also, the results from models indicate that word-level discrimination performs better than utterance-level discrimination, consistent with the results reported by Olabiyi2018 Olabiyi2018 for the hredGAN model. While it is difficult to identify why some models generate very long responses, we observe that models with Gaussian noise inputs (e.g., hredGAN and

) may be using the latent Gaussian distribution to better encode response length information; indeed, this is an area of ongoing work. Within the variants of

, we observe that models trained with a stochastic policy, and , outperform those trained with a deterministic policy, . Notably, we find that for the stochastic policy, there is a tradeoff in relevance and diversity between top_k categorical and uniform sampling. The categorical sampling tends to perform better with relevance but worse with diversity. We believe that this is because top_k categorical sampling causes the generator to exploit high likelihood (i.e., more likely to be encountered during inference) than uniform sampling of the top candidates, while still allowing the policy to explore. This however comes with some loss of diversity, although not significant. Overall, the automatic evaluation indicates that adversarial bootstrapping trained with stochastic policy using top_k categorical sampling strategy gives the best performance.

6.2 Qualitative Evaluation

As part of our evaluation we also consider scores from human judges. Specifically, we have each evaluator compare responses from five models: , hredGAN_w, (V)HRED, and DAIM. The pairwise human preferences are reported in Table 2. These data indicate a significant preference for responses generated by as compared to both (V)HRED and DAIM. We observe that is preferred over hredGAN_w on average, although not by a significant margin. We note that this score was computed from only 200 evaluation samples, which is likely too small to demonstrate a strong preference for . It is also worthy noting that the hredGAN_w model represents a strong baseline, based on previous human evaluations [Olabiyi et al.2018], against which to compare our adversarially trained models. It is interesting to note that although automatic evaluation scores hredGAN_w much lower than on relevance, the long response length from hredGAN_w, which indicates strong diversity, has a considerable impact on how human evaluators judge the informativeness of responses generated by hredGAN_w. Table 4 shows example responses from the models.

Model Response
Context_0 we ’ re grateful , person . it ’ s not that . those craftsman tools . they ’ re a dream come true .
Context_1 your dad and i were raised with a certain work ethic , that ’ s all .
HRED person .
VHRED person , mr . person . person , i ’ m not going to tell you , person . i ’ m sorry . i ’ m sorry , person . i ’ m sorry
hredGAN_w person , i know you ’ re see . continued_utterance person , please .
person , i ’ m a little worried about my future . i ’ m not here to marry you .
Context_0 where would they be going ?
Context_1 we ’ ve got two choices . one of them is wrong .
HRED i ’ m sorry , person .
VHRED i know .
hredGAN_w person , i know , it ’ s a real hair . with you a little little little , little little little little somethin
i don ’ t care where they are , i don ’ t know anyone will be outside .
Context_0 DJones: update manager won’t detect 12.04.1 as a new version if you are already running 12.04, because 12.04.1 = 12.04 + lots of package updates
Context_1 DJones I am running 10.04.
Context_2 what does ’gksu update-manager c’ report?
HRED I don’t know what the problem is
VHRED you mean ?
hredGAN_w yes , I get the new one , I just want to get the new one
it says that command not found
Context_0 libgnome2-dev is the package that has that pkgconfig file
Context_1 After I installed libgnome2-dev I got a different output. sorry it is so long.
Context_2 I thought that was what was going on thanks. :) I ran the command ./configure && sudo make && sudo make install would that have worked.
HRED I don’t know what the problem is
VHRED what do you mean ? __eou__ what are you trying ?
hredGAN_w I have no idea , I can just copy the files and copy files to the other one
yes I did . I just did sudo apt-get update and it worked
Table 4: Sample responses of DAIM, (V)HRED, hredGAN, .

6.3 Ablation Studies

In this section, we examine the effect of partial bootstrapping on the model performance. Here, the target in (5) is excluded from the discriminator. The automatic evaluation results on all the variants of are reported in Table 3. The table shows that the generator models bootstrapped by a discriminator that is not bootstrapped generally performs worse than ones with a bootstrapped discriminator. This improvement is particularly more evident in the best performing variant, . We attribute this performance improvement to the better calibration of discriminator obtained from the bootstrapping of the discriminator output with the similarity measure between the generator’s autoregressive output and the ground truth during training.

7 Conclusion

We have proposed a novel training technique, adversarial bootstrapping, which is useful for dialogue modeling. The method addresses the issues of data-induced redundancy and exposure bias in dialogue models trained with maximum likelihood. This is achieved by bootstrapping the teacher-forcing MLE objective with feedback on autoregressive outputs from an adversarially trained discriminator. This feedback discourages the generator from producing bland and generic responses that are characteristic of MLE training. Experimental results indicate that a doubly bootstrapped system produces better performance than a system where only the generator is bootstrapped. Also, the model variant characterized by choosing top_k categorical sampling, stochastic policy optimization, and word-level discrimination gives the best performance. The results demonstrate that the proposed method leads to models generating more relevant and diverse responses in comparison to existing methods.