Learning a Simple and Effective Model for Multi-turn Response Generation with Auxiliary Tasks

04/04/2020 ∙ by Yufan Zhao, et al. ∙ Microsoft 0

We study multi-turn response generation for open-domain dialogues. The existing state-of-the-art addresses the problem with deep neural architectures. While these models improved response quality, their complexity also hinders the application of the models in real systems. In this work, we pursue a model that has a simple structure yet can effectively leverage conversation contexts for response generation. To this end, we propose four auxiliary tasks including word order recovery, utterance order recovery, masked word recovery, and masked utterance recovery, and optimize the objectives of these tasks together with maximizing the likelihood of generation. By this means, the auxiliary tasks that relate to context understanding can guide the learning of the generation model to achieve a better local optimum. Empirical studies with three benchmarks indicate that our model can significantly outperform state-of-the-art generation models in terms of response quality on both automatic evaluation and human judgment, and at the same time enjoys a much faster decoding process.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As an important topic in conversational AI, open-domain human-machine conversation is gaining increasing attention from both academia and industry. A common approach to building such a system is to learn a response generation model within an encoder-decoder framework using neural sequence architectures Sutskever et al. (2014); Vaswani et al. (2017)

. While the encoder-decoder framework has been successfully applied in various text generation tasks such as machine translation

Vaswani et al. (2017), summarization Rush et al. (2015), paraphrase generation Dong et al. (2017), etc., it has to deal with a unique challenge in the task of response generation: modeling conversation contexts. A conversation context often exhibits a hierarchical structure with dependency existing on both a word-level and an utterance-level. Moreover, as indicated in Xing et al. (2018); Zhang et al. (2019), information in a context is rather redundant for responding: commonly only a few words and utterances are useful for response generation, and the positions of the relevant words and utterances vary from case to case. To model the hierarchy of conversation contexts, hierarchical recurrent encoder-decoder (HRED) Serban et al. (2016) extends the vanilla sequence-to-sequence model by a word-level encoder and an utterance-level encoder. Later on, a hierarchical recurrent attention network (HRAN) Xing et al. (2018) harnesses the decoder of the HRED model with word-level attention and utterance-level attention to dynamically highlight the effect of relevant words and utterances in response synthesis. Very recently, ReCoSa Zhang et al. (2019) further exploits multi-layer multi-head self-attention111The fact that both the encoder and the decoder of ReCoSa contain multiple layers is not highlighted in the paper, but is revealed by the source code released by the authors at https://github.com/zhanghainan/ReCoSa. to model long-term dependency among utterances and responses. From HRED to HRAN, and then to ReCoSa, the performance of the models in terms of response quality becomes better and better Zhang et al. (2019), but the models also grow to be more and more complicated. For example, the number of parameters in ReCoSa is more than twice as that in HRED. Thus, when we enjoy the improved performance from the increased complexity, the complexity may also impede the application of the models in some scenarios (e.g., in a mobile scenario).

In this work, we study multi-turn response generation and target on a model that has a simple structure yet can make use of conversation contexts as well as the existing deep models. The key idea is to transfer the burden of context understanding from modeling to learning by designing several auxiliary tasks, and leverage the auxiliary tasks as regularization in model estimation. Specifically, the model we use for response generation concatenates utterances in a conversation context as a long sequence, and only exploits one-layer self-attention in encoding and one-layer context attention in decoding. In such a frugal setting, the representation capability of the model shrinks a lot compared with deep transformers. As a remedy, we augment the maximum likelihood estimation (MLE) in learning with objectives from four auxiliary tasks including word order recovery, utterance order recovery, masked word recovery, and masked utterance recovery. In the first two tasks, we predict the correct order of words and utterances from a random shuffle of words in an utterance and a random shuffle of utterances in a context respectively. The goal of the two tasks is to enhance understanding of the sequential dependency among words and utterances within a context. The other two tasks are inspired by the recent breakthrough from BERT

Devlin et al. (2019), in which we randomly mask a word in an utterance and an utterance in a context respectively, and predict the masked word and the masked utterance using the remaining words and utterances. The two tasks may encourage the learning process to pay more attention to semantics of words and utterances in their contexts, and help the learning process find better representations of words and utterances for the generation model. The auxiliary tasks and the MLE task share the encoder of the generation model. Through learning with multiple tasks, optimization for response generation and optimization for context understanding are performed in a joint form. The context understanding related tasks can guide the MLE to achieve a better local optimum, and thus realize superior performance in response generation with a simple neural structure.

We test the proposed approach with three benchmarks including the Ubuntu Dialogue Corpus Lowe et al. (2015), DailyDialog Li et al. (2017), and PERSONA-CHAT Zhang et al. (2018). Evaluation results on all three datasets indicate that our model can significantly outperform state-of-the-art generation models in terms of both automatic evaluation and human judgment. Moreover, with a parameter set even smaller than HRED, our model is 2x faster than ReCoSa in response decoding.

Our contributions in the paper are three-fold: (1) proposal of balancing model complexity and model capability in multi-turn response generation; (2) proposal of four auxiliary learning tasks that transfer context understanding from modeling to learning; and (3) empirical verification of the effectiveness and the efficiency of the proposed model on three benchmarks.

2 Related Work

End-to-end open-domain dialogue generation is built upon the encoder-decoder architecture Shang et al. (2015); Vinyals and Le (2015), and the vanilla sequence-to-sequence structure has been widely extended to address challenges such as generic responses Li et al. (2015); Xing et al. (2017), context modeling Serban et al. (2016, 2017); Xing et al. (2018); Zhang et al. (2019), and grounding by persona/emotion/knowledge Li et al. (2016); Zhang et al. (2018); Zhou et al. (2018); Dinan et al. (2018). In this work, we study how to leverage conversation context for multi-turn response generation, which represents a fundamental problem in dialogue generation. Different from the existing work that enhances the representation capability of models through neural architecture engineering, we turn to an orthogonal direction that we keep the generation model simple, and optimize the simple structure by learning with auxiliary tasks that encode context understanding. As a result, our model can provide high-quality responses at a low cost. Before us, there have been a few studies on learning a primary task with auxiliary ones Rei and Yannakoudakis (2017); Yu and Jiang (2016); Ding et al. (2017); Trinh et al. (2018); Mehri et al. (2019); Wu et al. (2019). The work is unique in that through extensive empirical studies, we verified that a simple structure learned with auxiliary tasks can work as well as deep architectures in dialogue generation.

3 Approach

We first formalize the problem in question, and then detail the model and the learning tasks.

3.1 Problem Formalization

Suppose that we have a dataset , where = denotes a context with the -th utterance, and is a response regarding to

. The goal is to estimate a generation probability distribution

from , and thus, given a new context , one can generate a response for following . A common practice is to learn by maximizing the log-likelihood of (i.e. MLE) which can be formulated as


When is in a simple structure, only learning with MLE could be insufficient to obtain a model that can well capture the syntax and the semantics of contexts. An evidence is that simple architectures like HRED is much worse than complicated architectures like ReCoSa in terms of response quality, as reported by the existing work Zhang et al. (2019). Since a simple structure is still favored, we consider aiding the objective given by Equation (1) with extra ones that can reinforce context understanding in the learning process.

3.2 Generation Model

Figure 1 illustrates the architecture of the generation model. In a nutshell, the model is in a transformer-based structure Vaswani et al. (2017) with one attentive layer (in the transformer layer) in the encoder and one attentive layer in the decoder. The auxiliary tasks, which will be presented later, share the encoder with the generation model. We prefer a transformer-based structure instead of a recurrent structure, because the former is easier to parallelize than the latter, and thus can further enhance efficiency of the model in an online system.


we unfold all words in into , where is the number of words in context , and is the number of words in response . , is represented by a summation of word embedding, position embedding, and segment embedding:


where represents the word embedding of initialized using GloVe Pennington et al. (2014), is the position embedding of which is defined by , where

is a one-hot vector with the only non-zero entry indicating the position of

in , and is a randomly initialized matrix with an upper bound of the number of words in a dialogue. is the segment embedding of defined similarly with the one-hot vector indicating the position of the utterance that contains . The embedding matrix is then fed to a transformer layer, which can be formulated as



is a feed-forward neural network and

is a multi-head attention function with a query, a key, and a value. To control the receptive field of self-attention in different tasks, we add a mask matrix Dong et al. (2019) in attention computation, and let determine whether a pair of words can attend to each other according to the learning tasks. Thus, is defined by


where refers to a concatenation operation, and is given by



suppose that are words generated until step , then the next word is predicted according to:


where is defined by with the output of the encoder, and is a trainable parameter.

Figure 1: Architecture of the generation model.
Figure 2: Auxiliary tasks.

3.3 Auxiliary Tasks

Heading for learning the simple structure that can effectively make use of contexts for response generation, we design two kinds of auxiliary tasks including order recovery and masked content recovery. The order recovery tasks aim to enhance the capability of the self-attention module on capturing the sequential relationship among words and utterances, while the masked content recovery tasks can optimize the self-attention module to enhance semantic connection among words and utterances.

Order recovery:

a recent study Sankar et al. (2019) indicates that transformer-based models are insensitive to ordering of words and utterances, which means that the information they learn could be just bag-of-words representations. Thus, we consider recovering the correct order from random shuffling on both a word level and an utterance level to force self-attention to be aware of relative positions of words and utterances in the context.

Word order recovery: Figure 2 (a) illustrates the task. Given a randomly sampled utterance from a context , we randomly shuffle the words in and obtain a disordered utterance . Then, we replace in with and form a corrupt context . The goal of the task is to predict from . The loss of the task can be formulated as


where is obtained from which is the representation of given by the encoder of the generation model, is shared with Equation (6).

For this task, the mask matrix in Equation (4) is defined by:


Utterance order recovery: Figure 2 (d) illustrates the task. Given context = , we randomly shuffle the utterances and obtain a disordered context = . The goal is to predict the correct positions for utterances in . The prediction model falls in a read-process-write framework Vinyals et al. (2015). In the reading module, the model first represents as via the encoder of the generation model, where is the -th word in utterance (words within an utterance are ordered), and then obtains the representation of utterance through


where is the number of words in . forms a sentence memory that is accessible by the processing module. The processing module exploits multi-head self-attention and GRU to guarantee the property that vectors retrieved from memory will not change if the memory is randomly shuffled. Formally, the processing module is defined by


where the last hidden state is permutation invariant regarding to input. The writing module is another GRU that decodes one by one. At step , the hidden state is defined by


where is the hidden state at step with , is the embedding of (i.e., the embedding of the ground-truth position of in ), and is a context vector which is defined via attention over :


where , , , and are parameters. The prediction model is finally formulated as


The loss function of the task is defined by


For this task and the following ones, in Equation (4

) is defined as a zero matrix meaning that every pair of words can attend to each other in the context.

Masked content recovery:

a major challenge in context understanding is the information omission problem (e.g., coreferences) that widely exists in utterances Su et al. (2019). The challenge requires a model to connect semantically related words and utterances. Thus, we design masked content recovery tasks on both a word level and an utterance level to enhance the self-attention module in terms of awareness of the semantic connections.

  • Word level: for each utterance in a context, we randomly replace 15% words with a special token [MASK].

  • Utterance level: we randomly pick an utterance from a context, and replace all words in the utterance with a special token [MASK].

Figure 2 (b) and Figure 2 (c) illustrate the task of masked word recovery (mwr) and the task of masked utterance recovery (mur) respectively. Since the only difference of the two tasks is the input, we present them in a uniform way. Given a context , suppose that the masked context is , where if is masked, otherwise , then, the loss of the tasks can be formulated as


where is the representation of obtained by passing through the encoder of the generation model, indexes the two tasks, is an indicator function, and is shared with Equation (6).

3.4 Learning Objective

The full loss function is finally defined by:


where is a hyper-parameter as a trade-off between MLE and the objectives of the auxiliary tasks. The learning algorithm is summarized in Algorithm 1, where refers to a set of parameters including both the parameters of the generation model and the parameters of the auxiliary objectives.

Input: Training data , GlobalMaxStep , AuxTrainEpoch , InitialRate , BatchNumPerEpoch
4 while   do
5        Randomly sample a mini-batch from .
6        if  0 then
7               Compute .
8        Compute MLE.
9        Update the parameters of the model with respect to using Adagrad.
Algorithm 1 Optimization Algorithm

4 Experiments

We conduct experiments on DailyDialog Li et al. (2017), PERSONA-CHAT Zhang et al. (2018), and the Ubuntu Dialogue Corpus (UDC) Lowe et al. (2015), and compare our model with state-of-the-art baselines in terms of response quality, parameter size, and decoding speed.

4.1 Datasets

Both DailyDialog and PERSONA-CHAT are open domain datasets. Dialogues in DailyDialog cover a wide range of topics in daily scenarios and resemble human communications in their daily life; while PERSONA-CHAT contains multi-turn chit-chat conversations between turkers according to their assigned profiles. Since the focus of the work is how to leverage conversation history for response generation, we just append the profiles (the original ones) to the corresponding dialogues as an extension of contexts. To control the length of the dialogues and increase the number of instances, we slide a window on the training/validation/test dialogues in both datasets, and split a dialogue longer than utterances to multiple instances (i.e., the window size is ). Moreover, we also truncate long utterances with the first words kept. Vocabularies are formed with all words appearing in the entire data and are shared by contexts and responses. The vocabulary size of DailyDialog is and the vocabulary size of PERSONA-CHAT is . The UDC data are collected from Ubuntu chat logs with two-person multi-turn conversations about Ubuntu-related problems. Here we use the same data as in Zhang et al. (2019). Table 1 reports some statistics of the three datasets.

DailyDialog PERSONA-CHAT Ubuntu
# dialogues for training 44,050 95,682 3980,000
# dialogues for validation 4,176 11,602 10,000
# dialogues for test 3,864 11,152 10,000
avg. # utter. per dialogue 7.0 9.4 4.3
avg. utter. length 13.6 14.5 16.6
Table 1: Statistics of the datasets.

4.2 Baselines

We select several multi-turn response generation models as baselines: (1) HRED222https://github.com/hsgodhia/hred: hierarchical encoder-decoder proposed in Serban et al. (2016); (2) VHRED333https://github.com/julianser/hed-dlg-truncated: an extension of HRED that factorizes response generation with latent variables Serban et al. (2017); (3) HRAN444https://github.com/LynetteXing1991/HRAN: hierarchical encoder-decoder equipped with a hierarchical attention mechanism Xing et al. (2018); (4) ReCoSa555https://github.com/zhanghainan/ReCoSa: a hierarchical transformer-based model that exhibits state-of-the-art performance on benchmarks Zhang et al. (2019); and (5) SSN: a very recent study on enhancing dialogue generation learning with self-supervision signals extracted from utterance order Wu et al. (2019).

4.3 Implementation Details

We train the baselines and our model on RTX 2080, and initialize word embedding with GloVe vectors Pennington et al. (2014). In our model, the dimension of all vectors is set as . The number of heads in multi-head attention is set as . We adopt the Adagrad algorithm Duchi et al. (2011) in optimization with a learning rate and a batch size //

in DailyDialog/PERSONA-CHAT/Ubuntu. All models are tuned on the validation sets according to perplexity. We stop training if the perplexity does not drop in three consecutive epochs. The GlobalMaxStep

is set as 50k. The AuxTrainEpoch is set as 30. The BatchNumPerEpoch N is // for DailyDialog/PERSONA-CHAT/Ubuntu.

Dataset Model PPL BLEU Distinct-1 Distinct-2 Average Greedy Extrema Parameter size Decoding speed
DailyDialog HRED 56.22 0.535 1.553 3.569 81.393 65.546 48.109 34.5M 14.79ms
HRAN 47.23 0.447 1.953 7.400 83.460 67.239 49.599 38.2M 17.15ms
VHRED 44.79 0.997 1.299 6.113 83.866 67.186 48.570 34.8M 15.67ms
SSN 44.28 1.250 2.309 7.266 72.796 73.069 44.260 20.0M 12.69ms
ReCoSa 42.34 1.121 1.987 10.180 84.763 67.557 48.957 73.8M 40.89ms
Our Model 38.60 1.658 3.457 14.954 85.224 69.518 49.069 20.3M/14.4M 12.15ms
PERSON-CHAT HRED 46.04 1.279 0.164 0.450 83.329 64.486 47.132 28.3M 13.14ms
HRAN 41.94 1.997 0.235 0.771 82.850 65.556 47.882 33.1M 18.43ms
VHRED 42.07 2.181 0.312 1.915 82.995 65.578 46.810 28.8M 20.27ms
SSN 47.90 2.288 0.637 2.623 85.002 66.752 47.461 15.2M 15.82ms
ReCoSa 34.19 2.258 0.915 4.217 83.963 66.498 48.163 68.7M 39.38ms
Our Model 33.23 2.434 1.279 5.816 83.632 66.778 48.552 18.4M/12.5M 13.89ms
Ubuntu HRED 117.53 0.624 0.951 3.007 74.671 61.725 39.587 24.1M 25.09ms
HRAN 107.43 0.646 1.210 2.136 75.214 59.963 40.985 29.5M 31.07ms
VHRED 177.12 0.579 1.898 3.178 78.755 63.978 40.072 24.7M 30.47ms
SSN 101.09 2.044 1.304 3.984 78.039 60.556 37.595 12.3M 21.11ms
ReCoSa 95.33 1.753 1.526 3.536 77.751 62.013 40.607 60.6M 45.34ms
Our Model 78.83 1.926 1.931 5.573 79.358 63.524 38.812 14.4M/8.5M 22.98ms
Table 2: Evaluation results on automatic metrics. Numbers in bold indicate the best performing model on the corresponding metrics.

4.4 Evaluation Metrics

We evaluate the performance of the models in terms of response quality with both automatic metrics and human judgment. In automatic evaluation, besides BLEU-4 Papineni et al. (2002) and perplexity Sutskever et al. (2014), we follow Serban et al. (2017) and employ Embedding Average (Average), Embedding Extrema (Extrema), and Embedding Greedy (Greedy) as metrics. We also follow Li et al. (2015) and measure the informativeness of responses with distinct-1 and distinct-2 that are calculated as the ratios of distinct unigrams and bigrams. In human evaluation, we randomly sample dialogues from each of the three test sets, and recruit native speakers as human annotators. For each context in the dialogues, each annotator compares a response from our model and a response from a baseline model. The two responses are top one results from greedy search, and are randomly shuffled to hide their sources. The annotators judge which response is better based on informativeness, consistency, and fluency of the responses. If an annotator cannot tell which response is better, he/she is required to label a “tie”. Each annotator individually judges pairs for all combinations of our model and baseline models. In total, each one labels pairs for one dataset. Fleiss’ kappa Fleiss and Cohen (1973) is employed to measure agreement among the annotators.

In addition to response quality, we also compare our model with baselines on decoding speed. We calculate the average prediction time per word in response generation using all dialogues in the test sets. The efficiency comparison is conducted on a GPU environment with a single RTX 2080.

4.5 Evaluation Results

Table 2 reports evaluation results on automatic metrics. Our model outperforms all baseline methods on most of the metrics on all the three datasets. The last two columns of the tables compare different models in terms of parameter size and decoding speed. Note that in training, the auxiliary tasks contain parameters outside the generation model. Therefore, in the column of parameter size, we report two numbers for our model with the one before “/” parameter size in training and the one after “/” parameter size of the generation model. It is remarkable that the parameter size of our model, even in training, is smaller than HRED. In spite of this, the model still outperforms ReCoSa with only // parameters on the DailyDialog/PERSONA-CHAT/Ubuntu data. This is because (1) the auxiliary tasks can effectively aid the learning of the generation model in our method; and (2) ReCoSa, although in a deep structure, is still inadequate in terms of context modeling due to the RNN-based encoder and the only utterance-level attention. Besides the superior performance on response quality, our model also enjoys a fast decoding process, thanks to the small model size. In terms of decoding speed, our model is comparable with HRED, and 2x faster than ReCoSa. The generation model of SNN is just a simple RNN sequence-to-sequence with one layer encoder and one layer decoder. Therefore, our model is comparable with SSN in terms of complexity and speed. However, SSN is worse than our model on response quality due to (1) the RNN-based seq2seq model in SSN is worse than a transformer-based structure on the benchmarks used in this work, which has been indicated by Sankar et al. (2019); (2) SSN only considers utterance order, while we also leverage word order, word content, and utterance content in learning. In fact, we find that the proposed auxiliary tasks can improve a 2-layer (one for encoder and one for decoder) RNN-based seq2seq model as well, as reported in Supplementary Material. On most metrics, RNN with full auxiliary tasks is better than SSN but worse than the proposed model.

Table 3 summarizes human evaluation results. We can see that our model outperforms all baseline models, and most of the kappa values exceed indicating substantial agreement among the annotators. Based on the annotation results, we find that our model tends to generate diverse and context consistent responses, indicating the effect of the auxiliary tasks.

4.6 Discussions

To further understand the merit of the auxiliary tasks, we make some analysis regarding to the following questions: Q1 how do the simple architecture learned with the auxiliary tasks compare with a deep architecture; Q2 if learning with the auxiliary tasks can also improve deep architectures; and Q3 how different auxiliary tasks affect the performance of the model.

Figure 3: Performance of deep architectures. (a) DailyDialog; (b) PERSONA-CHAT; (c) Ubuntu

Answer to Q1: we aim to move one step further to understand how the auxiliary tasks enhance the capability of the simple generation model on context understanding. While this is not trivial for neural models, we assume that one can let a transformer-based model capture more semantics in contexts by stacking more layers in the encoder, and examine to what extent the simple model learned with the auxiliary tasks is equivalent to a deep architecture. Figure 3 compares our model with deep architectures in terms of perplexity on the three datasets, in which we get the deep architectures by stacking transformer layers in the encoder of our model. The dotted lines represent our model learned with the auxiliary tasks, and the solid lines represent the deep architectures learned with MLE. Approximately, our model is equivalent to a deep model with a 4-layer encoder on the DailyDialog data, a 6-layer encoder on the PERSONA-CHAT data, and a 3-layer encoder on the Ubuntu data.

models win (%) loss (%) tie (%) kappa
Our Model v.s. HRED 42.57 13.16 44.27 0.675
Our Model v.s. VHRED 38.14 19.38 42.48 0.634
Our Model v.s. HRAN 31.69 16.29 52.02 0.587
Our Model v.s. SSN 35.43 22.85 41.72 0.638
Our Model v.s. ReCoSa 34.60 22.15 43.25 0.733
models win (%) loss (%) tie (%) kappa
Our Model v.s. HRED 45.73 15.99 38.28 0.867
Our Model v.s. VHRED 39.13 20.25 40.62 0.650
Our Model v.s. HRAN 36.49 23.17 40.34 0.621
Our Model v.s. SSN 49.45 12.79 37.76 0.695
Our Model v.s. ReCoSa 38.06 28.95 32.99 0.566
models win (%) loss (%) tie (%) kappa
Our Model v.s. HRED 47.16 12.99 39.85 0.792
Our Model v.s. VHRED 46.29 11.62 42.09 0.603
Our Model v.s. HRAN 42.04 10.60 47.36 0.579
Our Model v.s. SSN 44.24 17.38 38.38 0.527
Our Model v.s. ReCoSa 38.52 29.54 31.94 0.644
Table 3: Human evaluation results. The ratios are calculated by combining annotations from three judges together.

Answer to Q2: since the auxiliary tasks are useful for the simple model, it is also interesting to check if they work as well for deep architectures. Figure 3 shows the results, in which the dash-dotted lines represent the deep architectures learned with the full auxiliary tasks. First of all, we can conclude that the auxiliary tasks are also useful for deep architectures, since there is clear PPL drop for the same models learned with and without (i.e., the solid lines) the auxiliary tasks. Secondly, the auxiliary tasks are more useful for simple structures, since the gap between the same models learned with and without the tasks becomes smaller and smaller when the number of encoding layers increases. The results indicate that after stacking enough layers, the effect of the auxiliary tasks is overwhelmed by the model itself. Therefore, the merit of the auxiliary tasks is to allow us to learn a generation model that enjoys both efficacy and efficiency, which is exactly the goal of the work. Improvement with respect to the number of layers of the encoder on UDC is more steady than that on DailyDialog and PERSON-CHAT. This is because the training set of UDC is much larger than those of the other two datasets.

model variant PPL BLEU distinct-1 distinct-2 Average Greedy Extrema
full tasks 38.60 1.658 3.457 14.954 85.224 69.518 49.069
- masked word recovery 38.37 1.365 2.629 11.135 85.270 69.901 49.495
- masked utterance recovery 39.06 1.407 2.980 12.544 85.143 69.667 49.791
- word order recovery 41.53 1.082 2.769 11.166 85.020 69.417 49.567
- utterance order recovery 38.69 1.215 2.551 9.764 85.253 69.678 49.644
- all tasks 46.58 0.903 1.775 7.136 84.042 69.017 48.467
model variant PPL BLEU distinct-1 distinct-2 Average Greedy Extrema
full tasks 33.23 2.434 1.279 5.816 83.632 66.778 48.552
- masked word recovery 34.74 2.429 1.018 4.764 82.841 66.177 48.610
- masked utterance recovery 33.49 2.638 1.045 5.412 83.402 66.862 48.810
- word order recovery 35.06 2.355 1.028 4.698 82.503 66.011 48.350
- utterance order recovery 33.24 2.484 1.054 5.011 82.652 66.025 47.927
- all tasks 37.16 1.928 0.938 4.141 82.104 65.899 47.162
model variant PPL BLEU distinct-1 distinct-2 Average Greedy Extrema
full tasks 78.83 1.926 1.931 5.573 79.35 63.52 38.81
- masked word recovery 80.04 1.792 1.334 4.684 79.62 63.16 38.78
- masked utterance recovery 86.56 1.492 1.226 4.427 78.53 63.67 38.78
- word order recovery 88.93 1.594 1.068 4.468 78.49 62.42 38.27
- utterance order recovery 81.92 1.484 1.341 5.029 78.11 63.36 38.55
- all tasks 105.47 1.074 0.753 2.473 78.62 62.58 37.46
Table 4: Results of ablation study.

Answer to Q3: we keep the architecture of the generation model and remove the objectives of the auxiliary tasks one at a time from the full learning objective given by Equation (16). Table 4 reports the ablation results. First of all, all auxiliary tasks are useful as removing any of them will cause a performance drop. When all auxiliary tasks are removed, the approach degenerates to learning a 2-layer transformer architecture through MLE. Without any optimization on context understanding, the simple structure is worse than ReCoSa. Secondly, on DailyDialog and UDC, order recovery tasks are more crucial than content recovery tasks due to the order insensitive nature of self-attention. Finally, on PERSONA-CHAT, word-level recovery tasks matter more than utterance-level recovery tasks. This might stem from the fact that in PERSONA-CHAT, dialogues highly depend on the profiles used as contexts. In many cases, utterances are just formed by copying a proportion of words from the profiles. Thus, recognizing the semantic connections and the relationship among words in contexts is more critical for the data.

5 Conclusions

We propose a simple generation model with order recovery and masked content recovery as auxiliary tasks. Evaluation results on three benchmarks indicate that our model can significantly outperform state-of-the-art deep generation models in terms of both response quality and decoding speed.