Diversifying Topic-Coherent Response Generation for Natural Multi-turn Conversations

10/24/2019 ∙ by Fei Hu, et al. ∙ 17

Although response generation (RG) diversification for single-turn dialogs has been well developed, it is less investigated for natural multi-turn conversations. Besides, past work focused on diversifying responses without considering topic coherence to the context, producing uninformative replies. In this paper, we propose the Topic-coherent Hierarchical Recurrent Encoder-Decoder model (THRED) to diversify the generated responses without deviating the contextual topics for multi-turn conversations. In overall, we build a sequence-to-sequence net (Seq2Seq) to model multi-turn conversations. And then we resort to the latent Variable Hierarchical Recurrent Encoder-Decoder model (VHRED) to learn global contextual distribution of dialogs. Besides, we construct a dense topic matrix which implies word-level correlations of the conversation corpora. The topic matrix is used to learn local topic distribution of the contextual utterances. By incorporating both the global contextual distribution and the local topic distribution, THRED produces both diversified and topic-coherent replies. In addition, we propose an explicit metric (TopicDiv) to measure the topic divergence between the post and generated response, and we also propose an overall metric combining the diversification metric (Distinct) and TopicDiv. We evaluate our model comparing with three baselines (Seq2Seq, HRED and VHRED) on two real-world corpora, respectively, and demonstrate its outstanding performance in both diversification and topic coherence.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Response generation (RG) has been playing an increasing important role in Natural Language Generation (NLG) as it draws close to industry manufacture and our daily life. Neural net models building upon encoder-decoder learning

[44, 4] have been demonstrated effective in RG and have achieved a lot of success [35, 26, 54, 45], while these models suffered from safe reply problem [21, 52] as they prefer producing generic and safe replies like “thank you” and “I am sorry”, and high-frequent function words like “the” and “no” due to the high frequency of these patterns and words in the training data. Although these generic responses are helpful to promote the results in terms of accuracy, they are less informative and even meaningless to the post. In addition, accurate replies are not good answers because we would like to respond based on contextual semantics and conversational environments rather than based on an accurate-reply handbook. Diversifying the responses will make conversations more informative, more interesting, and more like human interaction.

Safe reply problem is a big challenge in RG. While encoder-decoder models follow the functional principle of [34], making both the source sentence and target sentence subject to the same latent variables like the Machine Translation (MT) does, this principle neglects the intrinsic difference between MT and RG that MT treats the sentence pairs of the same meanings but RG has to move further to a richer response rather than the post [31]. Besides, encoder-decoder models deal with RG based on post-response pairs of sentences that narrows the distribution of predicted responses and gives those highly frequent words and patterns a higher chance to show themselves in the final generation [31, 17]. Many works have proposed to mitigate the safe reply problem, producing informative and interesting replies [21, 32, 7, 8, 11, 51, 28, 15, 36, 48, 1, 6, 47, 9, 53, 2, 50, 19] . These works are helpful to some extent, as [7, 8, 11] learning from several target references for each post to broaden the generation distribution, [36, 48, 1, 6, 47] producing a set of diversified candidate replies, and [53, 2, 50] leveraging the topic distribution to bias final responses. However, these methods only work on single-turn dialog tasks.

Recent years, Sequence-to-Sequence model (Seq2Seq) [25] has demonstrated its effect in modeling multi-turn dialogs [49]. It is based on encoder-decoder framework which encodes the sequence of tokens recurrently. Seban et. al extended Seq2Seq to model relationship between utterances, proposing the Hierarchical Recurrent Encoder-Decoder model (HRED) which made the final responses more comprehensive and informative [37]. After that, a lot of works were proposed to model correlations between utterances, producing diverse responses [55, 38, 41, 39]. Unfortunately, these diverse responses are not topic-related to the context due to the lack of topic information.

In this paper, we build a Seq2Seq framework [25] and extend the functional principles with respect to both variational methods [39, 9] and topic methods [53, 2, 42], proposing both global and local strategies that inject the global contextual information and the local topic information into the response for multi-turn conversations. The key idea of this paper is to model both the global contextual distribution and the local topic distribution, and to train them jointly. It is like the way of real-world conversations that people generalize the context based on the previous turns of talks and replace the responses patterns with semantically (topic-) similar ones to make the conversation more informative and interesting.

Global contextual distribution implies linguistic rules. We resort to the latent Variable Hierarchical Recurrent Encoder-Decoder model (VHRED) [39] to learn it where Conditional Variational Auto-Encoder (CVAE) [42, 29] is used to acquire the knowledge of speaking skills, gaining correlations between utterances. We leverage the discourse-level knowledge to help produce more comprehensive responses.

Local topic information is explicitly sampled from a topic distribution where words have correlation probabilities over a series of topics. Specifically, we firstly build a sparse matrix which distributes topics over all non-functional words in the vocabulary, i.e., each topic is denoted by a word. Then we extend the word-level topic distribution to generalize higher-level topics into a dense topic matrix using Non-negative Matrix Factorization (NMF)

[20], i.e., each topic turns into a high-level pattern. Thus, topic values sampled from this dense topic matrix could enrich word-level expression in the final generation.

Global contextual information and local topic information are both dynamic because they are conditioned on dynamic context within various dialogs. As a result, patterns in responses are diversified without deviating the informativeness to the context.

We study RG for open-domain and multi-turn conversation systems because they are in accordance with real-world scenes and are more challenging than task-oriented [33] and single-turn [31] conversation systems. In daily lives, people talk to each other in more than one utterance, and previous utterances contain contextual information that could be used to support and remain following conversations. In open-domain and multi-turn RG, safe reply is much more an issue because the long and redundant contextual utterances bring more functional patterns. Therefore, traditional encoder-decoder models cannot learn multi-turn utterances effectively rather than single-turn and short conversations.

In summary, our contributions are as follows:

  1. We use bias factors from two separated distributions (global contextual distribution and local topic distribution) to influence dull responses, producing diversified yet topic-related replies.

  2. We diversify RG in the dialog-level and word-level, respectively.

  3. We advocate an explicit metric (TopicDiv) to measure the topic divergence between the post and the according response. In addition, we combine the diversification metric (Distinct) and TopicDiv to propose an overall metric (F score) which does the comprehensive evaluation of diversification and topic coherence.

In this paper, we introduce two multi-turn dialog datasets, Daily Dialogs [22] and Ubuntu Dialogs [24], to evaluate our model. Daily Dialogs is less noisy, in which the dialogues are well organized and carefully selected from human-written communications, reflecting our daily communication way and covering various topics about our daily life. Ubuntu Dialogs is two orders of magnitude bigger than Daily Dialogs, containing almost one million two-person conversations which were extracted from the Ubuntu chat logs, being used to receive technical support for various Ubuntu-related problems. We also compare to three state-of-the-art models: SEQ2SEQ [49], HRED [37] and VHRED [39]. Experimental results show that our model significantly outperforms the other three models in generating diversified and topic-coherent responses.

Ii Related work

Diversifying RG has been attracting a growing number of researchers, unyielding to the demands to match the target reference replies, and turning up unusual results. Traditional RG diversification methods are roughly divided into two categories: task-oriented (or data-driven) methods and open-domain methods. While task-oriented methods only work with elaborate corpora [7, 8, 11] and extra carefully selected supplementary data [33], open-domain methods are flexible in real-world environments, such as mutual information methods [21], beam search methods [36, 48, 1, 6, 47], topic bias methods [2, 50] and variational methods [9, 19]. These methods only work with single-turn dialog tasks while multi-turn conversations were not well studied till the model of Seq2Seq [25].

Seq2Seq is a recurrent encoder-decoder model. It leverages recurrent nets to encode the context into a fixed-size vector which is then used to decode the output response. Vinyals and Le have broken the logjam for modeling multi-turn conversations by using the Seq2Seq model


. They utilized Long Short-Term Memory (LSTM) as the Encoder and the Decoder, respectively, encoding previous multiple utterances in a compressed vector and decoding it to produce the output response. However, Seq2Seq cannot learn lengthy dialogs effectively due to the natural flaws of vanishing memory with recurrent models (including LSTM) when encoding long past information

[30, 14]. Moreover, the problem of vanishing long-term memory confines the model to a short range of the later tokens, dampening learning language’s multi-mode distributions which might exist in the far-previous contextual segments.

In order to learn language patterns effectively and comprehensively, Seban et al extended Seq2Seq to propose HRED [37] by incorporating an additional recurrent net to model correlations between utterances. In this way, long-term language patterns are encoded in a compressed vector. This compressed vector certainly implies dialog-level contextual information and turns out to diversify the generated responses. Specifically, HRED generalized contextual information and made great use of it to bias the safe replies. The generalized contextual information makes up the deficiency of the lack of long-term contextual information.

Considering the successful variational methods in modeling natural language [5], CVAE was used to improve modeling multi-turn conversations [39, 55, 38, 41] and has demonstrated its effects in diversifying generated responses [55, 38, 41]. Seban et al extended HRED to propose VHRED [39] by incorporating a latent distribution (instead of the compressed vector in HRED) to model correlations between utterances. The latent distribution is learned by using CVAE, which leveraged diverse contexts as conditional factors to dynamically model the correlational knowledge between utterances.

Traditional diverse RG systems for natural multi-turn conversations have improved the encoder-decoder model by deviating the final response from the target reference replies, which, however, either do not satisfy the multi-distribution quality of the language given the syntactically and semantically diverse context, or lack the topic information related to the context.

CVAE aims to encode the knowledge between utterances into a high-level data distribution. And conditioned on the diverse context, its distribution becomes dynamic [10]. We extend VHRED [39] (which uses CVAE) to learn dynamic distribution in discourse level. Meantime, a pre-trained topic matrix provides word-level dynamic distribution given the conditional words in the context. Both the discourse-level (global) and word-level (local) information foster the system to produce interesting and informative responses.

Iii Methodology

Iii-a Overview

In multi-turn conversational systems, a dialogue can be considered as a sequence of utterances. And each utterance contains various length of tokens. Formally, we have and , where is a dialogue, is the -th utterance of , and is a token at position of . The RG task is to predict given the previous contextual utterances . The prediction process is formulated as follows:


From this formulation we can see, the RG prediction counts on two parts: and . That is, the RG system models the prediction with a two-level hierarchy: a sequence of utterances, and the tokens in current utterance [37].

Overall, our work is a Seq2Seq model [25], which is known as the recurrent encoder-decoder model [40]

. As a prevalent neural machine translation approach, Seq2seq has been successfully applied to RG

[43, 49]. In particular, Seq2seq is used to learn the embeddings of the context of the previous utterances to generate tokens in the current utterance. Seq2Seq improves RG in terms of accuracy, producing standard replies adhering to the reference replies, but failing to address the safe reply problem.

In order to mitigate the safe reply problem, we leverage both the global contextual and the local topic offsets to bias the generic replies. Specifically, we resort to VHRED [39] to learn the global contextual offset and leverage NMF [20] to learn the local topic offset, proposing the Topic-coherent Hierarchical Recurrent Encoder-Decoder model (THRED) to produce not only diversified but also topic-coherent replies.

The VHRED has demonstrated an ability to improve diversification of RG [55, 38, 23, 41]. In this paper, we resort to the VHRED to learn the global contextual information of dialogs which utilizes the CVAE to learn contextual structures and correlations between utterances within each dialog, learning common linguistic rules. The global linguistic knowledge is injected into a global contextual distribution. And then, conditioned on the contextual utterances in the dialog, a latent variable z is sampled from the distribution. It encodes the global linguistic knowledge which involves the context of current dialog, improving the decoder to produce a more comprehensive reply.

Besides, we use NMF to learn the local topic information conditioned on the words in current dialog. The global context information reflects general knowledge of the dialog, while the local topic information implies the topics of all words in the dialog. The two offsets do not simply change patterns in the generated response, but improve the response to reserve speaking skills and linguistic rules, and to follow the topics of the context.

As shown in Fig. 1, in the left is the framework we proposed where is the latent variable of the global context distribution, and are sampled from the local topic distribution conditioned on the tokens of the context and the tokens of the predicted response, respectively. The subscripts represent time step of utterances in the dialog. The proposed framework has four layers: Projection, Encoder that is depicted in the right bottom, Context, and Decoder that is depicted in the right top. The Projection is a full-connected neural net, encoding tokens into dense embeddings with the same dimensional size of the following layers. The Encoder is a recurrent net which sequentially encodes the token embeddings of the utterance, learning utterance-level information. The Context is also a recurrent net which encodes the temporal utterances of the dialog, learning dialog-level information. The Decoder encodes both the context embedding and the latent variable z, producing temporal sequence of the response. Meantime, the function measures the distance between and , which is an optimization constraint to bias the model to learning topic correlations between the replies and the context.

Fig. 1: The structure of the Topic-coherent Hierarchical Recurrent Encoder-Decoder model (THRED).

The Encoder is a bidirectional LSTM net [13]. The Context and the Decoder are unidirectional LSTM nets. In the following subsections, we will explain the learning processes of the distribution (global contextual distribution) and the distribution (local topic distribution including and ), respectively.

Iii-B Learning global contextual distribution

The global contextual distribution learns the discourse-level knowledge of conversations. To this end, we resort to VHRED [39] which utilizes the CVAE [42, 29] to simulate a discourse-level distribution. And we build the discourse-level knowledge by sampling from this distribution when predicting the response.

CVAE improved the Variational Auto-Encoder (VAE) model [18] by introducing a conditional factor. Vanilla VAE encodes all data into a single-mode distribution no matter the different patterns of these data, while CVAE encodes data with different conditional factors into respective distributions. The conditional factor is a prior knowledge, and another posterior factor

is introduced which can be taken as the label of the according sample. Thus, optimizing CVAE can be thought of as a supervised learning that expects the target

conditioned on . After having learned the variational distribution, we can sample dynamic patterns from it conditioned on different conditional factors. We formulate the -conditioned objective function as:


Where is the latent variable sampled from , expects all samples in the distribution , is the prior distribution which approximates the posterior distribution ,

is the Kullback-Leibler divergence function which is used to measure how one probability distribution is different from another one. Optimization is performed by minimizing the lower bound of this objective function, i.e.,

, while the divergence is greater than zero in all time. ensures approximates since is not available in the inference step and can be sampled from instead of .

(in Eq. 2) is of the discourse-level distribution. The learning process of in the training step and the sampling process of in the testing step are detailed as follows: In the training step, encodes both the previous utterances and the following expected utterance to model . Since considers the whole dialog (the previous utterances plus the expected replied utterance), it therefore learns the exactly accurate discourse-level knowledge. The expected utterance , as a posterior factor, is not available in the testing step, thus, another distribution is introduced to take the place of . models a prior distribution which only considers the previous utterances . By using the divergence function as a regularization term (see Eq. 2), approximates . In the testing step, the prior distribution instead of is used to fill the gap between the expected response and the discourse-level knowledge in dialogs.

In this paper, we encode in a -length vector. Both and

are Gaussian distributions.

and where mean and covariance are encoded in -length vectors, respectively.

Iii-C Learning local topic distribution

Iii-C1 Building the topic matrix

The local topic distribution explicitly encodes topics of all non-functional words in a topic matrix. In particular, we utilize PPMI [46] to build a sparse word-topic matrix where each topic is a word in the vocabulary. Then we use NMF [20] to factorize it to obtain a dense word-topic matrix where each topic turns into a high-level pattern.

By using PPMI, we construct a high-dimensional matrix where the row denotes the list of words and, the column represents the list of contextual features. Both the row and the column are the list of non-functional words in the vocabulary. The value of the matrix cell is the PPMI value that suggests the associated relationship between the word and the contextual feature

, which can be estimated by:


Where the function ensures that only positive correlations of word-feature pairs are reserved and negative correlations are ignored by setting them zero.

The sparse PPMI matrix raises two issues: 1) The topic representation (i.e., word-level representation of topics) is too specific to be adaptive in learning stable topic distribution; 2) The sparsity results in both excessive memory consumption and extreme time complexity when training the model. In order to mitigate the sparsity problems, we resort to NMF to cluster sparse topics in dense topic patterns. NMF factorizes the sparse PPMI matrix into two dense matrices and , mathematically abstracting it as . The approximation of is achieved by minimizing the objective function . is a matrix, and is a matrix then can be significantly less than (in this paper, we set

). Unlike the Singular Value Decomposition (SVD) which might generate negative values in the final dense matrix, the

produced by NMF has only positive elements, i.e., correlated topic patterns are reserved yet uncorrelated patterns are ignored. The non-negative quality guarantees that the dense word-topic distribution conforms to the sparse word-topic distribution, remaining the positive relationship between words and topic features.

From the training logs, we randomly selected ten topic divergence values at ten successive training epoches with PPMI and NMF, respectively, listing them in Table

I , where the topic divergence value is the

divergence between the context and according predicted response. The two ranges are of the same number of training epoches. The variance values for the bunch of topic divergence values with PPMI and the bunch of topic divergence values with NMF are calculated, respectively. As we can see, NMF has a much smaller variance value, i.e., the dense topic patterns are helpful to stabilize the learning rather than the sparse word-level topics.

Divergence Variance
PPMI 6.8266e-06 0.0916 0.0739 0.0047 0.1112 0.0896 0.0256 0.0008 0.0881 0.1065 0.001897
NMF 0.0133 0.0106 0.0102 0.0035 0.0093 0.0088 0.0082 0.0010 0.0092 0.0107 0.000012
TABLE I: Ten randomly selected topic divergence values at ten successive training epoches with PPMI and NMF, respectively, each being calculated by comparing the context and the predicted response. The variance values are calculated using the ten topic divergence values of PPMI and NMF, respectively.

Iii-C2 Learning local topics

The local topic distribution is encoded in a dense topic matrix. Topic information is sampled from the dense topic matrix conditioned on the tokens of the contextual utterances. In particular, we match each word in the context with row values of the topic matrix and sum up all words’ topic values along the topic dimension in the column. The result value is scaled by the number of tokens to avoid favoring long sentences. Then we get a -length vector where is the number of columns in the topic matrix. This vector encodes topics of the context in current dialog. The topic information is dynamic while the conditional factor of the context changes with different dialogs. In the meantime, a topic vector of the predicted response is computed. We use the KL divergence function to measure the difference of the two topic vectors. It is formulated as follows:


By minimizing this objective function, the model is inclined to learning the topic distribution, bringing the final generation and the context closer together in terms of topics. Thus, the generation is not only simply diversified, but also informative and topic-related.

Iv Experimental settings

Iv-a Datasets

We conduct experiments on two multi-turn dialog datasets with different styles: Daily Dialogs [22] and Ubuntu Dialogs [24]. The Daily Dialog corpus contains 13118 high-quality dialogs which are human-written and less noisy. The Ubuntu Dialog corpus has been widely used in multi-turn dialog tasks [37, 39, 38, 41]. It consists of almost one million conversations from the Ubuntu chat logs, used to receive technical support for various Ubuntu-related problems. These conversations are arbitrary and lack syntactical regularities. We preprocessed the two datasets, splitting them into three groups of Train, Validation and Test, respectively. Table II

provides descriptive statistics about the two datasets.

Corpus #Train #Validation #Test #Avg. Utterances #Avg. Words #Vocab size
Daily Dialogs 11118 1000 1000 8.9 114.7 26987
Ubuntu Dialogs 448833 19584 18920 7.48 102.21 268487
TABLE II: Dataset statistics including number of dialogues in training, validation and test sets, average number of utterances, average number of words per dialogue, and vocabulary size.

Iv-B Baselines

In the experiments, we evaluate the performance of the proposed model (THRED) against three state-of-the-art neural dialog models, including SEQ2SEQ [49], HRED [37] and VHRED [39] which have been discussed in Section Related Work.

Iv-C Metrics of TopicDiv and F Score

We evaluate the above four models (including the proposed model) from three aspects: producing accurate replies, diversifying the generated responses and generating topic-related responses. The three aspects are demonstrated by the metrics of Perplexity [27], Distinct [21] and TopicDiv, respectively. Taken together, these metrics demonstrate how well the model predicts diversified, informative and topic-coherent responses.

Perplexity shows how well a probability model predicts a sample. A lower Perplexity indicates the model expects to predict a more accurate reply.

Distinct reports the degree of consistency of the generated response to the expectation. A higher Distinct value indicates a better model in predicting more diversified responses. It has two indicators: Distinct1 and Distinct2. They calculate the number of distinct unigrams and bigrams of the generated response and scale it by the length of the sequence, respectively. In this paper, unigram Distinct is denoted as Dist1, and bigram Distinct is denoted as Dist2.

Besides, we propose a topic-related metric which measures the difference of the context and the generated response in a dialog with respect to the topic information. This metric, called TopicDiv, demonstrates topic coherence in the conversation. It is calculated by Eq. 4. The lower TopicDiv, the better topic coherence of post-response pairs.

In this paper, we aim to generate both diversified and topic-coherent replies. So, we need a comprehensive metric combining the two factors (i.e., Distinct and TopicDiv) to evaluate models. Specifically, we introduce the F score to do the comprehensive evaluation, which is formulated as follows:


Where is a pre-defined real number greater than zero. And the subscript refers to unigram () or bigram () of the metric Distinct. When , Distinct and TopicDiv contribute equally to this synthetic metric; when , Distinct contributes more yet TopicDiv contributes less; and when , Distinct contributes less yet TopicDiv contributes more. In this paper, we evaluate models with , and , respectively. The higher score, the better both diversification and topic coherence.

Iv-D Training settings

The four models including the proposed model (THRED) are all encoder-decoder models. We use the bidirectional LSTM as the encoder part and the unidirectional LSTM as the decoder part. All models have the dimensional size of 500 in the hidden layers. The size of the latent variable is . The size of the dense topic features in the (NMF) dense topic matrix is . For each dataset, we pick top 20000 frequent tokens to make the vocabulary. We train the models with the learning rate of 0.0002. The best validated networks are saved in 400000 training epochs. We also improve the results using Beam Search [36] which samples best-first candidate tokens at each inference step. And we set the Beam number as 5.

V Experimental results

Evaluation results on datasets of Ubuntu Dialogs and Daily Dialogs are listed in Table III and Table IV, respectively. We also illustrate the results of the comprehensive metric of F scores in Fig. 2, which depicts four sub-figures for the two datasets with unigram diversification (Dist1) and bigram diversification (Dist2), respectively.

On both datasets of Ubuntu Dialogs and Daily Dialogs, VHRED and THRED perform fairly poor with higher Perplexity scores. The reason, we conjecture, is caused by the NLG diversification. When diversifying the generated replies, i.e., replacing tokens and patterns of the expected references with semantically similar ones, the NLG accuracy decreases due to the lack of the tokens of the reference replies. In other words, higher Perplexity scores reflect better diversification to some extent.

THRED has much better Dist1 and Dist2 scores. Though, on the dataset of Daily Dialogs, SEQ2SEQ achieves the highest Dist2 score, it performs remarkably worse than the other three models in terms of Dist1. And on the dataset of Ubuntu Dialogs, SEQ2SEQ performs much worse than VHRED and THRED in both Dist1 and Dist2. On the other hand, comparing to HRED and VHRED, THRED significantly outperforms HRED on both datasets, performing much better than VHRED on Daily Dialogs, and obtaining fairly equivalent diversification scores to VHRED on Ubuntu Dialogs. In general, THRED has a stable diversification performance, obtaining fairly better diversification scores.

Diversifying NLG leads to the lack of topic coherence of generated replies. As we can see, VHRED performs extremely bad with the highest TopicDiv scores as it generates much more diverse replies. However, on the basis of successfully diversifying NLG, THRED performs well in terms of TopicDiv, even obtaining the best on the dataset of Daily Dialogs.

Diversifying NLG is not simply seeking substitutes for tokens of the expected reference replies, but replacing them with topic-coherent ones. It is hard to analyze the diversification effect with both Distinct (including Dist1 and Dist2) and TopicDiv as they are two opposite indicators. In this paper, we advocate F score to evaluate models, combining both Distinct and TopicDiv scores. As shown in the results of Table III and Table IV, THRED performs rather better with higher F scores. In particular, for the unigram diversification, THRED performs better with the highest F scores on both datasets and in all situations of Diversification-Topic offsets (w.r.t , and ). On the other hand, for the bigram diversification, THRED performs better with Diversification-Topic equivalence () and Diversification offset () on the dataset of Ubuntu Dialogs, and performs better than VHRED and HRED in all situations of Diversification-Topic offsets on both datasets. In Fig. 2, it depicts F scores according to different Diversification-Topic offsets. As we can see, THRED performs best with unigram diversification and fairly well with bigram diversification.

In overall, THRED performs stably with both fairly better Distinct scores and better TopicDiv scores. Comparing to the state-of-the-art diversification model of VHRED, THRED improves it with higher F scores, especially increasing topic coherence without spoiling the NLG diversification effect.

SEQ2SEQ 38.1559 0.7870 0.9564 0.2723 0.7562 0.8265 0.7744 0.8998 0.7450 0.7855
HRED 39.5231 0.7093 0.9025 0.2382 0.7346 0.8262 0.7192 0.8704 0.7448 0.8002
VHRED 40.6731 0.8018 0.9702 0.2908 0.7527 0.8194 0.7814 0.9037 0.7353 0.7732
THRED (ours) 40.7888 0.8008 0.9712 0.2750 0.7610 0.8302 0.7844 0.9094 0.7467 0.7863
TABLE III: Results in terms of accuracy (), diversification ( and ), topic divergence () and F scores ( and ) on Ubuntu Dialogs corpus.
SEQ2SEQ 36.8831 0.6044 0.9699 0.3276 0.6366 0.7942 0.6169 0.8911 0.6499 0.7425
HRED 39.3921 0.6349 0.9229 0.3334 0.6504 0.7741 0.6410 0.8570 0.6565 0.7289
VHRED 41.4817 0.6310 0.9165 0.3351 0.6475 0.7707 0.6375 0.8520 0.6541 0.7262
THRED (ours) 43.4796 0.6604 0.9273 0.3101 0.6748 0.7912 0.6661 0.8676 0.6805 0.7489
TABLE IV: Results in terms of accuracy (), diversification ( and ), topic divergence () and F scores ( and ) on Daily Dialogs corpus.
(a) F scores on Ubuntu Dialogs with unigram diversification.
(b) F scores on Daily Dialogs with unigram diversification.
(c) F scores on Ubuntu Dialogs with bigram diversification.
(d) F scores on Daily Dialogs with bigram diversification.
Fig. 2: F scores on respective datasets of Ubuntu Dialogs and Daily Dialogs with unigram or bigram diversification. The vertical coordinates illustrate F scores according to respective datasets and uni-(or bi-)gram diversification. The horizontal coordinates denote Diversification-Topic offsets where the lower values represent Diversification-biased F scores and the higher values represent Topic-biased F scores.

Vi Discussing diversity

Safe reply has been a long-troubling issue in NLG. It is also stunting the development of RG.

Natural language presents multi-mode distribution. For the sake of simplicity, we illustrate the multi-mode distribution with three modes and depict it in Figure 3 (a). In practice, the system inclines to learn a single-mode distribution [16]

. The reason, we conjecture, is brought by the gradient-optimizing mechanism of neural networks. Neural networks predict the next token by distributing latent probabilities over all tokens in the vocabulary. And the learning process is performed by driving the probabilities towards the expected tokens. It is formulated as follows:


Fig. 3: Illustration of multi-mode distribution of natural language. (a) A three-mode distribution represents natural language. (b) The coarsely trained system has learned all modes of natural language, yet covering a large white space that denotes patterns outside natural language distribution. (c) The finely trained system has learned a uni-mode distribution of natural language. (d) The well trained system has learned a multi-mode distribution of natural language without deviating too much.

is the cross-entropy objective function which has been prevalently used to optimize neural networks [10]. is the expected token and is the predicted token. The optimization of aims to push the prediction close to the expectation. The drawback with this process is: The system is trained to produce high-frequent tokens and patterns because they have more chances to show themselves in the optimization process. When the model is finely trained, the frequent patterns force it to cover a single mode (see Fig 3 (c)), producing accurate responses. On the contrary, when the model is coarsely trained, although it might cover all modes (see Fig 3 (b)), it produces meaningless even ungrammatical sentences due to unpleasant occupation of the learned distribution in the white space [3].

To learn multiple modes of natural language (see Fig 3 (d)), we need a dynamical learning strategy. A tradeoff among different learning objectives was created for the dynamic quality [21, 16, 12]. Especially, the conditional mechanism was adopted to learn the distribution diversity w.r.t each learning objective, extremely improving the flexibility and robustness in modeling dynamic language [39, 38].

Diversifying RG for multi-turn conversations is not only dispersing the language patterns, but most importantly, is generating informative and topic-related responses. Introducing both the dialog-level contextual distribution and the word-level topic distribution influences the learned safe and commonplace patterns, effectively diversifying the final responses.

Vii Discussing Semantically Invalidity of NLG

Viii Conclusion

In this paper, we leverage both dialog-level contextual information and word-level topic coherence to propose the model of THRED, which generates not only diversified but also topic-coherent replies for multi-turn conversations. And we propose an explicit metric (TopicDiv) to measure the topic divergence between the post and according replies. In addition, we combine Distinct and TopicDiv to propose an overall metric which involves both diversification and topic-coherent criteria. We evaluate THRED comparing with three baselines (Seq2Seq, HRED and VHRED) on two real-world corpora, respectively. The results demonstrate that our model performs fairly better for both diversified and topic-coherent response generation.

[Analysis of Generated Replies] We select ten generated replies according to ten respective contexts and list them in Table V. We also group these replies (w.r.t. the model of THRED) in two classes: 1) good (topic-coherent and interesting) replies which are listed in items from Item 1 to Item 7, 2) bad (semantically invalid) replies which are listed in items from Item 8 to Item 10. In this Section, we analyze these replies in more details.

In the first three items (Item 1, Item 2 and Item 3), THRED produces diversified alternatives which are not only topic-coherent, but more importantly, proposing specific solutions. For example, in Item 1, “handbrake” is an open source application soft for video transcoding, and the context presents issues of how to install and how to make it work. The other four replies (including GrT) propose generic or semantically invalid responses while THRED tries to figure out the issues in a specific way where “libdvdcss2” is a lib (supporting) file which could be used to solve certain problems of “handbrake”

111The “handbrake” app is available at https://handbrake.fr/. When “handbrake” does not work, throwing out errors such as “Could not read DVD. This may be because the DVD is encrypted and a DVD descryption library is not installed.”, “libdvdcss2” could be an appropriate solution..

In Item 4 and Item 5, THRED provides contradictory but valid answers, diversifying the responses without spoiling the semantic consistency to the context. For example, in Item 4, THRED answers by expressing a negative attitude while the other models (including the ground truth) have a grateful expression.

In Item 6 and Item 7, THRED generates semantically equivalent but diversified replies. And in Item 7, both GrT and THRED have proposed specific solutions, giving concrete implementation.

On the other hand, THRED also produces various bad replies. In Item 8, THRED produces an semantically invalid reply as “usb flash drive” (of which “usb” and “drive” appear in the context) is a frequent pattern in the training data 222In the training data, “usb flash drive” appears 246 times.. In Item 9, THRED produces a generic answer responding “Thank you” in the context. In Item 10, the context prompts a “font” problem; however, THRED mistakes it as a “screen resolution” issue due to the keywords of “monitor” and “HD 1920x1080” in the context.

Item Context Replies
i can find handbrake …shows “ Conflicts : handbrake ” …
…did you have handbrake installed already …try sudo apt-get install handbrake in a terminal

did you try the terminal .
the actual name is handbrake-gtk and that installs via terminal
GrT: cool . :)
THRED: do you have libdvdcss2 installed ?
HRED: I’m sorry to see you .
SEQ2SEQ: yes
VHRED: I’m sorry , sir .
Are you looking at your computer ’s CPU temperature ?

where do i go to see its temp rypervenche ?
GrT: That depends on your hardware and kernel . The Ubuntu guys can help you with that .
THRED: Look at the “ sensors ” package .
HRED: I don’t know , I don’t know much about lmsensors , but I don’t know much about it .
SEQ2SEQ: **unknown**
VHRED: It should be under System Administration Power management .
3 anyone ? having trouble opening an executable file , it doesnt do anything …
no . Be more specific .
GrT: download that game , unzipped , clicked the executable , nothing happens
THRED: when i try to open a file , it says “ no such file or directory
HRED: i just want to run a program from the terminal , and i want to run it from the terminal , but i don’t know how to do that
SEQ2SEQ: what do you mean ?
VHRED: trying to open a file , it says it can’t find the file
4 Does anyone have any experience setting up Ubuntu with MAAS ? I’m having some weird time issues when trying to add nodes .
The folks in #ubuntu-server might be of some help
GrT: Thanks much .
THRED: I’ve tried that , but I can’t seem to find a solution .
HRED: Thanks .
SEQ2SEQ: thanks .
VHRED: thanks !
5 how can I remove a launchpad keyring , …
apt-key del 1378B444
GrT: thanks :)
THRED: I tried that , but it didn’t work
HRED: thanks
SEQ2SEQ: thanks
VHRED: I tried that , but it still doesn’t work
6 Hi , does anyone know of any file shredding software for Ubuntu ?
there is shred , but it doesn’t really work on journaling filesystems .
GrT: i’m on ext4 would that work ?
THRED: do you know of any software that can recover deleted files ?
HRED: ok , thanks
SEQ2SEQ: **unknown** , **unknown** , **unknown** , **unknown** , **unknown**
VHRED: **unknown** , that ’s what I was looking for , thanks
7 hi ! … i have changed the default zoom level and now i cant go back to the default or ‘ 66% ’ … can any one help me here plz ?
GrT: hm . Hold down Control key while scrolling the mousewheel ? That ’s a guess , not an answer .
THRED: ok , open a terminal and type : gconf-editor
HRED: are you using compiz ?
SEQ2SEQ: what do you mean ?
VHRED: I’m not sure what you mean by that . I don’t know how to fix it though .

HEy all I don’t have any dvds at the moment Can I load an external hdd with ubuntu then use that to install it ? …is that possible to do with this

hardrive ? a hard drive is hard drive isn’t it ? …
no usb thumb ?
GrT: http://ubuntuforums.org/showthread.php?t=1549847 no problem you can boot the iso from grub
THRED: it ’s a usb flash drive
HRED: do you have a usb flash drive ?
SEQ2SEQ: it ’s a usb drive , it ’s a usb drive
VHRED: then you should be able to boot from usb
…more than 2GB you might see advantages using the 64-bit installation . …

…we can install the 32-bit libraries to support 32-bit applications on 64-bit without any silly kludges we had before
excellent . That ’s good to know - reassuring . Thank you . :)
GrT: check the specifications of the manufacturer
THRED: You’re welcome .
HRED: no problem , glad to help
SEQ2SEQ: it ’s a bit of a pain in the ass , but it ’s a bit of a **unknown** .
VHRED: you’re welcome
10 …When I log in with Awesome as my WM , the fonts in the Terminal , …, have an EXTREMELY small font …and the fonts when using Gnome are perfect .
…use the key board

yes there is Alt+f2 or ctrl+alt+f1 does the keyboard work ? …
I don’t know if this has anything to do with it but my monitor is full HD 1920x1080 , 47 ” …
GrT: plugged to the pc via HDMI
THRED: I am trying to figure out how to change the resolution of my monitor
HRED: I think it ’s a font problem , but I don’t know how to do that .
SEQ2SEQ: what do you mean ?
VHRED: I have a laptop , and it ’s a laptop .
TABLE V: The context and its diversified replies. For each context, there are a ground-truth reply (GrT), and four replies produced by four models (THRED, HRED, SEQ2SEQ and VHRED), respectively. THRED generates good (topic-coherent and interesting) replies which are listed in items from Item 1 to Item 7. On the other hand, THRED also generates bad (semantically invalid) replies which are listed in items from Item 8 to Item 10.


The authors would like to thank…

Conflict of interest

The authors declare that they have no conflict of interest.


  • [1] N. Asghar, P. Poupart, X. Jiang, and H. Li (2016)

    Deep active learning for dialogue generation

    arXiv preprint arXiv:1612.03929. Cited by: §I, §II.
  • [2] A. Baheti, A. Ritter, J. Li, and B. Dolan (2018) Generating more interesting responses in neural conversation models with distributional constraints. arXiv preprint arXiv:1809.01215. Cited by: §I, §I, §II.
  • [3] M. Caccia, L. Caccia, W. Fedus, H. Larochelle, J. Pineau, and L. Charlin (2018) Language gans falling short. arXiv preprint arXiv:1811.02549. Cited by: §VI.
  • [4] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §I.
  • [5] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio (2015) A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980–2988. Cited by: §II.
  • [6] A. Cibils, C. Musat, A. Hossman, and M. Baeriswyl (2018) Diverse beam search for increased novelty in abstractive summarization. arXiv preprint arXiv:1802.01457. Cited by: §I, §II.
  • [7] J. M. Deriu and M. Cieliebak (2017) End-to-end trainable system for enhancing diversity in natural language generation. In End-to-End Natural Language Generation Challenge (E2E NLG), 2017, Cited by: §I, §II.
  • [8] J. M. Deriu and M. Cieliebak (2018) Syntactic manipulation for generating more diverse and interesting texts. In 11th International Conference on Natural Language Generation (INLG 2018), Tilburg, The Netherlands, 05-08 November 2018, pp. 22–34. Cited by: §I, §II.
  • [9] J. Du, W. Li, Y. He, R. Xu, L. Bing, and X. Wang (2018) Variational autoregressive decoder for neural response generation. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    pp. 3154–3163. Cited by: §I, §I, §II.
  • [10] I. Dykeman

    Conditional variational autoencoders

    Note: http://ijdykeman.github.io/ml/2016/12/21/cvae.htmlAccessed on April 4, 2018 Cited by: §II, §VI.
  • [11] H. Elder, S. Gehrmann, A. O’Connor, and Q. Liu (2018) E2E nlg challenge submission: towards controllable generation of diverse natural language. In Proceedings of the 11th International Conference on Natural Language Generation, pp. 457–462. Cited by: §I, §II.
  • [12] W. Fedus, I. Goodfellow, and A. M. Dai (2018) MaskGAN: better text generation via filling in the_. arXiv preprint arXiv:1801.07736. Cited by: §VI.
  • [13] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §III-A.
  • [14] S. Hochreiter (1998)

    The vanishing gradient problem during learning recurrent neural nets and problem solutions

    International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6 (02), pp. 107–116. Cited by: §II.
  • [15] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017) Toward controlled generation of text. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 1587–1596. Cited by: §I.
  • [16] F. Huszár (2015) How (not) to train your generative model: scheduled sampling, likelihood, adversary?. arXiv preprint arXiv:1511.05101. Cited by: §VI, §VI.
  • [17] V. Kassarnig (2016) Political speech generation. arXiv preprint arXiv:1601.03313. Cited by: §I.
  • [18] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §III-B.
  • [19] H. Le, T. Tran, T. Nguyen, and S. Venkatesh (2018) Variational memory encoder-decoder. In Advances in Neural Information Processing Systems, pp. 1508–1518. Cited by: §I, §II.
  • [20] D. D. Lee and H. S. Seung (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401 (6755), pp. 788. Cited by: §I, §III-A, §III-C1.
  • [21] J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2015) A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055. Cited by: §I, §I, §II, §IV-C, §VI.
  • [22] Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu (2017) Dailydialog: a manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957. Cited by: §I, §IV-A.
  • [23] R. Lowe, M. Noseworthy, I. V. Serban, N. Angelard-Gontier, Y. Bengio, and J. Pineau (2017) Towards an automatic turing test: learning to evaluate dialogue responses. arXiv preprint arXiv:1708.07149. Cited by: §III-A.
  • [24] R. Lowe, N. Pow, I. Serban, and J. Pineau (2015) The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909. Cited by: §I, §IV-A.
  • [25] M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §I, §I, §II, §III-A.
  • [26] L. Mou, Y. Song, R. Yan, G. Li, L. Zhang, and Z. Jin (2016) Sequence to backward and forward sequences: a content-introducing approach to generative short-text conversation. arXiv preprint arXiv:1607.00970. Cited by: §I.
  • [27] H. Ney, U. Essen, and R. Kneser (1994) On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language 8 (1), pp. 1–38. Cited by: §IV-C.
  • [28] J. Novikova, O. Dušek, and V. Rieser (2017) The e2e dataset: new challenges for end-to-end generation. arXiv preprint arXiv:1706.09254. Cited by: §I.
  • [29] G. Pandey and A. Dukkipati (2016) Variational methods for conditional multimodal learning: generating human faces from attributes. arXiv preprint arXiv:1603.01801. Cited by: §I, §III-B.
  • [30] R. Pascanu, T. Mikolov, and Y. Bengio (2013)

    On the difficulty of training recurrent neural networks

    In International conference on machine learning, pp. 1310–1318. Cited by: §II.
  • [31] J. Pei and C. Li (2018) S2spmn: a simple and effective framework for response generation with relevant information. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 745–750. Cited by: §I, §I.
  • [32] D. Pérez and E. Alfonseca (2005) Application of the bleu algorithm for recognising textual entailments. In Proceedings of the First Challenge Workshop Recognising Textual Entailment, pp. 9–12. Cited by: §I.
  • [33] R. Reddy, D. Contractor, D. Raghu, and S. Joshi (2018) Multi-level memory for task oriented dialogs. arXiv preprint arXiv:1810.10647. Cited by: §I, §II.
  • [34] A. Ritter, C. Cherry, and W. B. Dolan (2011) Data-driven response generation in social media. In Proceedings of the conference on empirical methods in natural language processing, pp. 583–593. Cited by: §I.
  • [35] A. M. Rush, S. Chopra, and J. Weston (2015)

    A neural attention model for abstractive sentence summarization

    arXiv preprint arXiv:1509.00685. Cited by: §I.
  • [36] C. Sammut (2010) Beam search. Encyclopedia of Machine Learning, pp. 93–93. Cited by: §I, §II, §IV-D.
  • [37] I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau (2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In

    Thirtieth AAAI Conference on Artificial Intelligence

    pp. 3776–3783. Cited by: §I, §I, §II, §III-A, §IV-A, §IV-B.
  • [38] I. V. Serban, R. Lowe, L. Charlin, and J. Pineau (2016) Generative deep neural networks for dialogue: a short review. arXiv preprint arXiv:1611.06216. Cited by: §I, §II, §III-A, §IV-A, §VI.
  • [39] I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, and Y. Bengio (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artificial Intelligence, pp. 3295–3301. Cited by: §I, §I, §I, §I, §II, §II, §III-A, §III-B, §IV-A, §IV-B, §VI.
  • [40] L. Shang, Z. Lu, and H. Li (2015) Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364. Cited by: §III-A.
  • [41] X. Shen, H. Su, Y. Li, W. Li, S. Niu, Y. Zhao, A. Aizawa, and G. Long (2017) A conditional variational framework for dialog generation. arXiv preprint arXiv:1705.00316. Cited by: §I, §II, §III-A, §IV-A.
  • [42] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §I, §I, §III-B.
  • [43] A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J. Nie, J. Gao, and B. Dolan (2015) A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714. Cited by: §III-A.
  • [44] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §I.
  • [45] Z. Tian, R. Yan, L. Mou, Y. Song, Y. Feng, and D. Zhao (2017) How to make context more useful? an empirical study on context-aware neural conversational models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 231–236. Cited by: §I.
  • [46] P. D. Turney and P. Pantel (2010) From frequency to meaning: vector space models of semantics. Journal of artificial intelligence research 37, pp. 141–188. Cited by: §III-C1.
  • [47] A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. Crandall, and D. Batra (2018) Diverse beam search for improved description of complex scenes. In Thirty-Second AAAI Conference on Artificial Intelligence, pp. 7371–7379. Cited by: §I, §II.
  • [48] A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. Crandall, and D. Batra (2016) Diverse beam search: decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424. Cited by: §I, §II.
  • [49] O. Vinyals and Q. Le (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §I, §I, §II, §III-A, §IV-B.
  • [50] Y. Wang, C. Liu, M. Huang, and L. Nie (2018) Learning to ask questions in open-domain conversational systems with typed decoders. arXiv preprint arXiv:1805.04843. Cited by: §I, §II.
  • [51] T. Wen, M. Gasic, N. Mrksic, P. Su, D. Vandyke, and S. Young (2015) Semantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745. Cited by: §I.
  • [52] Y. Wu, W. Wu, D. Yang, C. Xu, and Z. Li (2018) Neural response generation with dynamic vocabularies. In Thirty-Second AAAI Conference on Artificial Intelligence, pp. 5594–5601. Cited by: §I.
  • [53] C. Xing, W. Wu, Y. Wu, J. Liu, Y. Huang, M. Zhou, and W. Ma (2017) Topic aware neural response generation. In Thirty-First AAAI Conference on Artificial Intelligence, pp. 3351–3357. Cited by: §I, §I.
  • [54] R. Yan, Y. Song, X. Zhou, and H. Wu (2016) Shall i be your chat companion?: towards an online human-computer conversation system. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 649–658. Cited by: §I.
  • [55] T. Zhao, R. Zhao, and M. Eskenazi (2017) Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960. Cited by: §I, §II, §III-A.