Log In Sign Up

Why Do Neural Dialog Systems Generate Short and Meaningless Replies? A Comparison between Dialog and Translation

by   Bolin Wei, et al.
University of Waterloo
ByteDance Inc.

This paper addresses the question: Why do neural dialog systems generate short and meaningless replies? We conjecture that, in a dialog system, an utterance may have multiple equally plausible replies, causing the deficiency of neural networks in the dialog application. We propose a systematic way to mimic the dialog scenario in a machine translation system, and manage to reproduce the phenomenon of generating short and less meaningful sentences in the translation setting, showing evidence of our conjecture.


page 1

page 2

page 3

page 4


Automatically Exposing Problems with Neural Dialog Models

Neural dialog models are known to suffer from problems such as generatin...

Learning End-to-End Goal-Oriented Dialog with Multiple Answers

In a dialog, there can be multiple valid next utterances at any point. T...

Effective Incorporation of Speaker Information in Utterance Encoding in Dialog

In dialog studies, we often encode a dialog using a hierarchical encoder...

Quick Starting Dialog Systems with Paraphrase Generation

Acquiring training data to improve the robustness of dialog systems can ...

Conversation Learner – A Machine Teaching Tool for Building Dialog Managers for Task-Oriented Dialog Systems

Traditionally, industry solutions for building a task-oriented dialog sy...

Bootstrapping Multilingual Intent Models via Machine Translation for Dialog Automation

With the resurgence of chat-based dialog systems in consumer and enterpr...

Referring to the recently seen: reference and perceptual memory in situated dialog

From theoretical linguistic and cognitive perspectives, situated dialog ...

1 Introduction

Open-domain human-computer dialog systems are attracting increasing attention in the NLP community. With the development of deep learning, sequence-to-sequence (Seq2Seq) neural networks, or more generally encoder-decoder frameworks, are among the most popular models for utterance generation in dialog systems 

Shang et al. (2015); Li et al. (2016); Mou et al. (2016); Serban et al. (2017).

Historically, Seq2Seq-like models are first designed for machine translation Sutskever et al. (2014); Bahdanau et al. (2015) and later widely applied to image captioning Vinyals et al. (2015)

, text summarization 

Rush et al. (2015), etc. When adapted to open-domain dialog systems, however, Seq2Seq models are less satisfactory. A severe problem is that the Seq2Seq model tends to generate short and meaningless replies, e.g., “I don’t know” Li et al. (2016) and “Me too” Mou et al. (2016). They are universally relevant to most utterances, called universal replies in seq2BF, and hence less desired in real-world conversation systems.

In previous studies, researchers have proposed a variety of approaches to address the problem of universal replies, ranging from heuristically modified training objectives 

Li et al. (2016), diversified decoding algorithms Vijayakumar et al. (2016), to content-introducing approaches Mou et al. (2016); Xing et al. (2016).

Although universal replies have been alleviated to some extent, there lacks an empirical explanation to the curious question: Why does the same Seq2Seq model tend to generate shorter and less meaningful sentences in a dialog system than in a machine translation system?

Considering the difference between dialog and translation data, our intuition is that the dialog system encounters a severe unaligned problem: an utterance may have multiple equally plausible replies, which may have different meanings. On the contrary, the translation datasets typically have a precise semantic matching between the source and target sides. This conjecture is casually expressed in our previous work Mou et al. (2016), but is not supported by experiments.

In this paper, we propose a method to verify the conjecture by mimicking the unaligned scenario in machine translation datasets. We propose to shuffle the source and target sides of the translation pairs to artificially build a conditional distribution of target sentences with multiple plausible data points. By doing so, we manage to shorten the length and lower the “information” of generated sentences in a Seq2Seq machine translation system. This shows evidence that the unaligned problem could be one reason that causes short and meaningless replies in neural dialog systems.

To summarize, this paper systematically compares Seq2Seq dialog and translation systems, and provides an explanation to the question: Why do neural dialog systems tend to generate short and meaningless replies? Our study also sheds light on the future development of neural dialog systems as well as the application scenarios where Seq2Seq models are appropriate.

In the rest of this paper, we first describe our conjecture in Section 2. Then we design the experimental protocol in Section 3 and present results in Section 4. Finally, we conclude with discussion in Section 5.

2 Conjecture

We hypothesize that given a source sequence, the conditional distribution of the target sequence having multiple plausible points is one cause of the deficiency of Seq2Seq models in dialog systems.

Let us denote the source sequence by and the target sequence by . Both (orthodox) training and prediction objectives are to maximize

, where the conditional probability

is modeled by a Seq2Seq neural network with parameters .

In a machine translation system, the source and target information generally aligns well, although some meanings could have different expressions. Figure 1a shows a continuous analog of .

In an open-domain dialog system, however, an utterance can have a variety of replies that are (nearly) equally plausible. For example, given a user-issued utterance “What are you going to do?” there could be multiple replies like “having lunch,” “watching movies,” and “sleeping,” shown in Figure 1

b with an analog of continuous random variables. There is no particular reason why one reply should be favored over another without further context. Even with context, this problem could not be fully solved because of the true randomness of dialog.

Dialog Seq2Seq 1.84 15.1 2.40 1.02 0.66
Translation Seq2Seq 27.2 60.2 33.4 20.9 13.6
+shuffle 25% 24.4 56.2 30.3 18.8 12.0
+shuffle 50% 21.1 52.8 26.8 16.0 10.0
+shuffle 75% 17.2 48.2 23.2 13.4 8.10
+shuffle 100% .024 12.5 .189 0.00 0.00
Table 1: BLEU scores of dialog and translation systems.
Figure 1: The conditional distribution in (a) machine translation and (b) dialog systems, where we consider an analog of continuous random variables. More rigorously speaking, is peaked at one or a few similar sentence(s) in machine translation because source and target information generally aligns, whereas an utterance can have multiple plausible replies in dialog systems.

The above is, perhaps, the most salient difference between dialog and translation datasets. While it is tempting to think of Seq2Seq’s performance in this way Mou et al. (2016), there does not exist a practical approach to verify the conjecture.

3 Experimental Protocol

3.1 Mimicking a “Dialog Scenario” in the Machine Translation

We propose to mimic the “unaligned” property in a translation dataset by shuffling the source and target pairs. This ensures the resulting conditional distribution to have multiple plausible data points, whereas other settings of translation remain unchanged, making a rigorous controlled experiment.

Formally speaking, let be the training dataset in a translation setting, where is a particular data point containing a source and target sentence pair; in total we have data points.

The shuffled dataset is , where and is a random permutation of . In this way, we artificially construct a conditional target distribution that allows multiple plausible sentences conditioned on a particular source sentence.

Notice that, for the sake of constructing a distribution where the target sentences can have multiple plausible data points, there is no need to generate multiple random target sentences for a particular source sentence. In fact, it is preferred NOT, so that the experiment is more controlled. In the case where we generate a single target sentence for a source sentence , can still be viewed as samples from the marginal (unconditioned) distribution , and thus the desired “unaligned” property is in place.

It is straightforward to shuffle a subset of the translation dataset. Details are not repeated here. This helps to analyze how Seq2Seq models behave when the “unaligned” problem becomes more severe.

It should also be mentioned that the shuffling trick is previously used in shuffle to compare the robustness of Seq2Seq models and phrase-based statistical machine translation in terms of BLEU scores. Our paper contains a novel insight that shuffling datasets mimics the unaligned property in dialog datasets, which facilitates the comparison between Seq2Seq dialog and translation systems.

3.2 The Seq2Seq Model and Datasets

We adopted a modern Seq2Seq model (with an attention mechanism) as the neural network for both dialog and translation systems. The encoder is a bidirectional recurrent neural network with gated recurrent units (GRUs), whereas the decoder comprises two GRU state transition blocks and an attention mechanism in between 

Sennrich et al. (2017).111Code downloaded from

For the dialog system, we used the Cornell Movie-Dialogs Corpus dataset,222Available at containing 221k samples. For machine translation, we used the WMT-2017 dataset333Available at and focus on English-to-Germany translation; it contains 5.8M samples.

We first tried a normal machine translation setting and achieved comparable results to a baseline in baseline; thus our replication of the machine translation system is fair. In all settings, we used the same model and hyperparameters so that our comparison is also fair.

Appendix A provides detailed model description and experimental setup.

Setting Length Neg. log-prob. Entropy
# words % of Ref
Dialog References 14.40 100 8.79 8.91
Seq2Seq 11.70 81.3 8.08 7.92
Translation References 21.47 100 11.4 10.2
Seq2Seq 21.24 98.9 11.1 9.98
shuffle 25% 21.02 97.9 10.9 9.81
shuffle 50% 20.73 96.6 10.8 9.66
shuffle 75% 19.89 92.6 10.6 9.39
shuffle 100% 15.88 74.0 9.34 4.46
Table 2: Average length (along with the percentage of references), negative log-probability, and entropy of dialog and translation systems.
Setting Correlation
Encoder Decoder
Dialog Seq2Seq .5095 .1706
Translation Seq2Seq .9673 .8734
+shuffle 25% .9257 .7241
+shuffle 50% .9374 .6221
+shuffle 75% .8622 .6574
+shuffle 100% .9928 .8521
Table 3:

correlation obtained by fitting a linear regression of the encoding/decoding step with hidden states.

4 Results

Overall Performance. Table 1 presents the BLEU scores of dialog machine translation systems. In open-domain dialog, BLEU-2 exhibits some (not large) correlation with human satisfaction, although BLEU scores are generally low. For machine translation, we achieved 27.2 BLEU for the normal setting, which is comparable to 28.4 achieved by a baseline method in baseline.

If we begin to shuffle the translation dataset, we see that BLEU drops gradually and finally reaches near zero if the training set is completely random (100% shuffled). The results are not surprising and also reported in shuffle. This provides a quick understanding on how the Seq2Seq is influenced by shuffled data.

Length, Negative Log-Probability, and Entropy. We now compare the length, probability, and entropy of dialog and translation systems, as well as the shuffling setting (Table 2). The length metric counts the number of words in a generated reply.444In some cases, an RNN fails to terminate by repeating a same word. Here, we assume a same word can be repeated at most four times. The negative log-probability is computed as , where denotes all replies and is the unigram distribution of words in the training set. Entropy is defined as , where is the unigram distribution in generated replies. Intuitively, both negative log-probability and entropy evaluate how much “content” is contained in the replies. These metrics are used in previous work Serban et al. (2017); Mou et al. (2016),555In our previous work Mou et al. (2016), the negative log-probability is mis-interpreted as entropy after email correspondence with some other peer researcher. and obviously most relevant to our research question.

We first compare the dialog system with machine translation, both in a normal setting (no shuffling). We observe that, the dialog system does generate short and meaningless replies with lower length, negative log-probability, and entropy metrics than references, as opposed to machine translation where Seq2Seq’s generated sentences are comparable to references in terms of these statistics. Quantitatively, the length is 20% shorter than references. The negative log-probability and entropy decrease by 0.71 and 0.99, respectively; a decrease of 1 in negative log-probability and entropy metrics is large because they are logarithmic metrics. Although with a well-engineered Seq2Seq model (with attention, beam search, etc.), the phenomenon is less severe than a vanilla Seq2Seq in seq2BF, it is still perceivable and worth investigating.

We then applied the shuffling setting to the translation system. With the increase of shuffling rate, the Seq2Seq translation model precisely exhibits the phenomenon as a dialog system: the length decreases, the negative log-probability decreases, and the entropy decreases. In particular, the decreasing negative log probability implies that the generated words are more frequently appearing in the training set, whereas the decreasing entropy implies that the distribution of generated sentences spread less across the vocabulary. In other words, artificially constructing an unaligned property in translation datasets—with all other settings remain unchanged—enables to reproduce the phenomenon in a dialog system. This shows evidence that the unaligned property could be one reason that causes the problem of short and meaningless replies in a dialog system.

Correlation between Time Step and Hidden States. MTlength conduct an empirical study analyzing “Why Neural Translations are the Right Length?” They observe that, even the semantic of translation is not good, the length of generated reply is likely to be correct. They further find that some dimensions in RNN states are responsible for memorizing the current length in the process of sequence generation; the result is also reported in visualizeRNN previously. MTlength apply linear regression to predict the time step during sequence modeling based on hidden states, and compute the correlation as a quantitative measure.

Since a dialog system usually generates short replies (and thus not right length), we are curious what the correlation would be in a dialog system as well as shuffled translation settings. The results are shown in in Table 3. We find that the dialog system exhibits low correlation, and that the correlation also decreases in machine translation if data are shuffled (but not as worse as dialog systems). One inconsistent result, however, is that for the 100% shuffled dataset, the correlation in the encoder side becomes 99%, while the decoder correlation also increases to 85%. We currently do not have good explanation to this.

5 Conclusion and Discussion

In this paper, we addressed the question why dialog systems generate short and meaningless replies. We managed to reproduce this phenomenon in a well-behaving translation system by shuffling training data, artificially mimicking the scenario that a source sentence can have multiple equally plausible target sentences.

Admittedly, it is impossible to construct exactly the same scenario as dialog by using translation datasets (otherwise the translation just becomes dialog). However, the unaligned property is a salient difference, and by controlling this, we observe the desired phenomenon. Therefore it could be one cause of short and meaningless replies in dialog systems.

Our findings also explain why referring to additional information—including dialog context Tian et al. (2017), keywords Mou et al. (2016) and knowledge bases Vougiouklis et al. (2016)—helps dialog systems: the number of plausible target sentences decreases if the generation is conditioned on more information; this intuition is helpful for future development of Seq2Seq dialog systems. Moreover, our experiments suggest that Seq2Seq models are more suitable to applications where the source and target information is aligned.


We would like to thank Daqi Zheng and Yiping Song for helpful discussion.


Appendix A Experimental Setup

a.1 Neural Network

We use the neural network in toolkit as our model. The encoder is a bidirectional recurrent neural network with gated recurrent units (GRUs). Let us consider one direction , where is the input embedding at the time step and is the hidden state. The computation of one step is given by

where ’s and ’s are weights; is the function and is element-wise product.

Applying GRU-RNN to both directions and concatenating the resulting hidden states, we obtain the representation of the th word in the source as

The decoder is an RNN with two blocks of GRUs and an attention mechanism sandwiched in between. The first block of GRU computes an intermediate representation for the th word in the target as , where is the embedding of the last word .

is used to compute attention vector as

A context vector is computed as

Then is fed to the second block of GRU as

Finally, , , and

are fed to a fully connected layer and a softmax layer for prediction of the word

at the time step in the decoder.

a.2 Hyperparameter Settings

In our all experiments, word embeddings were 512d. We used Adam to optimize all parameters, with initial learning rate 0.0001. The dropout rate was set to 0.2. We set the mini-batch size to 60 to fit to GPU memory. In machine translation, RNN was 1024d and the vocabulary size was 30k in each language, whereas in the dialog model, the RNN was 1000d and the vocabulary size was 50k. For prediction beam search (beam size 12) was adopted to generate a translation or a reply.