Relevance-Promoting Language Model for Short-Text Conversation

11/26/2019 ∙ by Xin Li, et al. ∙ Tencent The Chinese University of Hong Kong 0

Despite the effectiveness of sequence-to-sequence framework on the task of Short-Text Conversation (STC), the issue of under-exploitation of training data (i.e., the supervision signals from query text is ignored) still remains unresolved. Also, the adopted maximization-based decoding strategies, inclined to generating the generic responses or responses with repetition, are unsuited to the STC task. In this paper, we propose to formulate the STC task as a language modeling problem and tailor-make a training strategy to adapt a language model for response generation. To enhance generation performance, we design a relevance-promoting transformer language model, which performs additional supervised source attention after the self-attention to increase the importance of informative query tokens in calculating the token-level representation. The model further refines the query representation with relevance clues inferred from its multiple references during training. In testing, we adopt a randomization-over-maximization strategy to reduce the generation of generic responses. Experimental results on a large Chinese STC dataset demonstrate the superiority of the proposed model on relevance metrics and diversity metrics.[%s]



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Short Text Conversation (STC) [40]

, also known as single-turn chit-chat conversation, is a popular research topic in the field of natural language processing. It is usually formulated as a sequence translation problem 

[38, 40] and the sequence-to-sequence encoder-decoder (Seq2Seq) framework [9, 41, 2] is applied for solving this problem. The decoder generates the responses token-by-token, conditioned on the compressed query representations from the encoder. Following this paradigm, many attempts have been conducted to refine the quality of the generated responses [22, 45, 14, 42].

Despite the effectiveness of these efforts, some intrinsic issues of Seq2Seq-based models still hinder further improvement of generation performance. Under the Seq2Seq formulation, the auto-regressive decoder is only trained on the gold-standard response text while the query text is ignored, leading to under-exploitation of the training data. Besides, the maximization-based decoding strategies adopted in existing models, such as beam search and greedy search, restrict the search space to the most frequent phrases and thus they have the tendency to generate the generic responses or repetitive responses with unnaturally high likelihood, degrading the conversational experience.

GPT-2 [37], a recently proposed Transformer-based language model, provides an alternative solution for language generation. One advantage of GPT-2 is that the transformer language model can not only capture the context of arbitrary length but also make full use of the textual supervision signals because the generator is actually the language model itself. Moreover, GPT-2 adopts top-k sampling [15] to diversify the generated texts while preserving the relevance. Obviously, these characteristics are attractive and meaningful for solving the STC task, whose aim is to generate informative and diverse human-like responses given the user queries.

However, due to the essence of language modeling, directly applying GPT-2 on the STC task, a conditional language generation task, may be insufficient because the language model is unable to discriminate the source (query) sentence and the target (response) sentence. The original experimental results of GPT-2 on the abstractive summarization task [32] also verify this claim. Another potential issue of adapting language model for the STC task comes from recency bias [20] and explanation-away effects [48, 18], where the language model has the tendency to rely overly on the immediate context and explain away from the long-term context222Long-term context in language model is roughly equivalent to the source information in Seq2Seq framework., yielding fluent but topically irrelevant responses.

Figure 1: Representations of the example input with and .

With the motivation of inheriting the merits of transformer language model while alleviating the potential issues under the language model formulation, we carefully design a training strategy to adapt the auto-regressive transformer-based language model333Without explicit specification, the language model in our paper refers to the “auto-regressive” language model, which is different from those “auto-encoding” language models [12, 13]. for the conditional response generation. First of all, it is observed that the dialog conversation is actually a process of text continuation, in other words, giving the response right after the query. Based on this observation, we can regard the STC task as a language modeling problem on the concatenated sequence of query and response. To discriminate the generation of query tokens and that of response tokens, we inject a special token between query and response, acting as the trigger of response generation. With this formulation, the language model based training objective can make use of the textual data from query, alleviating the under-exploitation issue mentioned above.

Since the transformer-based language model tends to focus on the short-term context and ignore the long-term context, namely, the explanation away issue, we propose to empower the self-attention with encoder-decoder attention, which enforces the model to pay additional attention to the query, especially the query tokens of user interest, and guides the model to rely on informative query tokens to make good predictions. It is also observed that some response tokens not mentioned in the query are still closely related to the discussed topic in the conversation. In order to exploit such kind of relevance clues hidden behind the responses, we propose a topic inference component to learn a compact source (query) representation encoding the information relevant to the query and feed the query representation into each generation step, encouraging the language model to consider the generation of the topic words potentially related to the query.

As with the decoding strategy, different from the existing STC models, we propose to decode with randomization-over-maximization method, namely, the top-k sampling, from the transformer language model to generate the relevant response with high originality.

In summary, our contributions are as follows:

We tailor-make a training strategy to adapt the transformer-based language model for the Short Text Conversation (STC) task.
We propose two components, namely, Supervised Source Attention (SSA) component and Topic Inference (TI) component to promote the relevance modeling in the language model based response generator.
To the best of our knowledge, we are the first to introduce top-k sampling, a randomization-over-maximization strategy, for diverse response generation.444We notice that some concurrent works [4, 33, 49] also adopt the strategy similar to ours after the submission.

Figure 2:

Overall architecture. The Topic Inference (TI) component on top of the transformer layers and the Supervised Source Attention (SSA) component inside the transformer layers are the proposed relevance-promoting components. Training losses are calculated on top of the obtained representation vectors




In our language model formulation, each training query-response pair and the special tokens are concatenated as a single sequence of length . corresponds to the query token sequence of length and is the special token [EOQ], denoting the end of query. corresponds to the response and is [EOS], the end symbol of the whole sequence. The training objective of our model is to maximize the unconditional likelihood , similar to the existing language models [3, 29].

The architecture of our model is depicted in Fig 2, where decoder-only transformer layers [43]555For the technical details of transformer, we recommend the reader to read the paper [43]. are involved. Different from the original transformer layer solely containing the self-attention component, the transformer layer in our model is further empowered with the proposed supervised source attention (SSA) component. The outputs of the -th transformer layer are the contextualized token representations of size , denoted as . When predicting the tokens, a Topic Inference (TI) component is introduced to provide the refined query representations encoding the topic information inferred from the reference.

Language Model as Response Generator

To achieve the goal of adapting language model for the STC task, we should carefully design a training strategy different from that in the Seq2Seq framework. Based on the observation that the human conversations can be regarded as a process of text continuation (i.e., giving the response/answer right after the query/question), we concatenate the query token sequence and the response token sequence into a single sequence and formulate the STC task as a contextual text continuation problem. One input example of our model is illustrated in Fig 1. The training goal of the model is to minimize the joint negative log likelihood over the whole sequence:


Obviously, it is easy to bridge the gap between the task-specific training and the auto-regressive pre-training [35, 36, 37] because the formulations of their objectives are almost the same. Another advantage of this language model formulation is that it takes the likelihood of query tokens into consideration, which is ignored in the existing works [40, 45]. Intuitively, the text generated by the language model is more fluent than those generated by Seq2Seq framework because the generator of the language model (the language model itself) is not only trained on the response sentence but also the query sentence.

Relevance Modeling Component

The vanilla transformer decoder is equipped with self-attention [8, 24] and can theoretically capture the context of arbitrary length. Given the input , the contextualized representations (, ) at the -th time step is built as follows:


where Slf-Att is the self-attention layer666

The symbols for the feed-forward layer and residual connections are not shown.

and is the calculated attention vector. , , respectively denote the query777Here, the “query” refers to a real-valued vector while the “query” in the STC task is a sentence., key and value in the self-attention layer. indicate the leftward elements and the same to . Despite its capability of learning global dependency, the transformer-based language model still has the tendency to overly rely on the short-term context and ignore the long-term context when predicting the next word, dubbed as explanation away problem [18]. This problem is catastrophic for the STC task because the query acts as the long-term context in our language model formulation and not involving the query information is prone to generating the content irrelevant to the query. Therefore, explicitly modeling the relevance and emphasizing the importance of the query are essential. In this paper, we propose two components, namely, Supervised Source Attention (SSA) and Topic Inference (TI), to handle the explanation away problem.

Supervised Source Attention

In the existing Seq2Seq

-based frameworks, incorporating the query/source information is achieved by applying encoder-decoder attention solely on the encoder hidden representations. Similarly, attending only on the long-term context of language model is presumably beneficial for improving the relevance. Therefore, we propose to introduce another source attention layer on top of the self-attention layer. The computational formula of the

-th () query-enhanced hidden representation is below:


Src-Att refers to our source attention layer on top of the self-attention layer. is the attention scores for the corresponding hidden representations of the query tokens. is the output of Slf-Att layer and , , are the corresponding query, key, value in the source attention. Note that we only additionally apply source attention when the current token is not query token, i.e., , and do nothing in the preceding steps. Learning word alignment from data is possible but may be inaccurate without any supervision or external knowledge [25, 30], therefore, we employ the keywords as the knowledge and enforce the source attention component to be concentrated on the important query tokens. First of all, we perform max-over-time pooling over the attention vectors () and induce the vector reflecting the salience scores of the query/source tokens:


Then, given the query keyword indicator vector , we introduce additional source attention loss into Eq (1):


Ideally, the generation process will rely on more important query tokens if the salience score is more close to the keyword vector .

Topic Inference

The SSA component attempts to improve the relevance by highlighting the importance of the important query tokens/words in the attention process. However, the range of the words topically related to the query is far more than that of the keywords explicitly mentioned in the query. Considering the query “what is your favorite fruit?” and two valid responses “I like the watermelon very much” and “My favorite fruit is pineapple”, “fruit” should be emphasized during the generation but the words used to discuss fruit such as “watermelon” and “pineapple” are also very meaningful for building a response. Inspired by this, we collect the multiple references of each query in the training set and gather all of the keywords extracted from such responses888[45] extend the keyword set using external corpus. Here, we focus on improving the relevance rather than enriching the topical words in the response, thus, we only utilize the training data to explore more keywords.

. To exploit the latent topic information, we introduce Topic Inference (KI) component to estimate the global topical word distribution based on the query representation

as follows:


where denotes the function mapping the input query tokens to a low-dimensional query representation. Specifically, we feed the last query hidden representation in the transformer, namely, , into a linear layer with tanh activation and regard the output as the query representation for simplifying the modeling part. To encode the topic information into the query representation, we employ the global keyword indicator vector as supervision signals and enforce the components corresponding to keywords/important tokens in the query-based global topic distribution to be up-weighted. The computational formula is as follows:


where the subscript denotes the -th component of a vector and is the vocabulary size. Note that we attempt to replace the Softmax in Eq 6

with the component-wise Sigmoid, typically used in multi-label classification problem, but the empirical results become worse. Thus, we keep the Softmax probability function unchanged in the experiment. Similar to Eq 

5, the will be added in the training loss.

Different from [47] and [16] regarding the concrete topic/keyword as the trigger of generation, we introduce the query representation encoding the global topic information as the supplementation for each token-level representation to encourage the generation of the relevant topical words. The representation vector for predicting the output is calculated below:


where is the gate value and are parameter matrices in the TI component.

Model Training

The proposed SSA component and the TI component are jointly trained with the transformer-based language model. Based on Eq 1, Eq 5 and Eq 7, the overall training objective of the proposed model is as follow:


Here, and are the coefficients controlling the proportion of and involved in the training respectively.


Due to the limited search space, it is difficult for the beam search or greedy search to find the interesting and diverse responses. Therefore, we do not adopt them but a “randomization-over-maximization” strategy (also know as ‘top-k sampling”) to perform the decoding, as done in [15, 37]. [18] and [19] explore the usage of other advanced decoding strategies in the language generation task. Since our aim in this paper is not to compare the performances across the different decoding strategies, we consistently use the top-k sampling.


Experiment Setup

We utilize the benchmark STC dataset [26] to evaluate the effectiveness of the proposed relevance-promoting transformer language model. This dataset is built based on the real conversations from Weibo999 and contains about 7M high-quality query-response pairs. We split the dataset such that #train:#dev:#test is 7,024,156:2,000:800. Training details are provided in the appendix.

To avoid word segmentation errors and out-of-vocabulary issue, the tokens in our model and the baseline models are Chinese characters and the vocabulary size is about 12,000.

Evaluation Metrics

We introduce the following metrics to evaluate the model’s capability of generating relevant and diverse responses:

Relevance Metrics We employ Bleu-2, Bleu-3 & Bleu-4 [34] to estimate the relevance of the generated responses. Moreover, we also design two more metrics, namely, Hit-q and Hit-r to calculate the hit rates of the topical words in the query and the response respectively. Firstly, we build a high-precision-low-recall

keyword set for each query/response sentence based on keyword extraction toolkit

101010 and filter some noisy words based on additional hand-crafted rules. Then, we calculate the Hit-Q and Hit-R for the -th predictions as follows:


where , and respectively denote the topical word set for the -th query, predicted response and gold standard response. Then we obtain the Hit-q and Hit-r by performing the corpus-level average:


Diversity Metrics Following [22], we employ Dist-1 and Dist-2 to calculate the ratios of the distinct uni-grams and bi-grams in the generated responses.

Human Evaluations We also conduct human evaluations. Specifically, we randomly sampled 100 queries and recruit five helpers to judge Relevance (4-scale rating, 0-3), Fluency (3-scale rating, 0-2) and Acceptance (0 or 1) of the generated responses from our model and the baselines. Details of the rating criteria are stated in the appendix.

Comparison Models

  • [leftmargin=*]

  • LSTM-LM [27]: LSTM-based auto-regressive language model armed with incremental self-attention. We train LSTM-LM using the same strategy mentioned in this paper.

  • LSTM-S2S: Attention-based LSTM Sequence-to-Sequence model.

  • TFM-S2S: Transformer Sequence-to-Sequence model where the network components are identical to those in [43].

  • TFM-LM: Transformer-based auto-regressive language model. We train TFM-LM using the same strategy mentioned in this paper.

  • MMI [22]: LSTM-S2S with Maximum Mutual Information objective in decoding. In this paper, we set the number of responses for re-ranking as 50.

  • CVAE [50]111111 Conditional Variational Auto-Encoder for response generation. We replace the dialogue acts used in the original model with the keywords extracted from the references.

  • MMPMS [7]: The model with the state-of-the-art performance on the STC task. We re-run the officially released code121212 to obtain the results on our dataset.


Model Relevance Diversity
Bleu-2 Bleu-3 Bleu-4 Hit-Q Hit-R Dist-1 Dist-2
LSTM-LM 3.8 0.9 0.3 0.084 0.066 0.028 0.094
LSTM-S2S 5.6 2.8 1.8 0.293 0.145 0.039 0.137
TFM-LM 6.9 3.2 2.1 0.295 0.144 0.058 0.259
TFM-S2S 7.3 3.5 2.3 0.369 0.172 0.078 0.290
MMI 7.9 2.5 1.0 0.197 0.145 0.093 0.349
CVAE 5.8 1.5 0.4 0.211 0.135 0.060 0.211
MMPMS 6.7 3.0 1.8 0.151 0.102 0.057 0.220
OURS-tk w/o SSA & TI 4.9 1.0 0.3 0.119 0.076 0.086 0.441
OURS-tk w/o SSA 5.5 2.1 1.5 0.150 0.146 0.102 0.521
OURS-tk w/o TI 5.1 2.1 1.4 0.171 0.132 0.090 0.445
OURS-bm 10.3 5.3 3.4 0.510 0.193 0.102 0.398
OURS-tk 6.0 3.6 2.5 0.191 0.152 0.107 0.544


Table 1: Experimental results on the automatic metrics. The best results are in bold.


Model Evaluation Metrics
Relevance Fluency Acceptance
LSTM-LM 1.206 1.297 0.26
LSTM-S2S 1.386 1.285 0.37
TFM-LM 1.412 1.328 0.39
TFM-S2S 1.475 1.306 0.43
MMI 1.432 1.301 0.34
CVAE 1.316 1.274 0.33
MMPMS 1.528 1.396 0.42
OURS-tk w/o SSA & TI 1.273 1.368 0.28
OURS-tk w/o SSA 1.485 1.407 0.39
OURS-tk w/o TI 1.503 1.303 0.36
OURS-bm 1.515 1.359 0.38
OURS-tk 1.606 1.346 0.44


Table 2: Human evaluation results with the best ones in bold.

Main Results

Table 1 and 2 list the automatic evaluation results and the human evaluation results respectively. In terms of Bleu, the proposed model with beam search decoding, namely, OURS-bm, consistently achieve the best scores. Besides, OURS-bm outperforms all compared models on the keyword-overlapping-based Hit metrics, suggesting that our model, armed with Supervised Source Attention component (SSA) and Topic Inference (TI) component, is beneficial for the generation of informative topical words related to the query. Surprisingly, OURS-bm also obtains better Dist metrics than the baseline models. After replacing the beam search with top-k sampling, our model (OURS-tk) is further enhanced in diversity modeling, reaching 0.107 and 0.544 on Dist-1 and Dist-2 respectively.

Regarding the more reliable human evaluations, both of OURS-bm and OURS-tk are the top-ranked models. Specifically, despite its unsatisfactory results on the automatic Bleu and Hit metrics, OURS-tk performs the best on the manually annotated Relevance metric with 5% improvement over the current state-of-the-art MMPMS model. Instead, OURS-bm, the best model on the automatic relevance metrics, still yields competitive results on the Relevance. It is reasonable because some words not appearing in the query/references, especially those not being frequently used, are still related to the discussed topic in the conversations. At the same time, such inconsistency between automatic and human evaluations demonstrates the effectiveness of top-k sampling, a randomization-over-maximization decoding strategy, in discovering infrequent but meaningful patterns for the STC task.

We now turn to discuss the performance of the other compared methods. Inheriting the powerful modeling capability of Transformer, TFM-S2S obtains the best automatic relevance scores as well as the second best Relevance among the baselines. TFM-LM, another Transformer-based baseline following the language model formulation in our paper, performs not as good as TFM-S2S on all of the metrics except Fluency, verifying the postulation that the explanation away issue of language model has the tendency to produce fluent but topically irrelevant responses. Despite of this, the TFM-LM outperforms LSTM-LM and LSTM-S2S, proving the superiority of Transformer to LSTM in response generation. Owing to the re-ranking mechanism, the MMI model is the strongest baseline on diversity modeling but OURS-bm/OURS-tk still achieves approximately 14%/55% improvement on Dist-2.

Ablation Study

In order to track the source of the performance gains, we also conduct the ablation study on the OURS-tk. The corresponding automatic and human evaluation results are shown in the second group of Table 1 and Table 2. As expected, the model without relevance-promoting design, i.e., OURS-tk w/o SSA & TI, is the worst one on the relevance metrics. OURS-k w/o SSA and OURS-tk w/o TI, the variants incorporating either TI or SSA for relevance modeling, boost the Relevance score by 17% and 18% respectively. Although they are comparable on the relevance metrics but the former achieves higher diversity scores (Dist-2: 0.521 v.s. 0.441). We attribute this phenomenon to the TI component, which exploits the usage of more related topical words mentioned in the multiple references. With the help of both SSA component and TI component, OURS-tk becomes the best model on Relevance and Dist metrics, demonstrating the necessity of the relevance modeling for the transformer language model. Another interesting finding is that the SSA component decreases the Fluency score (see the results of OURS-tk w/o TI), which indicates that fighting against explanation-away issue by incorporating additional query context may be coupled with corrupting the language model.

Figure 3: Examples of response generation. We translate Chinese samples to English.

Case Study

Figure 3 shows example responses generated by our model and the most competitive baseline models. OURS-tk, which explicitly incorporates the query context and exploits the tokens potentially related to the query, always produces meaningful and informative responses. Taking the Query #1 & #2 as examples, the generated responses accurately respond to the query because they mention “flower ladder”/“matcha” and “cream”, which are exactly the topics discussed in the conversations. The response for the Query #3 can easily engage user in the conversation and thus it is also a meaningful prediction. The outputs of TFM-LM are generally fluent. However, due to the explanation away issue, TFM-LM tends to generate the irrelevant response (Case #1) or response with phrase repetition (Case #2). Under the sequence-to-sequence formulation, TFM-S2S obtains the responses moderately related to the corresponding queries although the third output, directly copying part of the source text (i.e., query), is still unsatisfactory. MMPMS and MMI, the models aiming for promoting diversity, have chances to yield irrelevant responses.

Further Discussions on Top-k Sampling

We further investigate the impact of top-k sampling on the STC models. Firstly, we conduct additional automatic and human evaluations on the baseline models with results shown in Table 3. As can be seen, the top-k sampling consistently improves the Dist-2 score by a large margin on all models but the Relevance scores of LSTM-S2S, TFM-LM and TFM-S2S decrease after top-k sampling is applied. The variation trends of Fluency across the evaluated models are also inconsistent. These observations suggest that top-k sampling is simple yet effective to achieve diverse response generation but it should be carefully utilized in the model because of its uncertainty on relevance and fluency.

As discussed in Case Study, the transformer-based models adopting beam search have the tendency to generate the responses with repetition and those directly copying the query. We here investigate whether top-k sampling can help solve these issues. Figure 4 depicts the ratios of responses in the test set falling into the phrase repetition and query copy. The top-k sampling greatly reduces the query copy rate (about 72% on average) and almost eliminates the phrase repetition phenomenon in the Transformer-based models. However, note that Table 3 shows both TFM-LM and TFM-S2S perform worse on Relevance after using top-k sampling. We consider these results are consistent with human perception because enriching the morphology via sampling-based decoding strategy will inevitably introduce irrelevant information, leading to the degradation of relevance score. It is noticeable that the proposed model (i.e., OURS) is not affected on relevance modeling due to its capability of filtering some topically irrelevant candidates for the sampling process.

Models Relevance () Fluency () Dist-2 ()
LSTM-LM-tk 1.111 (-0.09) 1.270 (-0.03) 0.383 (+0.29)
LSTM-S2S-tk 1.439 (+0.05) 1.265 (-0.20) 0.490 (+0.35)
TFM-LM-tk 1.273 (-0.14) 1.368 (+0.04) 0.441 (+0.18)
TFM-S2S-tk 1.270 (-0.15) 1.321 (+0.15) 0.507 (+0.22)
OURS-tk 1.606 (+0.10) 1.346 (-0.13) 0.544 (+0.20)
Table 3: Experimental results on the models adopting top-k sampling. refers to the improvement over the original model adopting beam search. The best results are in bold.
Figure 4: Comparison results on beam search and top-k sampling. Specifically, if the length of the longest common sub-string between response and query is larger than 4, then the response is regarded as a “copy” of query. If a response contains the word/phrase loop over 3 times, it is regarded as a response with repetition.

Related Work

Short Text Conversation Short Text Conversation (STC) is usually formulated as a conditional text generation task [40, 39]. The sequence-to-sequence (Seq2Seq) encoder-decoder framework [9, 41, 2] and its variants have been studied extensively for solving this task. li-etal-2016-diversity li-etal-2016-diversity introduce diversity-promoting decoding strategies into the Seq2Seq model. Some [31, 45, 47, 51, 16] attempt to guide the Seq2Seq model to generate keyword/topic-aware responses while others [44, 5, 6] try to control the response generation with additional retrieved data. The advanced techniques such as RL, GAN and VAE are also considered for improving conversational experience [23, 46, 14, 17].

Transformer-based Language Model Deep transformer-based architecture [43] has led to significant performance gains on the language modeling task [1, 10, 37], compared to the existing CNN/RNN-based architectures [11, 29, 28]. Meanwhile, GPT-2 [37] and UniLM [13] are the pioneer works adapting the transformer language model for the conditional text generation tasks.


In this paper, we present a language model based solution instead of traditional Seq2Seq paradigm for handling Short-Text Conversation (STC). We firstly tailor-make a training strategy to adapt the language model for the STC task. Then, we propose a relevance-promoting transformer language model to distill the relevance clues from the query as well as the topics inferred from the references, and incorporate them into the generation. Moreover, we explore the usage of top-k sampling for the STC task to further improve the response diversity. Experimental results on a large-scale STC dataset validate that our model is superior to the compared models on both relevance and diversity from automatic and human evaluations.


  • [1] R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones (2019) Character-level language modeling with deeper self-attention. In AAAI, pp. 3159–3166. Cited by: Related Work.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: Introduction, Related Work.
  • [3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin (2003) A neural probabilistic language model. JMLR 3 (Feb), pp. 1137–1155. Cited by: Overview.
  • [4] P. Budzianowski and I. Vulić (2019) Hello, it’s gpt-2–how can i help you? towards the use of pretrained language models for task-oriented dialogue systems. arXiv preprint arXiv:1907.05774. Cited by: footnote 4.
  • [5] D. Cai, Y. Wang, W. Bi, Z. Tu, X. Liu, W. Lam, and S. Shi (2019) Skeleton-to-response: dialogue generation guided by retrieval memory. In NAACL, pp. 1219–1228. Cited by: Related Work.
  • [6] D. Cai, Y. Wang, W. Bi, Z. Tu, X. Liu, and S. Shi (2019) Retrieval-guided dialogue response generation via a matching-to-generation framework. In EMNLP, pp. 1866–1875. Cited by: Related Work.
  • [7] C. Chen, J. Peng, F. Wang, J. Xu, and H. Wu (2019) Generating multiple diverse responses with multi-mapping and posterior mapping selection. arXiv preprint arXiv:1906.01781. Cited by: 7th item.
  • [8] J. Cheng, L. Dong, and M. Lapata (2016) Long short-term memory-networks for machine reading. In EMNLP, pp. 551–561. Cited by: Relevance Modeling Component.
  • [9] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In EMNLP, pp. 1724–1734. Cited by: Introduction, Related Work.
  • [10] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov (2019) Transformer-XL: attentive language models beyond a fixed-length context. In ACL, Cited by: Related Work.
  • [11] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier (2017) Language modeling with gated convolutional networks. In ICML, Cited by: Related Work.
  • [12] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: footnote 3.
  • [13] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197. Cited by: Related Work, footnote 3.
  • [14] J. Du, W. Li, Y. He, R. Xu, L. Bing, and X. Wang (2018) Variational autoregressive decoder for neural response generation. In EMNLP, pp. 3154–3163. Cited by: Introduction, Related Work.
  • [15] A. Fan, M. Lewis, and Y. Dauphin (2018) Hierarchical neural story generation. In ACL, pp. 889–898. Cited by: Introduction, Decoding.
  • [16] J. Gao, W. Bi, X. Liu, J. Li, and S. Shi (2019) Generating multiple diverse responses for short-text conversation. In AAAI, Cited by: Topic Inference, Related Work.
  • [17] J. Gao, W. Bi, X. Liu, J. Li, G. Zhou, and S. Shi (2019) A discrete CVAE for response generation on short-text conversation. In EMNLP, pp. 1898–1908. Cited by: Related Work.
  • [18] A. Holtzman, J. Buys, M. Forbes, and Y. Choi (2019) The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. Cited by: Introduction, Relevance Modeling Component, Decoding.
  • [19] D. Ippolito, R. Kriz, J. Sedoc, M. Kustikova, and C. Callison-Burch (2019) Comparison of diverse decoding methods from conditional language models. In ACL, pp. 3752–3762. Cited by: Decoding.
  • [20] U. Khandelwal, H. He, P. Qi, and D. Jurafsky (2018) Sharp nearby, fuzzy far away: how neural language models use context. In ACL, pp. 284–294. Cited by: Introduction.
  • [21] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: Training Details.
  • [22] J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In NAACL, pp. 110–119. Cited by: Introduction, 5th item, Evaluation Metrics.
  • [23] J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao (2016)

    Deep reinforcement learning for dialogue generation

    In EMNLP, pp. 1192–1202. Cited by: Related Work.
  • [24] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. In ICLR, Cited by: Relevance Modeling Component.
  • [25] L. Liu, M. Utiyama, A. Finch, and E. Sumita (2016) Neural machine translation with supervised attention. In COLING, Cited by: Supervised Source Attention.
  • [26] Y. Liu, W. Bi, J. Gao, X. Liu, J. Yao, and S. Shi (2018) Towards less generic responses in neural conversation models: a statistical re-weighting method. In EMNLP, pp. 2769–2774. Cited by: Experiment Setup, Human Evaluations.
  • [27] H. Mei, M. Bansal, and M. R. Walter (2017) Coherent dialogue with attention-based language models. In AAAI, Cited by: 1st item.
  • [28] G. Melis, C. Dyer, and P. Blunsom (2018) On the state of the art of evaluation in neural language models. In ICLR, Cited by: Related Work.
  • [29] S. Merity, N. S. Keskar, and R. Socher (2018) Regularizing and optimizing lstm language models. In ICLR, Cited by: Overview, Related Work.
  • [30] H. Mi, Z. Wang, and A. Ittycheriah (2016) Supervised attentions for neural machine translation. In EMNLP, pp. 2283–2288. Cited by: Supervised Source Attention.
  • [31] L. Mou, Y. Song, R. Yan, G. Li, L. Zhang, and Z. Jin (2016) Sequence to backward and forward sequences: a content-introducing approach to generative short-text conversation. In COLING, Cited by: Related Work.
  • [32] R. Nallapati, B. Zhou, C. dos Santos, Ç. Gu̇lçehre, and B. Xiang (2016)

    Abstractive text summarization using sequence-to-sequence RNNs and beyond

    In CoNLL, pp. 280–290. Cited by: Introduction.
  • [33] O. Olabiyi and E. T. Mueller (2019) DLGNet: a transformer-based model for dialogue response generation. arXiv preprint arXiv:1908.01841. Cited by: footnote 4.
  • [34] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In ACL, Cited by: Evaluation Metrics.
  • [35] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In NAACL, pp. 2227–2237. Cited by: Language Model as Response Generator.
  • [36] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. External Links: Link Cited by: Language Model as Response Generator.
  • [37] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: Introduction, Language Model as Response Generator, Decoding, Related Work.
  • [38] A. Ritter, C. Cherry, and W. B. Dolan (2011) Data-driven response generation in social media. In EMNLP, pp. 583–593. Cited by: Introduction.
  • [39] I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau (2016)

    Building end-to-end dialogue systems using generative hierarchical neural network models

    In AAAI, Cited by: Related Work.
  • [40] L. Shang, Z. Lu, and H. Li (2015) Neural responding machine for short-text conversation. In ACL, pp. 1577–1586. Cited by: Introduction, Language Model as Response Generator, Related Work.
  • [41] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NeurIPS, pp. 3104–3112. Cited by: Introduction, Related Work.
  • [42] Z. Tian, W. Bi, X. Li, and N. L. Zhang (2019) Learning to abstract for memory-augmented conversational response generation. In ACL, pp. 3816–3825. Cited by: Introduction.
  • [43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: Overview, 3rd item, Related Work, footnote 5.
  • [44] Y. Wu, F. Wei, S. Huang, Y. Wang, Z. Li, and M. Zhou (2019) Response generation by context-aware prototype editing. In AAAI, Cited by: Related Work.
  • [45] C. Xing, W. Wu, Y. Wu, J. Liu, Y. Huang, M. Zhou, and W. Ma (2017) Topic aware neural response generation. In AAAI, Cited by: Introduction, Language Model as Response Generator, Related Work, footnote 8.
  • [46] Z. Xu, B. Liu, B. Wang, C. SUN, X. Wang, Z. Wang, and C. Qi (2017) Neural response generation via gan with an approximate embedding layer. In EMNLP, pp. 617–626. Cited by: Related Work.
  • [47] L. Yao, Y. Zhang, Y. Feng, D. Zhao, and R. Yan (2017) Towards implicit content-introducing for generative short-text conversation systems. In EMNLP, pp. 2190–2199. Cited by: Topic Inference, Related Work.
  • [48] L. Yu, P. Blunsom, C. Dyer, E. Grefenstette, and T. Kocisky (2017) The neural noisy channel. In ICLR, Cited by: Introduction.
  • [49] Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2019) DialoGPT: large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536. Cited by: footnote 4.
  • [50] T. Zhao, R. Zhao, and M. Eskenazi (2017)

    Learning discourse-level diversity for neural dialog models using conditional variational autoencoders

    In ACL, pp. 654–664. Cited by: 6th item.
  • [51] G. Zhou, P. Luo, R. Cao, F. Lin, B. Chen, and Q. He (2017) Mechanism-aware neural machine for dialogue response generation. In AAAI, Cited by: Related Work.


Training Details

Our model consists of 6 decoder-only transformer layers with masked self-attention (i.e., =6), where the hidden size , number of heads and feed-forward size are 512, 8, 1024 respectively. The weights for and are set as 1.0 and 0.2. We do not introduce the pre-trained word/character embeddings but randomly initialize the parameters of the token embedding layer. We employ Adam [21]

as optimizer and the initial learning rate is 1e-4. We apply linear warm-up at the first 10,000 training steps. The batch size is 32 and we train the model up to 20 epoch. We evaluate the model every 30,000 steps and select the model performs best on the validation set for producing the final results.

Human Evaluations

Apart from automatic evaluations, we also conduct human evaluations. Specifically, we randomly sampled 100 queries and recruit five helpers to judge Relevance, Fluency and Acceptance of the generated responses from our model and the baselines. The rating criteria, identical to those in [26], are as follows:

Relevance: +3: relevant as well as interesting; +2: relevant, including the generic responses; +1: relevant at a distant level; 0: not relevant at all.
Fluency: +2: fluent; +1: readable but with some grammar mistakes; 0: unreadable.
Acceptance: the ratio of acceptable responses. Specifically, acceptable response refers to the response with Relevance and Fluency .

Obtaining Informative Query Words

Building the supervision signals in Eq 5 is based on the informative words of each query. The basic idea is that a query word having strong semantic relation with the corresponding response should be regarded as an informative word. The procedure is as follows:

  1. Use keyword extractor131313Here, we use jieba keyword extraction toolkit available at to obtain the keywords for each response in the training set.

  2. Define the semantic relation score between a query word and the response as the maximal point-wise mutual information (PMI) between a query word and the response keywords.

  3. Select the top-ranking query words in terms of the calculated semantic relation scores as the informative words.

Obtaining Response Keywords

The proposed Topic Inference (TI) component aims to refine the query representation with the knowledge inferred from response keywords. First of all, we employ jieba keyword extraction toolkit to collect the response keywords. Since one query may correspond to multiple references (i.e., one-to-many phenomenon), we aggregate the keyword sets for multiple responses corresponding to the same query. Then, we randomly sample 80% keywords in the aggregated set and regard them as the relevant response keywords (in Eq 7) associated with each training instance.

Obtaining Keywords for Evaluation

As mentioned in the Experiment part, calculating the Hit-Q and Hit-R metrics need to build a high-precision-low-recall keyword set for each query/response sentence. We firstly employ jieba keyword extraction toolkit to obtain an initial keyword set for each query/response. Then, we design the following rules to guarantee the precision of the obtained query/response keywords:

Remove the stop words in the initial keyword set.
Filter the keyword if the Part-of-Speech tag of this keyword does not belong to {N, NS, VN, V, F}.

Additional Details of Experiment

For the automatic evaluation results in Table 1, Bleu and Dist are character-level metrics while Hit scores are calculated using the word-based overlapping statistics.