Open-domain dialogue systems play an important role in the communication between human and computers. It has always been a big challenge to build intelligent agents that can carry out fluent open-domain conversations with people. In the early decades, people started to build open-domain chatbots with plenty of human-designed rules 
. Recently, as the accumulation of data and advancement of neural network technology, more and more neural-based open-domain dialogue systems come into people’s sight and achieve good results[12, 14].
The sequence-to-sequence (Seq2Seq) architecture has been empirically proven to be quite effective in building open-domain dialogue systems , which directly learns a mapping function between the input and output utterances in a pure end-to-end manner. However, Seq2Seq models tend to generate generic and less informative sentences such as I’m not sure and I don’t know. Many methods are proposed to alleviate the problem, such as improving the training objective function , leveraging latent variables in the decoding procedure  and using boosting to improve the response diversity . However, the existing methods generate dialogue responses in one step. The decoder predicts the main idea of the responses and organizes natural sentences at the same time, which is hard for the model to generate coherence and fluent dialogue responses.
Intuitively, when someone prepares to say something in a conversation, he usually first conceives an outline or some keywords of what he wants to say in his mind, and then expands them into grammatical sentences. As Table 1 shows, the person wants to refuse the invitation to the party, so he first prepares the reason in his mind, which is represented by the keywords, and then organized the keywords to fluent and natural sentences. If a dialogue system explicitly models the two steps in human dialogues, the generated responses would be more specific and informative than the responses of traditional models.
|Keywords||sorry; want; but; finish; paper; weekend;|
In this paper, we propose a novel Keywords-guided Sequence-to-Sequence model (KW-Seq2Seq) which uses keywords information as guidance to generate more meaningful and informative dialogue responses. Besides the standard encoder and decoder components in conventional Seq2Seq models, KW-Seq2Seq has an additional pair of encoder and decoder to deal with keywords information, i.e. the keywords encoder and keywords decoder
. After the dialogue context is mapped to its hidden representation, the keywords decoder first predicts some keywords from it, and the keywords encoder re-encodes the generated keywords to get the keywords hidden representation. The hidden representation of the dialogue context and the keywords are concatenated to decode the final response.
In order to obtain the training keywords of each response, we calculate the TF-IDF  value of each token in the response utterances. The tokens with high TF-IDF values are chosen as the keywords of the response. We use an additional keywords loss on the output of keywords decoder so that the generated keywords can capture the main idea of the response to be generated. Moreover, we use a cosine annealing mechanism to make the response decoder better learn to leverage the keywords information. The inputs of the keywords encoder are switched gradually from the ground truth to the generated keywords, so the response decoder can learn to incorporate keywords information to responses and keep this ability in the testing stage. We conduct experiments on a popular open-domain dialogue dataset. The results on both automatic evaluation and human judgment show that KW-Seq2Seq can generate appropriate keywords that capture the main idea of the responses and leverage the keywords well to generate more informative, coherence and topic-aware response sentences.
2 Related Work
There have been many methods proposed to alleviate the problem of generating generic responses of the sequence-to-sequence dialogue models. Li et al. mmi uses Maximum Mutual Information (MMI) as the training objective to strengthen the relevance of the dialogue post and response. Li et al. li-beam proposes a beam search decoding algorithm which encourages the model to choose hypotheses from diverse parents in the search tree and penalizes the tokens of the same parent node. There is also some research utilizing latent variables to improve the diversity of responses. Shen et al. shen-vae builds a conditional variational dialogue model that generates specific dialogue responses based on the dialog context and a stochastic latent variable. Zhao et al. zhao-vae captures the dialogue discourse-level diversity by using latent variables to learn a distribution over potential conversational intents as well as integrating linguistic prior knowledge to the model.
Some research try to leverage keywords to improve the quality of responses in generative dialogue systems. Xing et al. xing-topic propose a topic aware sequence-to-sequence (TA-Seq2Seq) model which uses an extra pre-trained LDA topic model to generate the topic keywords of the input messages and decodes responses with a joint attention mechanism on the input messages and topic keywords. Recently, Tang et al. tang-target proposes to use keywords to guide direction of the dialogue. For each dialogue turn, the model predicts one word as the keyword and use it to form a whole sentence. Unlike the models mentioned above, our proposed KW-Seq2Seq model can predict any number of keywords to capture the main idea of the response sentences and can be trained in an end-to-end manner without any outside auxiliary model. It makes better use of keyword information to produce responses with better quality.
3 Sequence-to-Sequence Model
We use Transformer  as the encoder and decoder in the baseline sequence-to-sequence model and name them the context encoder and response decoder. The context encoder transforms the dialogue context to its hidden representation and the response decoder generates the response utterance conditioned on it.
3.1 Context Encoder
We concatenate All the utterances in the dialogue context are concatenated and fed into the context encoder. the context encoder consists of layers of residual multi-head self-attention layers with feed-forward connections. The -th layer of the context encode obtains its hidden states by the following operations:
where is the layer normalization , and are the self-attention and fully connected sub-layers in encoder layer .
The self-attention sub-layer consists of attention heads to perform the multi-head self-attention operation. For each attention head , the hidden states from last layer are projected to the query, key and value matrices , , separately. They have the same size of , where is the number of tokens in the input sequence. We multiply and to get the attention weight matrix and then scale each weight element by dividing the square root of the hidden states dimension . Finally, we normalize the weights by softmax function and multiply it by to get the self-attended token representation :
The outputs of all attention heads are concatenated together and applied a linear transformation to get the results of the self-attention sub-layer.
3.2 Response Decoder
The architecture of the response decoder is similar to the encoder. There are two differences in it: 1) a triangle mask is added to the self-attention sub-layer and 2) an addition cross-attention sub-layer is appended right after each self-attention sub-layer. We represent the self-attention sub-layer with triangle mask as and the hidden states from the last layer of the context encoder as . The operations in decoder layer are as following:
During the training process, we should make sure that the -th decoding token can only focus on the first tokens in the output sequence. Therefore, in the self-attention sub-layer of the response decoder, we add a triangular mask matrix to the attention weights before the softmax operation. has all elements on and below its diagonal and all values above the diagonal, so all the attention weights above the diagonal become after the softmax operation. The masked self-attention operation is as follows:
In the cross-attention sub-layer, we use the hidden states of the input sequence to produce the key and value matrices and , so the information of the input sequence can be aggregated to the decoding procedure.
With the masked self-attention and cross-attention, the response decoder generates each token of the output sequence conditioned on the input sequence and the generated output tokens:
4 Keywords-Guided Sequence-to-Sequence Model
The Keywords-guided Sequence-to-sequence (KW-Seq2Seq) Model adds a keywords decoder and a keywords encoder on the basis of the sequence-to-sequence framework. The overall architecture of KW-Seq2Seq is shown in Figure 1. The keywords decoder generates keywords from dialogue context hidden states and the keywords encode maps the generated keywords to their hidden representation to guide the generation of the final response. We also propose a cosine annealing mechanism to help the model learn to leverage keywords to generate the responses better.
4.1 Keywords Decoder and Keywords Encoder
The architectures of the keywords decoder and keywords encoder are same as the response decoder and context encoder. With the dialogue context hidden states as input, the keywords decoder generates the keywords of the response utterance:
We calculate cross entropy between the ground truth and generated keywords here, which equals the negative log-likelihood below. The ground truth keywords are selected from the response utterance in advance to presents the response’s mean idea. So the keywords loss guides the keywords decoder learn to predict the words that represent the key idea of the response to be generated.
In order to sample out the predicted keywords but still maintain the differentiability in the training procedure, we resort to Gumbel-Softmax 
, which is a differentiable surrogate to the argmax function. The probability distribution of the-th keywords is:
where represents the probabilities of the original categorical distribution, are i.i.d samples drawn from the Gumbel distribution 111If , then . and is a constant that controls the smoothness of the distribution. When , Gumbel-Softmax performs like argmax, while if
, Gumbel-Softmax performs like a uniform distribution.
The generated keywords are then encoded by keywords encoder to obtain their hidden representation . The context hidden states and keywords hidden states are concatenated together and feed into the response decoder to produce the final dialogue response .
Finally, we calculate the negative log-likelihood (cross entropy) loss of the responses and sum the keywords loss and response loss weighted by and to obtain the final training loss:
4.2 The Cosine Annealing Mechanism
Although we feed the hidden states of the generated keywords to the response decoder, we cannot make sure that it makes use of the keywords information well and generates responses related to the keywords. To tackle the problem, we propose the cosine annealing mechanism to guide the response decoder to leverage the keywords information better.
In the training stage, we feed the ground truth keywords to the keyword encoder with probability and feed the generated keywords with probability . The initial value of is and as the training progresses, we gradually decrease to 0 by a cosine function. Formally, the relation between the probability and training progress is as following:
At the beginning of the training procedure (), the performance of the keyword encoder is quite low, so we only feed the ground truth keywords to the keywords encoder so that the response decoder tends to pay more attention to the keywords when decoding the response sentence. As the training progresses (), we gradually decrease the probability so the keywords encoder and response decoder have more opportunities to access the generated keywords. The keywords decoder can be trained better with the supervision signal from both the keywords loss and the final response loss. At last (), we only use the generated keywords to train the model so the model can learn to do better when testing.
|Overlay Metric||Embedding Metric||Keywords Metric|
|Seq2Seq-6 w/o BERT||8.76||0.205||0.098||0.864||0.689||0.473||-||-|
|Seq2Seq-12 w/o BERT||12.24||0.240||0.115||0.877||0.708||0.495||-||-|
|KW-Seq2Seq w/o BERT||26.66||0.348||0.187||0.896||0.755||0.574||0.264||0.876|
|KW-Seq2Seq + GT Keywords||43.95||0.700||0.355||0.961||0.897||0.815||-||0.903|
4.3 Keywords Acquisition
In order to obtain the ground truth keywords of each response utterance, we use the TF-IDF  value to indicate the importance of each word. Specifically, we calculate the TF-IDF value of each token in all the response utterances and choose the top tokens in each response with the highest TF-IDF values as the keywords of it. We also try different keywords ratios to obtain the ratio value that produces the best dialogue responses. The experiment details are described in Section 5.
4.4 Input Representations
The model takes the dialogue context as input, which consists of a sequence of utterances of two interlocutors. To obtain the input representations, we follow the processing of BERT  that the embedding of each token is the sum of the word embedding, type embedding, and position embedding. The difference is that we concatenate all the context utterances to a whole sequence rather than just one or two sentences in BERT, which is shown in Figure 2. We add the BERT’s classification token [CLS] at the beginning of the sequence and the separation token [SEP] at the end of each utterance. We use two type embeddings for the utterances of the two dialogue interlocutors and the position embeddings are added to each token in turn.
5.1 Experiments Setting
We use 6 layers Transformer encoder and decoder for all the components in the model. For hyper-parameters, we mostly follow the settings of the model . We use a vocabulary of 30522 tokens and set both the dimensions of word embeddings and hidden states to 768. We use 12 attention heads in each layer of the encoders and decoders. The dropout probability is set to 0.1 in all the dropout layers in the model. The sample temperature is set to . We use Adam  to optimize the model parameters with learning rate of . The weighting factors of the two loss items and are both set to . About the cosine annealing mechanism, we begin to decrease the probability
at the 50-th epoch and after the 200-th epoch,becomes . We don’t use fixed batch size but set the max number of tokens in a batch, which much improves the training efficiency . We use the parameters of the first 6 layers of
model to initialize all the components in the model. There’s no cross-attention component in BERT so we copy the parameters of the self-attention component in the same layer to the corresponding cross-attention component. We implement the KW-Seq2Seq model in PyTorch and use the pretrained BERT in the transformers222https://github.com/huggingface/transformers library. The code of our model is available at http://anonymous.
We train our model on a popular open-domain multi-turn dialogue dataset DailyDialog , which consists of 13K multi-turn conversations crawled from English practice websites. Each conversation is written by exactly two English learners and the content is mainly about people’s daily life. To prepare the data from training and testing, we use a sliding window of size 6 to crop the conversations. The first 5 utterances in the window are used as the dialogue context and the last one as the response.
5.3 Automatic Evaluation
We train the KW-Seq2Seq model and two baseline Seq2Seq models: Seq2Seq-6 and Seq2Seq-12. Seq2Seq-6 has 6-layers encoder and decoder, which are the same as the context encoder and response decoder in KW-Seq2Seq. Seq2Seq-12 has 12-layers encoder and decoder so it has the same number of parameters as KW-Seq2Seq. We train all the models in both with and without BERT initialization settings.
Overlap and Embedding Metrics
We use three overlap-based metrics to evaluate the generated dialogue responses: BLEU, Rouge and Meteor
. They calculate the scores of two sentences by the number of co-occurring words or n-grams between them. Meanwhile, many papers point out that the overlay-based metrics cannot reflect the real quality of the responses in the dialogue task. So we also conduct three embedding-based evaluations:Average, Greedy and Extrema 
, which map sentences into embedding space and compute their cosine similarity. The embedding-based metrics can be used to measure the semantic similarity and test the ability of successfully generating a response sharing a similar topic with the golden answer. From Table2 we can see that the KW-Seq2Seq model achieves higher scores than the Seq2Seq baselines on all the overlap and embedding-based metrics, which indicates the keywords in KW-Seq2Seq can help to generate more accurate and semantically relevant dialogue responses. It’s worth noting that, when we train the models without BERT’s pretrained parameters, KW-Seq2Seq still achieves good results while the performance of Seq2Seq drops sharply.
In order to check the performance of the keywords decoder and keywords encoder, we use two keywords-related metrics. First, we calculate the F1 score of the generated keywords of keywords decoder and the ground truth keywords (KW-F1), which indicates the ability of the keywords decoder to capture the key idea of the response. Second, we count the number of generated keywords appearing in the final response sentence and calculate the keywords recall score (KW-Recall). It reflects how well the keywords encoder and response decoder captures the meaning of the keywords and leverage them in the response sentence. As Table 2 shows, KW-Seq2Seq obtains about 30% KW-F1 score and nearly 90% KW-Recall score, that verifies the keywords decoder can predict keywords with reasonable accuracy and the keywords encoder and response decoder can effectively leverage the keywords information to guide the generation of the dialogue responses.
Evaluation with Ground Truth Keywords
We also evaluate KW-Seq2Seq with the ground truth keywords as input, so we can find the performance upper bound of the model. As the last row in Table 2 shows, scores of all the three types of metrics have greatly improved. It further illustrates that the important guidance of the keywords in KW-Seq2Seq. In the situation that we can get ground truth keywords in advance, KW-Seq2Seq can generate more controllable responses and better meet people’s needs.
|Response: Do you know when you can get up?|
|Keywords: command; expect; at; 7; fifteen;|
|Custom||Keywords: go; park; flowers; beautiful;||Keywords: ok; no; problem; on; time; good; night;|
5.4 Human Evaluation
Accurate automatic evaluation of dialogue generation is still a big challenge . We conduct human evaluation on the KW-Seq2Seq and the baseline Seq2Seq model. We randomly sampled 300 dialogues from the evaluation results of the KW-Seq2Seq and Seq2Seq-6 models respectively and mix them together. We hire 3 undergraduate students majoring in English to score the dialogue responses. They are asked to give each response a score from 1 to 5 points according to the grammar, fluency, coherence, and informativeness of the sentences. Finally, Seq2Seq received the average score of 2.92 and KW-Seq2Seq got 3.16. The ratio of each score is shown in Figure 3. As the figure shows, more responses generated from the KW-Seq2Seq gain higher scores than Seq2Seq, which verifies KW-Seq2Seq can generate dialogue responses of higher quality and more informative.
5.5 The Keywords Ratio
To observe the effect of the keywords ratio on the quality of the generated responses, we choose the top 10%-50% of words with the largest TF-IDF value as keywords to train the KW-Seq2Seq model separately. The results are shown in Table 4. We can see that the model trained with 30% keywords achieves the best scores on almost all the metrics, while the models trained with more or fewer keywords cannot outperform it. When training with fewer keywords, the keywords decoder cannot receive enough supervision information to learn the main idea of the responses. In turn, too many keywords bring more noise to the model and make the keywords decoder confuse to find key points of the dialogue. Therefore, we choose the keywords ratio of 30% to train the model, which gives the responses with the best quality.
5.6 The Cosine Annealing Mechanism
The cosine annealing mechanism makes the model learn to leverage keywords information better in generating dialogue responses. It guides the response decoder to give more attention to keywords at early training and makes the model learn to leverage generated keywords by gradually decreasing the probability of feeding ground truth keywords to the keywords encoder with a cosine function. In this part, we also train KW-Seq2Seq in the settings of only feed the keywords encoder with ground truth keywords or generated keywords. The results are shown in Table 5. Although the model with only ground truth keywords (All GT) gets a high KW-Recall score, the model trained with cosine annealing mechanism (Cosine) gets the best results on all the other metrics, which indicates the important role of it.
5.7 Case Study
Table 3 shows some examples of the KW-Seq2Seq and the baseline Seq2Seq model. From the table, we can see that the predicted keywords not only capture the topic idea of the dialogue but also bring new conceptions to the response, such as “forest” and “river” in the first example. We also input some custom keywords to KW-Seq2Seq (the last two rows in Table 3). It generates the response formed by the new keywords, which indicates KW-Seq2Seq can not only generate meaningful and informative sentences but also gives people the opportunity to control the content and direction of the dialogues.
We propose a Keywords-guided Sequence-to-sequence (KW-Seq2Seq) model, which predicts keywords from the dialogue context hidden states and uses the keywords as guidance to generate the final dialogue response. Empirical experiments demonstrate that the KW-Seq2Seq model produces more informative, coherent and fluent responses, yielding substantive gain in both automatic and human evaluation metrics.
-  (2016) Layer normalization. CoRR abs/1607.06450. External Links: Cited by: §3.1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. External Links: Cited by: §4.4, §5.1.
-  (2019) Boosting dialog response generation. In ACL (1), pp. 38–43. Cited by: §1.
-  (2017) Categorical reparameterization with gumbel-softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Cited by: §4.1.
-  (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §5.1.
-  (2016-06) A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 110–119. External Links: Cited by: §1.
DailyDialog: A manually labelled multi-turn dialogue dataset.
Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP 2017, Taipei, Taiwan, November 27 - December 1, 2017 - Volume 1: Long Papers, pp. 986–995. External Links: Cited by: §5.2.
-  (2016) How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 2122–2132. External Links: Cited by: §5.3, §5.4.
-  (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §5.1.
-  (1988) Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24 (5), pp. 513–523. External Links: Cited by: §1, §4.3.
-  (2015) A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In CIKM, pp. 553–562. Cited by: §1.
-  (2014) Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. Cited by: §1.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 5998–6008. External Links: Cited by: §3.
-  (2015) A neural conversational model. CoRR abs/1506.05869. External Links: Cited by: §1.
-  (1966) ELIZA - a computer program for the study of natural language communication between man and machine. Commun. ACM 9 (1), pp. 36–45. Cited by: §1.
Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 654–664. External Links: Cited by: §1.