Recently, dialogue systems have attracted increasing attention in both academia and industry because of their potential applications and commercial values. Sequence-to-sequence (Seq2Seq) models form the cornerstone of popular response generation models (Serban et al., 2016; Sordoni et al., 2015; Shang et al., 2015)
. However, neural dialogue systems based on Seq2Seq models tend to repeatedly generate universal and boring responses like “I don’t know.", “Thank you.". Although being widely applied, conventional Maximum Likelihood Estimation (MLE) training could cause the low-diversity problem above(Li et al., 2016b). Since high frequency words make up a big proportion of the training set, MLE encourages the model to excessively generate high frequency words.
Moreover, when training a Seq2Seq model traditionally, we iteratively maximize the log predictive likelihood of each true token in the target sequence, given the previously decoded tokens. Therefore, the model can only see the previous information during learning, unable to grasp the holistic information of the target sequence when decoding tokens. This also leads to the loss of complete semantic relationship between target sequences and source sequences.
As discussed, we argue that the current learning strategy heavily limits the Seq2Seq models to generate highly diverse responses, and the holistic semantic information of the target response as well as the global semantic relationship between responses and dialog histories are missing during generation process. Most previous solutions simply rely on external information or post-processing models to mask the deficiencies of the Seq2Seq model, while the problem of the Seq2Seq model itself has not been addressed. Therefore, in this paper, we hope to fully exploit the learning potential of the Seq2Seq models without any external information to improve diversity while better to constrain the semantic relevance of generated responses simultaneously. We propose a Holistic Semantic Constraint Joint Network (HSCJN) to predict the subsequent word set in each target utterance for direct supervision in decoding, which directly introduces more linguistic information from target utterances to increase diversity. More specifically, in the HSCJN, we require each hidden state in the decoder to predict the words in the target utterance which remain ungenerated, and the initial state of the decoder is required to predict all the words in the target utterance. Since the HSCJN enables the decoder network to see all words in the target utterance at every time step, our model also more likely captures direct semantic information such as keywords in target utterances to enhance relevance.
In this way, the relationship of representation spaces between source sequences and target sequences, and the transition between different decoder states could be better constrained. In addition, we consider that the entropy of the output distribution is low if the model is over-confident about high frequency words. Penalizing the low entropy output distribution can help regularize the model, optimize the predicted output distribution and alleviate the over-estimation of high frequency words. Therefore, we devise a maximum entropy based regularizer. Our learning framework can be used as a general joint training method with Seq2Seq models and requires no additional data or annotation. In general, our contributions are summarized as follows:
We devise a joint training network to introduce future information in the decoding stage in open-domain dialogue generation, which can be applied to any Seq2Seq neural model. Our network introduces more linguistic information from target utterances to increase diversity, and likely captures key semantic information such as keywords in target utterances to enhance relevance in a direct manner.
We regularize the model by penalizing low entropy output distribution at each time step in the decoder to alleviate the over-estimation of high frequency words, which also enables the loss function to consider every word in the vocabulary to improve diversity.
The experimental results on multi-turn dialogue datasets show the effectiveness of our method in terms of both diversity and relevance of generated responses.
2 Related Work
The diversity of generated responses is an important issue of common concern. Serban et al. (2017) and Zhao et al. (2017) proposed to introduce variational auto-encoders (VAEs) to Seq2Seq models to bring informativeness by increasing variability. Some researches proposed several beam search based approaches (Li et al., 2017; Song et al., 2018; Vijayakumar et al., 2016). However, this kind of methods merely provide a criterion for reweighing response candidates, rather than producing more diverse responses in the first place. Besides, another previous works introduce additional information or knowledge such as the context (Serban et al., 2016; Tian et al., 2017; Yao et al., 2017) , keyword (Serban et al., 2016; Xing et al., 2017; Yao et al., 2017) or knowledge-base (Young et al., 2018; Ghazvininejad et al., 2018) into the response generation process to produce informative content. Although being effective, these approaches actually bypass the low-diversity problem by introducing the randomness of stochastic latent variables or additional information. The underlying Seq2Seq model remains sub-optimal in terms of diversity. Li et al. (2016a) proposed to use a Maximum Mutual Information (MMI) as an optimization objective to maximize the mutual information between messages and responses, but the MMI objective is used only during test time, and relies on many extra modules, like reverse models and beam search. Zhang et al. (2018) proposed the Adversarial Information Maximization (AIM) model which considers explicitly maximizing mutual information during training to generate informative responses, but it still needs to train an extra backward model generating source from target, and implemented with complicated adversarial training strategy.
In other tasks, the word prediction technique has been applied in the neural machine translation(Weng et al., 2017; L’Hostis et al., 2016). Lin et al. (2019) adds entropy to the loss function to make the sparse distribution more specifically concentrate on a small set of video segments in the VQA task. Unlike we consider entropy on the entire vocabulary, it only considers the entropy of audio, video, and words in the current sentence on the video segment.
3 HSCJN Model
3.1 Task Definition
Given a dialogue as a sequence of utterances , and each as a sequence of tokens , where represents the token at position in utterance from the vocabulary , our task is to generate a response
that is not only fluent and grammatical but also not repeated and trivial in content. Essentially, the goal is to estimate the conditional probability:
3.2 An Overview of HSCJN
demonstrates the architecture of our model, in which we join a prediction network with the decoder network on each hidden state of the decoder. The encoder takes the word embedding sequences of context utterances as inputs, and obtains the hidden representations of the context. The decoder starts the generation of the target sequence from the initial state. Since the initial state is responsible for the generation of the whole target sequence, we optimize the initial state by making prediction for all the target words to contain comprehensive target information. Similarly, at each time step in the decoder, we introduce the prediction network to predict the word set of the target subsequence that has not yet been generated. The response is generated from the decoder, HSCJN applies a constraint network to each hidden state in the decoder to introduce more direct language information from the Seq2Seq model. There is a specific objective function for the HSCJN. In addition, we optimize the output distribution at each decoding step by adding a maximum entropy based regularizer to the final objective function to generate the predictive response.
3.3 Holistic Semantic Constraint Joint Network Design
In HSCJN, we require the hidden state at each time step in the decoder to predict the word set containing target words which remain ungenerated in the target utterance, where the order of words is not considered and we assume target words are independent with each other. In this way, at each time step, the decoder generates words not only conditioned on the previously generated subsequences within the original decoder network, but also under consideration of the future words not yet seen in the target sequence through our HSCJN. That is, our joint network HSCJN can introduce and utilize the global sentence information in target utterances for every token generation, beneficial for both diversity and relevance. Specifically, for each time step in the decoder, the hidden state is required to predict the word collection of . The conditional probability of the prediction task in the HSCJN at hidden state is defined as follows:
where, , and the set is the word set of the future subsequence in target response at time step .
is a multi-layer perceptron with two hidden layers using
as an activation function, followed by one output layer with
acting on each neuron. Here, we predict the target word set in a multi-classification way.is the embedding of the word , and
is the context vector from the attention mechanism(Luong et al., 2015).
where is weight parameters, is the input sequence length, is the hidden state of the encoder RNN at time step , and is the hidden state of the decoder RNN at the previous time step .
Specifically, for the initial state , HSCJN requires it to predict the word set containing all target words, so as to compress the overall information of the target sequence into the initial state. Therefore, the decoder can see the entire target sequence at the initial time step through the HSCJN. The conditional probability of the HSCJN at the initial state is defined as follows:
where is the word set containing all target words in the target response , and is the context vector from the attention mechanism for the initial state .
To optimize the HSCJN network, we add an extra likelihood function into the training procedure:
where and are as previously defined, the coefficient of the logarithm is used to calculate the average probability of each prediction. This loss function is used to guide HSCJN to accurately introduce the expected target semantic information.
3.4 Output Distribution Regularizer
When a dialogue model generates universal responses, the prediction of high frequency words is too confident, that is, the entire output probability distribution is concentrated on high frequency words. In result, the entropy of the output distribution is low. We consider that maximizing the entropy of the output distribution at each decoding step could help to regularize the model and produce more diverse responses. By this means, the token-level distributionis better constrained to relieve over-estimation of high frequency words. Therefore, we add a negative entropy to the negative log-likelihood loss function during training. To minimize the overall loss function, the model encourages the maximization of entropy. Specifically, the loss is expressed as:
Where , is the entropy of the output distribution at decoding step , is the length of the vocabulary , and represents a word in the vocabulary.
This loss function not only penalizes the low entropy output distribution when predicting each token, but also considers the entropy over the entire vocabulary, so that the model likely takes into account more words to increase diversity.
3.5 Loss Function
We add and to the original negative log-likelihood loss function. The final loss function for model training is as follows:
where and are weight coefficients, which control the strength of the joint prediction task and the output distribution regularizer respectively. Our HSCJN builds a training objective at sentence level instead of the traditional token-level transition, considering the complete linguistic information in target utterances for every token generation.
4.1 Data Preparation
DailyDialog: It is a high-quality and less noisy dataset, which contains 13,118 multi-turn dialogues, separated into training/validation/test sets with 11,118/1,000/1,000 conversations. For the computational efficiency, we remove the dialogues with more than 300 tokens, which only makes up a small proportion of the whole dataset, and finally our training/validation/test sets of DailyDialog dataset contain 10,712 / 976 / 960 conversations respectively.
OpenSubtitles: It is a collection of movie subtitles. Following previous work (Xu et al., 2018), we treat each turn in the dataset as the target text and the two previous sentences as the source text. We randomly sample 200,000 / 50,000 / 10,000 dialogues for training, validation, and testing, respectively. Similarly, we also remove dialogues with more than 300 tokens, and finally our training/validation/test sets of OpenSubtitles dataset contain 199,992/ 49,995 / 9,984 dialogues respectively.
AttnSeq2Seq: A vanilla Seq2Seq model with attention mechanism (Bahdanau et al., 2014)
. The encoder and decoder are both recurrent neural networks (RNN) with LSTM as the basic cell, and the encoder RNN is bidirectional.
HRED: HRED (Serban et al., 2016) considers dialogue history in multi-turn dialogue generation at two levels: a sequence of words for each utterance and a sequence of utterances, and models this hierarchy of conversations accordingly.
VHRED: VHRED (Serban et al., 2017) augments the HRED model with a stochastic latent variable at the decoder, trained by maximizing a variational lower-bound on the log-likelihood. The latent variable helps facilitate the generation of long utterances with more information content.
4.3 Model Settings
Our proposed method is generic since it can be combined with any Seq2Seq model. In our experiments, we use HRED as the basis of our learning framework. We initialize the recurrent parameter matrices as orthogonal matrices while all the bias vectors are set to. Other parameters are initialized by sampling from the Gaussian stochastic distribution . The vocabularies are limited to the most frequent 25K and 30K words for DailyDialog dataset and OpenSubtitles dataset respectively. We apply GRU with 500 hidden states and GRU with 1000 hidden states to the encoders at word-level and utterance-level respectively, and LSTM with 500 hidden states to the decoder. The dimension of word embedding is set to 300. We use the Adam optimizer (Kingma and Ba, 2014) to update the parameters, with a batch size of 8. The learning rate is 0.0002 and the dropout rate is 0.75. Meanwhile, we set to 1 and to 0.13. For decoding during test time, we simply decode until the end-of-utterance symbol eou occurs, using a beam search with a beam width of 5. All models in the baselines are implemented with the same settings.
4.4 Automatic Evaluation
We adopt BLEU (Chen and Cherry, 2014; Papineni et al., 2002), Distinct-1, Distinct-2 and Distinct-3 (Li et al., 2016a) to evaluate the models at the quality and diversity level. The higher BLEU values demonstrate the responses are closer to the ground truth. Distinct-1, Distinct-2 and Distinct-3 are the proportion and number of distinct unigrams, bigrams, and trigrams in all the generated tokens, respectively. Higher Distinct- values are better for the overall diversity.
Table 1 shows the experimental results of 1-turn response generation on DailyDialog corpus and OpenSubtitles corpus. It is obvious that our HSCJN generates remarkably more distinct unigrams, bigrams and trigrams than all the baselines on both two datasets. Besides, our model achieves the highest BLEU-2/3/4 values on both two datasets, compared with all the baseline models. Confirmed by our experiments, our model achieves an excellent performance in terms of both quality and diversity, regardless of the scale of datasets.
Furthermore, we conduct experiments on 2-turns dialogue generation. Given dialogue histories as input, we require models to generate the next two consecutive utterances. Since dialogues in the OpenSubtitles dataset contain only three turns, we conduct 2-turns response generation on the DailyDialog corpus. The results of 2-turns dialogue generation by our model and the other two multi-turn dialogue generation models are showed in Table 2. From the results, our model exceeds all baseline models with diversity improvement in multi-turn dialogue generation, and even higher BLEU-3 and BLEU-4.
To verify the effectiveness of our model in optimizing the output distribution, we perform word segmentation and word frequency statistics on the generated responses. Figure 2 draws the distribution of ten most frequent words in responses generated by corresponding models on the OpenSubtitles dataset, excluding punctuations. “Natural” represents the natural distribution of the ground truth. The horizontal axis represents the rank of word frequency, and the vertical axis is word frequency values. The curves above columns fit the distribution trends of the word frequencies for different models, whose colors are consistent with the columns. It shows that the frequencies of words generated by our model are not as high as the baseline models, and the frequency distribution is flatter. Moreover, the distribution trends of our model and VHRED are basically consistent with the natural distribution, and our model is closer to the natural distribution in word frequency values than VHRED.
4.5 Ablation Study
We conduct the ablation study to examine the effectiveness of each mechanism, and the results are shown in Table 1. HSCJN(w/o ME) and HSCJN (w/o PN) represent the trained models without the maximum entropy regularizer and without the prediction sub-network respectively. On the DailyDialog dataset, HSCJN (w/o ME) generates the most distinct unigrams, bigrams and trigrams among all compared models, and also surpasses all the baseline models on Distinct- metrics on the OpenSubtitles dataset, indicating that our joint network can obviously generate more diverse responses. HSCJN(w/o ME) also achieves higher BLEU scores, which proves that our joint network can directly capture semantic information to enhance relevance.
Since HSCJN(w/o PN) only adds a regularization item to the loss function in the training process of HRED, comparison with HRED model is sufficient for verifying the performance of the maximum entropy regularizer. From the results, HSCJN(w/o PN) achieves obvious improvement in both of the quality and diversity performance compared with HRED, demonstrating the effectiveness of the maximum entropy regularization. Both mechanisms contribute to the improvement of diversity and quality in response generation. The elimination of the prediction sub-network has greatest impact on the HSCJN model, indicating the importance of incorporating holistic semantic information.
4.6 Manual Evaluation
Since automatic metrics for open-domain generative models may not be consistent with human perceptions, the quality scores from the human annotations are more reliable. Therefore, we further recruit human annotators to evaluate the quality of the generated responses. We randomly select 100 testing dialogues with responses generated by different models for each dataset and for both 1-turn and 2-turns generation. Responses generated by different models are randomly shuffled for each annotator. 5 annotators with linguistics experience are recruited to refer to the test dialogue histories and judge the quality of the responses of all compared models according to the following criteria:
0: The response cannot be used as a response to the conversation context. It is semantically unrelated or disfluent.
+1: The response can be used as a reply to the message, but it is too universal like “Yes, I see.”, “Thank you.” and “I don’t know.”.
+2: The response is not only grammatical and relevant, but also informative and interesting.
Manual evaluation results for 1-turn response generation are presented in Table 3, which lists the percentage of each score and an overall average score. Among the three baselines, AttnSeq2Seq performs the worst and VHRED the best. Our model obtains a minimum of 0 points and a maximum of 2 points among all compared models on two datasets. This indicates that our model can generate fewer low-quality responses, as well as more semantically relevant and informative responses. The highest average score achieved by our model also confirms that our model outperforms the baselines.
|Dailydialog Corpus||OpenSubtitles Corpus|
Table 4 shows the manual evaluation results for 2-turns dialogue generation on DailyDialog dataset. In multi-turn generation, VHRED’s performance is not good. Responses generated by VHRED may be informative but most of them are irrelevant to the context. Our model also outperforms the baselines in terms of relevance and informativeness, as well as the overall average score.
5 Case Study
|Case 1: 1-turn response generation|
|Speaker A: Good morning. Are you Mr.Liu?|
|Speaker B: My name is Liu Lichi. How do you do?|
|Speaker A: Have you had any working experience?|
|Speaker B: Well, I worked at a supermarket during last summer holidays.|
|Speaker A: How are your English and computer skills?|
|Speaker B: I have passed the CET- 4 and 6. As far as computer is concerned, I can use the computer for|
|AttnSeq2Seq: A great idea.|
|HRED: What do you do?|
|VHRED: I think so.|
|HSCJN: That sounds great. How long have you been interested in the job?|
|Human: Okay. Mr.Liu, we’ll inform you of the results within a week.|
|Case 2: 2-turns dialogue generation|
|Speaker A: Can I borrow this magazine from you? It’s really interesting and I can’t put it down.|
|Speaker B: I am sorry, but I can’t lend it to you now, for I haven’t finished reading it. If you don’t mind,|
|I can lend you some back numbers to you.|
|HRED: Speaker A: Thank you very much.|
|Speaker B: You’re welcome.|
|VHRED: Speaker A: Thank you very much.|
|Speaker B: Do you have any questions?|
|HSCJN: Speaker A: That’s great.|
|Speaker B: It’s too good as you like it.|
|Human: Speaker A: That would be very kind of you. By the way, is it a monthly magazine?|
|Speaker B: No, it is a fortnightly. So, you see, I can get the new one quite soon.|
Table 5 presents the examples of 1-turn response and 2-turns dialogue generated by different models, given the multi-turn contexts between two speakers as inputs. “Human” lists the reference response in the dataset of the given context. We can see that Case 1 is an interview between an interviewer and a candidate. Our model captures that this is a job interview and produces a question matching the interview situation. In contrast, the responses generated by the baseline models are generic and irrelevant. In Case 2, our model captures the emotion information that speaker A likes the magazine, giving a more specific and informative response, and generates two consecutive turns matching different speakers’ roles, while the results generated by baselines are universal and monotone. It can be found that our model generates obviously better responses with more specific details and higher diversity. Moreover, it also shows that our results are more relevant to the dialogue scenario.
In this paper, we investigate the low diversity issue in dialogue generation task. We propose a Holistic Semantic Constraint Joint Network to introduce future information into the decoding stage. In addition, we devise a maximum entropy regularizer into our loss function to penalize the over-estimation of high frequency words. In this way, the model can see the entire target sequence and consider all the words in the vocabulary in the learning process. It is worth mentioning that our model introduces the language information in the target sequences from the Seq2Seq model itself for diverse response generation, which does not depend on any external information and variables, also beneficial to capture holistic dialogue semantics to promote relevance. Moreover, our joint learning framework can be generalized to any end-to-end model. Extensive experiments show that our model produces more informative and relevant responses than several competitive baselines.
- Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. Cited by: §4.2.
- A systematic comparison of smoothing techniques for sentence-level BLEU. In WMT@ACL 2014, June 26-27, 2014, Baltimore, Maryland, USA, pp. 362–367. Cited by: §4.4.
A knowledge-grounded neural conversation model.
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 5110–5117. Cited by: §2.
- Adam: A method for stochastic optimization. CoRR abs/1412.6980. Cited by: §4.3.
- Vocabulary selection strategies for neural machine translation. CoRR abs/1610.00072. Cited by: §2.
- A diversity-promoting objective function for neural conversation models. In NAACL HLT 2016, San Diego California, USA, June 12-17, 2016, pp. 110–119. Cited by: §2, §4.4.
- A simple, fast diverse decoding algorithm for neural generation. CoRR abs/1611.08562. Cited by: §1.
- DailyDialog: A manually labelled multi-turn dialogue dataset. In IJCNLP 2017, Taipei, Taiwan, November 27 - December 1, 2017 - Volume 1: Long Papers, pp. 986–995. Cited by: §2, §4.1.
Entropy-enhanced multimodal attention model for scene-aware dialogue generation. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, (AAAI-19), Hilton Hawaiian Village, Honolulu, Hawaii, USA, January 27-February 1, 2019, Cited by: §2.
- Effective approaches to attention-based neural machine translation. In EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 1412–1421. Cited by: §3.3.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pp. 311–318. Cited by: §4.4.
- Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pp. 3776–3784. Cited by: §1, §2, §4.2.
- A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pp. 3295–3301. Cited by: §2, §4.2.
- Neural responding machine for short-text conversation. In ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp. 1577–1586. Cited by: §1.
- Towards a neural conversation model with diversity net using determinantal point processes. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 5932–5939. Cited by: §2.
- A neural network approach to context-sensitive generation of conversational responses. In NAACL HLT 2015, Denver, Colorado, USA, May 31 - June 5, 2015, pp. 196–205. Cited by: §1.
- How to make context more useful? an empirical study on context-aware neural conversational models. In ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 2: Short Papers, pp. 231–236. Cited by: §2.
- Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, pp. 2214–2218. Cited by: §4.1.
- Diverse beam search: decoding diverse solutions from neural sequence models. CoRR abs/1610.02424. Cited by: §2.
- Neural machine translation with word predictions. In EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 136–145. Cited by: §2.
- Topic aware neural response generation. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pp. 3351–3357. Cited by: §2.
- DP-GAN: diversity-promoting generative adversarial network for generating informative and diversified text. CoRR abs/1802.01345. Cited by: §4.1.
- Towards implicit content-introducing for generative short-text conversation systems. In EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 2190–2199. Cited by: §2.
- Augmenting end-to-end dialogue systems with commonsense knowledge. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 4970–4977. Cited by: §2.
- Generating informative and diverse conversational responses via adversarial information maximization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pp. 1815–1825. Cited by: §2.
Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 654–664. Cited by: §2.