The popularization of Social Networking Services (SNS) offers the advantage of reducing the burden of building large-scale open datasets. Therefore, recent works pertaining to dialogue systems have focused on end-to-end dialogue system using neural networks[Vinyals and Le2015, Serban et al.2016, Serban et al.2017b]. The end-to-end approach has a potential to generate tailored and coherent responses for user-input. However, there are still some problems with suffering from “safe response” phenomenon available to any utterance, such as the “I’m sorry”and “I think so,” and generating words that have meanings different from real-world facts. This is because neural networks generally infer responses using only the collection of conversational transcriptions.
To tackle these problems, researchers have taken various approaches. [Ghazvininejad et al.2018] proposed a knowledge-grounded dialogue system, conditioned on the context and facts extracted from online resources such as SNS posts utilizing location information. This easily and quickly enables to handle topics not appeared in training data and to adapt to a new domain. In the other approach, dialogue systems combining multiple dialogue models allow responses to be more diverse than those with a single model so that they can treat user-inputs from various viewpoints [Serban et al.2017a, Song et al.2018]. We believe that combining these approaches is crucial to generate meaningful responses.
In this study, we propose an ensemble dialogue system conditioned on a previous context and external facts. This system consists of three modules including generation, retrieval and reranking. First, two modules generate and retrieve responses by feeding context and facts extracted from information websites such as Wikipedia. In generating candidates, we use the method extending Diverse Beam Search (DBS) [Vijayakumar et al.2018]
by enhancing the probability of words in facts data to treat low-frequency words such as proper nouns in external data adequately. Second, the reranking module sorts these candidates according to several features considering appropriateness and informativeness, and it finally returns the final response which is the highest-ranked candidate. Our main contributions of this paper has two-fold : (1) we propose a model for combining multiple hypotheses and injecting external facts, (2) we develop a method to decode diverse and informative words.
We evaluate the performance with the DSTC7-Task2 [Yoshino et al.2018], which is devoted to building dialogue systems generating responses based on real-world facts. In this paper, we report our experimental results.
The system outputs a response using a context in recent turns and facts relevant to the context, where is a sentence, containing HTML tag, extracted information websites. Each utterance is composed of words.
Here, we categorize as subject facts and description facts using the HTML tag rule. is a sequence enclosed by
<h> tag or
<title> tag, and is a sequence enclosed by
<p> tag or not enclosed by any tags.
Ensemble Dialogue System for Facts-Based Sentence Generation
We propose an ensemble dialogue system using external facts and context. As shown in Figure 1
, it consists of the Memory-augmented Hierarchical Encoder-Decoder (MHRED), the sentence selection module with facts retrieval (FR), and the Reranker. This system has two processes: the generate-retrieval process and the reranking process. In the generate-retrieval process, the MHRED generates responses using context and external facts, and the FR retrieves the responses from a database containing important words extracted from the facts. In the reranking process, we use a binary classifier with various dialogue features to select the final response by feeding all the candidates from the MHRED and FR. In this section, each module of the proposed system is introduced in detail.
Memory-augmented Hierarchical Recurrent Encoder-Decoder
To inject facts into responses, a novel encoder-decoder model incorporating end-to-end memory networks (MemN2N) [Sukhbaatar et al.2015] architecture into hierarchical recurrent encoder-decoder (HRED) [Serban et al.2016] is proposed. We call this model as Memory-augmented HRED (MHRED). The overview of MHRED is shown in Figure 2.
Hierarchical Recurrent Encoder
To encode the context, a Hierarchical Recurrent Encoder (HRE) is applied. Previous work has shown that hierarchical Recurrent Neural Networks (RNNs) have a higher ability to express the dialogue context than non-hierarchical RNNs[Tian et al.2017]
. The HRE consists of two level encoders, one at the utterance level and the other at the context level, computed by the Gated Recurrent Unit (GRU)[Chung et al.2014]
. An utterance encoder converts each utterance to an utterance vector. The utterance vector is the hidden state obtained after encoding the last word in each utterance. Letdenote the word embedding of the -th word in the -th utterance. Then, utterance vector is computed as follows:
After processing each utterance, a context encoder outputs context vector , which is a summary of the past utterances, as follows:
A facts encoder is introduced to select facts that need to be injected in responses and map the facts to the continuous representation utilizing the concept of MemN2N architecture. contains many sentences, written headlines, and titles concerning facts, whereas mostly contains sentences explaining the headline and the title. To access the using , it is efficient to extract the detailed facts about the headlines and titles since they tend to contain vital information as a fact. Therefore, we extend the facts encoder proposed by [Ghazvininejad et al.2018] and to store in the first memory (first hop), and in the last memory (second hop).
First, and are converted into memory vector by sum of word embeddings for each sentence. Then, context vector , which is the last hidden state of HRE, is fed into the facts encoder in the first memory, and subject fact is obtained, as shown below:
where ( denotes the vocabulary size) are trainable parameters. Moreover, and are passed to the second memory, and we obtain the description fact as follows:
where are trainable parameters. Note that denotes the shared weights between memories. Finally, vector concatenation across the rows on is performed and facts vector is obtained.
A decoder reads context vector and facts vector and predicts the next utterance. Let the initial hidden state be . Then, the hidden state of decoder is computed by GRU as follows:
In generating conversational responses such as “I think” and “I know” on the decoder, it is not always necessary to use facts relevant to the context at all time steps. Hence, the decoder should change the preference to whether facts or other information needs to be used. We use Maxout Networks (MN) [Goodfellow et al.2013] to generate the response injecting facts. MN obtains the vector with the maximum value , where
can be computed with linear transformation of input features. Its vector represents the most important features from among all features, and then enables the decoder to switch depending on whether only facts are required. The probability of generating the wordis calculated as follows:
where , and are trainable parameters.
Diverse Sentence Generation with Facts
Most neural dialogue systems apply Beam Search (BS) to generate the optimal response [Vinyals and Le2015, Serban et al.2016, Serban et al.2017b]. However, BS does not guarantee diversity for the final response because word sequences within the beam width closely resemble each other. In addition, words such as proper nouns, which often appear in facts data, tend to be less selective than general words appearing in dialogue data.
Previous work extended BS to focus on alleviating the diversity problem. [Vijayakumar et al.2018] proposed Diverse Beam Search (DBS), generating diverse word sequence alternatives to BS. Given a beam width , groups , and beam width in group , beam sets at time step are divided into subsets. The DBS selects the word in order of for these subsets as follows:
where is the hyper-parameter, is the log probability, and is the penalty which is the hamming distance between the words selected in the other groups and . Note that the DBS sets the penalty at .
Furthermore, we extend the DBS to add a penalty with facts. In order to enhance the probability of generating the word sequence to contain words in facts data, we introduce a penalty , using the similarity between facts and the sequence of candidate words. Let be the hyper-parameter. The penalty term is added to the equation (15) when the word is selected. Here, is calculated as follows:
where , is the , of word embeddings computed by Word2Vec [Mikolov et al.2013] respectively, and Sim(
) denotes cosine similarity.
|Candidate||Length||Number of characters, and words|
|Fluency||-gram language model|
|POS||Number of nouns, verbs, adjectives, and adverbs|
|Fact||Frequency of words appeared in and / number of words|
|Pair||Word sim||Cosine similarity between one-hot vectors of words|
|-gram sim||Cosine similarity between -gram|
|Length sim||Similarity 222Let and be the length of the previous utterance and candidate sentence normalized 0 to 1 respectively. Then, the similarity is calculated as . of number of characters, and words|
|Embedding sim||Cosine similarity between vectors computed as the averaged Word2Vec|
|Sentimental sim||Similarity 333Let and be the average of the semantic orientations of the previous utterance and candidate. Then, the similarity is calculated as . of semantic orientations [Takamura, Inui, and Manabu2005]|
|POS sim||Cosine similarity between BoW (nouns, verbs, adjectives, and adverbs)|
|Proper Noun sim||Cosine similarity between BoW of proper noun types, extracted by NLTK 444 https://www.nltk.org/|
|Keyword sim||Cosine similarity between the averaged Word2Vec of keywords|
|extracted by RAKE algorithm [Rose et al.2010]|
|Context||Topic sim||Cosine similarity between topic vectors by feeding|
|a context and candidate to LDA model [Blei, Ng, and Jordan2003]|
Sentence Selection with Facts Retrieval
In general, the raw human-human conversation is highly fluent and rich in variety, and often contains a considerable amount of information about a specific topic in itself. Thus, in this study a method that combines utterance selection based on facts is also proposed. Hence, Facts Retrieval (FR) is employed to output responses, including facts in responses and context. Let be the context and be the response. The database is constructed in the form of , where is a query, which is word sequence concatenation on and , and is a system output. Note that the database is used from the training dialogue dataset.
For sentence selection, we extract important words from facts and feed them into the database. Here, denotes overlapping words in , which contains titles and headlines. In order to eliminate noises and improve the quality of retrieval, is restricted to word sequence that includes at least one noun, verb, adjective, and adverb. FR outputs satisfying the relation . If multiple sentences satisfy the relation, FR reranks sentences using the score produced by BM25F [Zaragoza et al.2004] and outputs up to 10 sentences. Note that FR will not output sentences if the relation is unsatisfied.
The outputs of the MHRED and FR modules may contain meaningless and non-fluent responses. Hence, these responses should be eliminated; the responses should be both appropriate and informative. The Reranker sorts candidates by feeding all of the results of the MHRED and FR, and the highest ranked candidate is returned to user as the final response. It classifies whether a candidate is “positive” or “negative” as a response, where the probability of being “positive” is computed as a confidence score from binary classification with XGBoost[Chen and Guestrin2016]. The features of the Reranker consist of three categories, “Candidate” (responses returned by both the FR and MHRED), “Pair” (a pair of a previous utterance and “Candidate”), and “Context” (a pair of a context and “Candidate”), as shown in Table 1. These categories enable evaluation of the quality of the responses.
Pairs of a context and a response from the dialogue dataset was used to build the training dataset. Contexts and responses with a high “response score” (over 100) on Reddit 111https://www.reddit.com/ was chosen as positive examples. Then, negative examples are generated on those contexts according to one of the following rules:
A randomly selected response with a low “response score” (1 or less) from a dialogue on another topic.
A response that swap words and eliminates some words randomly from a positive example.
A response that matches both of above-mentioned descriptions.
As a result, the dataset contains 44,449 “positive” and “negative” examples respectively.
The experiment was performed according to the regulations of DSTC7-Task2. We crawled the dialogue dataset from 178 subreddits (subsidiary threads or categories on Reddit). Markdown and special symbols were eliminated from the crawled dialogue dataset, and for the same context, the context-response pair of the highest “response score” was selected. We crawled the facts dataset from 226 information sharing websites, such as Wikipedia. The facts were categorized into and as mentioned above, up to the top 10 sentences with the highest cosine similarity for each context. For calculating similarity, the average of the Word2Vec output with 256 dimension was used. Note that the Word2Vec model was trained only on the official training datasets according to DSTC-Task2 regulations. The pre-processing described above leads to the formation of the dialogue and facts datasets, as shown in Table 2.
|Avg. Tokens/Sentence (s)||3.86||3.61||3.30|
|Avg. Tokens/Sentence (d)||17.11||16.67||15.63|
|# Topics (s)||27735||1152||3047|
|# Topics (d)||27645||1121||3063|
Automatic evaluation and human evaluation for responses were conducted in DSTC7-task2 organizers. For automatic evaluation, two types of metrics are used; one is word-overlap metrics, including BLEU [Papineni et al.2002], NIST [Doddington2002] and METEOR [Banerjee and Lavie2005], and the other is the diversity metric using div [Li et al.2015]. In human evaluation, human evaluates responses rated with score 1 (Strong Disagree) to 5 (Strong Agree) for Appropriateness and Informativeness using crowdsourcing.
Models for Comparison
Several models are evaluated to show the effectiveness of the proposed model:
S2S: Sequence-to-sequence (seq2seq) model [Vinyals and Le2015].
HRED: HRED model [Serban et al.2016].
HRED-F: Add the term to DBS [Vijayakumar et al.2018], which generates the responses of HRED. .
MHRED-F: Add the term to DBS, which generates the responses of MHRED. .
MHRED-F15-R, MHRED-F5-R: Add the term to DBS, which generates the responses of MHRED. or . Reranker selects the final response from candidates returned by MHRED-F.
Ensemble: Reranker selects the final response from candidates returned by both MHRED-F and FR.
Moreover, the baseline models “baseline(random)” and “baseline(constant)” derived from the organizers are compared with the proposed model in human evaluation.
Note that only FR model should not be compared with other models since FR model is not able to output responses continuously.
We use a two-layer seq2seq, HRED and MHRED for training. All models are set to the word embedding dimension and hidden vector size of 256. Mini-batch training was employed with a batch size of 40. The models were trained with cross entropy loss function and adapted Adam optimization algorithm[Kingma and Ba2014]
with the initial learning rate of 0.0001. To alleviate over-fitting to the training dataset, a dropout rate of 0.2 was set for all models. Training was conducted for up to 20 epochs and the model with the lowest perplexity in the dev dataset was selected.
Hyper-parameters of DBS was set as and according to BLEU on dev dataset. Vocabulary size was set to 20k, which is shared between both the dialogue and facts data. In generating responses, the log probability of out-of-vocabulary (OOV) words was set to so as not to generate the special symbol unk.
Results and Discussion
|Proper Noun sim|
|1||til there is enough water in lake superior to flood the entire landmasses of north and south america to a depth of 1 foot. it contains over 3 quadrillion gallons of fresh water. the world’s largest by volume and deepest lake is located in southern russia. lake baikal.||S2S: i’m not true.|
|HRED: we are there. i don’t really count, but its not..|
|HRED-F: also, and i have been posted lake baikal are on. the water lake .|
|Ensemble (FR): lake baikal volume. so the answer is yes a bit more than the all the great lakes, and a little under twice the volume of lake superior.|
|2||til that president jimmy carter placed his peanut farm in a blind trust when he was elected president. after his term ended, he found that it had been badly mismanaged and he was $1 million in debt. god bless jimmy carter america’s last honest president.||S2S: what do you think, i think of.|
|HRED: not sure. the best really liked by far.|
|HRED-F: yeah. but that is a president obama.|
|Ensemble (MHRED): obama was also a bit of course he was president obama. democrat.|
|1||til that kyoto, the former capital of japan, just means “capital city” and tokyo means “eastern capital” i only just noticed that tokyo and kyoto are anagrams.||MHRED: i think tokyo godzilla, but as well and kyoto.||0.9602||1|
|MHRED: they also have been a lot of tokyo as the tokyo are they have the same as well. the kyoto is the only one.||0.9342||2|
|FR: villages arent cities.||0.1817||worst|
|2||til german animal protection law prohibits killing of vertebrates without proper reason. because of this ruling, all german animal shelters are no-kill shelters. i am german. til that there are kill shelters.||FR: wow! i didnt know there was a tv show about for pets/animal shelters. thats pretty cool! do you know if that sort of advertising caused a lot more people to adopt animals?||0.8144||1|
|MHRED: its a good thing about cats are occupying breeds cats.||0.7297||2|
|FR: use its hide as shelter.||0.7234||3|
Table LABEL:tab:automatic shows results of the automatic evaluation. It can be seen that the proposed model Ensemble performs better than other models. This indicates that Ensemble enables to output more fluent responses similar to human and diverse responses.
Comparing the result of MHRED-F and HRED-F, notably at div1 score, it is apparent that the proposed MHRED architecture is superior to conventional models. It shows that MHRED can infer words and topics using facts that may be hard to handle only from conversation data and generate diverse responses on the new domain.
Comparison of Ensemble and MHRED-F5-R indicates that the FR module is effective. This is because the responses by the FR are parts of the conversation actually chatted by human and thus highly fluent. Thus, it is shown that the MHRED and the FR are useful in generating informative and appropriate responses.
To analyze effectiveness of introducing the penalty and the Reranker, we compared HRED-F, HRED, MHRED-F15-R and MHRED-F. The model combining the Reranker (MHRED-F15-R) gives significantly higher performance than the model without the Reranker, even on a single model. It designates capturing diverse perspectives of dialogue with various features is important for response generation. Conversely, the model introducing the penalty (HRED-F) showed slight improvement on NIST4 and BLEU4. This indicates that adding the penalty of DBS has positive potential to generate responses similar to human-made.
Table 4 shows the results of human evaluation. Since our primary model beats official baseline models returning responses randomly and constantly, the proposed model is able to capture the context and generate responses fluently.
Case Study and Error Analysis
To validate the MHRED architecture, we looked into the details of the result with attention value and in the facts encoder. Figure 3 depicts an example of attention paid by the fact encoder. The captures “seven wonders of the ancient world”, which refers to the topic of the context. Subsequently, The captures the facts containing “pyramid” considering both the context and . Finally, the MHRED generates a response including “pyramid”. This indicates that this model enables to focus on the facts relevant to the context and generate responses injecting them.
Table 5 shows how the accuracy of the Reranker changes when one of the target features is excluded, per feature or per category. A negative value implies that the corresponding feature is important. The category “Candidate” showed significant decrease of all categories, and the feature “Fluency” showed the biggest decrease by , followed by “Keyword sim” by . Thus, the Reranker has a tendency to select the final response focusing on fluency and contextually informativeness in dialogue. This tendency is probably due to making training dataset for the Reranker. Negative examples are generated using hand-crafted rules such as swapping and eliminating words, thus resulted in the tendency to select more higher “Fluency” and “Keyword sim” sentences preferentially.
Table 6 shows examples of responses predicted by the models. As can be observed from the table, HRED-F and Ensemble output more informative words related to the context such as “lake bikal” (#1) or “obama” (#2). Table 7 presents examples of reranking by the Reranker. In example #1, the MHRED is explicitly designed for the previous context, and the Reranker selects the most meaningful response. In example #2, the response returned by the FR has high fluency and many content words. Conversely, the response is not suitable for the context in terms of the topic. This indicates, as above mentioned, that the Reranker tends to focus on “Candidate” strongly due to the way of making examples for the Reranker. However, we expects making examples from the various perspective will improve the performance more.
Conclusion and Future Work
In this paper, we proposed an ensemble dialogue system using external facts for DSTC7-Task2. The proposed system is a combination of three modules: the MHRED, a neural dialogue system which incorporates external facts into the procedure of response generation, the FR, and the Reranker. In generation, we extend the DBS to generate more meaningful words containing facts data. The experimental results showed that the MHRED especially improved the diversity of the response sentence over the baseline model. Moreover, we confirmed that the combination of multiple modules improved overall automatic metrics and generates more informative responses. In future work, we plan to introduce an end-to-end learning for multiple systems simultaneously.
- [Banerjee and Lavie2005] Banerjee, S., and Lavie, A. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65–72.
[Blei, Ng, and Jordan2003]
Blei, D. M.; Ng, A. Y.; and Jordan, M. I.
Latent dirichlet allocation.
Journal of machine Learning research3(Jan):993–1022.
- [Chen and Guestrin2016] Chen, T., and Guestrin, C. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794.
- [Chung et al.2014] Chung, J.; Gülçehre, Ç.; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555.
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics.In Proceedings of the second international conference on Human Language Technology Research, 138–145. Morgan Kaufmann Publishers Inc.
[Ghazvininejad et al.2018]
Ghazvininejad, M.; Brockett, C.; Chang, M.; Dolan, B.; Gao, J.; Yih, W.; and
A knowledge-grounded neural conversation model.
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018, 5110–5117.
- [Goodfellow et al.2013] Goodfellow, I. J.; Warde-Farley, D.; Mirza, M.; Courville, A. C.; and Bengio, Y. 2013. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, 1319–1327.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980.
- [Li et al.2015] Li, J.; Galley, M.; Brockett, C.; Gao, J.; and Dolan, B. 2015. A diversity-promoting objective function for neural conversation models. CoRR abs/1510.03055.
- [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, 2013, 3111–3119.
- [Papineni et al.2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
- [Rose et al.2010] Rose, S.; Engel, D.; Cramer, N.; and Cowley, W. 2010. Automatic keyword extraction from individual documents. Text Mining: Applications and Theory 1–20.
- [Serban et al.2016] Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A. C.; and Pineau, J. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 3776–3784.
- [Serban et al.2017a] Serban, I. V.; Sankar, C.; Germain, M.; Zhang, S.; Lin, Z.; Subramanian, S.; Kim, T.; Pieper, M.; Chandar, S.; Ke, N. R.; Mudumba, S.; de Brébisson, A.; Sotelo, J.; Suhubdy, D.; Michalski, V.; Nguyen, A.; Pineau, J.; and Bengio, Y. 2017a. A deep reinforcement learning chatbot. CoRR abs/1709.02349.
- [Serban et al.2017b] Serban, I. V.; Sordoni, A.; Lowe, R.; Charlin, L.; Pineau, J.; Courville, A. C.; and Bengio, Y. 2017b. A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, , 2017, 3295–3301.
- [Song et al.2018] Song, Y.; Li, C.; Nie, J.; Zhang, M.; Zhao, D.; and Yan, R. 2018. An ensemble of retrieval-based and generation-based human-computer conversation systems. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, 2018, 4382–4388.
- [Sukhbaatar et al.2015] Sukhbaatar, S.; Szlam, A.; Weston, J.; and Fergus, R. 2015. End-to-end memory networks. In Advances in Neural Information Processing Systems 28 , 2015, 2440–2448.
- [Takamura, Inui, and Manabu2005] Takamura, H.; Inui, T.; and Manabu, O. 2005. Extracting semantic orientations of words using spin model. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 133–140.
- [Tian et al.2017] Tian, Z.; Yan, R.; Mou, L.; Song, Y.; Feng, Y.; and Zhao, D. 2017. How to make context more useful? an empirical study on context-aware neural conversational models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 231–236.
- [Vijayakumar et al.2018] Vijayakumar, A. K.; Cogswell, M.; Selvaraju, R. R.; Sun, Q.; Lee, S.; Crandall, D. J.; and Batra, D. 2018. Diverse beam search for improved description of complex scenes. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018, 7371–7379.
[Vinyals and Le2015]
Vinyals, O., and Le, Q.
A neual conversational model.
ICML, Deep Learning Workshop.
- [Yoshino et al.2018] Yoshino, K.; Hori, C.; Perez, J.; D’Haro, L. F.; Polymenakos, L.; Gunasekara, C.; Lasecki, W. S.; Kummerfeld, J.; Galley, M.; Brockett, C.; Gao, J.; Dolan, B.; Gao, S.; Marks, T. K.; Parikh, D.; and Batra, D. 2018. The 7th dialog system technology challenge. arXiv preprint.
- [Zaragoza et al.2004] Zaragoza, H.; Craswell, N.; Taylor, M. J.; Saria, S.; and Robertson, S. E. 2004. Microsoft cambridge at TREC 13: Web and hard tracks. In Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004.