Automatic dialog/conversation systems have served humans for a long time in various fields, ranging from train routing  to museum guiding . In the above scenarios, the dialogs are domain-specific, and a typical approach to such in-domain systems is by human engineering, for example, using manually constructed ontologies , natural language templates , and even predefined dialog states .
Recently, researchers have paid increasing attention to open-domain, chatbot-style human-computer conversation, because of its important commercial applications, and because it tackles the real challenges of natural language understanding and generation [6, 18, 16]
. For open-domain dialogs, rules and temples would probably fail as we can hardly handle the great diversity of dialog topics and natural language sentences. With the increasing number of human-human conversation utterances available on the Internet, previous studies have developed data-oriented approaches in the open domain, which can be roughly categorized into two groups: retrieval systems and generative systems.
When a user issues an utterance (called a query), retrieval systems search for a most similar query in a massive database (which consists of large numbers of query-reply pairs), and respond to the user with the corresponding reply [6, 7]. Through information retrieval, however, we cannot obtain new utterances, that is, all replies have to appear in the database. Also, the ranking of candidate replies is usually judged by surface forms (e.g., word overlaps, tfidf features) and hardly addresses the real semantics of natural languages.
. Typically, a recurrent neural network (RNN) captures the query’s semantics with one or a few distributed, real-valued vectors (also known asembeddings
); another RNN decodes the query embeddings to a reply. Deep neural networks allow complicated interaction by multiple non-linear transformations; RNNs are further suitable for modeling time-series data (e.g., a sequence of words) especially when enhanced with long short term memory (LSTM) or gated recurrent units (GRUs). Despite these, RNN also has its own weakness when applied to dialog systems: the generated sentence tends to be short, universal, and meaningless, for example, “I don’t know” or “something” . This is probably because chatbot-like dialogs are highly diversified and a query may not convey sufficient information for the reply. Even though such universal utterances may be suited in certain dialog context, they make users feel boring and lose interest, and thus are not desirable in real applications.
In this paper, we are curious if we can combine the above two streams of approaches for open-domain conversation. To this end, we propose an ensemble of retrieval and generative dialog systems. Given a user-issued query, we first obtain a candidate reply by information retrieval from a large database. The query, along with the candidate reply, is then fed to an utterance generator based on the “bi-sequence to sequence” (biseq2seq) model . Such sequence generator takes into consideration the information contained in not only the query but also the retrieved reply; hence, it alleviates the low-substance problem and can synthesize replies that are more meaningful. After that we use the scorer in the retrieval system again for post-reranking. This step can filter out less relevant retrieved replies or meaningless generated ones. The higher ranked candidate (either retrieved or generated) is returned to the user as the reply.
From the above process, we see that the retrieval and generative systems are integrated by two mechanisms: (1) The retrieved candidate is fed to the sequence generator to mitigate the “low-substance” problem; (2) The post-reranker can make better use of both the retrieved candidate and the generated utterance. In this sense, we call our overall approach an ensemble in this paper. To the best of our knowledge, we are the first to combine retrieval and generative models for open-domain conversation.
Experimental results show that our ensemble model consistently outperforms each single component in terms of several subjective and objective metrics, and that both retrieval and generative methods contribute an important portion to the overall approach. This also verifies the rationale for building model ensembles for dialog systems.
2 The Proposed Model Ensemble
Figure 1 depicts the overall framework of our proposed ensemble of retrieval and generative dialog systems. It mainly consists of the following components.
When a user sends a query utterance , our approach utilizes a state-of-the-practice information retrieval system to search for a query-reply pair that best matches the user-issued query . The corresponding is retrieved as a candidate reply.
Then a biseq2seq model takes the original query and the retrieved candidate reply as input, each sequence being transformed to a fixed-size vector. These two vectors are concatenated and linearly transformed as the initial state of the decoder, which generates a new utterance as another candidate reply.
Finally, we use a reranker (which is a part of the retrieval system) to select either or as the ultimate response to the original query .
In the rest of this section, we describe each component in detail.
2.2 Retrieval-Based Dialog System
We utilize a state-of-the-practice retrieval system with extensive manual engineering and on a basis of tens of millions of existing human-human utterance pairs. Basically, it works in a two-step retrieval-and-ranking strategy, similar to the Lucene111http://lucene.apache.org and Solr222http://lucene.apache.org/solr systems.
First, a user-issued utterance is treated as bag-of-words features with stop-words being removed. After querying it in a knowledge base, we obtain a list containing up to 1000 query-reply pairs , whose queries share most words as the input query . This step retrieves coarse-grained candidates efficiently, which is accomplished by an inversed index.
Then, we measure the relatedness between the query and each pair in a fine-grained fashion. In our system, both - and -
relevance scores are considered. A classifier judges whethermatches and ; its confidence degree is used as the scorer. We have tens of features, and several important ones include word overlap ratio, the cosine measure of a pretrained topic model coefficients, and the cosine measures of word embedding vectors. (Details are beyond the scope of this paper; any well-designed retrieval system might fit into our framework.)
In this way, we obtain a query-reply pair that best matches the original query ; the corresponding utterance is considered as a candidate reply retrieved from the database.
2.3 The biseq2seq Utterance Generator
Using neural networks to build end-to-end trainable dialog systems has become a new research trend in the past year. A generative dialog system can synthesize new utterances, which is complementary to retrieval-based methods.
Typically, an encoder-decoder architecture is applied to encode a query as vectors and to decode the vectors to a reply utterance. With recurrent neural networks (RNNs) as the encoder and decoder, such architecture is also known as a seq2seq
model, which has wide applications in neural machine translation, abstractive summarization , etc. That being said, previous studies indicate seq2seq has its own shortcoming for dialog systems. suggests that, in open-domain conversation systems, the query does not carry sufficient information for the reply; that the seq2seq model thus tends to generate short and meaningless sentences with little substance.
To address this problem, we adopt a biseq2seq model, which is proposed in  for multi-source machine translation. The biseq2seq model takes into consideration the retrieved reply as a reference in addition to query information (Figure 2). Hence, the generated reply can be not only fluent and logical with respect to the query, but also meaningful as it is enhanced by a retrieved candidate.
Specifically, we use an RNN with gated recurrent units (GRUs) for sequence modeling. Let be the word embeddings of the time step and be the previous hidden state of RNN. We have
where and are known as gates, ’s and ’s are parameters, and “” refers to element-wise product.
After two RNNs go through and , respectively, we obtain two vectors capturing their meanings. We denote them as bold letters and , which are concatenated as and linearly transformed before being fed to the decoder as the initial state.
. But at each time step, a softmax layer outputs the probability that a word would occur in the next step, i.e.,
where is the -th row of the output weight matrix (corresponding to ) and is a bias term.
Notice that we assign different sets of parameters—indicated by three colors in Figure 2—for the two encoders ( and ) and the decoder (). This treatment is because the RNNs’ semantics differ significantly from one another (even between the two encoders).
Now that we have a retrieved candidate reply as well as a generated one , we select one as the final reply by the - scorer in the retrieval-based dialog system (described in previous sections and not repeated here).
Using manually engineered features, this step can eliminate either meaningless short replies that are unfortunately generated by biseq2seq or less relevant replies given by the retrieval system. We call this a post-reranker in our model ensemble.
We train each component separately because the retrieval part is not end-to-end learnable.
In the retrieval system, we use the classifier’s confidence as the relevance score. The training set consists of 10k samples, which are either in the original human-human utterance pairs or generated by negative sampling. We made efforts to collect binary labels from a crowd-sourcing platform, indicating whether a query is relevant to another query and whether it is relevant to a particular reply. We find using crowd-sourced labels results in better performance than original negative sampling.
For biseq2seq, we use human-human utterance pairs as data samples. A retrieved candidate is also provided as the input when we train the neural network. Standard cross-entropy loss of all words in the reply is applied as the training objective. For a particular training sample whose reply is of length , the cost is
where is the one-hot vector of the next target word, serving as the groundtruth, is the output probability by softmax, and is the vocabulary size. We adopt mini-batched AdaDelta  for optimization.
In this section, we evaluate our model ensemble on Chinese (language) human-computer conversation. We first describe the datasets and settings. Then we compare our approach with strong baselines.
3.1 Experimental Setup
Typically, a very large database of query-reply pairs is a premise for a successful retrieval-based conversation system, because the reply must appear in the database. For RNN-based sequence generators, however, it is time-consuming to train with such a large dataset; RNN’s performance may also saturate when we have several million samples.
|Dataset Split||# of samples|
|Matching scorer (Train)||50,000|
To construct a database for information retrieval, we collected human-human utterances from massive online forums, microblogs, and question-answering communities, such as Sina Weibo,333http://weibo.com Baidu Zhidao,444http://zhidao.baidu.com and Baidu Tieba.555http://tieba.baidu.com We filtered out short and meaningless replies like “…” and “Errr.” In total, the database contains 7 million query-reply pairs for retrieval.
For the generation part, we constructed another dataset from various resources in public websites comprising 1,606,741 query-reply pairs. For each query , we searched for a candidate reply by the retrieval component and obtained a tuple . As a friendly reminder, and are the input of biseq2seq, whose output should approximate . We randomly selected 100k triples for validation and another 6,741 for testing. The train-val-test split remains the same for all competing models.
then degrades to an utterance autoencoder
. Also, the validation and test sets are disjoint with the training set and the database for retrieval, which complies with the convention of machine learning.
To train our neural models, we followed 
for hyperparameter settings. All embeddings were set to 620-dimensional and hidden states 1000d. We applied AdaDelta with a mini-batch size of 80 and other default hyperparameters for optimization. Chinese word segmentation was performed on all utterances. We kept a same set of 100k words (Chinese terms) for two encoders, but 30k for the decoder due to efficiency concerns. The three neural networks do not share parameters (neither connection weights nor embeddings).
We did not tune the above hyperparameters, which were set empirically. The validation set was used for early stop based on the perplexity measure.
3.2 Competing Methods
We compare our model ensemble with each individual component and provide a thorough ablation test. Listed below are the competing methods in our experiments.
Retrieval. A state-of-the-practice dialog system, which is a component of our model ensemble; it is also a strong baseline because of extensive human engineering.
biseq2seq. Another component in our approach, adapted from , which is essentially a seq2seq model extended with a retrieved reply.
Rerank(Retrieval,seq2seq). Post-reranking between a retrieved candidate and one generated by seq2seq.
Rerank(Retrieval,biseq2seq). This is the full proposed model ensemble.
All baselines were trained and tuned in a same way as our full model, when applicable, so that the comparison is fair.
3.3 Overall Performance
We evaluated our approach in terms of both subjective and objective evaluation.
Human evaluation, albeit time- and labor-consuming, conforms to the ultimate goal of open-domain conversation systems. We asked three educated volunteers to annotate the results using a common protocol known as pointwise annotation [18, 10, 12]. In other words, annotators were asked to label either “0” (bad), “1” (borderline), or “2” (good) to a query-reply pair. The subjective evaluation was performed in a strict random and blind fashion to rule out human bias.
We adopted BLEU-1, BLEU-2, BLEU-3 and BLEU-4 as automatic evaluation. While  further aggressively argues that no existing automatic metric is appropriate for open-domain dialogs, they show a slight positive correlation between BLEU-2 and human evaluation in non-technical Twitter domain, which is similar to our scenario. We nonetheless include BLEU scores as expedient objective evaluation, serving as supporting evidence. BLEUs are also used in  for model comparison and in  for model selection.
Notice that, automatic metrics were computed on the entire test set, whereas subjective evaluation was based on 79 randomly chosen test samples due to the limitation of human resources available.
We present our main results in Table 2. As shown, the retrieval system, which our model ensemble is based on, achieves better performance than RNN-based sequence generation. The result is not consistent with , where their RNNs are slightly better than retrieval-based methods. After closely examining their paper, we find that their database is multiple times smaller than ours, which may, along with different features and retrieval methods, explain the phenomenon. This also verifies that the retrieval-based dialog system in our experiment is a strong baseline to compare with.
Combining the retrieval system and the RNN generator by bi-sequence input and post-reranking, we achieve the highest performance in terms of both human evaluation and BLEU scores. Concretely, our model ensemble outperforms the state-of-the-practice retrieval system by averaged human scores, which we believe is a large margin.
3.4 Analysis and Discussion
Having verified that our model ensemble achieves better performance than all baselines, we are further curious how each gadget contributes to our final system. Specially, we focus on the following research questions.
RQ1: What is the performance of biseq2seq (the 1⃝ step in Figure 1) in comparison with traditional seq2seq?
From the BLEU scores in Table 2, we see biseq2seq significantly outperforms conventional seq2seq, showing that, if enriched with a retrieved human utterance as a candidate, the encoder-decoder framework can generate much more human-like utterances.
We then introduce in Table 3 another measure, the entropy of a sentence, defined as
where refers to all replies. Entropy is used in  and  to measure the serendipity of generated utterances.666Notice that, the entropy of retrieved replies is not a fair metric to compare in Table 3, because the retrieval system has filtered out short, meaningless utterances in advance by surface statistics (e.g., length). We nevertheless report the result here out of curiosity: its entropy is 9.507, which is even higher than groundtruth. The results in Table 3 confirm that biseq2seq indeed integrates information from the retrieved candidate, so that it alleviates the “low-substance” problem of RNNs and can generate utterances more meaningful than traditional seq2seq. And the statistic result also displays that biseq2seq generates longer sentences than seq2seq approach.
|(This mobile phone’s photo effect is pretty good. )|
|Retrieved||所以好心动 (I really have a crush on it.)|
|biseq2seq||心动不如行动 (Rushing for it rather than having a crush on it.)|
|(Could you please recommend some health products for pregnant woman? )|
|Retrieved||打算上孕妇装？ (Do you want metarnity clothes? )|
|biseq2seq||打算买护肤品么？ (Do you want some skin care products?)|
|seq2seq||不好意思 (I’m sorry. )|
RQ2: How do the retrieval- and generation-based systems contribute to post-reranking (the 2⃝ step in Figure 1)?
We plot in Figure 3 the percentage by which the post-raranker chooses a retrieved candidate or a generated one. In the retrieval-and-seq2seq ensemble (Figure 3a), 54.65% retrieved results and 45.35% generated ones are selected. In retrieval-and-biseq2seq ensemble, the percentage becomes 44.77% vs. 55.23%. The trend further indicates that biseq2seq is better than seq2seq (at least) from the reranker’s point of view. More importantly, as the percentages are close to 50%, both the retrieval system and the generation system contribute a significant portion to our final ensemble.
RQ3: Do we obtain further gain by combining the two gadgets 1⃝ and 2⃝ in Figure 1?
We would also like to verify if the combination of biseq2seq and post-reranking mechanisms will yield further gain in our ensemble. To test this, we compare the full model Rerank(Retrieval,biseq2seq) with an ensemble that uses traditional seq2seq, i.e., Rerank(Retrieval,seq2seq). As indicated in Table 2, even with the post-reranking mechanism, the ensemble with underlying biseq2seq still outperforms the one with seq2seq. Likewise, Rerank(Retrieval,biseq2seq) outperforms both Retrieval and biseq2seq. These results are consistent in terms of all metrics except a BLEU-4 score.
Through the above ablation tests, we conclude that both gadgets (biseq2seq and post-reranking) play a role in our ensemble when we combine the retrieval and the generative systems.
3.5 Case Study
Table 4 presents two examples of our ensemble and its “base” models. We see that biseq2seq is indeed influenced by the retrieved candidates. As opposed to traditional seq2seq, several content words in the retrieved replies (e.g., crush) also appear in biseq2seq’s output, making the utterances more meaningful. The post-reranker also chooses a more appropriate candidate as the reply.
4 Related Work
In early years, researchers mainly focus on domain-specific dialog systems, e.g., train routing , movie information , and human tutoring . Typically, a pre-constructed ontology defines a finite set of slots and values, for example, cuisine, location, and price range in a food service dialog system; during human-computer interaction, a state tracker fills plausible values to each slot from user input, and recommend the restaurant that best meets the user’s requirement [26, 13, 24].
In the open domain, however, such slot-filling approaches would probably fail because of the diversity of topics and natural language utterances.  applies information retrieval techniques to search for related queries and replies.  and  use both shallow hand-crafted features and deep neural networks for matching.  proposes a random walk-style algorithm to rank candidate replies. In addition, their model can introduce additional content (related entities in the dialog context) by searching a knowledge base when a stalemate occurs during human-computer conversation.
Generative dialog systems have recently attracted increasing attention in the NLP community.  formulates query-reply transformation as a phrase-based machine translation. Since the last year, the renewed prosperity of neural networks witnesses an emerging trend in using RNN for dialog systems [22, 19, 18, 16]. However, a known issue with RNN is that it prefers to generate short, meaningless utterances.  proposes a mutual information objective in contrast to the conventional maximum likelihood criterion.  and  introduce additional content (either the most mutually informative word or topic information) to the reply generator.  applies a variational encoder to capture query information as a distribution, from which a random vector is sampled for reply generation.
To the best of our knowledge, we are the first to combine retrieval-based and generation-based dialog systems. The use of biseq2seq and post-reranking is also a new insight of this paper.
5 Conclusion and Future Work
In this paper, we propose a novel ensemble of retrieval-based and generation-based open-domain dialog systems. The retrieval part searches a best-match candidate reply, which is, along with the original query, fed to an RNN-based biseq2seq reply generator. The generated utterance is fed back as a new candidate to the retrieval system for post-reranking. Experimental results show that our ensemble outperforms its underlying retrieval system and generation system by a large margin. In addition, the ablation test demonstrates both the biseq2seq and post-reranking mechanisms play an important role in the ensemble.
Our research also points out several promising directions for future work, for example, developing new mechanisms of combining retrieval and generative dialog systems, as well as incorporating other data-driven approaches to human-computer conversation.
MIMIC: An adaptive mixed initiative spoken dialogue system for
Proc. Conf. Applied Natural Language Processing, pages 97–104, 2000.
-  G. Ferguson, J. Allen, B. Miller, et al. Trains-95: Towards a mixed-initiative planning assistant. In AIPS, pages 70–77, 1996.
-  Joseph L Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378, 1971.
-  Nadine Glas, Ken Prepin, and Catherine Pelachaud. Engagement driven topic selection for an information-giving agent. In Proc. Workshop on the Semantics and Pragmatics of Dialogue, 2015.
-  A. Graesser, P. Chipman, B. Haynes, and A. Olney. AutoTutor: An intelligent tutoring system with mixed-initiative dialogue. IEEE Trans. Education, 48(4):612–618, 2005.
-  Charles Lee Isbell, Michael Kearns, Dave Kormann, Satinder Singh, and Peter Stone. Cobot in LambdaMOO: A social statistics agent. In AAAI, pages 36–41, 2000.
-  Zongcheng Ji, Zhengdong Lu, and Hang Li. An information retrieval approach to short text conversation. arXiv preprint arXiv:1408.6988, 2014.
-  Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In NAACL-HLT, pages 110–119, 2016.
-  Jiwei Li, Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs and documents. In ACL-IJCNLP, pages 1106–1115, 2015.
-  Xiang Li, Lili Mou, Rui Yan, and Ming Zhang. StalemateBreaker: A proactive content-introducing approach to automatic human-computer conversation. In IJCAI, pages 2845–2851, 2016.
Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin,
and Joelle Pineau.
How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation.In EMNLP (to appear), 2016.
-  Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang, and Zhi Jin. Sequence to backward and forward sequences: A content-introducing approach to generative short-text conversation. arXiv preprint arXiv:1607.00970, 2016.
-  Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. Multi-domain dialog state tracking using recurrent neural networks. In ACL-IJCNLP, pages 794–799, 2015.
-  Alan Ritter, Colin Cherry, and William B Dolan. Data-driven response generation in social media. In EMNLP, pages 583–593, 2011.
Alexander M. Rush, Sumit Chopra, and Jason Weston.
A neural attention model for abstractive sentence summarization.In EMNLP, pages 379–389, 2015.
-  Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, pages 3776–3783, 2016.
-  Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069, 2016.
-  Lifeng Shang, Zhengdong Lu, and Hang Li. Neural responding machine for short-text conversation. In ACL-IJCNLP, pages 1577–1586, 2015.
-  Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. A neural network approach to context-sensitive generation of conversational responses. In NAACL-HLT, pages 196–205, 2015.
-  Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. Continuously learning neural dialogue management. arXiv preprint arXiv:1606.02689, 2016.
-  Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112, 2014.
-  Oriol Vinyals and Quoc Le. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.
-  Tsung-Hsien Wen, Milica Gasic, Dongho Kim, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking. In SIGDIAL, pages 275–284, 2015.
-  Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, David Vandyke, and Steve Young. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562, 2016.
-  Jason Williams, Antoine Raux, Deepak Ramachandran, and Alan Black. The dialog state tracking challenge. In SIGDIAL, pages 404–413, 2013.
-  Jason D Williams. Web-style ranking and SLU combination for dialog state tracking. In SIGDIAL, pages 282–291, 2014.
-  Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. Topic augmented neural response generation with a joint attention mechanism. arXiv preprint arXiv:1606.08340, 2016.
-  Rui Yan, Yiping Song, and Hua Wu. Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In SIGIR, pages 55–64, 2016.
-  Matthew D Zeiler. AdaDelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
-  Barret Zoph and Kevin Knight. Multi-source neural translation. In NAACL-ACL, pages 30–34, 2016.