Two are Better than One: An Ensemble of Retrieval- and Generation-Based Dialog Systems

by   Yiping Song, et al.

Open-domain human-computer conversation has attracted much attention in the field of NLP. Contrary to rule- or template-based domain-specific dialog systems, open-domain conversation usually requires data-driven approaches, which can be roughly divided into two categories: retrieval-based and generation-based systems. Retrieval systems search a user-issued utterance (called a query) in a large database, and return a reply that best matches the query. Generative approaches, typically based on recurrent neural networks (RNNs), can synthesize new replies, but they suffer from the problem of generating short, meaningless utterances. In this paper, we propose a novel ensemble of retrieval-based and generation-based dialog systems in the open domain. In our approach, the retrieved candidate, in addition to the original query, is fed to an RNN-based reply generator, so that the neural model is aware of more information. The generated reply is then fed back as a new candidate for post-reranking. Experimental results show that such ensemble outperforms each single part of it by a large margin.


page 1

page 2

page 3

page 4


RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems

Open-domain human-computer conversation has been attracting increasing a...

Self-attention Comparison Module for Boosting Performance on Retrieval-based Open-Domain Dialog Systems

Since the pre-trained language models are widely used, retrieval-based o...

StalemateBreaker: A Proactive Content-Introducing Approach to Automatic Human-Computer Conversation

Existing open-domain human-computer conversation systems are typically p...

EnsembleGAN: Adversarial Learning for Retrieval-Generation Ensemble Model on Short-Text Conversation

Generating qualitative responses has always been a challenge for human-c...

Towards Retrieval-based Conversational Recommendation

Conversational recommender systems have attracted immense attention rece...

We've had this conversation before: A Novel Approach to Measuring Dialog Similarity

Dialog is a core building block of human natural language interactions. ...

Dialog Intent Induction via Density-based Deep Clustering Ensemble

Existing task-oriented chatbots heavily rely on spoken language understa...

1 Introduction

Automatic dialog/conversation systems have served humans for a long time in various fields, ranging from train routing [2] to museum guiding  [4]. In the above scenarios, the dialogs are domain-specific, and a typical approach to such in-domain systems is by human engineering, for example, using manually constructed ontologies [23], natural language templates [20], and even predefined dialog states [25].

Recently, researchers have paid increasing attention to open-domain, chatbot-style human-computer conversation, because of its important commercial applications, and because it tackles the real challenges of natural language understanding and generation [6, 18, 16]

. For open-domain dialogs, rules and temples would probably fail as we can hardly handle the great diversity of dialog topics and natural language sentences. With the increasing number of human-human conversation utterances available on the Internet, previous studies have developed data-oriented approaches in the open domain, which can be roughly categorized into two groups: retrieval systems and generative systems.

When a user issues an utterance (called a query), retrieval systems search for a most similar query in a massive database (which consists of large numbers of query-reply pairs), and respond to the user with the corresponding reply [6, 7]. Through information retrieval, however, we cannot obtain new utterances, that is, all replies have to appear in the database. Also, the ranking of candidate replies is usually judged by surface forms (e.g., word overlaps, tfidf features) and hardly addresses the real semantics of natural languages.

Generative dialog systems, on the other hand, can synthesize a new sentence as the reply by language models [19, 18, 16]

. Typically, a recurrent neural network (RNN) captures the query’s semantics with one or a few distributed, real-valued vectors (also known as


); another RNN decodes the query embeddings to a reply. Deep neural networks allow complicated interaction by multiple non-linear transformations; RNNs are further suitable for modeling time-series data (e.g., a sequence of words) especially when enhanced with long short term memory (LSTM) or gated recurrent units (GRUs). Despite these, RNN also has its own weakness when applied to dialog systems: the generated sentence tends to be short, universal, and meaningless, for example, “I don’t know” 

[8] or “something” [16]. This is probably because chatbot-like dialogs are highly diversified and a query may not convey sufficient information for the reply. Even though such universal utterances may be suited in certain dialog context, they make users feel boring and lose interest, and thus are not desirable in real applications.

In this paper, we are curious if we can combine the above two streams of approaches for open-domain conversation. To this end, we propose an ensemble of retrieval and generative dialog systems. Given a user-issued query, we first obtain a candidate reply by information retrieval from a large database. The query, along with the candidate reply, is then fed to an utterance generator based on the “bi-sequence to sequence” (biseq2seq) model [30]. Such sequence generator takes into consideration the information contained in not only the query but also the retrieved reply; hence, it alleviates the low-substance problem and can synthesize replies that are more meaningful. After that we use the scorer in the retrieval system again for post-reranking. This step can filter out less relevant retrieved replies or meaningless generated ones. The higher ranked candidate (either retrieved or generated) is returned to the user as the reply.

From the above process, we see that the retrieval and generative systems are integrated by two mechanisms: (1) The retrieved candidate is fed to the sequence generator to mitigate the “low-substance” problem; (2) The post-reranker can make better use of both the retrieved candidate and the generated utterance. In this sense, we call our overall approach an ensemble in this paper. To the best of our knowledge, we are the first to combine retrieval and generative models for open-domain conversation.

Experimental results show that our ensemble model consistently outperforms each single component in terms of several subjective and objective metrics, and that both retrieval and generative methods contribute an important portion to the overall approach. This also verifies the rationale for building model ensembles for dialog systems.

2 The Proposed Model Ensemble

2.1 Overview

Figure 1: The overall architecture of our model ensemble. We combine retrieval and generative dialog systems by


 enhancing the generator with the retrieved candidate and by


 post-reranking of both retrieved and generated candidates.

Figure 1 depicts the overall framework of our proposed ensemble of retrieval and generative dialog systems. It mainly consists of the following components.

  • When a user sends a query utterance , our approach utilizes a state-of-the-practice information retrieval system to search for a query-reply pair that best matches the user-issued query . The corresponding is retrieved as a candidate reply.

  • Then a biseq2seq model takes the original query and the retrieved candidate reply as input, each sequence being transformed to a fixed-size vector. These two vectors are concatenated and linearly transformed as the initial state of the decoder, which generates a new utterance as another candidate reply.

  • Finally, we use a reranker (which is a part of the retrieval system) to select either or as the ultimate response to the original query .

In the rest of this section, we describe each component in detail.

2.2 Retrieval-Based Dialog System

Information retrieval is among prevailing techniques for open-domain, chatbot-style human-computer conversation [6, 7].

We utilize a state-of-the-practice retrieval system with extensive manual engineering and on a basis of tens of millions of existing human-human utterance pairs. Basically, it works in a two-step retrieval-and-ranking strategy, similar to the Lucene111 and Solr222 systems.

First, a user-issued utterance is treated as bag-of-words features with stop-words being removed. After querying it in a knowledge base, we obtain a list containing up to 1000 query-reply pairs , whose queries share most words as the input query . This step retrieves coarse-grained candidates efficiently, which is accomplished by an inversed index.

Then, we measure the relatedness between the query and each pair in a fine-grained fashion. In our system, both - and -

relevance scores are considered. A classifier judges whether

matches and ; its confidence degree is used as the scorer. We have tens of features, and several important ones include word overlap ratio, the cosine measure of a pretrained topic model coefficients, and the cosine measures of word embedding vectors. (Details are beyond the scope of this paper; any well-designed retrieval system might fit into our framework.)

In this way, we obtain a query-reply pair that best matches the original query ; the corresponding utterance is considered as a candidate reply retrieved from the database.

2.3 The biseq2seq Utterance Generator

Using neural networks to build end-to-end trainable dialog systems has become a new research trend in the past year. A generative dialog system can synthesize new utterances, which is complementary to retrieval-based methods.

Typically, an encoder-decoder architecture is applied to encode a query as vectors and to decode the vectors to a reply utterance. With recurrent neural networks (RNNs) as the encoder and decoder, such architecture is also known as a seq2seq

 model, which has wide applications in neural machine translation 

[21], abstractive summarization [15], etc. That being said, previous studies indicate seq2seq has its own shortcoming for dialog systems.[12] suggests that, in open-domain conversation systems, the query does not carry sufficient information for the reply; that the seq2seq model thus tends to generate short and meaningless sentences with little substance.

To address this problem, we adopt a biseq2seq model, which is proposed in [30] for multi-source machine translation. The biseq2seq model takes into consideration the retrieved reply as a reference in addition to query information (Figure 2). Hence, the generated reply can be not only fluent and logical with respect to the query, but also meaningful as it is enhanced by a retrieved candidate.

Specifically, we use an RNN with gated recurrent units (GRUs) for sequence modeling. Let be the word embeddings of the time step and be the previous hidden state of RNN. We have


where and are known as gates, ’s and ’s are parameters, and “” refers to element-wise product.

After two RNNs go through and , respectively, we obtain two vectors capturing their meanings. We denote them as bold letters and , which are concatenated as and linearly transformed before being fed to the decoder as the initial state.

During reply generation, we also use GRU-RNN, given by Equations 14

. But at each time step, a softmax layer outputs the probability that a word would occur in the next step, i.e.,


where is the -th row of the output weight matrix (corresponding to ) and is a bias term.

Notice that we assign different sets of parameters—indicated by three colors in Figure 2—for the two encoders ( and ) and the decoder (). This treatment is because the RNNs’ semantics differ significantly from one another (even between the two encoders).

Figure 2: The biseq2seq model, which takes as input a query and a retrieved candidate reply ; it outputs a new reply .

2.4 Post-Reranking

Now that we have a retrieved candidate reply as well as a generated one , we select one as the final reply by the - scorer in the retrieval-based dialog system (described in previous sections and not repeated here).

Using manually engineered features, this step can eliminate either meaningless short replies that are unfortunately generated by biseq2seq or less relevant replies given by the retrieval system. We call this a post-reranker in our model ensemble.

2.5 Training

We train each component separately because the retrieval part is not end-to-end learnable.

In the retrieval system, we use the classifier’s confidence as the relevance score. The training set consists of 10k samples, which are either in the original human-human utterance pairs or generated by negative sampling. We made efforts to collect binary labels from a crowd-sourcing platform, indicating whether a query is relevant to another query and whether it is relevant to a particular reply. We find using crowd-sourced labels results in better performance than original negative sampling.

For biseq2seq, we use human-human utterance pairs as data samples. A retrieved candidate is also provided as the input when we train the neural network. Standard cross-entropy loss of all words in the reply is applied as the training objective. For a particular training sample whose reply is of length , the cost is


where is the one-hot vector of the next target word, serving as the groundtruth, is the output probability by softmax, and is the vocabulary size. We adopt mini-batched AdaDelta [29] for optimization.

3 Evaluation

In this section, we evaluate our model ensemble on Chinese (language) human-computer conversation. We first describe the datasets and settings. Then we compare our approach with strong baselines.

3.1 Experimental Setup

Typically, a very large database of query-reply pairs is a premise for a successful retrieval-based conversation system, because the reply must appear in the database. For RNN-based sequence generators, however, it is time-consuming to train with such a large dataset; RNN’s performance may also saturate when we have several million samples.

Dataset Split # of samples
Retrieval (Database)  7,053,820
Matching scorer (Train) 50,000
Generator (Train) 1,500,000
Validation 100,000
Testing 6,741
Table 1: Statistics of our datasets.

To construct a database for information retrieval, we collected human-human utterances from massive online forums, microblogs, and question-answering communities, such as Sina Weibo,333 Baidu Zhidao,444 and Baidu Tieba.555 We filtered out short and meaningless replies like “…” and “Errr.” In total, the database contains 7 million query-reply pairs for retrieval.

For the generation part, we constructed another dataset from various resources in public websites comprising 1,606,741 query-reply pairs. For each query , we searched for a candidate reply by the retrieval component and obtained a tuple . As a friendly reminder, and are the input of biseq2seq, whose output should approximate . We randomly selected 100k triples for validation and another 6,741 for testing. The train-val-test split remains the same for all competing models.

The biseq2seq

 then degrades to an utterance autoencoder 


. Also, the validation and test sets are disjoint with the training set and the database for retrieval, which complies with the convention of machine learning.

To train our neural models, we followed [18]

for hyperparameter settings. All embeddings were set to 620-dimensional and hidden states 1000d. We applied AdaDelta with a mini-batch size of 80 and other default hyperparameters for optimization. Chinese word segmentation was performed on all utterances. We kept a same set of 100k words (Chinese terms) for two encoders, but 30k for the decoder due to efficiency concerns. The three neural networks do not share parameters (neither connection weights nor embeddings).

We did not tune the above hyperparameters, which were set empirically. The validation set was used for early stop based on the perplexity measure.

Method Human Score BLEU-1 BLEU-2 BLEU-3 BLEU-4
Retrieval 0.996 5.707 3.092 2.406 2.094
seq2seq 0.907 3.676 1.228 0.564 0.286
biseq2seq 0.966 7.762 3.293 2.056 1.487
Rerank(Retrieval,seq2seq) 1.030 4.500 2.041 1.364 1.060
Rerank(Retrieval,biseq2seq) 1.131 7.260 3.503 2.480 2.000
Table 2: Results of our ensemble and competing methods in terms of average human scores and BLEUs. Inter-annotator agreement for human annotation: Fleiss’  [3], std , indicating moderate agreement.

3.2 Competing Methods

We compare our model ensemble with each individual component and provide a thorough ablation test. Listed below are the competing methods in our experiments.

  • Retrieval. A state-of-the-practice dialog system, which is a component of our model ensemble; it is also a strong baseline because of extensive human engineering.

  • seq2seq. An encoder-encoder framework [21], first introduced in [18] for dialog systems.

  • biseq2seq. Another component in our approach, adapted from [30], which is essentially a seq2seq model extended with a retrieved reply.

  • Rerank(Retrieval,seq2seq). Post-reranking between a retrieved candidate and one generated by seq2seq.

  • Rerank(Retrieval,biseq2seq). This is the full proposed model ensemble.

All baselines were trained and tuned in a same way as our full model, when applicable, so that the comparison is fair.

3.3 Overall Performance

We evaluated our approach in terms of both subjective and objective evaluation.

  • Human evaluation, albeit time- and labor-consuming, conforms to the ultimate goal of open-domain conversation systems. We asked three educated volunteers to annotate the results using a common protocol known as pointwise annotation [18, 10, 12]. In other words, annotators were asked to label either “0” (bad), “1” (borderline), or “2” (good) to a query-reply pair. The subjective evaluation was performed in a strict random and blind fashion to rule out human bias.

  • We adopted BLEU-1, BLEU-2, BLEU-3 and BLEU-4 as automatic evaluation. While [11] further aggressively argues that no existing automatic metric is appropriate for open-domain dialogs, they show a slight positive correlation between BLEU-2 and human evaluation in non-technical Twitter domain, which is similar to our scenario. We nonetheless include BLEU scores as expedient objective evaluation, serving as supporting evidence. BLEUs are also used in [8] for model comparison and in [12] for model selection.

Notice that, automatic metrics were computed on the entire test set, whereas subjective evaluation was based on 79 randomly chosen test samples due to the limitation of human resources available.

We present our main results in Table 2. As shown, the retrieval system, which our model ensemble is based on, achieves better performance than RNN-based sequence generation. The result is not consistent with [18], where their RNNs are slightly better than retrieval-based methods. After closely examining their paper, we find that their database is multiple times smaller than ours, which may, along with different features and retrieval methods, explain the phenomenon. This also verifies that the retrieval-based dialog system in our experiment is a strong baseline to compare with.

Combining the retrieval system and the RNN generator by bi-sequence input and post-reranking, we achieve the highest performance in terms of both human evaluation and BLEU scores. Concretely, our model ensemble outperforms the state-of-the-practice retrieval system by averaged human scores, which we believe is a large margin.

3.4 Analysis and Discussion

Method Entropy Length
seq2seq 7.420 7.362
biseq2seq 8.302 8.185
Groundtruth 8.625 12.62
Table 3: Entropy and length of generated replies. We also include groundtruth for reference. A larger entropy value indicates that the replies are less common, and probably, more meaningful.

Having verified that our model ensemble achieves better performance than all baselines, we are further curious how each gadget contributes to our final system. Specially, we focus on the following research questions.

RQ1: What is the performance of biseq2seq (the 1⃝  step in Figure 1) in comparison with traditional seq2seq?

From the BLEU scores in Table 2, we see biseq2seq significantly outperforms conventional seq2seq, showing that, if enriched with a retrieved human utterance as a candidate, the encoder-decoder framework can generate much more human-like utterances.

We then introduce in Table 3 another measure, the entropy of a sentence, defined as

where refers to all replies. Entropy is used in [17] and [12] to measure the serendipity of generated utterances.666Notice that, the entropy of retrieved replies is not a fair metric to compare in Table 3, because the retrieval system has filtered out short, meaningless utterances in advance by surface statistics (e.g., length). We nevertheless report the result here out of curiosity: its entropy is 9.507, which is even higher than groundtruth. The results in Table 3 confirm that biseq2seq indeed integrates information from the retrieved candidate, so that it alleviates the “low-substance” problem of RNNs and can generate utterances more meaningful than traditional seq2seq. And the statistic result also displays that biseq2seq generates longer sentences than seq2seq approach.

Figure 3: The percentage by which our post-reranker chooses a retrieved reply or a generated reply. (a) Ensemble of Retrieval and seq2seq; (b) Ensemble of Retrieval and biseq2seq.
Utterance (Translated) Reranked
Query 手机拍照效果不错
(This mobile phone’s photo effect is pretty good. )
Retrieved 所以好心动 (I really have a crush on it.)
biseq2seq 心动不如行动 (Rushing for it rather than having a crush on it.)
seq2seq 哈哈 (Haha)
Query 有给孕妇推荐的保健品啥的么?
(Could you please recommend some health products for pregnant woman? )
Retrieved 打算上孕妇装? (Do you want metarnity clothes? )
biseq2seq 打算买护肤品么? (Do you want some skin care products?)
seq2seq 不好意思 (I’m sorry. )
Table 4: Examples of retrieved replies and generated ones. An arrow “” indicates the one selected during post-reranking. Also included are replies generated by conventional seq2seq. Notice that it is not part of our model and thus not considered for reranking.

RQ2: How do the retrieval- and generation-based systems contribute to post-reranking (the 2⃝  step in Figure 1)?

We plot in Figure 3 the percentage by which the post-raranker chooses a retrieved candidate or a generated one. In the retrieval-and-seq2seq ensemble (Figure 3a), 54.65% retrieved results and 45.35% generated ones are selected. In retrieval-and-biseq2seq ensemble, the percentage becomes 44.77% vs. 55.23%. The trend further indicates that biseq2seq is better than seq2seq (at least) from the reranker’s point of view. More importantly, as the percentages are close to 50%, both the retrieval system and the generation system contribute a significant portion to our final ensemble.

RQ3: Do we obtain further gain by combining the two gadgets 1⃝  and 2⃝  in Figure 1?

We would also like to verify if the combination of biseq2seq and post-reranking mechanisms will yield further gain in our ensemble. To test this, we compare the full model Rerank(Retrieval,biseq2seq) with an ensemble that uses traditional seq2seq, i.e., Rerank(Retrieval,seq2seq). As indicated in Table 2, even with the post-reranking mechanism, the ensemble with underlying biseq2seq still outperforms the one with seq2seq. Likewise, Rerank(Retrieval,biseq2seq) outperforms both Retrieval and biseq2seq. These results are consistent in terms of all metrics except a BLEU-4 score.

Through the above ablation tests, we conclude that both gadgets (biseq2seq and post-reranking) play a role in our ensemble when we combine the retrieval and the generative systems.

3.5 Case Study

Table 4 presents two examples of our ensemble and its “base” models. We see that biseq2seq is indeed influenced by the retrieved candidates. As opposed to traditional seq2seq, several content words in the retrieved replies (e.g., crush) also appear in biseq2seq’s output, making the utterances more meaningful. The post-reranker also chooses a more appropriate candidate as the reply.

4 Related Work

In early years, researchers mainly focus on domain-specific dialog systems, e.g., train routing [2], movie information [1], and human tutoring [5]. Typically, a pre-constructed ontology defines a finite set of slots and values, for example, cuisine, location, and price range in a food service dialog system; during human-computer interaction, a state tracker fills plausible values to each slot from user input, and recommend the restaurant that best meets the user’s requirement [26, 13, 24].

In the open domain, however, such slot-filling approaches would probably fail because of the diversity of topics and natural language utterances. [6] applies information retrieval techniques to search for related queries and replies. [7] and [28] use both shallow hand-crafted features and deep neural networks for matching. [10] proposes a random walk-style algorithm to rank candidate replies. In addition, their model can introduce additional content (related entities in the dialog context) by searching a knowledge base when a stalemate occurs during human-computer conversation.

Generative dialog systems have recently attracted increasing attention in the NLP community. [14] formulates query-reply transformation as a phrase-based machine translation. Since the last year, the renewed prosperity of neural networks witnesses an emerging trend in using RNN for dialog systems [22, 19, 18, 16]. However, a known issue with RNN is that it prefers to generate short, meaningless utterances. [8] proposes a mutual information objective in contrast to the conventional maximum likelihood criterion. [12] and [27] introduce additional content (either the most mutually informative word or topic information) to the reply generator. [17] applies a variational encoder to capture query information as a distribution, from which a random vector is sampled for reply generation.

To the best of our knowledge, we are the first to combine retrieval-based and generation-based dialog systems. The use of biseq2seq and post-reranking is also a new insight of this paper.

5 Conclusion and Future Work

In this paper, we propose a novel ensemble of retrieval-based and generation-based open-domain dialog systems. The retrieval part searches a best-match candidate reply, which is, along with the original query, fed to an RNN-based biseq2seq reply generator. The generated utterance is fed back as a new candidate to the retrieval system for post-reranking. Experimental results show that our ensemble outperforms its underlying retrieval system and generation system by a large margin. In addition, the ablation test demonstrates both the biseq2seq and post-reranking mechanisms play an important role in the ensemble.

Our research also points out several promising directions for future work, for example, developing new mechanisms of combining retrieval and generative dialog systems, as well as incorporating other data-driven approaches to human-computer conversation.