EnsembleGAN: Adversarial Learning for Retrieval-Generation Ensemble Model on Short-Text Conversation

04/30/2020 ∙ by Jiayi Zhang, et al. ∙ Peking University 0

Generating qualitative responses has always been a challenge for human-computer dialogue systems. Existing dialogue systems generally derive from either retrieval-based or generative-based approaches, both of which have their own pros and cons. Despite the natural idea of an ensemble model of the two, existing ensemble methods only focused on leveraging one approach to enhance another, we argue however that they can be further mutually enhanced with a proper training strategy. In this paper, we propose ensembleGAN, an adversarial learning framework for enhancing a retrieval-generation ensemble model in open-domain conversation scenario. It consists of a language-model-like generator, a ranker generator, and one ranker discriminator. Aiming at generating responses that approximate the ground-truth and receive high ranking scores from the discriminator, the two generators learn to generate improved highly relevant responses and competitive unobserved candidates respectively, while the discriminative ranker is trained to identify true responses from adversarial ones, thus featuring the merits of both generator counterparts. The experimental results on a large short-text conversation data demonstrate the effectiveness of the ensembleGAN by the amelioration on both human and automatic evaluation metrics.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Natural language human-computer conversation has long been an attractive but challenging task in artificial intelligence (AI), for it requires both language understanding and reasoning  

(Shang et al., 2015). While early works mainly focused on domain-specific scenarios such as ticket booking, the open-domain chatbot-human conversations has gained popularity recently, not only for their commercial values (e.g., Xiaoice111http://www.msxiaoice.com/

from Microsoft), but for the rapid growth of online social media as well, along with tremendous data available for data-driven deep learning methods to be proved worthwhile. Current conversation systems could be generally divided into two different categories, namely the retrieval-based and the generative-based approach.

Given an user input utterance (also called a query), a retrieval-based system usually retrieves a number of response candidates from a pre-constructed index, and then selects the best matching one as a response to a human input using semantic matching (Wang et al., 2013; Yan et al., 2016; Wu et al., 2017). The retrieved responses usually have various expressions with rich information and language fluency. However, limited by the capacity of the pre-constructed repository, the selected response might seem less customized for unobserved novel queries .

Meanwhile the generative conversation system works differently, for it generates responses token by token according to conditional probabilistic language models (LM) such as seq2seq with attention (Bahdanau et al., 2015), which generates appropriate and tailored responses to most queries, but often suffers from the lack of language fluency and the problem of universal responses (e.g., “I don’t know” and “Me too”) due to statistical model incapabilities (Chen et al., 2017). Various ameliorations have been proposed to enrich the generation, either by better exploring internal features such as mutual-information-based objective function (Li et al., 2016), dynamic vocabularies (Wu et al., 2018b) and diverse beam search (Vijayakumar et al., 2018), or by incorporating external knowledge, such as topic information (Xing et al., 2017), cue words (Mou et al., 2016; Yao et al., 2017), dialog acts (Zhao et al., 2017), and common sense knowledge (Young et al., 2018).

On the other hand, studies seeking for an ensemble of both retrieval and generative approaches show great improvement to dialogue generation performance. Song et al. [18] proposed Multi-Seq2Seq model that focuses on leveraging responses generated by the retrieval-based dialog systems to enhance generation-based dialog systems, thus synthesizing a more informative response. Similarly, Weston et al. [27] designed a retrieval-and-refine model which treats the retrieval as additional context for sequence generator to avoid universal issues such as producing short sentences with frequent words. Wu et al. [28] introduced a prototype-then-edit paradigm for their conversation system by building a retrieval-based prototype editing with a seq2seq model that increases the diversity and informativeness of the generation results.

Despite the performance gain of an ensemble compared with either retrieval or generative model, previous works only focused on ameliorating one approach based on the other, still leaving great potentials for making further progress by allowing both methods to be mutually enhanced. Inspired by adversarial learning (Goodfellow et al., 2014), we propose a generative adversarial framework for improving an ensemble on short-text conversation, which is called EnsembleGAN throughout the paper. Particularly, EnsembleGAN consists of two generators and a discriminator. The LM-like generator () is responsible for synthesizing tailored responses via a sequence-to-sequence framework, while the ranking-based generator () aims at selecting highly competitive negative responses from a pre-retrieval module and , and finally the ranking-based discriminator () endeavors to distinguish the ground-truth and adversarial candidates provided by pre-retrieval module and two generators ( and ).

The motivation behind is that through adversarial learning, with generating improved highly relevant responses, and providing enriched and fluent unobserved as well as synthetic candidates, the discriminative ranker could be further trained to identify responses that are highly correlated, informative and fluent, thus absorbing the merits of both its generative counterparts. The proposed EnsembleGAN framework is intuitively suited for improving a combination of any neural-based generative and retrieval approaches towards better global optimal results. The main contribution of this paper is three-folded and it’s summarized as follows:

  • We introduce a novel end-to-end generative adversarial framework that aims to mutually enhance both generative and retrieval module, leading to a better amelioration of a dialogue ensemble model.

  • We make extensive studies on ensembles of various generators and discriminators, providing insights of global and local optimization from the ensemble perspective through both quantitative and qualitative analysis.

  • We demonstrate the effectiveness of the proposed EnsembleGAN by performing experiments on a large mixed STC dataset, the gain on various metrics confirms that the ensemble model as well as each of its modules could all be enhanced by our method.

2. Related Work

Open-domain dialogue systems have been attracting increasing attention in recent years. Researchers have made various progress on building both generative-based (Shang et al., 2015; Mou et al., 2016; Tao et al., 2018a) and retrieval-based conversation system (Yan et al., 2016, 2017; Wu et al., 2017; Yan and Zhao, 2018; Tao et al., 2019). Besides, with the success of generative adversarial networks (GANs) (Goodfellow et al., 2014)

on computer vision such as image translation 

(Zhu et al., 2017)

and image captioning 

(Dai et al., 2017)

, studies of GAN applications also start to emerge in the domain of natural language processing (NLP), such as dialogue generation 

(Li et al., 2017; Xu et al., 2018), machine translation (Yang et al., 2018)

and text summarization 

(Liu et al., 2018), all demonstrating the effectiveness of GAN mechanism in the domain of NLP. With respect to dialogue generation framework, the GAN-related researches could also be generally categorized as the GAN on generative-based and retrieval-based models.

As for sequence generation models, also regarded as sequential decision making process in reinforcement learning,

Yu et al. [39

] proposed seqGAN framework that bypasses the differentiation problem for discrete token generation by applying Monte Carlo roll-out policy, with recurrent neural network (RNN) as generator and binary classifier as discriminator. What follows are RankGAN 

(Lin et al., 2017) which treats the discrimination phase as a learning-to-rank optimization problem as opposed to binary classification, dialogueGAN (Li et al., 2017) that adapts the GAN mechanism on a seq2seq model for dialogue generation scenario with its discriminator capable of identifying true query-response pairs from fake pairs, as well as DPGAN (Xu et al., 2018) that promotes response diversity by introducing an LM-based discriminator that overcomes the saturation problem for classifier-based discriminators. Nevertheless, even state-of-the-art generative approaches couldn’t achieve comparable performance as retrieval-based approaches in terms of language fluency and diversity of response generation.

As for retrieval-based models, Wang et al. [26] proposed IRGAN framework that unifies both generative and discriminative ranking-based retrieval models through adversarial learning. While the generator learns the document relevance distribution and is able to generate (or select) unobserved documents that are difficult for discriminative ranker to rank correctly, the discriminator is trained to distinguish the good matching query-response pair from the bad. However effective IRGAN is, in a conversation scenario, a pure retrieval system would always be limited by the constructed query-response repository. The adversarial responses, observed or not, might not be suitable for novel queries after all, which is a common problem for retrieval-based conversation system that is beyond IRGAN’s capability.

While previous GAN-related studies only focused on the improvement of either generative-based or retrieval-based single approach, our work in this paper could be categorized as a unified GAN framework of the aforementioned GAN mechanism on both retrieval model and sequence generation model of an ensemble, which is constructed with each of its modules getting involved in adversarial learning with different roles. While being most related to rankGAN and IRGAN, our work has the following differences:

  • RankGAN only trains a language model through point-wise ranking of independent human-written and synthetic sentences, while EnsembleGAN trains a generative seq2seq model (G) through pair-wise ranking (D) of ground-truth and negative responses, with both G and D conditioned on the user’s query, let alone the existence of another strong competitor G providing negative adversarial samples.

  • While IRGAN allows for both generative and discriminative retrieval model to compete against each other, EnsembleGAN allows for both rankers G and D to compete against each other as ensembles, with the constant involvement of response generation module G included in a more delicate three-stage sampling strategy.

  • EnsembleGAN unifies both GAN mechanism with a shared overall learning objective among all generators and discriminator, enhancing an ensemble of generative and retrieval-based approaches towards better global optimal results.

3. Preliminaries

Before diving into details of our EnsembleGAN Framework, we first introduce the generation-based conversation model and the retrieval-based conversation model, which is the basis of our Ensemble model.

Response Generation Model

An LM-based probabilistic conversation model usually employs the seq2seq encoder decoder framework, where in general the encoder learns the query representation and the decoder generates the response sequence token by token based on encoder output (Shang et al., 2015). For an RNN-based seq2seq model with attention mechanism (Bahdanau et al., 2015)

, the generation probability of the current word

of the response given query of length could be generally modeled as follows:


where and are the recurrence functions. and represent the hidden state of the encoder and decoder, and

the context vector obtained by attention mechanism based on

, which often takes the form of a weighted sum of . The weight factor is generally computed as a similarity between and each , allowing the decoder to attend to different part of contexts at every decoding step. The cross entropy loss is often applied for the model training, with the ground-truth corresponding word.

Response Ranking Model

Given a query and some candidate provided by a fast pre-retrieval module111We apply Lucene (https://lucenenet.apache.org/) to index all query-response pairs and use the built-in TF-IDF method to retrieve candidates, following Song et al. [18]., the ranking model learns to compute a relevance score between each candidate and the query . Instead of the absolute relevance of individual responses (a.k.a, point-wise ranking), we train the model through pair-wise ranking, for a user’s relative preference on a pair of documents is often more easily captured (Wang et al., 2017b). Hence, the probability of a response pair with more relevant than (noted as

) being correctly ranked can be estimated by the normalized distance of their matching relevance to




is the sigmoid function, and

is the ranker’s scoring function defined by any matching model. We train the ranker to rank the ground-truth response higher than a sampled negative candidate , with the pair-wise ranking loss defined as a hinge function (Herbrich, 2008):


where is the number of training samples and denotes the margin allowing for a flexible decision boundary. While both the response generation and ranking model could be used alone as single model, they form an ensemble when the latter reranks both pre-retrieved candidates and generated responses and finally selects the response of the top ranking.

4. EnsembleGAN Framework

Figure 1. Illustration of EnsembleGAN Architecture (best viewed in color): generators , , discriminator as well as three-stage sampling strategy are represented by blue, green, grey and orange colored blocks respectively.1⃝ and 2⃝ denote the training phase of -steps and -steps respectively, which is defined in algorithm 1.

4.1. Model Overview

Figure 1 illustrates the overall architecture of our proposed EnsembleGAN framework. Given a set of user queries , the original ensemble applies both its generation and pre-retrieval module to synthesize and retrieve response candidates and for each , respectively. All candidates are ranked together based on the scoring function of the ranking module.

1) Generative seq2seq model

, which inherits from the generation module of the ensemble, is responsible for synthesizing response candidates given query , as depicted in Eq.(7), with the application of Monte Carlo (MC) roll-out policy. By combining with the ground-truth response , we directly generate negative response pairs aiming at receiving high ranking scores from discriminator, the process of which is also noted as for formulation coherence.

2) Generative ranking model

, which inherits from the response ranking model of the ensemble, learns to approximate the true relevance distribution over response pairs . Hence with the true response , we generates highly competitive negative samples as specified in Eq.(9) so as to challenge the discriminator.

3) Discriminative ranking model

, which inherits from the same ranking model as , endeavors however to distinguish the true response pairs from adversarial candidates provided by both generators ( and ). After the adversarial training, all , and could be used alone as single model, or we could also form an improved ensemble consisting of a generation model and a ranking model (either or ) as described previously.

4.2. Adversarial Training for the Ensemble

4.2.1. Overall Objective

In our generative adversarial framework for the ensemble, both generators try to generate fake samples that get high ranking scores so as to fool the discriminator, the discriminator on the contrary is expected to distinguish the good samples from the bad by ranking more precisely as well as scoring down negative samples. We summarize the minimax game among generators , and the discriminator with the objective function as follows:


where denotes the mathematical expectation, the number of training samples, and are the true and generated response pair, respectively.

4.2.2. Optimizing Discriminative Ranker

As shown in Eq.(2) previously, we design to evaluate the probability of a response pair being correctly ranked given query . Combining the ground-truth responses with the fake ones generated by both current and , the optimal parameters of are obtained as follows:


where and are defined in Eq.(4), such an optimization problem is usually solved by gradient descent as long as is differentiable with respect to . When training generators, is used to provide reward of generated negative samples, which will be detailed later in this section.

4.2.3. Optimizing Generative Seq2Seq

At the first stage, we enhance the generative seq2seq model through discriminative ranker . When given a user query , the generation of a sequence could be regarded as a series of decision making at time steps by policy as defined in Eq.(1). However, since only provides the reward for a complete sequence, the lack of intermediate reward for every time step leads to the ignorance of long term reward causing the model to be shortsighted. We hence apply MC roll-out policy (Yu et al., 2017; Lin et al., 2017) to tackle with the problem, which repeatedly rolls out incomplete sequences until the end-of-sequence token so as to get an expected reward from for every time step. With the true response , the expected end reward of a response pair is defined as follows:


where is the current state with tokens already generated in , the initial token. The response pair , where is the completed -length sequence rolled out from current according to -time MC roll-out policy (noted as ), resulting in the action-value function defined as follows:


Hence, the instant reward for time step is calculated as the average ranking scores from of all sampled response pairs obtained by repeatedly rolling out for times based on . We note as the total number of generations for of length . In contrast to the original rankGAN, both generator and discriminator are conditioned on the given query , which is a necessary adaptation in the case of dialogue generation. Note that such a configuration could also be referred to as conditionalGAN framework (Mirza and Osindero, 2014).

4.2.4. Optimizing Generative Ranker

The second stage involves the amelioration of generator through discriminator , with the objective function defined as below:


where denotes that this second stage is actually based on the first stage discussed above, with fixed as a result. Inheriting from the same ranking model as , we train to generate competitive negative response pairs that receive high ranking scores from , where both ranking-based generative and discriminative models could get improved (Wang et al., 2017b). More precisely, when given a true pair and a scoring function , the chance of selecting a negative sample according to the relevance distribution of response pairs is defined by a softmax function as follows:


where represents the M-sized candidate pool with ground-truth responses excluded. Despite other possible configurations as observed in Wang et al. [26], we follow as directly being the relevance distribution of an individual response , not only for the simplicity, but for the coherence of both and being able to sample responses independently of the ground-truth response, as it’s the real usage case after training.

The candidate pool is of importance for the capability of to sample unobserved as well as highly competitive responses. In addition to the random sampling strategy that generates random responses () from the database as the original IRGAN, we apply both the pre-retrieval module to retrieve candidates () similar to ground-truth responses regardless of queries, and also to synthesize relevant responses, all of which are summarized as a three-stage sampling strategy:


The design of not only compensates for the ineffectiveness of random sampling for generating competitive responses from a huge dialogue database in our case, it also enables the generator to work as an ensemble with the response generation model , thus always considering the cooperation of both generative-based and retrieval-based approaches during adversarial learning.

0:    Generators ,, and discriminator ; Training data , and retrieval database ;Three-stage sampling approach as in Eq.(10); , the sampling size of and respectively.
0:    Ensemble of seq2seq model and ranker model ,
1:  Initialize , with random weights ,;
2:  Pretrain , on , respectively
3:  for -steps do
4:      generates samples for each ;
5:     Update via policy gradient defined in Eq.(11);
6:  end for
7:  for -steps do
8:     for each  do
9:          generates samples
10:          generates samples via ;
11:     end for
12:     Update via policy gradient defined in Eq.(11);
13:  end for
14:  for -steps do
15:      generates samples for each ;
16:      generates samples via and combine with positive samples from ;
17:     Train discriminator according to Eq.(5)
18:  end for
Algorithm 1 EnsembleGAN Minimax Game

4.2.5. Policy Gradient

Following Sutton et al. [19], we apply policy gradient to update generators’ parameters through feedback of , for the sampling process of both generators are non-differential. Hence, with fixed, for each query with true-negative response pair , the minimization of in Eq.(4) with respect to , could be deduced as follows (Lin et al., 2017; Wang et al., 2017b):


where , are defined in Eq.(6) and Eq.(8) respectively. is the differential operator, the generated sequence length by and the negative sampling size of .

4.2.6. Reward Setting

Normally, we would consider that the reward . It’s however problematic that the logarithm leads to instability of training (Goodfellow et al., 2014). We thus follow Wang et al. [26] with the advantage function of reward implementation defined as below:


4.2.7. Overall Algorithm

We summarize the ensembleGAN algorithm in Algorithm 1, where all the generators , and discriminator are initialized by a pretrained ensemble, with and sharing the same parameter initialization.

Despite the very existence of Nash equilibrium between generator and discriminator for their minimax game, it remains an open problem of how they could be trained to achieve the desired convergence (Goodfellow et al., 2014). In our empirical study, we confirm that both the ranker and generator are enhanced by ensembleGAN, while the ranker generator encounters a loss of performance after adversarial training, as also observed in Wang et al. [26].

5. Experiments

In this section, we compare our EnsembleGAN with several representative GAN mechanism on a huge dialogue corpus. The goal of our experiments is to 1) evaluate the performance of our generation module and retrieval module for response generation and selection, and 2) evaluate the effectiveness of our proposed EnsembleGAN framework from the ensemble perspective.


FeaturesDataset Retrieval Pool Ranking Set Generation Set Test Set
Corpus Weibo+Toutiao Weibo Toutiao Toutiao
Post # Sent 2,065,908 30,000 1,000,000 2,000
# Vocab 251,523 29,272 120,996 5,642
Avg_L 11.4 13.1 9.3 10.1
Resp. # Sent 5,230,048 360,000 1,000,000 2,000
# Vocab 628,254 28,000 121,763 4,544
Avg_L 8.7 9.8 7.1 7.7
Pair # Pairs 6,000,000 360,000 1,000,000 2,000


Table 1. The Statistics of Mixed Short-Text Conversation Dataset. Resp. is response for short, # Sent, # Vocab and Avg_L denote the number of sentences, vocabularies and the average sentence length, respectively.

5.1. Dataset

We conduct our experiments on a large mixed dialogue dataset crawled from online Chinese forum Weibo111https://www.weibo.com/ and Toutiao222https://www.toutiao.com/ containing millions of query-response pairs. For data pre-processing, we remove trivial responses like ”wow” as well as the responses after first 30 ones for topic consistency following Shang et al. [17]. We use Jieba333https://github.com/fxsjy/jieba, a common Chinese NLP tool, to perform Chinese word segmentation on all sentences. Each query and reply contain on average 10.2 tokens and 8.44 tokens, respectively. From the remaining query-response pairs, we randomly sampled 6,000,000 pairs as retrieval pool for the pre-retrieval module, 1,000,000 and 50,000 pairs for training and validating the sequence generation model, 360,000 and 2000 pairs for training and validating the ranking model (we apply three-stage sampling strategy to generate 11 negative samples for 30,000 true query-response pairs), and finally 2,000 pairs as test set for both models. We make sure that all test query-response pairs are excluded in training and validation sets. More detailed data statistics are summarized in Table 1.

5.2. Baselines

We introduce baseline models and GAN competitors on three levels, namely the generation approach, the retrieval approach and the ensemble approach. We note GAN-G (D) for the generator (discriminator) of a GAN mechanism in this section. EnsembleGAN is represented by ensGAN for ease of demonstration.

DialogueGAN. We consider dialogueGAN (Li et al., 2017) as our GAN competitor for the generation part, with a seq2seq generator and a binary-classifier-based discriminator that is trained to distinguish the true query-response pairs from the fake ones. In order to eliminate structure biases for a fair comparison, we adopt the very same deep matching model structure as our ranking model (which will be detailed later) for its discriminator, instead of the hierarchical recurrent architecture applied by the original paper.

DPGAN. We consider diversity-promoting GAN (DPGAN) (Xu et al., 2018) as a second GAN competitor for the generation part, with a seq2seq generator and a language-model-based discriminator that is trained to assign higher probability (lower perplexity) for true responses than fake ones. The LM-based discriminator is consisted with a uni-directional LSTM (Hochreiter and Schmidhuber, 1997) as the original paper.

RankGAN. We consider RankGAN as another GAN competitor. The original RankGAN (Lin et al., 2017) is an unconditional language model that is unsuitable for dialogue generation scenario as discussed previously, we hence modify RankGAN to consist of a seq2seq generator and a pairwise discriminative ranker, which could be considered as ensGAN without getting generator G involved.

IRGAN. We also consider IRGAN (Wang et al., 2017b) as a GAN competitor. Similarly, this could be considered as ensGAN without any involvement with seq2seq generator or the three-stage sampling strategy. All GAN mechanism are applied on exactly the same pre-trained generation or ranking model for a fair comparison, and we evaluate single component (generator or discriminator) as well as the derived ensemble (Generation + Ranking) for each GAN mechanism, resulting in various combinations which will be detailed later.

Response Generation Models (S2SA). We compare with the attention-based seq2seq model (S2SA) (Bahdanau et al., 2015), which has been widely adopted as a baseline model in recent studies (Shang et al., 2015; Li et al., 2017). As a result, we have three derived adversarial sequence generators, namely the dialogueGAN-G, DPGAN-G, RankGAN-G that compete against ensGAN-G. Besides, We include mutual information enhanced seq2seq model (MMI-S2SA) (Li et al., 2016) as another generative baseline method.

Pre-Retrieval Module (TF-IDF). The pre-retrieval module, as the basis of retrieval approach, first calculates similarities among utterances (queries) based on simple TF-IDF scores and then retrieve the corresponding responses (Song et al., 2018). We report the Top-{1,2} responses, noted as TF-IDF-{1,2}, respectively.

Response Ranking Models (Ranking). The pure retrieval system is consisted with a pre-retrieval module and a ranking (matching) model, where the pre-retrieved candidates is reranked by the ranker, for which we apply state-of-the-art attentive conv-RNN model (Wang et al., 2017a) for our ranker baseline. Therefore, we have 5 derived adversarial rankers based on the same original ranker, namely the RankGAN-D, IRGAN-G and IRGAN-D that compete against our ensGAN-G and ensGAN-D.

Ensemble Models (Generation+Ranking). Ensemble models are constructed with a generation model, a pre-retrieval module and a ranking model. When given a query, the generative model (e.g., S2SA, RankGAN-G and ensGAN-G) synthesizes candidate responses. Then the ranking model (e.g., conv-RNN, IRGAN-D, RankGAN-D and ensGAN-D) is required to rerank both pre-retrieved candidates and synthetic responses, and select the top one response in the end. Besides, following Song et al. [18] and Wu et al. [28], we also consider Multi-Seq2Seq + GBDT reranker and Prototype-Edit as two baseline ensemble models.

5.3. Implementation Details

The seq2seq model is trained with a word embedding size of 300 for source and target vocabulary of 30,000 most frequent tokens of queries and responses in the generation training set, covering 97.47% and 97.22% tokens that appear in queries and responses respectively. The rest tokens are treated as ”UNK” as unknown tokens. We set the hidden size of the encoder and decoder to 512. The adversarial sampling size during G training steps.

The conv-RNN ranker is trained with 200-dimensional word embedding for a shared vocabulary of 40,000 tokens, covering 93.54% words in the retrieval pool. The size of GRU is set to 200. The window size of the convolution kernel is set to (2, 3, 4, 8), with number of filters equal to (250, 200, 200, 150), following Wang et al. [24]. We pretrain the ranker to rank the ground-truth response to the top from negative samples including 5 random samples, the top 5 pre-retrieved candidates and a synthetic one generated by seq2seq model. During adversarial training, the ranker generator generates negative samples from a candidate pool according to the three-stage sampling strategy.

We use dropout of 0.2 for all models, and Adam optimizer (Kingma and Ba, 2015) with a mini-batch of 50. The learning rate of S2SA and conv-RNN are respectively and 0.001 during pre-training, and during adversarial learning.


ModulesAutomatic Metrics Word Overlap Embedding Similarity RUBER
Generation S2SA 7.334 2.384 0.987 0.340 0.503 0.154 0.332 0.550 0.500
MMI-S2SA 8.468 2.464 0.956 0.404 0.526 0.149 0.342 0.557 0.521
DialogGAN-G 9.465 2.483 0.912 0.349 0.533 0.161 0.344 0.560 0.533
DPGAN-G 8.578 2.474 0.922 0.385 0.535 0.165 0.345 0.588 0.557
RankGAN-G 10.033 2.545 0.967 0.436 0.560 0.145 0.343 0.602 0.580
ensGAN-G1 9.530 2.487 0.872 0.352 0.531 0.163 0.347 0.598 0.584
Retrieval TF-IDF-1 (pre-retrieval) 7.026 2.175 0.928 0.460 0.537 0.152 0.337 0.541 0.486
TF-IDF-2 (pre-retrieval) 7.120 2.108 0.990 0.581 0.538 0.153 0.338 0.539 0.499
Ranking (conv-RNN) 7.242 2.213 0.933 0.488 0.543 0.151 0.339 0.558 0.519
Ranking (RankGAN-D) 7.441 2.194 0.945 0.490 0.547 0.152 0.341 0.571 0.535
Ranking (IRGAN-G) 7.225 2.166 0.867 0.409 0.540 0.152 0.337 0.560 0.518
Ranking (IRGAN-D) 7.451 2.362 1.012 0.528 0.553 0.156 0.343 0.573 0.542
Ranking (ensGAN-G) 7.057 2.129 0.897 0.460 0.539 0.150 0.338 0.549 0.516
Ranking (ensGAN-D) 7.452 2.320 1.004 0.527 0.548 0.153 0.341 0.579 0.539
Ensemble Multi-Seq2Seq + GBDT (Song et al., 2018) 7.542 2.173 0.993 0.569 0.540 0.152 0.338 0.592 0.568
Prototype-Edit (Wu et al., 2018a) 7.926 2.334 1.120 0.571 0.557 0.164 0.346 0.610 0.587
S2SA + conv-RNN 7.630 2.299 1.125 0.555 0.544 0.153 0.341 0.564 0.535
RankGAN-G + conv-RNN 7.755 2.275 0.889 0.432 0.549 0.150 0.340 0.572 0.543
ensGAN-G + conv-RNN 7.570 2.168 0.871 0.423 0.544 0.156 0.339 0.568 0.540
RankGAN-G + IRGAN-D 8.827 2.693 1.234 0.716 0.560 0.152 0.348 0.608 0.577
S2SA + IRGAN-D 8.375 2.850 1.232 0.637 0.558 0.162 0.347 0.600 0.573
S2SA + ensGAN-D 8.535 2.749 1.297 0.715 0.547 0.159 0.345 0.595 0.569
RankGAN-G + RankGAN-D 8.715 2.501 1.075 0.580 0.561 0.154 0.347 0.615 0.591
ensGAN-G + ensGAN-D 9.339 2.876 1.277 0.763 0.559 0.178 0.352 0.621 0.605


Table 2. Overall performance of baselines and GAN competitors. Ranking() means that candidate responses (generated by the pre-retrieval module) are re-ranked by the ranking model . Bold scores denote the highest score within each block. The RUBER scores for ground-truth are 0.815, 0.798 for RUBER and RUBER, respectively.

5.4. Evaluation Metrics

We adopt multiple automatic evaluation criteria as well as human evaluation for a comprehensive comparison.

BLEU. BLEU (Papineni et al., 2002) evaluates the word-overlap between the proposed and the ground-truth responses. Typically, we use BLEU (

) to calculate their n-grams-overlap, where BLEU

denotes the BLEU score considering n-grams of length .

Embedding-based metrics (EA, GM, VE). Following Liu et al. [12

], we alternatively apply three heuristics to measure the similarity between the proposed and ground-truth response based on pre-trained word embeddings

111We apply pre-trained Chinese word embedding which is available at https://github.com/Embedding/Chinese-Word-Vectors., including Embedding Average (EA), Greedy Matching (GM), and Vector Extrema (VE).

Semantic Relevance (RUBER, RUBER). Together with the embedding similarity, Tao et al. [21

] evaluates the semantic relatedness between a response and its query based on neural matching models. Following the original paper, we report the arithmetic and geometric mean of embedding similarity and semantic relatedness, denoted as RUBER

and RUBER, respectively.

Retrieval Precision (P@1). We evaluate pure ranking-based retrieval systems by precision at position 1 (P@1), which calculates the ratio of relevant responses (in our case, the ground-truth response) within top- reranked responses.

Human evaluation. We also conduct human evaluations for generation and ensemble models since automatic metrics might not be consistent with human annotations (Liu et al., 2016; Tao et al., 2018b). Following previous studies (Xing et al., 2017; Tao et al., 2018b; Song et al., 2018), we invited 3 well educated volunteers to judge the quality of 100 randomly generated responses by different models222Due to numerous generation + ranking possibilities and space limitations, we only asked annotators to evaluate representative models with high automatic metric scores., based on the following criteria: a score of indicates a bad response that is either dis-fluent or semantically irrelevant; means a relevant but universal response; indicates a fluent, relevant and informative response. We report the proportion of each score () for each model. Fleiss’ kappa (Fleiss, 1971) scores are also reported.

5.5. Results and Analysis

5.5.1. Overall Performance

Our evaluation is divided into three parts, namely the evaluation for pure generation module, pure retrieval module and the ensemble. Table 2 summarizes the general dialogue generation performance including various automatic metrics of word overlap, embedding similarity and semantic relevance. Figure 2 shows the P@1 scores for different retrieval systems as well as the study of contribution of two modules that consist of an ensemble, together with Table 3 of human evaluation results for representative models. The human agreement is validated by the Kappa with a value range of 0.4 to 0.6 indicating “moderate agreement” among annotators. A higher value denotes a higher degree of agreement, such as 0.65 for S2SA which is probably because it generates more dis-fluent or irrelevant responses that are easy to recognize. We first make several observations as follows:


ModelScore +2 +1 0 Kappa
S2SA 0.12 0.40 0.48 0.65
ensGAN-G 0.14 0.49 0.36 0.55
RankGAN-G 0.16 0.39 0.45 0.43
S2SA + conv-RNN 0.21 0.33 0.47 0.52
S2SA + IRGAN-D 0.22 0.35 0.43 0.47
S2SA + ensGAN-D 0.25 0.35 0.40 0.49
RankGAN-G + RankGAN-D 0.28 0.35 0.36 0.46
RankGAN-G + IRGAN-D 0.30 0.36 0.35 0.53
ensGAN-G + ensGAN-D 0.37 0.38 0.26 0.45


Table 3. Results of human evaluation for generation and ensemble models. “Kappa” means Fleiss’ kappa.

[style=unboxed,leftmargin=0cm,font= ]


As for generation module, we first notice that GAN-enhanced seq2seq models achieve plausible improvement on most evaluation metrics, outperforming S2SA and MMI-S2SA baselines. Both RankGAN-G and ensGAN-G aim at synthesizing responses that approximate true responses with higher ranking scores, which is demonstrated by the obvious gain of their contribution ratios for ensembles shown in Figure 2, with more than 40% contribution for both RankGAN ensemble (E) and ensGAN ensemble (E). Their comparable enhancement to RUBER scores indicates better generations in terms of the semantic relevance. Despite the outperforming word overlap and embedding average of RankGAN-G, ensGAN-G is not only better at improving the GM and VE metrics, indicating more generation of key words with important information that are semantically similar to those in the ground-truth (Liu et al., 2016), but it’s also capable of generating more satisfying responses with fewer 0 human scores according to Table 3.

Figure 2. (a) P@1 scores for various ranker-based retrieval systems. * denotes significant precision improvement (compared with conv-RNN) according to the Wilcoxon signed-rank test; and (b) The final response contribution of generation and ranking modules for ensembles.

As for retrieval methods, we see that they often achieve advantageous higher order BLEU scores (e.g., BLEU and BLEU) than generative approaches, since generating responses of better language fluency (hence higher n-gram overlaps to some extent) is undoubtedly their strong points. They are however inferior to generative methods in terms of RUBER scores, for the latter are generally better at generating more tailored responses of high semantic relatedness (Song et al., 2018), similar results are also obtained by Tao et al. [21]. Together with P@1 scores in Figure 2, all discriminative rankers of GAN approaches (IRGAN-D, RankGAN-D, ensGAN-D) are generally ameliorated on various aspects, with generative rankers (IRGAN-G, ensGAN-G) somehow deteriorated, which is also confirmed by Wang et al. [26]. Similarly, one possible explanation might be the sparsity of the positive response distribution compared with negative ones during training, making it hard for a generative ranker to get positive feedbacks from discriminator. Without any generation module, IRGAN outperforms others on enhancing a pure retrieval system, notably achieving the highest P@1 score. On the other hand however, the P@1 score for all methods remains low compared with common QA task (Wang et al., 2017b, a), which might be explained by a more complicated and chaotic nature of STC dataset (Wang et al., 2013).


As for the ensembles, they commonly outperform previous single approaches, for the scores in the third block (Ensemble) are generally better than the first two blocks (Generation and Retrieval), which is especially true for those GAN-enhanced ensembles. Among various combinations of generation + ranking, the ensGAN ensemble (ensGAN-G + ensGAN-D) outperforms both IRGAN (S2SA + IRGAN-D) and RankGAN (RankGAN-G + RankGAN-D) ensembles with the largest gain on almost all metrics, as well as achieving the most +2 and the fewest 0 scores for human judgement. While RankGAN and IRGAN bring specific enhancement to the generative and retrieval module respectively, the ensGAN improves the whole ensemble by allowing each its module to compete against each other, which might be regarded as seeking for a global optimum compared with other GAN that searches for local optimum of a single approach. While the ensGAN-G generation module accounts more for the ensemble’s final selection, the ensGAN-D learns to rank (select) responses featuring advantages of both generative and retrieval approach as expected, with the help of another strong negative sampler G during adversarial training.

Figure 3. Cosine similarity between ground-truth and synthetic word distribution by various generative models on test data for different word frequency level (e.g., top 300 frequent words). EnsGAN achieves satisfying performance especially when considering words of lower frequency.

5.5.2. Discussion

In addition to previous observations, we’d also like to provide further insights of the EnsembleGAN framework on several interesting aspects in this section.

Ranking versus LM versus Binary-Classification.

As for the amelioration of generative seq2seq model, while DialogueGAN uses a binary classifier as the discriminator, DPGAN utilizes an LM-based discriminator, and both RankGAN and ensGAN apply ranking-based discriminator. As a result, the superiority of adversarial ranking over binary-classification is not only observed in our experiment, but confirmed in (Lin et al., 2017) as well. The LM-based discriminator (DPGAN-D) on the other hand, by addressing the saturation issue of binary classification (Xu et al., 2018), brings comparable improvement as adversarial ranking, all of which help generate responses of higher quality as observed previously, as well as achieving better cosine similarity of word distributions (Figure 3). In particular, we apply the adversarial ranking in our work for it’s the very bridge that connects the adversarial training of both generative-based and retrieval-based methods in the EnsembleGAN framework.

1+12 for Ensemble Approach.

Although it’s unreasonable to exhaustively study all possible generation + ranking combinations, it’s however interesting to directly combine the seemingly best two modules of their separate worlds, namely the RankGAN-G + IRGAN-D, to see how such an ensemble performs compared with ensGAN. Apart from the overall results in Table 2 which already indicate that these two ”best winners” do not get along as well as ensGAN-G + ensGAN-D to some extent, a further evidence lies in the analysis on the ranking module shown in Figure 4. On one hand, the P@1 adversarial learning curves (Figure 4) show that the IRGAN is better at enhancing a pure retrieval system, while RankGAN-D encounters a higher oscillation which is probably due to its concentration on ranking the synthetic responses to the top, making its P@1 pure retrieval performance unpredictable. On the other hand, the ensemble of ensGAN-G + ensGAN-D turns out to be clearly advantageous in terms of the ranking loss (Figure 4) defined in Eq.(2) among ensemble approaches. More specifically, we calculate the module-wise ranking loss for the final chosen responses (considered as ) from the generation () or the pre-retrieval module (), the overall ranking loss (

) is computed as the weighted sum of the two losses based on the module contribution. We see that ensGAN-D generally achieves the lowest ranking loss with moderate variance, which clearly demonstrates that EnsembleGAN is indeed more inclined towards global optimum without unilaterally enhancing a single module and thus is more adapted for an ensemble of multiple modules, especially when we note that the direct combination of the two ”best winners” RankGAN-G + IRGAN-D does not result in the lowest overall ranking loss (not even close).

Figure 4.

(a) Ranker P@1 learning curves; and (b) Error bars of mean and standard deviation of ranking loss for different modules of ensembles. Results are calculated on test set.

The Merits of the Ranking Module.

In addition, we find that there also exists a clear performance gap among the ensembles themselves. As shown in Table 2, the combinations of original S2SA + GAN-enhanced rankers generally bring better ameliorations compared with the combinations of GAN-enhanced S2SA + original conv-RNN, suggesting the very importance of a re-ranker for a dialogue ensemble, which is reasonable because all candidates have to be reranked by this final decision maker. Hence, despite the trend on the amelioration of generative approaches, it’s also plausible to concentrate on the research of retrieval or ensemble methods so as to improve the open domain human-computer conversation.


Response generation cases
: I can’t play sniper.
: You’re messing up with me.
TF-IDF-1: Really can’t play this.
TF-IDF-2: I really can’t play games.
S2SA: Have you ever played?
MMI-S2SA: Right, I’ve been playing for a year.
DialogueGAN: I’ve played once, I don’t know.
DPGAN: They’re my men, what’re you thinking?
RankGAN-G: You’ll know once you’ve played.
ensGAN-G1: What I played was real.
Multi-Seq2Seq: I’ve played this game, and you?
: Looking for the title of this song.
: It’s the theme song from Rudy.
TF-IDF-1: It’s Faith.
TF-IDF-2: Faith by Xinzhe Zhang.
MMI-S2SA: I’d like to know where you’re from.
DialogueGAN: UNK.
DPGAN: Yeah!
RankGAN-G: UNK by UNK.
ensGAN-G1: Thanks for your love.
Multi-Seq2Seq: UNK.
: You’re pretty, with or without fringe.
: Well, I’ve got a big face actually.
TF-IDF-1: Hahahaha.
S2SA: Thanks.
MMI-S2SA: Haha, thanks.
DialogueGAN: I think so.
RankGAN-G: Haha, I’m also UNK.
ensGAN-G1: Thank you my friend.
Multi-Seq2Seq: Haha.
: Who’s called hot pepper?
: It’s a girl we saw previously.
TF-IDF-1: Our favourite fast food.
TF-IDF-2: You don’t even know hot pepper?
S2SA: I know you’re male.
MMI-S2SA: I know you’re male, I’m female.
DialogueGAN: Let me guess who you are.
DPGAN: I mean that I’m actually her.
RankGAN-G: Are you talking about your joke?
ensGAN-G1: I knew it was you!
Multi-Seq2Seq: Yeah, me too.


Table 4. Response generation case study. The final decision of rankers are marked by . White and gray cells denote valid and inaccessible candidates for a ranker when combined with its corresponding generative module as an ensemble. The original ranker is noted , and , , , denote IRGAN-D, RankGAN-D, ensGAN-D and GBDT respectively.

5.5.3. Case Study

Table 4 shows several example response generation by ensembles, together with various baselines. It’s obvious that an ensemble becomes plausible for selecting one final response from multiple candidates in case a single approach fails to respond correctly, just as the second and third case, corresponding to generative-failure and retrieval-failure respectively. We could also observe that as for generation module, most enhanced seq2seq models are better than S2SA in terms of both language fluency and informativeness. Moreover, the GAN-enhanced seq2seq models are generally better than MMI-S2SA which generates irrelevant responses like ”I know you’re male” given the query ”who’s called hot pepper?” in the last case. Among GAN-based generators, all DPGAN, RankGAN and ensGAN achieve similar performances in terms of the generation enrichness, which seem slightly better than dialogueGAN. Besides, while the original GBDT ranker and IRGAN-D mostly prefer the retrieved candidates, RankGAN-D however largely favors synthetic responses, conforming with their respective GAN initiatives. In contrast, the ensGAN-D is able to perform more balanced and logical selections between its generation module and pre-retrieval module, demonstrating its ability to leverage both advantages of single retrieval-based and generation-based approach in dialogue generation scenarios.

6. Conclusion and Future Work

In this paper, we proposed a novel generative adversarial framework that aims at enhancing a conversation retrieval-generation ensemble model by unifying GAN mechanism for both generative and retrieval approaches. The ensembleGAN enables the two generators to generate responses getting higher scores from the discriminative ranker, while the discriminator scores down adversarial samples and selects responses featuring merits of both generators, allowing for both generation and retrieval-based methods to be mutually enhanced. Experimental results on a large STC dataset demonstrate that our ensembleGAN outperforms other GAN mechanism on both human and automatic evaluation metrics and is capable of bringing better global optimal results.

This work was supported by the National Key Research and Development Program of China (No. 2017YFC0804001), the National Science Foundation of China (NSFC Nos. 61672058 and 61876196).


  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §1, §3, §5.2.
  • H. Chen, X. Liu, D. Yin, and J. Tang (2017) A survey on dialogue systems: recent advances and new frontiers. SIGKDD Explorations 19 (2), pp. 25–35. Cited by: §1.
  • B. Dai, S. Fidler, R. Urtasun, and D. Lin (2017) Towards diverse and natural image descriptions via a conditional GAN. In ICCV, pp. 2989–2998. Cited by: §2.
  • J. L. Fleiss (1971) Measuring nominal scale agreement among many raters.. Psychological Bulletin 76 (5), pp. 378–382. Cited by: §5.4.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §1, §2, §4.2.6, §4.2.7.
  • R. Herbrich (2008) Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers 88. Cited by: §3.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §5.2.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §5.3.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In NAACL, pp. 110–119. Cited by: §1, §5.2.
  • J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. Jurafsky (2017) Adversarial learning for neural dialogue generation. In EMNLP, pp. 2157–2169. Cited by: §2, §2, §5.2, §5.2.
  • K. Lin, D. Li, X. He, M. Sun, and Z. Zhang (2017) Adversarial ranking for language generation. In NIPS, pp. 3158–3168. Cited by: §2, §4.2.3, §4.2.5, §5.2, §5.5.2.
  • C. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau (2016) How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In EMNLP, pp. 2122–2132. Cited by: item 1), §5.4, §5.4.
  • L. Liu, Y. Lu, M. Yang, Q. Qu, J. Zhu, and H. Li (2018) Generative adversarial network for abstractive text summarization. In AAAI, Cited by: §2.
  • M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. CoRR abs/1411.1784. Cited by: §4.2.3.
  • L. Mou, Y. Song, R. Yan, G. Li, L. Zhang, and Z. Jin (2016) Sequence to backward and forward sequences: A content-introducing approach to generative short-text conversation. In COLING, pp. 3349–3358. Cited by: §1, §2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: §5.4.
  • L. Shang, Z. Lu, and H. Li (2015) Neural responding machine for short-text conversation. In ACL, pp. 1577–1586. Cited by: §1, §2, §3, §5.1, §5.2.
  • Y. Song, C. Li, M. Zhang, D. Zhao, and R. Yan (2018) An ensemble of retrieval-based and generation-based human-computer conversation systems. In IJCAI, pp. 4382–4388. Cited by: §1, item 2), §5.2, §5.2, §5.4, Table 2, footnote 1.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (1999) Policy gradient methods for reinforcement learning with function approximation. In NIPS, pp. 1057–1063. Cited by: §4.2.5.
  • C. Tao, S. Gao, M. Shang, W. Wu, D. Zhao, and R. Yan (2018a) Get the point of my utterance! learning towards effective responses with multi-head attention mechanism. In IJCAI, pp. 4418–4424. Cited by: §2.
  • C. Tao, L. Mou, D. Zhao, and R. Yan (2018b) RUBER: an unsupervised method for automatic evaluation of open-domain dialog systems. In AAAI, pp. 722–729. Cited by: item 2), §5.4, §5.4.
  • C. Tao, W. Wu, C. Xu, W. Hu, D. Zhao, and R. Yan (2019) Multi-representation fusion network for multi-turn response selection in retrieval-based chatbots. In WSDM, pp. 267–275. Cited by: §2.
  • A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. J. Crandall, and D. Batra (2018) Diverse beam search for improved description of complex scenes. In AAAI, pp. 7371–7379. Cited by: §1.
  • C. Wang, F. Jiang, and H. Yang (2017a) A hybrid framework for text modeling with convolutional RNN. In SIGKDD, pp. 2061–2069. Cited by: item 2), §5.2, §5.3.
  • H. Wang, Z. Lu, H. Li, and E. Chen (2013) A dataset for research on short-text conversations. In EMNLP, pp. 935–945. Cited by: §1, item 2).
  • J. Wang, L. Yu, W. Zhang, Y. Gong, Y. Xu, B. Wang, P. Zhang, and D. Zhang (2017b) IRGAN: A minimax game for unifying generative and discriminative information retrieval models. In SIGIR, pp. 515–524. Cited by: §2, §3, §4.2.4, §4.2.5, §4.2.6, §4.2.7, item 2), §5.2.
  • J. Weston, E. Dinan, and A. H. Miller (2018) Retrieve and refine: improved sequence generation models for dialogue. CoRR abs/1808.04776. Cited by: §1.
  • Y. Wu, F. Wei, S. Huang, Z. Li, and M. Zhou (2018a) Response generation by context-aware prototype editing. CoRR abs/1806.07042. Cited by: §1, §5.2, Table 2.
  • Y. Wu, W. Wu, C. Xing, M. Zhou, and Z. Li (2017) Sequential matching network: a new architecture for multi-turn response selection in retrieval-based chatbots. In ACL, pp. 496–505. Cited by: §1, §2.
  • Y. Wu, W. Wu, D. Yang, C. Xu, and Z. Li (2018b) Neural response generation with dynamic vocabularies. In AAAI, pp. 5594–5601. Cited by: §1.
  • C. Xing, W. Wu, Y. Wu, J. Liu, Y. Huang, M. Zhou, and W. Ma (2017) Topic aware neural response generation. In AAAI, pp. 3351–3357. Cited by: §1, §5.4.
  • J. Xu, X. Sun, X. Ren, J. Lin, B. Wei, and W. Li (2018) DP-GAN: diversity-promoting generative adversarial network for generating informative and diversified text. In EMNLP, pp. 3940–3949. Cited by: §2, §2, §5.2, §5.5.2.
  • R. Yan, Y. Song, and H. Wu (2016) Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In SIGIR, pp. 55–64. Cited by: §1, §2.
  • R. Yan, D. Zhao, and W. E (2017) Joint learning of response ranking and next utterance suggestion in human-computer conversation system. In SIGIR, pp. 685–694. Cited by: §2.
  • R. Yan and D. Zhao (2018) Coupled context modeling for deep chit-chat: towards conversations between human and computer. In SIGKDD, pp. 2574–2583. Cited by: §2.
  • Z. Yang, W. Chen, F. Wang, and B. Xu (2018) Improving neural machine translation with conditional sequence generative adversarial nets. In NAACL, pp. 1346–1355. Cited by: §2.
  • L. Yao, Y. Zhang, Y. Feng, D. Zhao, and R. Yan (2017) Towards implicit content-introducing for generative short-text conversation systems. In EMNLP, pp. 2190–2199. Cited by: §1.
  • T. Young, E. Cambria, I. Chaturvedi, M. Huang, H. Zhou, and S. Biswas (2018) Augmenting end-to-end dialogue systems with commonsense knowledge. In AAAI, pp. 4970–4977. Cited by: §1.
  • L. Yu, W. Zhang, J. Wang, and Y. Yu (2017) SeqGAN: sequence generative adversarial nets with policy gradient. In AAAI, pp. 2852–2858. Cited by: §2, §4.2.3.
  • T. Zhao, R. Zhao, and M. Eskenazi (2017)

    Learning discourse-level diversity for neural dialog models using conditional variational autoencoders

    In ACL, pp. 654–664. Cited by: §1.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In ICCV, Vol. 2223–2232. Cited by: §2.