The terms coherence and cohesion in linguistics are commonly defined as follows (Williams and Colomb, 1995).
Cohesion: sentence pairs fitting together the way two pieces of a jigsaw puzzle do.
Coherence: what all the sentences in a piece of writing add up to, the way all the pieces in a puzzle add up to the picture on the box.
In layman’s terms, cohesion indicates that two consecutive sentences are locally well-connected, and coherence indicates that multiple sentences globally hold together.
Generating cohesive and coherent natural language texts that span multiple sentences is a challenging task mainly due to two reasons. First, there is no principled way of modeling cross-sentence linguistic properties, such as cohesion and coherence of a text. Second, there is no widely accepted metric of evaluating the quality of the generated text in terms of cohesion and coherence.
Most state-of-the-art approaches to natural language generation (NLG) relied on a large amount of human-generated texts to train neural language models (Cho et al., 2014; Graves, 2013; Sutskever et al., 2014)
. Although these models can generate sentences that, if judged individually, are similar to human-generated ones, they often fail to capture the local and global dependencies among sentences, resulting in neither coherent nor cohesive text. For example, neural language models based on Recurrent Neural Networks (RNNs) are widely applied to response generation for dialogue(Vinyals and Le, 2015; Shang et al., 2015; Sordoni et al., 2015; Li et al., 2015). Although the responses by themselves look reasonable, they are either bland such as “I don’t know”, or incoherent with the whole dialogue session. See Gao et al. (2018) for a comprehensive survey.
In this paper, we strive to address the challenge in a principled manner. We propose a pair of discriminators to score whether and to what extent a text is coherent or cohesive, respectively. The coherence discriminator measures the compatibility among all sentences in a generated text using sentence-level features, thus providing a bird’s-eye view on the text. The cohesion discriminator, on the other hand, measures the compatibility of each pair of consecutive sentences using only word-level features, thus providing a worm’s-eye view on the text. These models, given a conditional input text and multiple candidate output texts, are learned to score the candidates with respect to the criterion by optimizing a pairwise ranking loss, respectively. These scores are then used as reward signals to train an RNN-based language model to generate (more) coherent and cohesive texts.
Our main contributions are three-fold: (1) we propose two linguistic discriminators for measuring coherence and cohesion of a text, respectively; (2) we present a simple yet efficient training mechanism to encode these linguistic properties; and (3) we propose negative-critical sequence training, a variant of policy gradient method, which uses negative samples to construct its reward baseline.
To the best of our knowledge, this paper is the first attempt to explicitly capture cross-sentence linguistic properties, i.e., coherence and cohesion, for long text generation. Despite the encouraging initial results, we only scratched the surface of the problem. The proposed method is yet to be significantly improved to meet the ultimate goal of generating meaningful and logical long-form texts. We cast the text generation as an RL problem and review recent work in Section 2, and detail our approach in Section 3.
2 Related work
A word sequence generation task can be framed as a reinforcement learning (RL) problem, in which the generatoris acting as a policy , with parameters , and each generated word at time , , can be viewed as an action to be chosen by the policy from a large discrete space, or vocabulary, conditioned on state , which encodes the previously generated text sequence.
Let be the reward for a partially generated text sequence . We define the long-term expected reward , where is the initial distribution of conditional input texts. Following Sutton et al. (1999), the gradient of with respect to is
where is the stationary distribution and is the expected return from state and taking action , both following policy . For brevity, we omit the derivation. In our work, we formulate text generation as an episodic RL problem with episode length , rewards being available only at the end of episode and .
There are many works on training neural language models using reward signals, such as Ranzato et al. (2015) and Paulus et al. (2017). These works directly optimize for specific metrics, such as BLEU (Papineni et al., 2002) or ROUGE (Lin and Hovy, 2003), using REINFORCE (Williams, 1992; Sutton et al., 1999). However, it is well-known that these metrics do not give a complete picture on the quality of the generated text. Only recently have there been efforts to provide more relevant quality objectives for which to optimize (Li et al., 2015, 2016a; Holtzman et al., 2018) the quality of interest such as consistency, repetition of text. But these works use the objective function to re-rank candidate outputs, not to reward or penalize outputs when they are generated in the first place. Li et al. (2016b) constructed a set of reward models, such as information flow and semantic coherence, to tune the generator, yet they do not provide an ablation study to elaborate the relative contribution of these reward models individually.
Another line of research is to use Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) to incorporate feedback signals for text generation (Yu et al., 2017; Lin et al., 2017; Zhang et al., 2017c; Guo et al., 2017; Fedus et al., 2018; Zhang et al., 2018). However, the discriminator in these works are trained to distinguish real texts from the generated ones, operating as a black-box rather than providing fine-grained feedback on particular linguistic aspects of the texts. In fact, Yang et al. (2018) has partially addressed this issue by using a trained language model as the discriminator. Although the discriminator provides a fine-grained feedback at the word-level, it does not critique on many important linguistic properties of generated texts, such as cohesion and coherence.
These text generators, when facing a long-form text generation task that span multiple sentences, are by no means perfect and often exhibit some critical errors, such as a breakdown of local connections between consecutive sentences (cohesion), let alone globally solid intent (coherence). As a result, readers can easily take these cues and discriminate such generated texts against real texts. In this paper we argue that the primary reason is the lack of an effective mechanism of measuring and controlling the text quality in the generation process. The method we propose in the next section is intended to address the problem.
We assume that global coherence of a text depends to a large degree upon how its individual sentences with different meanings are organized. So we focus our evaluation of coherence solely on the sentence-level. If the meanings of sentences are not organized properly, we have difficulty in picking up the intent of the paragraph as a whole, regardless of seamless local connectivity between consecutive sentences.
This is not to say that local connections between any two sentences should be overlooked. One can easily distinguish a model-generated sentence from a real one, simply by looking at whether the sentence is followed by another sentence logically, regardless of their grammar.
We instill these two different yet important concepts in two discriminators, operating on the sentence level and word level, respectively. Our models closely resemble these successful models for computer vision, such as StackGAN(Zhang et al., 2017a, b) and PatchGAN (Isola et al., 2017) in that they all provide hierarchical signals to their corresponding generators, where the signals are derived from raw low-level data. We call the sentence-level discriminator the coherence discriminator , and the word-level discriminator the cohesion discriminator .
3.1 Coherence discriminator:
This discriminator measures how likely two text chunks form a coherent paragraph. Let be the source text chunk that consists of sentences, be the real target text chunk that consists of sentences, and be the model-generated target text chunk that consists of sentences. is designed to distinguish a real pair from a synthetic pair by assigning them with different scores, i.e., .
Design. Our design of is inspired by the Deep Semantic Similarity Model (DSSM) (Huang et al., 2013; Gao et al., 2014; Xu et al., 2017). Given source text chunk and target text chunk , computes their coherence score in two steps, as illustrated in Figure 1. First, a pair of convolutional networks (CNNs)111We explored with deeper networks. However, the performance difference was marginal. For simplicity, we decided to use a 1-layer convolutional network architecture (Kim, 2014; Collobert et al., 2011). are applied to encode both andand 1, where 1 indicates the maximal coherence score, and the minimal coherence score.
measures how likely and add up to a single coherent passage. The score depends on the parameters of the two CNNs, or in other words, how and are encoded. Since we focus solely on the sentence-level features, we view a text chunk as a sequence of sentences, and view each sentence as a bag-of-words. Therefore, we represent each word using its pre-trained word embedding vector (Pennington et al., 2014) and represent each sentence using a vector which takes the average of its word embedding vectors. A text chunk is then represented as a sequence of sentence vectors which are fed to the CNN (either or in Figure 1). The parameters of and are optimized in such a way that a real pair scores higher than a synthetic pair:
Formally, the task of optimizing can be stated as follows. Given a set of training samples of the form , we optimize parameters by minimizing the pairwise rank loss on training data defined as
is a loss function, differentiable w.r.t..
In the following subsection, we will describe in turn how we construct the pairwise training samples and the form of the loss function. Since is used as a pairwise ranker, we employ the metrics commonly used in information retrieval for evaluation, such as recall at K (RK), which is defined as the fraction of correctly identifying an item in the TOP-K retrieved list (Baeza-Yates and Ribeiro-Neto, 1999). We present the retrieval results on test data in Table 2.
|TripAdvisor||Target Sentences Retrieval||Yelp||Target Sentences Retrieval|
Training mechanism. How do we construct these list of candidate target sentences , given the source sentences ? We assume the that follows an in the data is a positive target sample, or the correct item to retrieve. Negative samples are constructed using three different methods within a batch while iterating through the training data, motivated by Wieting et al. (2016):
The first method is to simply rotate with fixed in a batch. For a single , this method yields negative samples, where is the batch size.
The second method is to shuffle the sentence order once, different from its original order, known as a derangement, in each to break coherence, and this yields one negative sample.
Lastly, we combine the previous two methods: rotate across a batch and shuffle sentences within , yielding negative samples.
These negative samples and a single positive sample, in sum, pose a significant challenge in learning. To fit this training task into a ranking framework, we optimize over
where the weighted arithmetic mean parametrized by : and .
In our experiments, we fix and this assigns more weight to a more challenging negative sample222We performed a coarse grid search over the values of and setting resulted in fast convergence to high recall scores on the dev dataset.. Notice that is the mean if , and approaches the max as . Empirically, training the models using the weighted mean resulted in faster convergence, as opposed to using the single most challenging negative sample score () or the mean of all negative sample batch.
3.2 Cohesion discriminator:
Our second discriminator pays attention only to low-level features, such as grammar of each of the sentences and the logical flow between arbitrary two consecutive sentences. These two aspects combined, despite more linguistic aspects that we do not mention, heavily influence readability.
For simplicity, is similar to , except that its architecture, input, and negative sample construction are modified to encode cohesion between any pair of sentences on the word-level. A single input sample to is a pair of two consecutive sentences: and , where denotes the -th word in sentence . We construct the negative samples using the three methods for training , where shuffling occurs on the word level within each sentence, rather than shuffling multiple sentences on the sentence level.
The two pre-trained discriminators, and , are used to modify the text generation behavior of . is an attention-based bidirectional sequence-to-sequence model. It is initially pre-trained via maximizing the word-level likelihood given the training data, and we denote this model as .
However, the model-generated texts from often does not hold to the standards of two discriminators. Therefore, we need to change the text generation behavior of with respect to the criteria. To this end, the scores from the criteria are used as direct reward or penalty signals to modify the parameters of . Given these signals, we use our proposed variant of the policy gradient, negative-critical sequence training, to update its parameters and generate (more) coherent and cohesive texts. We discuss the details in the next section.
4 Negative-Critical Sequence Training
parameterized by neural networks typically require learning a separate critic network to estimate the expected future reward as abaseline, which in many cases is a difficult task by itself. In NLP, we have observed similar practices and challenges by Ranzato et al. (2015), Bahdanau et al. (2016), and Nguyen et al. (2017). However, recently Rennie et al. (2017) proposed an effective self-critical sequence training (SCST) mechanism that avoids learning a separate critic network. Similarly, our method does not require learning a separate critic network, instead we directly use the scores of negative samples assigned by the discriminators as the baseline.
For an arbitrary pair of and , which is the generator’s output conditioned on , we compute the coherence and cohesion scores by calling and , respectively. Since each review consists of multiple sentences, the overall cohesion score is computed as the average of scores of all consecutive sentence pairs. These scalar scores, however, have no interpretation since the discriminators are trained by optimizing a margin ranking loss. Instead, the differences between positive sample scores and the maximal or average negative sample scores provide insight of how well the models can distinguish between the positives and the negatives. Therefore, these margins can be considered as rewards with baselines, and thus we define the reward functions as:
where denotes a negative sample for a given source condition, and ( and ) are computed by averaging over an ensemble of negative samples. Notice that this reward resembles the ranking loss we use to train our discriminators, except that our baseline is an average score (instead of the weighted arithmetic mean) over negative samples. The rational for this difference is that: the maximal or the weighted arithmetic mean score baseline is in fact noisy to be used as rewards, because the best randomly constructed negative samples may be a formidably good sample. To alleviate such noise, we use the average discriminator scores of negative samples as the baseline, and this turns out to be an empirically better alternative.
In this section, we show results of training both and , and compare our RL-tuned generators , , and with the baseline model . We argue that through the use of feedback from our simple discriminators to , the quality of text generations improves significantly. See Table 1 for a comparison.
Dataset. We use the publicly available TripAdvisor’s hotel reviews dataset collected by Wang et al. (2010) and the Yelp review dataset333https://www.yelp.com/dataset. We consider only subsets of the two review datasets satisfying the following two conditions: a review must have (1) at least 10 sentences, and (2) each sentence should have more than 5 and less than 30 words. This yields roughly 60,000 TripAdvisor reviews and 220,000 Yelp reviews, split into ratio for train/dev/test. We merge the source and target vocabularies, and limit it to the top 50,000 frequent words, excluding special tokens. For each of these reviews, as in Holtzman et al. (2018), we consider the first five sentences as the source input to
, and the following five sentences as the target outputfrom .
Evaluation metrics. It is widely known that there is no accurate metric to evaluate the generator. Nevertheless, we report scores of standard metrics, such as negative log-likelihood (NLL), perplexity (PPL), BLEU and proportion of unique -grams within a single generation (intra-unique-), and across generations (inter-unique-), as in Gu et al. (2018). Results are shown in Table 3.
5.1 Implementation details
takes individual words as inputs and embeds into a pre-trained 300-dimensional word vectors from GloVe (Pennington et al., 2014). This embedding layer is fixed throughout training.
uses a gated recurrent unit with two layers and a hidden size of 1024 for both bidirectional encoder and attention-based decoder. During optimization using Adam(Kingma and Ba, 2014), we set the learning rate to and clip the gradient’s L2-norm to 1.0. We initially train
by maximizing the word-level likelihood estimation (MLE) from data that consist of positive samples for 60 epochs on the TripAdvisor data and 30 epochs on the Yelp dataset, separately. These are our baseline models against which to empirically prove value of our hierarchical discriminators.
also uses the pre-trained GloVe word vectors444The vector dimension can be different from that of . The differences were marginal for sizes 50, 100, and 300. For results shown in this paper, we used the same dimension of size 300., which are fixed. The source processing network and the target processing network have the same structure, but different parameters. The convolutional layer has filters of sizes 2, 3, 4, and 5, each with 512 filters. Each convolution filter is followed by aactivation. We use an Adam optimizer with a learning rate of . is the same as , except it has convolutional filters of sizes 3, 4, 5, and 6. We train both discriminators for 50 epochs and choose models with the best R1 validation scores.
In the tuning stage, we use the negative-critical sequence training as explained in Section 4 up to 5 epochs, with a learning rate of
. We also continue with supervised learning toto limit the policy search within a grammatically correct space, similar to Paulus et al. (2017); Wu et al. (2016); Lewis et al. (2017). In practice, sequence-level rewards are only available upon a completed generation, so they are sparse signals for the generator. Typically, sparse end-of-sequence rewards entail a noisy training, yet would want the learning generalize to the testing data. We observed that, for our particular task, most noises were caused by exploration, and the learning generalized to the testing data. Thus, reward shaping was unnecessary, unlike previous works (Li et al., 2017; Yang et al., 2018) that further provided signals for partially generated sequences.
5.2 Sanity check on and
Most of the reviews written by the hotel guests are considered coherent. Suppose we randomly select a negative sample from a pool of other continuations in the data. Even for a layman in linguistics, one can effortlessly discern if the review repeats similar ideas, albeit in different wordings, whether supports or contradicts and is considered as a natural continuation from . To show that our does these jobs and likewise for , we show some randomly selected positive and negative samples and their assigned margin scores in Table 4.
|this hotel was unbelievably overpriced .||0.0002|
|it wasn t clear when booking that we would have to share a bathroom .||0.0084|
|there was one shower for the whole floor which was tiny and unclean .||0.0054|
|the room was old and lacking in facilities .|
|the beds were very uncomfortable and the linen was very old .||0.0768|
|breakfast was ok , but the staff were incompetent .||0.0591|
|on our last day they were too lazy to clean our table and never bothered taking our order .||-0.0097|
|we had to leave having had no breakfast , as we ran out of time .||0.0457|
|they saw us get up and leave and didn t even apologise for the appalling lack of service .||+0.3735|
|the staff recommended great restaurants with very reasonable prices within walking distance .||0.0514|
|the paris hop on bus stops nearby .||0.0798|
|the gare l est is within 3 blocks .||-0.0156|
|the rooms are clean and bathrooms ensuite .||-0.2001|
|more examples of cohesion|
|once you get there you are greeted by the staff .|
|they explain everything to you , and in english , not the best , but good enough .||0.1004|
|the coffee was even good for a coffee snob like myself .|
|the hotel is much smaller than i thought and only has six floors .||-0.1103|
|the only negative was the curtain in the bathroom .|
|the beer at the lobby bar was stale .|
|there are many friendly cats on the grounds .||-0.0830|
We first comment on the text-to-text retrieval results in Table 2. Compared to image-to-textual caption retrieval tasks, the numbers are lower. One plausible reason is that an image is rich in semantics: an image is worth a thousand words. In contrast, a few sentences are limited in their capacity to convey a message, in addition to grammar and readability constraints. For this reason, the discriminator models face a more difficult challenge in identifying the correct query (source sentences) - item (target sentences) pairs.
Furthermore, we note that our methods to construct negative samples, in spite of its simplicity, are not thorough. For example, a randomly selected next sentence, given an arbitrary sentence, may actually be a valid continuation for . In this work, given an unlabelled dataset, we ignore such a problem, which may not be negligible and explain why we see a lower performance compared to that of . Despite potential drawbacks of our methodology, we have shown significant improvements with imperfect discriminators. After experimenting with different architectures and hyper-parameters, we conclude that the table numbers are reasonable for the task.
We do note that results will get better with more data - our discriminators, as well as the generator, will be well-trained by seeing more data samples. We consider the dataset to be rather small because each pre-processing condition is quite restrictive. However, our goal is to demonstrate the efficacy of our discriminator models, rather than to show good results arising from a large amount of data.
6.1 What do and capture?
Ideally, given how is constructed: independently processes the source and target sentences, we would want to detect whether the target sentences support the message delivered in the source sentences. Some cues that signal lacking such support are repetitive, contradicting, irrelevant, or sentiment inconsistent statements without appropriate transition phrases. Although only computes scores and does not reason out with specific cues, we observed that it learns to pick up these ideas that govern coherence of a multi-sentence paragraph. Despite some randomness from scores of randomly constructed negative samples, its critique margin scores are fairly consistent, based on a collection of margin scores on hotel reviews. Examples of more detailed incoherent aspects learns to distinguish include parts of a review commenting on different countries, cities or locations, and listing prices in different currency denominations. This is a positive result of training on randomly selected in the same batch.
The role of is similar to , except it operates in-between sentences. We observed that captures some low-level logical connection. For example, it favorably scores artifacts that are commonly mentioned together from positive samples, and this is again a result of our methods to construct negative samples through shuffling and rotating target sentences in the same batch. As a consequence, artifacts that do not appear together as many times in consecutive sentences yield low cohesion scores.
However, we note that low cohesion scores among consecutive sentences do not necessarily imply a bad writing. Although one needs to avoid consistently low cohesive writing, the writer may simply enumerate seemingly disparate aspects written into respective sentences, and these likely imply a low connection between immediately neighboring sentences. Therefore, we do not solely optimize for the cohesion criterion.
6.2 Potential improvements in our approach
We would also like to comment on our imperfect discriminators. When we tuned for up to as many epochs as in the pre-training stage, we noticed that learns to find a policy that maximizes the discriminators’ margin scores, yet diverge from what we consider a good writing. Although this problem can be partially overcome by training on a larger dataset, we determined that these two discriminators do not give a comprehensive critique, and there exist other equally, if not more, important linguistic properties that we did not address. We hope to extend our hands-free model framework to encode these features and provide richer signals for to improve upon.
While reinforcing linguistic properties such as coherence and cohesion is the first attempt and an important research direction, we consider our results to be preliminary, and many of our experiment figures allude to plenty of room for further improvements, such as recall scores. We admit that we can argue the structure of and either way, whether to process the source and target sentence(s) independently with different parameters, or together. Nevertheless, we are convinced that our architecture for both and generalize well to unseen texts, and we plan to provide online a collection of examples with corresponding margin rewards, due to limited space in this paper, and release all of our resources to promote this line of research.
In this paper we propose to encode essential linguistic properties, coherence and cohesion, with a simple neural network architecture, and quantify them using negative-critical margin scores. The coherence discriminator provides a bird’s-eye view on coherence. It assesses how likely two text chunks form a coherent paragraph, using sentence-level features. On the other hand, the cohesion discriminator provides a worm’s-eye view on cohesion. It assesses how cohesive two consecutive sentences are using word-level features.
The scores computed by these discriminators are used as reward signals for training neural language models via policy gradient. Empirical results on two long-form text generation tasks show that our method outperforms the strong baseline, an attention-based bidirectional MLE-trained sequence-to-sequence model in a number of automatic metrics.
Future work will focus on casting the long-form text generation task using the GANs framework. In this framework, the coherence and cohesion discriminators are modified against model-generated texts, and in turn, provide signals to learn neural language models. This work is an extension of GANs in that we use multiple discriminators, similar to Durugkar et al. (2016), but each discriminator reinforces a distinct linguistic behavior in .
- Baeza-Yates and Ribeiro-Neto (1999) Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern information retrieval, volume 463. ACM Press Books, 1999.
- Bahdanau et al. (2016) Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.
- Barto et al. (1983) Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, SMC-13(5):834–846, 1983.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, 2014.
- Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493–2537, November 2011.
- Durugkar et al. (2016) Ishan P. Durugkar, Ian Gemp, and Sridhar Mahadevan. Generative multi-adversarial networks. CoRR, abs/1611.01673, 2016.
- Fedus et al. (2018) William Fedus, Ian Goodfellow, and Andrew Dai. MaskGAN: Better text generation via filling in the ____. In ICLR, 2018.
- Gao et al. (2014) Jianfeng Gao, Patrick Pantel, Michael Gamon, Xiaodong He, and Li Deng. Modeling interestingness with deep neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2–13, 2014.
- Gao et al. (2018) Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to conversational AI. arXiv preprint arXiv:1809.08267, 2018.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680, 2014.
- Graves (2013) Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
- Gu et al. (2018) Xiaodong Gu, Kyunghyun Cho, JungWoo Ha, and Sunghun Kim. DialogWAE: Multimodal response generation with conditional wasserstein auto-encoder. CoRR, abs/1805.12352, 2018.
- Guo et al. (2017) Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. Long text generation via adversarial training with leaked information. arXiv preprint arXiv:1709.08624, 2017.
- Holtzman et al. (2018) Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. Learning to write with cooperative discriminators. In Proceedings of the Association for Computational Linguistics, 2018.
- Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry P. Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
- Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
- Kim (2014) Yoon Kim. Convolutional neural networks for sentence classification. In EMNLP, 2014.
- Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Lewis et al. (2017) Mike Lewis, Denis Yarats, Yann N Dauphin, Devi Parikh, and Dhruv Batra. Deal or no deal? end-to-end learning for negotiation dialogues. arXiv preprint arXiv:1706.05125, 2017.
- Li et al. (2015) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055, 2015.
- Li et al. (2016a) Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155, 2016a.
- Li et al. (2016b) Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016b.
- Li et al. (2017) Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547, 2017.
Lin and Hovy (2003)
Chin-Yew Lin and Eduard Hovy.
Automatic evaluation of summaries using n-gram co-occurrence statistics.In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 71–78, Stroudsburg, PA, USA, 2003.
- Lin et al. (2017) Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pages 3155–3165, 2017.
Nguyen et al. (2017)
Khanh Nguyen, Hal Daumé, and Jordan L. Boyd-Graber.
Reinforcement learning for bandit neural machine translation with simulated human feedback.In EMNLP, 2017.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, 2002.
- Paulus et al. (2017) Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. CoRR, abs/1705.04304, 2017.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
- Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732, 2015.
Rennie et al. (2017)
Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava
Self-critical sequence training for image captioning.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1195, 2017.
- Shang et al. (2015) Lifeng Shang, Zhengdong Lu, and Hang Li. Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364, 2015.
- Sordoni et al. (2015) Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. A neural network approach to context-sensitive generation of conversational responses. In NAACL-HLT, May 2015.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112, 2014.
- Sutton et al. (1999) Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057–1063. MIT Press, 1999.
Vinyals and Le (2015)
Oriol Vinyals and Quoc Le.
A neural conversational model.
ICML Deep Learning Workshop, 2015.
- Wang et al. (2010) Hongning Wang, Yue Lu, and ChengXiang Zhai. Latent aspect rating analysis on review text data: a rating regression approach. In KDD, 2010.
- Wieting et al. (2016) John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. Towards universal paraphrastic sentence embeddings. ICLR, 2016.
- Williams and Colomb (1995) J.M. Williams and G.G. Colomb. Style: Toward Clarity and Grace. Chicago guides to writing, editing, and publishing. University of Chicago Press, 1995.
- Williams (1992) Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8(3-4):229–256, May 1992.
- Witten (1977) Ian H Witten. An adaptive optimal controller for discrete-time markov environments. Information and control, 34(4):286–295, 1977.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- Xu et al. (2017) Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. arXiv preprint, 2017.
- Yang et al. (2018) Zichao Yang, Zhiting Hu, Chris Dyer, Eric P Xing, and Taylor Berg-Kirkpatrick. Unsupervised text style transfer using language models as discriminators. arXiv preprint arXiv:1805.11749, 2018.
- Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence generative adversarial nets with policy gradient. In AAAI, 2017.
- Zhang et al. (2017a) Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017a.
- Zhang et al. (2017b) Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. CoRR, abs/1710.10916, 2017b.
- Zhang et al. (2017c) Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. Adversarial feature matching for text generation. In NIPS, 2017c.
- Zhang et al. (2018) Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and Bill Dolan. Generating informative and diverse conversational responses via adversarial information maximization. In NIPS, 2018.