To control the output length, kikuchi_2016 (kikuchi_2016) first proposed two learning-based models for neural encoder-decoder named LenInit and LenEmb. We observe that when two models have same or similar structures, the evaluation score of one model with more precise length control is usually lower than another with weaker length control. In other words, worse LC capacity results in better output quality. For instance, LenEmb can generate the sequence with more accurate length but evaluation scores are lower than LenInit. In most situations when sentence length is in an adequate range, i.e. the length constraint is satisfied, people prefer to focus on semantic accuracy of the produced sentence, at this case, LenInit seems to be a more appropriate choice than LenEmb. Therefore, it makes sense to research the control of trade-off between LC capacity and sentence quality, which we called controllable length control (CLC).
To track this trade-off, we set our sight into using Reinforcement Learning (RL) [Sutton and Barto2018]. Commonly, RL in neural language generation is used to overcome two issues: the exposure bias [Ranzato et al.2015]
and inconsistency between training objective and evaluation metrics. Recently, great efforts have been devoted to solve the above two problems[Ranzato et al.2015, Rennie et al.2017, Paulus, Xiong, and Socher2018] In addition, RL can actually bring two benefits in allusion to the LC neural language generation. Firstly, most datasets provide only one reference summary in each sentence pair, so we can only learn fixed-length summary for each source document under maximum likelihood (ML) training. But for RL, we could appoint various lengths as input to sample sentences for training, consequently, promote the model to become more robust to generate sentences given different desired length. Secondly, the length information could be easily incorporated into reward design in RL to induce the model to have different LC capacity, in this way, CLC could be achieved.
Normally, RL for sequence generation is operated on ML-trained models, however, we find that directly applying RL algorithm on pre-trained models will dramatically degrade LC capacity. In this paper, we design two RL methods for LC neural text generation: MTS-RL, and SCD-RL. By adjusting the rewards in RL according to outputs score and length, our MTS-RL and SCD-RL can improve the summarization performance as well as control the LC capacity. Furthermore, we can make some modifications on previous models to improve the score by leveraging the trade-off. An intuitive approach is that we could add a “regulator” between length input and decoder to suppress or enhance the transmission of the length information. Under the guidance of this idea, two models namedLenLInit and LenMC are proposed. These two LC models significantly improve the evaluation score at low cost of its ability to control the length in both ML and RL. The major contributions of our paper are four-fold:
To the best of our knowledge, this is the first work applying reinforcement learning on length-control neural abstractive summarization, and we present the concept of CLC.
Two RL methods are developed to successfully control the LC capacity, and improve the scores significantly. Meanwhile, we find that RL for LC text generation alleviate the limitation of inadequacy and unbalance of Ground-Truth reference in different lengths.
Two models named LenLInit and LenMC are proposed based on previous neural LC models [Kikuchi et al.2016].
Extensive experiments are conducted to verify that proposed models with devised RL algorithms cover a wide range of LC ability and smoothly achieve CLC on Gigaword summarization Dataset.
Abstractive Text Summarization
There are increasing heuristic work based on the encoder-decoder framework[Rush, Chopra, and Weston2015, Nallapati et al.2016]
. DRGD designed by li_2017 (li_2017) is a seq2seq oriented model equipped with deep recurrent generative decoder. point_2017 (point_2017) proposed a hybrid pointer-generator network that uses pointer to copy words from articles while produce the words by generator. cao_2018 (cao_2018) used OpenIE and dependency parser to extract fact descriptions from the source text, then adopted a dual attention model to force the faithfulness of outputs. yang_2019 (yang_2019) explored a human-like reading strategy for abstract summarization and leveraged it by training model with multi-task learning system.
Length Control neural Encoder-Decoder
kikuchi_2016 (kikuchi_2016) first proposed two learning-based neural encoder-decoder models to control sequence length named LenInit and LenEmb. LenEmb mixes the inputs of decoder with remaining length embedded into each time step, while LenInit
initializes the memory cell state of LSTM decoder with whole length information. Before that, sentence length is controlled by ignoring “EOS” at certain time or truncating output sentence. fan_2017 (fan_2017) treated the length of ground truth summaries in different ranges as independent properties and identify it as a discrete mark in an embedding unit. lccnn_2018 (lccnn_2018) presented a convolutional neural network (CNN) encoder-decoder, the inputs and length information are proceeded by CNN before entering the decoder unit. Generally, length control model in neural encoder-decoder can be divided into two types:Whole Length Infusing (WLI) model and Remaining Length Infusing (RLI) model. WLI model is to inform the decoder with entire length of target sentence and RLI model is to tell the remaining length of the sentence in each time step. LenInit [Kikuchi et al.2016], Fan [Fan, Grangier, and Auli2018] and LCCNN [Liu, Luo, and Zhu2018] all belong to WLI models, while LenEmb [Kikuchi et al.2016] is a typical RLI model. Ordinarily, RLI models have better length control capacity but lead to poor sentence quality compare with WLI models. We follow kikuchi_2016 (kikuchi_2016) to define the length of a sentence in character level, which is more challenge than lccnn_2018 (lccnn_2018) in word level.
Reinforcement learning in NLG
There are several successful attempts to integrate encoder-decoder and RL for neural language generation. ranzato_2015 (ranzato_2015) applied RL algorithm to directly optimize the non-differential evaluation metric, which highly raise score. scst_2017 (scst_2017) modified RL algorithm by replacing the critic model with inference results to produce rewards, this simple modification makes significant improvements in image caption task. seqgan_2017 (seqgan_2017) rewarded the Monte-Carlo sampled sentences with adversarial trained discriminator. deep_2017 (deep_2017) employed intra-temporal attention, and combined supervised word prediction with RL to generate more readable summaries. gan_2018 (gan_2018) designed an adversarial process for abstractive text summarization. fast_2018 (fast_2018) firstly selected the salient sentences and rewrote the summary, in which non-differential computation is connected via policy gradient. However, above mentioned work did not involve and explore length control in RL.
The dataset for text summarization contains pairs of input source sequence and corresponding ground truth summary , where and is the length of the input article and reference, respectively. The target of summarization is trying to seek a transform from to using a parameterized policy
, this can be formalized to maximize the conditional probability in Eq.(1), where .
Encoder-Decoder Attention Model
Encoder-decoder with attention mechanism [Bahdanau, Cho, and Bengio2014]
is selected as the basic framework in this work. RNN encoder sequentially takes each word embedding of input sentence. Then the final hidden state of the encoder which contains whole information of source sentence is fed into decoder as the initial state. We select bi-directional Long Short-Term Memory[Hochreiter and Schmidhuber1997] as the encoder to read the source sequence. Here we denote as the hidden state of the BiLSTM encoder in forward direction at time step and for backward direction. and are the memory cell states of the BiLSTM encoder:
Outputs of the encoder at time are concatenated as
, depicting the vector for attention. whereis denoted as concatenation.
Decoder unrolls the output summary from initial hidden state by predicting one word each time. Neglecting length control, initial state of decoder is set as and , and the hidden state is calculated by:
Context vector is used to measure which parts of the source words that decoder pays attention to at time :
Then we can concatenate with hidden state to predict the next word:
Length Control Models
To control the length of output, we need to put the desired length information into the decoder, hence, the training objective in supervised ML with “teacher forcing”[Williams and Zipser1989] becomes:
Here, is denoted as length information the decoder perceives at time
. As is introduced before, LC models are classified into two groups. For the RLI model, remaining length is updated in each time step by, while is set to . In WLI model, decoder only aware of the whole length of the sentence, so we set all as .
In this section, We will introduce four models: LenInit, LenEmb, LenLInit and LenMC. The first two models are proposed by [Kikuchi et al.2016]. We make modification on them and propose the remaining two.
This WLI model uses memory cell to control the output length by rewriting the initial state as:
is regarded as the entire desired length of the output sentence, and is a learnable vector.
This model can be viewed as a variant of LenInit
. In order to produce higher scores by leveraging the LC capacity, we simply add a linear transformationof length information, the model is thus named Length Linear Initialization (LenLInit). Unlike the LenInit, is replaced by , a gaussian sampled non-trainable vector, and the initial memory cell state of decoder is:
For this RLI model, embedding matrix transforms into a vector , where is the possible length types, then will be concatenated with the word embedding vector as additional input for LSTM decoder:
Other than LenEmb that length information is concatenated as additional input, we infuse into memory cell at each time step in the same way as LenLInit, and name this RLI model as LenMC.
Length Control Reinforcement Learning
Models trained by maximum likelihood estimation with “teacher forcing” suffer from the problem of “exposure bias”[Ranzato et al.2015]. Moreover, the training process is to minimize the cross-entropy loss, while in test time, results are evaluated with language metrics. One way to redeem these conflicts is to learn a policy that directly maximizes the evaluation of metric instead of maximum-likelihood loss where RL could be naturally considered.
From the perspective of RL for sequence generation, our LC models can be viewed as an agent, parameters of the network form a policy , and making prediction at each step can be treated as action. After generating a complete sentence, agent receives a reward computed by evaluation metrics. During training process, decoder can produce two types of output: with greedy search, and in which
is sampled from the probability distributionat time . We assign a random number within an appropriate range as the target summary length for each article and feed it into LC model to sample a sentence , then reward is evaluated between ground truth summary and sampled sentence . We apply the self-critical sequence training (SCST) [Rennie et al.2017] as our RL backbone, and the training objective of SCST becomes:
This reveals that the goal of policy gradient RL in sequence generation is equivalent to increase the probability of generating high-score sentences.
We encounter two additional problems about LC summarization, first is that LC models are designed to generate summaries in different lengths, but existing datasets only provide one or a few ground-truth references for each article, worse still, the number of reference with different length are terribly unbalanced (see Figure 2). In consequence, models trained under ML by this dataset tend to have better performance only in particular lengths. By sampling sequences with randomly assigned length
in reinforcement training, uniform-distributed length sentences are served as additional summaries to be judged by RL system, as a result, alleviate the above-mentioned issue.
The second problem is that directly applying SCST for LC models will seriously diminish the LC capacity, because some of the sampled sentences have deviation in length, enlarging the generation probability of these sentences will corrupt the LC capacity that in turn would further force the model to generate more length-deviation sentences, and therefore reinforcing a vicious cycle to lead LC capacity crash. To save the model from length control collapse in RL, an intuitive idea is to adjust the reward incorporating with outputs length, especially for those sentences with high scores and mismatched length. In consequence, we propose two training approaches for length control RL: Manually Threshold Select (MTS) and Self-Critical Dropout (SCD). Additionally, both training algorithms can regulate the model by tuning a hyper-parameter that has better LC capacity but lower sentence quality and vice versa, i.e, accomplishing the CLC.
Manually Threshold Select
As an initial point, semantic accuracy is still the most critical indicator needed to be guaranteed. For a sentence has low score, its generation probability would be reduced during the training even with expected length. Considering sentences with high scores, for those who have expected length, reward should be naturally retained, thus, we only need to deal with remaining sentences with unqualified length.
Suppose the desired length for sampling sentence is , and the length of the output sequence is . The length prediction error is the difference of two lengths: . We manually choose an error threshold to eliminate the reward of sentence when exceeds :
The LC capacity can be adjusted by setting different , larger would yield better evaluation score while smaller get better length control.
Two drawbacks occur in MTS-RL. Firstly, sentences exceeding the limit are completely ignored even though they reach high evaluation scores while is slightly larger than . Secondly, can only take discrete values, and this makes it hard to control those models that have precise length control such as LenMC. Inspired by SCST [Rennie et al.2017] to approximate the baseline from the current training model, we propose Self-Critical Dropout RL approach. In each iteration, a batch of sampled outputs is obtained, where is the sampled sentence with desired length . The mean of is approximated by:
We take as the threshold, unlike the previous method that restrains the rewards of all sentences with larger than , we keep their rewards by a probability of . At the same time, rewards should be more likely to be reserved when get closer to :
reflects the degree of length constraint towards output sequence, therefore controls the LC capacity. Larger could force the model to generate sentences that have more accurate lengths, while smaller have weaker control of length so could improve the performance.
|Source article||arsenal chairman peter hill-wood revealed thursday that he fears french striker thierry henry will leave highbury at the end of the season .|
|Reference summary||arsenal boss fears losing henry|
|model||desired length and sampled summaries (true length)|
|25||arsenal chief quits to leave (24)|
|LenLInit||45||gunners chief fears french striker will leave the end (45)|
|65||gunners chief says he will leave as he fears french striker will leave (58)|
|25||arsenal fears henry henry (22)|
|LenInit||45||arsenal fears french striker henry will leave arsenal (46)|
|65||arsenal fears french striker henry will leave arsenal says arsenal chairman (65)|
|25||arsenal fear henry will leave (25)|
|LenMC||45||arsenal ’s arsenal worried about henry ’s return home (45)|
|65||arsenal ’s arsenal worried about french striker henry will leave wednesday (64)|
|25||arsenal ’s henry to quit again (25)|
|LenEmb||45||arsenal chairman fears henry ’s fate of henry ’s boots (45)|
|65||arsenal chairman fears french striker henry says he ’s will leave retirement (65)|
The experiments are divided into two parts. We make basic experiments in ML to observe the gap of accuracy between LC models and other summarization baselines. Besides, trained models will be served as the initial state of RL. Then further comparison on LC models under different RL methods is conducted, we pay more attention to this part and perform extensive experiments to demonstrate the effectiveness of controllable length control by designed RL.
Gigaword dataset is selected for our experiments. The corpus pairs including the collected news and corresponding headlines [Napoles, Gormley, and Van Durme2012]. We use the standard train/valid/test data splits followed by [Rush, Chopra, and Weston2015], which are pruned to improve data quality. The whole processed dataset contains nearly 3.8 million sentences for training, along with one summary each. In the experiment of ML, to compare with other summarization models in a unified standard, we conduct the experiment on the entire dataset. Results are reported by standard Gigaword testset which contains 1951 instances and we name it “test-1951”. For the experiments on RL, we shrink the size of training set by sampling 600K pairs of it, validation/test set is rebuilt imitating infused_2018 (infused_2018), two non-overlapped sets are sampled from a standard validation set called: “valid-10K” and “test-4k” for model selection and result evaluation, respectively.
Notice that the scores on “test-4k” are much higher than those on “test-1951”, this is because in standard test set, words in summary sentences do not frequently occur in source texts which brings difficulty for word prediction during decoding. We build the dictionary containing 50000 words with the highest frequency and the other words are replaced by “unk” tag.
|seq2seq (our impl.)||32.24||14.92||30.21||14.23|
Following other summarization work, we evaluate the quality of generated sentence by F-1 scores of ROUGE-1(R-1), ROUGE-2(R-2), ROUGE-L(R-L) [Lin2004].
To measure the LC capacity, lccnn_2018 (lccnn_2018) use variance of summary lengthsagainst target length , In this paper, we use the square root of variance (svar) :
Dimensions of hidden state for our BiLSTM encoder and one-layer LSTM decoder are both fixed to 512. The size of vector and incorporating length input is 512 and the number of possible lengths in LenEmb is 150.
We first train our models in supervised ML using Adam [Kingma and Ba2014]Pascanu, Mikolov, and Bengio2013] with a range of [-10, 10], and batch size is set to 64.
Then we run RL algorithms on previously trained LC models with initial learning rate of 0.00001 and reward in RL is also set as the sum of R-1, R-2, and R-L scores. During the RL, desired length to sample the sentences is average distributed in a interval . We evaluate the model in validation set at each 2000 iterations and select the model according to its cumulative score of R-1, R-2 and R-L.
Note that in our experiments, the space is not counted into sentence length which is slightly different with [Kikuchi et al.2016].
Experiment Results Analysis
Length Control in ML
Although the evaluation score is not the unique objective in this research, it is of interest that how exactly the score is deprived by LC capacity. The results of four LC models in ML are presented in Table 2, and ROUGE scores are collected with desired length of . To embody the accuracy level of our LC models, we list several existing summarization baselines including ABS, ABS+ [Rush, Chopra, and Weston2015], RAS-LSTM, RAS-Elman and Luong-NMT [Chopra, Auli, and Rush2016]. After individually comparing two WLI models and two RLI models, we find the two proposed models, LenLInit and LenMC, slightly corrupt LC capacity while improve the scores obviously.
In Table 1, we provide a representative example of the summaries generated by LC models, and results demonstrate that these models are able to output well-formed sentences with various lengths. It is also observed that LenLInit and LenMC perform better on short sentence summary in this case.
RL for Length Control
Table 3 displays overall comparison of all models under RL. We evaluate our models with sentence length of 25, 45 and 65, which represent short, median and long sentences separately. Results may vary after each training process since RL is usually unstable, so we repeat training for multiple times in each model and statistic the results on average.
We first present the results of four LC models in ML. After that, we apply raw self-critical sequence training (SCST) on this basis, without any constraints on output length, we find that WLI models tend to lose control of length sharply but increase the accuracy significantly, while RLI methods still keep the good LC ability. This is mainly because the lengths of sampled sentences for RLI models are consistent with the input length in most cases, consequently, the training process is stable.
To further investigate the impacts that RL makes on LC models, we evaluate the models on all expected lengths within the range of [20, 70]. These results are reported in Figure 3, where x-axis represents the length, ROUGE score and svar of y-axis measures output quality and LC capacity separately. For convenience, we take the average of R-1, R-2, and R-L values as ROUGE score. Obviously, RL improves scores among the range of all lengths but release LC capacity. The gain of scores is significantly on both short and long sentences for WLI models as well as short sentences for RLI models, which signify that RL alleviates the problem due to unbalanced amount of multiple lengths in training corpus. In particular, LenLIint performs the highest score among four models, nonetheless, have poor LC on long sentences. It is worth noting that LenMC with SCST results even higher score than LenInit on short summaries, and still perserve excellent LC ability. Since SCST has negligible effect on LenEmb, we exclude LenEmb for further comparison under length-control RL.
Controllable Length Control Analysis
|(a) LenLInit||(b) LenInit|
|(c) LenEmb||(d) LenMC|
Results of MTS-RL and SCD-RL in Table 3 are followed by SCST part. We make experiments of MTS-RL on three LC models, for WLI models LenLInit and LenInit, accuracy and svar both rise under the selected increasing, which means hyper-parameter in MTS-RL can be used to adjust the LC capacity. However, for the RLI model LenMC, results show there is no obvious distinction in scores when we use different . Hence, we adopt SCD-RL training algorithm for LenMC, the results show our SCD-RL algorithm can control LC capacity for RLI model as MTS-RL does for the WLI model. and SCD-RL can also manage the LC capacity for WLI models. Overall, two RL training algorithms prevent the model from length control collapsing, and make this capacity controllable via their own hyper-parameters.
In order to make comprehensive comparison considering all factors, we build a scatter map (see Figure 4) to display the performance of models in different training strategies. The x-axis is svar to measure the LC capacity. To evaluate the scores intergrating different lengths, we take the average of R-1, R-2, R-L scores with lengths of , , as the value on y-axis. From Figure 4, we can give some intuitive interpretations: (i) SCST as length control RL for WLI models is extremely unstable. (ii) For those models with similar average ROUGE scores, LenMC have strictly better LC capability than LenInit. (iii) Statistically, LenLInit performs higher score than LenInit when their svar values are relatively close. (iv) The models with designed RL algorithms sufficiently cover wide range of LC capacity with accuracy in a reasonable scope.
Conclusion and Future Work
In this paper, we proposed LenLInit and LenMC inspired by former work, our modified models improved length control summarization performance on Gigaword Dataset. Two developed RL algorithms were successfully applied in length control models to significantly improve the scores on all short, median and long sentences, and to allow users to determine the model with expected length control capacity. Due to the deficiency of the research in this field, extra work need to be pursued. We plan to perform experiments on other tasks such as image caption and dialogue system to further verify our RL algorithms. It is also valuable to investigate the mathematical relationship between length control capacity and evaluation scores, which can be beneficial for model selection. Furthermore, the controllable ability can be extended to other domains like sentiment or style.
- [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR.
- [Cao et al.2018] Cao, Z.; Wei, F.; Li, W.; and Li, S. 2018. Faithful to the original: Fact aware neural abstractive summarization. In AAAI.
- [Chen and Bansal2018] Chen, Y.-C., and Bansal, M. 2018. Fast abstractive summarization with reinforce-selected sentence rewriting. In ACL, 675–686.
[Chopra, Auli, and Rush2016]
Chopra, S.; Auli, M.; and Rush, A. M.
Abstractive sentence summarization with attentive recurrent neural networks.In NAACL, 93–98.
- [Fan, Grangier, and Auli2018] Fan, A.; Grangier, D.; and Auli, M. 2018. Controllable abstractive summarization. ACL 45.
- [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- [Kikuchi et al.2016] Kikuchi, Y.; Neubig, G.; Sasano, R.; Takamura, H.; and Okumura, M. 2016. Controlling output length in neural encoder-decoders. In EMNLP, 1328–1338.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Li et al.2017] Li, P.; Lam, W.; Bing, L.; and Wang, Z. 2017. Deep recurrent generative decoder for abstractive text summarization. In EMNLP, 2091–2100.
- [Lin2004] Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In ACL, 74–81.
- [Liu et al.2018] Liu, L.; Lu, Y.; Yang, M.; Qu, Q.; Zhu, J.; and Li, H. 2018. Generative adversarial network for abstractive text summarization. In AAAI.
- [Liu, Luo, and Zhu2018] Liu, Y.; Luo, Z.; and Zhu, K. 2018. Controlling length in abstractive summarization using a convolutional neural network. In EMNLP, 4110–4119.
- [Nallapati et al.2016] Nallapati, R.; Zhou, B.; dos Santos, C.; Gulcehre, C.; and Xiang, B. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In CoNLL, 280–290.
- [Napoles, Gormley, and Van Durme2012] Napoles, C.; Gormley, M.; and Van Durme, B. 2012. Annotated gigaword. In ACL, 95–100. Association for Computational Linguistics.
- [Pascanu, Mikolov, and Bengio2013] Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the difficulty of training recurrent neural networks. In ICML, 1310–1318.
- [Paulus, Xiong, and Socher2018] Paulus, R.; Xiong, C.; and Socher, R. 2018. A deep reinforced model for abstractive summarization. ICLR.
- [Ranzato et al.2015] Ranzato, M.; Chopra, S.; Auli, M.; and Zaremba, W. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.
[Rennie et al.2017]
Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V.
Self-Critical Sequence Training for Image Captioning.In CVPR, 1179–1195.
- [Rush, Chopra, and Weston2015] Rush, A. M.; Chopra, S.; and Weston, J. 2015. A neural attention model for abstractive sentence summarization. In EMNLP, 379–389.
- [See, Liu, and Manning2017] See, A.; Liu, P. J.; and Manning, C. D. 2017. Get to the point: Summarization with pointer-generator networks. In ACL, 1073–1083.
- [Song, Zhao, and Liu2018] Song, K.; Zhao, L.; and Liu, F. 2018. Structure-infused copy mechanisms for abstractive summarization. In COLING, 1717–1729.
- [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to Sequence Learning with Neural Networks. In NIPS. 3104–3112.
- [Sutton and Barto2018] Sutton, R. S., and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press.
- [Vinyals et al.2015] Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015. Show and tell: A neural image caption generator. In CVPR, 3156–3164.
- [Williams and Zipser1989] Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation 1(2):270–280.
- [Yang et al.2019] Yang, M.; Qu, Q.; Tu, W.; Shen, Y.; Zhao, Z.; and Chen, X. 2019. Exploring human-like reading strategy for abstractive text summarization. In AAAI, volume 33, 7362–7369.
- [Yu et al.2017] Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI.