Code for "Language GANs Falling Short"
Generating high-quality text with sufficient diversity is essential for a wide range of Natural Language Generation (NLG) tasks. Maximum-Likelihood (MLE) models trained with teacher forcing have constantly been reported as weak baselines, where poor performance is attributed to exposure bias; at inference time, the model is fed its own prediction instead of a ground-truth token, which can lead to accumulating errors and poor samples. This line of reasoning has led to an outbreak of adversarial based approaches for NLG, on the account that GANs do not suffer from exposure bias. In this work, wake make several surprising observations with contradict common beliefs. We first revisit the canonical evaluation framework for NLG, and point out fundamental flaws with quality-only evaluation: we show that one can outperform such metrics using a simple, well-known temperature parameter to artificially reduce the entropy of the model's conditional distributions. Second, we leverage the control over the quality / diversity tradeoff given by this parameter to evaluate models over the whole quality-diversity spectrum, and find MLE models constantly outperform the proposed GAN variants, over the whole quality-diversity space. Our results have several implications: 1) The impact of exposure bias on sample quality is less severe than previously thought, 2) temperature tuning provides a better quality / diversity trade off than adversarial training, while being easier to train, easier to cross-validate, and less computationally expensive.READ FULL TEXT VIEW PDF
Training generative models that can generate high-quality text with
The exposure bias problem refers to the training-inference discrepancy c...
Generative Adversarial Networks (GANs) enjoy great success at image
Exposure bias describes the phenomenon that a language model trained und...
Generative adversarial networks (GANs) achieved relevant advances in the...
Recent approaches to question generation have used modifications to a Se...
Sequence generation models trained with teacher-forcing suffer from issu...
Code for "Language GANs Falling Short"
GANs are an instance of generative models based on a competition between a generator network and a discriminator network . The generator network
represents a probability distribution. The discriminator attempts to distinguish whether an input value is real (came from the training data) or fake (came from the generator). Mathematically, the GAN objective can be formulated as a minimax game .
GANs originally were applied on continuous data like images. This is because the training procedure relied on backpropagation through the discriminator into the generator. Discrete (sequential) data require an alternative approach.Yu et al. (2017)estimate the gradient to the generator via REINFORCE policy gradients Williams (1992)
. In their formulation, the discriminator evaluates full sequences. Therefore, to provide error attribution earlier for incomplete sequences and to reduce the variance of gradients they performMonte-Carlo rollouts until the sentence is completed.
Yu et al. (2017) advertise their model using two tasks which we argue (with hindsight) are flawed. First, they introduce a synthetic evaluation procedure where the underlying data distribution is known and can be queried. By representing with an LSTM (referred to as an oracle in the literature) they directly compute the likelihood of samples drawn from a generative model . The problem is they benchmark models against each other on this likelihood alone, i.e., the diagnostic is completely blind to diversity. For example, a model that always outputs the same highly likely sequence would easily outperform other potentially superior models. For real data, there was no agreed upon metric to evaluate the quality of unconditional NLG at the time. This led the authors to propose a new metric, Corpus-level BLEU
, which computes the fraction of n-grams in a sample that appear in a reference corpus. Again, this metric is agnostic to diversity. Generating a single good sentence over and over will gives a perfect BLEU score.
This paper sparked a stream of research adopting the same evaluation framework. RankGAN Lin et al. (2017) proposes to replace the discriminator with an adversarial ranker to provide better signal to the generator. MaliGAN Che et al. (2017) in turn proposes a normalized maximum-likelihood optimization target and a variance-reduction technique. GANs are notoriously known to suffer from mode collapse. This is also an issue for GANs with discrete-sequential data. To remedy this, TextGAN Zhang et al. (2017) performs high-dimensional latent feature matching where the generator incurs a high cost for having low-diversity samples. LeakGAN Guo et al. (2017) allows the generator to access leaked information from the discriminator on partially generated sentences at every time step. The former has a sophisticated two-level hierarchical architecture, inspired by FeUdal Networks Vezhnevets et al. (2017).
All of the aforementioned papers evaluate sample quality alone. As a remedy, Zhu et al. (2018) propose a metric that compares a generated sentence with a corpus of generated sentences, called Self-BLEU. They, along with Lu et al. (2018b), provide an extensive comparison of quality and diversity in GANs using Negative BLEU and Self-BLEU. However, it is not clear which algorithm is superior, as evidence by Figure 1, because no model simultaneously outperforms another on both metrics.
Newer NLG models have been proposed and now evaluate simultaneously quality and diversity. Shi et al. (2018)
use inverse reinforcement learning to model a reward function on training data. This produces a denser signal for the generator and better samples. An even more recent line of work explores alternate methods to adversarial training for text generation. Cooperative trainingLu et al. (2018a)
replaces the discriminator with a mediator, that interpolates between generator and target distributions. They propose a low-variance and unbiased training procedure that minimizes an approximation of the Jensen-Shannon divergence. Two other GAN variants have since been proposedXu et al. (2018); Chen et al. (2018).
During natural language generation, a single spurious sampled token can lead to an overall low-quality and incoherent sequence. In other words, a high-entropy conditional distribution may result in poor sample quality at inference. To address this problem one can modulate the entropyof with a Boltzmann temperature parameter Ackley et al. (1988). More precisely, if
is the generator’s pre-logit activation andis the word embedding matrix then the conditional distribution of the generator is given by
Decreasing below will increase the and thus decrease the entropy of ’s conditional probability. This is a useful tool to reduce the probability of mistakes in NLG and thus improve sample quality. Concretely, temperature tuning naturally moves the model in quality/diversity space. We demonstrates this in Table 1.
|2.0||(1) If you go at watch crucial characters putting awareness in Washington , forget there are now unique developments organized personally then why charge .|
|(2) Front wants zero house blood number places than above spin 5 provide school projects which youth particularly teenager temporary dollars plenty of investors enjoy headed Japan about if federal assets own , at 41 .|
|1.0||(1) Researchers are expected to comment on where a scheme is sold , but it is no longer this big name at this point .|
|(2) We know you ’ re going to build the kind of home you ’ re going to be expecting it can give us a better understanding of what ground test we ’ re on this year , he explained .|
|0.7||(1) The other witnesses are believed to have been injured , the police said in a statement , adding that there was no immediate threat to any other witnesses .|
|(2) The company ’ s net income fell to 5 . 29 billion , or 2 cents per share , on the same period last year .|
|0.0||(1) The company ’ s shares rose 1 . 5 percent to 1 . 81 percent , the highest since the end of the year .|
|(2) The company ’ s shares rose 1 . 5 percent to 1 . 81 percent , the highest since the end of the year .|
The current evaluation protocol for NLG is to compare models with respect to separate quality and diversity metrics. Often, we are left in a situation where it is impossible to tell which algorithm is superior as shown in Figure 1 and further exemplified in Figure 2
. However, one can play with the quality-diversity trade-off of autoregressive text generators with temperature control. Thus, we can leverage this tool to design a new evaluation framework that shines light on the real performance of each models without heuristically combining a quality and diversity metric. More precisely, we propose to perform a sweep of temperatures for each models in order to compute temperature curves in quality-diversity space. This is exemplified in Figure2. We refer to this procedure as temperature sweep.
This new way of evaluating NLG models allows practitioners to answer questions such as: which model should I use if I’m interested in high quality (diversity) samples? Does a new model learn something interesting such that it can outperform others somewhere in the quality/diversity space, or is it just e.g. reducing the entropy of the distribution. It could also be leveraged as a cross-validation tool e.g. early-stop once the best temperature curve is achieved according to a heuristic.
As shown in the next section, temperature sweep demonstrates MLE models consistently outperform the new proposed GAN variants everywhere in the quality-diversity space and performs equally to CoT, which is computationally and algorithmically more complex.
Concurrent with our work, Semeniuta et al. (2018) demonstrated the issues of local n-gram metrics. Their extensive empirical evaluation of GAN models and language models (LM) did not result in evidence of GAN-trained models outperforming on the new and improved global metrics from Cífka et al. (2018). Our analysis continues along this path by examining the performance of these models under a sweep of temperatures. We believe this difference to be of utmost importance, as it is the necessary ingredient towards definitively showing MLE models outperform the currently proposed GAN variants on quality-diversity global metrics.
Empowered with this evaluation approach, we examine several recent GAN text generation models and compare against an MLE baseline. The experiments consist of two parts: synthetic data generation long text generation. We will provide strong empirical evidence that MLE trained models often outperform Textual GANs in the the quality-diversity space.
For the synthetic data experiment, we follow the procedure proposed in Yu et al. (2017)
. We create an oracle parametrized by an LSTM with a hidden dimension of 32 with parameters drawn from a standard normal distribution. We use EMNLP2017 News111 http://www.statmt.org/wmt17/ for the long-text generation task Guo et al. (2017). This corpus has become a common benchmark for neural text generation.
Regarding the experimental procedure, for each experiment we report the best perfoming GAN on the quality metric alone and on the sum of the quality and diversity metric. We refer to theses models as GAN (quality) and GAN (quality+diversity) respectively. For more details, see Appendix A.
In the synthetic experiment, we learn a generative model of data produced from a fixed oracle LSTM. This allows us to compute a perfect quality metric, the likelihood under the Oracle NLL. In Table 4 we see that artificially reducing the temperature at inference achieves state-of-the-art as evaluated by the NLL.
Of course, quality alone is misleading. The MLE-trained model with extremely low temperature will repeatedly output only a single sequence. It is therefore essential to evaluate the resulting sample diversity. We performed a Temperature Sweep in Figure 4. Interestingly, our GAN algorithm is far superior to the official SeqGAN and LeakGAN implementations. Nonetheless, MLE undoubtedly outperforms GANs everywhere in the quality-diversity space. Finally, CoT achieves a similar performance to MLE. This is unsurprising as their objectives are somewhat similar. Future work will investigate if the added algorithmic and computational complexity actually adds value e.g on other datasets.
Furthermore, the best performing GAN models have the smallest learning rate and a considerable amount of pretraining. These results are consistent with Semeniuta et al. (2018), suggesting that the best-performing GANs tend to stay close to the solution given by maximum-likelihood training.
|SeqGAN Yu et al. (2017)||8.74|
|RankGAN Lin et al. (2017)||8.25|
|LeakGAN Guo et al. (2017)||7.04|
|IRL Shi et al. (2018)||6.91|
Secondly, we study long text generation. We again compare an MLE model to reported GAN results on the local metrics Negative BLEU and Self-BLEU. Negative BLEU5 and SBLEU-5 are plotted in Figure 6 and the full results are reported in Appendix 2. Similarly to the short text generation task, MLE outperforms all text GANs considered in the quality-diversity trade-off.
Now, we compare MLE to GAN training on the new and improved global metrics i.e. Language Model score (quality) and Reverse Language Model score (diversity). We refer the readers to Cífka et al. (2018); Semeniuta et al. (2018) for details on theses metrics. See Figure 6 for a comparison of the training scheme at multiple temperatures in quality-diversity space. The story repeats itself: MLE outperforms GAN training everywhere in the quality-diversity space.
These findings on global metrics combined with the local ones are important: a great body of research speculates that exposure bias is to blame for bad sample quality. Because of the accumulating error hypothesis, MLE should be at its weakest on long text generation tasks. However, these results are evidence that exposure bias is less an issue than GAN training combined with and deep Reinforcement Learning. There is also another way to interpret these results. It seems that MLE training leaves the generator with a great policy on a quality-diversity basis. However, because GAN training removes entropy from the learned distribution (see Appendix E), one can be left with the illusion that GAN training is superior.
This research demonstrates that well-adjusted language models are a remarkably strong baseline and that temperature sweeping can provide a very clear characterization of model performance. A well-adjusted language model outperforms the considered GAN variants as evaluated on both local, and more surprisingly, global metrics of quality and diversity. Our temperature sweeping framework shares characteristics with a Receiver Operating Curve. Analogously, if one needed a single scalar to compare NLG models, one could compute area under the curve and seek the model with the smallest value (lower is better for our considered metrics).
GAN-based generative models have been proven effective on real-valued data, however, but there exist many difficult pernicious issues of moving to discrete data. These issues must be overcome before they will improve over the strong MLE baselines. On the datasets and tasks considered, potential issues caused by exposure bias were less than the issues of training GANs in discrete data. GAN training may prove fruitful eventually, but this research lays forth clear boundaries that it must first surpass.
chapter A Learning Algorithm for Boltzmann Machines, pages 285–307. Ablex Publishing Corp., Norwood, NJ, USA.
We conducted each experiment in the same fashion. We first train 50 MLE models and cross-validated with the log-likelihood on the validation set (). Next we trained 300 GANs that were cross-validated two ways: quality-only ( for synthetic data and for real data) or quality + diversity (+ for synthetic data and + for real data).
Our GANs are based on SeqGAN with additional improvements shown to be useful including: MLE pretraining Yu et al. (2017), leaky discriminator Guo et al. (2017), step level loss instead of sequence level Fedus et al. (2018), learned baseline to reduce variance Fedus et al. (2018), regularizing REINFORCE with a maximum entropy loss Williams and Peng (1991) and alternating adversarial loss with MLE loss Guo et al. (2017). Each facet could be toggled during training. Finally, the discriminator architecture is identical to the generator’s. This allows us to pretrain the discriminator with the generator’s weight learned by MLE.
The authors of SeqGAN and LeakGAN open sourced a benchmarking platform to support research on open-domain text generation models, called TexyGEN Zhu et al. (2018). Later, the same authors published Lu et al. (2018b). This work reviews the current state of Neural Text Generation. They used the TexyGEN platform to benchmarks multiples models, including theirs. It is thus fair to assume that those numbers are their officials. These numbers are the ones reported in the experimental section. Furthermore, we use the exact datasets found on Texygen to train our models.
We also report results from Lu et al. (2018b) for MaliGAN, RankGAN and TextGAN. Because, they are not from the same authors as Texygen and Lu et al. (2018b), we need to assume that results could be slightly better because researchers in general are biased towards working harder on their models at the expense of baselines.
The Lu et al. (2018a) paper doesn’t report results on Image CoCo. They do report results on the EMNLP2017 News dataset. However, at the time of this writing, they do not report Self-BLEU for their top performing model (this is the one with ). Moreover, their official implementation is not completed yet, as it only works on synthetic Data. For these reasons, we didn’t report results for CoT in the real data experiment.
The Shi et al. (2018) paper reports results on Image CoCo. However, they use the 80,000 sentences long training dataset. At the time of this writing, there is no official implementation. For these reasons, we didn’t report results for IRL in the real data experiment.
MaskGAN Fedus et al. (2018) is a conditional generative models. For this reason, it doesn’t fit in the scope of this work.
|SeqGAN Yu et al. (2017)||0.72||0.42||0.18||0.09||0.91||0.70||0.46||0.27|
|MaliGAN Che et al. (2017)||0.76||0.44||0.17||0.08||0.91||0.72||0.47||0.25|
|RankGAN Lin et al. (2017)||0.69||0.39||0.18||0.09||0.90||0.68||0.45||0.30|
|TextGAN Zhang et al. (2017)||0.21||0.17||0.15||0.13||1.00||0.98||0.97||0.96|
|LeakGAN Guo et al. (2017)||0.84||0.65||0.44||0.27||0.94||0.82||0.67||0.51|
|SeqGAN Yu et al. (2017)||0.75||0.50||0.29||0.18||0.95||0.84||0.67||0.49|
|MaliGAN Che et al. (2017)||0.67||0.43||0.26||0.16||0.92||0.78||0.61||0.44|
|RankGAN Lin et al. (2017)||0.74||0.47||0.26||0.16||0.96||0.88||0.76||0.62|
|TextGAN Zhang et al. (2017)||0.59||0.46||0.28||0.21||0.94||0.93||0.80||0.75|
|LeakGAN Guo et al. (2017)||0.74||0.52||0.33||0.21||0.93||0.82||0.66||0.51|
In this section, we want to highlight an important flaw in using BLEU as a proxy for quality. We tuned the temperature in order to find a MLE model with BLEU score equal to the training data’s. We show three randomly sampled sentences from the model in Table 4. Although sometimes grammatically correct, the samples lack in semantic and/or global coherence. It seems the generated text has poor information content. Surprisingly, in order to get great samples on a consistent basis, the temperature needs to be reduced to a level where BLEU-5 is twice as large as the training data’s. Thus, it seems like BLEU is not always a good proxy of sample quality. Again, we think it is of utmost importance to develop better metrics and modernize NLG’s canonical evaluation framework.
|MLE||(1) He explained that the Government ’ s plan to cut tax on unemployment was 3 . 3 percent lower than forecast for the first increase of 16 percent in 2015 , the fastest rate in the state since 2004 .|
|(2) On the policy , it ’ s no more than the amount of money we have of the decades and Senate of our assets .|
|(3) They say it was possible supporting the Scottish government to make the changes as secret free environment based on competition .|
As a penultimate experiment, we analyze the effect of changing the temperature at training instead of inference. Guo et al. (2017) suggested that increasing the temperature at training time leads to more diverse samples. However, we argue that this procedure leads to the opposite outcome as a model can adapt to the temperature change. This would have the net result of lowering the entropy at test time. To examine this, we trained 30 GANs maintaining everything constant except training temperature. Negative BLEU-5 against SBLEU-5 are plotted in Figure 8. The darker the dot, the higher the and consequently the temperature. As we hypothesize, models trained with increased temperature at training time adapted to the change and the net result was a colder temperature at inference (hence reduced diversity). We therefore recommend only adjusting the temperature at inference. One should consider other techniques to facilitate exploration during training.
We present additional samples for EMNLP 2017 News dataset.
|SeqGAN||You only certainly might not rush it down for those circumstances where we are when they were the heads , and when she ’ s name .|
|“ I think you should really really leave for because we hadn ’ t been busy , where it goes to one , ” he wrote .|
|All the study knew was that they are , so they continue to provide support service and it doesn ’ t exist .|
|‘ It can say become up with nothing sales have reached the charge for the other any evidence that been virtually well below the $ 800 .|
|Three times before the start of the season is much early on 2015 we are in the third training every year .|
|That ’ s the idea of strength that decision they said, we haven ’ t already lost four or seven, or Liverpool ’ s team .|
|That is not the time for the cost of changing the system and it was pushing for $ 20 million .|
|We had to take it a good day for a military , but nearly 6 , 000 ] and prepare for them through .|
|I actually didn ’ t tell the background check the difference after my hour was to be recalled . . . and it was great .|
|We are thinking about 40 , 000 and jobs in what is wrong in the coming and you know .|
|That is out how working you can ’ t set out some pretty tight . . . or what I ’ m going through .|
|“ I wanted to be made you decided to have a crisis that way up and get some sort of weapon , not much to give birth to for an American room .|
|She had been fined almost 200, 000 with couple of asylum seekers in Syria and Iraq .|
|Perhaps not , in looking for , housing officials would help the frustration of Government , with an FBI shortly before 2020 .|
|Once we got to real show for the young man since I ’ m sure she went to love it just , whether to be late later last year .|
|But , after a holiday period we might have to go on a total - out debate like that could have happened to us .|
|LeakGAN||A man has been arrested at age 28 , a resident in Seattle , which was widely reported in 2007 .|
|I also think that ’ s a good place for us , I ’ m sure that this would be a good opportunity for me to get in touch .|
|What is the biggest problem for Clinton is that Donald Trump will be in the race and he ’ s unlikely to be the nominee .|
|” We ’ re going to do and we ’ re going to put it out and get the ball , ” he said .|
|“ I would be afraid to blame the girls to go back but I was just disappointed with the race , ” he said.|
|“ I ’ m not going to work together with a different role and we can win the game , ” he added .|
|The couple ’ s lives are still missing and they have been killed in the city ’ s way to play against them , and because I came out there .|
|For the last three years , we ’ ve got a lot of things that we need to do with this is based on the financial markets .|
|Don ’ t ask me , but I know , if I ’ ll be able to be out of Hillary Clinton , I think it ’ s being made for the Congress.|
|“ I am proud to be able to move forward because we don ’ t have to look at about , ” he said .|
|That ’ s why we ’ re the most important people for the African American community and we ’ ve made a good response .|
|But the move will be only in a fight against them, as well as likely to prevent an agreement to remain in the EU .|
|The American Medical Association said that the militants had been arrested in connection with the murder of the same incident.|
|The two - year - old girl has been charged with a suspect who was in the vehicle to the police station.|
|It is hard to buy on the Olympics , but we probably don ’ t see a lot of it.|
|“ I ’ m not going to be very proud of the other countries , ” he said .|
|He said the U . N . intelligence industry will not comment on the ground , which would be sensitive to the European Union .|
|I take my work in the days , but I would have to go down on Wednesday night .|
|MLE||The UN Security Council is a major concern for the U . S . government , as well as a NATO ally in the Syrian civil war .|
|A spokesman for the Met Office said the death toll was only slightly higher than the previous year , according to the report .|
|But I hope that at the end of the day , I ’ m going to give her the best chance to go to the gym and go out and play .|
|The man , who cannot be named , said that he had never had sex with him , and he didn ’ t want to see him .|
|And it ’ s just one of those things that I have to say , I ’ m a Democrat , and I ’ m a conservative .|
|The bank is now the fastest growing market in the world and it is a significant change in the economy .|
|The two men , aged 20 and 22 , were arrested and charged with the murder of a man , a police officer .|
|The company will be able to provide a post Brexit strategy , which will be published in the coming weeks .|
|She said she had been on the wrong side of the road and was finally caught in a car accident and was taken to hospital .|
|I don ’ t think he ’ s even a good player , he said , but he ’ s got a good chance to win the game .|
|I don ’ t know what the future holds but I ’ m sure it will be a good thing .|
|It ’ s a very important step forward , and we ’ re going to be able to get the right results .|
|The driver of the vehicle , who was inside the vehicle , was taken to hospital for treatment , but said he was not aware of the incident .|
|The leak was obtained by a Freedom of Information request , which is based on the number of people claiming to be a victim of fraud .|
|The former secretary of state has made a major speech in New York , where she ’ s running for president .|
|The US economy grew at a record low of 1 . 6 percent in 2014 , and the unemployment rate has fallen by 0 . 9 percent .|
|The new rules are put into the hands of a member of the police , and the public is not aware of the situation .|
|The World Health Organization said a number of people were killed in the attack , according to the Pentagon .|
|The study also found that women who are not particularly vulnerable to women ’ s health problems are more likely to commit suicide .|