Normally, generative RNNs are trained to maximize the likelihood of samples from an empirical distribution of target sequences, i.e., using maximum-likelihood estimation (MLE), which essentially minimizes the KL-divergence between the distribution of target sequences and the distribution defined by the model. While principled and effective, this KL-divergence objective tends to favor a model that overestimates its uncertainty / smoothness, which can lead to unrealistic samples(goodfellow2016nips).
For a purely generative RNN, the desired behavior is for the output from the free-running sampling process to match the samples in the target distribution. The most common approach to training RNNs is to maximize the likelihood of each token from the target sequences given previous tokens in the same sequence (a.k.a., Teacher Forcing (williams1989learning)). In principle, if the generated sequences match the target sequences perfectly, then then generative model is the same as the one from the training procedure. In practice, however, because of the directed dependence structure in RNNs, small deviations from the target sequence can cause a large discrepancy between the training and evaluation model, so it is typical to employ beam search or scheduled sampling to incorporate the real generative model during training (bengio2015scheduled).
Generative adversarial networks (GANs goodfellow2014generative) estimate a difference measure (i.e., a divergence, e.g., the KL or Jensen-Shannon or a distance, e.g., the Wasserstein nowozin2016f; arjovsky2017wgan)
using a binary classifier called adiscriminator trained only on samples from the target and generated distributions. GANs rely on back-propagating this difference estimate through the generated samples to train the generator to minimize the estimated difference.
In many important applications in NLP, however, sequences are composed of discrete elements, such as character- or word-based representations of natural language, and exact back-propagation through generated samples is not possible. There are many approximations to the back-propagated gradient through discrete variables, but in general credit assignment with discrete variables is an active and unsolved area of research (bengio2013estimating; gu2015muprop; maddison2016concrete; jang2016categorical; tucker2017rebar). A direct solution was proposed in boundary-seeking GANs (BGANs hjelm2017bsgan), but the method does not address difficulties associated with credit assignment across a generated sequence.
In this work, we propose Actor-Critic under Adversarial Learning (ACtuAL), to overcome limitations in existing generative RNNs. Our model is inspired by the classical actor-critic algorithms in reinforcement learning, and we extend this algorithm in the setting of sequence generation to incorporate a discriminator. Here, we treat the generator as the “actor”, whose actions are evaluated by a “critic” that estimates a reward signal provided by the discriminator. Our newly proposed ACtuAL framework offers the following contributions:
A novel way of applying Temporal Difference (TD) learning to capture long-term dependencies in generative models without full back-propagation through time.
A new method for training generative models on discrete sequences.
A novel regularizer for RNNs, which improves generalization in terms of likelihood, on both character- and word-level language modeling with several language model benchmarks.
2.1 Generative RNNs and Teacher Forcing
A generative recurrent neural networks (RNN) models a joint distribution of a sequence,
, as a product of conditional probability distributions,
RNNs are commonly trained with Teacher Forcing (williams1989learning), where at each time step, , of training, the model is judged based on the likelihood of the target, , given the ground-truth sequence, .
In principle, RNNs trained by teacher forcing should be able to model a distribution that matches the target distribution, as the joint distribution will be modeled perfectly if the RNN models each one-step-ahead prediction perfectly. However, if the training procedure is not able to model the conditional distribution perfectly, during the free-running phase, even small deviations from the ground-truth may push the model into a regime it has never seen during training, and the model may not able to recover from these mistakes. From this, it is clear that the optimal model under Teacher Forcing is undesirable, as we want our generative model to generalize well. This problem can be addressed with beam-search or scheduled sampling (bengio2015scheduled), which can be used to add some exploration to the training procedure. However, the central limitation is in these approaches is that training with MLE necessitates that the model stays close to the target distribution, which can limit overall capacity and generalization properties.
2.2 Generative Adversarial Networks
Generative Adversarial Networks (GANs, goodfellow2014generative) define a framework for training a generative model by posing it as a minimax game. GANs have been shown to generate realistic-looking images that are more crisp than the models trained under maximum likelihood (radford2015unsupervised).
Let us assume that we are given empirical samples from a target distribution over a domain (such as ), . In its original, canonical form, a GAN is composed of two parts, a generator network, , and a discriminator, . The generator takes as input random noise, from a simple prior
(such as a spherical Gaussian or uniform distribution), and produces a generated distribution,over . The discriminator then is optimized to minimize the mis-classification rate, while the generator is optimized to maximize it:
As the discriminator improves, it better estimates , where is the Jensen-Shannon divergence. The advantage of this approach is that, as the discriminator provides an estimate of this divergence based purely on samples from the generator and the target distribution, unlike Teacher Forcing an related MLE-based methods, the generated samples need not be close to target samples to train.
However, character- and word-based representations of language are typically discrete (e.g., represented as discrete “tokens”), and GANs require a gradient that back-propagates from the discriminator through the generated samples to train. Solutions have been proposed to address GANs with discrete variables which use policy gradients based on the discriminator score (hjelm2017bsgan), and other methods even applied similar policy gradients for discrete sequences of language (yu2016poem; li2017adversarial)
. However, for reasons of which are covered in detail below, these methods provide a high-variance gradient for the generator of sequences, which may limit the model’s ability to scale to realistic real-world sequences of long length.
2.3 The Actor-Critic Framework
The actor-critic framework is a widely used approach to reinforcement learning to address credit assignment issues associated with making sequential discrete decisions. In this framework, an “actor” models a policy of actions, and a “critic” estimates the expected reward or value function.
Let us assume that there is a state, which is used as input to a conditional multinomial distribution of policies with density over actions, . At timestep , an agent / actor samples an action, , and performs the action which generates the new state . This procedure ultimately leads to a reward, (where in general, the total reward could be an accumulation of intermediate rewards), over the whole episode, , where is some terminal state.
The REINFORCE algorithm uses this long term reward by running the agent / actor for an episode, gathering rewards, then using them to make an update to through a policy gradient. However there are some issues with this approach, as there is no mechanism to decide which actions may have affected the long-term outcomes and thus it can be ambiguous how to distribute the reward among them. This is referred to as the credit assignment issue. On top of this, the Reinforce algorithm also suffers from high variance.
In actor-critic methods, the critic attempts to estimate the expected reward associated with each state, typically by using the Temporal Difference (TD, sutton1988learning) learning. The advantage is of actor-critic is, despite the added bias from estimating the expected reward, we can update the parameters of agent / actor at each step using the critic’s estimates. This lowers the variance of gradient estimates, as the policy gradient is based on each state, , rather than the whole episode, . This also ensures faster convergence of the policy, as the updates can be made on-line as the episode, , is generated.
3 Actor-Critic under Adversarial Learning (ACtuAL)
At this point, the analogy between the task of training a generative RNN adversarially and the reinforcement learning setting above should be clear. Namely, the reward signal is provided by the discriminator, which corresponds more or less to the likelihood ratio between the distribution of generated sequences and the target distribution as it’s trained to minimize the mis-classification rate. The state, , is a sequence of generated tokens, , up to time . The actor is the generative RNN, whose policy, , corresponds to the conditional distribution defined by the RNN. In order to do better credit-assignment, we introduce a third component to the generative adversarial framework: the critic, which is optimized to estimate the expected reward at each step of generation.