1 Introduction
Sequence generation is an important and broad family of machine learning problems that includes image generation, language generation, and speech synthesis. Variants of recurrent neural networks
(RNNs, hochreiter1997long; chung2014empirical; cho2014learning) have attained stateoftheart for many sequence to sequence tasks such as language modeling (mikolov2012context), machine translation (sutskever2014sequence), speech recognition (chorowski2015attention) and time series forecasting (flunkert2017deepar). Much of RNN’s popularity is derived from its ability to handle variablelength input and output along with a simple learning algorithm based on backpropagation through time (BPTT).Normally, generative RNNs are trained to maximize the likelihood of samples from an empirical distribution of target sequences, i.e., using maximumlikelihood estimation (MLE), which essentially minimizes the KLdivergence between the distribution of target sequences and the distribution defined by the model. While principled and effective, this KLdivergence objective tends to favor a model that overestimates its uncertainty / smoothness, which can lead to unrealistic samples
(goodfellow2016nips).For a purely generative RNN, the desired behavior is for the output from the freerunning sampling process to match the samples in the target distribution. The most common approach to training RNNs is to maximize the likelihood of each token from the target sequences given previous tokens in the same sequence (a.k.a., Teacher Forcing (williams1989learning)). In principle, if the generated sequences match the target sequences perfectly, then then generative model is the same as the one from the training procedure. In practice, however, because of the directed dependence structure in RNNs, small deviations from the target sequence can cause a large discrepancy between the training and evaluation model, so it is typical to employ beam search or scheduled sampling to incorporate the real generative model during training (bengio2015scheduled).
Generative adversarial networks (GANs goodfellow2014generative) estimate a difference measure (i.e., a divergence, e.g., the KL or JensenShannon or a distance, e.g., the Wasserstein nowozin2016f; arjovsky2017wgan)
using a binary classifier called a
discriminator trained only on samples from the target and generated distributions. GANs rely on backpropagating this difference estimate through the generated samples to train the generator to minimize the estimated difference.In many important applications in NLP, however, sequences are composed of discrete elements, such as character or wordbased representations of natural language, and exact backpropagation through generated samples is not possible. There are many approximations to the backpropagated gradient through discrete variables, but in general credit assignment with discrete variables is an active and unsolved area of research (bengio2013estimating; gu2015muprop; maddison2016concrete; jang2016categorical; tucker2017rebar). A direct solution was proposed in boundaryseeking GANs (BGANs hjelm2017bsgan), but the method does not address difficulties associated with credit assignment across a generated sequence.
In this work, we propose ActorCritic under Adversarial Learning (ACtuAL), to overcome limitations in existing generative RNNs. Our model is inspired by the classical actorcritic algorithms in reinforcement learning, and we extend this algorithm in the setting of sequence generation to incorporate a discriminator. Here, we treat the generator as the “actor”, whose actions are evaluated by a “critic” that estimates a reward signal provided by the discriminator. Our newly proposed ACtuAL framework offers the following contributions:

A novel way of applying Temporal Difference (TD) learning to capture longterm dependencies in generative models without full backpropagation through time.

A new method for training generative models on discrete sequences.

A novel regularizer for RNNs, which improves generalization in terms of likelihood, on both character and wordlevel language modeling with several language model benchmarks.
2 Background
2.1 Generative RNNs and Teacher Forcing
A generative recurrent neural networks (RNN) models a joint distribution of a sequence,
, as a product of conditional probability distributions,
RNNs are commonly trained with Teacher Forcing (williams1989learning), where at each time step, , of training, the model is judged based on the likelihood of the target, , given the groundtruth sequence, .
In principle, RNNs trained by teacher forcing should be able to model a distribution that matches the target distribution, as the joint distribution will be modeled perfectly if the RNN models each onestepahead prediction perfectly. However, if the training procedure is not able to model the conditional distribution perfectly, during the freerunning phase, even small deviations from the groundtruth may push the model into a regime it has never seen during training, and the model may not able to recover from these mistakes. From this, it is clear that the optimal model under Teacher Forcing is undesirable, as we want our generative model to generalize well. This problem can be addressed with beamsearch or scheduled sampling (bengio2015scheduled), which can be used to add some exploration to the training procedure. However, the central limitation is in these approaches is that training with MLE necessitates that the model stays close to the target distribution, which can limit overall capacity and generalization properties.
2.2 Generative Adversarial Networks
Generative Adversarial Networks (GANs, goodfellow2014generative) define a framework for training a generative model by posing it as a minimax game. GANs have been shown to generate realisticlooking images that are more crisp than the models trained under maximum likelihood (radford2015unsupervised).
Let us assume that we are given empirical samples from a target distribution over a domain (such as ), . In its original, canonical form, a GAN is composed of two parts, a generator network, , and a discriminator, . The generator takes as input random noise, from a simple prior
(such as a spherical Gaussian or uniform distribution), and produces a generated distribution,
over . The discriminator then is optimized to minimize the misclassification rate, while the generator is optimized to maximize it:(1) 
As the discriminator improves, it better estimates , where is the JensenShannon divergence. The advantage of this approach is that, as the discriminator provides an estimate of this divergence based purely on samples from the generator and the target distribution, unlike Teacher Forcing an related MLEbased methods, the generated samples need not be close to target samples to train.
However, character and wordbased representations of language are typically discrete (e.g., represented as discrete “tokens”), and GANs require a gradient that backpropagates from the discriminator through the generated samples to train. Solutions have been proposed to address GANs with discrete variables which use policy gradients based on the discriminator score (hjelm2017bsgan), and other methods even applied similar policy gradients for discrete sequences of language (yu2016poem; li2017adversarial)
. However, for reasons of which are covered in detail below, these methods provide a highvariance gradient for the generator of sequences, which may limit the model’s ability to scale to realistic realworld sequences of long length.
2.3 The ActorCritic Framework
The actorcritic framework is a widely used approach to reinforcement learning to address credit assignment issues associated with making sequential discrete decisions. In this framework, an “actor” models a policy of actions, and a “critic” estimates the expected reward or value function.
Let us assume that there is a state, which is used as input to a conditional multinomial distribution of policies with density over actions, . At timestep , an agent / actor samples an action, , and performs the action which generates the new state . This procedure ultimately leads to a reward, (where in general, the total reward could be an accumulation of intermediate rewards), over the whole episode, , where is some terminal state.
The REINFORCE algorithm uses this long term reward by running the agent / actor for an episode, gathering rewards, then using them to make an update to through a policy gradient. However there are some issues with this approach, as there is no mechanism to decide which actions may have affected the longterm outcomes and thus it can be ambiguous how to distribute the reward among them. This is referred to as the credit assignment issue. On top of this, the Reinforce algorithm also suffers from high variance.
In actorcritic methods, the critic attempts to estimate the expected reward associated with each state, typically by using the Temporal Difference (TD, sutton1988learning) learning. The advantage is of actorcritic is, despite the added bias from estimating the expected reward, we can update the parameters of agent / actor at each step using the critic’s estimates. This lowers the variance of gradient estimates, as the policy gradient is based on each state, , rather than the whole episode, . This also ensures faster convergence of the policy, as the updates can be made online as the episode, , is generated.
3 ActorCritic under Adversarial Learning (ACtuAL)
At this point, the analogy between the task of training a generative RNN adversarially and the reinforcement learning setting above should be clear. Namely, the reward signal is provided by the discriminator, which corresponds more or less to the likelihood ratio between the distribution of generated sequences and the target distribution as it’s trained to minimize the misclassification rate. The state, , is a sequence of generated tokens, , up to time . The actor is the generative RNN, whose policy, , corresponds to the conditional distribution defined by the RNN. In order to do better creditassignment, we introduce a third component to the generative adversarial framework: the critic, which is optimized to estimate the expected reward at each step of generation.