Text Generation with Efficient (Soft) Q-Learning

Maximum likelihood estimation (MLE) is the predominant algorithm for training text generation models. This paradigm relies on direct supervision examples, which is not applicable to many applications, such as generating adversarial attacks or generating prompts to control language models. Reinforcement learning (RL) on the other hand offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward. Yet previous RL algorithms for text generation, such as policy gradient (on-policy RL) and Q-learning (off-policy RL), are often notoriously inefficient or unstable to train due to the large sequence space and the sparse reward received only at the end of sequences. In this paper, we introduce a new RL formulation for text generation from the soft Q-learning perspective. It further enables us to draw from the latest RL advances, such as path consistency learning, to combine the best of on-/off-policy updates, and learn effectively from sparse reward. We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation. Experiments show our approach consistently outperforms both task-specialized algorithms and the previous RL methods. On standard supervised tasks where MLE prevails, our approach also achieves competitive performance and stability by training text generation from scratch.


page 5

page 13

page 14

page 15

page 17

page 19

page 20

page 22


Text Generation by Learning from Off-Policy Demonstrations

Current approaches to text generation largely rely on autoregressive mod...

Towards Diverse Text Generation with Inverse Reinforcement Learning

Text generation is a crucial task in NLP. Recently, several adversarial ...

RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning

Prompting has shown impressive success in enabling large pretrained lang...

ReGen: Reinforcement Learning for Text and Knowledge Base Generation using Pretrained Language Models

Automatic construction of relevant Knowledge Bases (KBs) from text, and ...

Smart To-Do : Automatic Generation of To-Do Items from Emails

Intelligent features in email service applications aim to increase produ...

Clickbait? Sensational Headline Generation with Auto-tuned Reinforcement Learning

Sensational headlines are headlines that capture people's attention and ...

Connecting the Dots Between MLE and RL for Sequence Generation

Sequence generation models such as recurrent networks can be trained wit...

1 Introduction

Recent natural language generation systems have made remarkable progress in producing well-formed coherent text, especially with the massive pretrained language models (LMs) Radford et al. (2019); Brown et al. (2020); Lewis et al. (2020); Raffel et al. (2019). Those models are typically trained with maximum likelihood estimation (MLE) on large supervised data. Despite its efficiency and successful outcomes, the standard training method suffers from limited applicability to many emerging text generation problems, where little or no standard supervised data is available. Prominent examples include learning to generate prompts to control the massive LMs Yin et al. (2019); Shin et al. (2020); Zhong et al. (2021), learning text generation from noisy or even negative data, learning to generate adversarial text attacks for robustness study Wallace et al. (2019); Atanasova et al. (2020), and others, where people have to devise specialized algorithms due to the failure of the standard MLE.

On the other hand, reinforcement learning (RL) Sutton and Barto (2018) offers an alternative principled formulation for learning in general. The framework enjoys added flexibility by allowing users to plug in arbitrary reward functions. Instead of (blindly) imitating the training data, the model is trained to maximize the rewards to possess desired generation abilities. However, RL by far has made limited success for training text generation (Choshen et al., 2020; Wu et al., 2018). A popular family of RL algorithms studied extensively for text generation is the policy-based Williams (1992) or actor-critic based Bahdanau et al. (2016); Rennie et al. (2017) algorithms, with policy gradient (PG) (Ranzato et al., 2015; Li et al., 2016; Rennie et al., 2017; Tan et al., 2018; Pasunuru and Bansal, 2018; Paulus et al., 2018) being the most prevalent example. Those algorithms train the model with on-policy

updates, i.e., the text samples used for estimating policy gradients are from the target model itself. Due to the exponentially large space of sequences, on-policy updates often suffer from extremely high variance and low data efficiency (e.g., most model samples are not useful for learning). Thus directly training with PG from scratch is usually impossible. In practice, the model has to be initialized by MLE training, followed by PG as finetuning, which often leads to limited improvement 

(Choshen et al., 2020; Wu et al., 2018).

To overcome the shortcomings, another set of work has resorted to off-policy RL. The key advantage of off-policy updates is that samples from other sources, such as human-written text, can be used, making them more data efficient than on-policy methods. Previous work has used either importance weighted PG Pang and He (2021); Zhou et al. (2017); Kandasamy et al. (2017) or -learning based algorithms Guo (2015); Jaques et al. (2020); Narasimhan et al. (2015). However, the off-policy methods have been considered to be less stable. For example, the -learning performance relies heavily on how accurate the learned -function assesses the quality of intermediate subsequences – a challenging task due to the sparse reward signals (e.g., reward is received only after the whole sequence is generated). Further, previous work has largely focused on the extreme of using only off-policy data, mostly for offline training of chatbots Jaques et al. (2020). As a result, the opportunity of directly improving the reward (as in on-policy updates) for other rich tasks is missed.

In this paper, we develop a new RL formulation for text generation that addresses the above issues. Specifically, we reframe the text generation problem from the soft -learning perspective Haarnoja et al. (2017); Schulman et al. (2017), which enables us to further take advantage of the latest successful techniques from the RL literature. In particular, we introduce the principled path consistency learning Nachum et al. (2017), that (1) offers a natural way to train the model with both on-policy and off-policy updates, hence combining the best of the two strategies, and (2) bridges the sparse reward signals directly to supervise the function learning, leading to more accurate estimation and credit assignment.

The generality of the proposed learning framework allows us to train text generation in a wide range of applications: (1) With noisy and negative training examples for entailment generation, our approach manages to greatly improve upon the data and generate accurate entailment text; (2) The method also applies to train an effective generator for black-box adversarial attacks

against a popular entailment classifier; (3) We train a

prompt generator with our algorithm to achieve controllable generation of pretrained LMs in terms of topics. On all the three tasks, our approach consistently improves over both task-specialized algorithms and other general RL methods such as PG. Finally, (4) we study on the standard supervised tasks (E2E (Novikova et al., 2017), CommonGen (Lin et al., 2020)) where MLE prevails. We show that our approach is competitive to train text generation models from scratch, which was usually impossible for previous RL algorithms.

Figure 1: Schematic views of MLE, on-policy RL, and off-policy RL training. MLE requires clean training data for direct supervision. On-policy RL can suffer from low data efficiency due to gibberish text sampled from the current model to be learned. Pure off-policy RL can also be sensitive to the quality of training data and lack exploration as on-policy RL.

2 Background

The goal of text generation is to produce coherent text of certain properties for a given task, where is a token from a vocabulary , and is the text length. The generation can condition on arbitrary input context, which we omit for simplicity of notations. We aim to learn a generation model which is typically decomposed autoregressively as , where

is the prefix, and the distribution at each step is obtained by applying the softmax function on the output logits:


Here is the logit of token computed by the generation model.

2.1 Maximum Likelihood Estimation (MLE)

Given a training example , MLE trains the model by maximizing the data log-likelihood. More concretely. At every time-step , the ground-truth prefix given as input to the model. The output at the current time-step is then compared against the corresponding ground-truth token via cross-entropy. The objective leads to the following update:


Despite its popularity, MLE-based training only applies when supervised data is available, and cannot be used to optimize arbitrary task metrics (e.g., BLEU, entailment score) which are typically the goal in many text generation tasks.

2.2 Reinforcement Learning (RL) Formulation for Text Generation

To formulate text generation as an RL problem, we consider the following finite-time Markov Decision Process (MDP). At each time step

, let the “state” be , namely the partial sequence generated so far. The model, also known as the “agent”, takes as input the current state and outputs a token, also called “action”, according to a policy . The agent then receives a reward and (deterministically) transitions to next state . Let the trajectory be defined as , the agent’s objective is to maximize the accumulative reward,


where is the discount factor. In text generation, the reward signal is usually sparse, i.e., and the agent receives a non-zero reward only after it generates the full sequence. A central concept in RL is the state-action value function (-function) of policy , defined as , which is the expected future reward of taking action (i.e., generating token ) in state and continuing with the policy . There are two major families of RL approaches to parameterizing and training the agent as below.

Policy-based RL

The first family is the policy-based techniques that directly parameterize the policy with parameters . Thus the policy exactly corresponds to the above generation model . To learn the parameters , policy gradient (PG) is one of the most widely used algorithms for text generation (Ranzato et al., 2015). It optimizes the cumulative reward with the policy gradient:


where is the estimated value with sample . Notice that the expectation is taken w.r.t. the policy , which makes PG an on-policy algorithm, meaning that the sample needs to come from the the current policy itself. Intuitively, the update is analogous to maximizing the likelihood of the sampled sequences weighted by .

In practice, however, optimizing this objective alone from scratch is unlikely going to work because most samples are just gibberish with zero reward, failing to provide meaningful training signals for updating the policy. Previous literature either initializes the policy with MLE training, and/or use a combination of MLE and PG updates, which often leads to marginal gains in practice Wu et al. (2018); Choshen et al. (2020).

Value-based RL

The second family is the value-based techniques, such as -learning, that implicitly learn the policy by approximating the value directly. Specifically, let denote the optimal value over policies. Thus the optimal policy is simply taking the action of maximal value at each state, i.e., , where is the indicator function that takes if and otherwise. The approximation of is based on the well-known Bellman temporal consistency:


Recall that in the context of text generation, , i.e., the concatenation of the tokens in and the token . Deep -learning (Mnih et al., 2013) parameterizes the -function as

(e.g., a neural network), and train the parameters by minimizing the following regression objective:


where is the parameters of the target -network, which is a slow copy of and considered as constant for gradient computation of . Here is an behavior policy which can be an arbitrary distribution over text, such as the data distribution or replay buffer (Mnih et al., 2013). This makes -learning an off-policy algorithm because of its ability to use samples coming from another policy. After learning , we can induce a policy from it as above that takes at each state . Jaques et al. (2017) instead sample tokens from the softmax function applied to .

However, the training can be unstable and inefficient due to several challenges: (1) The bootstrapping nature of the above regression problem can make the training unstable. That is, the regression target itself is derived from the -function to be learned (Kumar et al., 2019). The problem is exacerbated in the presence of sparse reward in text generation, where the real observed signal is zero for all intermediate ; (2) The large action space (e.g., ) in text generation results in slow updates. In particular, notice that Eq.(6) applies the gradient update to the -value of the only one particular token (out of the candidate tokens in the vocabulary), making the training inefficient; (3) Besides, pure off-policy updates could be highly sensitive to the quality of training data, and miss the opportunity of on-policy exploration that maximizes the reward of interest in a more direct way.

3 The Soft -Learning Framework

In this section, we combat the difficulties of previous RL methods by introducing the soft -learning (SQL) formulation of text generation. We show that the formulation is seamlessly compatible with the common architecture of text generation model (Eq.1), permitting easy implementation (§3.1). The formulation further allows us to integrate the latest advances in RL, in particular path consistency learning (Nachum et al., 2017) that makes the RL training efficient and stable in practice (§3.2).

3.1 Soft -Learning Formulation for Text Generation

Soft -learning (Haarnoja et al., 2017; Schulman et al., 2017; Nachum et al., 2017) is an maximum-entropy (MaxEnt) extension to the standard (hard) -learning Mnih et al. (2015); Sutton and Barto (2018). Under this framework, the agent is encouraged to optimize the reward while staying as stochastic as possible, with the following objective:


which augments the vanilla in Eq.(3) with the additional Shannon entropy term with coefficient . We can assume without loss of generality, as it can be folded into the reward function by scaling the latter with . This framework is appealing because it seamlessly connects the -values to the familiar output logits of a text generation model, which enables straightforward implementation of the SQL formulation.

-values as Generation Model Logits

We show the connection of the -values with the logits, i.e., the model outputs right before the layer. Concretely, with the SQL objective in Eq.(7), the following relationship between optimal policy and action-value holds (Haarnoja et al., 2017; Schulman et al., 2017):


This form is highly reminiscent of the layer of the generation model in Eq.(1). The connection suggests that we can naturally parameterize the -function in SQL as the generation model logit function, i.e., . In other words, the model output , originally interpretted as the “logit” of token given the preceding tokens , is now re-interpretted as the -value of action in state . When achieving optimality, , namely , represents the best possible future reward achievable by generating token in state . Similarly, the full generation model in Eq.(1) that applies to now precisely corresponds to the policy induced from . That is,


We could further gain even more intuitive/concise interpretation of the above generation policy from the lens of advantage function (Sutton and Barto, 2018). Specifically, in SQL, the optimal state-value function is the log-normalizer of the optimal -values (Haarnoja et al., 2017; Schulman et al., 2017), i.e.,


This allows us to rewrite Eq.(8) into a more concise form:


where is the optimal advantage function. The equation says that the optimal policy generates a token in state according to the token’s advantage.

Figure 2: Soft -Learning with path consistency learning (PCL) objectives, where we illustrate with a vocabulary of size 3. Left: Single-step objective (Eq.15), where for each , the computation involves step and . Dashed boxes in dark green and gray indicate the regression target, where the intermediate reward is often 0 due to sparsity. The gradient is applied to parameters at step (indicated by orange color). Right: Multi-step objective (Eq.17) which aggregates from step all the way to . In this way, the final-step non-zero reward is used as the regression target.

3.2 Learning -function via Path Consistency Learning (PCL)

The above section has discussed parameterizing the -function with the common generation model with parameters . Now we describe how to learn the function within the SQL framework. Intuitively, learning is related to the credit assignment problem in text generation: given a sparse sequence-level reward , how do we properly assign credits to tokens (actions) taken along the way?

In the following, we first discuss the vanilla training method based on the “temporal consistency” as in the standard -learning (Eq.5). We then introduce a more efficient method based on another optimality property of “path consistency” (Nachum et al., 2017) that enables fast and effective updates for the -function given sparse reward.

3.2.1 Vanilla Training with Temporal Consistency

Much like the Bellman temporal consistency in standard -learning (Eq.5), in SQL, the optimal action-value function follows the softmax form of the temporal consistency Ziebart et al. (2008); Ziebart (2010); Fox et al. (2016); Nachum et al. (2017):


We thus can again derive a bootstrapping-like regression objective similar to the standard -learning (Eq.6):


Recall that is an arbitrary behavior policy (e.g., data distribution), and is the target -network which is a slow copy of the to be learned and is held fixed during the gradient updates. However, the above objective is inefficient due to exact the same reasons as in standard -learning discussed earlier, namely the unstable per-step bootstrapping-style training with sparse reward signals, plus the slow updates w.r.t only one token out of the large vocabulary (action space).

3.2.2 Efficient Training with Path Consistency

We instead derive the new gradient update rule following the unified path consistency learning (PCL) Nachum et al. (2017) which addresses the above two challenges. In particular, the PCL-based training updates -values of all tokens at once through a connection between the value function and the induced policy. More specifically, it is shown in Nachum et al. (2017) that the optimal policy (Eq.8) and the optimal state value function (Eq.10) in SQL must satisfy the following consistency property for all states and actions:


Accordingly, the PCL-based training attempts to encourage the satisfaction of the consistency with the following regression objective:


where is the induced policy defined in Eq.(9); and is defined similarly as in Eq.(10) but depends on the target network. Please see Figure 2 (left) for an illustration. Crucially, notice that the gradient update is applied to through the term which explicitly involves the -values of all tokens in the vocabulary. This shows an important difference from the above vanilla training where is updated only through the particular token. The PCL training thus offers more efficient updates for the function.

Comparison with MLE Objective

Before moving on to a further extension of the training, it is interesting to take a closer look at the above objective and compare with the common MLE training (§2.1). Specifically, we notice the relations between the optimal , , and functions: , where the first equation is the definition of (see Eq.10) and the second equation is due to Eqs.(12) and (10). We thus can see the regression target in the above objective as an approximation to the advantage function: . Therefore, by optimizing the regression objective,

, which is the log probability of generating token

given preceding tokens , is encouraged to match the approximate advantage value , no more and no less. This is different from the objective of MLE where the model is trained to (blindly) increase the probability of the observed token given and decrease the probability of the rest.

Multi-step PCL for Sparse Reward

The above PCL objective Eq.(15) does not resolve the potential instability issue due to the bootstrapped value and the sparse reward (i.e., for ). To this end, we additionally incorporate the multi-step variant of the PCL training (Nachum et al., 2017). Specifically, by applying a telescoping sum on the consistency equation (Eq.14) starting from up to , we arrive at the multi-step temporal consistency:


where the value of past-terminal state is zero, ; and the rewards are only available at the end, . We can then come to the following multi-step objective function,


We can see the objective side-steps the need to bootstrap intermediate value functions for . Instead, it directly uses the non-zero end reward to derive the update for . Please see Figure 2 (right) for an illustration.

In practice, we combine the single- and multi-step objectives (Eqs.15 and 17) together for training.

On- and Off-policy Training

Finally, we highlight that the behavior policy involved in the objectives Eqs.(15) and (17) can be an arbitrary policy (i.e., distribution over text sequences), from which we can draw trajectories (i.e., text samples). For example, can be a (possibly noisy) text dataset, or a set of text samples produced by other generation models, resulting in off-policy training. We can also set to be the current generation model to be learned, resulting in on-policy training. In practice, we usually first train the model with only off-policy data for warming up, and then continue with joint on- and off-policy training to further maximize the reward.

Algorithm 1 summarizes the resulting SQL framework for efficient training of text generation (where we show joint on- and off-policy updates).

0:   function (i.e., generation model logit function in Eq.1)    Reward function    Training examples (for off-policy updates; optional)
1:  Initialize and target model parameters
2:  repeat
3:     Draw a batch of off-policy samples
4:     Draw a batch of on-policy samples by decoding with policy (Eq.9)
5:     Compute values and target values for
6:     Compute the objectives in Eqs.(15) and (17)
7:     Update the model parameters via gradient descent
8:     Update the target model parameters by with update rate
9:  until convergence
9:  The trained function and the induced generator
Algorithm 1 Efficient Soft -Learning for Text Generation

4 Applications and Experiments

We show broad applications of the general RL framework to a variety of problems where no clean supervision data is available, including learning with noisy or even negative data (§4.1), generating adversarial text attacks (§4.2), and generating prompts to steer pretrained LMs (§4.3). Our approach shows substantial improvement over both previous RL algorithms and specialized methods specific to each individual tasks. We also study the performance on standard supervised generation tasks (§4.4) and show our RL algorithm can train models from scratch in a stable way, achieving competitive results with MLE training. For each of the experiments, we provide detailed configurations in the appendix.

4.1 Learning from Noisy (Negative) Text

The popular MLE algorithm learns by (blindly) imitating training data, and thus often requires high-quality training examples. However, for text generation tasks with an enormous output space, it is often expensive to curate clean quality data. It is thus highly desirable to be able to learn from data with noises, or even negative examples. With the guidance of task metrics (rewards), the model can even learn to “outperform” the training data and achieve desired generation behaviors.

Figure 3: Entailment generation performance plotted against diversity (average of and ). Entailment rate is the percentage of samples classified as “entailment” by the entailment classifier. Circles represent results of top- sample outputs (by varying values from to ), and triangles represent results of beam-search outputs.

To this end, we consider the task of entailment generation Pasunuru and Bansal (2017). Given a sentence (premise), the goal of the task is to generate a new sentence (hypothesis) that logically follows the premise. For example, given source sentence ‘‘Sophie is walking a dog outside her house’’, the hypotheses ‘‘Sophie is outdoor’’ and ‘‘Sophie is walking a dog’’ are considered entailed, but ‘‘Sophie is inside her house’’ is not and even is a negative (contradictive) sentence.


We study using the SNLI dataset 

Bowman et al. (2015), a dataset commonly used in training an entailment classifier. The original dataset contains (premise, hypothesis) sentence pairs, where the hypothesis may or may not entail the premise. We sub-sampled training examples from the corpus such that the hypotheses have an average entailment probability of only in terms of the premises, and over examples have entailment probabilities less than , which can be seen as negative (contradictive) examples. The resulting training set poses a significant challenge for the models to learn from the noises. We present more details of the dataset in the appendix.

Baselines and Setup.

The entailment generation model takes as input a premise and generates a hypothesis. We compare our training approach with several baselines, including (1) the standard MLE training (MLE), (2) PG with MLE initialization (MLE+PG), and (3) one of the latest text-generation RL algorithms GOLD- Pang and He (2021) which is a pure off-policy method based on importance-sampling PG. To ablate the effect of multi-step training (§3.2), we additionally compare with a simplified variant of our approach that uses only vanilla single-step PCL training (SQL (single)).

The RL algorithms (including PG and ours) permit us to plug in arbitrary reward functions to drive learning. Based on the goal of the task, we use the following intuitive rewards to ensure entailment accuracy and language quality: (1) a robust entailment classifier (Nie et al., 2020) that measures the entailment score of a generation in terms of the input premise, (2) a GPT-2 language model Radford et al. (2019) that measures the log-likelihood of the generation as an indicator of language quality, and (3) BLEU score w.r.t the input premises as another language quality reward that avoids trivial outputs. We sum together all rewards with weights . For all experiments in this and the following sections, we use a transformer model Vaswani et al. (2017)

based on Texar-Pytorch 

Hu et al. (2019) by default, with hidden dimension, blocks, and heads.


We evaluate generation results in terms of entailment rate, language quality (perplexity), and diversity which is measured by the Shannon entropy over unigrams and bigrams (, ) (Gehrmann et al., 2021). Since text generation models intrinsically trade off diversity and quality (Caccia et al., 2019; Hashimoto et al., 2019), we vary the generation diversity by generating samples via top- sampling (Holtzman et al., 2019) with different values, and plot the entailment rate and perplexity against diversity, respectively. We also evaluate the samples produced by beam-search decoding.

Figure 3 shows the results. First, notice that MLE performs poorly. This is not surprising as the training data contain noisy/negative examples. Similarly, since the pure off-policy algorithm GOLD- relies heavily on the data distribution, we observed that it achieves sub-optimal performance. The on-policy PG with MLE initialization gives better entailment rate. In comparison, our full SQL framework achieves the best entailment-diversity trade-off. The comparison between SQL and SQL (single) highlights the importance of having the multi-step objective which directly uses the end reward rather than bootstrapping intermediate -values for supervision.

Figure 4: Entailment attack performance against diversity (average of and ). Only a few MLE+PG dots are visible because the model is not able to generate more diverse samples even with increasing value in top- decoding, i.e., the model collapses.

4.2 Black-box Universal Adversarial Attacks

We next study the application of our approach to a very different problem, namely generating text adversarial attacks, where again no supervised data is available. Adversarial attacks is an increasingly important research topic as they reveal models’ vulnerabilities and flaws. This is especially true for universal attacks Wallace et al. (2019); Atanasova et al. (2020), where we want to generate universal examples that trick the model on all possible inputs. For instance, consider the context of entailment classification where the classifier takes as inputs a premise sentence and a hypothesis sentence, and predicts the probability of the hypothesis entails the premise. Our goal is to find universal human-readable hypotheses that are going to be classified as “entailment” with as high probability as possible, regardless of the input premises. This is a more challenging setting compared to previous instance-specific attack Morris et al. (2020); Jin et al. (2020); Ebrahimi et al. (2017) where the attack model conditions on a premise and generates an adversarial hypothesis specific to the premise.

Dataset, Baselines, and Setup.

We study the task of attacking an entailment classifier. In particular, we aim to attack one of the most popular entailment classifiers on HuggingFaceHub.222https://github.com/pytorch/fairseq/tree/master/examples/roberta, which is ranked #1 as of May 20, 2021 based on https://huggingface.co/models?search=nli. The attack generation model generates adversarial text without conditioning on any inputs so that the generated attacks are universal to all premises. The generation model is trained with mostly the same setting as in §4.1

, where the entailment classifier to be attacked is used as entailment score reward functions. Besides, we additionally include a token-level repetition penalty reward, which empirically benefits readability. Finally, we use the MultiNLI dataset 

Williams et al. (2018) which includes more diverse examples than the SNLI used above.

We compare our SQL with MLE+PG. We use all hypotheses in the MultiNLI dataset as the training data for the MLE training in MLE+PG and the off-policy updates for our SQL. We do not compare with previous specialized adversarial text attack methods, because they either are not applicable to the universal attack setting Morris et al. (2020); Jin et al. (2020); Ebrahimi et al. (2017), or were not designed to generate human-readable sentences (Wallace et al., 2019). Besides, it is worth noting that the general RL algorithms have an additional advantage of doing black-box attacks. That is, the algorithms only require the ability to query the entailment classifier for entailment probability, without need of knowing the internal structure of the classifier (e.g., for computing gradients) as in previous attack algorithms (Ebrahimi et al., 2017; Wallace et al., 2019).

Model Generation Rate
MLE+PG it ’s . 90.48
SQL (ours) the person saint-pierre-et-saint-paul is saint-pierre-et-saint-paul . 97.40
Table 1: Entailment attack samples and respective entailment rates across all test premises. For example, the adversarial sample by SQL is considered to entail 97.40% test premises by the entailment classifier.

To explore the diversity-quality trade-off as in §4.1, we similarly generate samples from models using various values in top- decoding, and plot the entailment rate and perplexity against diversity, respectively. Figure 4 shows the results. We can see that SQL outperforms MLE+PG consistently across different diversity values. The outputs from MLE+PG are not diverse even with high ’s, indicating the model collapses and can only generate only a small set of unique adversarial examples. Table 1 shows generated samples by each method with highest entailment rate. The model by SQL discovers the pattern “saint-pierre-et-saint-paul” (an entity name), and exploits this to generate samples with high universal entailment rate.

4.3 Prompt Generation for Controlling Pretrained Language Models

Figure 5: The scheme of prompt generation for controlling the outputs of pretraind LMs. Conditioning on a topic (e.g., ‘‘science’’), the prompt generator automatically produces a short piece of text (i.e., prompt) such that, by prepending the prompt to the input text, the pretrained LM will generate continuation sentences of the particular topic. We aim to learn the prompt generator with parameters . The subsequent components, including the pretrained LM and evaluation functions such as topic classifier, serve as the reward functions to train the prompt generator. Discrete steps in the pipeline are highlighted in red, including the prompt by the generator and continuation sentences by the pretrained LM. The discrete steps make previous gradient-based prompt tuning approaches not applicable here.
Figure 6: Average topic accuracy.
PPLM GeDi MLE (5) SQL (off, 5) MLE+PG (5/10/15) SQL (5/10/15, ours) // // Table 2: Language perplexity results averaged across topics. The lower, the more fluent the generated continuation sentences. Model PPLM GeDi SQL Seconds
Figure 7: Average time cost for generating a topic-specific sentence.

The ability to optimize an arbitrary black-box reward has a broader implication. In particular, a reward function does not just have to be a metric like the BLEU score. It can also be a composition of multiple functions that eventually return a score.

To demonstrate this, we consider the task of prompting a large pretrained LM Radford et al. (2019); Brown et al. (2020) for controllable generation (Reiter and Dale, 1997; Hu et al., 2017). The goal is to learn to generate text prompts that steer the LM to generate sentences of certain desired attributes (e.g., topics). The problem of controlling the generation of pretrained LMs was previously approached through specialized algorithms such as modifying the LM hidden states during decoding Dathathri et al. (2020); Krause et al. (2020); Qin et al. (2020). Here we show that prompts offer an easier, faster, more effective way for controlled generation.

Learning to automatically generate or tune prompts is gaining increasing attention since the massive pretrained LMs (Brown et al., 2020). Most existing approaches Wallace et al. (2019); Li and Liang (2021); Lester et al. (2021)

rely on gradient backpropagation and are applicable only when the whole training pipeline is differentiable. The differentiability does not hold for the text generation setting, as illustrated in Figure 

5. In contrast, the RL framework is generally applicable to any differentiable or discrete pipelines.


Following (Dathathri et al., 2019), we aim to control the generation to have one of 7 topics (e.g., “science”); the generated prompt is prepended to one of 20 input sentences (Figure 5) for the pretrained LM to generate continuation sentences. There is no direct supervision data available for training the prompt generator. We randomly create some noisy text as the training data for MLE baselines below and for off-policy updates for our algorithm. Specifically, the noisy text is created by sampling keywords and topics from the list used in (Dathathri et al., 2020) and a paraphrase generation model.

Baselines and Setup.

Figure 5 shows the architecture of prompt-based controllable generation. We compare our SQL method with MLE+PG as before. At training time, for each generated prompt sample, the pretrained LM generates 2 continuation sentences for evaluating average reward. We use a zero-shot classifier to evaluate the topic accuracy of the continuation sentences (see appendix). That is, we do not assume access to classifiers pretrained on topic-specific sentences, because generating such topic-specific sentences is the goal of the task in the first place. We additionally use an LM to evaluate the log-likelihood of continuation sentences for measuring language quality. Since the prompt length could impact the generated sentences, we conducted experiments with maximum prompt length , , and . As ablation study, we also evaluate the SQL algorithm with only off-policy updates (i.e., without on-policy exploration), denoted as SQL (off), and compare it with vanilla MLE training. At test time, given a topic, the trained prompt generator produces one prompt using beam search decoding. For each generated prompt, the pretrained LM generates 100 sentences using top- decoding (with ) for evaluation. Finally, we also compare with two specialized controllable generation techniques based on pretrained LMs, namely PPLM Dathathri et al. (2019) and GeDi Krause et al. (2020)

, following similar procedures using their open-sourced code. We use a distilled GPT-2 model

333https://huggingface.co/distilgpt2 as the pretrained LM to be controlled.


Figure 7 shows the topic accuracy of the controlled LM outputs averaged across the 7 topics, and Table 2 shows the respective language quality results. More detailed topic accuracy results are provided in the appendix (where GeDi obtained low accuracy on 2 of the 7 topics, possibly because the topic tokens are tokenized into two subwords for which the model released by the authors was not specifically trained). We can see that the prompts generated by our SQL cause the LM to generate sentences with high topic accuracy while maintaining low perplexity in most settings. Increasing the prompt length positively impacts the topic accuracy, which makes sense because longer prompts give more flexible for steering the LM. The comparison between MLE and SQL (off) shows that the off-policy component of SQL is better than standard MLE training, as it incorporates reward signals instead of just blindly following the (noisy) data.

Finally, comparing with the previous steered decoding such as PPLM and GeDi shows the advantage of the prompt-based control trained with RL. We can see the latter achieves better trade-off between topic accuracy and language quality. Moreover, once a prompt is produced, we can use the pretrained LM to generate text of desired topics efficiently, with the same time cost as standard non-controlled decoding. In comparison, the dedicated steered decoding is often orders-of-magnitude slower, as shown in Table 7.

4.4 Supervised Text Generation Tasks

Finally, we conduct experiment on standard generation tasks where clean supervised data is available. The study is to examine the capabilities of the proposed RL method to train a text generation model from scratch, which has been considered as exceedingly challenging for previous RL algorithms.

Model MLE PG MLE+PG SQL (ours)
Table 3: BLEU results on the E2E val/test sets.
Datasets, Baselines, and Setup.

We study on two tasks, E2E (Novikova et al., 2017) and CommonGEN (Lin et al., 2020), and use the respective datasets pre-processed by Gehrmann et al. (2021) which allow sequence-to-sequence modeling with standard transformers. We run four sets of methods: the standard MLE training (MLE); PG training from scratch (PG); Joint MLE and PG training, with MLE initialization (MLE+PG); and our SQL training from scratch with both off-policy and on-policy updates (SQL

). We use the standard BLEU as reward. We additionally investigate the training stability and sensitivity w.r.t hyperparameters, in particular the scale of reward. To this end, for

MLE+PG and SQL, we vary the reward scale in and evaluate the respective performance under different scales.

Figure 8: Training curves on validation sets. Left: Training curves on E2E with best hyperparameter configurations. Middle: Training curves on E2E with varying reward scale. Right: Training curves on CommonGen with varying reward scale.

Table 3 shows the performance on E2E of different models whose hyperparameters are picked using the validation set. We can see the proposed SQL that trains models from scratch achieves competitive results with the common MLE and MLE+PG. In contrast, the PG algorithm alone without MLE fails the training. Figure 8 (left) shows the respective training curves (on the validation set), demonstrating that SQL converges in an efficient and stable way as MLE.

We further demonstrate the sensitive of MLE+PG and SQL w.r.t the reward scale as a key hyperparameter. Figure 8 (middle and right) shows the training curves of the two methods with varying reward scales. We can see SQL is significantly more robust as reward scale changes, while MLE+PG tends to collapse with improper reward scale configurations.

5 Related Work

Standard RL algorithms, such as -learning Sutton and Barto (2018), aim to find the best way to solve a given task. But, sometimes, the training can be over-sensitive to the randomness in the environment. Recent works have considered maximum-entropy RL (MaxEnt RL) extensions. MaxEnt RL optimizes policies to maximize both the expected reward as well as the expected entropy of the policy. Previous work in robotic and game control demonstrated that this formulation provides a substantial improvement in exploration and robustness Ziebart et al. (2008); Todorov (2008); Toussaint (2009); Ziebart (2010); Rawlik et al. (2013); O’Donoghue et al. (2017); Haarnoja et al. (2017); Nachum et al. (2017); Schulman et al. (2017); Nachum et al. (2018); Eysenbach and Levine (2021). One of the most prominent examples is soft -learning (SQL) Haarnoja et al. (2017); Nachum et al. (2017); Schulman et al. (2017), which modifies the -learning object to optimize not only the total reward but also the entropy (diversity) of the induced policy. In this work, we leverage one of the latest advances related to SQL, namely the path consistency learning Nachum et al. (2017).

Applying RL for text generation has been discussed with the goals of alleviating the exposure bias problem and directly optimizing task metrics (Ranzato et al., 2015; Li et al., 2016; Wu et al., 2016; Rennie et al., 2017; Paulus et al., 2018; Chen and Bansal, 2018). For example, Ranzato et al. (2015) used the REINFORCE algorithm Williams (1992), and Bahdanau et al. (2016) used the actor-critic algorithm. They are both on-policy RL algorithms with the need of pretraining their models using MLE. Cold-start Ding and Soricut (2017) proposed softmax policy gradient (SPG) that does not rely on MLE pretraining but requires various dedicated techniques for effective training (e.g., token-level decomposition of sequence reward, etc). Tan et al. (2018) proposed an entropy-regularized policy optimization (ERPO) formulation that subsumes many of the previous text generation training algorithms, ranging from MLE to Cold-start, as special cases. Our proposed framework offers solutions for efficient training from scratch in the presence of large action space and sparse sequence-level reward in text generation.

6 Conclusion

We have developed a new RL formulation for text generation based on soft -learning and path consistency learning. The proposed method combines off- and on-policy updates, and uses multi-step return to alleviate the issues with sparse sequence-level rewards. We conduct experiments and show impressive performance on four sets of experiments covering a wide range of applications: learning with noisy and negative data, black box adversarial attack, prompting a pretrained language model for controllable generation, and finally, on standard supervised tasks. The RL formulation opens up enormous new opportunities to integrate more advances made in the fertile RL literature to improve text and other sequence generation problems.


  • P. Atanasova, D. Wright, and I. Augenstein (2020) Generating label cohesive and well-formed adversarial claims. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 3168–3177. Cited by: §1, §4.2.
  • D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio (2016) An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086. Cited by: §1, §5.
  • S. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642. Cited by: §A.1, §4.1.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. External Links: Link Cited by: §1, §4.3, §4.3.
  • M. Caccia, L. Caccia, W. Fedus, H. Larochelle, J. Pineau, and L. Charlin (2019) Language GANs falling short. In International Conference on Learning Representations, Cited by: §4.1.
  • Y. Chen and M. Bansal (2018) Fast abstractive summarization with reinforce-selected sentence rewriting. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 675–686. Cited by: §5.
  • L. Choshen, L. Fox, Z. Aizenbud, and O. Abend (2020)

    On the weaknesses of reinforcement learning for neural machine translation

    In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.2.
  • S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu (2019) Plug and play language models: a simple approach to controlled text generation. In International Conference on Learning Representations, Cited by: §4.3, §4.3.
  • S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu (2020) Plug and play language models: a simple approach to controlled text generation. In International Conference on Learning Representations, External Links: Link Cited by: §4.3, §4.3.
  • N. Ding and R. Soricut (2017) Cold-start reinforcement learning with softmax policy gradient. arXiv preprint arXiv:1709.09346. Cited by: §5.
  • O. Dušek, D. M. Howcroft, and V. Rieser (2019) Semantic noise matters for neural natural language generation. In Proceedings of the 12th International Conference on Natural Language Generation, pp. 421–426. Cited by: §A.1.
  • J. Ebrahimi, A. Rao, D. Lowd, and D. Dou (2017) Hotflip: white-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751. Cited by: §4.2, §4.2.
  • B. Eysenbach and S. Levine (2021) Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257. Cited by: §5.
  • R. Fox, A. Pakman, and N. Tishby (2016) Taming the noise in reinforcement learning via soft updates. In

    Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence

    pp. 202–211. Cited by: §3.2.1.
  • S. Gehrmann, T. Adewumi, K. Aggarwal, P. S. Ammanamanchi, A. Anuoluwapo, A. Bosselut, K. R. Chandu, M. Clinciu, D. Das, K. D. Dhole, et al. (2021) The gem benchmark: natural language generation, its evaluation and metrics. arXiv preprint arXiv:2102.01672. Cited by: §A.1, §4.1, §4.4.
  • H. Guo (2015) Generating text with deep reinforcement learning. arXiv preprint arXiv:1510.09202. Cited by: §1.
  • T. Haarnoja, H. Tang, P. Abbeel, and S. Levine (2017) Reinforcement learning with deep energy-based policies. In

    International Conference on Machine Learning

    pp. 1352–1361. Cited by: §1, §3.1, §3.1, §3.1, §5.
  • T. Hashimoto, H. Zhang, and P. Liang (2019) Unifying human and statistical evaluation for natural language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1689–1701. Cited by: §4.1.
  • A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi (2019) The curious case of neural text degeneration. In International Conference on Learning Representations, Cited by: §4.1.
  • Z. Hu, H. Shi, B. Tan, W. Wang, Z. Yang, T. Zhao, J. He, L. Qin, D. Wang, X. Ma, et al. (2019) Texar: a modularized, versatile, and extensible toolkit for text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 159–164. Cited by: §4.1.
  • Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. Xing (2017) Toward controlled generation of text. In International Conference on Machine Learning (ICML), Cited by: §4.3.
  • N. Jaques, S. Gu, D. Bahdanau, J. M. Hernández-Lobato, R. E. Turner, and D. Eck (2017) Sequence tutor: conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pp. 1645–1654. Cited by: §2.2.
  • N. Jaques, J. H. Shen, A. Ghandeharioun, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard (2020) Human-centric dialog training via offline reinforcement learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3985–4003. Cited by: §1.
  • D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits (2020) Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34, pp. 8018–8025. Cited by: §4.2, §4.2.
  • K. Kandasamy, Y. Bachrach, R. Tomioka, D. Tarlow, and D. Carter (2017) Batch policy gradient methods for improving neural conversation models. In ICLR, Cited by: §1.
  • B. Krause, A. D. Gotmare, B. McCann, N. S. Keskar, S. Joty, R. Socher, and N. F. Rajani (2020) Gedi: generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367. Cited by: §4.3, §4.3.
  • A. Kumar, J. Fu, G. Tucker, and S. Levine (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. In NeurIPS, Cited by: §2.2.
  • B. Lester, R. Al-Rfou, and N. Constant (2021) The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691. Cited by: §4.3.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. Cited by: §A.1, §1.
  • J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao (2016) Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1192–1202. Cited by: §1, §5.
  • X. L. Li and P. Liang (2021) Prefix-tuning: optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190. Cited by: §4.3.
  • B. Y. Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y. Choi, and X. Ren (2020) CommonGen: a constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 1823–1840. External Links: Link, Document Cited by: §A.1, §1, §4.4.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §2.2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §3.1.
  • J. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, and Y. Qi (2020) TextAttack: a framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 119–126. Cited by: §4.2, §4.2.
  • O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans (2017) Bridging the gap between value and policy based reinforcement learning. In NIPS, Cited by: §1, §3.1, §3.2.1, §3.2.2, §3.2.2, §3.2, §3, §5.
  • O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans (2018) Trust-PCL: an off-policy trust region method for continuous control. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • K. Narasimhan, T. Kulkarni, and R. Barzilay (2015) Language understanding for text-based games using deep reinforcement learning. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1–11. Cited by: §1.
  • Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela (2020) Adversarial nli: a new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4885–4901. Cited by: §A.1, §4.1.
  • J. Novikova, O. Dušek, and V. Rieser (2017) The e2e dataset: new challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 201–206. Cited by: §A.1, §1, §4.4.
  • B. O’Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih (2017) Combining policy gradient and q-learning. In ICLR, Cited by: §5.
  • R. Y. Pang and H. He (2021) Text generation by learning from demonstrations. In International Conference on Learning Representations, External Links: Link Cited by: Table 4, §1, §4.1.
  • R. Pasunuru and M. Bansal (2017) Multi-task video captioning with video and entailment generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1273–1283. Cited by: §4.1.
  • R. Pasunuru and M. Bansal (2018) Multi-reward reinforced summarization with saliency and entailment. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 646–653. Cited by: §1.
  • R. Paulus, C. Xiong, and R. Socher (2018) A deep reinforced model for abstractive summarization. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
  • L. Qin, V. Shwartz, P. West, C. Bhagavatula, J. D. Hwang, R. Le Bras, A. Bosselut, and Y. Choi (2020) Backpropagation-based decoding for unsupervised counterfactual and abductive reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 794–805. Cited by: §4.3.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §A.1, §1, §4.1, §4.3.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    arXiv preprint arXiv:1910.10683. Cited by: §1.
  • M. Ranzato, S. Chopra, M. Auli, and W. Zaremba (2015)

    Sequence level training with recurrent neural networks

    arXiv preprint arXiv:1511.06732. Cited by: §1, §2.2, §5.
  • K. Rawlik, M. Toussaint, and S. Vijayakumar (2013) On stochastic optimal control and reinforcement learning by approximate inference. In Twenty-third international joint conference on artificial intelligence, Cited by: §5.
  • E. Reiter and R. Dale (1997) Building applied natural language generation systems. Natural Language Engineering 3 (1), pp. 57–87. Cited by: §4.3.
  • S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017)

    Self-critical sequence training for image captioning


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7008–7024. Cited by: §1, §5.
  • J. Schulman, X. Chen, and P. Abbeel (2017) Equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1704.06440. Cited by: §1, §3.1, §3.1, §3.1, §5.
  • T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh (2020) AutoPrompt: eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1, §3.1, §3.1, §5.
  • B. Tan, Z. Hu, Z. Yang, R. Salakhutdinov, and E. Xing (2018) Connecting the dots between mle and rl for sequence prediction. arXiv preprint arXiv:1811.09740. Cited by: §1, §5.
  • E. Todorov (2008) General duality between optimal control and estimation. In 2008 47th IEEE Conference on Decision and Control, pp. 4286–4292. Cited by: §5.
  • M. Toussaint (2009) Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pp. 1049–1056. Cited by: §5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in Neural Information Processing Systems 30, pp. 5998–6008. Cited by: §4.1.
  • E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh (2019) Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2153–2162. Cited by: §1, §4.2, §4.2, §4.3.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Cited by: §A.1, §4.2.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §1, §5.
  • L. Wu, F. Tian, T. Qin, J. Lai, and T. Liu (2018) A study of reinforcement learning for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3612–3621. Cited by: §1, §2.2.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §5.
  • W. Yin, J. Hay, and D. Roth (2019) Benchmarking zero-shot text classification: datasets, evaluation and entailment approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3905–3914. Cited by: §1.
  • J. Zhang, Y. Zhao, M. Saleh, and P. J. Liu (2019) PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. External Links: 1912.08777 Cited by: §A.1.
  • R. Zhong, K. Lee, Z. Zhang, and D. Klein (2021) Meta-tuning language models to answer prompts better. arXiv preprint arXiv:2104.04670. Cited by: §1.
  • L. Zhou, K. Small, O. Rokhlenko, and C. Elkan (2017) End-to-end offline goal-oriented dialog policy learning via policy gradient. arXiv preprint arXiv:1712.02838. Cited by: §1.
  • B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8, pp. 1433–1438. Cited by: §3.2.1, §5.
  • B. D. Ziebart (2010) Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Cited by: §3.2.1, §5.

Appendix A Appendix

a.1 Setup Details


We use four datasets: E2E [Novikova et al., 2017, Dušek et al., 2019], CommonGen [Lin et al., 2020], SNLI [Bowman et al., 2015], and MultiNLI [Williams et al., 2018]. Our evaluation follows the GEM Benchmark Gehrmann et al. [2021] when applicable,444https://github.com/GEM-benchmark/GEM-metrics and otherwise same with the reward function used in training. SNLI is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Please see details in Williams et al. [2018] for MultiNLI licensing Information, and https://gem-benchmark.com/data_cards/E2E and https://gem-benchmark.com/data_cards/CommonGen for E2E and CommonGen licensing Information.

Reward Functions

We use the robust entailment classifier [Nie et al., 2020] in §4.1,555https://huggingface.co/ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli one of the most used entailment classifiers on HuggingFaceHub in §4.2.666https://github.com/pytorch/fairseq/tree/master/examples/roberta. This classifier is ranked #1 (as of May 20, 2021) based on https://huggingface.co/models?search=nli. and a zero-shot classifier based on BART Lewis et al. [2020] to compute the topic score in §4.3.777https://huggingface.co/facebook/bart-large-mnli To compute perplexities, we use a GPT-2 model Radford et al. [2019] fine-tuned on the corresponding datasets for computing perplexity in §4.1 and 4.2, and a distilled GPT-2 model in §4.3 without fine-tuning.888https://huggingface.co/distilgpt2 We simply set reward weights to , except in §4.2, where we changed the entailment weight to , log-likelihood and repetition penalty weight to .


For experiments that involve policy gradient training, we initialize the model with maximum likelihood training by default unless specified otherwise. We train soft -learning model from scratch with both off-policy (using data) and on-policy (using samples) by default except in §4.1 and 4.3, in which we find it beneficial to warm-up the model with just off-policy training. We apply similar tuning budgets to both soft -learning model, and policy-gradient (mostly the reward scale and top-), based on performance on the validation dataset and sample qualities.


For top- sampling results, we sample a hypothesis for each premise and measure the average attack rate across the dataset. This is because sampling multiple hypotheses, each for all premises, and measure performance are expensive. Since the hypotheses are sampled input-independently, this should be a good approximation.


We use the paraphrase generation model based on Zhang et al. [2019].999https://huggingface.co/tuner007/pegasus_paraphrase During decoding, we include no_repeat_ngram_size, which improves readability.101010https://huggingface.co/blog/how-to-generate

a.2 Experimental Results


Please see Table 4 for beam search results.


Please see Table 5 for detailed results breakdown, and Table 6-9 for examples. Examples are in the format: topic: [prompt] input sentence generated text.

Model Entl. Prob Entl. Rate PPL
MLE / / / / /
GOLD-s Pang and He [2021] / / / / /
MLE+PG / / / / /
SQL / / / / /
SQL (single) / / / / /
Table 4: Beam search results on entailment generation, in the format val/test. / indicates higher/lower is better. SQL (single) achieves zero in / as it generates a single token.
Length Model legal politics computers space religion science military Average
Topic Scores
/ GeDi
5 SQL (off)
10 SQL
15 SQL
/ GeDi
5 SQL (off)
10 SQL
15 SQL
Table 5: Prompt generation results. Note that some of the numbers from GeDi are low because the topics are tokenized into two subword tokens, which the model was not trained with.
legal: legal space religion and space In summary, a good understanding of these concepts is that by giving an explicit understanding to a person, they provide an avenue to be studied and studied. But the concept of one person being a space is also very confusing, and can be very difficult to obtain.\n So,
politics: the primary referendum is In summary, the outcome will be a referendum on the EU membership for the first time of its kind for EU citizens, and the full extent of the benefits of a single market and a flexible single EU state."
computers: macintoshintoshintoshintosh In summary, it appears that the company and IBM products are currently in need of upgrading the computer. This can be seen in a detailed review of the Macintosh version of Windows Vista and XP. However, when looking at the changes made by the HP Macintosh hardware and software versions of
space: legal space science and space In summary:\n\n The purpose of this paper is to investigate and test the theory of space space and other objects. This project will support NASA.s efforts to demonstrate these theories, and to propose other relevant new theories.\n This paper provides the following arguments for the
religion: space legal religion religion religion In summary, to the author the current discussion is the position of the Church and the community. While we acknowledge that we should not be commenting upon claims such as our recent cases or the other ones that contradict our view, we conclude it is appropriate to include these cases. Further
science: the chemical microscope is In summary, the most sophisticated of these experiments is a technique that gives no obvious, no apparent way of revealing that the material was obtained. In this study, we examine how the compounds in the samples in question make up the composition of the chemical and its properties. The chemical composition
military: arms defense battalion battalion cavalry In summary: 6th Panzer Field Division, Second Division.\n\n The main task of the battalion in the main counterinsurgency campaign was to counter the enemy in any counter-incursion. The main objective of this campaign is to eliminate enemy groups and the remnants of
legal: legal space religion and space This essay discusses the idea of space and time as a space, in both theoretical and conceptual terms, as not an individual time period or anything else. The emphasis is on time itself, rather than having a fixed central space. Space was the object of the first chapter, and
politics: the primary referendum is This essay discusses the nature of the EU referendum. The purpose of this essay is to shed light on the importance of a public referendum, on a question of whether the decision of an EU member states to remain in the European Union is constitutional and thus in accord with constitutional guarantees of sovereignty
computers: macintoshintoshintoshintosh This essay discusses hardware devices and software systems for Mac OS X, MacOS X and Linux. To view the latest version of Macintosh OS: Mac 8.7.x\n\n For more information or for information about Macintosh systems, visit Mac MacSystems.\n More
space: legal space science and space This essay discusses science for teens, adults and teenagers.\n\n When the idea of studying space was first implemented as a method to test, the question was: What if a student has been "comfortable" with space without its body? What would their body like to be
religion: space legal religion religion religion This essay discusses an alternative religion that focuses on the role of a particular religion and views some form of religious ethics as the form when the law is applied to that particular religious community . This discussion is concerned with the status of faith for individuals or groups which may be members and members
science: the chemical microscope is This essay discusses the mechanisms of reaction with a focus on the molecular structure of nucleite and of enzymes within the cytoskeleton, thus making it easier to understand the process of metabolism and other elements of cellular life. In this essay, we use techniques such as the photochemical transfer
military: arms defense battalion battalion cavalry This essay discusses three main themes:\n\n 1) Lack of uniformed soldiers is an unacceptable and unconscionable strategy for the Army.\n 2) Poor and inadequate training does not compensate the soldiers, and may deprive them of the necessary and competitive training from their instructors
Table 6: Prompt samples from SQL.
legal: there liable injunction In summary, the court decided that defendants had a right to petition the high court to intervene, and therefore granted the injunction. However, a Court of Appeal decision in the US District Court ruled with no effect on the petitioned by the plaintiffs, citing two reasons.\n\n The US
politics: the primary referendum was In summary, the majority of Russians would support the idea of banning all human rights, and that would be a major part of the government’s effort to build a new liberal economy, which President Vladimir Putin says would boost Russian tourism.\n\n The results were published in the
computers: the macintoshintosh In summary, the kernel has the best quality of life, and kernel security is at the heart of any OS development. The first time Linux released was when i386 released the Macintosh version, which had a rather low standard. Other issues like the introduction of the OS6 operating system
space: comet asteroid spaceship In summary, this asteroid asteroid ship is about to make its maiden flight to explore deep-space.\n\n "The asteroid craft was a very early attempt at the discovery of a new home for the world’s first black hole," NASA said in a news release. "Today we
religion: faith faith faith salvation In summary, Christian beliefs are not a new way to use the time spent thinking about God’s world as a source for faith. Faith is an effort to think of the world without fear that it might become a dangerous place for the human family. Because it represents the very essence that
science: climate research chemistry In summary of the study, this review aims to determine how in a single study where the same number of data was analysed, a new methodology is needed to better understand who produced a different graph than the one suggested. The paper will be published in issue #5, Issue #18.
military: the cavalry battalion a In summary, the army are a unit of the same type and in all, so there is no need to declare one. The unit does not constitute a cavalry unit or for use on troops.\n\n The army is not under the command of a brigade from the front. For
legal: there liable injunction This essay discusses the potential legal consequences of a stay in the United States for an indefinite period of time if the government continues to delay the process of de-instituting it. To apply such a request, all applicable laws shall apply either the same terms as the existing statutes. In
politics: the primary referendum was This essay discusses the electoral strategy against a candidate for governor of the Commonwealth.\n\n The survey of British voters in this survey provides an overview of what the candidates for the United Kingdom will be seeking in the next Parliament. In the general election a few seats will lead up to a
computers: the macintoshintosh This essay discusses the various problems of the Macintosh, the first two-year running environment. An early version of this paper was originally published in 1982. The MacSX was not designed and managed by Kia.\n\n Macintosh\n The mac has been a family invention
space: comet asteroid spaceship This essay discusses a topic: the impact of two of the Earth’s two-thirds comet-sized moon Charon on Earth, and why asteroids are so close to the sun; why people are looking for ways to find a way to keep Earth-shaped asteroids out of orbit.
religion: faith faith faith salvation This essay discusses the impact religion has on the American experience and in American culture. Since the beginning of my career I have found that faith and belief have often been linked to economic growth, social development and education. I believe that all people need to know that there is no reason for
science: climate research chemistry This essay discusses the role of molecular information and its interaction with the general organism and human health.\n\n "The idea of biological information is not really a new concept. We used genetic information as a medium to define, identify, and store information about biology and biology," explains Dr.
military: the cavalry battalion a This essay discusses the potential for the development of a small infantry brigade as an infantry regiment. It is also a contribution to the larger cavalry corps as it would require a larger brigade for battle. For more information see the original article on this page.
Table 7: Prompt samples from MLE+PG.
legal: In summary Currently: In 1966 the Act was amended into state of law through amendments.\n\n\n Defent No. 1 etc 695 [The character in question for judicial decision purposes; participation t concerned you; "but not acceptance.")\n\n Generally held: Just
politics: In summary Senate candidates, senator (Republican); senator (Democrat); and opinion-former (2002-08). - 2012 Senate results are based on the federal Election Commission’s October 2016 Current Opinion Polling Reports. Key figures : Open Gallup poll Most Americans view the
computers: In summary: 12-16 add-on chips. Trace out the type predefined ORDER parameters, and write to /dev/tty with them.\n\n\n\n\n\n\n\n\n Roundset sizes with mm(831x810 x870 x81f);
space: In summary Space Station - Farm Station (1985 by Mike Lazarra) Here is an article developed by Maregnus Spirit Experimentator on WinViotrv - An exploration benefit for compute-enriched array data densities (UPERS).This thesis home
religion: In summary nice things about Android 6.1 Jelly Bean!\n Searching for OP lag fixes one of my cllcs or some other improvements that’s fixing a bug due to this nerf! (model causing Huge Frame Decay!) It also fixed an upper turret hook
science: In summary Computer Age Experience Overview\n\n\n\n Networking skills are the most developed skill set for Internetthumb members at universities at this time. In computer science, we are introducing various gatekeepers to intellectual property ownership and cyberware acquisitions, entry program makers post a
military: In summary Army Sgt. Harold Tolbard (5/16/2018) Lt. Gen. Michael Bachaes 1 Dickie Powell 2 Lt. Zachary Bram 9 *Gen. Robert Eisen: Warfighter – Soldier + Genoured\n\n\n – Senior Bush Doctrine
legal: This essay discusses Illinois cases on issues such as drug trafficking and drug Social Security.
politics: This essay discusses federal ethics as the key area on which current and past state and local governments have been operating.
computers: This essay discusses the very development of alternative technology for young people.
space: This essay discusses NASA’s StarHubble satellite mission development. Transcript here.
religion: This essay discusses various aspects of the relays of mediocality and Hammazanna.
science: This essay discusses Linux desktop computing, and IRI video-game applications.\n\n The zooming in — even after the GNOME 3 transition came to an end, is all about figuring out how you have run a software operating system so vital that any hacker can mine it
military: This essay discusses military courage that included in the combat operations in Iraq and Afghanistan.
Table 8: Prompt samples from GeDi.
legal: In summary we have published in the journal Nature Neuroscience: A systematic review of human brain tissue has found no evidence for any association between the presence of the presence of a particular form of human neuropathy in the brain, a condition that is not normally associated with cognitive impairment. We found that
politics: In summary we have a list of 10 of the best and most common types of drugs for people with HIV. This is a very short list of recommendations from a national and international community.\n\n\n\n This article has been updated to make the official state of the EU state of
computers: In summary, we believe that the current system has no way of doing anything about it.\n\n\n\n The following steps are taken to get the system working.\n\n 1. Install a new operating system with a Linux Mint operating system\n 2. Start a new Linux Mint operating
space: In summary

we have some important news from the moment of the year and some important information about these two major planets. This new discovery is the first to confirm this important planet has an active life in its home planet, a planet with a mass of about 5.8 billion tons. It

religion: In summary, we believe that the current administration has no way of doing anything about the Benghazi attacks. This is a very interesting story, and I think it has been a very nice surprise. This is a very nice and well thought out piece that is a must for the
science: In summary we use this approach to evaluate if the number of data points (in the dataset) that are relevant for each data set is the same (in this case, the data are not in one data set). In this approach we can test the data points in a different way.
military: In summary we have some important news from the moment of the year and some important information from the moment of the year.\n\n\n\n\n We’ve also added an additional update for our new feature, which includes:\n • Improved access and access in all of the main
legal: This essay discusses how you can build a community of dedicated people. If you’re a member of a community of people who want to contribute to the environment, you’ll also be helping them build communities in order to support the local economy, and the future of the city. The latest report
politics: This essay discusses how we can build on previous research findings about the role religion plays in human development in human development. This is a very interesting and highly entertaining story. What is an "independent" political party in the United States, the U.S. political party, and the United
computers: This essay discusses how you can build a new browser to view and share your favorite web sites.\n\n\n A browser that is open source can also be built from a web browser, which can be a browser that does not allow browser extensions (e.g. Firefox, Chrome, Opera
space: This essay discusses how you can build a life with a healthy diet and how you can use it when you’re ready to move forward. It’s a very simple approach to building a life with a healthy diet and what it means to be healthy and healthy for the
religion: This essay discusses how you can build a new game without having to play the original game, and how you can make a new title that is completely different to the original. It has been around since 2007, when the first game, The Elder Scrolls IV: Oblivion, was released in the PlayStation
science: This essay discusses how we can build on previous research findings about the role of obesity in human metabolism and how we can improve our health.\n\n\n\n In this essay, we explore why eating a whole whole diet does not help prevent obesity (1). We find that a whole food diet
military: This essay discusses how you can build a community with the help of friends and family.\n\n\n\n\n "The people around me are the ones who need help. They are the ones who need help. They are the ones who are not alone."\n - Michael\n "It’s
Table 9: Prompt samples from PPLM.