and Recurrent Neural Networks (RNN) that use internal gating mechanisms to preserve long-term dependencies 
, showing impressive results for density estimation (i.e language modelling) and text generation. These encoder-decoder networks are usually trained end-to-end using Maximum Likelihood (ML) training with full supervision (i.e the model learns explicitly from expert demonstrations). This is also referred to asteacher forcing .
Typically for text generation tasks, such as image captioning, CNNs  encode an image, which is passed to the beginning of the decoder RNN language model via a linear map from the image encoding. The decoder policy then performs a set of actions given an expert policy for time steps (e.g human captions).
However, ML can act as a poor surrogate loss for a task-specific score we are interested in and evaluate on (e.g BLEU). Moreover, these scores are non-differentiable and hence cannot be used in the standard supervised learning paradigm.
Deep Reinforcement Learning (DRL) can be used to optimize for task scores as rewards[2, 32], for the purposes of higher quality text generation e.g n-gram overlap based metrics such as ROUGE-L and BLEU.
Actor-critic networks in particular have shown State of The Art (SoTA) results for image captioning [2, 32]. These models use a policy network that produces actions which are evaluated by a value network that output scores given both actions and targets as input for each state. The critic value prediction is then used to train the actor network, assuming the critic values are exact (pre-training the critic is often necessary). However, parameter-free measures such as BLEU and ROUGE-L do not correlate well with human judgements when compared to using word and sentence-based embedding semantic similarity evaluation measures between predicted and target sequences, moreover the frequency distribution of generated captions is significantly different to human captions .
We argue that n-gram overlap based measures in DRL models can be improved using model-based reward estimators that are transferred from sentence similarity models that directly learn from the manually annotated similarity of paired sentences.
In the context of DRL, we view this type of transfer learning as generating the environment of a target task given a source model that implicitly learns relationships on a pairwise source task (i.e sentence similarity). Additionally, the reward is continuous everywhere and out-of-vocabulary terms do not hinder the TRL model’s ability to estimate return since sentence similarity can still be inferred even when unk
tags are used at test time to replace tokens we have not seen at training time. We are further motivated by the fact that transfer learning for text generation has already been successfully demonstrated using pretrained ImageNet CNN encoders. However, transfer learning with respect to the decoder of DRL-based encoder-decoder models has been unexplored until this point.
We also note that for the common use of DRL in games and robotics, transfer learning is often made difficult since the environments and dynamics are often distinctly different from one another (e.g games usually do not have the same states, actions, transition probabilities and rewards i.e Markov Decision Process (MDP)). In contrast, the MDP for natural language is defined by the vocabulary used for a given corpus. Thus, given a sufficient amount of text the MDP for all corpora converge. Hence, transfer learning becomes easier which is not typical for robotics and games.
This paper proposes to transfer pairwise models that have been trained to learn a similarity score between various universal phrase and sentence representations. These models are trained on a set of sentence-pairwise learning problems such as sentence similarity and natural language inference (NLI).
Herein, we refer to this as Transferable Reward Learning (TRL), a method that incorporates model-based reward shaping to improve task-specific scores in relation to semantic similarity as a measure of language generation quality. We baseline both unsupervised and supervised TRL models against ML training and previous actor-critic models with model-free rewards such as BLEU and ROUGE-L. To our knowledge, this is the first work that focuses on learning to transfer model-based reward in sequence prediction.
DRL-based Conditional Text Generation
zhang2017actor have previously used actor-critic sequence training for image captioning using a token-level advantage and value function, achieving SoTA on MSCOCO at that time. In contrast, TRL can evaluate return both on token and sentence level learned from human judgment similarities between sentences.
ranzato2015sequence proposed Mixed Incremental Cross-Entropy Reinforce (MIXER) which uses REINFORCE  for text generation with a mixed expert policy (originally inspired by DAgger ), whereby incremental learning is used with REINFORCE and cross-entropy (CE). During training, the policy gradually deviates from provided using CE, to using its own past predictions.
rennie2017self propose Self-Critical Sequence Training (SCTS) which extends REINFORCE by using the test-time model predictions to normalize the reward. This avoids the use a baseline to normalize the rewards and reduce variance, while mitigating exposure bias.
Reward Augmented ML (RAML)  combine conditional log-likelihood and reward objectives while showing that highest reward is found when the conditional is proportional to the normalized exponentiated rewards, referred to as the payoff distribution.
ren2017deep have defined the reward as the embedding similarity between both images and sentences that are projected into the same embeddding space, instead of similarity of embeddings corresponding to predicted and target tokens alone. To the best of our knowledge, it is the only other method that uses a continuous reward signal from an embedded space. We can consider this to be an embedding measure that is model-free. In the context of this work, we also consider a model-free sentence embedding similarity measure (see Transfer Reward Learner) in our AC model.
Given that our main contribution is the adaptation of TRL pairwise models to model rewards (i.e sentence-level similarity between predicted and target caption), we briefly introduce the relevant SoTA that we consider in our experiments. kiros2015skip presented Skipthought vectors which are formed using either a bidirectional or unidirectional Long Short Term Memory (LSTM) encoder-decoder that learns to predict adjacent sentences from encodings of the current sentence, which is a fully unsupervised approach.
In contrast to Skipthought, Conneau et al. (2017) propose InferSent which is a supervised method to learn sentence representations from natural language inference data . InfeSent outperforms unsupervised sentence representations such as Skipthought on various sentence-pair tasks. We too include this approach in our experiments as the second TRL model to estimate accumulative rewards. O’ Neill and Bollegala (2018) proposed contextualized embeddings by learning to reconstruct a weighted combination of multiple pretrained word-embeddings as an auxiliary task, acting as a regularizer for the main task.
Learning Reward Functions
Apprenticeship learning  has also focused on learning the reward function from expert demonstrations, albeit in the context of robotics. christiano2017deep proposed to learn to play Atari using deep RL from human preferences. This is analgous to learning from similarity scores that are used as labels for pairwise learning between sentence representations. We share a similar motivation in that learning from demonstrations (i.e reference captions) can be difficult, particularly when there is many permutable demonstrations that lead to a similar goal (i.e many semantically and grammatically correct ways to describe what is in an image). By learning a pairwise-model learned from human preference scores (e.g sentence similarity scores) between trajectories, we can model the reward.
Image Caption Setup
For an image there is a corresponding caption sequence that contains tokens and where is the vocabulary. encodes an image into a hidden state that is then passed to a Recurrent Neural Network (RNN) decoder that generates a predicted sequence which is then evaluated with a task-specific score . In the DRL setting, we can consider the problem as a finite Markov Decision Process (MDP) where each word is considered a state and a prediction is considered as an action in action space with probability . The environment then issues a discounted return where the discount factor , after receiving the set of actions and the objective is then to maximize the total expected return . We use an actor-critic model  as the basis of our experiments with a ResNet-152 encoder  and an LSTM network.
Policy Gradient Training
We define a policy network as an encoder-decoder architecture that encodes an image as through the Resnet-152  CNN-based encoder and a linear projection shown in Equation 1. This is then concatenated with the embedding corresponding to the input word , which forms state where denotes concatenation. This LSTM decoder takes as input, as shown in Equation 2, omitting where is used instead. Therefore the policy network parameters include the ResNet-152 parameters , the linear projection , the LSTM parameters and decoder projection layer , hence . The predictions for a given sequence length of , predictions are defined as where the action space is defined by the vocabulary and the targets .
Value Function Approximation
For Value Function Approximation (VFA), the gradient of the expected accumulative discounted reward is typically estimated as , for reward at time for an -step return.
We then use a critic network to estimate the state-value function that takes the policy , actions , parameterized rewards and compute the expected return from state for steps. This is given as the expectation over the sum of discounted rewards expressed as Equation 5 where rewards are issued by our proposed TRL model-based reward with frozen parameters and not used as discounted rewards are not applicable in the TRL.
Advantage Function Approximation
Above, we considered using to estimate . However, training the value network from scratch can result in high-variance in the gradient and therefore poor convergence. Similarly to zhang2017actor, we use the Advantage Function Approximator (APA) to reduce the variance in gradient updates. This is achieved using temporal-difference (TD-) learning as shown in Equation 6 where the Q-function is , for N step expected return and is the trace decay weight (larger assigns more credit assigned to distant rewards).
The gradient of the policy network can then be rewritten as Equation 7. Here, the trace decay parameter , in our experiments which corresponds to Monte-Carlo and means that large traces are also assigned to distant states and actions. In the context of image captioning, it is typical that the episodes are short () and hence it is feasible.
This is achieved by computing the gradient of the log likelihood multiplied by the advantage function shown in Equation 8. Here, reduces the variance of the of the gradient by increasing the probability of actions when and decrease otherwise.
Transfer Reward Learner
We now consider two sentence encoders, as mentioned in Related Work, as the TRL. We note that, although there has been considerable breakthroughs in recent years for models that could be used for sentence similarity tasks , these models are too computationally costly to consider for issuing rewards and typically have more parameters than the whole actor-critic network combined.
Both TRL models that evaluate state-action pairs are denoted as where the parameters are not-updated as rewards are kept static throughout training. The advantage of this is that we are not restricted to choosing , which is used for the sentence-level n-gram overlap measures such as BLEU, ROUGE and CIDEr. The critic can evaluate partially generated sequences and sentence pairs of different length, since they have been trained to learn similarity between sentences of non-equal length. We emphasize at this point that the TRL is not updated for value function estimation in our experiments, this is only carried out for approximating the advantage and value functions.
For reward shaping, we use a pretrained sentence similarity model such as InferSent, tuned on SemEval 2017 Semantic Textual Similarity (STS) dataset111 http://alt.qcri.org/semeval2017/task1/ consisting of English monolingual pairs that are labelled with a score from 0 (semantic independence) to 5 (semantic equivalence). The scores are scaled from the continuous [0-5] range to [0-1] using a sigmoid to convert to a probability.
Conneau et al. (2017) use the scoring function in Equation 9 between two encoded sentence pairs where , corresponding to two sentences . We also use this scoring function for the pretrained InferSent model.
The model is a Bidirectional-GRU (or BiLSTM) with max (or mean) pooling, as in Equation 10 where represents the pooling function and is the embedding corresponding to word .
We also use the self-attentive variation in Equation 11, where the max-pooling operation is replaced with self-attention that produces a weighted average where sum the weights to 1 . Hence, attention focuses on the hidden states of important tokens prior to using the scoring function.
We also consider the Skipthought vectors as unsupervised sentence representations, which have shown competitive performance on sentence-level pairwise tasks . This allows us to compare against the supervised InferSent model. Similarly to the InferSent model, we use the same scoring function as in Equation 9.
Prior work in this area used an loss between policy values and the critic scores . In our experiments, we found a KL-divergence loss outperformed an loss.
We minimize the KL divergence between the normalized and which corresponds to minimizing the cross-entropy loss when ignoring that does not rely on
. The logit penalizes values that are much higher than the baseline more than theloss but show a larger gap between small value improvements over the baseline.
This has the effect of stabilizing the critic network over consecutive iterations, as the critics gradient updates are not as large, stabilizing the training of the critic and subsequently ensuring the difference in the policy network is less drastic between iterations.
We use the Microsoft Common Objects in Context (MSCOCO) 2014 image captioning dataset proposed by lin2014microsoft, which is a de-facto benchmark
for image captioning. The dataset includes 164,062 images (82,783 training images, 40,504 validation images, and 40,775 test images) with 5 manually labelled captions per image of 80 object categories and 91 stuff categories. Each image is paired with as least five manually annotated captions.
We also use the smaller Flickr-30k  dataset, which contains 30k images with 150k corresponding captions, which also includes a constructed denotation graph that can be used to define denotational similarity, giving more generic descriptions through lexical and syntactic operations.
As mentioned before, we use the ResNet-152 classifier trained on ImageNet as our encoder. The reported experimental results are that of a 2-hidden layer LSTM decoder  network, with embedding input size and hidden layer size of . For both MSCOCO and Flickr30k a mini-batch size of and use adaptive momentum () 
for stochastic gradient descent (SGD) optimization for the LSTM decoder while, as mentioned, the image encoder is not updated for our experiments.
Training both actor and critic networks from scratch is difficult, or more generally for related policy gradient algorithms since it is often the case the reward signal leads to high variance in the gradient updates, particularly in the early phase of training where the parameters and are initialized randomly.
Therefore, in all our experiments we pre-train the actor and critic networks (similarly to ren2017deep) for 5 and 7 epochs respectively using by minimizing the cross entropy lossfor both the actor network and the critic network (as mentioned we also consider CSP loss). After the actor is pre-trained, the critic network is passed sampled actions from the fixed pre-trained actor and updated accordingly. After this initial phase, we then begin training both actor and critic together.
|Mind’s Eye chen2015mind||-||-||-||0.13||19.10|
|Strong Sup||-||-||30.2 21.0||0.19||-|
Embedding Similarity Evaluation
Word Mover’s Distance Sentence Similarity
We also include WMD  for measuring semantic similarity between normalized embeddings associated with predicted and target words. Word-level embedding similarities offer a faster alternative to model-based sentence-level evaluation, hence we include it for our experiments.
To align WMD with word overlap metrics, we also include the penalization terms such as the brevity penalties used in BLEU , as shown in Equation 13. Here, is the similarity measure, the length ratio  and the brevity penalty which penalizes shorter length generated sentences.
Sliding Kernel Cosine Similarity
We also considered decayed -pairwise cosine similarity where is a sliding window span that compares embeddings corresponding to n-gram groupings with a decay factor that depends on the distance such that . In the below case we use the kernel shown in Equation 14 where is the index corresponding to and j for respectively.
This allows for any mis-alignments between sentences, as some may be shorter than others. There are window spans, therefore we multiply the by the brevity penalty.
|RAML @ ||-||-||-||-||-||-||-||-||27.6||-||-||-||-||-||-|
Table 1 shows the SoTA for image captioning on the Flickr30k dataset, not specific to policy-gradient methods as not all relevant papers include Flickr30k in experiments. Models proposed by wu2018image incorporate external knowledge (SPARQL queries over DBpedia knowledge base) for image captioning, hence the increase in BLEU and Perplexity (PPL).
Table 2 compares ML training with previously published actor-critic approaches that use BLEU and ROUGE as the reward signal . We a beam search of a beam size at test time. The beam search retains most probable prediction at each timestep and considers the possible next token extensions for a beam and repeats until timestep , .
We find that when using only WMD as the reward signal, which is model-free, we find an improvement on semantic similarity measures (i.e WMD and COS). Here, COS refers to the Sliding Kernel Cosine Similarity described in the previous section. Interestingly, we also find WMD improves over ML for word-overlap eventhough the measure does not optimize for a dirac distribution, like ML training.
Both TRLs (InferSent and Skipthought) make significant improvements on WMD and COS. Hence, we infer that these TRLs that learn sentence similarity produce semantically similar sentences at the expense of a decrease in word-overlap (expected since the model is not restricted to predicting the exact ground truth tokens). This relaxes the strictness of word-overlap and allows for diversity in the generated captions. Moreover, WMD does not penalize sentence length and thus promotes diversity in caption length. However, as mentioned, we do include brevity penalty in WMD and COS for the purposes of easier comparison to word overlap metrics.
The top of Table 3 shows SoTA results for policy-gradient methods based on the best average score on BLEU , ROUGE-L , METEOR , CIDEr evaluation metrics. VSE is the aforementioned Visual Semantic Embedding model that uses at . For SCST, these results are from the test portion of the Karpathy splits  using CIDEr for optimization. Policy gradient methods have reached near top of the MSCOCO competition leaderboard without using ensemble models.
The lower end shows the results of our proposed models and baselines evaluated on both n-gram overlap based measures and word-level (Cosine) and sentence-level (WMD) embedding based evaluation measures. We find that the largest gap in performance between our proposed TRLs and n-gram overlap metrics (BLEU and ROUGE-L) reward signals are found on the embedding-based evaluation measures.
For all TRLs (WMD, InferSent, Skipthought) performance consistently improves over ML, BLEU and ROUGE-L when evaluated on WMD and Cosine. This suggests that even though we may not strictly predict the correct word as measured by word-overlap measures, the semantic similarity of sentences is preserved as measured by WMD and Sliding Kernel Cosine Similarity. Furthermore, this results in more diverse text generation as the policy network is not penalized for constructing candidate sentences that do not have high word overlap with the reference captions.
TRLs outperform word overlap policy rewards such as BLEU and ROUGE-L on our embedding similarity based metrics. Of the three, we find the InferSent TRL to outperform the other two, with the unsupervised Skipthought TRL being competitive for all metrics. We also see results are competitive to the SoTA. We find similar findings for TRL models evaluated on CIDEr and METEOR.
|Human||An image of a cars driving on the highway|
|A section of traffic coming to a stop at an intersection.|
|A bunch of cars sit at the intersection of a street.|
|This is a picture of traffic on a very busy street.|
|A busy intersection filled with cars in asia.|
|ML||an image of a sitting car in traffic|
|AC-BLEU||A group of cars at an intersection.|
|AC-WMD||A group of cars at lights near a traffic intersection.|
|AC-Skipthought||A group of cars near a busy intersection road.|
|AC-InferSent||A picture of cars stopping near the traffic intersection.|
Figure 2 shows an example of the ground truth captions (Human), ML trained generated caption, a baseline AC trained with BLEU scores and our three proposed alternatives that improve for semantic similarity. We demonstrate the difference between text generated for an image of a traffic jam near an intersection. The example also illustrates that the ground truth itself is imperfect, both syntactically (‘..of a cars..’) and semantically (‘..cars sit at the intersection..’). The TRL will assign lower return in these cases, whereas word-overlap measures do not explicitly penalize how bad the semantic or syntactic differences are between predicted and ground truth sentences.
We proposed to use pretrained models that are specifically trained on sentence similarity tasks that can be used to issue rewards and to define, optimize and evaluate language quality for neural-based text generation. We find performance on semantic similarity metrics improve over a policy gradient model, namely the actor-critic model, that uses unbiased word overlap metrics as rewards.
The InferSent actor-critic model improves over a BLEU trained actor-critic model on MSCOCO when evaluated on a Word Mover’s Distance similarity measure by 6.97 points and 10.48 points on sentence-level cosine embedding metric. Large performance gains are also found for Flickr-30k dataset, demonstrating the general applicability of the proposed transfer learning method. We conclude that model-based task should be considered for reinforcement learning based approaches to conditional text generation.
Apprenticeship learning via inverse reinforcement learning.
Proceedings of the twenty-first international conference on Machine learning, pp. 1. Cited by: Learning Reward Functions.
-  (2016) An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086. Cited by: Introduction.
-  (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics 1 (5), pp. 834–846. Cited by: Image Caption Setup.
-  (2015) A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326. Cited by: Sentence Representations.
-  (2018) Learning to evaluate image captioning. In , pp. 5804–5812. Cited by: Introduction.
-  (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pp. 376–380. Cited by: MSCOCO Results.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Transfer Reward Learner.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Image Caption Setup, Policy Gradient Training, Training Details.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: Introduction, Training Details.
-  (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: MSCOCO Results.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Training Details.
-  (2015) Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302. Cited by: Skipthought Rewards.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: Introduction.
-  (2015) From word embeddings to document distances. In International Conference on Machine Learning, pp. 957–966. Cited by: Word Mover’s Distance Sentence Similarity.
-  (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361 (10), pp. 1995. Cited by: Introduction.
-  (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: MSCOCO Results.
-  (2016) Optimization of image description metrics using policy gradient methods. CoRR abs/1612.00370. External Links: Cited by: Table 3.
-  (2017) Softmax q-distribution estimation for structured prediction: a theoretical interpretation for raml. arXiv preprint arXiv:1705.07136. Cited by: Table 3.
-  (2016) Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pp. 1723–1731. Cited by: DRL-based Conditional Text Generation.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: Word Mover’s Distance Sentence Similarity, MSCOCO Results.
-  (2015) Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732. Cited by: Table 3.
-  (2017) Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 290–298. Cited by: Table 3.
-  (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: Table 3.
Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979. Cited by: DRL-based Conditional Text Generation.
Why neural translations are the right length.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2278–2282. Cited by: Word Mover’s Distance Sentence Similarity.
-  (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: Introduction.
Inception-v4, inception-resnet and the impact of residual connections on learning. In
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: Introduction.
-  (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: MSCOCO Results.
-  (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: DRL-based Conditional Text Generation.
-  (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1367–1381. Cited by: Table 3.
-  (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, pp. 67–78. Cited by: Dataset Details.
-  (2017) Actor-critic sequence training for image captioning. arXiv preprint arXiv:1706.09601. Cited by: Introduction, Critic Loss, Flickr30k Results, Table 3.