Transfer Reward Learning for Policy Gradient-Based Text Generation

09/09/2019
by   James O'Neill, et al.
6

Task-specific scores are often used to optimize for and evaluate the performance of conditional text generation systems. However, such scores are non-differentiable and cannot be used in the standard supervised learning paradigm. Hence, policy gradient methods are used since the gradient can be computed without requiring a differentiable objective. However, we argue that current n-gram overlap based measures that are used as rewards can be improved by using model-based rewards transferred from tasks that directly compare the similarity of sentence pairs. These reward models either output a score of sentence-level syntactic and semantic similarity between entire predicted and target sentences as the expected return, or for intermediate phrases as segmented accumulative rewards. We demonstrate that using a Transferable Reward Learner leads to improved results on semantical evaluation measures in policy-gradient models for image captioning tasks. Our InferSent actor-critic model improves over a BLEU trained actor-critic model on MSCOCO when evaluated on a Word Mover's Distance similarity measure by 6.97 points, also improving on a Sliding Window Cosine Similarity measure by 10.48 points. Similar performance improvements are also obtained on the smaller Flickr-30k dataset, demonstrating the general applicability of the proposed transfer learning method.

READ FULL TEXT
research
05/20/2023

Off-Policy Average Reward Actor-Critic with Deterministic Policy Search

The average reward criterion is relatively less studied as most existing...
research
02/07/2019

The Actor-Advisor: Policy Gradient With Off-Policy Advice

Actor-critic algorithms learn an explicit policy (actor), and an accompa...
research
04/05/2019

Multi-Preference Actor Critic

Policy gradient algorithms typically combine discounted future rewards w...
research
05/17/2021

Controlling an Inverted Pendulum with Policy Gradient Methods-A Tutorial

This paper provides the details of implementing two important policy gra...
research
12/02/2016

Self-critical Sequence Training for Image Captioning

Recently it has been shown that policy-gradient methods for reinforcemen...
research
03/25/2014

Variance-Constrained Actor-Critic Algorithms for Discounted and Average Reward MDPs

In many sequential decision-making problems we may want to manage risk b...

Please sign up or login with your details

Forgot password? Click here to reset