Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

05/27/2018
by   Julia Kreutzer, et al.
0

We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of a reward estimator, and the effect of the quality of reward estimates on the overall RL task. Our analysis of cardinal (5-point ratings) and ordinal (pairwise preferences) feedback shows that their intra- and inter-annotator α-agreement is comparable. Best reliability is obtained for standardized cardinal feedback, and cardinal feedback is also easiest to learn and generalize from. Finally, improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT. This shows that RL is possible even from small amounts of fairly reliable human feedback, pointing to a great potential for applications at larger scale.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/16/2018

Can Neural Machine Translation be Improved with User Feedback?

We present the first real-world application of methods for improving neu...
research
04/21/2017

Bandit Structured Prediction for Neural Sequence-to-Sequence Learning

Bandit structured prediction describes a stochastic optimization framewo...
research
07/24/2017

Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback

Machine translation is a natural candidate problem for reinforcement lea...
research
08/27/2018

A Study of Reinforcement Learning for Neural Machine Translation

Recent studies have shown that reinforcement learning (RL) is an effecti...
research
01/18/2016

Bandit Structured Prediction for Learning from Partial Feedback in Statistical Machine Translation

We present an approach to structured prediction from bandit feedback, ca...
research
11/16/2021

Reinforcement Learning with Feedback from Multiple Humans with Diverse Skills

A promising approach to improve the robustness and exploration in Reinfo...
research
07/11/2019

Self-Regulated Interactive Sequence-to-Sequence Learning

Not all types of supervision signals are created equal: Different types ...

Please sign up or login with your details

Forgot password? Click here to reset