DORB: Dynamically Optimizing Multiple Rewards with Bandits

11/15/2020
by   Ramakanth Pasunuru, et al.
4

Policy gradients-based reinforcement learning has proven to be a promising approach for directly optimizing non-differentiable evaluation metrics for language generation tasks. However, optimizing for a specific metric reward leads to improvements in mostly that metric only, suggesting that the model is gaming the formulation of that metric in a particular way without often achieving real qualitative improvements. Hence, it is more beneficial to make the model optimize multiple diverse metric rewards jointly. While appealing, this is challenging because one needs to manually decide the importance and scaling weights of these metric rewards. Further, it is important to consider using a dynamic combination and curriculum of metric rewards that flexibly changes over time. Considering the above aspects, in our work, we automate the optimization of multiple metric rewards simultaneously via a multi-armed bandit approach (DORB), where at each round, the bandit chooses which metric reward to optimize next, based on expected arm gains. We use the Exp3 algorithm for bandits and formulate two approaches for bandit rewards: (1) Single Multi-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit (HM-Bandit). We empirically show the effectiveness of our approaches via various automatic metrics and human evaluation on two important NLG tasks: question generation and data-to-text generation, including on an unseen-test transfer setup. Finally, we present interpretable analyses of the learned bandit curriculum over the optimized rewards.

READ FULL TEXT

page 8

page 9

page 10

page 11

page 12

page 13

page 14

page 15

research
04/30/2023

Indexability of Finite State Restless Multi-Armed Bandit and Rollout Policy

We consider finite state restless multi-armed bandit problem. The decisi...
research
12/02/2021

Risk-Aware Algorithms for Combinatorial Semi-Bandits

In this paper, we study the stochastic combinatorial multi-armed bandit ...
research
07/13/2020

Contextual Bandit with Missing Rewards

We consider a novel variant of the contextual bandit problem (i.e., the ...
research
06/01/2022

Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts

There is a rising interest in industrial online applications where data ...
research
04/17/2018

Multi-Reward Reinforced Summarization with Saliency and Entailment

Abstractive text summarization is the task of compressing and rewriting ...
research
02/02/2023

Practical Bandits: An Industry Perspective

The bandit paradigm provides a unified modeling framework for problems t...
research
06/09/2020

Differentiable Meta-Learning in Contextual Bandits

We study a contextual bandit setting where the learning agent has access...

Please sign up or login with your details

Forgot password? Click here to reset