Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret

05/25/2022
by   Jiawei Huang, et al.
0

We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies π^O and π^E: π^O ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while π^E ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., π^E=π^O) for the risk-averse users. We individually consider the gap-independent vs. gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce π^E, we can achieve a constant regret for risk-averse users independent of the number of episodes K, which is in sharp contrast to the Ω(log K) regret for any online RL algorithms in the same setting, while the regret of π^O (almost) maintains its online regret optimality and does not need to compromise for the success of π^E.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/07/2023

Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR

In this paper, we study risk-sensitive Reinforcement Learning (RL), focu...
research
11/16/2020

No-Regret Reinforcement Learning with Value Function Approximation: a Kernel Embedding Approach

We consider the regret minimisation problem in reinforcement learning (R...
research
06/22/2020

Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret

We study risk-sensitive reinforcement learning in episodic Markov decisi...
research
05/26/2022

Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment

Multi-armed bandit (MAB) is a classic model for understanding the explor...
research
04/08/2021

Incentivizing Exploration in Linear Bandits under Information Gap

We study the problem of incentivizing exploration for myopic users in li...
research
03/07/2022

Cascaded Gaps: Towards Gap-Dependent Regret for Risk-Sensitive Reinforcement Learning

In this paper, we study gap-dependent regret guarantees for risk-sensiti...
research
06/01/2018

The Externalities of Exploration and How Data Diversity Helps Exploitation

Online learning algorithms, widely used to power search and content opti...

Please sign up or login with your details

Forgot password? Click here to reset