Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

05/04/2023
by   Washim Uddin Mondal, et al.
0

We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute implies that a learner, for each state, only observes the aggregate of past reward components generated as a result of different actions taken at that state, but realized at the observation instance. We propose an algorithm named DUCRL2 to obtain a near-optimal policy for this setting and show that it achieves a regret bound of 𝒪̃(DS√(AT) + d (SA)^3) where S and A are the sizes of the state and action spaces, respectively, D is the diameter of the MDP, d is a parameter upper bounded by the maximum reward delay, and T denotes the time horizon. This demonstrates the optimality of the bound in the order of T, and an additive impact of the delay.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/15/2021

Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

We study reinforcement learning in an infinite-horizon average-reward se...
research
06/20/2019

Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

We tackle the problem of acting in an unknown finite and discrete Markov...
research
07/03/2019

Maximum Expected Hitting Cost of a Markov Decision Process and Informativeness of Rewards

We propose a new complexity measure for Markov decision processes (MDP),...
research
03/23/2023

Stochastic Submodular Bandits with Delayed Composite Anonymous Bandit Feedback

This paper investigates the problem of combinatorial multiarmed bandits ...
research
12/01/2022

Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP

This work considers the sample complexity of obtaining an ε-optimal poli...
research
12/23/2016

Constructing Effective Personalized Policies Using Counterfactual Inference from Biased Data Sets with Many Features

This paper proposes a novel approach for constructing effective personal...
research
08/03/2021

Energy Management in Data Centers with Server Setup Delay: A Semi-MDP Approximation

The energy management schemes in multi-server data centers with setup ti...

Please sign up or login with your details

Forgot password? Click here to reset