Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space

02/21/2023
by   Jonatha Anselmi, et al.
0

In this paper, we revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure. Specifically, we consider a controlled queue with impatient jobs and the main objective is to optimize a trade-off between energy consumption and user-perceived performance. Within this setting, the diameter D of the MDP is Ω(S^S), where S is the number of states. Therefore, the existing lower and upper bounds on the regret at timeT, of order O(√(DSAT)) for MDPs with S states and A actions, may suggest that reinforcement learning is inefficient here. In our main result however, we exploit the structure of our MDPs to show that the regret of a slightly-tweaked version of the classical learning algorithm Ucrl2 is in fact upper bounded by 𝒪̃(√(E_2AT)) where E_2 is related to the weighted second moment of the stationary measure of a reference policy. Importantly, E_2 is bounded independently of S. Thus, our bound is asymptotically independent of the number of states and of the diameter. This result is based on a careful study of the number of visits performed by the learning algorithm to the states of the MDP, which is highly non-uniform.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset