Average Reward Adjusted Discounted Reinforcement Learning: Near-Blackwell-Optimal Policies for Real-World Applications

04/02/2020
by   Manuel Schneckenreither, et al.
0

Although in recent years reinforcement learning has become very popular the number of successful applications to different kinds of operations research problems is rather scarce. Reinforcement learning is based on the well-studied dynamic programming technique and thus also aims at finding the best stationary policy for a given Markov Decision Process, but in contrast does not require any model knowledge. The policy is assessed solely on consecutive states (or state-action pairs), which are observed while an agent explores the solution space. The contributions of this paper are manifold. First we provide deep theoretical insights to the widely applied standard discounted reinforcement learning framework, which give rise to the understanding of why these algorithms are inappropriate when permanently provided with non-zero rewards, such as costs or profit. Second, we establish a novel near-Blackwell-optimal reinforcement learning algorithm. In contrary to former method it assesses the average reward per step separately and thus prevents the incautious combination of different types of state values. Thereby, the Laurent Series expansion of the discounted state values forms the foundation for this development and also provides the connection between the two approaches. Finally, we prove the viability of our algorithm on a challenging problem set, which includes a well-studied M/M/1 admission control queuing system. In contrast to standard discounted reinforcement learning our algorithm infers the optimal policy on all tested problems. The insights are that in the operations research domain machine learning techniques have to be adapted and advanced to successfully apply these methods in our settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/22/2020

Reinforcement Learning in Economics and Finance

Reinforcement learning algorithms describe how an agent can learn an opt...
research
03/22/2023

Reinforcement Learning with Exogenous States and Rewards

Exogenous state variables and rewards can slow reinforcement learning by...
research
10/07/2021

Reinforcement Learning in Reward-Mixing MDPs

Learning a near optimal policy in a partially observable system remains ...
research
06/16/2020

Preference-based Reinforcement Learning with Finite-Time Guarantees

Preference-based Reinforcement Learning (PbRL) replaces reward values in...
research
07/07/2023

Action-State Dependent Dynamic Model Selection

A model among many may only be best under certain states of the world. S...
research
03/03/2019

Scaling up budgeted reinforcement learning

Can we learn a control policy able to adapt its behaviour in real time s...
research
12/24/2021

Multi-Provider NFV Network Service Delegation via Average Reward Reinforcement Learning

In multi-provider 5G/6G networks, service delegation enables administrat...

Please sign up or login with your details

Forgot password? Click here to reset