Geometry of Policy Improvement

04/06/2017
by   Guido Montufar, et al.
0

We investigate the geometry of optimal memoryless time independent decision making in relation to the amount of information that the acting agent has about the state of the system. We show that the expected long term reward, discounted or per time step, is maximized by policies that randomize among at most k actions whenever at most k world states are consistent with the agent's observation. Moreover, we show that the expected reward per time step can be studied in terms of the expected discounted reward. Our main tool is a geometric version of the policy improvement lemma, which identifies a polyhedral cone of policy changes in which the state value function increases for all states.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/10/2021

Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent State

We design a simple reinforcement learning agent that, with a specificati...
research
05/29/2020

Reinforcement Learning

Reinforcement learning (RL) is a general framework for adaptive control,...
research
07/18/2023

Online Learning with Costly Features in Non-stationary Environments

Maximizing long-term rewards is the primary goal in sequential decision-...
research
01/12/2022

Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning

This paper considers multi-agent reinforcement learning (MARL) tasks whe...
research
05/25/2018

Situated Mapping of Sequential Instructions to Actions with Single-step Reward Observation

We propose a learning approach for mapping context-dependent sequential ...
research
10/29/2020

Off-Policy Interval Estimation with Lipschitz Value Iteration

Off-policy evaluation provides an essential tool for evaluating the effe...
research
11/01/1997

Dynamic Non-Bayesian Decision Making

The model of a non-Bayesian agent who faces a repeated game with incompl...

Please sign up or login with your details

Forgot password? Click here to reset