Log In Sign Up

How memory architecture affects performance and learning in simple POMDPs

by   Mario Geiger, et al.

Reinforcement learning is made much more complex when the agent's observation is partial or noisy. This case corresponds to a partially observable Markov decision process (POMDP). One strategy to seek good performance in POMDPs is to endow the agent with a finite memory, whose update is governed by the policy. However, policy optimization is non-convex in that case and can lead to poor training performance for random initialization. The performance can be empirically improved by constraining the memory architecture, then sacrificing optimality to facilitate training. Here we study this trade-off in the two-arm bandit problem, and compare two extreme cases: (i) the random access memory where any transitions between M memory states are allowed and (ii) a fixed memory where the agent can access its last m actions and rewards. For (i), the probability q to play the worst arm is known to be exponentially small in M for the optimal policy. Our main result is to show that similar performance can be reached for (ii) as well, despite the simplicity of the memory architecture: using a conjecture on Gray-ordered binary necklaces, we find policies for which q is exponentially small in 2^m i.e. q∼α^2^m for some α < 1. Interestingly, we observe empirically that training from random initialization leads to very poor results for (i), and significantly better results for (ii).


page 1

page 2

page 3

page 4


The act of remembering: a study in partially observable reinforcement learning

Reinforcement Learning (RL) agents typically learn memoryless policies—p...

Sample-Efficient Reinforcement Learning of Undercomplete POMDPs

Partial observability is a common challenge in many reinforcement learni...

Memory-based Deep Reinforcement Learning for POMDP

A promising characteristic of Deep Reinforcement Learning (DRL) is its c...

Provable Reinforcement Learning with a Short-Term Memory

Real-world sequential decision making problems commonly involve partial ...

Optimal Attacks on Reinforcement Learning Policies

Control policies, trained using the Deep Reinforcement Learning, have be...

Belief Tree Search for Active Object Recognition

Active Object Recognition (AOR) has been approached as an unsupervised l...

Learning classifier systems with memory condition to solve non-Markov problems

In the family of Learning Classifier Systems, the classifier system XCS ...