Policy Dispersion in Non-Markovian Environment

by   Bohao Qu, et al.

Markov Decision Process (MDP) presents a mathematical framework to formulate the learning processes of agents in reinforcement learning. MDP is limited by the Markovian assumption that a reward only depends on the immediate state and action. However, a reward sometimes depends on the history of states and actions, which may result in the decision process in a non-Markovian environment. In such environments, agents receive rewards via temporally-extended behaviors sparsely, and the learned policies may be similar. This leads the agents acquired with similar policies generally overfit to the given task and can not quickly adapt to perturbations of environments. To resolve this problem, this paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment, in which a policy dispersion scheme is designed for seeking diverse policy representation. Specifically, we first adopt a transformer-based method to learn policy embeddings. Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results show that this dispersion scheme can obtain more expressive diverse policies, which then derive more robust performance than recent learning baselines under various learning environments.


page 4

page 7

page 14


Learning to Design Games: Strategic Environments in Deep Reinforcement Learning

In typical reinforcement learning (RL), the environment is assumed given...

Learning Probabilistic Reward Machines from Non-Markovian Stochastic Reward Processes

The success of reinforcement learning in typical settings is, in part, p...

Fast Task Inference with Variational Intrinsic Successor Features

It has been established that diverse behaviors spanning the controllable...

Context-Aware Composition of Agent Policies by Markov Decision Process Entity Embeddings and Agent Ensembles

Computational agents support humans in many areas of life and are theref...

Learning Symbolic Representations for Reinforcement Learning of Non-Markovian Behavior

Many real-world reinforcement learning (RL) problems necessitate learnin...

Exploiting Expert-guided Symmetry Detection in Markov Decision Processes

Offline estimation of the dynamical model of a Markov Decision Process (...

Incentivizing Exploration with Unbiased Histories

In a social learning setting, there is a set of actions, each of which h...

Please sign up or login with your details

Forgot password? Click here to reset