Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures

04/17/2002
by   Marcus Hutter, et al.
0

The problem of making sequential decisions in unknown probabilistic environments is studied. In cycle t action y_t results in perception x_t and reward r_t, where all quantities in general may depend on the complete history. The perception x_t and reward r_t are sampled from the (reactive) environmental probability distribution μ. This very general setting includes, but is not limited to, (partial observable, k-th order) Markov decision processes. Sequential decision theory tells us how to act in order to maximize the total expected reward, called value, if μ is known. Reinforcement learning is usually used if μ is unknown. In the Bayesian approach one defines a mixture distribution ξ as a weighted sum of distributions ν∈, where is any class of distributions including the true environment μ. We show that the Bayes-optimal policy p^ξ based on the mixture ξ is self-optimizing in the sense that the average value converges asymptotically for all μ∈ to the optimal value achieved by the (infeasible) Bayes-optimal policy p^μ which knows μ in advance. We show that the necessary condition that admits self-optimizing policies at all, is also sufficient. No other structural assumptions are made on . As an example application, we discuss ergodic Markov decision processes, which allow for self-optimizing policies. Furthermore, we show that p^ξ is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in all environments ν∈ and a strictly higher value in at least one.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/30/2019

Detecting Spiky Corruption in Markov Decision Processes

Current reinforcement learning methods fail if the reward function is im...
research
09/23/2020

CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq

Reinforcement learning algorithms solve sequential decision-making probl...
research
11/08/2018

Meta-Learning for Multi-objective Reinforcement Learning

Multi-objective reinforcement learning (MORL) is the generalization of s...
research
11/28/2016

Nonparametric General Reinforcement Learning

Reinforcement learning (RL) problems are often phrased in terms of Marko...
research
08/18/2023

Intrinsically Motivated Hierarchical Policy Learning in Multi-objective Markov Decision Processes

Multi-objective Markov decision processes are sequential decision-making...
research
02/25/2016

Thompson Sampling is Asymptotically Optimal in General Environments

We discuss a variant of Thompson sampling for nonparametric reinforcemen...
research
04/20/2015

Optimal Nudging: Solving Average-Reward Semi-Markov Decision Processes as a Minimal Sequence of Cumulative Tasks

This paper describes a novel method to solve average-reward semi-Markov ...

Please sign up or login with your details

Forgot password? Click here to reset