Chaining Value Functions for Off-Policy Learning

01/17/2022
by   Simon Schmitt, et al.
2

To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective – that we call a `k-step expedition' – of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/20/2002

Learning from Scarce Experience

Searching the space of policies directly for the optimal policy has been...
research
12/28/2022

Lexicographic Multi-Objective Reinforcement Learning

In this work we introduce reinforcement learning techniques for solving ...
research
02/26/2020

Policy Evaluation Networks

Many reinforcement learning algorithms use value functions to guide the ...
research
06/08/2016

Safe and Efficient Off-Policy Reinforcement Learning

In this work, we take a fresh look at some old and new algorithms for of...
research
01/09/2020

Population-Guided Parallel Policy Search for Reinforcement Learning

In this paper, a new population-guided parallel learning scheme is propo...
research
11/15/2018

Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search

Learning policies on data synthesized by models can in principle quench ...
research
05/21/2018

Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning

Multiple-step lookahead policies have demonstrated high empirical compet...

Please sign up or login with your details

Forgot password? Click here to reset