Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep Reinforcement Learning

05/21/2017
by   Sahil Sharma, et al.
0

Reinforcement Learning (RL) can model complex behavior policies for goal-directed sequential decision making tasks. A hallmark of RL algorithms is Temporal Difference (TD) learning: value function for the current state is moved towards a bootstrapped target that is estimated using next state's value function. λ-returns generalize beyond 1-step returns and strike a balance between Monte Carlo and TD learning methods. While lambda-returns have been extensively studied in RL, they haven't been explored a lot in Deep RL. This paper's first contribution is an exhaustive benchmarking of lambda-returns. Although mathematically tractable, the use of exponentially decaying weighting of n-step returns based targets in lambda-returns is a rather ad-hoc design choice. Our second major contribution is that we propose a generalization of lambda-returns called Confidence-based Autodidactic Returns (CAR), wherein the RL agent learns the weighting of the n-step returns in an end-to-end manner. This allows the agent to learn to decide how much it wants to weigh the n-step returns based targets. In contrast, lambda-returns restrict RL agents to use an exponentially decaying weighting scheme. Autodidactic returns can be used for improving any RL algorithm which uses TD learning. We empirically demonstrate that using sophisticated weighted mixtures of multi-step returns (like CAR and lambda-returns) considerably outperforms the use of n-step returns. We perform our experiments on the Asynchronous Advantage Actor Critic (A3C) algorithm in the Atari 2600 domain.

READ FULL TEXT

page 6

page 17

page 18

page 19

research
04/15/2017

The Reactor: A Sample-Efficient Actor-Critic Architecture

In this work we present a new reinforcement learning agent, called React...
research
07/16/2020

Mixture of Step Returns in Bootstrapped DQN

The concept of utilizing multi-step returns for updating value functions...
research
06/24/2022

Value Function Decomposition for Iterative Design of Reinforcement Learning Agents

Designing reinforcement learning (RL) agents is typically a difficult pr...
research
11/23/2021

Schedule Based Temporal Difference Algorithms

Learning the value function of a given policy from data samples is an im...
research
07/28/2020

Munchausen Reinforcement Learning

Bootstrapping is a core mechanism in Reinforcement Learning (RL). Most a...
research
11/10/2020

Dirichlet policies for reinforced factor portfolios

This article aims to combine factor investing and reinforcement learning...
research
08/26/2022

DETERRENT: Detecting Trojans using Reinforcement Learning

Insertion of hardware Trojans (HTs) in integrated circuits is a pernicio...

Please sign up or login with your details

Forgot password? Click here to reset