DIRECT: Learning from Sparse and Shifting Rewards using Discriminative Reward Co-Training

01/18/2023
by   Philipp Altmann, et al.
0

We propose discriminative reward co-training (DIRECT) as an extension to deep reinforcement learning algorithms. Building upon the concept of self-imitation learning (SIL), we introduce an imitation buffer to store beneficial trajectories generated by the policy determined by their return. A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies. The discriminator's verdict is used to construct a reward signal for optimizing the policy. By interpolating prior experience, DIRECT is able to act as a surrogate, steering policy optimization towards more valuable regions of the reward landscape thus learning an optimal policy. Our results show that DIRECT outperforms state-of-the-art algorithms in sparse- and shifting-reward environments being able to provide a surrogate reward to the policy and direct the optimization towards valuable areas.

READ FULL TEXT
research
06/23/2020

Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

Adversarial imitation learning alternates between learning a discriminat...
research
12/03/2018

Generative Adversarial Self-Imitation Learning

This paper explores a simple regularizer for reinforcement learning by p...
research
07/24/2020

Bayesian Robust Optimization for Imitation Learning

One of the main challenges in imitation learning is determining what act...
research
07/19/2021

Reward-Weighted Regression Converges to a Global Optimum

Reward-Weighted Regression (RWR) belongs to a family of widely known ite...
research
05/25/2018

Learning Self-Imitating Diverse Policies

Deep reinforcement learning algorithms, including policy gradient method...
research
02/19/2019

Learning to Generalize from Sparse and Underspecified Rewards

We consider the problem of learning from sparse and underspecified rewar...
research
07/06/2018

Memory Augmented Policy Optimization for Program Synthesis with Generalization

This paper presents Memory Augmented Policy Optimization (MAPO): a novel...

Please sign up or login with your details

Forgot password? Click here to reset