DeepAI AI Chat
Log In Sign Up

DIRECT: Learning from Sparse and Shifting Rewards using Discriminative Reward Co-Training

by   Philipp Altmann, et al.
Universität München

We propose discriminative reward co-training (DIRECT) as an extension to deep reinforcement learning algorithms. Building upon the concept of self-imitation learning (SIL), we introduce an imitation buffer to store beneficial trajectories generated by the policy determined by their return. A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies. The discriminator's verdict is used to construct a reward signal for optimizing the policy. By interpolating prior experience, DIRECT is able to act as a surrogate, steering policy optimization towards more valuable regions of the reward landscape thus learning an optimal policy. Our results show that DIRECT outperforms state-of-the-art algorithms in sparse- and shifting-reward environments being able to provide a surrogate reward to the policy and direct the optimization towards valuable areas.


Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

Adversarial imitation learning alternates between learning a discriminat...

Generative Adversarial Self-Imitation Learning

This paper explores a simple regularizer for reinforcement learning by p...

Bayesian Robust Optimization for Imitation Learning

One of the main challenges in imitation learning is determining what act...

Reward-Weighted Regression Converges to a Global Optimum

Reward-Weighted Regression (RWR) belongs to a family of widely known ite...

Learning Self-Imitating Diverse Policies

Deep reinforcement learning algorithms, including policy gradient method...

Learning to Generalize from Sparse and Underspecified Rewards

We consider the problem of learning from sparse and underspecified rewar...

Memory Augmented Policy Optimization for Program Synthesis with Generalization

This paper presents Memory Augmented Policy Optimization (MAPO): a novel...