Generalized Data Distribution Iteration

06/07/2022
by   Jiajun Fan, et al.
0

To obtain higher sample efficiency and superior final performance simultaneously has been one of the major challenges for deep reinforcement learning (DRL). Previous work could handle one of these challenges but typically failed to address them concurrently. In this paper, we try to tackle these two challenges simultaneously. To achieve this, we firstly decouple these challenges into two classic RL problems: data richness and exploration-exploitation trade-off. Then, we cast these two problems into the training data distribution optimization problem, namely to obtain desired training data within limited interactions, and address them concurrently via i) explicit modeling and control of the capacity and diversity of behavior policy and ii) more fine-grained and adaptive control of selective/sampling distribution of the behavior policy using a monotonic data distribution optimization. Finally, we integrate this process into Generalized Policy Iteration (GPI) and obtain a more general framework called Generalized Data Distribution Iteration (GDI). We use the GDI framework to introduce operator-based versions of well-known RL methods from DQN to Agent57. Theoretical guarantee of the superiority of GDI compared with GPI is concluded. We also demonstrate our state-of-the-art (SOTA) performance on Arcade Learning Environment (ALE), wherein our algorithm has achieved 9620.33 normalized score (HNS), 1146.39 records using only 200M training frames. Our performance is comparable to Agent57's while we consume 500 times less data. We argue that there is still a long way to go before obtaining real superhuman agents in ALE.

READ FULL TEXT
research
06/11/2021

GDI: Rethinking What Makes Reinforcement Learning Different From Supervised Learning

Deep Q Network (DQN) firstly kicked the door of deep reinforcement learn...
research
05/23/2022

Distance-Sensitive Offline Reinforcement Learning

In offline reinforcement learning (RL), one detrimental issue to policy ...
research
11/16/2019

Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift

Off-policy deep reinforcement learning (RL) algorithms are incapable of ...
research
06/03/2019

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Off-policy reinforcement learning aims to leverage experience collected ...
research
07/21/2023

General regularization in covariate shift adaptation

Sample reweighting is one of the most widely used methods for correcting...
research
12/03/2021

Reinforcement Learning-Based Automatic Berthing System

Previous studies on automatic berthing systems based on artificial neura...
research
11/02/2022

Offline RL With Realistic Datasets: Heteroskedasticity and Support Constraints

Offline reinforcement learning (RL) learns policies entirely from static...

Please sign up or login with your details

Forgot password? Click here to reset