Optimizing Data Collection in Deep Reinforcement Learning

07/15/2022
by   James Gleeson, et al.
0

Reinforcement learning (RL) workloads take a notoriously long time to train due to the large number of samples collected at run-time from simulators. Unfortunately, cluster scale-up approaches remain expensive, and commonly used CPU implementations of simulators induce high overhead when switching back and forth between GPU computations. We explore two optimizations that increase RL data collection efficiency by increasing GPU utilization: (1) GPU vectorization: parallelizing simulation on the GPU for increased hardware parallelism, and (2) simulator kernel fusion: fusing multiple simulation steps to run in a single GPU kernel launch to reduce global memory bandwidth requirements. We find that GPU vectorization can achieve up to 1024× speedup over commonly used CPU simulators. We profile the performance of different implementations and show that for a simple simulator, ML compiler implementations (XLA) of GPU vectorization outperform a DNN framework (PyTorch) by 13.4× by reducing CPU overhead from repeated Python to DL backend API calls. We show that simulator kernel fusion speedups with a simple simulator are 11.3× and increase by up to 1024× as simulator complexity increases in terms of memory bandwidth requirements. We show that the speedups from simulator kernel fusion are orthogonal and combinable with GPU vectorization, leading to a multiplicative speedup.

READ FULL TEXT

page 3

page 6

research
02/08/2021

RL-Scope: Cross-Stack Profiling for Deep Reinforcement Learning Workloads

Deep reinforcement learning (RL) has made groundbreaking advancements in...
research
10/12/2018

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

Most Deep Reinforcement Learning (Deep RL) algorithms require a prohibit...
research
03/12/2021

Large Batch Simulation for Deep Reinforcement Learning

We accelerate deep reinforcement learning-based training in visually com...
research
09/23/2020

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

We show in this work that memory intensive computations can result in se...
research
03/11/2022

GATSPI: GPU Accelerated Gate-Level Simulation for Power Improvement

In this paper, we present GATSPI, a novel GPU accelerated logic gate sim...
research
02/20/2021

A Python Framework for Fast Modelling and Simulation of Cellular Nonlinear Networks and other Finite-difference Time-domain Systems

This paper introduces and evaluates a freely available cellular nonlinea...
research
07/02/2020

Automatic Horizontal Fusion for GPU Kernels

We present automatic horizontal fusion, a novel optimization technique t...

Please sign up or login with your details

Forgot password? Click here to reset