Efficient RLHF: Reducing the Memory Usage of PPO

09/01/2023
by   Michael Santacroce, et al.
0

Reinforcement Learning with Human Feedback (RLHF) has revolutionized language modeling by aligning models with human preferences. However, the RL stage, Proximal Policy Optimization (PPO), requires over 3x the memory of Supervised Fine-Tuning (SFT), making it infeasible to use for most practitioners. To address this issue, we present a comprehensive analysis the memory usage, performance, and training time of memory-savings techniques for PPO. We introduce Hydra-RLHF by first integrating the SFT and Reward models and then dynamically turning LoRA "off" during training. Our experiments show: 1. Using LoRA during PPO reduces its memory usage to be smaller than SFT while improving alignment across four public benchmarks, and 2. Hydra-PPO reduces the latency per sample of LoRA-PPO by up to 65 results demonstrate that Hydra-PPO is a simple and promising solution for enabling more widespread usage of RLHF.

READ FULL TEXT
research
08/23/2023

Aligning Language Models with Offline Reinforcement Learning from Human Feedback

Learning from human preferences is crucial for language models (LMs) to ...
research
05/03/2022

Efficient Fine-Tuning of BERT Models on the Edge

Resource-constrained devices are increasingly the deployment targets of ...
research
04/22/2022

Optimal Heap Limits for Reducing Browser Memory Use

Garbage collected language runtimes must carefully tune heap limits to r...
research
05/19/2023

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

A centerpiece of the ever-popular reinforcement learning from human feed...
research
08/17/2023

Reinforced Self-Training (ReST) for Language Modeling

Reinforcement learning from human feedback (RLHF) can improve the qualit...
research
11/28/2019

The Weighted Tsetlin Machine: Compressed Representations with Weighted Clauses

The Tsetlin Machine (TM) is an interpretable mechanism for pattern recog...
research
06/04/2023

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Reinforcement learning from human feedback (RLHF) has emerged as a relia...

Please sign up or login with your details

Forgot password? Click here to reset