Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles

03/07/2023
by   Zhiwei Tang, et al.
0

In this paper, we focus on a novel optimization problem in which the objective function is a black-box and can only be evaluated through a ranking oracle. This problem is common in real-world applications, particularly in cases where the function is assessed by human judges. Reinforcement Learning with Human Feedback (RLHF) is a prominent example of such an application, which is adopted by the recent works <cit.> to improve the quality of Large Language Models (LLMs) with human guidance. We propose ZO-RankSGD, a first-of-its-kind zeroth-order optimization algorithm, to solve this optimization problem with a theoretical guarantee. Specifically, our algorithm employs a new rank-based random estimator for the descent direction and is proven to converge to a stationary point. ZO-RankSGD can also be directly applied to the policy search problem in reinforcement learning when only a ranking oracle of the episode reward is available. This makes ZO-RankSGD a promising alternative to existing RLHF methods, as it optimizes in an online fashion and thus can work without any pre-collected data. Furthermore, we demonstrate the effectiveness of ZO-RankSGD in a novel application: improving the quality of images generated by a diffusion generative model with human ranking feedback. Throughout experiments, we found that ZO-RankSGD can significantly enhance the detail of generated images with only a few rounds of human feedback. Overall, our work advances the field of zeroth-order optimization by addressing the problem of optimizing functions with only ranking feedback, and offers an effective approach for aligning human and machine intentions in a wide range of domains. Our code is released here <https://github.com/TZW1998/Taming-Stable-Diffusion-with-Human-Ranking-Feedback>.

READ FULL TEXT

page 1

page 9

page 10

page 23

page 25

page 27

research
07/06/2023

Censored Sampling of Diffusion Models Using 3 Minutes of Human Feedback

Diffusion models have recently shown remarkable success in high-quality ...
research
05/22/2023

Training Diffusion Models with Reinforcement Learning

Diffusion models are a class of flexible generative models trained with ...
research
06/30/2023

Preference Ranking Optimization for Human Alignment

Large language models (LLMs) often contain misleading content, emphasizi...
research
07/19/2023

FABRIC: Personalizing Diffusion Models with Iterative Feedback

In an era where visual content generation is increasingly driven by mach...
research
05/24/2023

Provable Offline Reinforcement Learning with Human Feedback

In this paper, we investigate the problem of offline reinforcement learn...
research
09/14/2022

Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models

Neural text ranking models have witnessed significant advancement and ar...
research
09/13/2022

Designing Biological Sequences via Meta-Reinforcement Learning and Bayesian Optimization

The ability to accelerate the design of biological sequences can have a ...

Please sign up or login with your details

Forgot password? Click here to reset