Everyone Deserves A Reward: Learning Customized Human Preferences

09/06/2023
by   Pengyu Cheng, et al.
1

Reward models (RMs) are crucial in aligning large language models (LLMs) with human preferences for improving interaction quality. However, the real world is pluralistic, which leads to diversified human preferences based on different religions, politics, cultures, etc. Moreover, each individual can have their own unique preferences on various topics. Neglecting the diversity of human preferences, current LLM training processes only use a general reward model, which is below satisfaction for customized or personalized application scenarios. To explore customized preference learning, we collect a domain-specific preference (DSP) dataset, which collects preferred responses to each given query from four practical domains. Besides, from the perspective of data efficiency, we proposed a three-stage customized RM learning scheme, whose effectiveness is empirically verified on both general preference datasets and our DSP set. Furthermore, we test multiple training and data strategies on the three learning stages, and have found several ways to better preserve the general preferring ability while training the customized RMs, especially general preference enrichment and customized preference imitation learning. The DSP dataset and code are available at https://github.com/Linear95/DSP.

READ FULL TEXT
research
04/13/2022

A Study of Causal Confusion in Preference-Based Reward Learning

Learning robot policies via preference-based reward learning is an incre...
research
02/16/2023

Pretraining Language Models with Human Preferences

Language models (LMs) are pretrained to imitate internet text, including...
research
04/12/2023

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

We present ImageReward – the first general-purpose text-to-image human p...
research
12/01/2021

A General Language Assistant as a Laboratory for Alignment

Given the broad capabilities of large language models, it should be poss...
research
06/23/2021

Vision-based Behavioral Recognition of Novelty Preference in Pigs

Behavioral scoring of research data is crucial for extracting domain-spe...
research
06/17/2017

Evaluating the quality of tourist agendas customized to different travel styles

Many tourist applications provide a personalized tourist agenda with the...
research
06/08/2021

Exploration and preference satisfaction trade-off in reward-free learning

Biological agents have meaningful interactions with their environment de...

Please sign up or login with your details

Forgot password? Click here to reset