RL agents Implicitly Learning Human Preferences

02/14/2020
by   Nevan Wichers, et al.
0

In the real world, RL agents should be rewarded for fulfilling human preferences. We show that RL agents implicitly learn the preferences of humans in their environment. Training a classifier to predict if a simulated human's preferences are fulfilled based on the activations of a RL agent's neural network gets .93 AUC. Training a classifier on the raw environment state gets only .8 AUC. Training the classifier off of the RL agent's activations also does much better than training off of activations from an autoencoder. The human preference classifier can be used as the reward function of an RL agent to make RL agent more beneficial for humans.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/22/2023

DIP-RL: Demonstration-Inferred Preference Learning in Minecraft

In machine learning for sequential decision-making, an algorithmic agent...
research
02/12/2019

Preferences Implicit in the State of the World

Reinforcement learning (RL) agents optimize only the features specified ...
research
05/21/2018

Planning to Give Information in Partially Observed Domains with a Learned Weighted Entropy Model

In many real-world robotic applications, an autonomous agent must act wi...
research
05/16/2023

Prompt-Tuning Decision Transformer with Preference Ranking

Prompt-tuning has emerged as a promising method for adapting pre-trained...
research
07/25/2022

Modelling non-reinforced preferences using selective attention

How can artificial agents learn non-reinforced preferences to continuous...
research
08/09/2021

Bob and Alice Go to a Bar: Reasoning About Future With Probabilistic Programs

Agent preferences should be specified stochastically rather than determi...
research
05/02/2018

AI safety via debate

To make AI systems broadly useful for challenging real-world tasks, we n...

Please sign up or login with your details

Forgot password? Click here to reset