AI safety via debate

05/02/2018
by   Geoffrey Irving, et al.
0

To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information. In an analogy to complexity theory, debate with optimal play can answer any question in PSPACE given polynomial time judges (direct judging answers only NP questions). In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus theoretical questions about the meaning of AI alignment. We report results on an initial MNIST experiment where agents compete to convince a sparse classifier, boosting the classifier's accuracy from 59.4 85.2 the debate model, focusing on potential weaknesses as the model scales up, and we propose future human and computer experiments to test these properties.

READ FULL TEXT

page 8

page 9

page 10

page 11

research
01/05/2017

Designing a Safe Autonomous Artificial Intelligence Agent based on Human Self-Regulation

There is a growing focus on how to design safe artificial intelligent (A...
research
10/13/2019

On the Utility of Learning about Humans for Human-AI Coordination

While we would like agents that can coordinate with humans, current algo...
research
01/05/2023

Evidence of behavior consistent with self-interest and altruism in an artificially intelligent agent

Members of various species engage in altruism–i.e. accepting personal co...
research
10/15/2021

Collaborating with Humans without Human Data

Collaborating with humans requires rapidly adapting to their individual ...
research
02/14/2020

RL agents Implicitly Learning Human Preferences

In the real world, RL agents should be rewarded for fulfilling human pre...
research
03/17/2020

Rat big, cat eaten! Ideas for a useful deep-agent protolanguage

Deep-agent communities developing their own language-like communication ...
research
01/01/2023

Chatbots as Problem Solvers: Playing Twenty Questions with Role Reversals

New chat AI applications like ChatGPT offer an advanced understanding of...

Please sign up or login with your details

Forgot password? Click here to reset