Path-Specific Objectives for Safer Agent Incentives

04/21/2022
by   Sebastian Farquhar, et al.
0

We present a general framework for training safe agents whose naive incentives are unsafe. As an example, manipulative or deceptive behaviour can improve rewards but should be avoided. Most approaches fail here: agents maximize expected return by any means necessary. We formally describe settings with 'delicate' parts of the state which should not be used as a means to an end. We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state, using Causal Influence Diagram analysis. The resulting agents have no incentive to control the delicate state. We further show how our framework unifies and generalizes existing proposals.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/20/2020

The Incentives that Shape Behaviour

Which variables does an agent have an incentive to control with its deci...
research
08/03/2013

Universal Empathy and Ethical Bias for Artificial General Intelligence

Rational agents are usually built to maximize rewards. However, AGI agen...
research
10/19/2018

Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning

We propose a unified mechanism for achieving coordination and communicat...
research
08/17/2022

Discovering Agents

Causal models of agents have been used to analyse the safety aspects of ...
research
12/03/2019

SafeLife 1.0: Exploring Side Effects in Complex Environments

We present SafeLife, a publicly available reinforcement learning environ...
research
03/14/2023

Learning Adaptable Risk-Sensitive Policies to Coordinate in Multi-Agent General-Sum Games

In general-sum games, the interaction of self-interested learning agents...
research
03/05/2021

Causal Analysis of Agent Behavior for AI Safety

As machine learning systems become more powerful they also become increa...

Please sign up or login with your details

Forgot password? Click here to reset