Power-seeking can be probable and predictive for trained agents

04/13/2023
by   Victoria Krakovna, et al.
0

Power-seeking behavior is a key source of risk from advanced AI, but our theoretical understanding of this phenomenon is relatively limited. Building on existing theoretical results demonstrating power-seeking incentives for most reward functions, we investigate how the training process affects power-seeking incentives and show that they are still likely to hold for trained agents under some simplifying assumptions. We formally define the training-compatible goal set (the set of goals consistent with the training rewards) and assume that the trained agent learns a goal from this set. In a setting where the trained agent faces a choice to shut down or avoid shutdown in a new situation, we prove that the agent is likely to avoid shutdown. Thus, we show that power-seeking incentives can be probable (likely to arise for trained agents) and predictive (allowing us to predict undesirable behavior in new situations).

READ FULL TEXT
research
06/23/2022

On Avoiding Power-Seeking by Artificial Intelligence

We do not know how to align a very intelligent AI agent's behavior with ...
research
03/14/2023

Learning Adaptable Risk-Sensitive Policies to Coordinate in Multi-Agent General-Sum Games

In general-sum games, the interaction of self-interested learning agents...
research
04/06/2023

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Artificial agents have traditionally been trained to maximize reward, wh...
research
04/04/2019

Self-Adapting Goals Allow Transfer of Predictive Models to New Tasks

A long-standing challenge in Reinforcement Learning is enabling agents t...
research
12/03/2019

Optimal Farsighted Agents Tend to Seek Power

Some researchers have speculated that capable reinforcement learning (RL...
research
09/23/2021

Evaluating Attacker Risk Behavior in an Internet of Things Ecosystem

In cybersecurity, attackers range from brash, unsophisticated script kid...
research
02/20/2019

World Discovery Models

As humans we are driven by a strong desire for seeking novelty in our wo...

Please sign up or login with your details

Forgot password? Click here to reset