Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

04/06/2023
by   Alexander Pan, et al.
0

Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents' towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics–designing agents that are Pareto improvements in both safety and capabilities.

READ FULL TEXT
research
08/03/2013

Universal Empathy and Ethical Bias for Artificial General Intelligence

Rational agents are usually built to maximize rewards. However, AGI agen...
research
10/25/2021

What Would Jiminy Cricket Do? Towards Agents That Behave Morally

When making everyday decisions, people are guided by their conscience, a...
research
04/13/2023

Power-seeking can be probable and predictive for trained agents

Power-seeking behavior is a key source of risk from advanced AI, but our...
research
01/18/2023

Learning to Participate through Trading of Reward Shares

Enabling autonomous agents to act cooperatively is an important step to ...
research
09/22/2021

Making Human-Like Trade-offs in Constrained Environments by Learning from Demonstrations

Many real-life scenarios require humans to make difficult trade-offs: do...
research
06/27/2022

Parametrically Retargetable Decision-Makers Tend To Seek Power

If capable AI agents are generally incentivized to seek power in service...
research
10/04/2022

Incentivising cooperation by rewarding the weakest member

Autonomous agents that act with each other on behalf of humans are becom...

Please sign up or login with your details

Forgot password? Click here to reset