The Off-Switch Game

11/24/2016
by   Dylan Hadfield-Menell, et al.
0

It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation. This is not caused by a built-in instinct, but because a rational agent will maximize expected utility and cannot achieve whatever objective it has been given if it is dead. Our goal is to study the incentives an agent has to allow itself to be switched off. We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational. Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H's actions as important observations about that utility. (R also has no incentive to switch itself off in this setting.) We conclude that giving machines an appropriate level of uncertainty about their objectives leads to safer designs, and we argue that this setting is a useful generalization of the classical AI paradigm of rational agents.

READ FULL TEXT
research
08/13/2017

A Game-Theoretic Analysis of the Off-Switch Game

The off-switch game is a game theoretic model of a highly intelligent ro...
research
11/12/2020

Performance of Bounded-Rational Agents With the Ability to Self-Modify

Self-modification of agents embedded in complex environments is hard to ...
research
04/23/2020

GUT: A General Cooperative Multi-Agent Hierarchical Decision Architecture in Adversarial Environments

Adversarial Robotics is a burgeoning research area in Swarms and Multi-A...
research
02/07/2021

Consequences of Misaligned AI

AI systems often rely on two key components: a specified goal or reward ...
research
02/01/2014

Godseed: Benevolent or Malevolent?

It is hypothesized by some thinkers that benign looking AI objectives ma...
research
07/10/2020

AGI Agent Safety by Iteratively Improving the Utility Function

While it is still unclear if agents with Artificial General Intelligence...
research
03/30/2017

Enter the Matrix: A Virtual World Approach to Safely Interruptable Autonomous Systems

Robots and autonomous systems that operate around humans will likely alw...

Please sign up or login with your details

Forgot password? Click here to reset