Corrigibility with Utility Preservation

08/05/2019
by   Koen Holtman, et al.
0

Corrigibility is a safety property for artificially intelligent agents. A corrigible agent will not resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started. This paper shows how to construct a safety layer that adds corrigibility to arbitrarily advanced utility maximizing agents, including possible future agents with Artificial General Intelligence (AGI). The layer counter-acts the emergent incentive of advanced agents to resist such alteration. A detailed model for agents which can reason about preserving their utility function is developed, and used to prove that the corrigibility layer works as intended in a large set of non-hostile universes. The corrigible agents have an emergent incentive to protect key elements of their corrigibility layer. However, hostile universes may contain forces strong enough to break safety features. Some open problems related to graceful degradation when an agent is successfully attacked are identified. The results in this paper were obtained by concurrently developing an AGI agent simulator, an agent model, and proofs. The simulator is available under an open source license. The paper contains simulation results which illustrate the safety related properties of corrigible AGI agents in detail.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/10/2020

AGI Agent Safety by Iteratively Improving the Utility Function

While it is still unclear if agents with Artificial General Intelligence...
research
07/20/2023

Of Models and Tin Men – a behavioural economics study of principal-agent problems in AI alignment using large-language models

AI Alignment is often presented as an interaction between a single desig...
research
01/29/2021

Counterfactual Planning in AGI Systems

We present counterfactual planning as a design approach for creating a r...
research
05/26/2023

NASimEmu: Network Attack Simulator Emulator for Training Agents Generalizing to Novel Scenarios

Current frameworks for training offensive penetration testing agents wit...
research
11/12/2020

Performance of Bounded-Rational Agents With the Ability to Self-Modify

Self-modification of agents embedded in complex environments is hard to ...
research
05/19/2011

Ontological Crises in Artificial Agents' Value Systems

Decision-theoretic agents predict and evaluate the results of their acti...
research
07/27/2022

Break and Make: Interactive Structural Understanding Using LEGO Bricks

Visual understanding of geometric structures with complex spatial relati...

Please sign up or login with your details

Forgot password? Click here to reset