Experiments with Detecting and Mitigating AI Deception

06/26/2023
by   Ismail Sahbane, et al.
0

How to detect and mitigate deceptive AI systems is an open problem for the field of safe and trustworthy AI. We analyse two algorithms for mitigating deception: The first is based on the path-specific objectives framework where paths in the game that incentivise deception are removed. The second is based on shielding, i.e., monitoring for unsafe policies and replacing them with a safe reference policy. We construct two simple games and evaluate our algorithms empirically. We find that both methods ensure that our agent is not deceptive, however, shielding tends to achieve higher reward.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/25/2020

TanksWorld: A Multi-Agent Environment for AI Safety Research

The ability to create artificial intelligence (AI) capable of performing...
research
05/29/2023

AI Audit: A Card Game to Reflect on Everyday AI Systems

An essential element of K-12 AI literacy is educating learners about the...
research
04/02/2022

Safe Reinforcement Learning via Shielding for POMDPs

Reinforcement learning (RL) in safety-critical environments requires an ...
research
01/27/2022

DecisionHoldem: Safe Depth-Limited Solving With Diverse Opponents for Imperfect-Information Games

An imperfect-information game is a type of game with asymmetric informat...
research
06/23/2022

Formalizing the Problem of Side Effect Regularization

AI objectives are often hard to specify properly. Some approaches tackle...
research
02/27/2020

ConQUR: Mitigating Delusional Bias in Deep Q-learning

Delusional bias is a fundamental source of error in approximate Q-learni...

Please sign up or login with your details

Forgot password? Click here to reset