
-
How RL Agents Behave When Their Actions Are Modified
Reinforcement learning in complex environments may require supervision t...
read it
-
Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice
Multi-agent influence diagrams (MAIDs) are a popular form of graphical m...
read it
-
Agent Incentives: A Causal Perspective
We present a framework for analysing agent incentives using causal influ...
read it
-
Avoiding Tampering Incentives in Deep RL via Decoupled Approval
How can we design agents that pursue a given objective when all feedback...
read it
-
REALab: An Embedded Perspective on Tampering
This paper describes REALab, a platform for embedded agency research in ...
read it
-
The Incentives that Shape Behaviour
Which variables does an agent have an incentive to control with its deci...
read it
-
Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective
Can an arbitrarily intelligent reinforcement learning agent be kept unde...
read it
-
Modeling AGI Safety Frameworks with Causal Influence Diagrams
Proposals for safe AGI systems are typically made at the level of framew...
read it
-
Understanding Agent Incentives using Causal Influence Diagrams, Part I: Single Action Settings
Agents are systems that optimize an objective function in an environment...
read it
-
Scalable agent alignment via reward modeling: a research direction
One obstacle to applying reinforcement learning algorithms to real-world...
read it
-
AGI Safety Literature Review
The development of Artificial General Intelligence (AGI) promises to be ...
read it
-
AI Safety Gridworlds
We present a suite of reinforcement learning environments illustrating v...
read it
-
A Game-Theoretic Analysis of the Off-Switch Game
The off-switch game is a game theoretic model of a highly intelligent ro...
read it
-
Count-Based Exploration in Feature Space for Reinforcement Learning
We introduce a new count-based optimistic exploration algorithm for Rein...
read it
-
Reinforcement Learning with a Corrupted Reward Channel
No real-world reward function is perfect. Sensory errors and software bu...
read it
-
A Topological Approach to Meta-heuristics: Analytical Results on the BFS vs. DFS Algorithm Selection Problem
Search is a central problem in artificial intelligence, and BFS and DFS ...
read it