Circumventing interpretability: How to defeat mind-readers

12/21/2022
by   Lee Sharkey, et al.
0

The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values. Yet there is reason to believe that misaligned artificial intelligence will have a convergent instrumental incentive to make its thoughts difficult for us to interpret. In this article, I discuss many ways that a capable AI might circumvent scalable interpretability methods and suggest a framework for thinking about these potential future risks.

READ FULL TEXT
research
11/26/2018

Can Artificial Intelligence Do Everything That We Can?

In this article, I discuss what AI can and cannot yet do, and the implic...
research
02/28/2022

The dangers in algorithms learning humans' values and irrationalities

For an artificial intelligence (AI) to be aligned with human values (or ...
research
07/02/2023

Minimum Levels of Interpretability for Artificial Moral Agents

As artificial intelligence (AI) models continue to scale up, they are be...
research
03/15/2023

Artificial Influence: An Analysis Of AI-Driven Persuasion

Persuasion is a key aspect of what it means to be human, and is central ...
research
03/16/2023

Protecting Society from AI Misuse: When are Restrictions on Capabilities Warranted?

Artificial intelligence (AI) systems will increasingly be used to cause ...
research
06/19/2018

A Reputation System for Artificial Societies

One approach to achieving artificial general intelligence (AGI) is throu...
research
10/18/2022

Vision Paper: Causal Inference for Interpretable and Robust Machine Learning in Mobility Analysis

Artificial intelligence (AI) is revolutionizing many areas of our lives,...

Please sign up or login with your details

Forgot password? Click here to reset