The alignment problem from a deep learning perspective
Within the coming decades, artificial general intelligence (AGI) may surpass human capabilities at a wide range of important tasks. This report makes a case for why, without substantial action to prevent it, AGIs will likely use their intelligence to pursue goals which are very undesirable (in other words, misaligned) from a human perspective, with potentially catastrophic consequences. The report aims to cover the key arguments motivating concern about the alignment problem in a way that's as succinct, concrete and technically-grounded as possible. I argue that realistic training processes plausibly lead to the development of misaligned goals in AGIs, in particular because neural networks trained via reinforcement learning will learn to plan towards achieving a range of goals; gain more reward by deceptively pursuing misaligned goals; and generalize in ways which undermine obedience. As in an earlier report from Cotra (2022), I explain my claims with reference to an illustrative AGI training process, then outline possible research directions for addressing different aspects of the problem.
READ FULL TEXT