AI Safety and Reproducibility: Establishing Robust Foundations for the Neuroscience of Human Values

We propose the creation of a systematic effort to identify and replicate key findings in neuroscience and allied fields related to understanding human values. Our aim is to ensure that research underpinning the value alignment problem of artificial intelligence has been sufficiently validated to play a role in the design of AI systems.


page 1

page 2

page 3


AI Safety and Reproducibility: Establishing Robust Foundations for the Neuropsychology of Human Values

We propose the creation of a systematic effort to identify and replicate...

Understanding Attention: In Minds and Machines

Attention is a complex and broad concept, studied across multiple discip...

Mammalian Value Systems

Characterizing human values is a topic deeply interwoven with the scienc...

Requisite Variety in Ethical Utility Functions for AI Value Alignment

Being a complex subject of major importance in AI Safety research, value...

What are you optimizing for? Aligning Recommender Systems with Human Values

We describe cases where real recommender systems were modified in the se...

What does it mean to represent? Mental representations as falsifiable memory patterns

Representation is a key notion in neuroscience and artificial intelligen...

Artificial Intelligence, Values and Alignment

This paper looks at philosophical questions that arise in the context of...

I Anthropomorphic Design of Superintelligent AI Systems

There has been considerable discussion in recent years about the consequences of achieving human-level artificial intelligence. In a survey of top-researchers in computer science, an aggregate forecast of 352 scientists assigned a 50% probability of human-level machine intelligence being realized within 45 years. In the same survey, 48% responded that greater emphasis should be placed on minimizing the societal risks of AI, an emerging area of study known as “AI safety”


A distinct area of research within AI safety concerns software systems whose capacities substantially exceed that of human beings along every dimension, that is, superintelligence [2]. Within the framework of superintelligence theory, a core research topic known as the value alignment problem is to specify a goal structure for autonomous agents compatible with human values. The logic behind the framing of this problem is the following: Current software and AI systems are brittle and primitive, showing little capacity for generalized intelligence. However, ongoing research advances suggest that future systems may someday show fluid intelligence, creativity, and true thinking capacity. Defining the parameters of goal-directed behavior will be a necessary component of designing such systems. Because of the complex and intricate nature of human behavior and values, an emerging train of thought in the AI safety community is that such a goal structure will have to be inferred by software systems themselves, rather than pre-programmed by their human designers. Russell summarizes the notion of indirect inference of human values by stating three principles that should guide the development of AI systems [3]:

  1. The machine’s purpose must be to maximize the realization of human values. In particular, it has no purpose of its own and no innate desire to protect itself.

  2. The machine must be initially uncertain about what those human values are. The machine may learn more about human values as it goes along, but it may never achieve complete certainty.

  3. The machine must be able to learn about human values by observing the choices that we humans make.

In other words, rather than have a detailed ethical taxonomy programmed into them, AI systems should infer human values by observing and emulating our behavior [4, 5, 3].

In a recent article, we argued that ideas from affective neuroscience and related fields may play a key role in developing AI systems that can acquire human values. The broader context of this proposal is an inverse reinforcement learning (IRL) type paradigm in which an AI system infers the underlying utility function of an agent by observing its behavior. Our perspective is that a neuroscientific understanding of human values may play a role in characterizing the initially uncertain structure that the AI system refines over time. Having a more accurate initial goal structure may allow an agent to learn from fewer examples. For a system that is actively taking actions and having an impact on the world, a more efficient learning process can directly translate into a lower risk of adverse outcomes. As an example, we suggested that human values could be schematically and informally decomposed into three components:

1) mammalian values, 2) human cognition, and 3) several millennia of human social and cultural evolution [6]. This decomposition is simply one possible framing of the problem. There are major controversies within these fields and many avenues to approach the question of how neuroscience and cognitive psychology can inform the design of future AI systems. We refer to this broader perspective, i.e. building AI systems which possess structural commonalities with the human mind, as anthropomorphic design.

Ii Formal Models of Human Values and the Reproducibility Crisis

The connection of the value alignment problem to research in the biological and social sciences intertwines this work with another major topic in contemporary scientific discussion, the reproducibility crisis. Systematic studies conducted recently have uncovered astonishingly low rates of reproducibility in several areas of scientific inquiry [7, 8, 9]. Although we do not know what the “reproducibility distribution” looks like for the entirety of science, the shared incentive structures of academia suggest that we should view all research with some amount of skepticism.

How then do we prioritize research to be the focus of targeted replication efforts? Surely all results do not merit the same level of scrutiny. Moreover, all areas likely have “linchpin results,” which if verified, will increase researchers’ confidence substantially in entire bodies of knowledge. Therefore, a challenge for modern science is to efficiently identify areas of research and corresponding linchpin results that merit targeted replication efforts [10]. A natural strategy to pursue is to focus such efforts around major scientific themes or research agendas. The Reproducibility Projects of the Center for Open Science, for example, are targeted initiatives aimed replicating key results in psychology and cancer biology [11, 12].

In a similar spirit, we propose a focused effort aimed at investigating and replicating results which underpin the neuroscience of human values. Artificial intelligence has already been woven into the fabric of modern society, a trend that will only increase in scope and pace in the coming decades. If, as we strongly believe, a neuroscientific understanding of human values plays a role in the design of future AI systems, it essential that this knowledge base is thoroughly validated.

Iii Next Steps

We have deliberately left this commentary brief and open ended. The topic is broad enough that it merits substantial discussion before proceeding. In addition to the obvious questions of which subjects and studies should fall under the umbrella of the reproducibility initiative that we are proposing, it is also worth asking how such an effort will be coordinated, whether through a single research group or via a collaborative, open-science framework, for instance. Furthermore, this initiative should also be an opportunity to take advantage of novel scientific practices and strategies aimed at improving research quality, such as pre-prints, post-publication peer review, and pre-registration of study design.

It is also important to note that the specific task of replication is likely only applicable to a subset of results that are relevant to anthropomorphic design. There are legitimate scientific disagreements in these fields and many theories and frameworks that have yet to achieve consensus. Therefore, in addition to identifying those studies that are sufficiently concrete and precise to be the focus of targeted replication efforts, it is also our aim to identify important controversies that are of high-value to resolve, for example, via special issues in journals, workshops, or more rapid, iterated discussion among experts.

Our overarching message: From philosophers pursuing fundamental theories of ethics, to artists immersed in crafting compelling emotional narratives, to ordinary individuals struggling with personal challenges, deep engagement with the nature of human values is a fundamental part of the human experience. As AI systems become more powerful and widespread, such an understanding may also prove to be important for ensuring the safety of these systems. We propose that enhancing the reliability of our knowledge of human values should be a priority for researchers and funding agencies concerned about AI safety and existential risks. We hope this brief note brings to light an important set of contemporary scientific issues and we are eager to collaborate with other researchers in order to take informed next steps.


We would like to thank Owain Evans for insightful discussions on the topics of value alignment and reproducibility in psychology and neuroscience.


  • [1] K. Grace, J. Salvatier, A. Dafoe, B. Zhang, and O. Evans, “When Will AI Exceed Human Performance? Evidence from AI Experts,” ArXiv e-prints, May 2017.
  • [2] N. Bostrom, Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014.
  • [3] S. Russell, “Should We Fear Supersmart Robots?,” Scientific American, vol. 314, no. 6, pp. 58–59, 2016.
  • [4] O. Evans, A. Stuhlmüller, and N. D. Goodman, “Learning the Preferences of Ignorant, Inconsistent Agents,” arXiv:1512.05832, 2015.
  • [5] O. Evans and N. D. Goodman, “Learning the Preferences of Bounded Agents,” in NIPS Workshop on Bounded Optimality, 2015.
  • [6] G. P. Sarma and N. J. Hay, “Mammalian Value Systems,” Informatica, forthcoming.
  • [7] M. R. Munafò, B. A. Nosek, D. V. Bishop, K. S. Button, C. D. Chambers, N. P. du Sert, U. Simonsohn, E.-J. Wagenmakers, J. J. Ware, and J. P. Ioannidis, “A manifesto for reproducible science,” Nature Human Behaviour, vol. 1, p. 0021, 2017.
  • [8] R. Horton, “What’s medicine’s 5 sigma?,” The Lancet, vol. 385, no. 9976, 2015.
  • [9] P. Campbell, ed., Challenges in Irreproducible Research, vol. 526, Nature Publishing Group, 2015.
  • [10] G. P. Sarma, “Doing Things Twice: Strategies to Identify Studies for Targeted Validation,” ArXiv e-prints, Mar. 2017.
  • [11]

    The Open Science Collaboration, “Estimating the reproducibility of psychological science,”

    Science, vol. 349, no. 6251, 2015.
  • [12] T. M. Errington, E. Iorns, W. Gunn, F. E. Tan, J. Lomax, and B. A. Nosek, “Science forum: An open investigation of the reproducibility of cancer biology research,” eLife, vol. 3, p. e04333, dec 2014.