Mammalian Value Systems

by   Gopal P. Sarma, et al.
Emory University
berkeley college

Characterizing human values is a topic deeply interwoven with the sciences, humanities, art, and many other human endeavors. In recent years, a number of thinkers have argued that accelerating trends in computer science, cognitive science, and related disciplines foreshadow the creation of intelligent machines which meet and ultimately surpass the cognitive abilities of human beings, thereby entangling an understanding of human values with future technological development. Contemporary research accomplishments suggest sophisticated AI systems becoming widespread and responsible for managing many aspects of the modern world, from preemptively planning users' travel schedules and logistics, to fully autonomous vehicles, to domestic robots assisting in daily living. The extrapolation of these trends has been most forcefully described in the context of a hypothetical "intelligence explosion," in which the capabilities of an intelligent software agent would rapidly increase due to the presence of feedback loops unavailable to biological organisms. The possibility of superintelligent agents, or simply the widespread deployment of sophisticated, autonomous AI systems, highlights an important theoretical problem: the need to separate the cognitive and rational capacities of an agent from the fundamental goal structure, or value system, which constrains and guides the agent's actions. The "value alignment problem" is to specify a goal structure for autonomous agents compatible with human values. In this brief article, we suggest that recent ideas from affective neuroscience and related disciplines aimed at characterizing neurological and behavioral universals in the mammalian kingdom provide important conceptual foundations relevant to describing human values. We argue that the notion of "mammalian value systems" points to a potential avenue for fundamental research in AI safety and AI ethics.


page 1

page 2

page 3

page 4


AI Safety and Reproducibility: Establishing Robust Foundations for the Neuroscience of Human Values

We propose the creation of a systematic effort to identify and replicate...

Dynamic Models Applied to Value Learning in Artificial Intelligence

Experts in Artificial Intelligence (AI) development predict that advance...

Rise of the humanbot

The accelerated path of technological development, particularly at the i...

AI Safety and Reproducibility: Establishing Robust Foundations for the Neuropsychology of Human Values

We propose the creation of a systematic effort to identify and replicate...

Hierarchical principles of embodied reinforcement learning: A review

Cognitive Psychology and related disciplines have identified several cri...

AI Challenges in Human-Robot Cognitive Teaming

Among the many anticipated roles for robots in the future is that of bei...

Pragmatic-Pedagogic Value Alignment

For an autonomous system to provide value (e.g., to customers, designers...

1 Introduction

Artificial intelligence, a term coined in the 1950’s at the now famous Dartmouth Conference, has come to have a widespread impact on the modern world (Russell and Norvig, 2009; Nilsson, 2009). If we broaden the phrase to include all software, and in particular, software responsible for the control and operation of physical machinery, planning and operations management, or other tasks requiring sophisticated information processing, then it goes without saying that artificial intelligence has become a critical part of the infrastructure supporting modern human society. Indeed, prominent venture capitalist Mark Andresseen famously wrote that “software is eating the world,” in reference to the ubiquitous deployment of software systems across all industries and organizations, and the corresponding growth of the financial investment into software companies (Andreessen, 2011).

Nonetheless, there is a fundamental gap between the abilities of the most sophisticated software-based control systems today and the capacities of a human child or even many animals. Our AI systems have yet to display the capacity for learning, creativity, independent thought and discovery that define human intelligence. It is a near-consensus position, however, that at some point in the future, we will be able to create software-based agents whose cognitive capacities rival those of human beings. While there is substantial variability in researchers’ forecasts about the time-horizons of the critical breakthroughs and the consequences of achieving human-level artificial intelligence, there it is little disagreement that it is an attainable milestone (Müller and Bostrom, 2016).

Some have argued that the creation of human-level artificial intelligence would be followed by an “intelligence explosion,” whereby the intelligence of the software-based system would rapidly increase due to its ability to analyze, model, and improve its cognition by re-writing its codebase, in a feat of self- improvement impossible for biological organisms. The net result would be a “superintelligence,” that is, an agent whose fundamental cognitive abilities vastly exceed our own (Bostrom, 2014; Shanahan, 2015; Good, 1965; Chalmers, 2010).

To be more explicit, let us consider a superintelligence to be any agent which can surpass the sum total of human cognitive and emotional abilities. These abilities might include intellectual tasks such as mathematical or scientific research, artistic invention in musical composition or poetry, political philosophy and the crafting of public policy, or social skills and the ability to recognize and respond to human emotions. Many commentators in recent years and decades have predicted that convergent advances in computer science, robotics, and related disciplines will give rise to the development of superintelligent machines during the 21st century (Müller and Bostrom, 2016).

If it is possible to create a superintelligence, then a number of natural questions arise: What would such an agent choose to do? What are the constraints that would guide its actions and to what degree can these actions be shaped by the designers? If a superintelligence can reason about and influence the world to a substantially greater degree than human beings themselves, how can we design a system to be compatible with human values? Is it even possible to formalize the notion of human values? Are human values a monolithic, internally consistent entity, or are there intrinsic conflicts and contradictions between the values of individuals and between the value systems of different cultures? (Bostrom, 2014; Chalmers, 2010; Yudkowsky, 2008; Russell, 2016; Omohundro, 2014, 2008).

It is our belief that the value alignment problem is of fundamental importance both for its relevance to near-term developments likely to be realized by the computer and robotics industries and for longer- term possibilities of more sophisticated AI systems leading to superintelligence. Furthermore, the broader set of problems posed by the realization of intelligent, autonomous, software-based agents may provide an important unifying framework that brings together disparate areas of inquiry spanning computer science, cognitive science, philosophy of mind, behavioral neuroscience, and anthropology, to name just a few.

In this article, we set aside the question of how, when, and if AI systems will be developed that are of sufficient sophistication to require a solution to the value alignment problem. This is a substantial topic on its own right which has been analyzed elsewhere. We assume the feasibility of these systems as a starting point for further analysis of the goal structures of autonomous agents and propose the notion of “mammalian value systems” as providing a framework for further research.

2 Goal Structures for Autonomous Agents

2.1 The Orthogonality Thesis

The starting point for discussing AI goal structures is the observation that the cognitive capacities of an intelligent agent are independent of the goal structure that constrains or guides the agents’ actions, what Bostrom calls the “orthogonality thesis:”

We have seen that a superintelligence could have a great ability to shape the future according to its goals. But what will its goals be? What is the relation between intelligence and motivation in an artificial agent? Here we develop two theses. The orthogonality thesis holds (with some caveats) that intelligence and final goals are independent variables: any level of intelligence could be combined with any final goal. The instrumental convergence thesis holds that superintelligent agents having any of a wide range of final goals will nevertheless pursue similar intermediary goals because they have common instrumental reasons to do so. Taken together, these theses help us to think about what a superintelligent agent would do.(Bostrom, 2014)

The orthogonality thesis allows us to illustrate the importance of autonomous agents being guided by human compatible goal structures, whether they are truly superintelligent as Bostrom envisions, or even more modestly intelligent but highly sophisticated AI systems likely to be developed in industry in the foreseeable future. Consider the example of a domestic robot that is able to clean the house, monitor a security system, and prepare meals independently and without human intervention. A robot with a slightly incorrect or inadequately specified goal structure might correctly infer that a household pet has high nutritional value to its owners, but not recognize its social and emotional relationship to the family. We can easily imagine the consequences for companies involved in creating domestic robots if a family dog or cat ends up on the dinner plate (Russell, 2016).

As the intelligent capabilities of an agent grows, the consequences for slight deviations from human values will become greatly magnified. The reason is that such an agent possesses increasing capacity to achieve its goals, however arbitrary those goals might be. It is for this reason that researchers concerned with the value alignment problem have distanced themselves from the fictitious and absurd scenarios portrayed in Hollywood thrillers. These movies often depict outright malevolent agents whose explicit aim is to destroy or enslave humanity. What is implicit in these stories is a goal structure that has been explicitly defined

to be in opposition to human values. But as the simple example of the domestic robot illustrates, this is hardly the risk we face with sophisticated AI systems. The true risk is that if we incorrectly or inadequately specify the goals of a sufficiently capable agent, then it will devote its cognitive capacities to a task that is at odds with our values in ways that may be subtle or even bizarre. In the example given above, there was no malevolence or ulterior motive behind the robot making a nutritious meal out of the household pet. Rather, it simply did not recognize—due to the failure of its human designers—that the pet was valued by its owners, not for nutritional reasons, but rather for social and emotional ones

(Yudkowsky, 2008; Russell, 2016).

2.2 Inferring Human Compatible Value Systems

An emerging train of thought among AI safety researchers is that a human compatible goal structure will have to be inferred by the AI system itself, rather than pre-programmed by the designers. The reason is that human values are rich and complex, and in addition, often contradictory and conflicting. Therefore, if we incorrectly specify what we think to be a safe goal structure, even slight deviations can be magnified and lead to detrimental consequences. On the other hand, if an AI system begins with an uncertain model of human values, and then begins to learn our values by observing our behavior, then we can substantially reduce the risks of a misspecified goal structure. Furthermore, just as we are more likely to trust mathematical calculations performed by a computer than by humans, if we build an AI system that we know to have greater capacity than ourselves at performing those cognitive operations required to infer the values of other agents by observing their behavior, then we gain the additional benefit of knowing that these operations will be performed with greater certainty and accuracy than were they to be pre-programmed by human AI researchers.

There is context in contemporary research for this kind of indirect inference, such as Inverse Reinforcement Learning (IRL)

(Ng and Russell, 2000; Hadfield-Menell et al., 2016) or Bayesian Inverse Planning (BIP) (Baker et al., 2011). In these approaches, an agent learns the values, or utility function, of another agent, whether it is a human, an animal, or software system, by observing its behavior. While these ideas are in their nascent stages, practical techniques have already been developed for designing AI systems (Evans et al., 2015; Evans and Goodman, 2015; Riedl and Harrison, 2016; Riedl, 2016).

Russell summarizes the notion of indirect inference of human values by stating three principles that should guide the development of AI systems (Russell, 2016):

  1. The machine’s purpose must be to maximize the realization of human values. In particular, it has no purpose of its own and no innate desire to protect itself.

  2. The machine must be initially uncertain about what those human values are. The machine may learn more about human values as it goes along, but it may never achieve complete certainty.

  3. The machine must be able to learn about human values by observing the choices that we humans make.

There are almost certainly many conceptual and practical obstacles that lie ahead in designing a system that infers the values of human beings from observing our behavior. In particular, human desires can often be masked by many layers of conflicting emotions, they can often be inconsistent, and the desires of one individual may outright contradict the desires of another. In the context of a superintelligent agent capable of exerting substantial influence on the world (as opposed to a domestic robot), it is natural to ask about variations in the value systems of different cultures. It is often assumed that many human conflicts on a global scale stem from conflicts in the underlying value systems of the respective cultures or nation states. Is it even possible, therefore, for an AI system, no matter how intelligent, to arrive at a consensus goal structure that respects the desires of all people and cultures?

We make two observations in response to this important set of questions. The first is that when we say that cultures have conflicting values, implicit in this statement are our own limited cognitive capacities and ability to model the behavior and mental states of other individuals and groups. An AI system with capabilities vastly greater than ourselves may quickly perceive fundamental commonalities and avenues for conflict resolution that we are unable to envision.

To motivate this scenario, we give a highly simplified example from negotiation theory. A method known as “principled negotiation” distinguishes between values and positions (Fisher and Ury, 1987). As an example, if two friends are deciding on a restaurant for dinner, and one wants Indian food and the other Italian, it may be that the first person simply likes spicy food and the second person wants noodles. These preferences are the values, spicy food and noodles, that the corresponding positions, Indian and Italian, instantiate. In this school of thought, when two parties are attempting to resolve a conflict, they should negotiate from values, rather than positions. That is, if we have some desire that is in conflict with another, we should ask ourselves—whether in the context of a business negotiation, family dispute, or major international conflict—what the underlying value is that the desire reflects. By understanding the underlying values, we may see that there is a mutually satisfactory set of outcomes satisfying all parties that we failed to see initially. In this particular instance, if the friends are able to state their true underlying preferences, they may recognize that Thai cuisine will satisfy both parties. We mention this example from negotiation theory to raise the possibility that what we perceive to be fundamentally conflicting values in human society might actually be conflicting positions arising from common values when viewed from the perspective of a higher level of intelligence.

The second observation is that what we colloquially refer to as the values of a particular culture, or even collective human values, reflect not only innate features of the human mind, but also the development of human society. In other words, to understand the underlying value system that guides human behavior, which would ultimately need to be modeled and inferred by an AI system, it may be helpful to disentangle those aspects of modern cultural values which were latent, but not explicitly evident during earlier periods of human history.

Although an agent utilizing Inverse Reinforcement Learning or Bayesian Inverse Planning will learn and refine its model of human values by observing our behavior, it must begin with some very rough or approximate initial assumptions about the nature of the values it is trying to learn. In the remainder of this article, we argue that the neurological substrate common to mammals and their corresponding behaviors may provide a framework for characterizing the structure of the initially uncertain value system of an autonomous, intelligent agent.

2.3 Anthropomorphic Bias Versus Anthropomorphic Design

Before proceeding, we mention an important caveat with regards to the orthogonality thesis, namely, that it is not a free orthogonality. The particular goal structure of an agent will almost certainly constrain the necessary cognitive capabilities required for the agent to operate. In other words, the orthogonality thesis does not suggest that one can pair an arbitrary set of machine learning algorithms with an arbitrary goal structure. For instance, if we are building an AI system to process a large number of photographs and videos so that families can efficiently find their most memorable moments amidst terabytes of data, we know that the underlying algorithms will be those from computer vision and not computer algebra. The primary takeaway from the orthogonality thesis is that when reasoning about intelligence in the abstract, we should not assume that any particular goal structure is implied. In particular, there is no reason to believe that an arbitrary AI system having the cognitive capacity of humans will necessarily have a goal structure compatible with or in opposition to that of humans. It may very well be completely arbitrary from the perspective of human values.

This observation about the orthogonality thesis brings to light an important point with regards to AI goal structures, namely the difference between anthropomorphic bias and anthropomorphic design. Anthropomorphic bias refers to the default assumption that an arbitrary AI system will behave in a manner possessing commonalities with human beings. In practice, instances of anthropomorphic bias almost always go hand in hand with the assumption of malevolent intentions on behalf of an AI system—recall our previous dismissal of Hollywood thrillers depicting agents intent on destroying or enslaving humanity.

On the other hand, it may very well be the case, perhaps even necessary, that solving the value alignment problem requires us to build a specific AI system that possesses important commonalities with the human mind. This latter perspective is what we refer to as anthropomorphic design.111Anthropomorphic design refers to a more narrow class of systems than the term “human compatible AI,” which has recently come into use. See, for example, The Berkeley Center for Human Compatible AI.

2.4 Mammalian Value Systems

Our core thesis is the following: What we call human values can be informally decomposed into 1) mammalian values, 2) human cognition, and 3) several millennia of human social and cultural evolution. This decomposition suggests that contemporary research broadly spanning the study of animal behavior, biological anthropology, and comparative neuroanatomy may be relevant to the value alignment problem, and in particular, in characterizing the initially uncertain goal structure which is refined through observation by the AI system. Additionally, in analyzing the subsequent behavioral trajectories of intelligent, autonomous agents, we can decompose the resulting dynamics as being guided by mammalian values merged with AI cognition. Aspects of contemporary human values which are the result of incidental historical processes—the third component of our decomposition above—might naturally arise in the course of the evolution of the AI system (though not necessarily), even though they were not directly programmed into the agent. There are many factors that might influence the extent to which this third component of human values continues to be represented in the AI system. Examples might include whether or not these values remain meaningful in a world where other problems had been solved and the extent to which certain cultural values which were perceived to be in conflict with others could be resolved with a more fundamental understanding stemming from the combination of mammalian values and AI cognition.

We want to emphasize that our claim is not that mammalian values are synonymous with human values. Rather, our thesis is that there are many aspects of human values which are the result of historical processes driven by human cognition. Consequently, many structural aspects of human experience and human society which we colloquially refer to as “values” are derived entities, rather than features of the initial AI goal structure. As a thought experiment, consider a scenario whereby the fully digitized corpus of human literature, cinema, and ongoing global developments communicated via the Internet are analyzed and modeled by an AI system constructed around a core mammalian goal structure. In the conceptual framework that we propose, this initially mammalian structure would gradually come to reflect the more nuanced aspects of human society as the AI refines its model of human values via analysis and hypothesis generation. We also mention that as our aim in this article is to focus on the structure of the initial AI motivational system and not other aspects of AI more broadly, we set aside the possible role human interaction and feedback may play in the subsequent development of the AI system’s cognition and instrumental values.

2.4.1 Neural Correlate of Values: Behavioral and Neurological Foundations

Our thesis about mammalian values is predicated on two converging lines of evidence, one primarily behavioral and the other primarily neuroscientific. Behaviorally, it is not difficult to characterize intuitively what human values are when viewed from the perspective of the mammalian kingdom. Like many other animals, humans are social creatures and many, if not most, of our fundamental drives originate from our relationships with others. Attachment, loss, anger, territoriality, playfulness, joy, anxiety, and love are all deeply rooted emotions that guide our behavior and which have been foundational elements in the emergence of human cognition, culture, and the structure of society 222While we have mentioned several active areas of research, that there are likely others that we are simply not aware of. We apologize in advance to those scholars whose work we have not cited here. (Horswill, 2008; Swanson, 2000, 2012; Barkow et al., 1995; Dehaene and Cohen, 2007; Peterson and Seligman, 2004; Schnall et al., 2008; Tenenbaum et al., 2011; Bowlby, 1980; Porges, 1995; Cassidy, 2002; Tomasello, 1999).

The scientific study of behavior is largely the domain of the disciplines of ethology and behaviorism. As we are primarily concerned with emotions, we will focus on behavioral insights and taxonomies originating from the sub-community of affective neuroscience, which also aims to correlate these behaviors with underlying neural architecture. More formally, Panksepp and Biven categorize the informal list given above into 7 motivational and emotional systems that are common to mammals: seeking, rage, fear, lust, care, panic/grief, and play (Panksepp and Biven, 2012). We now give brief summaries of each of these systems:

  1. SEEKING: This is the system that primarily mediates exploratory behavior and also enables the other systems. The seeking system can give rise to both positive and negative emotions. For instance, a mother who needs to feed her offspring will go in search of food, and the resulting maternal / child bonding (via the CARE system; see below) creates positive emotional reinforcement. On the other hand, physical threats can generate negative emotions and prompt an animal to seek shelter and safety. The behaviors corresponding to SEEKING have been broadly associated with the dopaminergic systems of the brain, specifically regions interconnected with the ventral tegmental area and nucleus accumbens.

  2. RAGE: The behaviors corresponding to rage are targeted and more narrowly focused than those governed by the seeking system. Rage compels animals towards specific threats and is generally accompanied by negative emotions. However, it should be noted that in an adversarial scenario where rage can lead to victory, it can also be accompanied by the positive emotions of triumph or glory. The RAGE system involves medial regions of the amygdala, medial regions of the hypothalamus, and the periaqueductal gray.

  3. FEAR: The two systems described thus far are directly linked to externally directed, action- oriented behavior. In contrast, fear describes a system which places an animal in a negative affective state, one which it would prefer not to be in. In the early stages, fear tends to correspond to stationary states, after which it can transition to seeking or rage, and ultimately, attempts to flee from the offending stimulus. However, these are secondary effects, and the primary physical state of fear is typically considered to be an immobile one. The FEAR system involves central regions of the amygdala, anterior and medial regions of the hypothalamus, and dorsal regions of the periaqueductal gray.

  4. LUST: Lust describes the system leading to behaviors of courtship and reproduction. Like fear, it will tend to trigger the seeking system, but can also lead to negative affective states if satisfaction is not achieved. The LUST system involves anterior and ventromedial regions of the hypothalamus.

  5. CARE: Care refers to acts of tenderness directed towards loved ones, and in particular, an animal’s offspring. As we described in the context of seeking, the feelings associated with caring and nurturing can be profoundly positive and play a crucial component in the social behavior of mammals. CARE is associated with the ventromedial hypothalamus and the oxytocin system.

  6. PANIC / GRIEF: Activation of the panic / grief system corresponds to profound psychological pain, and is generally not associated with external physical causes. In young animals, this system is typically activated by separation from caregivers, and is the underlying network behind “separation anxiety.” Like care, the panic / grief system is a fundamental component of mammalian social behavior. It is the negative affective system which drives animals towards relationships with other animals, thereby stimulating the care system, generating feelings of love and affection, and giving rise to social bonding. This system is associated with the periaqueductal gray, ventral septal area, and anterior cingulate.

  7. PLAY: The play system corresponds to lighthearted behavior in younger animals and is a key component of social bonding, friendship, as well as the learning of survival-oriented skills. Although play can superficially resemble aggression, there are fundamental differences between play and adult aggression. At an emotional level, it goes without saying that play corresponds to positive affective states, and unlike aggressive behavior, is typically part of a larger, orchestrated sequence of events. In play, for example, animals often alternate between assuming dominant and submissive roles. The PLAY system is currently less neuroanatomically localized, but involves midline thalamic regions.

As we stated earlier, our thesis about mammalian values originates from two convergent lines of evidence, one behavioral and the other neuroscientific. What we refer to as the “neural correlate of values,” or NCV, are the common mammalian neural structures which underly the motivational and emotional systems summarized above. To the extent that human values are intertwined with our emotions, these architectural commonalities suggest that the shared mammalian neurological substrate is of importance to understanding human value alignment in sophisticated learning systems. Panksepp and Biven write,

To the best of our knowledge, the basic biological values of all mammalian brains were built upon the same basic plan, laid out in …affective circuits that are concentrated in subcortical regions, far below the neocortical “thinking cap” that is so highly developed in humans. Mental life would be impossible without this foundation. There, among the ancestral brain networks that we share with other mammals, a few ounces of brain tissue constitute the bedrock of our emotional lives, generating the many primal ways in which we can feel emotionally good or bad within ourselves. As we mature and learn about ourselves, and the world in which we live, these systems provide a solid foundation for further mental developments (Panksepp and Biven, 2012).

Latent in this excerpt is the decomposition that we have suggested earlier. The separation of the mammalian brain into subcortical and neocortical regions, roughly corresponding to emotions and cognition respectively, implies that we can attempt to reason by analogy what the architecture of an AI system would look like with a human compatible value system. In particular, the initially uncertain goal structure that the AI system refines via observation may be much simpler than we might imagine by reflecting on the complexities of human society and individual desires. As we have illustrated using our simple example from negotiation theory, our intuitive understanding of human values, and the conflicts that we regularly witness between individuals and groups, may in fact represent conflicting positions stemming from a shared fundamental value system, a value system that originates from the subcortical regions of the brain, and which other mammals share with us.333There is a contemporary and light-hearted social phenomenon which provides an evocative illustration of the universality of mammalian emotions, namely, the volume of animal videos posted to YouTube. From ordinary citizens with pets, to clips from nature documentaries, animal videos are regularly watched by millions of viewers worldwide. Individual videos and compilations of “animal odd couples,” “unlikely animal friends,” “dogs and babies,” and “animal friendship between different species” are commonly searched enough to be auto-completed by YouTube’s search capabilities. It is hardly surprising that these charming and heart-warming videos are so compelling to viewers of all age groups, genders, and ethnic backgrounds. Our relationships with other animals, whether home owners and their pets, or scientists and the wild animals that they study, tell us something deeply fundamental about ourselves (List, 2015). The strong emotional bonds that humans form with other animals, in particular, with our direct relatives in the mammalian kingdom, and the draw to simply watching this social behavior in other mammals, is a vivid illustration of the fundamental role that emotions play in our inner life and in guiding our behavior.
In the future, the potential to apply inverse reinforcement learning (or related techniques) to large datasets of videos, including short clips from YouTube, movies, TV shows, documentaries, etc. opens up an interesting avenue to evaluate and further refine the hypothesis presented here. For instance, when such technology becomes available, we might imagine comparing the inferred goal structures when restricted to videos of human behavior versus those restricted to mammalian behavior. There are many other variations along these lines, for instance, restricting to videos of non-mammalian behavior, mammals as well as humans, different cultures, etc.

Referring once again to the work of Panksepp,

In short, many of the ancient, evolutionarily derived brain systems all mammals share still serve as the foundations for the deeply experienced affective proclivities of the human mind. Such ancient brain functions evolved long before the emergence of the human neocortex with its vast cognitive skills. Among living species, there is certainly more evolutionary divergence in higher cortical abilities than in subcortical ones (Panksepp, 1998).

The emphasis on the diversity in higher cortical abilities is of particular relevance to the decomposition that we have proposed. We can, therefore, ask, in principle, what the full spectrum of higher cortical abilities are that could be built on top of the common mammalian substrate provided by the evolutionarily older parts of the brain. We need not confine ourselves to those manifestations of higher cognition that we see in nature, or that would even be hypothetical consequences of continued evolution by natural selection. Indeed, one restatement of our core thesis is to consider—in the abstract or as a thought experiment—the consequences of extending the diversity of brain architectures to include higher cortical abilities arising not from natural selection, but rather the de novo architectures of artificial intelligence.

2.5 Relationship to Moral Philosophy

It is hardly a surprise that a vibrant area of research within AI safety is the relationship of contemporary and historical theories of moral philosophy to the problem of value alignment. Indeed, researchers have specifically argued for the relevance of moral philosophy in the context of the inverse reinforcement learning paradigm (IRL) that is the starting point for analysis in this article (Armstrong and Leike, 2016).

Is the framework we propose in opposition to those that are oriented towards moral philosophy? On the one hand, our perspective is that the field of AI safety is simply too young to make such judgments. At our present level of understanding, we believe each of these agendas form solid foundations for further research and there seems little reason to pursue one to the exclusion of the other.

On the other hand, we would also argue that this distinction is a false dichotomy. Indeed, there are active areas of research in the ethics community aimed at understanding the neurological and cognitive underpinning of human moral reasoning (Greene and Haidt, 2002; Greene, 2009). Therefore, it is quite possbile that a hybrid approach to value alignment emerges, bridging the “value primitives” perspective we advocate here with research from moral philosophy.

3 Discussion

The possibility of autonomous, software-based agents, whether self-driving cars, domestic robots, or the longer-term possibilities of superintelligence, highlights an important theoretical problem—the need to separate the intelligent capabilities of such a system from the fundamental values which guide the agents’ actions. For such an agent to exist in a human world and to act in a manner compatible with human values, these values would need to be explicitly modeled and formalized. An emerging train of thought in AI safety research is that this modeling process would need to be conducted by the AI system itself, rather than by the system’s designers. In other words, the agent would start off with an initially uncertain goal structure and infer human values over time by observing our behavior.

The question that motivates this article is to ask the following: what can we say about the broad features of the initial goal structure that the agent then refines through observation and hypothesis generation? The perspective we advocate is to view human values within the context of the broader mammalian kingdom, thereby providing implicit priors on the latent structure of the values we aim to infer. The shared neurological structures underlying mammalian emotions and their corresponding social behaviors provide a starting point for formalizing an initial value system for autonomous, software-based agents.

From this vantage point, we argue that what we colloquially refer to as human values can be informally decomposed into 1) mammalian values, 2) human cognition, and 3) several millennia of human social and cultural evolution. In the context of a de novo artificially intelligent agent, we can characterize desirable, human-compatible behavior as being described by mammalian values merged with AI cognition. It goes without saying that we have left out a considerable amount of detail in this description. The specifics of Inverse Reinforcement Learning, the many neuroscientific nuances underlying the comparative neuroanatomy, physiology, and function of the mammalian brain, as well as the controversies and competing theories in the respective disciplines are all substantial topics on their own right.

Our omission of these issues is not out of lack of recognition or belief that they are unimportant. Rather, our aim in this article has been to present a high-level overview of a richly interdisciplinary and young set of questions whose broad outlines have only recently begun to take shape. We will tackle these issues and others in a subsequent series of manuscripts and invite interested researchers to join us. Our fundamental motivation in proposing this framework is to bring together scholars from diverse communities that may not be aware of each other’s research and their potential for synergy. We believe that there is a wealth of existing research which can be fruitfully re-examined and re-conceptualized from the perspective of artificial intelligence and the value alignment problem. We hope that additional interaction between these communities will help to refine and more precisely define research problems relevant to designing safe AI goal structures.



We would like to thank Adam Safron, Owain Evans, Daniel Dewey, and Miles Brundage for insightful discussions and feedback on the manuscript.


  • Andreessen [2011] M. Andreessen. Why Software Is Eating The World. Wall Street Journal, 20, 2011.
  • Armstrong and Leike [2016] S. Armstrong and J. Leike. Towards Interactive Inverse Reinforcement Learning. In NIPS, 2016.
  • Baker et al. [2011] C. L. Baker, R. R. Saxe, and J. B. Tenenbaum. Bayesian Theory of Mind: Modeling Joint Belief-Desire Attribution. In Proceedings of the Thirty-Second Annual Conference of the Cognitive Science Society, pages 2469–2474, 2011.
  • Barkow et al. [1995] J. H. Barkow, L. Cosmides, and J. Tooby. The Adapted Mind: Evolutionary Psychology and the Generation of Culture. Oxford University Press, 1995.
  • Bostrom [2014] N. Bostrom. Superintelligence: Paths, Dangers, Strategies. OUP Oxford, 2014.
  • Bowlby [1980] J. Bowlby. Attachment and Loss, volume 3. Basic books, 1980.
  • Cassidy [2002] J. Cassidy. Handbook of Attachment: Theory, Research, and Clinical Applications. Rough Guides, 2002.
  • Chalmers [2010] D. Chalmers. The Singularity: A Philosophical Analysis. Journal of Consciousness Studies, 17(9-10):7–65, 2010.
  • Dehaene and Cohen [2007] S. Dehaene and L. Cohen. Cultural Recycling of Cortical Maps. Neuron, 56(2):384–398, 2007.
  • Evans and Goodman [2015] O. Evans and N. D. Goodman. Learning the Preferences of Bounded Agents. In NIPS Workshop on Bounded Optimality, 2015.
  • Evans et al. [2015] O. Evans, A. Stuhlmüller, and N. D. Goodman. Learning the Preferences of Ignorant, Inconsistent Agents. arXiv:1512.05832, 2015.
  • Fisher and Ury [1987] R. Fisher and W. Ury. Getting to Yes. Simon & Schuster Sound Ideas, 1987.
  • Good [1965] I. J. Good. Speculations Concerning the First Ultraintelligent Machine. Advances In Computers, 6(99):31–83, 1965.
  • Greene and Haidt [2002] J. Greene and J. Haidt. How (and where) does moral judgment work? Trends in Cognitive Sciences, 6(12):517–523, 2002.
  • Greene [2009] J. D. Greene. The cognitive neuroscience of moral judgment. The Cognitive Neurosciences, 4:1–48, 2009.
  • Hadfield-Menell et al. [2016] D. Hadfield-Menell, A. Dragan, P. Abbeel, and S. Russell. Cooperative inverse reinforcement learning, 2016.
  • Horswill [2008] I. Horswill. Men Are Dogs (and Women Too). In AAAI Fall Symposium: Naturally-Inspired Artificial Intelligence, pages 67–71, 2008.
  • List [2015] R. List. Why I Identify as a Mammal. The New York Times, 10 2015.
  • Müller and Bostrom [2016] V. C. Müller and N. Bostrom. Future Progress in Artificial Intelligence: A survey of expert opinion. In Fundamental issues of artificial intelligence, pages 553–570. Springer, 2016.
  • Ng and Russell [2000] A. Y. Ng and S. J. Russell. Algorithms For Inverse Reinforcement Learning. In International Conference on Machine Learning, pages 663–670, 2000.
  • Nilsson [2009] N. J. Nilsson. The Quest for Artificial Intelligence. Cambridge University Press, 2009.
  • Omohundro [2014] S. Omohundro. Autonomous technology and the greater human good. Journal of Experimental & Theoretical Artificial Intelligence, 26(3):303–315, 2014.
  • Omohundro [2008] S. M. Omohundro. The Basic AI Drives. In AGI, volume 171, pages 483–492, 2008.
  • Panksepp [1998] J. Panksepp. Affective Neuroscience: The Foundations of Human and Animal Emotions. Oxford university press, 1998.
  • Panksepp and Biven [2012] J. Panksepp and L. Biven. The Archaeology of Mind: Neuroevolutionary Origins of Human Emotions. WW Norton & Company, 2012.
  • Peterson and Seligman [2004] C. Peterson and M. E. Seligman. Character Strengths and Virtues: A Handbook and Classification. Oxford University Press, 2004.
  • Porges [1995] S. W. Porges. Orienting in a Defensive World: Mammalian Modifications of Our Evolutionary Heritage. A Polyvagal Theory. Psychophysiology, 32(4):301–318, 1995.
  • Riedl [2016] M. O. Riedl. Computational Narrative Intelligence: A Human-Centered Goal for Artificial Intelligence. arXiv preprint arXiv:1602.06484, 2016.
  • Riedl and Harrison [2016] M. O. Riedl and B. Harrison. Using Stories to Teach Human Values to Artificial Agents. In Proceedings of the 2nd International Workshop on AI, Ethics and Society, Phoenix, Arizona, 2016.
  • Russell [2016] S. Russell. Should We Fear Supersmart Robots? Scientific American, 314(6):58–59, 2016.
  • Russell and Norvig [2009] S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall Press, Upper Saddle River, NJ, USA, 3rd edition, 2009.
  • Schnall et al. [2008] S. Schnall, J. Haidt, G. L. Clore, and A. H. Jordan. Disgust as Embodied Moral Judgment. Personality and social psychology bulletin, 2008.
  • Shanahan [2015] M. Shanahan. The Technological Singularity. MIT Press, 2015.
  • Swanson [2000] L. W. Swanson. Cerebral Hemisphere Regulation of Motivated Behavior. Brain Research, 886(1):113–164, 2000.
  • Swanson [2012] L. W. Swanson. Brain Architecture: Understanding the Basic Plan. Oxford University Press, 2012.
  • Tenenbaum et al. [2011] J. B. Tenenbaum, C. Kemp, T. L. Griffiths, and N. D. Goodman. How to Grow a Mind: Statistics, Structure, and Abstraction. Science, 331(6022):1279–1285, 2011.
  • Tomasello [1999] M. Tomasello. The Cultural Origins of Human Cognition. Harvard University Press, 1999.
  • Yudkowsky [2008] E. Yudkowsky. Artificial Intelligence as a Positive and Negative Factor in Global Risk. In Nick Bostrom and Milan Cirkovic, editor, Global Catastrophic Risks, page 303. Oxford University Press Oxford, UK, 2008.