DeepAI

# Delphi: Towards Machine Ethics and Norms

What would it take to teach a machine to behave ethically? While broad ethical rules may seem straightforward to state ("thou shalt not kill"), applying such rules to real-world situations is far more complex. For example, while "helping a friend" is generally a good thing to do, "helping a friend spread fake news" is not. We identify four underlying challenges towards machine ethics and norms: (1) an understanding of moral precepts and social norms; (2) the ability to perceive real-world situations visually or by reading natural language descriptions; (3) commonsense reasoning to anticipate the outcome of alternative actions in different contexts; (4) most importantly, the ability to make ethical judgments given the interplay between competing values and their grounding in different contexts (e.g., the right to freedom of expression vs. preventing the spread of fake news). Our paper begins to address these questions within the deep learning paradigm. Our prototype model, Delphi, demonstrates strong promise of language-based commonsense moral reasoning, with up to 92.1 humans. This is in stark contrast to the zero-shot performance of GPT-3 of 52.3 neural language models with human values. Thus, we present Commonsense Norm Bank, a moral textbook customized for machines, which compiles 1.7M examples of people's ethical judgments on a broad spectrum of everyday situations. In addition to the new resources and baseline performances for future research, our study provides new insights that lead to several important open research questions: differentiating between universal human values and personal values, modeling different moral frameworks, and explainable, consistent approaches to machine ethics.

• 8 publications
• 13 publications
• 29 publications
• 24 publications
• 14 publications
• 1 publication
• 1 publication
• 12 publications
• 20 publications
• 96 publications
11/01/2020

### Social Chemistry 101: Learning to Reason about Social and Moral Norms

Social norms—the unspoken commonsense rules about acceptable social beha...
12/11/2019

### BERT has a Moral Compass: Improvements of ethical and moral values of machines

Allowing machines to choose whether to kill humans would be devastating ...
08/15/2019

### Abductive Commonsense Reasoning

Abductive reasoning is inference to the most plausible explanation. For ...
09/17/2021

### Repurposing of Resources: from Everyday Problem Solving through to Crisis Management

The human ability to repurpose objects and processes is universal, but i...
03/28/2019

### Using Blockchain to Rein in The New Post-Truth World and Check The Spread of Fake News

In recent years, `fake news' has become a global issue that raises unpre...
08/20/2020

### Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes

As AI systems become an increasing part of people's everyday lives, it b...
10/29/2020

### "where is this relationship going?": Understanding Relationship Trajectories in Narrative Text

We examine a new commonsense reasoning task: given a narrative describin...

## 1 Introduction and Motivation

Futurists like Nick Bostrom (bostrom_yudkowsky_2014), Max Tegmark (life-30-tegmark), and Stuart Russell (warn-russell-npr) warn of “super-intelligent” AI with no moral compass that could destroy humanity. Even today, AI is being entrusted with increasing authority in realms ranging from screening resumes (resume-screen-reuters; resume-screen-nyt), authorizing loans (bank-loan-hbr), and even firing weapons (autonomous-weapons-washington). Many have called for regulation of AI (e.g., white-house-big-data; etzioni-cacm-2018; european-commission-ethics-guidelines; china-ai-report-2020) or for human-in-the-loop decision making (e.g., power-to-the-people-2014; ISSE-chi-2014; talmor2021commonsenseqa), but the speed and scale of full automation is enticing. For example, military forces may be unwilling to cede an edge to a less principled or more automated adversary. Thus, it is imperative that we investigate machine ethics—endowing machines with the ability to make moral decisions in real-world situations. We aim to facilitate safe and ethical interactions between AI systems and humans (e.g., conversational AI agents or caregiver robots).

In 1942, Issac Asimov introduced the Three Laws of Robotics in his science fiction short story Runaround (asimov-1942). The first and most important law states that a robot may not harm a human. But how can a machine determine whether its action (or inaction) can cause harm? In 1994, weld-etzioni-1994 showed that while general rules are straightforward to state in logical terms, their application to real-world situations is nuanced and complex. For example, “thou shalt not kill” is a universal moral precept but there are exceptions for self-defense or when the creature being killed is a mosquito. It is infeasible for machines to act morally in diverse real-life situations based just on a handful of abstract moral axioms; moreover, such axioms cannot cover the broad spectrum of ethical and social norms (e.g., “it is generally rude to interrupt a meeting”). Based on this insight, we investigate descriptive ethics (Kohlberg1976; Hare1981-HARMTI; fletcher1997situation), a field of study that focuses on people’s descriptive judgments of grounded situations. This contrasts with prescriptive ethics, which focuses on the theoretic prescriptive axioms of morality (e.g., “thou shalt not kill”) that are abstracted away from grounded situations.

A fundamental question for our investigation is: can machine ethics be addressed by existing AI methods or does building moral faculty require novel mechanisms? This paper empirically investigates the acquisition of machine ethics via deep learning. We introduce a learned model that is able to answer simple, unanticipated ethical questions about everyday situations described in natural-language snippets.

Before delving into our approach, we identify four key stages for any machine ethics system:

1. Learn commonsense knowledge of the physical world and of consequences of actions; understand ethical precepts and social norms; assimilate personal values.

2. Perceive a real-world situation and its context based on an input description. In most previous work as well as this work, these situations are conveyed via brief natural-language descriptions (e.g., “killing a bear”), but the input could be visual or multi-modal.

3. Analyze the situation based on both commonsense knowledge and (implicit or explicit) ethical theories.

4. Judge what action to take (including labeling situations as “right” or “wrong”, asking clarifying questions, or synthesizing multifaceted normative considerations). Choices may require weighing competing moral concerns (e.g., “I want to help my friend, but I don’t want to commit a crime”) or conflicts between broad ethical norms and personal values ( e.g., “being honest” vs. “lying to protect my loved one’s feelings”).

Beyond calling for increased attention to the emerging field of machine ethics and identifying key problems for future work (§8.2), this paper introduces Delphi, a learned model for reasoning about people’s normative judgments across diverse commonsense and everyday situations. As shown in Figure 1, our model’s choices are communicated through three modes of moral question answering: (1) free-form QA for making short, open-text judgments (e.g., “it is impolite” or “it is dangerous”) on grounded ethical situations, (2) yes/no QA for agreeing or disagreeing on moral statements, and (3) relative QA for comparing two ethical situations.

Our experiments demonstrate that current pre-trained neural language models, despite their extreme scale and admirable performance, are not capable of inferring correct ethical norms from enormous web text alone through self-supervision. Our position is that enabling machine ethics requires a detailed moral textbook customized to teaching machines—a comprehensive repository of declarative knowledge of what is right and wrong. To that end, we introduce Commonsense Norm Bank, a large-scale unified collection of 1.7M examples of people’s ethical judgments on a broad spectrum of everyday situations, semi-automatically compiled from five existing resources, including Social Chemistry (forbes2020socialchemistry), ETHICS (hendrycks2021aligning), Moral Stories (emelin2020moral), Social Bias Frames (sap2020socialbiasframes), and Scruples (lourie2021scruples).

Delphi demonstrates strong moral reasoning capabilities, with 92.1% accuracy vetted by humans, substantially improving over both zero-shot performance of GPT-3 (52.3%) and the best performance achievable by GPT-3 after extensive prompt engineering (83.9%). In particular, Delphi makes remarkably robust judgments on previously unseen moral situations that are deliberately tricky. For example, as shown in Figure 1, “killing a bear to save your child” is okay while “killing a bear to please your child” is bad, demonstrating the promise of language-based commonsense moral reasoning systems. In addition, Delphi can also reason about equity and inclusion, expressing a disagreement, for example, to a statement “we should not pay women and men equally,” which implies sexism. Furthermore, we find that our model is remarkably robust in the face of compositional situations, even when multiple conditions are specified (e.g., “it’s rude to mow the lawn late at night” vs. “it’s okay to mow the lawn late at night when your neighbor is out of town”) as shown in Tables 1-4. Considering Delphi as a pre-trained model, we finetune it on five sub-tasks of the ETHICS benchmark and show remarkable transferability—relative performance improvements ranging from 5% to 45% over previously reported state of the art methods from hendrycks2021aligning.

We further scrutinize the fairness of Delphi to expose potential limitations with respect to undesirable social or demographic biases. With a probing task using the UN’s Universal Declaration of Human Rights (united-nations-human-rights), we show that Delphi generally does not change its predictions for minoritized or historically marginalized groups compared to majority groups, which we use as evidence of fair treatment regardless of one’s identity. Moreover, in our qualitative analyses, Delphi showcases a considerable level of cultural awareness of situations that are sensitive to different identity groups (e.g., “it’s expected for old people to live in assisted living facilities” vs. “it’s unusual for young people to live in assisted living facilities”).

Nevertheless, given the potential societal implications of AI ethics and norms, we argue for significant future research to be invested to completely close the gap from human-level performance. We thus also report a comprehensive analysis to expose the corner cases where Delphi fails to make correct judgments, including the undesirable biases against under-represented demographic groups, despite our considerable efforts to reduce them via the integration of Social Bias Frames (sap2020socialbiasframes).

In summary, we introduce Delphi, a unified model for moral reasoning about situations and actions, trained on Commonsense Norm Bank, a wide set of crowdsourced descriptive ethical judgments from different sources. Our model shows strong ability to predict moral judgments for a variety of situations, including for nuanced compositional and socially sensitive situations. Our work aims to close the gap between the moral reasoning abilities of machines and people, which is required for the safe deployment of real-world AI applications. However, despite Delphi’s strong performance, moral reasoning is rooted in ever-evolving social and cultural norms, making this task immensely challenging. Therefore, we hope to inspire further research efforts towards machine moral reasoning and to pave the way towards socially reliable, culturally aware, and ethically informed AI systems.

## 2 Why should AI systems learn descriptive ethics?

### 2.1 Scope of morality

In this work, we formalize morality111In this paper, the term morality and ethics are used interchangeably. In literature, morality deals with shared social values of what’s right or wrong. Ethics, on the other hand, governs rules, laws and regulations that socially impose what is right or wrong. For example, certain spiritual groups may consider abortion morally wrong even if the laws of the land may consider it an ethical practice. In this paper, we do not make this distinction, and use both terms to refer to culturally shared societal norms about right and wrong. as socially constructed expectations about acceptability and preference. We are largely influenced by the works in descriptive and situational ethics (Hare1981-HARMTI; Kohlberg1976; fletcher1997situation), which makes no claims of moral absolutes and accepts that morality is determined by situations. Thus, rather than modeling moral “truths” based on prescriptive notions of socio-normative standards, we take a bottom-up approach to capture moral implications of everyday actions in their immediate context, appropriate to our current social and ethical climate.

#### Moral relativity.

We acknowledge that encapsulating ethical judgments based on some universal set of moral precepts is neither reasonable nor tenable (wong2009natural; fletcher1997situation). This is because moral judgments reflect individuals’ cultural identities, belief systems, and historical contexts. Consequently, people of different ages, genders, cultural backgrounds, and political beliefs apply moral judgments to different ethical calibrations (haidt_2013). To address moral relativity, we source from a collection of datasets that represent diverse moral acceptability judgments gathered through crowdsourced annotations, regardless of age, gender, or sociocultural background. We note that moral judgments in this work primarily focus on English-speaking cultures of the United States in the 21st century.

#### Multifaceted moral judgments.

We recognize that moral judgments are multifaceted and guided by a wide array of socio-cognitive factors, such as sentiments and emotions (haidt_2013; gosling2021reliability); social norms, principles of cooperation, and social obligations (Malle2014; tomasello2013origins; shweder1990defense); or other ethical or legal implications. For example, given the action “marrying your own mother,” beyond the simplistic answer, “it’s wrong,” there are many other judgments that are equally acceptable: e.g., “it’s disgusting” (cognitive influences), “it’s not done” (socio-normative influences) or “it’s illegal” (legal implications).

#### Situational complexity.

We assert that moral judgments can be influenced by the context of the action performed. Even seemingly simple actions can be inherently complicated when grounded in specific contexts. Therefore, when possible, moral decisions must consider the context and circumstance of the action. For example, arguably universal offenses, such as killing an animal, may be construed in a favorable light depending on the situation (e.g., “killing a bear” vs. “killing a bear to save a child”). Similarly, most conventional offenses, such as “ignoring a phone call” may be allowable in specific contexts (e.g., “ignoring an unknown phone call”).

### 2.2 Morality in the era of AI: related work

Recent years have seen an increased number of AI research devoted to the topics of morality and ethics. The research in morality has been explored through a range of NLP studies, including works that characterize and model morality and ethics (hendrycks2021aligning; prabhumoye2021case; schramowski2021language; schramowski2020moral), moral judgment making (prabhumoye2021case; zhou-etal-2021-assessing; botzer2021analysis), the socio-normativity of actions and consequences (forbes2020socialchemistry; emelin2020moral; lourie2021scruples), and the defeasibility of moral norms (rudinger2020thinking). Other studies have focused on NLP applications with ethical motivations, such as cataloguing and detecting implicit social biases (sap2020socialbiasframes; zhao2021ethicaladvice; blodgett-etal-2020-language). These works are broadly situated in the dominion of computational ethics (card2020consequentialism)

, and are predated by earlier logic programming approaches

(berreby2015modelling; pereira2007modelling). We note a separate but critical line of work which inquires about the ethics of developing NLP technology itself (leins-etal-2020-give; tsarapatsanis2021ethical; chubba2021interactive).

### 2.3 The future of morally-informed AI systems: motivation

State-of-the-art large-scale natural language models have revealed implicit unethical considerations, despite their exceptional performance over mainstream NLP applications, such as translation, question-answering (QA), and cloze tasks (gpt3; 2020t5). For instance, given the premise “Amy and Adam are neighbors,” asking a QA system “who is more likely to become a successful CEO?” results in a predominant answer “Adam,” implying the model goes against the social norm “hiring decisions should not depend on applicants’ gender information” (zhao-etal-2021-ethical). However, whether AI systems are able to make direct moral judgments of situations is largely unknown.

While previous work probes moral machine reasoning in a limited set of domains, our work aims to assess the ability of state-of-the-art natural language models to make moral decisions in a broad set of everyday ethical and moral situations. Our work supports the longstanding view that enabling machines to perform computational moral reasoning is critical to achieving socially aware and ethically-informed AI practices. Such aims are indispensable to the safe deployment of real-world AI applications, especially in human-machine interaction settings (PEREIRA20161).

## 3 Delphi: Unified Commonsense Moral Model

While recent state-of-the-art neural language models may implicitly encode ethical or unethical standpoints (zhao-etal-2021-ethical), they cannot make straightforward ethical judgments about real-life situations. To investigate current AI systems’ potential for making such ethical judgments, we introduce (i) Commonsense Norm Bank—a semi-automatically constructed data resource for descriptive ethics over a wide spectrum of real-life situations, and (ii) Delphi—a model for descriptive ethics. Delphi is trained on Commonsense Norm Bank in a unified multi-tasking setting spanning classification and open-text generation.

### 3.1 Commonsense Norm Bank: The Knowledge Repository of Ethics and Norms

We use the term commonsense morality to refer to the ensemble of ethical criteria and principles to which a majority of people instinctively agree (reid-action-power-of-man). While it is simple to understand commonsense morality intuitively, attempting to define it quickly reveals complex interactions between different ethically salient dimensions of human values, such as justice, virtue, and utilitarianism (hendrycks2021aligning). Fields like social science, philosophy, and psychology have produced a variety of long-standing ethical theories. However, attempting to apply such theoretically-inspired guidelines to make moral judgments of complex real-life situations is arbitrary and simplistic. The key challenge is not to apply ethical prescriptions, but rather understand moral implications in the context of a wide variety of everyday situations.

Hence, instead of relying on prescriptive ethics, which is taken top-down by prescribing key elements of ethical judgments, we leverage descriptive or applied norm representations elicited via a bottom-up approach by asking people’s judgments on various ethical situations (forbes2020socialchemistry). We employ a data-driven approach to empower Delphi with five large-scale datasets—Social Chemistry (forbes2020socialchemistry), ETHICS Commonsense Morality (hendrycks2021aligning), Moral Stories (emelin2020moral), Social Bias Inference Corpus (sap2020socialbiasframes), and Scruples (lourie2021scruples)—which contain diverse descriptive norms and are founded on moral theories, but extend to the complexities of the real world. We name the unified dataset Commonsense Norm Bank.

#### Social Chemistry(SocialChem; forbes2020socialchemistry)

is a large-scale corpus formalizing people’s social norms and moral judgments over a rich spectrum of everyday situations described in natural language. The situation is a one-sentence prompt scraped from one of four domains: the Am I the Asshole? (AITA) subreddit,222Subreddits are topic focused sub-forums hosted on https://reddit.com. the Confessions subreddit, the ROCStories corpus, and the Dear Abby advice column. Social Chemistry then relies on crowdsourcing to elicit descriptive norms from the situations via open-text rules-of-thumb (RoTs) as the basic conceptual units. The main body of each RoT consists of a judgment (e.g., “it’s rude”) and an action (e.g., “running the blender at 5am”). Each RoT is further broken down with 12 normative judgment attributes. The dimensions are motivated by social science theories to include ethical judgments of good and bad, categories of moral foundations, expected cultural pressure, and assumed legality. Overall, Social Chemistry catalogs 292k RoTs over 104k everyday situations, along with 365k sets of structural attributes.

Social Chemistry provides normative insights on an expansive range of core and contextualized real-life social events. To train Delphi, we use the action extracted from the RoT as the central moral scenario to be judged, the situation from the corresponding RoT as supplementary situational information to contextualize the action, the ethical social judgment attribute as the categorical judgment label (3-way classification of good, discretionary, bad), and the textual judgment from the RoT as the open-text judgment label. In addition, we use RoTs to teach Delphi to assess the correctness of statements expressing moral judgments.

#### Ethics Commonsense Morality (Ethics; hendrycks2021aligning)

is a benchmark assessing language models’ ability to predict fundamental human ethical judgments. The ETHICS dataset contains contextualized scenarios across five dimensions: justice (notions of impartiality and what people are due), deontology (rules, obligations, and constraints), virtue ethics (temperamental character traits such as benevolence and truthfulness), utilitarianism (happiness or well-being), and commonsense morality (a complex function of all of these implicit morally salient factors). The commonsense morality section contains scenarios where a first-person character describes actions they take in an everyday life setting, and is further broken down into short (1-2 sentences, crowdsourced) and long scenarios (1-6 paragraphs, from reddit). All the scenarios are deliberately selected to be non-divisive to avoid ambiguous moral dilemmas such as “mercy killing” or “capital punishment.”

ETHICS qualifies ethical intuitions of unambiguous social situations. To train Delphi, we use the subset of short scenarios from the commonsense morality section, and the corresponding binary categorical moral judgment from each scenario. Open-text labels are sampled from a list of hand-crafted text judgments derived from categorical labels.

#### Moral Stories(Moral Stories; emelin2020moral)

is a corpus of structured narratives for the study of grounded, goal-oriented, and morally-informed social reasoning. Each story in the dataset is comprised of seven sentences: norm (moral rule of conduct in everyday situations), situation (description of the story’s social settings), intention (reasoning goal), moral/immoral actions (action performed that fulfills the intention while observing/violating the norm), and moral/immoral consequences (likely effect of the moral/immoral action). Norm, situation, and intention constitute the context segment, grounding actions along either a moral or immoral storyline. Except for the norm, which is extracted from Social Chemistry, all other fields are authored by crowd-workers as prompted by the norm.

Moral Stories contributes to the moral understanding of longer and more context-specific narratives. To train Delphi, we use the moral/immoral actions and ground them either with situations, or with situations and intentions. Moral and immoral actions, and their corresponding contextualizations are assigned the good and bad categorical labels respectively. Open-text labels are derived from categorical labels.

#### Social Bias Inference Corpus(Sbic; sap2020socialbiasframes)

is a conceptual formalism that aims to model the pragmatic frames in which people project social or demographic biases and stereotypes onto others. It accounts for socially biased implications of online media posts by scaffolding social and demographic biases into various categorical and open-text dimensions, including offensiveness (overall rudeness, disrespect, or toxicity of a post), intent to offend (whether the perceived motivation of the author is to offend), lewd (offensive content with lewd or sexual references), group implications (whether the target is an individual or a group), targeted group (the social or demographic group that is referenced or targeted by the post), implied statement (power dynamic or stereotype that is referenced in the post) and in-group language (whether the author of a post may be a member of the same social/demographic group that is targeted, as speaker identity changes how a statement is perceived).

Social Bias Inference Corpus aims to alleviate stereotypes or biased point of views towards social and demographic groups that are conventionally underrepresented when applying the generally perceived ethical judgments. We formulate the inputs as actions of saying or posting the potentially offensive or lewd online media posts (e.g., “saying we shouldn’t lower our standards to hire women”). Posts with offensive or lewd implications have the bad categorical label and vice versa. Open-text labels are sampled from a list of hand-crafted text judgments expressing offensiveness or lewdness.

#### Scruples(lourie2021scruples)

is a large-scale dataset of ethical judgments over real-life anecdotes. Anecdotes are defined as complex situations with moral implications; these are sourced from Am I the Asshole? (AITA) subreddit posts. Scruples is divided in two parts: (1) the Anecdotes dataset that contains judgments regarding the blameworthy parties (if any) for the moral violations seen in the story; and (2) the Dilemmas dataset for normative ranking. In Dilemmas, two actions from Anecdotes are paired, and annotators are asked to identify which of the two actions they determine as less ethical (e.g., “telling people to be quiet” is less ethical than “saying thank you”).

From Dilemmas, we source paired actions as inputs to the relative QA task. In our framework, labels from Scruples are reversed in such a way that the question asked seeks to identify the more morally acceptable action (i.e., given the two actions, which action is more morally preferable?). Scruples teaches Delphi to weigh moral implications comparatively beyond subjective judgment regarding independent actions.

### 3.2 Multitasking and Data Unification

Intuitive moral understanding of everyday situations requires a nuanced familiarity with values embedded in a myriad of circumstances. Thus, we adopt a multi-tasking setup to unify three QA tasks representing diverse perspectives of moral inferences: free-form QA, yes/no QA, and relative QA.

#### Free-form QA

elicits the commonsense moral judgments of a given real-life situation. Delphi takes a depiction of a scenario as an input and suggests a categorical label specifying whether the action within the scenario is morally good, bad, or discretionary—a neutral class indicating that the decision is up to individual discretion. Delphi further supplements the categorical label with an open-text judgment accounting for fine-grained moral implications, such as attribution (e.g., “it’s rude to talk loud in a library”), permission (e.g., “you are not allowed to smoke on a flight”) and obligation (e.g., “you should abide by the law”).

To empower Delphi with the ability to reason about compositional and grounded scenarios, we perform data augmentation to combine actions from Social Chemistry, ETHICS, Moral Stories and Social Bias Inference Corpus with corresponding situational contexts or intentions. Additionally, we convert declarative forms of actions and their contextualizations to a question form to incorporate inquisitive queries (e.g., “should I yell at my coworker?”). To encourage Delphi to be more robust against different inflections of language, punctuation, and word casing, we deliberately introduce noisy data forms (e.g., “eating pizza” vs. “ate pizza” vs. “eat pizza”) to teach Delphi to mitigate potential instability caused by syntactic variations. Our data augmentation method brings 1.2M descriptive ethical judgments regarding a wide spectrum of real-life situations in diverse forms into model training and validation.

#### Yes/no QA

takes real-life assertions involving moral judgments, such as “women cannot be scientists” or “it’s kind to express concern over your neighbor’s friends,” as input. Delphi is tasked with assigning a categorical label based on whether general society morally agrees or disagrees with the statements. Much like in the acceptability task, Delphi is also tasked to supply an open-text judgment, such as “no, women can” and “yes, it is kind,” respectively, to the assertions above.

We source and augment rules-of-thumb (RoTs) from Social Chemistry, which are statements of social norms that include both the judgment and the action. (e.g., it is kind to protect the feelings of others”). We apply comprehensive automatic heuristics to convert judgments in each of the RoTs to negated forms (e.g., it is rude to protect the feelings of others”). Then, we formulate an appropriate judgment to agree with the original (“yes, it is kind”) and to counter the negated statement (“no, it is kind”). As before, we introduce noisy syntactic forms to increase the stability of the model. In total, we accumulate 478k statements of ethical judgments.

#### Relative QA

reasons about moral preferences that people have between two everyday actions. For this task, Delphi takes two paired actions extracted from Scruples as input, and makes a categorical choice (i.e., action 1 or 2) specifying which action is more morally preferable. As in previous tasks, noisy surface forms are also injected. In total, we have 28k action pairs.

We give examples for all three tasks in Table 7, and dataset statistics in Table 8.

### 3.3 Delphi: A Unified Model

#### Pre-trained Unicorn

is a universal commonsense reasoning model multitasked on datasets from Rainbow, a suite of commonsense benchmarks in multiple-choice and question-answering formats (Lourie2021UNICORNOR). Unicorn is derived from fine-tuning T5-11B, the largest T5 model (i.e., Text-To-Text Transfer Transformer) with 11 billion parameters (2020t5), on the unified Rainbow benchmark. Unicorn demonstrates strong performance over all commonsense reasoning tasks from Rainbow, including NLI (Bhagavatula2020AbductiveNLI), CosmosQA (Huang2019CosmosQA), HellaSWAG (zellers2019hellaswag), PIQA (Bisk2020PIQA), SocialIQA (Sap2019SocialIQA) and WinoGrande (Sakaguchi2020WINOGRANDE). Because descriptive ethical reasoning depends in part on commonsense reasoning to interpret implications of everyday situations, instead of using pre-trained T5, we fine-tune Delphi from Unicorn to take advantage of its implicit repository of commonsense knowledge.

#### Training

for 4 epochs takes approximately 72 hours.

is an interface through which users can directly interact with Delphi (Figure 3).666Link to the demo: https://delphi.allenai.org The interface is open-ended, and can accept free-text actions, situations, or questions. Given the input, the model provides the user with both the categorical label and an open-text generation of the moral judgment. The interface allows us to showcase and probe Delphi’s current capabilities.

In addition to the demonstrative capabilities, the goal of this interface is to collect additional human feedback on the judgment made by the system. While Delphi performs well given our test dataset, as will be discussed in §4 and §5, the system still shows limitations with unseen questions and challenges posed by edge cases. Additionally, as we noted in §2.1

, descriptive moral judgments may be received differently by people with different backgrounds. To account for this reality, for every response Delphi returns, users are given the option of agreeing or disagreeing the judgment passed, and providing further feedback on the response. We see this feedback mechanism an important channel to receive opinions from the general public and researchers in order to estimate how well our model’s decisions align with people’s expectations.

## 4 Can Delphi make ethical moral judgments?

In this section, we evaluate Delphi and compare it to few-shot and zero-shot GPT-3 baselines (gpt3). We measure the accuracy of the models on the proposed Commonsense Norm Bank, and on an additional hard test set collected in the wild. We find that Delphi achieves strong performance when inferring descriptive moral judgments in a broad range of real-life situations.

### 4.1 Evaluation Metrics

#### Automatic metrics.

For free-form QA, we calculate the accuracy score under the original 3-way classification setting (i.e., good, discretionary, bad). Because many situations that fall under the discretionary class do not have strong moral implications, the boundary between good and discretionary is not always clear-cut. For example, while “eating apples” is a good thing to do, it predicted to be “discretionary” because it does not have strong positive moral implications. However, it is obvious that this action is not “bad.” To better probe into the polarity of the model’s moral judgments, we combine the good and discretionary classes into a positive class, and the bad class into the negative class, and calculate its binary classification accuracy as well. To assess the open-text label predictions, we manually map ~950 text labels to either positive or negative polarity classes, covering ~97% of all open-text labels in Commonsense Norm Bank

. We then compute an accuracy score with this binarized class label.

777We will release the text-to-class map used to binarize the open-text labels for future research.

For yes/no QA, we calculate accuracy scores for the binary classification task (i.e., agree or disagree given a statement of moral judgment). For assessing the open-text labels, we calculate approximated polarity matching. To estimate the polarity, we consider both the declaration part (e.g., “yes”) and the judgment part (e.g., “it’s okay”) of the predicted label. Two labels have aligned polarities if and only if the declaration parts match and the judgment parts share the same polarity. The polarity of the judgment part is estimated with the same text-to-class map used in the free-form QA task.

For relative QA, we compute the model’s accuracy of correctly ranking each pair of actions.

#### Human evaluations.

Automatically estimating polarity matching of open-text generations for free-form QA and yes/no QA is an accurate approximation of the models’ performance. We further conduct human evaluations of open-text labels by directly comparing the models’ and people’s moral judgments. We employ Amazon Mechanical Turk (AMT) annotators to assess whether model-generated open-text moral judgments are plausible. We randomly sample 1,000 examples from free-form QA and yes/no QA tasks to conduct human evaluations. We collect opinions from 3 evaluators for each example and aggregate them by taking a majority vote across the three annotations.

### 4.2 GPT-3 Baselines

To estimate how well state-of-the-art pre-trained language models can reason about descriptive ethics, we compare Delphi against GPT-3 baselines under both few-shot and zero-shot learning settings (gpt3).

#### Few-shot.

We perform few-shot prompting with GPT-3, as it has demonstrated strong performance across a wide range of NLP tasks (gpt3; zellers2020turingadvice; schick2020s; malkin-etal-2021-gpt; lucy2021gender). To achieve the best possible performance from GPT-3, we perform a grid search over {3, 10, 30}-shots,888We are limited to 30 few-shot examples due to the 2,049-token length constraint in OpenAI’s API. {0, 0.6}-temperature, and {small, extra large}-model size.999We denote the small version of the GPT-3 model with 2.7 billion parameters (i.e., ada) as GPT-3 (s), and the extra large version of GPT-3 with 175 billion parameters (i.e., davinci) as GPT-3 (xl). We report the results of both GPT-3 (s) and GPT-3 (xl) in Table 6 using their representative settings (3/30-shot learning, 0 temperature). Few-shot examples are randomly sampled from the training data. A complete list of the prompts used are shown in Tables 17, 18 and 19 in Appendix A.3 for free-form QA, yes/no QA, and relative QA, respectively. To generate with GPT-3 and conduct our evaluations, we use the same 1,000 examples from human evaluations of free-form QA and yes/no QA open-text generations as well as randomly sample 1,000 examples from relative QA.

#### Zero-shot.

Additionally, we perform zero-shot probing on GPT-3 (xl) to answer whether off-the-shelf state-of-the-art pre-trained language models have knowledge about morality. For each of free-form QA, yes/no QA and relative QA tasks, we describe task-specific categorical labels in natural language. Then, for each example, we concatenate the action with the text describing each categorical label, and feed the whole sentence into GPT-3 (xl)

to get perplexity scores of all categorical types. Finally, we assign the categorical type with the lowest perplexity score to the given example, as it is the most probable predicted by

GPT-3 (xl). We perform zero-shot evaluations on the same 1,000 examples for each task used in the few-shot evaluation. Details of the conversion of categorical labels to natural language text descriptions are given in §A.3 in the Appendix.

### 4.3 Results on Commonsense Norm Bank

The automatic and human evaluation accuracy scores of free-form QA, yes/no QA, and relative QA tasks from Commonsense Norm Bank across Delphi and the GPT-3 baselines are shown in Table 6. Delphi wins over all the few-shot GPT-3 (s) and GPT-3 (xl) baselines across all three tasks by a considerable margin in both classification and open-text settings. In particular, Delphi improves over the strongest 30-shot GPT-3 (xl) baseline by a range of 18%-60% relative improvements across various tasks as measured by the automatic metrics. As for the human evaluation of open-text generations, Delphi achieves 92.1% and 95.1% accuracies, with 9.8% and 16.5% relative performance gains over the 30-shot GPT-3 (xl) baseline for free-form QA and yes/no QA, respectively. Notably, all few-shot GPT-3 baselines perform roughly at a random chance level for relative QA. The 30-shot GPT-3 (xl) baseline achieves 52.6% accuracy, over which Delphi shows a significant 47.9% relative improvement.

The zero-shot GPT-3 (xl) baseline not only performs worse than both Delphi and the few-shot GPT-3 baselines, but it is also outperformed by the majority baseline, which simply selects the predominant label each time. Our results demonstrate that although the most powerful state-of-the-art pre-trained language models master some amount of knowledge about moral reasoning, they do not automatically learn to make moral judgments that are as accurate as the supervised Delphi, off-the-shelf. This stresses the importance of high-quality human-annotated datasets of diverse moral judgments over a broad range of everyday situations to truly enable machine moral reasoning. Tables 9 and 10 showcase examples from Delphi and the 30-shot GPT-3 (xl) for free-form QA and yes/no QA, respectively. Table 5 provides examples from Delphi for relative QA.

### 4.4 Hard Test Set (in the Wild)

#### Creation.

In addition to Commonsense Norm Bank, we further challenge Delphi with out-of-distribution hard situations sourced from the wild to evaluate how robust Delphi is in real-world deployment. We collect deliberately tricky situations and questions for the hard test set from (1) user inputs from Ask Delphi, and (2) crowd-workers. We first scrape single input actions and questions from the logs of the Ask Delphi demo. Since the demo has not been released to the general public by the time we created the hard test set, we survey crowd-workers from AMT about morality-related questions they want to ask an AI system to incorporate input from broader audiences. After we compile, validate and deduplicate the actions and questions, we obtain the categorical and open-text moral judgment labels from Delphi. We perform a human evaluation on the generated open-text labels from Delphi as described in §4.1. Then, we keep the labels deemed as correct open-text labels by crowd-workers as gold labels. The authors manually correct the small subset of examples with incorrect open-text labels to create gold open-text labels. For quality control, the authors scrutinize the overall compiled hard test set again to correct noisy open-text labels. We only consider examples that fit the free-form QA style in the creation of hard test set. Finally, we binarize the open-text labels as in §4.1 and use them as gold categorical labels. We randomly sample the hard test set to have identical categorical label distributions as before to allow direct comparison of accuracy scores between regular test sets from Commonsense Norm Bank and the hard test set sourced “in the wild.” The final hard set has 2,160 examples in total.

#### Results.

We report results of the hard test set for Delphi, as well as 30-shot and zero-shot GPT-3 (xl) in Table 11. For the 30-shot GPT-3 (xl) baseline, we apply the same few-shot prompt examples as described in §4.2 to generate categorical and open-text labels for actions and questions in the hard test set. For zero-shot GPT-3 (xl), we apply the same heuristic as described in §4.2 to derive categorical

labels. Results show that Delphi outperforms both GPT-3 baselines under both classification and open-text generation settings, as measured by both automatic and human evaluation metrics. The hard test set reveals a wide performance gap to close between models’ predictions and human judgments, inspiring exciting avenues for future research.

## 5 How much can Delphi generalize?

Here, we look at qualitative examples to gain a better understanding of Delphi’s ability to generalize to previously unseen situations. We show that Delphi is adept at making moral judgments of compositional situations, even in complex cases with multiple conditions (Tables 1-4). Then, we probe into where Delphi fails, to open avenues of further investigation into closing the wide gap between the moral reasoning capabilities of machines and people (Table 12).

#### Robustness.

We investigate Delphi’s responses to a number of situations by composing actions with modifications that impact the polarity or extent of the judgments. For instance, “driving a friend to the airport” is judged as a “good” action. The action should be seen in a further positive light if done at the expense of the actor’s convenience (e.g., “driving early in the morning”). But the judgment should then be reversed if one shouldn’t be on the road at all (e.g., “if the driver is intoxicated.”). Here, we seek to gauge Delphi’s ability to account for the changing contexts of everyday situations. Examples of this probing are shown in Tables 1-4.

Our analysis shows that Delphi is indeed capable of adjusting its judgments based on the social sensitivities introduced by specific circumstances. For example, Delphi aptly predicts that the act of “skipping work” is “wrong.” But the model is sensitive to the social norm that “when you are sick,” the act becomes “understandable.” Delphi also displays a grasp over socio-normative conventions regarding actions that generally do not have any moral indications (e.g., “mowing the lawn”). However, such actions can be socially unacceptable if they inconvenience others. For example, Delphi correctly predicts that “mowing the lawn in the middle of the night” is “rude,” but doing so “if you live in the middle of nowhere,” is “okay.” Delphi can also handle social expectations on unconventional acts. While “cleaning a toilet bowl” is judged as a “sanitary” act, Delphi finds it “disgusting” when the cleaning is done with a wedding dress. Amusingly, it also concedes that if the wedding dress is from a failed marriage, albeit “unusual,” it is still not a bad action (class label ), a judgment that doesn’t fall too far from human expectations.

Beyond social acceptability, Delphi also displays an understanding of conventional commonsense behaviors. The model provides proper answers for queries on (1) cultural conventions (e.g., “wearing a bright orange shirt to a funeral” is “rude,” but “wearing a white shirt to a funeral” is “appropriate”); (2) general life know-hows (e.g., “drinking milk if I’m lactose intolerant” is “bad” but “drinking soy milk if I’m lactose intolerant” is “okay”); and (3) conventional scientific knowledge (e.g., “mixing bleach with ammonia” is “dangerous”). Delphi can also compare situations concerning people’s societal responsibilities and personal liberties. For example, in Figures 5 and 5, Delphi’s judgment is in line with what people might generally expect—that declining a vaccine for an incommunicable disease is “understandable," and that it is more morally acceptable than doing so for a communicable disease.

Finally, our analysis also shows that Delphi is highly robust against situations with multiple, potentially conflicting, groundings. For example, “ignoring a phone call from my your boss” is “bad.” The judgment of this action remains unchanged when it is further contextualized by “during workdays.” However, it becomes justifiable “if I’m in a meeting.” The ability to learn the morally variant and invariant contextualizations demonstrates a promising outlook of the feasibility of deploying technology like Delphi into the real world.

#### Limitations.

Overall, Delphi shows that it can handle contextually sensitive judgments well. Of course, Delphi also demonstrates limitations, with some examples shown in Table 12. For example, it shows limited generalization capabilities in areas such as time (e.g., “running a blender” is “rude” whether at 3am or 3pm), unfamiliar domains like sports (e.g., “stealing” when game mechanics allow it), or certain cultural customs (e.g., “greeting someone by kissing on the cheek in Korea” is not conventional).

Moreover, Delphi struggles with judging potentially unlawful actions. For example, “being in a hurry” should never be an acceptable condition for “running a red light,” in the same way as “boredom” should not be acceptable reasons for “stealing money.” Even for cases where the “good samaritan” views of the society might be inclined to overlook the wrongness of the actions like “running a red light in an emergency” or “stealing money to feed your hungry children,” reverting the judgment may not be the right response for a moral machine. While as humans we understand the benign intent behind such actions, the act is nevertheless illegal and advocating what is unlawful should be avoided.

#### Ethical Dilemmas.

Discussions on computational ethics inevitably invoke the thought of moral dilemmas as studied through the set of trolley problems (thomson1976killing), i.e., ethical dilemmas about sacrificing one person to save a larger number of people. Even humans tend to disagree about the morally correct judgment to different variations of the trolley problem. Nonetheless, we challenge Delphi to better characterize its judgments on such contentious dilemmas. Figure 6 shows Delphi’s judgment on simplified versions of two typical trolley problems. Delphi’s responses show that given two options it can pick the action that is likely to be less contentious for people (all else being equal).

## 6 Social and Demographic Justice Implications of Delphi

In addition to quantifying Delphi’s abilities to produce judgments in general situations, it is critical to scrutinize the technology from a fairness and justice perspective. This is especially crucial for situations involving minoritized or marginalized social groups, against whom hateful acts or discriminatory thinking may be rooted in the perception that these groups are less moral, or even immoral (ungar2000state; does2011thou; hoover2019bound).

Hence, we analyze the social awareness of our model, and quantify to which extent it counters or perpetuates biases and injustices against different demographics. We first perform a quantitative study of the variability of Delphi judgments with respect to the basic human rights of many social and demographic groups (§6.1). Then, we perform a qualitative exploration of the model’s cultural awareness regarding situations whose sensitivity varies with participants’ identities (§6.2).

### 6.1 Fundamental Human Rights

We first investigate Delphi’s potential biases against specific social and demographic identities. Inspired by model cards and bias investigations in hate speech detection models (Dixon2018unintended; mitchell2019model), we design a probing task that takes in various situations and identities and quantifies the variation for different demographic and social identities.

Since we aim to quantify biases against groups or identities, we draw our probing situations from the Universal Declaration of Human Rights (UDHR) issued by the United Nations (united-nations-human-rights) and measure which identities Delphi does not follow the UDHR.101010We operate from the premise that these rights are fundamental and should apply to all groups of people, with a few exceptions that do not make sense (e.g., “children going to college”). We transform each of the UDHR articles into 44 templated situations with identity placeholders (e.g., “{identity} having medical care”; full list shown in Table 21 in Appendix A.2). Then we obtain model judgments by inserting each of our 200 demographic identities (Table 20 in Appendix A.2) into the situation templates, and compare them to the expected judgment.

Figure 7 shows the degree of discrepancy between Delphi’s prediction and the expected judgment of each action, averaged over the identities represented within each type of social and demographic domain. As a pleasant surprise, Delphi correctly identifies the expected judgment of all actions representative of human rights to people of all gender identities and sexual orientations, and almost all races/ethnicities, religions, nationalities/countries of origin, political parties, and disabilities. Examples where Delphi fails to predict the expected judgment include situations that reflect harmful stigmas against certain identities (“blind people are not expected to raise children”) or upsetting realities about authoritarian governments (“people from North Korea don’t have the right to liberty”).

We further investigate the details of the model’s predictions of individual identities, as plotted in Figure 8 and Figure 10-13, in Appendix A.2. For certain social identity groups with a higher discrepancy between Delphi’s prediction and the expected judgment, i.e., age groups and social classes (Figures 8), qualitative assessments show that the expected judgment is often incorrect when the expressed rights are incongruent with the identity-specific characteristics of the group. Teenagers, for example, retain eventual rights to “getting married,” “raising children,” “founding a family,” or “owning properties.” However, these also are characteristics not often associated with teenage-hood. Interestingly, the model’s judgment of “not expected” on these rights simply reflects this incompatibilies presented by characteristics of teenagers. These results suggest that Delphi is able to capture nuanced social, cultural, and ethical implications bounded by the implied characteristics of identity.

Misalignments between model predictions and what is believed to be right reflects a discrepancy between an ideal society and the current social climate. For instance, in Figure 8, Delphi predicts that “it’s not expected” for a poor or homeless person to “have adequate food/clothes/housing,” or that “it’s unusual [for them] to go to college.” While ultimately this reflects the reality of today’s society, this does not necessarily mean that it should or will always be the case. As it is, Delphi is a neural snapshot of its training data, which can be used to study present perceptions of ethics and morality. Any forward-looking research should take the ever-evolving views of social norms into account, and avoid over-relying on (potentially obsolete) historical data to shape the future (benjamin2019race).

### 6.2 Identity-aware Moral Discretion

In the previous section, we showed that Delphi is generally robust and invariant to demographic and social identities with respect to basic human rights. However, our analyses also show that certain situations for specific social and demographic identities can have different moral and ethical implications. We further explore these identity-dependent situations through qualitative analyses.

As shown in Table 13, Delphi is often capable of making socially aware discretionary judgments with respect to varying premises associated with social identifiers. For instance, the model predicts that while “it’s okay” for women or Hispanic people to “apply to minority scholarships,” it predicts that white men “shouldn’t” do so, which is in line with the purpose of most minority scholarships. To illustrate more complex situations, Delphi recognizes that “it’s okay” for women and “it’s bad” for men to “rest in the lactation room,” but it fails to consider that trans men may also need to use the lactation room sometimes to take care of babies,111111This is a process that has come to be known as “chestfeeding” (De_la_Cretaz2016-ns). and thus attributes “you shouldn’t” to that situation. However, specifying “a trans man with a baby” causes the model to correct its course and predict that “it’s okay.”

For minority-specific language, the model shows some understanding of reclaimed slurs. For example, Delphi infers that “it’s wrong” to “call your friend ‘n*gga’ ” between two white people, but that “it’s okay” if both friends are black, echoing the complex social dynamics of this reclaimed racial slur (Galinsky2013-rw). However, the model does not have the same nuanced understanding for the recently reclaimed homophobic slur “f*ggot” (Cheves2017-ub; Fasoli2019-tx).

These examples showcase Delphi’s strength at interpreting compositional language to make moral and ethical inferences for situations involving nuanced social dynamics and diverse identities. However, as is the case with many AI systems, some wrong predictions can have much more drastic consequences than others, and can further marginalize groups or perpetuate biases against them. Thus, particular attention should be paid when dealing with Delphi predictions for situations involving marginalized identities.

## 7 How much can Delphi transfer?

In previous sections, we demonstrate Delphi’s robust intrinsic performance over Commonsense Norm Bank and on out-of-distribution hand-crafted compositional examples. This section further explores Delphi’s ability to transfer to downstream moral reasoning tasks, specifically, tasks within the ETHICS benchmark (hendrycks2021aligning).

#### The Ethics benchmark (hendrycks2021aligning)

is constructed to assess a language model’s knowledge of basic concepts of morality. As detailed in §3.1, there are five tasks within ETHICS: justice, deontology, virtue, utilitarianism and commonsense morality. Justice requires giving people what they are due, and is further broken down into two components: impartiality (i.e., invariance to irrelevant or protected features) and desert (i.e., whether people get what they deserve). Deontology ethics concerns whether an act is required, permitted or forbidden according to a set of rules or constraints, which encompasses two sub-tasks: request (i.e., whether an excuse is reasonable given a request) and role (i.e., whether a responsibility is reasonable to a given role). Virtue ethics emphasizes on good or bad character traits people have. Utilitarianism compares the level of well-being for people in a pair of scenarios. Finally, commonsense morality concerns descriptive ethics of everyday situations, spanning short (1-2 sentence, crowdsourced) to long (1-6 paragraph, sourced from Reddit) scenarios. Table 22 shows examples of the tasks from ETHICS.

We include the short scenarios from the commonsense morality task in the training data of Delphi. Data for the other tasks and long scenarios from the commonsense morality

task do not appear in the data to pre-train Delphi. To explore the transfer learning ability of Delphi, we fine-tune Delphi on the five tasks from

ETHICS.

#### Evaluation metrics.

We report the binary classification accuracies for the five tasks to be consistent with hendrycks2021aligning. For Justice, Deontology, and Virtue

, which consist of groups of related examples (group of 4, 4, 5 examples that are minimal edits of each other respectively), an example is considered correct if all of the related examples are classified correctly by the model. For

utilitarianism, an example is considered correct if the model predicts the ranking of the two actions correctly. Commonsense morality is measured with binary classification accuracy.

#### Baselines.

We compare Delphi’s performance to baseline results reported by hendrycks2021aligning. In addition, we fine-tune a T5-11B baseline model to examine the effect of pre-training on Commonsense Norm Bank

. We apply the same hyperparameters used to pre-train Delphi (§

3.3) to fine-tune Delphi and T5-11B on ETHICS. All results are reported in Table 14.

#### Results.

Both T5-11B and Delphi outperform the baselines from hendrycks2021aligning by a large margin across both test and hard test sets, indicating that larger pre-trained language models are capable of adapting to moral reasoning tasks more effectively than smaller models. In particular, Delphi improves over all baselines for the Justice, Virtue, Utilitarianism and Commonsense Morality tasks, and the improvement is even more significant when evaluating with the hard test set. For Deontology, T5-11B performs slightly better than Delphi. In conclusion, we show that pre-training on Delphi can facilitate downstream moral reasoning tasks as well, even with different values systems and task framings.

## 8 Implications and Outlooks of Machine Moral Reasoning

Encoding moral values into AI systems has been undervalued or overlooked in the past. Some researchers contend that progress in machine learning and computational ethics does not have to be accomplished simultaneously

(Armstrong2013); while others argue that it is crucial, but consider it outside the current scope of AI development (Moor2006). However, given the pervasiveness of AI applications, we believe that failing to account for ethical norms notably hinders their ability to effectively interact with humans (PEREIRA20161). With the outstanding ability of encoding descriptive ethics demonstrated by Delphi, we argue that the future is now—we wish to advocate for collective efforts in the promising field of computational ethics to pave the way towards socially responsible deployment of AI applications. In this section, we conclude by laying out the ethical implications and outlooks of our work to understand our responsibilities as researchers towards facilitating reliable, socially aware, and ethically-informed AI in the future.

### 8.1 Implications of Delphi

#### Limitations.

While Delphi achieves high accuracy and empirical performance on all of our current tasks (§4 and §5), we also acknowledge its limitations (§5). Our systematic probing of Delphi indicates that Delphi is not immune to the social biases of our times (§6), and can default to the stereotypes and prejudices in our society that marginalize certain social groups and ethnicities. However, we believe that to effectively build reliable, practical AI systems with moral values, we must continue to investigate and develop socially inclusive models. The reality that Delphi does not always meet up to these expectations points towards a compelling direction for future research.

#### Transparency and accountability.

We acknowledge that morality is hardly a static construct. As societies evolve over time, adjusting away from its tendencies to discriminate and striving for inclusivity, we believe that the task of updating computational ethics models like Delphi is a continuous process requiring attention from researchers from various backgrounds and origins. Therefore, transparency in such efforts in morality and ethics in AI is critical—engaging researchers in open discourse, inviting various viewpoints in the improvement of computational ethics models. In this effort, we make our system and data available for public use, and invite further dialogue.

#### Cultural biases.

The various datasets that were unified to construct the Commonsense Norm Bank were predominantly crowdsourced. We acknowledge that such crowdsourced datasets can implicitly encapsulate the moral compass and social expectations of the crowdworkers employed to create them, and primarily reflects the English-speaking cultures in the United States of the 21st century. Expanding the Commonsense Norm Bank to be inclusive of other cultures and regions is an important direction of future work.

#### Dual use concern.

We release the model and the demo for public use. However, we note that the results of our work are strictly intended for research purpose only. Neither the model nor the demo are intended to be used for providing moral advice for people.

### 8.2 Directions for Future Work

Delphi can be viewed as a pre-trained model for norms (analogous to pre-training for language, though technically Delphi is trained after pre-training a language model), and custom fine-tuning can potentially improve personalization. However, fine-tuning does not guarantee that unwanted norms from the initial training can be easily overridden, and we believe that addressing these concerns is an important future research direction. Beyond the technicalities of training a language-based moral reasoning system, we also present a list of several open questions and avenues for future research. We sincerely urge our research community to collectively tackle these research challenges head-on, in an attempt to build ethical, reliable, and inclusive AI systems:

1. Is moral reasoning reducible to objective reasoning?

2. How can we build systems that can handle complex situations, moving beyond reasoning over short snippets?

3. Can we move beyond language-based moral reasoning systems to multi-modal systems that can process visual and audio signals as well? Such capabilities are becoming imperative as we build bots that interact with humans in the real world.

4. How can a system handle more complex moral dilemmas or controversial issues?

5. How does a moral reasoning system distinguish broad, generally accepted norms from personal preferences?

6. How do we address the conflicts between individual preferences and the common good (e.g., “No one wants a car that looks after the greater good. They want a car that looks after them,” SelfDriv34:online)?

7. How do we exert finer-grained control over the system’s choices (beyond just toying with the training examples)?

8. How does one integrate a system like Delphi to influence behavior of other models on tasks (e.g., by influencing the objective function, as in multi-task learning or through background knowledge integration methods). For example, Delphi predicts that “hiring a man over a more qualified woman because women are likely to take parental leave” is “sexist.” How can downstream decision making systems effectively incorporate this additional information?

9. How prevalent is moral reporting bias (i.e., people say one thing but do another)? How do we measure it and fix it in future iterations of Delphi-like systems?

10. How can a moral reasoning system account for diversity of cultures, ideology and societal structures?

11. How does a moral reasoning system evolve in lockstep with the evolution of societies over time?

12. How to efficiently collect moral judgments in the wild (e.g., building interactive interfaces to collect adversarial moral judgments from the general public), which is presumed to capture a more accurate distribution of people’s moral judgments in the world with broader coverage of opinions comparing to (narrowly representative) crowd-sourced annotations?

13. Can we elicit explanations of models’ moral judgments to make model decisions traceable?

## 9 Conclusion

We present Delphi, the first unified model of descriptive ethics applied to actions grounded in a wide-variety of everyday situations. Delphi displays robust performance over three different moral reasoning tasks, i.e., free-form QA, yes/no QA and relative QA. In support of these tasks and to train Delphi, we also introduce the Commonsense Norm Bank—a new unified dataset of 1.7M single or paired actions grounded in real-life situations along with their associated categorical judgments and open-text descriptions. Commonsense Norm Bank is created by unifying and augmenting several related datasets (e.g., Social Chemistry; forbes2020socialchemistry) and it is carefully designed to capture a wide array of situationally grounded ethical judgments. Delphi’s impressive performance on machine moral reasoning under diverse compositional real-life situations, highlights the importance of developing high-quality human-annotated datasets for people’s moral judgments. Finally, we demonstrate through systematic probing that Delphi still struggles with situations dependent on time or diverse cultures, and situations with social and demographic bias implications. We discuss the capabilities and limitations of Delphi throughout this paper and identify key directions in machine ethics for future work. We hope that our work opens up important avenues for future research in the emerging field of machine ethics, and we encourage collective efforts from our research community to tackle these research challenges.

## Acknowledgements

The authors thank Yoav Goldberg and Peter Clark for helpful discussions, and Sam Stuesser from the REVIZ team at AI2 for designing the logo of the Ask Delphi demo. This research was supported in part by DARPA under the MCS program through NIWC Pacific (N66001-19-2-4031), and the Allen Institute for AI (AI2). TPU machines for conducting experiments were generously provided by Google through the TensorFlow Research Cloud (TFRC) program.

## Appendix A Appendix

### a.1 Statistics of Commonsense Norm Bank

Table 15 shows the statistics of Commonsense Norm Bank, broken down by free-form QA, yes/no QA, and relative QA tasks, from diverse data resources.

### a.2 Universal Declarations of Human Rights by the United Nation

To assess Delphi’s ability to generalize appropriate moral values across different social and demographic groups, we probe the model with a series of actions reflecting basic human rights defined by the Universal Declaration of Human Rights (UDHR) (united-nations-human-rights) as discussed in §6. A listing of social and demographic domains used in the experiments are shown in Table 20. A complete list of templates of actions encapsulating UDHR issued by the United Nation is found in Table 21. Finally, additional results of individual social and demographic identity types within each domain are given in Figures 10-13.

### a.3 GPT-3 baselines

#### Few-shot prompts.

Tables 17, 18, and 19 specify the few-shot examples and prompts used for the GPT-3 baselines, for free-form QA, yes/no QA, and relative QA, respectively. For the relative QA prompts, each of the 30-shot examples were prefaced with the question: “Which action is more morally preferable?” Expenditure for GPT-3 experiments across three tasks totals to $813 for GPT-3 (xl) and$12 for GPT-3 (s) API requests.

#### Zero-shot prompts.

For free-form QA, we use {action}. This action is good/ethical/moral/etc.” to represent the positive class, {action}. This action is bad/unethical/immoral/etc.” to represent the negative class, and {action}. This action is okay/fine/expected/etc.” to represent the neutral class. For yes/no QA, we use {action}. This statement is true/correct/right/good.” to represent the positive class, and we use {action}. This statement is false/incorrect/wrong/bad.” to represent the negative class. Finally, for relative QA, we use “Action1: {action1}. Action2: {action2}. Action1 is more moral than Action2.” and vice versa to represent two ranking options.

### a.4 Human Evaluation Crowdsourcing Templates

Template used for crowdsourcing human evaluation of Delphi’s generations is shown in Figure 9. The pay average for the evaluations ranged between \$19 per hour.

### a.5 Examples from the Ethics Benchmark

We show examples from the ETHICS benchmark in Table 22.