AI systems are currently used to make, or contribute to, many important decisions. These systems are deployed in self-driving cars, organ allocation programs, businesses for hiring, and courtrooms to set bail. It is an ongoing challenge for AI researchers to ensure that these systems make decisions that align with human values.
A growing body of research views this challenge through the lens of preference aggregation. From this perspective, researchers aim to (1) understand the preferences (or values) of the relevant stakeholders, and (2) design an AI system that aligns with the aggregated preferences of all stakeholders. This approach has been proposed recently in the context of self-driving cars Noothigattu et al. (2018) and organ allocation Freedman et al. (2020). These approaches rely on a mathematical model of stakeholder preferences–which is typically learned using data collected via hypothetical decision scenarios or online surveys.111The MIT Moral Machine project is one example: https://www.moralmachine.net/ There is a rich literature addressing how to elicit preferences accurately and efficiently, spanning the fields of computer science, operations research, and social science.
It is critical that these observed preferences accurately represent peoples’ true preferences, since these observations guide deployed AI systems. Importantly, the way we measure (or elicit) preferences is closely tied to the accuracy of these observations. In particular, it is well-known that both the order in which questions are asked, and the set of choices presented, impact expressed preferences Day et al. (2012); DeShazo and Fermo (2002).
Often people choose not to express a strict preference, in which case we call them indecisive. The economics literature has suggested a variety of explanations for indecision Gerasimou (2018)—for example when there are no desirable alternatives, or when all alternatives are perceived as equivalent. Moral psychology research has found that people often “do not want to play god” in moral situations, and would prefer for somebody or something else to take responsibility for the decision Gangemi and Mancini (2013).
In philosophy, indecision of the kind discussed in this paper is typically linked to a class of moral problems called symmetrical dilemmas, in which an agent is confronted with the choice between two alternatives that are or appear to the agent equal in value Sinnott-Armstrong (1988).222Sophie’s Choice is a well-known example: a guard at the concentration camp cruelly forces Sophie to choose one of her two children to be killed. The guard will kill both children if Sophie refuses to choose. Sophie’s reason for not choosing one child applies equally to another, hence the symmetry. Much of the literature concerns itself with the morality and rationality of the use of a randomizer, such as flipping a coin, to resolve these dilemmas. Despite some disagreements over details McIntyre (1990); Donagan (1984); Blackburn (1996); Hare (1981), many philosophers do agree that flipping a coin is often a viable course of action in response to indecision333With some exceptions: for example, see Railton (1992)..
The present study accepts the assumption that flipping a coin is typically an expression of one’s preference to not decide between two options, but goes beyond the received view in philosophy by suggesting that indecision can also be common and acceptable when the alternatives are asymmetric. We show that people often do adopt coin flipping strategies in asymmetrical dilemmas, where the alternatives are not equal in value. Thus, the use of a randomizer is likely to play a more complex role in moral decision making than simply as a tie breaker for symmetrical dilemmas.
Naturally, people are also sometimes indecisive when faced with difficult decisions related to AI systems. However it is commonly assumed in the preference modeling literature that people always express a strict preference, unless (A) the alternatives are approximately equivalent, or (B) the alternatives are incomparable. Assumption (A) is mathematically convenient, since it is necessary for preference transitivity.444My preferences are transitive if “I prefer A over B” and “I prefer B over C” implies “I prefer A over C”. Since indecision is both a common and meaningful response, strict preferences alone cannot accurately represent peoples’ real values. Thus, AI researchers who wish to guide their systems using observed preferences should be aware of the hidden meanings of indecision. We aim to uncover these meanings in a series of studies.
Our Contributions. First, we conduct a pilot experiment to illustrate how different interpretations of indecision lead to different outcomes (§ Study 1: Indecision is Not Random Choice). Using hypothesis testing, we reject the common assumption (A) that indecision is expressed only toward equivalent—or symmetric—alternatives.
Then, drawing on ideas from psychology, philosophy, and economics, we discuss several other potential reasons for indecision, drawing (§ Models for Indecision). We formalize these ideas as mathematical indecision models, and develop a probabilistic interpretation that lends itself to computation (§ Indecision Model Formalism).
To test the utility of these models, we conduct a second experiment to collect a much larger dataset of decision responses (§ Study 2: Fitting Indecision Models
). We take a machine learning (ML) perspective, and evaluate each model class based on its goodness-of-fit to this dataset. We assess each model class for predictingindividual peoples’ responses, and then we briefly investigate group decision models.
In all of our studies, we ask participants who should receive the kidney? in a hypothetical scenario where two patients are in need of a kidney, but only one kidney is available. As a potential basis for their answers, participants are given three “features” of each patient: age, amount of alcohol consumption, and number of young dependents.
We chose this task for several reasons: first, kidney exchange is a real application where algorithms influence—and sometimes make—important decisions about who receives which organ.555Many exchanges match patients and donors algorithmically, including the United Network for Organ Sharing (https://unos.org/transplant/kidney-paired-donation/) and the UK national exchange (https://www.odt.nhs.uk/living-donation/uk-living-kidney-sharing-scheme/). Second, organ allocation is a difficult problem: there are far fewer donors organs than there are people in need of a transplant.666There are around people in need of a transplant today (https://unos.org/data/transplant-trends/), and about transplants have been conducted in . Third, the question of who should receive these scarce resources raises serious ethical dilemmas Scheunemann and White (2011). Kidney allocation is also a common motivation for studies of fair resource allocation Agarwal et al. (2019); McElfresh and Dickerson (2018); Mattei et al. (2018). Furthermore, this type of scenario is frequently used to study peoples’ preferences and behavior Freedman et al. (2020); Furnham et al. (2000, 2002); Oedingen et al. (2018). Importantly, this prior work focuses on peoples’ strict preferences, while we aim to study indecision.
Study 1: Indecision is Not Random Choice
We first conduct a pilot study to illustrate the importance of measuring indecision. Here we take the perspective of a preference-aggregator; we illustrate this perspective using a brief example: Suppose we must choose between two alternatives (X or Y), based on the preferences of several stakeholders. Using a survey we ask all stakeholders to express a strict preference (to “vote”) for their preferred alternative; X receives 10 votes while Y receives 6 votes, so X wins.
Next we conduct the same survey, but allow stakeholders to vote for “indecision” instead; now, X receives 4 votes, Y receives 5 votes, and “indecision” receives 7 votes. If we assume that voters are indecisive only when alternatives are nearly equivalent (assumption (A) from Section Introduction), then each “indecision” vote is analogous to one half-vote for both X and Y, and therefore Y wins. In other words, in the first survey we assume that all indecisive voters choose randomly between X and Y. However, if indecision has another meaning, then it is not clear whether X or Y wins. Thus, in order to make the best decision for our constituents we must understand what meaning is conveyed by indecisive voters. Unfortunately for our hypothetical decision-maker, assumption (A) is not always valid.
Using a small study, we test—and reject—assumption (A), which we frame as two different hypotheses, H0-1: if we discard all indecisive votes, then both X and Y receive the same proportion votes, whether or not indecision is allowed. A second related hypothesis is H0-2: if we assign half of a vote to both X and Y when someone is indecisive, then both X and Y receive the same proportion votes, whether or not indecision is allowed. We conducted the hypothetical surveys described above, using 15 kidney allocation questions (see Appendix A for the survey text and analysis). Participants were divided into two groups: participants in group Indecisive (N=62) were allowed to express indecision (phrased as “flip a coin to decide who receives the kidney”), while group Strict (N=60) was forced to choose one of the two recipients. We test H0-1 by identifying the majority patient, “X” (who received the most votes) and the minority patient “Y” for each of the 15 questions (details of this analysis are in Appendix A). Overall, group Indecisive cast 581 (74) votes for the majority (minority) patient, and 275 indecision votes; the Strict group cast 751 (149) votes for the majority (minority) patient. Using a Pearson’s chi-squared test we reject H0-1 (). According to H0-2, we might assume that all indecision votes are “effectively” one half-vote for both the minority and majority patient. In this case, the Indecisive group casts 718.5 (211.5) “effective” votes for the majority (minority) patients; using these votes we reject H0-2 ().
In the context of our hypothetical choice between X and Y, this finding is troublesome: since we reject H0-1 and H0-2, we cannot choose a winner by selecting the alternative with the most votes—or, if indecision is measured, the most “effective” votes. If indecision has other meanings, then the “best” alternative depends on which meanings are used by each person; this is our focus in the remainder of this paper.
Models for Indecision
The psychology and philosophy literature find several reasons for indecision, and many of these reasons can be approximated by numerical decision models. Before presenting these models, we briefly discuss their related theories from psychology and philosophy.
Difference-Based Indecision In the preference modeling literature it is sometimes assumed that people are indecisive only when both alternatives (X and Y) are indistinguishable. That is, the perceived difference between X and Y is too small to arrive at a strict preference. In philosophy, this is referred to as “the possibility of parity” Chang (2002).
Desirability-Based Indecision In cases where both alternatives are not “good enough”, people may be reluctant to choose one over the other. This has been referred to as “single option aversion” Mochon (2013), when consumers do not choose between product options if none of the options is sufficiently likable. Zakay (1984) observes this effect in single-alternative choices: people reject an alternative if it is not sufficiently close to a hypothetical “ideal”. Similarly, people may be indecisive if both alternatives are attractive. People faced with the choice between two highly valued options often opt for an indecisive resolution in order to manage negative emotions Luce (1998).
Conflict-Based Indecision People may be indecisive when there are both good and bad attributes of each alternative. This is phrased as conflict by Tversky and Shafir (1992): people have trouble deciding between two alternatives if neither is better than the other in every way. In the AI literature, the concept of incomparability between alternatives is also studied Pini et al. (2011).
While these notions are intuitively plausible, we need mathematical definitions in order to model observed preferences. That is the purpose of the next section.
Indecision Model Formalism
In accordance with the literature, we refer to decision-makers as agents. Agent preferences are represented by binary relations over each pair of items , where is a universe of items. We assume agent preferences are complete: when presented with item pair , they expresses exactly one response , which indicates:
, or : the agent prefers more than
, or : the agent prefers more than
, or : the agent is indecisive between and
When preferences are complete and transitive,777Agent preferences are transitive if and iff . then the preference relation corresponds to a weak ordering over all items Shapley and Shubik (1974). In this case there is a utility function representation for agent preferences, such that , and , where is a continuous function. We assume each agent has an underlying utility function, however in general we do not assume preferences are transitive. In other words, we assume agents can rank items based on their relative value (represented by ), but in some cases they consider other factors in their response—causing them to be indecisive. Next, to model indecision we propose mathematical representations of the causes for indecision from Section Models for Indecision.
Mathematical Indecision Models
All models in this section are specified by two parameters: a utility function and a threshold . Each model is based on scoring functions: when the agent observes a query they assign a numerical score to each response, and they respond with the response type that has maximal score; we assume that score ties are broken randomly, though this assumption will not be important. In accordance with the literature, we assume the agent observes random iid additive error for each response score (see, e.g., Soufiani et al. (2013)). Let be the agent’s score for response to comparison ; the agent’s response is given by
That is, the agent has a deterministic score for each response , but when making a decision the agent observes a noisy version of this score, . We make the common assumption that noise terms are iid Gumbel-distributed, with scale . In this case, the distribution of agent responses is
Each indecision model is defined using different score functions . Score functions for strict responses are always symmetric, in the sense that ; thus we need only define and . We group each model by their cause for indecision from Section Models for Indecision.
Difference-Based Models: Min-, Max- Agents are indecisive when the utility difference between alternatives is either smaller than threshold (Min-) or greater than (Max-). The score functions for these models are
Here should be non-negative: for example with Min-, means the agent is never indecisive, while for Max- this means the agent is always indecisive. Model Max- seems counter-intuitive (if one alternative is clearly better than the other, why be indecisive?), yet we include it for completeness. Note that this is only one example of a difference-based model: instead the agent might assess alternatives using a distance measure , rather than .
Desirability-Based Models: Min-, Max- Agents are indecisive when the utility of both alternatives is below threshold (Min-), or when the utility of both alternatives is greater than (Max-). Unlike the difference-based models, here may be positive or negative. The score functions for these models are
Both of these models motivated in the literature (see § Models for Indecision).
Conflict-Based Model: Dom In this model the agent is indecisive unless one alternative dominates the other in all features, by threshold at least . For this indecision model, we need a utility measure associated with each feature of each item; for this purpose, let be the utility associated with feature of item . As before, here may be positive or negative. The score functions for this model are
This is one example of a conflict-based indecision model, though we might imagine others.
These models serve as a class of hypotheses which describe how agents respond to comparisons when they are allowed to be indecisive. Using the response distribution in (1), we can assess how well each model fits with an agent’s (possibly indecisive) responses. However, in many cases agents are required to express strict preferences—they are not allowed to be indecisive (as in Section Study 1: Indecision is Not Random Choice). With slight modification the score-based models from this section can be used even when agents are forced to express only strict preferences; we discuss this in the next section.
Indecision Models for Strict Comparisons
We assume that agents may prefer to be indecisive, even when they are required to express strict preferences. That is, we assume that agents use an underlying indecision model to express strict preferences. When they cannot express indecision, we assume that they either resample from their decision distribution, or they choose randomly. That is, we assume agents use a two-stage process to respond to queries: first they sample a response from their response distribution ; if is strict ( or ), then they express it, and we are done. If they sample indecision (), then they flip a weighted coin to decide how to respond:
with probabilitythey re-sample from their response distribution until they sample a strict response, without flipping the weighted coin again
with probability they choose uniformly at randomly between responses and .
That is, they respond according to distribution
Here, , and
. The (heads) condition from above has another interpretation: the agent chooses to sample from a “strict” logit, induced by only the score functions for strict responses,and . We discuss this model in more detail, and provide an intuitive example, in Appendix B.
We now have mathematical indecision models which describe how indecisive agents respond to comparison queries, both when they are allowed to express indecision (§ Mathematical Indecision Models), and when they are not (§ Indecision Models for Strict Comparisons). The model in this section, and response distributions (1) and (2), represent one way indecisive agents might respond when they are forced to express strict preferences. The question remains whether any of these models accurately represent peoples’ expressed preferences in real decision scenarios. In the next section we conduct a second, larger survey to address this question.
Study 2: Fitting Indecision Models
In our second study, we aim to model peoples’ responses in the hypothetical kidney allocation scenario using indecision models from the previous section as well as standard preference models from the literature. The models from the previous section can be used to predict peoples’ responses, both when they are allowed to be indecisive, and when they are not. To test both class of models, we conducted a survey with two groups of participants, where one group was were given the option to express indecision, and the other was not. Each participant was assigned to 1 of the 150 random sequences, each of which contains 40 pairwise comparisons between two hypothetical kidney recipients with randomly generated values for age, number of dependents, and number of alcoholic drinks per week. We recruited 150 participants for group Indecisive, which was given the option to express indecision888As in Study 1, this is phrased as “flip a coin.”. 18 participants were excluded from the analysis for failing attention checks, leaving us with a final sample of N=132. Another group, Strict (N=132), was recruited to respond to the same 132 sequences, but without the option to express indecision.
We remove 26 participants from Indecisive who never express indecision, because it is not sensible to compare goodness-of-fit for different indecision models when the agent never chooses to be indecisive. This study was reviewed and approved by our organization’s Institutional Review Board; please see Appendix A for a full description of the survey and dataset.
Model Fitting. In order to fit these indecision models to data, we assume that agent utility functions are linear: each item
is represented by feature vector; agent utility for item is , where is the agent’s utility vector
. We take a maximum likelihood estimation (MLE) approach to fitting each model: i.e., we select agent parametersand which maximize the log-likelihood (LL) of the training responses. Since the LL of these models is not convex, we use random search via a Sobol process Sobol’ (1967). The search domain for utility vectors is , the domain for probability parameters is , and the domain for depends on the model type (see Appendix B). The number of candidate parameters tested and the nature of the train-test split vary between experiments. All code used for our analysis is available online, 999https://github.com/duncanmcelfresh/indecision-modeling and details of our implementation can be found in Appendix B.
We explore two different preference-modeling settings: learning individual indecision models, and learning group indecision models.
Individual Indecision Models
The indecision models from Section Indecision Model Formalism are indented to describe how an indecisive agent responds to queries—both when they are given the option to be indecisive, and when they are not. Thus, we fit each of these models to responses from both participant groups: Indecisive and Strict. For each participant we randomly split their question-response pairs into a training and testing set of equal size (20 responses each). For each participant we fit all five models from Section Indecision Model Formalism, and two baseline methods: Rand (express indecision with probability and chooses randomly between alternatives otherwise), MLPMLP as a state-of-the-art benchmark, against which we compare our models; we use this benchmark to see how close our new models are to modern ML methods.
For group Indecisive we estimate parameter for NaiveRand from the training queries; for Strict is . For MLP we train a classifier with one class for each response type, using scikit-learn Pedregosa et al. (2011): for Indecisive responses we train a three-class model (), and for Strict we train a two-class model ().
|Group Indecisive (both indecision and strict responses)||Group Strict (only strict responses)|
|Model||#1st||#2nd||#3rd||Train/Test LL||# 1st||# 2nd||# 3rd||Train/Test LL|
|Min-||29 (27%)||23 (22%)||13 (12%)||-0.82/-0.85||26 (20%)||53 (40%)||34 (26%)||-0.44/-0.47|
|Max-||11 (10%)||12 (11%)||19 (18%)||-0.81/-0.90||31 (23%)||57 (43%)||25 (19%)||-0.44/-0.47|
|Min-||8 (8%)||32 (30%)||17 (16%)||-0.83/-0.88||1 (1%)||5 (4%)||20 (15%)||-0.53/-0.56|
|Max-||22 (21%)||23 (22%)||12 (11%)||-0.81/-0.83||1 (1%)||5 (4%)||15 (11%)||-0.53/-0.55|
|Dom||0 (0%)||3 (3%)||9 (8%)||-0.88/-0.95||2 (2%)||4 (3%)||3 (2%)||-0.57/-0.58|
|Logit||5 (5%)||12 (11%)||31 (29%)||-0.84/-0.90||4 (3%)||5 (4%)||27 (20%)||-0.53/-0.55|
|Rand||1 (1%)||0 (0%)||3 (3%)||-1.10/-1.10||6 (5%)||0 (0%)||1 (1%)||-0.69/-0.69|
|MLP||30 (28%)||1 (1%)||2 (2%)||-0.04/-1.15||61 (46%)||3 (2%)||7 (5%)||-0.03/-0.49|
Goodness-of-fit. Using the standard ML approach, we select the best-fit models for each agent using the training-set LL, and evaluate the performance of these best-fit models using the test-set LL. Table 1 shows the number of participants for which each model was the st-, nd-, and rd best-fit for each participant (those with the greatest training-set LL), and the median test and train LL for each model. First we observe that no indecision model is a clear winner: several different models appear in the top 3 for each participant. This suggests that different indecision models fit different individuals better than others — there is not a single model that reflects everyone’s choices. However, some models perform better than others: Min- and Max- appear often in the top 3 models, as does Max- for group Indecisive.
It it is somewhat surprising the Max- fits participant responses, since this model does not seem intuitive: in Max-, agents are indecisive when two alternatives have very different utility—i.e. one has much greater utility than the other. It is also surprising the Max- is a good fit for group Indecisive, but not for Strict. One interpretation of this fact is that some people use (a version of) Max- when they have the option, but they do not use Max- when indecision is not an option. Another interpretation is that our modeling assumptions in Section Indecision Models for Strict Comparisons are wrong—however our dataset cannot definitively explain this discrepancy.
Finally, MLP is the most common best-fit model for all participants in both groups, though it is rarely a 2nd- or 3rd-best fit. This suggests that the MLP benchmark accurately models some participants’ responses, and performs poorly for others; we expect this is due to overfitting. While MLP is more accurate than our models in some cases, it does not shed light on why people are indecisive.
It is notable that some indecision models (Min- and Max-) outperform the standard logit model (Logit), both when they are learned from responses including indecision (group Indecisive), and when they are learned from only strict responses (group Strict). Thus, we believe that these indecision models give a more-accurate representation for peoples’ decisions than the standard logit, both when they are given the option to be indecisive, and when they are not.
Since these indecision models may be accurate representations of peoples’ choices, it is informative to examine the best-fit parameters. Figure 1 shows best-fit parameters for participants in group Indecisive (top) and Strict (bottom); for each indecision model, we show all learned parameters for participants for whom the model is the 1st-best-fit (see Table 1). Importantly, the best-fit values of , , and are similar for all models, in both groups. That is, in general, people have similar relative valuations for different alternatives: means younger patients are preferred over older patients, means patients who consume less alcohol are preferred more; means that patients with more dependents are preferred more. We emphasize that the indecision model parameters for group Strict (bottom panel of Figure 1) are learned using only strict responses.
These models are fit using only 20 samples, yet they provide useful insight into how people make decisions. Importantly, our simple indecision models fit observed data better than the standard logit—both when people can express indecision, and when they cannot. Thus, contrary to the common assumption in the literature, not all people are indecisive only when two alternatives are nearly equivalent. This assumption may be true for some people (participants for which Min- is a best-fit model), but it is not always true.
Next we turn to group decision models, where the goal is for an AI system to make decisions that reflect the values of a certain group of humans. In the spirit of the social choice literature, we refer to agents as “voters”, and suggested decisions as “votes”. We consider two distinct learning paradigms, where each reflects a potential use-case of an AI decision-making system.
The first paradigm, Population Modeling, concerns a large or infinite number of voters; our goal is to estimate responses to new decision problems that are the best for the entire population. This scenario is similar to conducting a national poll: we have a population including thousands or millions of voters, but we can only sample a small number (say, hundreds) of votes. Thus, we aim to build a model that represents the entire population, using a small number of votes from a small number of voters. There are several ways to aggregate uncertain voter models (see for example Chapter 10 of Brandt et al. (2016)); our approach is to estimate the next vote from a random voter in the population. Since we cannot observe all voters, our model should generalize not only a “known” voter’s future behavior, but all voters’ future behavior.
In the second paradigm, Representative Decisions, we have a small number of “representative” voters; our goal is to estimate best responses to new decision problems for this group of representatives. This scenario is similar to multi-stakeholder decisions including organ allocation or public policy design: these decisions are made by a small number of representatives (e.g., experts in medicine or policy), who often have very limited time to express their preferences. As in Population Modeling we aim to estimate the next vote from a random expert—however in this paradigm, all voters are “known”, i.e., in the training data.
Both voting paradigms can be represented as a machine learning problem: observed votes are “data”, with which we select a best-fit model from a hypothesis class; these models make predictions about future votes.101010Several researchers have used techniques from machine learning for social choice Doucette et al. (2015); Conitzer et al. (2017); Kahng et al. (2019); Zhang and Conitzer (2019). Thus, we split all observed votes into a training set (for model fitting) and a test set (for evaluation). How we split votes into a training and test set is important: in Representative Decisions we aim to predict future decisions from a known pool of voters—so both the training and test set should contain votes from each voter. In Population Modeling we aim to predict future decisions from the entire voter population—so the training set should contain only some votes from some voters (i.e., “training” voters), while the test set should contain the remaining votes from training voters, and all responses from the non-training voters.
VMixture Model. We first learn a best-fit indecision (sub)model for each training voter; the overall model generates responses by first selecting a training voter uniformly at random, and then responding according to their submodel.
-Mixture Model. This model consists of submodels, each of which is an indecision model with its own utility vector and threshold . The type of each submodel (Min/Max-, Min/Max-, Dom
) is itself a categorical variable. Weight parametersindicate the importance of each submodel. This model votes by selecting a submodel from the softmax distribution111111With the softmax distribution, the probability of selecting is . We use this distribution for mathematical convenience, though it is straightforward to learn the distribution directly. on , and responds according to the chosen submodel.
-Min- Mixture. This model is equivalent to -Mixture, however all submodels are of type Min-. We include this model since Min- is the most-common best-fit indecision model for individual participants (see § Study 2: Fitting Indecision Models).
We simulate both the Population Modeling and Representative Decisions settings using various train/test splits of our survey data. For Population Modeling we randomly select training voters; half of each training voter’s responses are added to the test set, and half to the training set. All responses from non-training voters are added to the test set.121212Each voter in our data answers different questions, so all questions in the test set are “new.”
For Representative Decisions we randomly select training voters (“representatives”), and randomly select half of each voter’s responses for testing; all other responses are used for training; all non-training voters are ignored.
|Model Name||Represenatitives (20)||Population (100)|
For both of these settings we fit all mixture models (-Mixture, -Min-, and VMixture), each individual indecision model from Section Indecision Model Formalism, and each each baseline model. Table 2 shows the training-set and test-set LL for each method, for both voting paradigms. Most indecision models achieve similar test-set LL, with the exception of Dom. In the Representatives setting, both mixture models and (non-mixture) indecision models perform well (notably, better than MLP. This is somewhat expected, as the Representatives setting uses very little training data, and complex ML approaches such as MLP are prone to overfitting—this is certainly the case in our experiments. In the Population setting the mixture models outperform individual indecision models; this is expected, as these mixture models have a strictly larger hypothesis class than any individual model. Unsurprisingly, MLP achieves the greatest test-set LL in the Population setting—yet provides no insight as to how these decisions are made.
In many cases it is natural to feel indecisive, for example when voting in an election or buying a new car; people are especially indecisive when their choices have moral consequences. Importantly, there are many possible causes for indecision, and each conveys different meaning: I may be indecisive when voting for a presidential candidate because I feel unqualified to vote; I may be indecisive when buying a car because all options seem too similar. Using a small study, in Section Study 1: Indecision is Not Random Choice we demonstrate that indecision cannot be interpreted as a “flipping a coin” to decide between alternatives. This violates a key assumption in the technical literature, and it complicates the task of selecting the best alternative for an individual or group. Indeed, defining the “best” alternative for indecisive agents depends on what indecision means.
These philosophical and psychological questions have become critical to computer science researchers, since we now use preference modeling and social choice to guide deployed AI systems. The indecision models we develop in Section Indecision Model Formalism and test in Section Study 2: Fitting Indecision Models provide a framework for understanding why people are indecisive—and how indecision may influence expressed preferences when people are allowed to be indecisive (§ Mathematical Indecision Models), and when they are required to express strict preferences (§ Indecision Models for Strict Comparisons). The datasets collected in Study 1 (§ Study 1: Indecision is Not Random Choice) and Study 2 (§ Study 2: Fitting Indecision Models) provide some insight into the causes for indecision, and we believe other researchers will uncover more insights from this data in the future.
Several questions remain for future work. First, what are the causes for indecision, and what meaning do they convey? This question is well-studied in the philosophy and social science literature, and AI researchers would benefit from interdisciplinary collaboration. Methods for preference elicitation Blum et al. (2004)
and active learningFreund et al. (1997) may be useful here.
Second, if indecision has meaning beyond the desire to “flip a coin”, then what is the best outcome for an indecisive agent?
… for a group of indecisive agents?
This might be seen as a problem of winner determination, from a perspective of social choice Pini et al. (2011).
Dickerson and McElfresh were supported in part by NSF CAREER IIS-1846237, NSF CCF-1852352, NSF D-ISN #2039862, NIST MSE #20126334, NIH R01 NLM-013039-01, DARPA GARD #HR00112020007, DoD WHS #HQ003420F0035, DARPA Disruptioneering (SI3-CMD) #S4761, and a Google Faculty Research Award. Conitzer was supported in part by NSF IIS-1814056. This publication was made possible through the support of a grant (TWCF0321) from Templeton World Charity Foundation, Inc. to Conitzer, Schaich Borg, and Sinnott-Armstrong. The opinions expressed in this publication are those of the authors and do not necessarily reflect the views of Templeton World Charity Foundation, Inc.
Many AI systems are designed and constructed with the goal of promoting the interests, values, and preferences of users and stakeholders who are affected by the AI systems. Such systems are deployed to make or guide important decisions in a wide variety of contexts, including medicine, law, business, transportation, and the military. When these systems go wrong, they can cause irreparable harm and injustice. There is, thus, a strong moral imperative to determine which AI systems best serve the interests, values, and preferences of those who are or might be affected.
To satisfy this imperative, designers of AI systems need to know what affected parties really want and value. Most surveys and experiments that attempt to answer this question study decisions between two options without giving participants any chance to refuse to decide or to adopt a random method, such as flipping a coin. Our studies show that these common methods are inadequate, because providing this third option—which we call indecision—changes the preferences that participants express in their behavior. Our results also suggest that people often decide to use a random method for variety of reasons captured by the models we studied. Thus, we need to use these more complex methods—that is, to allow indecision—in order to discover and design AI systems to serve what people really value and see as morally permitted. That lesson is the first ethical implication of our research.
Our paper also teaches important ethical lessons regarding justice in data collection. It has been shown that biases can, and are, introduced at the level of data collection. Our results open the door to the suggestion that biases could be introduced when a participant’s values are elicited under the assumption of a strict preference. Consider a simple case of choosing between two potential kidney recipients, A and B, who are identical in all aspects, except A has 3 drinks a week while B has 4. Throughout our studies, we have consistently observed that participants would overwhelmingly give the kidney to patient A who has 1 fewer drink each week, when forced to choose between them. However, when given the option to do so, most would rather flip a coin. An argument can be made here that the data collection mechanism under the strict-preference assumption is biased against patient B and others who drink more than average.
Finally, our studies also have significant relevance to randomness as a means of achieving fairness in algorithms. As our participants were asked to make moral decisions regarding who should get the kidney, one interpretation of their decisions to flip a coin is that the fair thing to do is often to flip a coin so that they (and humans in general) do not have to make an arbitrary decision. The modeling techniques proposed here differ from the approach to fairness that conceives random decisions as guaranteeing equity in the distribution of resources. Our findings about model fit suggest that humans sometimes employ random methods largely in order to avoid making a difficult decision (and perhaps also in order to avoid personal responsibility). If our techniques are applied to additional problems, they will further the discussion of algorithmic fairness by emphasizing the role of randomness and indecision. This advance can improve the ability of AI systems to serve their purposes within moral constraints.
Experiment Scenario: Organ Allocation. Our experiments focus on a hypothetical scenario involving the allocation of scarce donor organs. We use organ allocation since it is a real, ethically-fraught problem, which often involves AI or other algorithmic guidance. However our hypothetical organ allocation, and our survey experiments, are not intended to reflect the many ethical and logistical challenges of organ transplantation; these issues are settled by medical experts and policymakers. Our experiments do not focus on a realistic organ allocation scenario, and our results should not be interpreted as guidance for transplantation policy.
- Equilibrium allocations under alternative waitlist designs: evidence from deceased donor kidneys. Working Paper Technical Report 25607, Working Paper Series, National Bureau of Economic Research. External Links: Cited by: Introduction.
- Dilemmas: dithering, plumping, and grief. In Moral Dilemmas and Moral Theory, H. E. Mason (Ed.), pp. 127. Cited by: Introduction.
- Preference elicitation and query learning. Journal of Machine Learning Research 5 (Jun), pp. 649–667. Cited by: Discussion.
- Handbook of computational social choice. Cambridge University Press. Cited by: Group Models.
- The possibility of parity. Ethics 112 (4), pp. 659–688. Cited by: Models for Indecision.
Moral decision making frameworks for artificial intelligence. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 4831–4835. Cited by: footnote 10.
- Ordering effects and choice set awareness in repeat-response stated preference studies. Journal of environmental economics and management 63 (1), pp. 73–91. Cited by: Introduction.
- Designing choice sets for stated preference methods: the effects of complexity on choice consistency. Journal of Environmental Economics and management 44 (1), pp. 123–143. Cited by: Introduction.
- Consistency in rationalist moral systems. Journal of Philosophy 81 (6), pp. 291–309. External Links: Cited by: Introduction.
- Conventional machine learning for social choice.. In AAAI, pp. 858–864. Cited by: footnote 10.
- Adapting a kidney exchange algorithm to align with human values. Artificial Intelligence, pp. 103261. Cited by: Introduction, Introduction.
- Selective sampling using the query by committee algorithm. Machine learning 28 (2-3), pp. 133–168. Cited by: Discussion.
- Decisions concerning the allocation of scarce medical resources. Journal of Social Behavior and Personality 15 (2), pp. 185. Cited by: Introduction.
- The allocation of scarce medical resources across medical conditions. Psychology and Psychotherapy: Theory, Research and Practice 75 (2), pp. 189–203. Cited by: Introduction.
- Moral choices: the influence of the do not play god principle. In Proceedings of the 35th Annual Meeting of the Cognitive Science Society, Cooperative Minds: Social Interaction and Group Dynamics, pp. 2973–2977. Cited by: Introduction.
- Indecisiveness, undesirability and overload revealed through rational choice deferral. The Economic Journal 128 (614), pp. 2450–2479. Cited by: Introduction.
- Moral thinking: its levels, method, and point. Oxford: Oxford University Press. Cited by: Introduction.
- Statistical foundations of virtual democracy. In International Conference on Machine Learning, pp. 3173–3182. Cited by: footnote 10.
- Choosing to avoid: coping with negatively Emotion-Laden consumer decisions. Journal of Consumer Research 24 (4), pp. 409–433. Cited by: Models for Indecision.
- Fairness in deceased organ matching. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 236–242. Cited by: Introduction.
- Balancing lexicographic fairness and a utilitarian objective with application to kidney exchange. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: Introduction.
- Moral dilemmas. Philosophy and Phenomenological Research 50 (n/a), pp. 367–382. External Links: Cited by: Introduction.
- Single-option aversion. Journal of Consumer Research 40 (3), pp. 555–566. Cited by: Models for Indecision.
- A voting-based system for ethical decision making. AAAI 2018. Cited by: Introduction.
- Public, medical professionals’ and patients’ preferences for the allocation of donor organs for transplantation: study protocol for discrete choice experiments. BMJ open 8 (10), pp. e026040. Cited by: Introduction.
- Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: Individual Indecision Models.
- Incompleteness and incomparability in preference aggregation: complexity results. Artificial Intelligence 175 (7-8), pp. 1272–1289. Cited by: Models for Indecision, Discussion.
- Pluralism, determinacy, and dilemma. Ethics 102 (4), pp. 720–742. External Links: Cited by: footnote 3.
- The ethics and reality of rationing in medicine. Chest 140 (6), pp. 1625–1632. Cited by: Introduction.
- Game theory in economics: chapter 4, preferences and utility. RAND Corporation, Santa Monica, CA. External Links: Cited by: Indecision Model Formalism.
- Moral dilemmas. Blackwell. Cited by: Introduction.
- On the distribution of points in a cube and the approximate evaluation of integrals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 7 (4), pp. 784–802. Cited by: Study 2: Fitting Indecision Models.
- Preference elicitation for general random utility models. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pp. 596–605. Cited by: Mathematical Indecision Models.
- Choice under conflict: the dynamics of deferred decision. Psychological science 3 (6), pp. 358–361. Cited by: Models for Indecision.
- ” To choose or not to choose”: on choice strategy in face of a single alternative. The American journal of psychology, pp. 373–389. Cited by: Models for Indecision.
- A pac framework for aggregating agents’ judgments. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 2237–2244. Cited by: footnote 10.
Appendix A Online Survey Experiments
This appendix describes our survey experiments in greater detail. A.1 describes the online platform we used for this survey, Section A.2 describes Study 1 and our analysis, and Section A.3 describes the design of Study 2.
a.1 Online Platform
All online experiments were conducted using a custom online survey platform. After agreeing to an online consent form, participants were shown background information on kidney allocation and about the patient features in this survey, shown below:
Sometimes people with certain diseases or injuries require a kidney transplant. If they don’t have a biologically compatible friend or family member who is willing to donate a kidney to them, they must wait to receive a kidney from a stranger.
Choose which of two patients should receive a sole available kidney. Information about Patient A will always be on the left. Information about Patient B will always be on the right. The characteristics of each patient will change in each trial. Patients who do receive the kidney will undergo an operation that is almost always successful. Patients who do not receive the kidney will remain on dialysis and are likely to die within a year.
After completing an online consent form, participants were asked to respond to a series of comparisons between two potential kidney recipients. Each recipient is represented by three features: “number of child dependent(s)”, “years old”, and “drinks per day prediagnosis.” Figure 2 shows a screenshot of the decision scenario.
All participants were recruited on Amazon Mechanical Turk131313https://www.mturk.com/ (MTurk). We included only participants in the United States, who have completed more than 500 HITs, with HIT approval rate at least , and who have not participated in any previous studies for this project.
a.2 Study 1
We recruited 120 participants via MTurk. One participant was excluded from the cohort due to incompleteness, leaving us a sample of N= 119 (32% female and 68% male; mean age = 35.2, SD = 10.12, 82% white) with N=60 for group Indecisive and N=59 for group Strict. On our online platform, both groups were asked to make decisions on a set of 15 pairs of hypothetical patients, whose features were pre-determined a priori. Both groups were given the same sequence of scenarios; the features of each patient in these scenarios is included in our dataset (included in the supplement and online, see below).
The Indecisive group were given the additional option to flip a coin, instead of choosing one of the two patients.
Anonymized responses from Study 1 are available online,141414Link removed for blind review. and included in this paper’s online supplement.
Study 1 Analysis: Hypothesis Testing
For this analysis we refer to each strict response as a “vote”. For example if a participant expresses the preference for patient over patient , we say this is a vote for . To test hypotheses H0-1 and H0-2 we first identify the majority patient (the patient who received more votes than the other patient); the other patient is referred to as the minority patient. Coincidentally, the majority and minority patients were the same for both groups, Indecisive and Strict. Table 3 shows the number of votes for the minority and majority patient for each question, for both groups.
|Q#||Group Indecisive||Group Strict|
a.3 Study 2
We first recruited 150 participants using MTurk for the Indecisive group. Each participant was assigned a randomly generated sequence of 40 pairs of hypothetical patients, and they were presented with the option to either give a kidney to one of the patients, or flip a coin (see Figure 2). Patient features were generated uniformly at random from the ranges:
# dependents: 0, 1, 2
age: 25, …, 70
# drinks: 1, 2, 3, 4, 5
In addition, 3 or 4 attention-check pairs, in which the participant is presented with the choice between an already deceased patient and a “favorable” patient,151515A 30-year-old patient who consumed 1 alcoholic drink per week, with 2 dependents. were randomly distributed in each sequence. After data collection, 18 participants were excluded for failing at least one attention check, i.e., choosing to give the kidney to the deceased patient. This leaves us N=132 participants (age distribution was 31%: 18-29, 48%: 39-30, 10%: 40-49, 6%: 50-59, 3%: 60+; gender distribution was 29%: female, 70%: male; racial distribution was 75%: white, 25% nonwhite).
Next we recruited 153 participants for group Strict; these participants were given the exact same task as the Indecisive group, but without the option to flip a coin. 21 participants were excluded from the analysis due to attention check failures, leaving us with a final sample of N=132 (age distribution was 26%: 18-29, 46%: 39-30, 17%: 40-49, 10%: 50-59, 2%: 60+; gender distribution was 36%: female, 63%: male; racial distribution was 72%: white, 28% nonwhite).
Anonymized responses from Study 2 are available online,161616Link removed for blind review. and included in this paper’s online supplement.
Appendix B Fitting Indecision Models
In this appendix we provide additional details on the indecision models from Section Indecision Model Formalism, as well as details of our computational experiments.
First, in B.1 we motivate the score-based decision models (from Section Indecision Model Formalism) using an intuitive—and equivalent—representation as response functions. In B.2 we provide additional motivation for the strict response model from Section Indecision Models for Strict Comparisons. In B.3 we provide additional details on our group indecision models. Finally, in B.4 we describe the implementation of our computational experiments.
b.1 Response Functions vs Score-Based Models
In the score-based models from Section Indecision Model Formalism, the agent responds by evaluating a “score” for each possible response. Here we provide an intuitive motivation for each of these indecision models, framed as response functions. As in Section Indecision Model Formalism, an agent response function maps a pair of items (a comparison question) to a response. In this section, all agent response functions are expressed in terms of the agent utility function and threshold . Each response function identifies a set of feasible responses for the agent, which depend on the agent utility function and threshold. If there are multiple feasible responses, the agent chooses one uniformly at random. Importantly, we show below that these response functions are identical to the score-based response functions for models in Section Indecision Model Formalism, when the agent observes no “noise.”
Below we formalize each response function, grouped by by their “causes” (see Section Models for Indecision).
Each of the functions here appears We emphasize that each of these response “functions” is in fact a multifunction, as multiple responses may be possible. However
Difference-Based Response Functions: Min-, Max- Here the agent is indecisive when the utility difference between alternatives is either smaller than threshold (Min-) or greater than (Max-). The corresponding response functions are
Note that in these definitions, multiple responses may be feasible (i.e., the conditions may be met for multiple responses. In this case, we assume the agent selects a feasible response uniformly at random. For example, for both models Min- and Max-, if then the agent selects a response randomly with either or .
In these models the agent response depends on the utility difference between and , (). Depending on how this utility difference compares with threshold , the agent may be indecisive. Since the agent is indecisive only when the absolute difference in item utility () is too large or too small, negative is not meaningful here—thus, we only consider .
Desirability-Based Models: Min-, Max- Here the agent is indecisive when the utility of both alternatives is below threshold (Min-), or when the utility of both alternatives is greater than (Max-). Unlike the difference-based models, here may be positive or negative. The response functions for these models are
As before, if there are multiple feasible responses, the agent selects one feasible response uniformly at random.
Unlike the difference-based models, both positive and negative are reasonable here. For example: suppose an agent is only indecisive when both alternatives are very undesireable (e.g., both items have utility less than ). This agent’s decisions might be best modeled by Min-, with .
Conflict-Based Model: Dom Here the agent is indecisive unless one alternative dominates the other in all features, by threshold at least . For this indecision model, we need a utility measure associated with each feature of each item; for this purpose, let be the utility associated with feature of item . As before we assume may be positive or negative. The response function for this model is
where and . In other words, is the minimum difference between the feature utilities of and : if is positive, then all features of alternative are strictly better than those of . If neither nor “dominates” the other by at least threshold , then the agent is indecisive. As before, the agent selects uniformly at random from all feasible responses.
While these response functions appear qualitatively different from the score functions in Section Indecision Model Formalism, they are in fact identical under certain circumstances.
For each indecision model (Min-, Max-, Min-, Max-, Dom), the response function given in Appendix B.1 is identical to the response function induced by score functions and as in Section Indecision Model Formalism, when the agent observes no score error. This score-induced response function is expressed as
where if multiple scores are maximal (i.e., the corresponding response is feasible), the agent selects a response with maximal score uniformly at random.
We prove equivalence for each indecision model separately. Note that, if both response functions and have the same set of feasible responses for a given comparison , then these responses are identical–since both response function chooses a feasible response uniformly at random. Thus, we prove that the set of feasible responses is the same for both and , for an arbitrary comparison .
For score-based response function , response is feasible if the following conditions are met
where the left and right side are equivalent. Note that the right side conditions are equivalent to the conditions for response in , since is positive. Note that the same argument holds for response .
Next, for score-based response function , response is feasible if the following conditions are met
and these conditions are equivalent to , since is positive. This condition is equivalent to the condition for response in .
For score-based response function , response is feasible if the following conditions are met
Note that the first constraint right side reduces to ; thus, these conditions are equivalent to the conditions for response in . Note that the same argument holds for response .
Next, for score-based response function , response is feasible if the following conditions are met
There are two cases: (1) if , then the first condition on the right side reduces to , and the second condition on the right side holds trivially; (2) if , then the second condition on the right side reduces to , and the first condition on the right side holds trivially. In both cases, these conditions are equivalent to the conditions for response in .
For score-based response function , response is feasible if the following conditions are met
where the right-side conditions reduce to , which is equivalent to the condition for response in . Note that the same argument holds for response .
Next, for score-based response function , response is feasible if the following conditions are met
which is equivalent to , the condition for response in .
For score-based response function , response is feasible if the following conditions are met
The first condition on the right side reduces to ; thus, the right side conditions are equivalent to , which is the condition for response in function . Note that the same argument holds for response .
Next, for score-based response function , response is feasible if the following conditions are met
There are two cases: (1) if , then the first condition on the right side reduces to , and the second condition on the right side reduces to (which holds trivially); (2) if , then the second condition reduces to (and the first condition holds trivially). In both cases, these conditions are equivalent to , which is the condition for response in .
This proof is identical to that of Min-: let and be replaced by and , respectively, and the proof is identical.
b.2 Strict Decision Models
In Section Indecision Models for Strict Comparisons we describe how indecision models can be used to model scenarios where an indecisive agent is required to express a strict preference. Here we assume that the agent uses a two-step process to respond, represented in Figure 3.
If the agent’s coin flip is “heads” (with probability ), then the agent draws from a strict version of their response distribution, defined as
for . Note that this is similar to the agent’s true response distribution (Equation 1), but assigns zero probability to response .
The overall response distribution described in Figure 3 has a closed-form expression, since the probability- coin flip is independent from each draw of the agent’s decision function. As stated in Section Indecision Models for Strict Comparisons, this distribution is
where, , and . The (heads) condition from above has another interpretation: the agent chooses to sample from a “strict” logit, induced by only the score functions for strict responses, and . We discuss this model in more detail, and provide an intuitive example, in Appendix B.
b.3 Group Decision Models
Here we outline the mathematical group decision models from Section Group Models.
A set of observed responses is represented by vectors , , , where and are the indices of items and in query , and is the observed agent’s response.
This model is parameterized only by the best-fit models for each of its constituent voters. Let be the number of voters, and let be the best-fit score function for voter and response . Since we take an MLE approach, the goodness-of-fit metric for these models is the log-likelihood of the model, given observed responses.
The log-likelihood for model VMixture is
is the response distribution for voter .
This model class is parameterized by distinct sets of submodel parameters: each submodel consists of a utility vector and threshold ; the type of each model is also a variable (i.e., a categorical variable). Weight parameters indicate the importance of each submodel. Let be the score function for model and response ; these score functions depend on the type of each model (see Section Indecision Model Formalism). For the -Mixture model, the log-likelihood is
is the response distribution for model .
b.4 Experiments and Implementation
All code used for our computational experiments is available online,171717https://github.com/duncanmcelfresh/indecision-modeling and attached in our supplementary material. All code is written in Python 3.7, and uses packages Ax181818https://ax.dev/ for random sampling. All experiments were run on a single Intel Xeon E5-2690 node with 16GB memory.
For all experiments, models were fit by sampling several random parameter sets using a Sobol process (implemented using Ax). Each model is “trained” using a different number or random Sobol points in our experiments:
Individual indecision models (Table 1): 1,000 points for Indecisive, and 5,000 for Strict (which uses an additional parameter ).
Group indecision models (Table 2, models Min-, Max-, Min-, Max-, Dom, Logit): 5,000 points
VMixture: 500 points for group Indecisive and 1,000 points for Strict, for each individual model.
-Mixture, -Min-: 100,000 points