Ask Not What AI Can Do, But What AI Should Do: Towards a Framework of Task Delegability

02/08/2019 ∙ by Brian Lubars, et al. ∙ University of Colorado Boulder 0

Although artificial intelligence holds promise for addressing societal challenges, issues of exactly which tasks to automate and the extent to do so remain understudied. We approach the problem of task delegability from a human-centered perspective by developing a framework on human perception of task delegation to artificial intelligence. We consider four high-level factors that can contribute to a delegation decision: motivation, difficulty, risk, and trust. To obtain an empirical understanding of human preferences in different tasks, we build a dataset of 100 tasks from academic papers, popular media portrayal of AI, and everyday life. For each task, we administer a survey to collect judgments of each factor and ask subjects to pick the extent to which they prefer AI involvement. We find little preference for full AI control and a strong preference for machine-in-the-loop designs, in which humans play the leading role. Our framework can effectively predict human preferences in degrees of AI assistance. Among the four factors, trust is the most predictive of human preferences of optimal human-machine delegation. This framework represents a first step towards characterizing human preferences of automation across tasks. We hope this work may encourage and aid in future efforts towards understanding such individual attitudes; our goal is to inform the public and the AI research community rather than dictating any direction in technology development.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recent developments in machine learning have led to significant excitement about the promise of artificial intelligence.

Ng (2017) claims that “artificial intelligence is the new electricity.” Artificial intelligence indeed approaches or even outperforms human-level intelligence in critical domains such as hiring, medical diagnosis, and judicial systems (Cheng et al., 2016; Erel et al., 2018; Fakoor et al., 2013; Kleinberg et al., 2018; Litjens et al., 2016)

. However, we also observe growing concerns about which problems artificial intelligence is applied to. For instance, a recent study used deep learning to predict sexual orientation from images

(Wang and Kosinski, 2018). This study has caused controversy (Murphy, 2017; Mezzofiore, 2017): Glaad and the Human Rights Campaign denounced the study as “junk science” that “threatens the safety and privacy of LGBTQ and non-LGBTQ people alike” (Anderson, 2017). In general, researchers also worry about the impact on jobs and the future of employment (Frey and Osborne, 2013; Schwab, 2017; Susskind and Susskind, 2015).

Such excitement and concern raises a fundamental question at the interface of artificial intelligence and human society: which tasks should be delegated to AI, and in what way?

To answer this question, we need to at least consider two dimensions. The first dimension is capability. Machines may excel at some tasks, but struggle at others; this area has been widely explored since Fitts first tackled function allocation in 1951 (Fitts et al., 1951; Parasuraman et al., 2000; Price, 1985). The goal of AI research has traditionally focused on pushing the ability boundary of machines and exploring what AI can do.

The second dimension is human preferences, i.e., what role humans would like AI to play. The automation of some tasks is celebrated, while others should arguably not be automated for reasons beyond capability alone. For instance, automated civil surveillance may be disastrous for ethical, privacy, or legal concerns. Motivation may be another reason: no matter how good machines get at writing novels, it is unlikely that an aspiring writer will derive the same satisfaction or value from delegating their writing to an automated system. Despite the clear importance of understanding human preferences, the question of what AI should do remains understudied in artificial intelligence research.

In this work, we present the first empirical study to understand how different factors relate to human preferences of delegability, i.e., to what extent AI should be involved. Our contributions are threefold. First, building on prior literature on function allocation, mixed-initiative systems, and trust and reliance on machines (Lee and See, 2004; Parasuraman et al., 2000; Horvitz, 1999, inter alia), we develop a framework of four factors — motivation, difficulty, risk, and trust — to explain task delegability. Second, we construct a dataset of diverse tasks ranging from those found in academic research to ones that people routinely perform in daily life. Third, we conduct a survey to solicit human evaluations of the four factors and delegation preferences and validate the effectiveness of our framework.

Organization and highlights. We first summarize related work. We then explain our framework’s four factors and their relation to task delegability, and present survey questions to measure each factor. We next describe our construction of a task database and survey collection procedures from Amazon Mechanical Turk.

Based on the survey responses, we find that participants seldom prefer full automation. In fact, upon averaging responses for each task, we find full automation is preferred for none. Finally, we use a prediction task to examine connections between our framework and human preferences of task delegability. Our classifiers clearly outperform the random baseline, demonstrating the effectiveness of our framework. Our results align with existing literature, showing trust as the most predictive factor of task delegability preferences.

As AI grows further embedded in human society, human preference becomes an increasingly important dimension. Our study contributes towards a framework of task delegability and an evolving reference database of tasks and their associated human preferences. Our dataset is released at https://delegability.github.io.

2. Related Work

Task allocation and delegation. Several studies have proposed theories of delegation in the context of general automation (Parasuraman et al., 2000; Price, 1985; Castelfranchi and Falcone, 1998; Milewski and Lewis, 2014). Function allocation examines how to best divide tasks based on human and machine abilities (Fitts et al., 1951; Parasuraman et al., 2000; Price, 1985). Castelfranchi and Falcone (1998) emphasize the role of risk, uncertainty, and trust in delegation, which we build on in developing our framework. Milewski and Lewis (2014) suggest that people may not want to delegate to machines in tasks characterized by low trust or low confidence, where automation is unnecessary, or where automation does not add to utility. In the context of jobs and their susceptibility to automation, Frey and Osborne (2013) find social, creative, and perception-manipulation requirements to be good prediction criteria for machine ability.

Parasuraman et al.’s Levels of Automation is the closest to our work (Parasuraman et al., 2000). However, their work is primarily concerned with performance-based criteria (e.g., capability, reliability, cost), while our primary interest involves human preferences.

Shared-control design paradigms. Many tasks are amen-able to a mix of human and machine control. Mixed-initiative systems and collaborative control have gained traction over function allocation (Bradshaw et al., 2012; Horvitz, 1999), mainly through recognition that such systems can transform the task itself through conflicts and interdependence (Johnson et al., 2014).

We broadly split work on such systems into categories. We find this split more flexible and practical for our application than the Levels of Automation. One one side we have human-in-the-loop ML designs, wherein humans assist machines. People handle edge cases, label, and refine system outputs. Such designs enjoy prevalence in applications from vision recognition to machine translation (Branson et al., 2010; Fails and Olsen Jr, 2003; Green et al., 2015; Fails and Olsen Jr, 2003).

Alternatively, a machine-in-the-loop paradigm places the human in the primary position of action and control while the machine assists. Examples of this include a creative-writing assistance system that generates contextual suggestions (Clark et al., 2018; Roemmele and Gordon, 2015), and predicting situations in which people are likely to make errors in judgments in decision-making (Anderson et al., 2016). Even tasks which should not be automated may still benefit from machine assistance, especially if human performance is not the upper bound as Kleinberg et al. found in judge bail decisions (Kleinberg et al., 2018).

Trust and reliance on machines. Finally, we consider the community’s interest in trust. As automation grows in complexity, complete understanding becomes impossible; trust serves as a proxy for rational decisions in the face of uncertainty, and appropriate use of technology becomes a concern (Lee and See, 2004). As such, calibration of trust continues to be a popular avenue of research (Lewandowsky et al., 2000; Gombolay et al., 2016). Lee and See (2004) identify three bases of trust in automation: performance, process, and purpose. Performance describes the automation’s ability to reliably achieve the operator’s goals. Process describes the inner workings of the automation; examples include dependability, integrity, and interpretability (in particular, interpretable ML has received significant interest (Ribeiro et al., 2018, 2016; Kim et al., 2016; Doshi-Velez and Kim, 2017)). Finally, purpose refers to the intent behind the automation and its alignment with the user’s goals.

3. A Framework for Task Delegability

To explain human preferences of task delegation to AI, we develop a framework with four factors: a person’s motivation in undertaking the task, their perception of the task’s difficulty, their perception of the risk associated with accomplishing the task, and finally their trust in the AI agent. We choose these factors because motivation, difficulty, and risk respectively cover why a person chooses to perform a task, the process of performing a task, and the outcome, while trust captures the interaction between the person and the AI. We now explain the four factors, situate them in literature, and present the statements used in the surveys to capture each component. Table 1 presents an overview.

Motivation. Motivation is an energizing factor that helps initiate, sustain, and regulate task-related actions by directing our attention towards goals or values (Locke, 2000; Latham and Pinder, 2005). Affective (emotional) and cognitive processes are thought to be collectively responsible for driving action, so we consider intrinsic motivation and goals as two components in motivation (Locke, 2000). Specifically we distinguish between learning goals and performance goals, as indicated by Goal Setting Theory (Locke and Latham, 2002). Finally, the expected utility of a task captures its value from a rational cost-benefit analysis perspective (Horvitz, 1999). Note that a task may be of high intrinsic motivation yet low utility, e.g., reading a novel. Specifically, we use the following statements to measure these motivation components in our surveys:

  1. [topsep=0pt,itemsep=0pt,leftmargin=*]

  2. Intrinsic Motivation: I would feel motivated to perform this task, even without needing to; for example, it is fun, interesting, or meaningful to me.

  3. Goals: I am interested in learning how to master this task, not just in the completing the task.

  4. Utility: I consider this task especially valuable or important; I would feel committed to completing this task because of the value it adds to my life or the lives of others.

Factors Components
Motivation Intrinsic motivation, goals, utility
Difficulty Social skills, creativity, effort required, expertise required, human ability
Risk Accountability, uncertainty, impact
Trust Machine ability, interpretability, value alignment
Table 1. An overview of the four factors in our AI task delegability framework.

Difficulty. Difficulty is a subjective measure reflecting the cost of performing a task. For delegation, we frame difficulty as the interplay between task requirements and the ability of a person to meet those requirements. Some tasks are difficult because they are time-consuming or laborious; others, because of the required training or expertise. To differentiate the two, we include effort required and expertise required as components in difficulty. The third component, belief about abilities possessed, can also be thought of as task-specific self-confidence (also called self-efficacy (Bandura, 1989)) and has been empirically shown to predict allocation strategies between people and automation (Lee and Moray, 1994).

Additionally, we contextualize our difficulty measures with two specific skill requirements: the amount of creativity and social skills required. We choose these because they are considered more difficult for machines than for humans (Frey and Osborne, 2013).

  1. [topsep=0pt,itemsep=0pt,leftmargin=*]

  2. Social skills: This task requires social skills to complete.

  3. Creativity: This task requires creativity to complete.

  4. Effort: This task requires a great deal of time or effort to complete.

  5. Expertise: It takes significant training or expertise to be qualified for this task.

  6. (Perceived) human ability: I am confident in [my own/a qualified person’s] ability to complete this task.111We flip this axis in data analysis so that lower self-confidence indicates higher difficulty.

Risk.

Real-world tasks involve uncertainty and risk in accomplishing the task, so a rational decision on delegation involves more than just cost and benefit. Delegation amounts to a bet: a choice considering the probabilities of accomplishing the goal against the risks and costs of each agent

(Castelfranchi and Falcone, 1998). Perkins et al. (2010) define risk practically as a “probability of harm or loss,” finding that people rely on automation less as the probability of mortality increases. Responsibility or accountability may play a role if delegation is seen as a way to share blame (Lewandowsky et al., 2000; Stout et al., 2014). We thus decompose risk into the three components of personal accountability for the task outcome; the uncertainty, or the probability of errors; and the scope of impact, or cost or magnitude of those errors.

  1. [topsep=0pt,itemsep=0pt,leftmargin=*]

  2. Accountability: In the case of mistakes or failure on this task, someone needs to be held accountable.

  3. Uncertainty: A complex or unpredictable environment/ situation is likely to cause this task to fail.

  4. Impact: Failure would result in a substantial negative impact on my life or the lives of others.

Trust. Trust captures how people deal with risk or uncertainty. We use Lee and See’s definition of trust as “the attitude that an agent will help achieve an individual’s goals in a situation characterized by uncertainty and vulnerability (Lee and See, 2004).” Trust is generally regarded as the most salient factor in reliance on automation (Lee and See, 2004; Lewandowsky et al., 2000). Here, we consider trust as a combination of perceived ability of the AI agent, agent interpretability (ability to explain itself), and perceived value alignment. Each of these corresponds to a component of trust in automation in Lee and See (2004): performance, process, and purpose.

  1. [topsep=0pt,itemsep=0pt,leftmargin=*]

  2. (Perceived) machine ability: I trust the AI agent’s ability to reliably complete the task.

  3. Interpretability: Understanding the reasons behind the AI agent’s actions is important for me to trust the AI agent on this task (e.g., explanations are necessary).

  4. Value alignment: I trust the AI agent’s actions to protect my interests and align with my values for this task.

Degree of delegation. We develop this framework of motivation, difficulty, risk, and trust to explain human preferences of delegation. To measure human preferences, we split the degree of delegation into the following four categories:

  1. [topsep=0pt,itemsep=0pt,leftmargin=*]

  2. No AI assistance: the person does the task completely on their own. We refer to this mode as “human only” in this paper.

  3. The human leads and the AI assists: the person does the task mostly on their own, but the AI offers recommendations or help when appropriate (e.g., human gets stuck or AI sees possible mistakes). We refer to this mode as “machine in the loop” in this paper.

  4. The AI leads and the human assists: the AI performs the task, but asks the person for suggestions/ confirmation when appropriate. We refer to this mode as “human in the loop” in this paper.

  5. Full AI automation: decisions and actions are made automatically by the AI once the task is assigned; no human involvement. We refer to this mode as “AI only”.

Figure 1. Factors behind task delegability.

Figure 1 presents our expectation of how these four factors relate to delegability. Motivation describes how invested someone is in the task, including how much effort they are willing to expend, while difficulty determines the amount of effort the task require. Therefore we expect difficulty and motivation to relate to each other: we hypothesize that people are more likely to delegate tasks which they find difficult (or have low confidence in their abilities), and less likely to delegate tasks which they are highly invested in. Risk reflects uncertainty and vulnerability in performing a task, the situational conditions necessary for trust to be salient (Lee and See, 2004). We thus expect risk to moderate trust. Finally, we hypothesize that the correlation between components within each factor should greater than that across different factors: that is, factors should show coherence in component measurements.

4. A Task Dataset and Survey Design

To evaluate our framework empirically, we build a database of diverse tasks that cover settings ranging from academic research to daily life, and develop and administer a survey to gather perceptions of those tasks under our framework.

4.1. A Dataset of Tasks

We choose 100 tasks drawn from academic conferences, popular discussions in the media, well-known occupations, and mundane tasks people encounter in their everyday lives. These tasks are generally relevant to current AI research and discussion and present challenging delegability decisions with which to evaluate our framework. Examples of tasks from each of these sources can be found in Table 2. To additionally balance the variety of tasks chosen, we categorize them as art, creative, business, civic, entertainment, health, living, and social, and keep a minimum of 7 tasks of each. Ideally, the reference set would cover the entire automation “task space”; our task set is intended as a reasonable starting point.

Since some tasks, e.g., medical diagnosis, require expertise, and since motivation does not apply if the subject does not personally perform the task, we develop two survey versions.

  • [topsep=0pt,itemsep=0pt,leftmargin=*]

  • Personal survey. We include all the four factors and ask participants “If you were to do the given (above) task, what level of AI/machine assistance would you prefer?”.

  • Expert survey. We include only difficulty, risk, and trust, and ask participants “If you were to ask someone to complete the given (above) task, what level of AI/machine assistance would you prefer?”

Source Task Examples
Conferences Identifying & flagging fake/deceptive news articles
Media Interviewing job applicants and rating candidates
Occupations Cutting, drying, and styling hair, similar to what a barber or hairstylist might do
Everyday-life Picking out and buying a birthday present for an acquaintance
Table 2. Example tasks from each source: academic conferences, media, occupations , and everyday life.

Following a general explanation, our survey begins by asking subjects for basic demographic information. Subjects are then presented a randomly-selected task from our database. They evaluate the task under each component in our framework (see Table 1) according to a five-point Likert scale. Finally, subjects select between four choices for the degree of AI assistance they would prefer for the task. As discussed previously, the choices are: Full Automation, AI leads and human assists, Human leads and AI assists, or No AI assistance. Note that subjects are not told which factor each question measures beyond the question text itself, and can choose the degree of AI assistance independently of our framework.

We administer this 5-minute survey on Amazon Mechanical Turk. To improve the quality of surveys, we require that participants have completed 200 HITs with at least a 99% acceptance rate and are from the United States. We additionally add two attention check questions to ensure participants read the survey carefully. Subjects are paid $0.80 upon completing the survey and passing the checks; otherwise the data is discarded. We record 1000 survey responses: 500 each for the personal and the expert versions, composed of 5 responses for each of the 100 tasks. We obtain a gender-balanced sample with 530 males, 466 females, and 4 identifying otherwise.

4.2. Survey Results

We present an overview of the survey responses by examining the preferred degree of AI assistance and the correlation between components in our four factors.

4.2.1. Distribution of Survey Responses

We consider two ways of analyzing survey responses: 1) treat each individual response as a data point, yielding 500 data points each in the personal and expert surveys; 2) average the responses for each task, resulting in 100 data points each in the two surveys.

Participants seldom choose “AI only” and prefer designs where humans play the leading role. Figure 2 presents the distribution of labels from all survey responses. In both the personal survey and the expert survey, a label of 4 (“AI only”) is seldom chosen. Instead both distributions peak at 2, indicating strong preferences towards machine-in-the-loop designs. This result becomes even more striking when averaging the five responses received per task, concentrating almost half the mass in the bin between 2 and 2.5 – again indicating a preference for machine-in-the-loop designs. In fact, after averaging responses, we find that none of the 100 tasks yield an overall preference for full automation ( 3.5). Taken together, these results imply that people prefer for humans to keep control over the vast majority of these tasks, yet are also open to AI assistance.

If we view our surveys as a labeling exercise, the agreement is relatively low: the Krippendorff’s is 0.064 in the personal survey and is 0.175 in the expert survey (Hayes and Krippendorff, 2007). The lower agreement in the personal survey is consistent with heterogeneity between individuals. Two of the most contentious personal survey tasks were: “Planning menus and developing recipes at a restaurant” and “Breaking up with your romantic partner”.

(a) Distribution of individual preferences.
(b) Distribution of average preferences for each task.
Figure 2. Label distributions, where 1 is Human only and 4 is AI only: (a) 500-example label distributions; (b) Distribution after grouping by task averaging (each bar refers to a range of size 0.5, e.g., ). In both figures, participants seldom choose “AI only”.

Trust is most correlated with human preferences of automation. Table 3 shows the components that are significantly correlated with the delegation labels. Trust is the most correlated factor with human preference of automation. In the personal survey, only 6 of the 14 components are significantly correlated with the delegability label and trust takes the top two spots. Surprisingly, interpretability (trust) is not.

Consistent with our hypothesis in Figure 1, motivation, difficulty, and risk are generally negatively correlated with the degree of AI assistance. However, we highlight some exceptions. Though risk is correlated with delegability in the expert survey, it is not in the personal survey. Also, human ability (within the difficulty factor) is positively correlated with delegability, suggesting that people are happy to delegate tasks that they excel at to AI, e.g., chores in daily life.

Factor Factor Personal Expert
Motivation Utility -0.114 (*) N/A
Motivation Intrinsic Motiv -0.091 (*) N/A
Difficulty Social Skills -0.289 (***) -0.283 (***)
Difficulty Creative Skills -0.208 (***) -0.278 (***)
Difficulty Human Ability NS 0.143 (**)
Difficulty Effort Req NS -0.112 (*)
Difficulty Expertise Req NS -0.110 (*)
Risk Uncertainty NS -0.123 (**)
Risk Accountability NS -0.112 (*)
Risk Impact NS -0.097 (*)
Trust Machine Ability 0.525 (***) 0.597 (***)
Trust Value Alignment 0.487 (***) 0.527 (***)
Table 3. Pearson correlation of framework components with the delegabilty label for individual responses to the personal and expert surveys. *** for , ** for , and * for ).
Figure 3. Average pairwise correlation between components in the four factors. We observe high coherence in motivation and risk.
(a) Correlation of components in difficulty factor.
(b) Correlation of components in the trust factor.
Figure 4. Breakdown of component correlations in difficulty and trust.

4.2.2. Coherence of Factors

We next study the correlation between components in our framework. We focus on the personal survey here because it has all four factors, but results are consistent in the expert survey.

Figure 3 presents the average pairwise component correlations between the four factors: the correlation along the diagonal is generally higher than the off-diagonal ones. This finding confirms that factors are generally “coherent”.

However, coherence is lower in difficulty and trust than in risk and motivation. To investigate this, we zoom in the correlation matrix in Figure 4 to show components. Figure 3(a) shows that difficulty has two groups: social and creative skills are well correlated, but the two are poorly correlated with the rest. This suggests that “skills required” may be a separate factor from difficulty. Similarly, in trust, interpretability has little correlation with machine ability and value alignment, suggesting that the need for explanation is independent of whether the machine is perceived as capable or benign.

4.2.3. Relations Between Factors.

Factor relations differ from our expectation in Figure 1. Risk has almost no correlation with trust, and motivation has low correlation with difficulty. Between components of different factors, we find the strongest correlations between risk and difficulty: expertise (difficulty) and impact (risk) (, ), followed by expertise and accountability (risk) (, ). This suggests that difficult tasks, i.e., those requiring expertise, connect with higher risk reflected through impact and accountability.

5. Predicting Task Delegability

To further validate our framework, we turn to machine learning models to examine whether an individual’s delegation preference can be predicted from their component evaluations for a task. Our results show that our framework can significantly outperform the weighted random baseline in F1. Among the four factors, trust gives the best prediction.

5.1. Experiment Setup

We consider two classifiers in the prediction experiments: logistic regression and decision trees. We first conduct separate training on the

personal and expert

surveys. Since we have limited data (500 individual responses in each survey), we use 5-fold nested cross-validation to estimate prediction performance and a grid search to set the hyperparameters. We further evaluate generalizability by training on the

personal survey and predicting the expert survey, and vice versa.

Evaluation metrics. Since this is an imbalanced multi-class classification problem, we focus our evaluation on macro F1. We use a weighted random baseline as a benchmark for macro F1, because the majority baseline is ill-defined for it. We also present the accuracy results, where we use the majority baseline over random since it yields higher accuracy.

Figure 5. Accuracy and F1 comparisons. Classifiers significantly outperform baseline in F1, and slightly outperform baseline in accuracy.
(a) Personal
(b) Expert
Figure 6. Confusion matrix in prediction tasks. Delegation modes are only confused with nearby ones.
Subject Task Impact Machine ability Label Predicted
A Diagnosing whether a person has cancer 5 2 2 2
B Devising treatment plans for patients with cancer 5 4 2 3
C Explaining the diagnosis and treatment options of cancer to a patient 5 3 2 2
Table 4. A case study of three individual responses in the expert survey.

5.2. Prediction Performance

We first examine the prediction performance in the personal survey and the expert survey respectively. Figure 5 presents the results. Both logistic regression and decision tree clearly outperform (almost double) the baseline in F1 (0.457 in logistic regression vs. 0.25 in baseline in personal, 0.513 in logistic regression vs. 0.25 in baseline in expert). In terms of accuracy, the improvement is much smaller: logistic regression outperforms the baseline by 5% in both setups.

Classification performance does not reveal the complete story because our classes are ordinal. Figure 6 shows the confusion matrix in both tasks, and clearly shows that 1 (“human only”) is rarely confused with 4 (“machine only”). Although predicting the exact preference of delegation is a challenging task, our framework provides significant explanatory power.

Consistently, components within trust contribute the most predictive power for personal and expert surveys as ranked by logistic regression coefficients and Gini importance. Ablation experiments also show that removing trust leads to the greatest drop in prediction performance.

test
Personal Expert
train Personal (0.457) 0.493
Expert 0.460 (0.513)
Table 5. Generalization performance in F1 across two setups. The diagonals are from nested cross validation, so they are not directly comparable.

Comparing personal vs. expert. We next attempt to predict preferences in the expert survey using the classifier obtained from personal survey, and vice versa. Table 5 presents the prediction performance. We include the nested CV as benchmarks though they come from different procedures. The classifier trained on either dataset can still obtain much better performance in F1 than a random baseline.

Case study and error analysis. To further illustrate our classifiers, Table 4 shows three individual responses in the expert survey as a case study. Because trust and risk have been shown important to this model, we use machine ability (trust) and impact (risk) to demonstrate two correct and one incorrect classification. The three tasks are all related to cancer and all subjects rated the impact as high (5). The model classifies subject A’s response correctly as 2 (machine-in-the-loop) based on low levels of trust. For subject B, the model fails. It assigns a level of 3 (human-in-loop) based on the high trust, in spite of the high impact. Such contrast demonstrates the inherent difficulty of this task: calibration between individuals can be important given sufficient data.

6. Concluding Discussion

In this work, we present first steps towards understanding the topology of human preferences in human-machine task delegation decisions. We develop an intuitive framework of motivation, difficulty, risk, and trust, which we hope may serve as a starting point for reasoning about delegation preferences across tasks. We develop a survey to quantify human preferences, and validate the promise of such an empirical approach through correlation analysis and prediction experiments. Our findings show that people prefer machine-in-the-loop designs and a disinclination towards “AI Only”.

In developing this framework, our intent is not to suppress development or to drive policy, but rather to provide an avenue for more effective human-centered automation. Human preferences regarding the extent of autonomy, and the reasons and motivations behind these preferences, are an understudied and valuable source of information. There is a clear gap in methodologies for both understanding and contextualizing such preferences across tasks; it is this gap we wish to address.

Limitations. Our empirical survey is based on participants on Amazon Mechanical Turk. We use a strict filter and attention check questions to guarantee quality, but this sample may not be representative of the general population in the US, or of other populations with different cultural expectations.

We suggest several improvements to our survey design. While useful as a first step, future surveys should include a thorough validation of the survey questions and tasks. Though we designed our survey to avoid requiring respondents to understand our framework conceptually, a brief training or examples may also help reduce noise stemming from abstractions. We also recommend consideration of individual baseline attitudes towards automation. Since we had only one respondent per task, we were unable to determine if perhaps individuals were consistently biased towards or against AI, and what effect this might have had on factor measurements. Finally, we chose only four delegability categories, ranging from “human only” to “AI only”. This was a deliberate abstraction choice to handle the wide variety of tasks presented. However, since the majority of responses fell into one of the two shared-control categories, we suggest future studies may benefit from more fine-grained choices on shared control.

Although our classifiers significantly outperform the random baseline, they do not fully explain human preferences in task delegability, leaving plenty of room for improvement. Due to limited data, we also do not explore higher-order feature interactions or causal relations. Since this is a poorly understood area with practical implications, we strongly believe it is worth exploring potential high-order dependencies.

Finally, human preferences are dynamic and survey results may not hold over time. Nevertheless, mapping current perceptions enables tracking any future changes, providing an additional mechanism to understand how basic changes in factors like machine ability or media coverage could manifest through trust and belief in optimal delegability.

Implications. Above limitations notwithstanding, we highlight some factor correlations worthy of further analysis. First, we consider trust. Our finding of trust as the most salient factor behind delegation preferences supports the community’s widespread interest in trust and reliance. We find negative correlations between trust in machine abilities and the social and creative skill requirements. These are skills commonly considered difficult for machines, hinting at a method of estimating trust using perceived skill assessments. We also note the low correlation between interpretability and delegation labels, demonstrating the complex relation between interpretable machine learning and trust. Components within trust, such as interpretable ML and algorithmic fairness, could benefit from a fine-grained understanding of human preferences across tasks.

Moreover, although people do not show high agreement in the delegation labels, classifiers based on our framework can effectively predict individual responses. This suggests that delegation concerns may be moderated by educating people, e.g., about machine ability, or by giving users the flexibility to modify levels of control.

Lastly, our findings show that people do not prefer “AI only”, instead opting for machine-in-the-loop designs. Interestingly, we note that even for low-trust tasks such as cancer diagnosis or babysitting, people are still receptive to the idea of a machine-in-loop assistant. We should explore paradigms which let people maintain high-level control over tasks while leveraging machines as support.

We hope this framework will serve as a starting point for understanding task relations through the lens of human perceptions, and considering how to generalize task-specific results to other domains. For instance, how might results on trust and delegation for flu diagnosis generalize to cancer or depression diagnosis tasks? According to our results, trust is highest for flu diagnosis and lowest for cancer diagnosis, with depression in the middle. Delegability levels followed trust, though the overall preference for all three remains a machine-in-loop design. Ultimately, we believe an effective understanding of human preferences and task relations will prove invaluable for the community and the public as a whole.

References

  • (1)
  • Anderson et al. (2016) Ashton Anderson, Jon M. Kleinberg, and Sendhil Mullainathan. 2016. Assessing Human Error Against a Benchmark of Perfection. In Proceedings of KDD.
  • Anderson (2017) Drew Anderson. 2017.

    GLAAD and HRC call on Stanford University & responsible media to debunk dangerous & flawed report claiming to identify LGBTQ people through facial recognition technology.

    https://www.glaad.org/blog/glaad-and-hrc-call-stanford-university-responsible-media-debunk-dangerous-flawed-report. [Online; accessed 3-Sep-2018].
  • Bandura (1989) Albert Bandura. 1989. Human Agency in Social Cognitive Theory. 44 (10 1989), 1175–84.
  • Bradshaw et al. (2012) J. M. Bradshaw, V. Dignum, C. Jonker, and M. Sierhuis. 2012. Human-agent-robot teamwork. IEEE Intelligent Systems 27, 2 (March 2012), 8–13. https://doi.org/10.1109/MIS.2012.37
  • Branson et al. (2010) Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder, Pietro Perona, and Serge Belongie. 2010. Visual recognition with humans in the loop. In Proceedings of ECCV.
  • Castelfranchi and Falcone (1998) Cristiano Castelfranchi and Rino Falcone. 1998. Towards a theory of delegation for agent-based systems. Robotics and Autonomous Systems 24, 3 (1998), 141 – 157. https://doi.org/10.1016/S0921-8890(98)00028-1 Multi-Agent Rationality.
  • Cheng et al. (2016) Jie-Zhi Cheng, Dong Ni, Yi-Hong Chou, Jing Qin, Chui-Mei Tiu, Yeun-Chung Chang, Chiun-Sheng Huang, Dinggang Shen, and Chung-Ming Chen. 2016. Computer-aided diagnosis with deep learning architecture: applications to breast lesions in US images and pulmonary nodules in CT scans. Scientific reports 6 (2016), 24454.
  • Clark et al. (2018) Elizabeth Clark, Anne Spencer Ross, Chenhao Tan, Yangfeng Ji, and Noah A Smith. 2018. Creative Writing with a Machine in the Loop: Case Studies on Slogans and Stories. In Proceedings of IUI.
  • Doshi-Velez and Kim (2017) Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 (2017).
  • Erel et al. (2018) Isil Erel, Léa H Stern, Chenhao Tan, and Michael S Weisbach. 2018. Selecting Directors Using Machine Learning. Technical Report. National Bureau of Economic Research.
  • Fails and Olsen Jr (2003) Jerry Alan Fails and Dan R Olsen Jr. 2003. Interactive machine learning. In Proceedings of IUI.
  • Fakoor et al. (2013) Rasool Fakoor, Faisal Ladhak, Azade Nazi, and Manfred Huber. 2013. Using deep learning to enhance cancer diagnosis and classification. In Proceedings of ICML, Vol. 28.
  • Fitts et al. (1951) Paul M Fitts, MS Viteles, NL Barr, DR Brimhall, Glen Finch, Eric Gardner, WF Grether, WE Kellum, and SS Stevens. 1951. Human engineering for an effective air-navigation and traffic-control system, and appendixes 1 thru 3. Technical Report. Ohio State University Research Foundation Columbus.
  • Frey and Osborne (2013) Carl Benedikt Frey and Michael A. Osborne. 2013. The Future of Employment: How Susceptible are Jobs to Computerisation? Technological Forecasting and Social Change 114 (2013), 254 – 280. https://doi.org/10.1016/j.techfore.2016.08.019
  • Gombolay et al. (2016) Matthew Gombolay, Xi Jessie Yang, Bradley Hayes, Nicole Seo, Zixi Liu, Samir Wadhwania, Tania Yu, Neel Shah, Toni Golen, and Julie Shah. 2016. Robotic assistance in the coordination of patient care. The International Journal of Robotics Research (2016), 0278364918778344.
  • Green et al. (2015) Spence Green, Jeffrey Heer, and Christopher D. Manning. 2015. Natural Language Translation at the Intersection of AI and HCI. Queue 13, 6, Article 30 (June 2015), 13 pages. https://doi.org/10.1145/2791301.2798086
  • Hayes and Krippendorff (2007) Andrew F Hayes and Klaus Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication methods and measures 1, 1 (2007), 77–89.
  • Horvitz (1999) Eric Horvitz. 1999. Principles of Mixed-initiative User Interfaces. In Proceedings of CHI. ACM, 159–166.
  • Johnson et al. (2014) Matthew Johnson, Jeffrey M Bradshaw, Paul J Feltovich, Catholijn M Jonker, M Birna Van Riemsdijk, and Maarten Sierhuis. 2014. Coactive design: Designing support for interdependence in joint activity. Journal of Human-Robot Interaction 3, 1 (2014), 43–69.
  • Kim et al. (2016) Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. 2016. Examples are not enough, learn to criticize! Criticism for interpretability. In Proceedings of NIPS.
  • Kleinberg et al. (2018) Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. 2018. Human Decisions and Machine Predictions*. The Quarterly Journal of Economics 133, 1 (2018), 237–293. https://doi.org/10.1093/qje/qjx032
  • Latham and Pinder (2005) Gary P Latham and Craig C Pinder. 2005. Work motivation theory and research at the dawn of the twenty-first century. Annu. Rev. Psychol. 56 (2005), 485–516.
  • Lee and Moray (1994) John D. Lee and Neville Moray. 1994. Trust, self-confidence, and operators’ adaptation to automation. International Journal of Human-Computer Studies 40, 1 (1994), 153 – 184. https://doi.org/10.1006/ijhc.1994.1007
  • Lee and See (2004) John D Lee and Katrina A See. 2004. Trust in Automation: Designing for Appropriate Reliance. Human Factors: The Journal of Human Factors and Ergonomics Society 46 (2004), 50–80. Issue 1.
  • Lewandowsky et al. (2000) Stephan Lewandowsky, Michael Mundy, and Gerard P. A. Tan. 2000. The Dynamics of Trust: Comparing Humans to Automation. Journal of Experimental Psychology: Applied 6 (2000), 104–123. Issue 2.
  • Litjens et al. (2016) Geert Litjens, Clara I Sánchez, Nadya Timofeeva, Meyke Hermsen, Iris Nagtegaal, Iringo Kovacs, Christina Hulsbergen-Van De Kaa, Peter Bult, Bram Van Ginneken, and Jeroen Van Der Laak. 2016. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Scientific reports 6 (2016), 26286.
  • Locke (2000) Edwin Locke. 2000. Motivation, Cognition, and Action: An Analysis of Studies of Task Goals and Knowledge. Applied Psychology 49, 3 (2000), 408–429. https://doi.org/10.1111/1464-0597.00023
  • Locke and Latham (2002) Edwin Locke and Gary Latham. 2002. Building a practically useful theory of goal setting and task motivation - A 35-year odyssey. 57 (10 2002), 705–17.
  • Mezzofiore (2017) Gianluca Mezzofiore. 2017. That AI study which claims to guess whether you’re gay or straight is flawed and dangerous. https://mashable.com/2017/09/11/artificial-intelligence-ai-lgbtq-gay-straight/. [Online; accessed 3-Sep-2018].
  • Milewski and Lewis (2014) Allen Milewski and Steven Lewis. 2014. When People Delegate. (11 2014).
  • Murphy (2017) Heather Murphy. 2017. Why Stanford Researchers Tried to Create a ‘Gaydar’ Machine. https://www.nytimes.com/2017/10/09/science/stanford-sexual-orientation-study.html. [Online; accessed 3-Sep-2018].
  • Ng (2017) Andrew Ng. 2017. Artificial Intelligence is the New Electricity. In presentation at the Stanford MSx Future Forum.
  • Parasuraman et al. (2000) Raja Parasuraman, Thomas B Sheridan, and Christopher D Wickens. 2000. A model for types and levels of human interaction with automation. IEEE Transactions on systems, man, and cybernetics-Part A: Systems and Humans 30, 3 (2000), 286–297.
  • Perkins et al. (2010) LeeAnn Perkins, Janet E. Miller, Ali Hashemi, and Gary Burns. 2010. Designing for Human-Centered Systems: Situational Risk as a Factor of Trust in Automation. , 2130-2134 pages.
  • Price (1985) Harold E. Price. 1985. The Allocation of Functions in Systems. Human Factors 27, 1 (1985), 33–45. https://doi.org/10.1177/001872088502700104
  • Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of KDD.
  • Ribeiro et al. (2018) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-precision model-agnostic explanations. In Proceedings of AAAI.
  • Roemmele and Gordon (2015) Melissa Roemmele and Andrew S. Gordon. 2015. Creative Help: A Story Writing Assistant. In Proceedings of ICDIS.
  • Schwab (2017) Klaus Schwab. 2017. The fourth industrial revolution. Crown Business.
  • Stout et al. (2014) Nathan Stout, Alan Dennis, and Taylor Wells. 2014. The Buck Stops There: The Impact of Perceived Accountability and Control on the Intention to Delegate to Software Agents. AIS Transactions on Human-Computer Interaction 6 (03 2014). Issue 1.
  • Susskind and Susskind (2015) Richard E Susskind and Daniel Susskind. 2015. The future of the professions: How technology will transform the work of human experts. Oxford University Press, USA.
  • Wang and Kosinski (2018) Yilun Wang and Michal Kosinski. 2018.

    Deep neural networks are more accurate than humans at detecting sexual orientation from facial images.

    Journal of personality and social psychology 114, 2 (2018), 246.