Do Humans Trust Advice More if it Comes from AI? An Analysis of Human-AI Interactions

07/14/2021
by   Kailas Vodrahalli, et al.
Stanford University
14

In many applications of AI, the algorithm's output is framed as a suggestion to a human user. The user may ignore the advice or take it into consideration to modify his/her decisions. With the increasing prevalence of such human-AI interactions, it is important to understand how users act (or do not act) upon AI advice, and how users regard advice differently if they believe the advice come from an "AI" versus another human. In this paper, we characterize how humans use AI suggestions relative to equivalent suggestions from a group of peer humans across several experimental settings. We find that participants' beliefs about the human versus AI performance on a given task affects whether or not they heed the advice. When participants decide to use the advice, they do so similarly for human and AI suggestions. These results provide insights into factors that affect human-AI interactions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 12

page 18

page 19

page 24

page 25

06/16/2021

Human-AI Interactions Through A Gricean Lens

Grice's Cooperative Principle (1975) describes the implicit maxims that ...
11/16/2021

Will We Trust What We Don't Understand? Impact of Model Interpretability and Outcome Feedback on Trust in AI

Despite AI's superhuman performance in a variety of domains, humans are ...
08/02/2021

Knowledge-intensive Language Understanding for Explainable AI

AI systems have seen significant adoption in various domains. At the sam...
09/15/2021

Karpov's Queen Sacrifices and AI

Anatoly Karpov's Queen sacrifices are analyzed. Stockfish 14 NNUE – an A...
01/17/2021

Adversarial Interaction Attack: Fooling AI to Misinterpret Human Intentions

Understanding the actions of both humans and artificial intelligence (AI...
11/25/2021

Meaningful human control over AI systems: beyond talking the talk

The concept of meaningful human control has been proposed to address res...
09/19/2020

Humans learn too: Better Human-AI Interaction using Optimized Human Inputs

Humans rely more and more on systems with AI components. The AI communit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Artificial intelligence (AI) technology has rapidly matured and recently become prevalent in many aspects of our life. However, in safety-critical areas like medicine, there remain significant barriers to widespread adoption of AI. For example, it would be reckless for a doctor to fully rely on an AI algorithm with unknown failure modes to diagnose life-threatening diseases.

One remediation, particularly relevant for medicine, is to encourage humans to treat the output of AI algorithms as “advice”, while allowing a human to make the final decision. In the diagnosis example, this essentially places the AI algorithm in the same category as lab tests or other exams a doctor may order to aid in diagnosis. While this mitigates issues related to safety, one of the biggest barriers to adoption remains the black-box nature of current AI systems that limits people’s trust in the AI’s suggestions Feldman et al. (2019); Ribeiro et al. (2016); Xie et al. (2020); Miller (2019).

In this paper, we draw upon literature in psychology to better characterize how humans use advice from an AI algorithm. We ask whether participants use advice from an AI versus humans differently. To answer this question, we employ a paradigm from psychology for comparing how people use advice from different sources Van Swol et al. (2018); Prahl and Van Swol (2017). We develop several experiments for a layperson audience and deploy these experiments on a crowdsourcing platform. Our findings suggest that the primary difference between receiving advice from a human vs. AI advice is that it affects how likely the individual is to use the advice. Individuals use the advice to the extent that they belief that the AI or other humans are good at the given task. If an individual chooses to use the advice, there is little difference in how they incorporate human advice vs. AI advice.

Our contributions

Understanding how users account for AI advice is an important aspect of human-AI interaction and has been under-explored. Here we systematically study how humans use advice in several experiments. We propose a two-stage model according to which humans first decide whether to use advice, and then decide how to update their judgments. The predictions of this model are supported by data from participants living in a variety of different geographic regions (US, UK, and Asia). In our analysis, we also quantify how different demographic and task-related factors contribute to how much people respond to advice from AI vs. human.

Related work

Advice utilization has been studied in the psychology literature where studies often employ the judge-advisor system (JAS) framework for comparing how humans use advice from various sources when completing a given task Van Swol et al. (2018). This setup is similar to ours in that people are given a task, asked to complete the task, and subsequently receive “advice” from an outside source. This advice is in the form of a sample response, and the participant is then given the option to change their answer. By comparing various types of advice, it is possible to draw conclusions on how different sources of advice are utilized. Previous work has investigated human peer advice, human expert advice, and algorithms or statistical models. This work has ascribed differences in advice utilization to several factors including difficulty of the task, the human’s expertise on the task, and the perception of the advice source Prahl and Van Swol (2017); Dzindolet et al. (2002); Madhavan and Wiegmann (2007); Önkal et al. (2009); Gino and Moore (2007); Sah et al. (2013); Schultze et al. (2015).

Recently, work has begun to explore advice utilization specifically for AI algorithms Tauchert et al. (2019); Mesbah et al. (2021); Gaube et al. (2021). We seek to extend this line of work. While previous work focused on advice utilization in a single context or in relation to a specific task of interest, we run experiments in a variety of contexts and data modalities including image-based tasks, text-based tasks, and tasks involving tabular data as well as across multiple geographic regions. This allows us to make more general conclusions about how AI advice is utilized relative to human advice.

2 Study Design

Figure 1: Visualization of the study design. Participants are randomly split into two groups: those who receive human advice, and those who receive AI advice. After reading through instructions, participants complete a fixed set of tasks. For each task, they first answer on their own (Response 1). Subsequently, they receive advice from their designated source and are allowed to change their answer (Response 2). After completing all tasks, we ask participants several survey questions and debrief them.

To investigate whether participants utilize advice from an AI algorithm differently than from humans, we develop a set of experiments on a variety of generated datasets. The experiments are setup as a 2-stage procedure shown in Figure 1. Participants from the US are first recruited from a crowdsourcing platform and randomly assigned into two groups. The first group received advice from an AI algorithm while the second group received advice from a human.

To isolate the effect that the source of advice has on people’s judgments, the AI-labeled advice and human-labeled advice were identical. The advice was generated using the average response from a previous set of 25-50 human labelers. These labelers were shown each task instance (without advice) prior to the experiment. In each condition, participants were informed that the advice source was 80% accurate, and the actual advice participants received was identical in both conditions. Between the conditions, we only varied whether the advice was given by a human versus AI and how the advice was presented by using a small icon of a person or a computer (see Figure 1). When showing the advice to participants, we randomly perturbed the advice by 3 fixed amounts: (1) of the scale width away from the correct response, (2) of the scale width towards the correct response, or (3) no change. Options (1) and (2) each had a probability of being selected, while option (3) had probability of being selected. We added these perturbations as a means for both increasing the variety of advice given and decreasing the accuracy of the advice to the 80% level (while the average performance of an individual was below 80%, the average performance of the average response was above 80% across our datasets). The randomness in these perturbations is specific to each participant. Note that participants don’t receive any feedback on their accuracy during the experiment.

After reading through instructions, each participant completed a series of tasks in randomized order. Within each experiment, the tasks were all of the same type (e.g., image classification on city images), but the difficulty of each task instance could vary significantly (see Table 1

). The tasks we considered are all binary classification tasks, and users submitted answers in the form of a sliding scale with the endpoints labeled “Definitely [Class A]” and “Definitely [Class B]”. We will analyze participants’ responses both on this continuous scale, as well as in a binarized form where we treat a response as correct if it was on the correct side of the scale, and incorrect otherwise.

After completing all tasks in an experiment, participants were asked a series of survey questions to gauge their trust in and familiarity with AI. Participants were incentivized to perform well on the task by receiving a bonus based on their averaged judgments across all task instances (both before and after advice). Participants were informed of how the bonus was calculated in the instructions. More details on the instructions and the survey questions are provided in the Appendix.

(a) Art (b) Cities
(c) Sarcasm (d) Census
Figure 2: Example tasks for each of the 4 datasets we use.
Dataset Data Type Description # of Tasks Baseline accuracy Prior belief
Art Image identify the art period 32 65.7% +/- 17.0% 66.7% (AI)
Cities Image identify the city 32 72.1% +/- 17.3% 65.2% (AI)
Sarcasm Text identify sarcasm 32 73.1% +/- 20.2% 79.4% (human)
Census Tabular identify income level 32 70.5% +/- 28.7% 67.3% (AI)
Table 1:

Summary of the 4 datasets we use in our experiments. Baseline accuracy contains the average and standard deviation at a task level, prior to receiving advice. The prior belief comes from a survey question asking whether the participant thinks a human or AI would do better on the given task (see Appendix for full question). The reported value indicates what percent of participants favored an AI or a human for the given task; the label in parentheses indicates which type of advice was more favored.

Dataset # of Participants Sex (Percent Female) Age Socioeconomic Score Education Level
Art 147 52.4% 33.0 +/- 12.3 5.2 +/- 1.5 5.4 +/- 1.3
Cities 94 50.0% 26.1 +/- 7.8 5.3 +/- 1.5 5.4 +/- 1.2
Sarcasm 97 56.7% 31.2 +/- 11.7 5.3 +/- 1.6 5.2 +/- 1.4
Census 98 53.1% 30.5 +/- 10.9 5.2 +/- 1.6 5.3 +/- 1.2
Table 2: Summary of participant demographics from the 4 datasets we use in our experiments. Socioeconomic score is on a scale from 1 to 10; a value of 5 corresponds to a perceived middle class status. Education level is on a scale from 1 to 8; a value of 5 corresponds to attaining at least an associate’s degree or higher. More details on these measures are provided in the Appendix.

Tasks

We provide an overview of the datasets we use for our experiments in Table 1 and a summary of our participant demographics in Table 2. The socioeconomic score and education level are provided by the crowdsource platform we use for running our experiment (Prolific 16). More details on these values are provided in the Appendix. For each experiment, every participant completes all tasks from a given dataset twice – before and after receiving advice. More details about each dataset are given below. In Figure 2, we show sample tasks from each dataset. These were the examples shown to participants in the instructions and were selected to be easier than the tasks we use in our experiments. The tasks were designed to cover diverse data modalities (visual, text and tabular) as well as sufficiently challenging so that participants can benefit from advice.

Art dataset (Image)

This dataset contains images of paintings from 4 art periods: Renaissance, Baroque, Romanticism, and Modern Art. The dataset contains 8 paintings from each time period. Participants were asked to determine the art period a painting is from given a binary choice. The incorrect label was selected to be from adjacent time periods to increase task difficulty (e.g., if Romanticism is the correct label, then the alternate choice would be Baroque or Modern Art; this choice was fixed for all participants). An example task is shown in Figure 2a.

Cities dataset (Image)

This dataset contains images from Google from 4 major US cities: San Francisco, Los Angeles, Chicago, and New York City. The dataset contains 8 images from each city. The task is to identify which city an image is from given a binary choice. For each image, an alternate city was randomly selected as the negative label presented to participants. The images were selected to be moderately difficult to identify – images of major landmarks were excluded, but differentiating aspects of the cities like architecture were included. An example task is shown in Figure 2b.

Sarcasm dataset (Text)

This dataset is a subset of the Reddit sarcasm dataset Khodak et al. (2017), which includes text snippets from the discussion forum website, Reddit. Specifically, the subset was selected from posts in the “AskReddit” forum and were hand-filtered to exclude posts that either contained potentially offensive content or were too long. Survey participants were asked to identify sarcasm in the text snippet. The dataset was balanced to contain 16 sarcastic posts and 16 non-sarcastic posts. An example task is shown in Figure 2c.

Census dataset (Tabular)

This dataset comes from a subset of US census data West,Alex and Praturu (2019). The task is to identify an individual’s income level (does the individual make annual income or not?) given their demographic information: state of residence, age, education level, marital status, sex, and race. The dataset was balanced to contain 16 individuals who make and 16 who make annual income. The dataset was further balanced across race and each binary income category contains 4 Asian, 4 Black, and 8 White individuals. An example task is shown in Figure 2d.

3 Results

Our results are summarized in Table 3. These results can be broken down into 3 components: (1) the effect of the advice on performance, (2) the difference between the human and AI advice on this effect, and (3) how people’s prior beliefs about the task affects how they use the advice. We discuss each of these in turn.

Dataset Art Dataset Cities Dataset Sarcasm Dataset Census Dataset
Baseline accuracy 65.7% 72.5% 73.1% 70.5%
Change in accuracy (human) +7.1% +3.7% +3.3% +2.4%
Change in accuracy (AI) +11.8% +6.2% +3.3% +3.5%
Activation (human) 46.0% 48.2% 39.8% 41.8%
Activation (AI) 52.3% 54.5% 34.0% 47.5%
Activation Difference +6.4% +6.3% 5.8% +5.7%
Table 3: Summary of the treatment effect across the 4 datasets we use in our experiments. The first row is copied over from Table 1 for convenience. The change in accuracy is the average accuracy after receiving advice across all participants for the specified sub-group. Activation refers to the percentage of participants who change their response by at least 3.5% (the width of the slider bar) after receiving advice. The activation difference is the difference between AI and human activation ( indicates participants with AI advice were more likely to change their label).

3.1 Effect of advice

(a) Art (b) Cities
(c) Sarcasm (d) Census
Figure 3: Average accuracy before and after receiving advice, by dataset. Accuracy is averaged at a task level; we ordered the tasks by average accuracy after receiving advice. In each plot, chance is the accuracy expected from random guessing; advice is the average accuracy of the advice on a given task (recall the advice is randomly perturbed for each individual and task); before advice and after advice are the average accuracies across both arms of the experiment for a given task.

First, we consider what effect receiving advice has on performance across both conditions. This effect is summarized in the 1 through 3 rows of Table 3. The 1 row reports the average accuracy across tasks before receiving advice, and the 2 and 3 rows report the average change in accuracy after receiving advice. In all 4 datasets, there is an increase in accuracy after advice across both human and AI conditions.

In Figure 3, we show the accuracy level of individual tasks across the 4 datasets. Note that when the advice for a given task is on average incorrect, people on average make more mistakes on the task after receiving advice. These results confirm that for all of our datasets, receiving advice is on average beneficial.

3.2 Difference between AI and human advice

(a) Art (b) Cities
(c) Sarcasm (d) Census
Figure 4: PDF of the weight of advice (WoA) metric, conditioned on being activated. Weight of advice is clipped to have a maximum magnitude of 5. Some intuition on the WoA values: if participant changed label in opposite direction of advice; if participant did not change their label; if the participant’s 2nd response exactly equals the advice, if participant updated beyond the given advice. The activated refers to the the corresponding value in Table 3.

Now that we have confirmed that participants use the advice to update their judgments, we can ask how this effect differs between receiving human versus AI advice. To help answer this question, we will use the weight of advice (WoA), a common metric in the psychology of advice utilization Harvey and Fischer (1997); Yaniv and Foster (1997). The WoA is defined as

(1)

where and are the participant’s first and second answers respectively, and advice is the advice presented to the participant. The value is typically clipped to have bounded magnitude (e.g., in the case that the first response happens to be close to the advice given). A positive value implies that the participant changed their response in accordance with the advice, while a negative value implies the participant changed their response away from the advice. A response of 0 occurs when the participant did not change their response. We define a user to be “activated” for a given task if they change their response by at least a threshold amount after receiving advice. The threshold was chosen to be equal to the width of the slider bar.

We plot the PDF of the WoA for each dataset, comparing across arms in Figure 4. From these plots, we make the following observations:

  1. People change their label more often for either human or AI advice, depending on the dataset. This effect is statistically significant, and the favored advice source differs by task. For the art, cities, and census tasks, the AI arm is favored; for the sarcasm task the human arm is favored. We describe this result as an “activation difference,” where the arm that people are more likely to use advice from is the more activated arm. The size of this difference also varies by dataset and is shown in the “Activation Difference” row of Table 3; note here that a positive difference favors the AI advice while a negative difference favors the human advice.

    Also note that the sign of the activation difference is reflected in the change in accuracy across the human and AI subsets. For the datasets where AI had a higher activation rate, the participants receiving AI advice had a greater average change in accuracy. On the sarcasm dataset, the change in accuracy is identical. While this may seem contradictory (there is a non-zero activation difference), Figure 3 provides an explanation. Note that in the Sarcasm dataset, there are more tasks where the advice is incorrect. So, the accuracy improvement we would expect to see is mitigated by bad advice decreasing accuracy. This explanation can also be applied to the Census dataset which has a smaller accuracy difference between AI and human than the Art or Cities dataset.

  2. Conditioned on having changed their label, people behave similarly. As shown in Figure 4, this subset of the population corresponds to between and

    of the responses depending on the experiment. We ran a 2 sample Kolmogorov–Smirnov test for each set of tasks with p value threshold of 0.01, including all responses where the label was changed after receiving advice and splitting on the treatment arm to get 2 samples. In all 4 datasets, there was no statistically significant difference between the samples, suggesting that the main effect of the treatment may be to modulate the activation level rather than how the advice gets used.

A two-stage model for the impact of advice

Based on these experiments, we propose a two stage model for how participants utilize the advice they receive. In the first stage, a participant decides whether or not to use the advice. In the second stage, they decide to what extent they will use the advice. There is evidence in the literature for such a model – in Harvey et al. (2000), the authors find that people are better at assessing advice than using it. Our two stage model captures this effect nicely.

Using this model, we can understand the difference between AI and human advice as a difference in the first stage but not in the second stage. In particular, the rate at which people utilize advice is modulated by the advice source, but the extent to which people utilize advice is not significantly different if they decide to use the advice.

3.3 Effect of participants’ prior beliefs about the task

Covariates Activation Coefficients Response Change Coefficients
Given AI Advice? () 0.162 +/- 0.134 -0.023 +/- 0.016
Prior Belief () 0.244 +/- 0.059 0.001 +/- 0.008
Response 1 Confidence () -0.584 +/- 0.025 -0.212 +/- 0.004
Advice Confidence () 0.218 +/- 0.024 0.025 +/- 0.004
Response 1 and Advice Consistent? () -1.555 +/- 0.051 0.661 +/- 0.008
Age () -0.089 +/- 0.072 -0.026 +/- 0.008
Sex () -0.046 +/- 0.137 0.010 +/- 0.016
Education () 0.043 +/- 0.076 0.021 +/- 0.009
Socioeconomic Status () 0.139 +/- 0.070 0.007 +/- 0.008
Has Programming Experience? () 0.233 +/- 0.147 0.014 +/- 0.017
Table 4:

Confidence levels of learned coefficients from a logistic regression mixed effects model predicting whether an individual changed their response after receiving advice on a given task and a linear regression mixed effects model predicting change in response after advice conditioned on being activated. Coefficients printed bold if non-zero with confidence

. refers to whether the initial response and the advice favor the same label. The and refer to the magnitude of the response and advice respectively.

Now we seek to explain what accounts for the treatment effect we observed. The final column of Table 1 offers an explanation: it shows participants’ answers to a survey question about whether an AI algorithm or the average human would perform better on the given task. Responses were recorded on a sliding scale with human at one end and AI at the other end. We computed the prior by averaging all responses and rescaling to a range of to . We then set the sign such that a positive prior corresponds to the participant’s belief that the advice source they received would perform better than the alternate source. We visualize the full distribution in the Appendix.

Now we use 2 mixed effects linear models to test our two stage model and show the effect of the prior. The list of covariates used for fixed effects is included in Table 4; we use task ID and subject ID for the random effects. A mixed effects model has the form

(2)

where is the outcome variable, is a matrix of the covariates of interest (the fixed effects), is the coefficients for the covariates, is the random effects design matrix, is the coefficients for the random effects, and is additive noise.

is a Gaussian random variable that, in our case, models variations across participants and stimuli. As per standard practice, we include random intercepts for both random effects

Barr (2013). Continuous covariates were normalized to have

mean and unit variance. We jointly fit the model across all 4 datasets using the lme4 R package

Bates et al. (2014).

In the first stage of our two stage model, we use activation on a task as our outcome variable. As this is a binary variable, we use a logistic regression mixed effects model. We list the fitted coefficients in the “Activation Coefficients” column of Table 

4. There are a few key takeaways. The first is that the prior has a significant effect on whether an individual is activated. The sign of the prior indicates that participants who received advice from the source they believe is better for the task are more likely to be activated. Other important coefficients include – its negative value suggests that unsure participants are more inclined to be activated and – its positive value suggests that participants are more likely to activate when presented more confident advice. While one might think should have a positive coefficient, it is negative which suggests that when a participant’s initial response agrees with the advice, they are less likely to activate. The reason it is negative is the response scale has a fixed width. Participants who are already very sure about their response cannot become more sure when the advice agrees with them.

To model the second stage of our two stage model, we use change in response after advice (sign-aligned so that corresponds to greater certainty in the original label prediction) as our outcome variable. This outcome is a continuous variable and so we use a linear regression mixed effects model. We list the learned coefficients in the “Response Change Coefficients” column of Table 4. There are three key takeaways. has a large negative coefficient suggesting, similar to the activation model, that participants sure about their response do not change it as much. has a large positive coefficient here – agreement between the participant’s initial response and the advice encourages larger changes. Lastly, note that is not associated with response after advice. This suggests that the primary effect of the prior is determining whether a person activates.

3.4 Replication In Other Countries

In all of the experiments above, we restricted study participants to be located in the US. As perception of AI may be tied to geographic location, it is also important to ask the question of whether our results generalize to other regions outside the US. To answer this question, we recruit approximately 200 participants from the UK and 200 participants from Asia and run experiments with our art and sarcasm datasets. Note that the studies are run identically to the US versions, with the exception that we now recruit participants exclusively from the corresponding geographic region. We include participant demographics and performance statistics in the Appendix.

The main finding is that our two stage model is largely validated in the UK and in Asia. In the activation model, the prior is associated with activation in the UK. is similar to the , suggesting the two geographical groups are similar. While we do not have sufficient power to say is associated with activation its positive value is in line with the UK and US prior coefficients. (US:, UK:, Asia:), (US:, UK:, Asia:), and (US:, UK:, Asia:) are associated with activation and are similar in sign and magnitude across all three regions. Sex, education, age, and programming experience are not associated with activation probability across all three regions.

In the response change model, (US:, UK:, Asia:) and (US:, UK:, Asia:) are associated with response change and have similar sign and magnitude across all three regions. Neither treatment nor prior are associated with response change across all three regions, validating our earlier conclusion that the prior has a strong effect in determining activation, but a weak effect in utilization of advice after activation.

4 Discussion and Future Work

This work investigates empirically how individuals incorporate external advice depending on whether the advice comes from an AI or from other humans. This is an important topic for the broader impact of AI and has been under-explored. We propose a two stage model for advice utilization to explain our empirical findings. In the first stage, a person decides whether to use the advice. In the second stage, they decide how to incorporate the advice. Our main finding is that there is a significant difference between whether AI and human advice is used, but not in how it is used. This finding holds over a variety of settings with 3 different data modalities and across participants from the US, UK and Asia.

The work in this paper is a first step toward better understanding how humans incorporate advice from AI. Several exciting research direction lie ahead. For example, our work focuses on common tasks—e.g. paintings, locations—that require no special expertise. This is a good starting point given that there has not been similar experiments before. An interesting question is to what extent our two-stage model applies to settings where human experts incorporate external advise (e.g. a doctor is suggested a medical diagnoses by an AI or another doctor). Another important question we can ask is how these results generalize across different types of advice. Here we focus on the most direct type of external feedback (i.e. a prediction with confidence score), because this is the most common type of output in machine learning. It would be interesting to explore more complex advice; e.g. where the AI provides an explanation together with its prediction.

Ethical Considerations

We collect data from human subjects in our work through survey questions, where we asked crowdsourced workers to complete simple tasks (image classification, sarcasm identification in text, and income level predictions from tabular data). We do not collect or release any personally identifiable data associated with the workers. Given the low risk nature of this data, IRB approval was not needed. More information on the specifics of the data we collected including compensation and survey design are in the Appendix.

This work and its potential extensions also raise some ethical questions. If it is possible to manipulate human utilization of AI advice either through the presentation of the algorithm or through the form of the advice given, there is risk that miscalibrated beliefs about the AI could harm performance. Note that this may not be done intentionally, but can still be harmful. So while understanding what factors influence human perception and utilization of AI advice is critical to designing systems where humans and AI work jointly, it must be done with care to ensure an AI agent is not over-trusted.

References

  • D. J. Barr (2013) Random effects structure for testing interactions in linear mixed-effects models. Frontiers in psychology 4, pp. 328. Cited by: §3.3.
  • D. Bates, M. Mächler, B. Bolker, and S. Walker (2014) Fitting linear mixed-effects models using lme4. arXiv preprint arXiv:1406.5823. Cited by: §3.3.
  • J. R. De Leeuw (2015) JsPsych: a javascript library for creating behavioral experiments in a web browser. Behavior research methods 47 (1), pp. 1–12. Cited by: §C.4.
  • M. T. Dzindolet, L. G. Pierce, H. P. Beck, and L. A. Dawe (2002) The perceived utility of human and automated aids in a visual detection task. Human Factors 44 (1), pp. 79–94. Cited by: §1.
  • R. C. Feldman, E. Aldana, and K. Stein (2019) Artificial intelligence in the health care space: how we can trust what we cannot know. Stan. L. & Pol’y Rev. 30, pp. 399. Cited by: §1.
  • S. Gaube, H. Suresh, M. Raue, A. Merritt, S. J. Berkowitz, E. Lermer, J. F. Coughlin, J. V. Guttag, E. Colak, and M. Ghassemi (2021) Do as ai say: susceptibility in deployment of clinical decision-aids. NPJ digital medicine 4 (1), pp. 1–8. Cited by: §1.
  • F. Gino and D. A. Moore (2007) Effects of task difficulty on use of advice. Journal of Behavioral Decision Making 20 (1), pp. 21–35. Cited by: §1.
  • N. Harvey and I. Fischer (1997) Taking advice: accepting help, improving judgment, and sharing responsibility. Organizational behavior and human decision processes 70 (2), pp. 117–133. Cited by: §3.2.
  • N. Harvey, C. Harries, and I. Fischer (2000) Using advice and assessing its quality. Organizational behavior and human decision processes 81 (2), pp. 252–273. Cited by: §3.2.
  • M. Khodak, N. Saunshi, and K. Vodrahalli (2017) A large self-annotated corpus for sarcasm. arXiv preprint arXiv:1704.05579. Cited by: §2.
  • P. Madhavan and D. A. Wiegmann (2007) Effects of information source, pedigree, and reliability on operator interaction with decision support systems. Human Factors 49 (5), pp. 773–785. Cited by: §1.
  • N. Mesbah, C. Tauchert, and P. Buxmann (2021) Whose advice counts more–man or machine? an experimental investigation of ai-based advice utilization. In Proceedings of the 54th Hawaii International Conference on System Sciences, pp. 4083. Cited by: §1.
  • T. Miller (2019) Explanation in artificial intelligence: insights from the social sciences. Artificial intelligence 267, pp. 1–38. Cited by: §1.
  • D. Önkal, P. Goodwin, M. Thomson, S. Gönül, and A. Pollock (2009) The relative influence of advice from human experts and statistical methods on forecast adjustments. Journal of Behavioral Decision Making 22 (4), pp. 390–409. Cited by: §1.
  • A. Prahl and L. Van Swol (2017) Understanding algorithm aversion: when is advice from automation discounted?. Journal of Forecasting 36 (6), pp. 691–702. Cited by: §1, §1.
  • [16] Prolific. Note: https://www.prolific.co Cited by: Figure 6, item 1:, §C.1, §2.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016)

    "Why should i trust you?" explaining the predictions of any classifier

    .
    In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §1.
  • S. Sah, D. A. Moore, and R. J. MacCoun (2013) Cheap talk and credibility: the consequences of confidence and accuracy on advisor credibility and persuasiveness. Organizational Behavior and Human Decision Processes 121 (2), pp. 246–255. Cited by: §1.
  • T. Schultze, A. Rakotoarisoa, and S. Schulz-Hardt (2015)

    Effects of distance between initial estimates and advice on advice utilization.

    .
    Judgment & Decision Making 10 (2). Cited by: §1.
  • C. Tauchert, N. Mesbah, et al. (2019) Following the robot? investigating users’ utilization of advice from robo-advisors.. In ICIS, Cited by: §1.
  • L. M. Van Swol, J. E. Paik, and A. Prahl (2018) Advice recipients: the psychology of advice utilization.. Cited by: Outline, §C.4, §1, §1.
  • West,Alex and A. Praturu (2019) Enhancing the census income prediction dataset. Note: https://people.ischool.berkeley.edu/~alexwest/w210_census_income_html/Accessed: 2021-05-15 Cited by: §2.
  • Y. Xie, M. Chen, D. Kao, G. Gao, and X. Chen (2020) CheXplain: enabling physicians to explore and understand data-driven, ai-enabled medical imaging analysis. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–13. Cited by: §1.
  • I. Yaniv and D. P. Foster (1997) Precision and accuracy of judgmental estimation. Journal of behavioral decision making 10 (1), pp. 21–32. Cited by: §3.2.

Appendix A Distribution of prior

We show the distributions of prior belief across our four datasets in Figure 5. For the art, cities, and census datasets, the prior belief favors the AI advice; for the sarcasm dataset, the prior belief favors the human advice. Notice that for all datasets there is a high variance, and there are always participants who strongly believe either the AI or human advice is better regardless of the average belief.

(a) Art (b) Cities
(c) Sarcasm (d) Census
Figure 5: Histogram of the prior belief across each of our 4 datasets (for US participants). The value of prior belief shown here is taken directly from the survey question (see Section C.2) and normalized to . A negative value indicates human favored and a positive values indicates AI favored. The histogram includes both participants who received human advice and participants who received AI advice. The dotted vertical line is the average prior across all participants.

Appendix B UK and Asia validation experiments

b.1 Dataset Overview

We show an overview of the datasets used for experiments in the UK and Asia in Table 5. Comparing with Table 1, note that the baseline accuracies in the UK are similar to the US, while the baseline accuracies in Asia are lower in both datasets The prior belief across both datasets in Asia and the UK is in the same direction as in the US (AI or human favored). In the UK, the prior belief is stronger (e.g., higher percentage of people favor AI for the art dataset and human for the sarcasm dataset); in Asia, it is only stronger for the art dataset.

Dataset Data Type Description # of Tasks Baseline accuracy Prior belief
Art Dataset (UK) Image identify the art period 32 65.5% +/- 16.6% 74.4% (AI)
Sarcasm Dataset (UK) Text identify sarcasm 32 70.6% +/- 20.8% 83.0% (human)
Art Dataset (Asia) Image identify the art period 32 62.2% +/- 17.0% 76.3% (AI)
Sarcasm Dataset (Asia) Text identify sarcasm 32 62.1% +/- 21.9% 78.2% (human)
Table 5: Summary of the datasets we use in our experiments in the UK and Asia. Baseline accuracy contains the average and standard deviation at a task level, prior to receiving advice. The prior belief comes from a survey question asking whether the participant thinks a human or AI would do better on the given task (see Section C.2). The reported value indicates what percent of participants favored an AI or a human for the given task; the label in parentheses indicates which type of advice was more favored.

b.2 Demographics

Demographic information for the UK and Asia experiments is shown in Table 6. All values are similar to the US datasets with two slight differences: (1) the average age of the Sarcasm dataset in the UK is close to 10 years higher, and (2) the socioeconomic score and education level in Asia are both about 0.5 higher than the US data.

Participants from Asia came from a diverse set of countries. At least 10 participants across both art and sarcasm datasets are Indian, Chinese, Vietnamese, Filipino, Pakistani, Indonesian, or Korean.

Dataset # of Participants Sex (Percent Female) Age Socioeconomic Score Education Level
Art Dataset (UK) 86 54.7% 32.2 +/- 12.3 5.2 +/- 1.6 5.1 +/- 1.4
Sarcasm Dataset (UK) 88 44.3% 41.4 +/- 15.4 4.9 +/- 1.8 5.1 +/- 1.6
Art Dataset (Asia) 97 43.3% 27.4 +/- 6.8 5.8 +/- 1.6 5.9 +/- 1.1
Sarcasm Dataset (Asia) 101 51.5% 29.0 +/- 7.3 5.8 +/- 1.4 6.1 +/- 1.3
Table 6: Summary of the participant demographics from the 4 datasets we use in our experiments. See Section C.1 for information on the socioeconomic score and education level.

b.3 Two stage model

Here we include the coefficient values referenced in Section 3.4. In Table 7, we show the coefficients for the activation model trained using US data, UK data, and Asia data. As was noted earlier, the magnitude and sign of the and are largely the same across all three regions. The single exception is the prior effect in Asia is smaller, and subsequently we do not have enough data to conclude significance.

In Table 8, we show the coefficients for the response change model across all three regions. and have identical sign and similar magnitude across all three regions. The prior effect is not significant in any of the regions, supporting our two stage model.

Covariates Coefficients (US) Coefficients (UK) Coefficients (Asia)
Given AI Advice? () 0.162 +/- 0.134 -0.253 +/- 0.218 0.389 +/- 0.207
Prior Belief () 0.244 +/- 0.059 0.217 +/- 0.107 0.101 +/- 0.100
Response 1 Confidence () -0.584 +/- 0.025 -0.491 +/- 0.037 -0.355 +/- 0.038
Advice Confidence () 0.218 +/- 0.024 0.146 +/- 0.036 0.143 +/- 0.033
Response 1 and Advice Consistent? () -1.555 +/- 0.051 -1.157 +/- 0.076 -1.492 +/- 0.073
Age () -0.089 +/- 0.072 -0.074 +/- 0.084 0.110 +/- 0.192
Sex () -0.046 +/- 0.137 0.201 +/- 0.229 0.157 +/- 0.224
Education () 0.043 +/- 0.076 0.027 +/- 0.106 -0.118 +/- 0.133
Socioeconomic Status () 0.139 +/- 0.070 0.013 +/- 0.109 -0.010 +/- 0.109
Has Programming Experience? () 0.233 +/- 0.147 -0.209 +/- 0.256 0.031 +/- 0.217
Table 7: Coefficients for mixed effects logistic regression model for activation outcome. Comparison across different geographical regions.
Covariates Coefficients (US) Coefficients (UK) Coefficients (Asia)
Given AI Advice? () -0.023 +/- 0.016 0.008 +/- 0.024 -0.000 +/- 0.027
Prior Belief () 0.001 +/- 0.008 -0.023 +/- 0.012 0.005 +/- 0.013
Response 1 Confidence () -0.212 +/- 0.004 -0.199 +/- 0.007 -0.226 +/- 0.007
Advice Confidence () 0.025 +/- 0.004 0.033 +/- 0.007 0.016 +/- 0.006
Response 1 and Advice Consistent? () 0.661 +/- 0.008 0.589 +/- 0.014 0.674 +/- 0.011
Age () -0.026 +/- 0.008 -0.013 +/- 0.009 -0.014 +/- 0.025
Sex () 0.010 +/- 0.016 0.027 +/- 0.025 -0.031 +/- 0.029
Education () 0.021 +/- 0.009 -0.016 +/- 0.012 0.015 +/- 0.017
Socioeconomic Status () 0.007 +/- 0.008 -0.020 +/- 0.012 -0.028 +/- 0.014
Has Programming Experience? () 0.014 +/- 0.017 0.044 +/- 0.028 -0.007 +/- 0.028
Table 8: Coefficients for mixed effects linear regression model for response change of activated population. Comparison across different geographical regions.

Appendix C Survey Details

c.1 Demographic Information Collection

We conducted our experiments using Prolific 16. Prolific provides access to various demographic information that we used in our study. These include age, sex, education level, socioeconomic status, and whether the participant has programming experience. Education level is defined on a scale from 1 to 8. The interpretation of the education level is given below:

  1. Don’t know / not applicable

  2. No formal qualifications

  3. Secondary education (e.g. GED/GCSE)

  4. High school diploma/A-levels

  5. Technical/community college

  6. Undergraduate degree (BA/BSc/other)

  7. Graduate degree (MA/MSc/MPhil/other)

  8. Doctorate degree (PhD/other)

Socioeconomic status is defined on a scale from 1 to 10. The question asked to participants to determine the socioeconomic status is included in Figure 6.

Figure 6: Question participants are asked by Prolific 16 to obtain participant’s socioeconomic status.

c.2 Survey Question About Prior

The prior belief value reported in Table 1 was obtained using a survey question asked to every participant after they had finished all the tasks. The survey question is shown below. The wording was slightly modified for those who received AI advice or human advice to account for the advice they received. The ordering of AI and human in the question was randomized for each participant.

Human advice:

“Do you think an artificial intelligence (AI) algorithm or the average person (without help) can do better on this task?”

AI advice:

“Do you think the AI or the average person (without help) can do better on this task?”

c.3 Participant Compensation

Participants were compensated at a rate of per hour as per the Prolific recommended rates. The survey was estimated to take 10 minutes based on several trial runs by the authors. Participants were compensated with this assumption.

Participants were also informed that they could receive up to a 30% bonus. This bonus was calculated as

where is the average performance of the participant across all tasks, both before and after receiving advice. Performance for a single task is computed as

where . Note that this performance metric penalizes incorrect responses.

The total cost of running all of our experiments (including the participants we used to calibrate the advice) was around .

c.4 Survey Screenshots

We designed our study using standard methods in the psychology of advice utilization literature Van Swol et al. (2018). The study was implemented for web deployment using jsPsych De Leeuw (2015) and a simple Python-based web server. In the “survey-art-dataset” directory of the Supplementary material, we include a set of PDF documents showing screenshots of our survey for the Art dataset for a participant who received human advice. Each “page” in the directory corresponds to a separate web page. Filenames are numbered in the order a participant encounters them. Clicking “continue” / “submit” brings the participant to the next web page.

A brief description of the survey screenshots follows. Any content referring to the art data or human advice was substituted with appropriate content for each dataset and AI advice respectively.

  1. Participant enters a unique ID assigned to them through Prolific 16, the crowdsource platform we use to recruit participants.

  2. Instructions specific to the task and advice source participants will receive. Note that this page is seen by participants as a single, continuous web page.

  3. Additional information on the advice source. (For participants receiving AI advice, the text is changed accordingly.)

  4. Information on bonus payment.

  5. Manipulation check. Participants who got the wrong answer were sent back to Page 2.

  6. Example task: recording Response 1. The “initial response” block in Figure 1 shows the bottom half of this slide (for the Cities dataset). Figure 2 shows example images of this slide for each of our four datasets.

  7. Example task: recording Response 2. The “Human advice” block in Figure 1 shows the bottom half of this slide (for the Cities dataset).

  8. After completing all tasks, Participants are shown this screen.

  9. Additional survey questions. The response on slide 12 is used for the belief prior.

  10. Check for errors in survey. Primarily used when developing the experiment to ensure the survey was bug-free.

  11. Debrief slide.

  12. Bonus payment and survey submission slide.

See pages 1-18 of survey/Survey.pdf