Conceptual Metaphors Impact Perceptions of Human-AI Collaboration

08/05/2020 ∙ by Pranav Khadpe, et al. ∙ Stanford University 11

With the emergence of conversational artificial intelligence (AI) agents, it is important to understand the mechanisms that influence users' experiences of these agents. We study a common tool in the designer's toolkit: conceptual metaphors. Metaphors can present an agent as akin to a wry teenager, a toddler, or an experienced butler. How might a choice of metaphor influence our experience of the AI agent? Sampling metaphors along the dimensions of warmth and competence—defined by psychological theories as the primary axes of variation for human social perception—we perform a study (N=260) where we manipulate the metaphor, but not the behavior, of a Wizard-of-Oz conversational agent. Following the experience, participants are surveyed about their intention to use the agent, their desire to cooperate with the agent, and the agent's usability. Contrary to the current tendency of designers to use high competence metaphors to describe AI products, we find that metaphors that signal low competence lead to better evaluations of the agent than metaphors that signal high competence. This effect persists despite both high and low competence agents featuring human-level performance and the wizards being blind to condition. A second study confirms that intention to adopt decreases rapidly as competence projected by the metaphor increases. In a third study, we assess effects of metaphor choices on potential users' desire to try out the system and find that users are drawn to systems that project higher competence and warmth. These results suggest that projecting competence may help attract new users, but those users may discard the agent unless it can quickly correct with a lower competence metaphor. We close with a retrospective analysis that finds similar patterns between metaphors and user attitudes towards past conversational agents such as Xiaoice, Replika, Woebot, Mitsuku, and Tay.



There are no comments yet.


page 1

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Collaboration between people and conversational artificial intelligence (AI) agents—AI systems that communicate through natural language (Grudin and Jacques, 2019)—is now prevalent. As a result, there is increasing interest in designing these agents and studying how users interact with them (Abokhodair et al., 2015; Cranshaw et al., 2017; Ferrara et al., 2016; Grudin and Jacques, 2019; Luger and Sellen, 2016; Shamekhi et al., 2018). While the technical underpinnings of these systems continue to improve, we still lack fundamental understanding of the mechanisms that influence our experience of them. What mechanisms cause some conversational AI agents to succeed at their goals, while others are discarded? Why would Xiaoice (Shum et al., 2018) amass millions of monthly users, while the same techniques powering Tay (Hunt, 2016) led to the agent being discontinued for eliciting anti-social troll interactions? Many AI agents have received polarized receptions despite offering very similar functionality: for example, Woebot (Nutt, 2017) and Replika (Kuyda and Dudchuk, 2015) continue to evoke positive user behavior, while Mitsuku (Worswick, 2015) is often subjected to dehumanization. Even with millions of similar AI systems available online (Chang and Kannan, 2018; Lee, 2018), only a handful are not abandoned (Grudin and Jacques, 2019; Zamora, 2017). The emergence of social robots and human-AI collaborations has driven home a need to understand the mechanisms that inform users’ evaluations of such systems.

In HCI, experiences of a system are typically understood as being mediated by a person’s mental model of that system (Norman, 1988). Conveying an effective understanding of the system’s behavior can enable users to build mental models that increase their desire to cooperate with the system (Kocielnik et al., 2019; Bansal et al., 2019; Cassell et al., 2000; Jakesch et al., 2019). However, a mental model explanation is insufficient to answer the present question: in the case of Xiaoice and Tay, both agents were based on the same underlying technology from Microsoft, but they resulted in very different reactions by users. Likewise, other agents such as Replika and Mitsuku elicit very different evaluations while existing even within the same cultural context. While theories of mental models and culture each help us understand how users experience conversational AI agents, we require additional theoretical scaffolding to understand the phenomenon.

Figure 1. We explore how the metaphors used to describe an AI agent, by influencing pre-use expectations, have a downstream impact on evaluations of those AI agents.

An important and unexamined difference between these otherwise similar agents are the different metaphors that they project. Conceptual metaphors are short descriptions attached to a system that are suggestive of its functionality and intentions (McGlone, 1996; Crawford, 2009). For instance, Microsoft described Tay as an “AI that’s got no chill” (Sandvig, 2015), while it markets Xiaoice as an “empathetic ear”—two very different metaphors. Metaphors are a central mechanism in the designer’s toolkit. Unlike mental models, they offer more than just functional understandings of the system—they shape users’ expectations from the system. And while most existing expectation-shaping mechanisms depend on the functionality of the specific AI system or task (Kocielnik et al., 2019), metaphors are agnostic to specificities of a system and can be used to shape expectations for nearly any AI system. Prior theory suggests that pre-use expectations of AI systems influence both initial behaviors (Hartmann et al., 2008b; Klaaren et al., 1994; Wilson et al., 1989) and long-term behaviors (Kujala et al., 2017), even if the system itself remains unchanged while varying user expectations (Raita and Oulasvirta, 2011).

We propose that these metaphors are a powerful mechanism to shape expectations and mediate experiences of AI systems. If, for example, the metaphor primes people to expect an AI that is highly competent and capable of understanding complex commands, they will evaluate the same interaction with the system differently than if users expect their AI to be less competent and only comprehend simple commands (Figure 1). Similarly, if users expect a warm, welcoming experience, they will evaluate an AI agent differently than if they expect a colder, professional experience — even if the interaction with the agent is identical in both cases.

In this paper, we test the effect of metaphors on evaluations of AI agents. We draw on the Stereotype Content Model (SCM) from psychology  (Fiske et al., 2018; Cuddy et al., 2008), which demonstrates that the two dimensions of warmth and competence are the principal axes of human social perception. Judgements along these dimensions provoke systematic cognitive, emotional, and behavioral reactions (Cuddy et al., 2008). The SCM suggests that user expectations and therefore evaluations, are mediated by judgements of warmth and competence. We crowdsource the labeling of a set of metaphors along these axes to identify a set of metaphors that appear in different quadrants of the SCM — e.g., a toddler, who is high warmth and low competence, and a shrewd executive, who is low warmth and high competence.

We perform an experiment () that manipulates the metaphor associated with an AI agent and measures how it invokes expectations of competence and warmth and how those two dimensions affect ratings of usability, intention to adopt, and desire to cooperate. We draw on an established method from prior experiments (Ho et al., 2018; Wallis and Norling, 2005; Thies et al., 2017; Chaves and Gerosa, 2019) to instantiate the agent itself as a remote Wizard-of-Oz who is blind to the condition and randomized across conditions for each participant. Participants are first exposed to the agent’s metaphor, then converse with the agent to complete a travel planning task (Asri et al., 2017).

Our results suggest that, contrary to how designers typically describe their AI agents, low competence metaphors lead to increases in perceived usability, intention to adopt, and desire to cooperate relative to high competence metaphors. These results persist despite both the low competence and high competence agents operating at full human-level performance levels via a wizard, suggesting that no matter how competent the agent actually is, people will view it negatively if it projects a high level of competence. Participants perceive the wizards to possess lower competence than the expectations implied by high competence metaphors. These results align with Contrast Theory (Sherif et al., 1958), which states that users’ evaluations are defined by the difference between their experiences and expectations. Finally, we find that the warmth axis operates conversely to competence: users viewed the AI with higher warmth more positively, interacted with it longer, and were more willing to cooperate with it. This result aligns with Assimilation Theory (Sherif et al., 1958): users recolor warmth experiences in light of their initial expectation.

Previous work has sought explanations for user behavior and evaluations of AI by profiling users (DeChurch and Mesmer-Magnus, 2010; Bansal et al., 2019) or by making the AI more interpretable (Caruana et al., 2015; Rudin, 2018; Lage et al., 2018). However, these approaches fail to explain why otherwise functionally similar systems elicited vastly different user responses. Our analysis suggests that designers should carefully analyze the effects of metaphors that they associate with the AI systems they create, especially whether they are communicating expectations of high competence. In discussion, we consider implications for design by retrospectively analyzing the metaphors used to describe existing and past AI agents, such as Xiaoice, Tay, and Mitzuku, and show that our results are consistent with the adoption and user cooperation with these products. The connection between our conclusions and the outcomes experienced by Xiaoice and Tay cannot explain the whole story; however, the pattern is striking and motivates the need for exploration of mechanisms to shape expectations and elicit prosocial user behavior.

We begin by laying out related work, deriving our research question and hypotheses from prior theories. We then describe our procedure for sampling metaphors. In Study 1, we study the effects of metaphor warmth and competence. In Study 2, we sample additional metaphors along the competence axis in order to understand the effects of competence at a more fine-grained level. In Study 3, we test the negative effects of portraying a low competence metaphor by studying the effect that warmth and competence have on participants’ interest in using the system in the first place. Finally, we discuss the implications of our findings for the choice of metaphors when designers deal with the dual objective of attracting more users and ensuring a positive user experience.

2. Related Work

Pre-use expectations play a critical role in users’ initial usage of a system or design (Hartmann et al., 2008a; Klaaren et al., 1994; Wilson et al., 1989). Setting positive or negative expectations colors users’ evaluation of what would otherwise be identical experiences (Raita and Oulasvirta, 2011). The effects of these pre-use expectations can have effects on evaluations even after weeks of interaction with a service (Kujala et al., 2017).

In the case of AI systems, which are often data-driven and probabilistic, there exists no simple method of setting user expectations. Providing users with performance metrics does not establish an accurate expectation for how the system behaves (Kocielnik et al., 2019). In the absence of effective mental models of AI systems, users instead develop folk theories — intuitive, informal theories — as expansive guiding beliefs about the system and its goals (Gelman and Legare, 2011; Sease, 2008; Lakoff and Johnson, 2008; French and Hancock, 2017).

Prior work has shown how subjective evaluations of interface agents are strongly influenced by the face, voice, and other design aspects of the agent (Xiao et al., 2004; Nass and Brave, 2005), beyond just the actual capabilities of the agent. These results motivate our study of how metaphors set expectations that affect how users view and interact with conversational AI systems. Inaccurate expectations can be consequential. Previously, interviews have established that expectations from conversational agents such as Siri, Google Assistant, and Alexa are out of sync with the actual capabilities and performance of the systems (Luger and Sellen, 2016; Zamora, 2017). So, after repeatedly hitting the agent’s capability limits, users retreat to using the agents only for menial, low-level tasks (Luger and Sellen, 2016). While these prior interview-based studies have demonstrated that a mismatch between user expectations and system operation are detrimental to user experiences (Luger and Sellen, 2016), they haven’t been able to establish causality and quantify the magnitude of this effect. This gap motivates our inquiry into understanding mechanisms that might shape these expectations and measuring the effect of expectations on user experiences and attitudes. We are guided by the following research question:

Research Question: How do metaphors impact evaluations of interactions with conversational AI systems?

2.1. Metaphors shape expectations

Conceptual metaphors are one of the most common and powerful means that a designer has to influence user expectations. We refer to a conceptual metaphor (or user interface metaphor, or just metaphor) as the understanding and expression of complex or abstract ideas using simple terms (Lakoff and Johnson, 2008). Metaphors are attached to all types of AI systems, both by designers to communicate aspects of the system and by users to express their understanding of the system. For instance, Google describes its search algorithm as a “robotic nose” (French and Hancock, 2017) and YouTube users think of the recommendation algorithm as a “drug dealer” (Wu et al., 2019). Starting with the desktop metaphor for personal computing in the Xerox Star (Kimball and Harslem, 1982), conceptual metaphors proliferated through the design of user interfaces — trash cans for deleted files, notepads for freetext notes, analog shutter clicking sounds for mobile phone cameras, and more.

Some AI agents utilize metaphors based in personas or human roles, for example an administrative assistant, a teenager, a friend, or a psychotherapist, and some are metaphors grounded in other contexts, for example a Jetsons-style humanoid servant robot. Such metaphors are meant to help human-AI collaboration in complex domains by aiding users’ ability to understand and predict the agent’s behavior (DeChurch and Mesmer-Magnus, 2010; Bansal et al., 2019). Metaphors include system descriptions outside of those rooted in human roles as well: Google describing its search algorithm as a “robotic nose” (French and Hancock, 2017) and Microsoft’s Zo marketed as a bot that “Will make you LOL”. The notion of “metaphors” extends beyond conversational AI to non-anthropomorphic systems that “personas” or “roles” may be ill-equipped to describe. Metaphors are effective: they influence a person’s folk theories of an AI system even before they use it (DeVito et al., 2018). Prior work has developed methods to extract conceptual metaphors (Sease, 2008; Lakoff and Johnson, 2008) for how people understand AI systems and aggregate them into underlying folk theories (French and Hancock, 2017).

Metaphors impact expectations, sometimes implicitly by activating different norms, biases, and expectations. For example, social robots that are racialized as Black or Asian are more likely to be subject to antisocial behaviour such as aggression and objectification (Strait et al., 2018b). Similarly, female-gendered robots can elicit higher levels of dehumanisation than male-gendered bots. Antisocial behavior leads to verbal disinhibition toward AI systems (Strait et al., 2018a), and in some extreme cases, to physical abuse and even dismemberment (Salvini et al., 2010; Brscić et al., 2015). Female voice agents are viewed as friendlier but less intelligent (Nass and Brave, 2005). Users also have a higher tendency to disclose information to female gendered agents (Nass and Brave, 2005). Race and gender of pedagogical agents affect learning outcomes—agents racialized as Black or female-gendered lead to improved attention and learning (Baylor and Kim, 2004). Beyond race and gender, agents portrayed as less intelligent, taking on roles such as “motivator” or “mentor”, promote more self-efficacy than agents projected as “experts” (Baylor and Kim, 2004). Young, urban users respond positively to bots that can add value to their life by suggesting recommendations, while in the role of a “friend” (Thies et al., 2017).

However, designers typically aim to use metaphors to affect expectations in more explicit, controlled, and pro-social ways. Most obviously, a metaphor communicates expectations of what can and cannot be done with an AI agent (Kimball and Harslem, 1982). Just as we expect an administrative assistant to know our calendar but not to know the recipe for the best stoat sandwiches, an AI agent that communicates a metaphor as an “administrative assistant” projects the same skills and boundaries. In a similar vein, describing an agent as a “toddler” suggests that the agent can interact in natural language and understand some, but not all, of our communication.

While other expectation shaping mechanisms for AI agents such as tutorials and instructions have been studied (Kocielnik et al., 2019), the effect of metaphors on user expectations and evaluations have not. Our work also bridges to research suggesting that people already form metaphor-based theories of socio-technical systems (French and Hancock, 2017) and suggests design implications for how designers should choose their metaphors.

2.2. Competing predictions: assimilation vs. contrast

As people view AI agents as social agents (Reeves and Nass, 1996), the metaphor—and thus the nature of that agent—is likely to influence their experience. However, the literature presents two competing theories for how changes to the metaphor — and thus to expectations — will impact user evaluation of an AI system. Assimilation theory (Sherif et al., 1958) states that people adapt their perceptions to match their expectations, and thus adjust their evaluations to be positively correlated with their initial expectations. (As Dumbledore points out to Snape in Harry Potter and the Deathly Hallows, “You see what you expect to see, Severus.”) Assimilation theory argues that users don’t perceive a difference between their pre-use expectations and actual experiences. Prior work supports that, for interactive systems, users’ expectations do influence evaluations (Hartmann et al., 2008b; Van Schaik and Ling, 2008). For example, users rate an interactive system higher when they are shown a positive review of that system before using it, and rate the system lower if they are shown a negative review before using it (Raita and Oulasvirta, 2011). Likewise, humor and other human-like characteristics that create high social intelligence expectations can be crucial in producing positive evaluations (Liao et al., 2018; Jain et al., 2018).

Assimilation theory would predict that a metaphor signaling high competence will set positive expectations and subsequently lead to positive evaluation:

Hypothesis 1 (H1).

Positive metaphors (e.g., high competence, high warmth) will lead to higher average intention to adopt and desire to cooperate with an AI agent than if it had no metaphor or negative metaphors.

Contrast theory (Sherif et al., 1958), on the other hand, attributes user evaluations to the difference they perceive between expectations and actual experience. Contrast theory argues that we are attuned not to absolute experiences, but to differences between our expectations and our experiences. For example, exceeding expectations results in high satisfaction, whereas falling short of expectations results in lower satisfaction. This suggests that it is beneficial to set users’ initial expectations to be low (with practitioners reasoning in the manner of George Weasley, in Harry Potter and the Order of the Phoenix, “‘E’ is for ‘Exceeds Expectations’ and I’ve always thought Fred and I should’ve got ‘E’ in everything, because we exceeded expectations just by turning up for the exams.”) Users of conversational AI agents such as Alexa stumble onto humorous easter egg commands that raise their expectations of what the system can do, but then report disappointment in the contrast to discovering the system’s actual limits (Luger and Sellen, 2016). Likewise, ratings of interactive games are driven in part by contrasting players experiences against their expectations of the game (Michalco et al., 2015).

Contrast theory predicts that positive metaphors will backfire because AI agents inevitably make mistakes and have limits:

Hypothesis 2 (H2).

Positive metaphors (e.g., high competence, high warmth) will lead to lower average intention to adopt and desire to cooperate with an AI agent than if it had no metaphor or negative metaphors.

3. Methods

Our research aim is to study the effect of metaphors on experiences with AI agents. So, we seek an experimental setup where participants accomplish a task in collaboration with an AI system, while avoiding effects introduced by idiosyncrasies of any particular AI system. We situate our method in goal-oriented conversational agents (or task-focused bots) as these systems represent a broad class of agents in research and product (Nutt, 2017; Shum et al., 2018; Worswick, 2015; Hunt, 2016; Woollaston, 2016; Damani et al., 2018).

3.1. Collaborative AI task

Goal-oriented AI systems, such as those for booking flights, hotel rooms, or navigating customer service requests, have become pervasive on social media platforms including Kik, Slack, and Facebook Messenger, with as many as one million flooding the Web between and  (Grudin and Jacques, 2019). Surveys revealed that as of 2018, as many as 60% of surveyed millenials had used a chatbot (Arnold, 2018) and 15% of surveyed internet users had used customer-service chatbots (Drift, 2018). Their prevalence means that interaction with such an agent is an ecologically valid task, and that many users online are familiar with how to interact with them. We draw on a common set of transactional tasks such as appointment booking, scheduling, and purchasing, which require people to engage with the agent in task-focused dialogue to acquire information or complete their task (Gao et al., 2018). Inspired by the popular Maluuba Frames (Asri et al., 2017)

data collection task templates, used to evaluate conversational agents in the natural language processing community, we utilize a travel planning task. More concretely, the task is a vacation planning endeavor where users must pick a vacation package that meets a set of experimenter-specified requirements through a search-compare-decide process. Specifically, every participant is presented with the following prompt:

You are considering going to New York, Berlin or Paris from Montreal. You want to travel sometime between August 23rd and September 1st. You are traveling alone. Ask for information about options available in all cities. Compare the alternatives and make your decision.

Participants were further instructed to determine what they could get for their money and to take into consideration factors they would consider while actually planning a vacation, including wifi, breakfast options, and a spa. The task is structured to involve three sub-goals: finalize a hotel package, an outgoing flight and an incoming flight back to Montreal.

3.2. Wizard-of-Oz conversational agent

We sought a conversational AI agent whose actual performance was strong enough for our result to generalize as the underlying AI models improve. So, following a pattern in prior work (Ho et al., 2018; Wallis and Norling, 2005; Thies et al., 2017; Chaves and Gerosa, 2019), we adopt a Wizard-of-Oz study paradigm.

We hire and train customer-support professionals from the Upwork platform to act as wizards in our experiment and pay them their posted hourly rate of per hour. The wizards play the role of the conversational AI agent in the text chat. We filtered workers who had at least a job success rating from past work, and had already earned

USD through the Upwork platform. We also filtered for workers with English proficiency by asking them to submit a cover letter detailing their past work experience and manually checked for spelling or grammatical errors. We hired 5 wizards in all. To eliminate wizard-specific confounds across our different conditions, wizards were blind to the treatment condition of the participants in the study, and randomized to a new condition for each new participant that they interacted with. Randomizing this source of variation produces an unbiased estimate of the effect of each condition.

Wizards were trained on how to provide responses by engaging in practice tasks with the authors as participants. They were instructed to send emotionally neutral responses simply addressing the participant’s query. Wizards were required to, at every turn, search for hotels that met the requirements specified by the participant. In case multiple hotels met the set of requirements, they were asked to present all the options to the participant, ordered according to how they appear in the database. Once users achieved their three goals (booking two flights and a hotel), the wizards informed the users to proceed to the next step of the study. In order to minimize inter-wizard differences, we provided feedback to wizards from trial conversations to ensure that there were no drastic differences across the wizards’ performance. In many common exchanges, wizards were also provided with template responses. To highlight the similarity of the wizards’ responses, sample responses of three wizards to similar input queries are provided in the Supplementary Material.

Task-focused agents are knowledgeable within a narrow task focus (Grudin and Jacques, 2019) and are often unable to answer questions that require external knowledge. In order to retain phenomena associated with access to finite knowledge, we provide our wizards with a database of hotels and flights. We construct this database by hand to mimic what one would find on a standard travel booking platform. We have provided details on the construction of the database and examples of hotels and flights from the database in the supplementary material. Consistent with instructions provided to wizards in the creation of Frames (Asri et al., 2017) corpus, if the wizard is asked about knowledge that is outside of their available database, wizards were trained to respond that they do not have that information.

We constructed a conversational chat platform via using Chatplat ( We embedded this chat widget into our web based survey.

3.3. Sampling metaphors

We place participants into treatment groups each defined by the metaphor used to describe their AI collaborator. Instead of randomly sampling metaphors, we draw on the Stereotype Content Model (SCM) (Fiske et al., 2018; Cuddy et al., 2008), an influential psychological theory that articulates two major axes in social perception: warmth and competence. These two dimensions have proven to far outweigh others and repeatedly come up as prime factors in literature (Asch, 1946; Judd et al., 2005; Wojciszke, 2005). Judgements on warmth and competence are made within milliseconds (Todorov et al., 2008) and a change in these traits alone can wholly change impressions (Zanna and Hamilton, 1972). The SCM proposes a quadrant structure and cognitive notions of warmth and competence- better understood as discrete- are characterized as being low or high (Cuddy et al., 2008). Warmth is characterised by notions such as good-naturedness and sincerity, while competence is characterised by notions of intelligence, responsibility, and skillfulness. For example, a “shrewd travel executive” can be described as high competence and low warmth. We sample metaphors such that they have either high or low values of warmth and competence.

Figure 2. Average warmth and competence measured for the conceptual metaphors sampled for our studies. Both the axes ranged from to .
Shrewd Travel
Table 1. Warmth and competence values (average standard deviations) for the metaphors we use across the studies. High competence and warmth values are in bold.

In our first study, we use four metaphors, one in each quadrant, to study the impact of competence and warmth. We pre-tested a set of metaphors for a conversational agent, measuring the perceived competence and warmth of conversational AI agents described with these metaphors using a point Likert scale. We captured ratings for each metaphor from workers on Amazon Mechanical Turk — a mutually exclusive set of workers from those who will later be involved in the experiment. Based on the results (see Figure 2), we chose “trained professional travel assistant” (high competence, high warmth), “shrewd travel executive” (high competence, low warmth), “toddler” (low competence, high warmth), and “inexperienced teenager” (low competence, low warmth). We selected metaphors that were otherwise agendered, with similar socio-cultural connotations across the world, and representative of actual metaphors that could be associated with a travel assistant bot. These four metaphors form our four treatment groups in Study 1; their mean and standard deviation values of competence and warmth are reported in Table 1.

In Study 2, we follow the same procedure to characterize several additional metaphors: “middle schooler”, “young student”, and “recent graduate”. These metaphors offer intermediate levels of competence, with “toddler” less competent than “middle schooler”, “middle schooler” less competent than “young student”, and “young student” less competent than “trained professional”. “Young student” is associated with higher competence levels than “middle schooler”, suggesting that people’s impression of a “young student” is a high schooler or college student, somewhere between a “middle schooler” and a “recent graduate”. In Study 3, we revisit the metaphors we analyzed in Study 1 to understand the effects of metaphors on potential users’ likelihood of trying out the system and their intentions of co-operating with it prior to using it.

4. Study 1: Metaphors drive evaluations

In our first study, we examine the effects of metaphors attached to the conversational system on user pre-use expectations and post-use evaluations. Specifically, we examine participants’ perceived pre- and post-use usability and warmth of the AI system. Additionally, we measure their post-use intention to adopt and their desire to cooperate with such a system given their treatment metaphor condition. Finally, we analyze the chat logs to explore if there are behavioral differences between the participants in different conditions.

4.1. Procedure

We perform a between-subjects experiment where participants in each treatment condition are primed with a metaphor to associate with the system. As described in the previous section, metaphors were chosen to vary as low/high warmth low/high competence, resulting in four treatment conditions. In addition, we included a control condition where participants were not primed with a metaphor, resulting in five total conditions.

After consenting to the study, participants were introduced to one of the study conditions, i.e. they were shown one of the four metaphors, or a control condition of no metaphor:

The bot you are about to interact with is modeled after a “shrewd travel executive”.

With the study condition revealed, participants were asked questions about their pre-use expectations of the AI system’s competence and warmth. Next, participants were shown the goal-oriented task description and allowed to interact with the wizard posing as conversational agent via the chat widget until they completed their task.

After finalizing their travel plans, participants were asked to evaluate their experience with the AI system and answer the manipulation check question. Finally, participants were debriefed, informed of the actual purpose of the study, and made aware that they were talking to a human and not an AI system. A high-level workflow is depicted in Figure 1.

4.2. Measures

User evaluation measures. To test contrast theory, it is important to measure a user’s evaluation of the experience without drawing explicit attention to the contrast between their expectations and their experience—since this makes the contrast salient (Kujala et al., 2017). So, we independently measure pre-use expectations and post-use evaluations without explicitly asking if expectations were met or violated. To gauge participants’ pre-use expectations and post-use perceptions of the systems competence and warmth, we ask the participants to report how strongly, on a point Likert scale (where 1 = strongly disagree and 5 = strongly agree), they agree with the following statements, both before and after they interacted with the AI system. Questions asked before use simply replaced the past tense of the verb with the future tense; the question ordering was randomized.

  • Usability: Since our notion of a system’s competence is akin to the notion of usability in previous studies, we adapt questions from previous surveys that examine usability (Kujala et al., 2017). These questions are: 1) “Using the AI system was (will be) a frustrating experience.” 2) “The AI system was (will be) easy to use.” 3) “I spent (will spend) too much time correcting things with this AI system.” 4) “‘The AI system met (will meet) my requirements.” Responses from before using the system are combined to form a pre-use usability index () while responses from after the conversation are combined to form a post-use usability index ().

  • Warmth: To measure the warmth of the AI system, we draw on different warmth levels articulated in the stereotype content model (Fiske et al., 1999): 1) “This AI system was (will be) good-natured.” 2) “This AI system was (will be) warm.” Responses from before using the system are combined to form a pre-use warmth index (). Similarly, responses from after the conversation are combined to form a post-use warmth index ().

  • Intention to Adopt and Desire to Cooperate: We borrow from prior work (Kujala et al., 2017) that captures user evaluations through their intentions to adopt the system. Since we increasingly have situations in which humans work alongside AI systems where these systems augment human efforts, it also becomes necessary to understand users’ behavioural tendencies towards these systems. So, we draw on prior work in HRI (Mieczkowski et al., 2019) and capture users’ behavioural tendencies through their desire to help and cooperate with the AI system. After their interaction with the system, participants are probed for their intentions to adopt as well as their desire to cooperate with the system. To probe for the participants’ intentions to adopt, we asked them the following two questions on point Likert scales: 1) “Based on your experience, how willing are you to continue using the service?”, 2) “How likely is it that you will be using the service in the future?”. Like previous work (Kujala et al., 2017), these two questions are combined to form an intention to adopt index (). To understand the participants’ desire to cooperate, we use questions about behavioral tendencies towards stereotyped groups adapted to the context of social robots (Mieczkowski et al., 2019). Users are asked on a 5 point Likert scale: “How likely would you be to cooperate with this AI?” and “How likely would you be to help this AI?”. Like previous work (Mieczkowski et al., 2019), these two questions are combined to form a cooperation index ().

Conversational behavior measures. To investigate if participant behavior changes across conditions we include measures to analyze differences in the conversational behavior of users.

  • Language measures: To measure differences in the chatlogs by the participant and by the wizards across the various conditions, we utilize the popular linguistic dictionary Linguistic Inquiry and Word Count, known as LIWC (Pennebaker et al., 2001). LIWC uses dimensions to determine if a text uses positive or negative emotions, self-references, and causal words, to help assess physical and mental health, intentions and expectations of the writers. We categorize all the words used by the participants and the wizards into LIWC categories and create normalized frequency histograms of these categories. We compare the words used by the participants across all the conditions to see if there significant differences in the types of LIWC categories used. Similarly, we compare the wizards’ words across all conditions. Finally, we combine the words used by the wizard and participant together and also check to see if there were differences between conversation across the different conditions.

  • Conversation measures: We also investigate differences across conditions, at the level of individual messages and whole conversations in terms of number of words used and duration of interaction.

4.3. Participants

For all the studies in this paper, we hired participants to interact with our Wizard-of-Oz AI system from Amazon Mechanical Turk (AMT). Participants were all US citizens aged . Each participant was allowed to take part in the experiment only once. Participants were compensated for a survey lasting an average of minutes, for a rate of roughly /hr in accordance with fair work standards on Mechanical Turk (Whiting et al., 2019). Participants’ data was discarded if they failed to follow instructions, left the conversation midway or did not follow the task specifications. of our participants were female and the mean age of participants was .

In this specific study, for a small expected effect size of , a power analysis with a significance level of , powered at , a power analysis indicated that we require participants per condition, or total participants. Thirteen participants’ responses were discarded because the raters had coded their WoZ manipulation check as expressing suspicion that the agent might be human. After these exclusions, we had a sample size of participants, which met the requirements from our power analysis.

4.4. Wizard-of-Oz manipulation check

To ensure that our study was not compromised by participants who identified that they were speaking to a wizard instead of an AI, we included a manipulation check at the end of the survey. The manipulation check gauges whether the participant was suspicious of the AI without explicitly drawing their attention to the fact that this might be the case. So, drawing on prior Wizard-of-Oz studies (Hinds et al., 2004), we asked the participants how they thought the system worked from a technical standpoint.

The responses were sent to two coders who inspected each response individually and marked all the responses that suspected a person was pretending to be an AI system. Participants who expressed suspicion that the system might be human were excluded from further analysis. Some participants were very confident they knew how such our conversational AI could be built: I am a programmer so i understand the bot has a vocabulary of words it attempts to parse through, then it takes what it finds from the user and checks against a database to output information it thinks is relevant. Others talked about how Most chat bots go through “training” beforehand to be able to parse commonly asked questions and phrasing or how it must be using a database full of responses.

Out of all our participants, both coders identified the same participants () who failed the manipulation check by calling out the agent as a human, resulting in a suspicion level of . In most of these cases, it was triggered by a wizard making a typo or taking too long to respond. One suspicious participant exclaimed, “I’m like

sure it’s not a bot, but if it were a bot, machine learning, though I don’t know exactly what THAT means

”. These participants were excluded from our analysis.

4.5. Results: Metaphors shape pre-use expectations

We compare the impact of setting expectations by varying the competence and warmth of the metaphors used. We perform our analysis using a pair of two-way analyses of variance (ANOVAs), where competence and warmth are two categorical independent variables, and pre-use usability and warmth are the dependent variables. We compared the impact of the conceptual metaphors used to describe our system compared to the control condition to measure if they have an impact on the participants’ default expectations of conversational AI systems. So, the independent variables are categorized into high, low or control categories.

Figure 3. (a) Metaphors that signal high competence lead to higher pre-use usability scores. (b) Similarly metaphors that signal high warmth lead to higher pre-use warmth scores. We also notice from both (a) and (b) that participants are naturally predisposed to have high expectations of usability and warmth from conversational systems; however, priming them with metaphors reduces the variance of their expectations as opposed to when their expectations are uninformed.

Pre-use usability is affected by the metaphor’s competence. For pre-use usability (Figure 3 (a)), we find that competence has a large main effect . By default, participants have high expectations of competence. A post-hoc Tukey revealed that pre-use usability was significantly lower for the low competence condition than high competence or control conditions. We found no main effects for warmth.

Pre-use warmth is likewise affected by the metaphor’s warmth. For pre-use warmth (see Figure 3 (b)), we find that warmth has a large main effect . By default, participants have high expectations of warmth. A post-hoc Tukey revealed that pre-use warmth is significantly low for the low warmth condition than both the high warmth and control conditions. We found no main effects of competence.

Together, these tests imply that participants, by default, expect conversational AI to possess high competence and high warmth. However, the change in expectation caused by low competence and low warmth implies that these conceptual metaphors do affect participants’ expectations of how the AI system will perform and behave. We visualize the means and standard errors for these conditions in Figure 

3 (a, b).

4.6. Results: Metaphors impact user evaluations and user attitudes

We compare the impact of varying the competence and warmth of the metaphors on participants’ post-use evaluations of the AI system’s usability and warmth. We perform our analysis using a pair of two-way ANOVAs where competence and warmth are categorical independent variables, and post-use usability and post-use warmth are the two dependent variables.

Figure 4. The low competence metaphor condition features the highest post-use usability, intention to adopt, and desire to cooperate. This result suggests that metaphors that undersell the AI agent’s competence are most likely to succeed.

Participants perceive agents with low competence to be more usable after interaction. For post-use usability (Figure 4 (a)), competence has a main effect . Competence has a smaller effect on post-use usability than on pre-use usability, implying that the actual interaction of the participant with the system affects their final evaluations. In the low competence condition, post-use usability was rated at and in the high competence condition, it was rated . These results suggest that users perceive a difference between their experience and their expectations in terms of competence of the agent. The means for post-use usability are also higher than for pre-use usability for both high and low competence conditions. We found no main effects of warmth or interaction effects between competence and warmth.

We observe post-use warmth ratings to be higher in the high warmth condition than the low warmth condition though the difference is not significant. For post-use warmth (Figure 4 (d)), we find no main effects of competence or warmth and no interaction effects. In the high warmth condition, warmth was rated at and in the low warmth condition, it was rated . There is no significant difference between the means of pre-use and post-use warmth for high warmth , but it is significantly different in the low warmth .

Using the composites described in the study design, we measure the effect of conceptual metaphors on the participants’ intention to adopt and desire to cooperate after interacting with the system.

Low competence metaphors increase participants’ likelihood of adopting the AI agent. For their intention to adopt (Figure 4 (b)), competence has a main effect . In the high competence condition, the intention to adopt was rated at and in the low competence condition, it was rated . These results support Hypothesis 2 as we see support for contrast theory: participants are more likely to adopt an agent that they originally expected to have low competence but outperforms that expectation. They are less forgiving of mistakes made by AI systems they expect to have high competence. We found no main effects of warmth or interaction effects between competence and warmth.

Participants prefer to cooperate with agents that have high warmth and low competence. For their desire to cooperate with the AI system, we found that both competence and warmth had main effects but no interaction effect (see Figure 4 (c, e)). The means increase from high to low competence of . Similarly, the means decrease from high to low warmth . These results provide mixed support to both Hypothesis 1 and Hypothesis 2 as we see support for assimilation theory along the warmth dimension and contrast theory along the competence dimension. If participants are told that the AI system is high warmth, they are more likely to cooperate with it. But if the AI system is described as high competence, they are less likely to cooperate.

Figure 5. Segments of two example conversations between a participant with our conversational AI system. In both cases, the participant expects the AI system to have low competence. While the left conversation is in the high warmth metaphor condition, the right conversation is in the low warmth metaphor condition. Participants in the high warmth condition ask more questions and explore the space of possible interactions by asking the agent details about checked luggage and hotel amenities. Wizards, acting as conversation agents, are given a fixed knowledge set, mimicking how today’s systems are designed, and reply with apologies when asked about details outside of their knowledge.

4.7. Results: Expectations change, but behavior doesn’t

We analyze the chat logs with LIWC features, following a standard LIWC analysis protocol of building a frequency count of how often words belonging to a specific LIWC category were used. We contrasted these counts across the various conditions and observed no significant differences () in language level phenomenon in the chatlogs across the conditions. This result implies that the post-use evaluations are driven primarily by the expectations set by the metaphors, not by the actual content of the conversation. In other words, evaluations differed between conditions, but the actual conversations themselves did not. The wizard was blinded to the condition, so any differences would have needed to be prompted by the participant. However, we acknowledge that there might be language shifts that LIWC categories cannot capture.

Participants use more words and spend more time speaking to agents with high warmth. We find a significant main effect of warmth on the number of words used per conversation . The number of words increase from in low to in high warmth. We also find that participants in the high warmth condition typically spend an average of minutes longer while interacting with the AI system. On a qualitative inspection, we find that participants tend to ask more questions and spent more time exploring the AI system’s capabilities. Consider, for example, the conversation shown in Figure 5, where the participant expects the bot to have low competence and high warmth. The participant asks numerous questions to test the system’s capabilities and even though it fails, they later express a high intention to adopt and cooperate with the system.

4.8. Summary

Our results support contrast theory (Hypothesis 2) for the competence axis. Users are more tolerant of gaps in knowledge of systems with low competence but are less forgiving of high competence systems making mistakes. The intention to adopt and desire to cooperate decreases as the competence of the AI system metaphor increases. For the warmth axis, our results provide some support for assimilation theory (Hypothesis 1): users are more likely to co-operate and interact longer with agents portraying high warmth, but we do not observe significant impact of warmth on users’ intention to adopt.

5. Study 2: The competence-adoption curve

Figure 6. A larger positive violation of expectation increases adoption intentions. The intention to adopt decreases monotonically with an increase in expected competence of the system. The red vertical line shows the average score users in the control condition assigned the system and the yellow shaded region around the vertical line depicts the standard deviation.

Study 1 established that setting low expectations and violating them positively increased the likelihood of users adopting the system. In Study 2, we zoom in and try to understand how user evaluations change as the magnitude of that gap changes. We sample additional metaphors and use the same experiment procedure as before to characterize how users’ intentions to adopt the system vary as gap between users expectations and their experience changes. For this purpose, we rely on the same measure of Intention to Adopt as Study 1.

5.1. Procedure

To precisely traverse the range of perceived competence, we sampled additional metaphors — “middle schooler”, “young student”, and “recent graduate”. Our pre-experiment survey revealed that these metaphors had perceived competence levels between the “toddler” and the “trained professional”. Together, these five metaphors formed five treatment conditions. As Figure 2 demonstrates, all five metaphors lie in the high warmth half of the space, minimizing any interfering effects of variations in warmth. Participants in these five conditions were primed with the respective metaphor.

To understand users’ unprimed evaluations of the system, we asked a sixth control group to participate without a metaphor (similar to the control condition in Study 1). Afterwards, we asked them to pick from the list of five metaphors and identify which one they felt described the system most accurately after use.

5.2. Measures

Similar to Study 1, we measure participants’ intention to adopt and desire to co-operate across all five metaphors.

5.3. Participants

Similar to the protocol in Study 1, participants were recruited on AMT. We recruited participants for each condition, for a total of participants. The duration of the study was similar to Study 1 and participants were compensated at the same rate. The average age of participants was ; identified as female.

5.4. Wizard-of-oz manipulation check

The two coders were consistent and identified the same participants () as being suspicious, implying a low suspicion level of . These five participants were removed from analysis.

5.5. Extreme violations of expectations have stronger effects

The five metaphors we sampled are shown in Figure 6, where the x-axis depicts workers’ perceived competence of a system with that metaphor and the y-axis depicts a different set of users’ intention to adopt. The red vertical line shows the average score users in the control condition assigned the system and the yellow shaded region around the vertical line depicts the standard deviation. The unprimed system was viewed roughly as competently as a recent graduate.

Over-performing low competence leads to higher adoption than over-performing medium competence, and projecting any more competence than the toddler metaphor incurs an immediate cost (Figure 6). These results paint a fuller picture of contrast theory at play, as the intention to adopt decreases monotonically as the expected competence of the system increases. Consistent with prior literature, the effect is greater as the contrast is greater (Geers and Lassiter, 1999; Brown et al., 2012). However, the effect is nonlinear, with only the lowest competence metaphor receiving a substantial benefit.

The “toddler” metaphor sees the highest (beneficial) violation as it is furthest away from the vertical line and sees the greatest intention to adopt and desire to cooperate. There was a statistically significant difference between groups as determined by one-way ANOVA for intention to adopt and for desire to cooperate . A Tukey post-hoc test revealed that intention to adopt was statistically significantly higher for “toddler” than “young student” , “recent graduate” , and “trained professional” . There was no statistically significant difference between other metaphors. Similarly, a Tukey post-hoc test revealed that desire to cooperate was statistically significantly higher for “toddler” than “middle schooler” , “young student” , “recent graduate” , and “trained professional” . There was no statistically significant difference between other metaphors.

5.6. Summary

Our results further support contrast theory (Hypothesis 2) for the competence axis. Users are more likely to adopt a lower competence agent than one with high competence, even though all conditions were exposed to human-level performance. We additionally see an asymmetry — users are even more likely to adopt an agent that exceeds extremely low expectations than one that exceeds slightly higher (but still low) expectations. And as the agent begins to under-perform expectations, intentions to adopt decrease further.

6. Study 3: The cost of low-competence metaphors

From our results so far, it might appear that that designers should pick metaphors that project lower competence and high warmth regardless of experience, as these conditions are most conducive for cooperative and patient user interactions. However, such a conclusion might be myopic. Metaphors attached to a system also have the ability to attract or drive people away.

To test the effect of metaphor on pre-use intention to adopt, and pre-use desire to cooperate with an AI system, we ran a third study. In this study, we present participants with AI systems, described using conceptual metaphors. We ask participants to identify which systems they are more likely to try out and potentially adopt, prior to using the system.

6.1. Procedure

We perform a between-subjects experiment. Each participant was introduced to an AI agent described using one of the metaphors in Study 1. Unlike the previous experiments, participants do not actually interact with an AI system (or wizard). Instead, they are asked to rate their likelihood of trying out a new AI system service described by each metaphor.

6.2. Measures

To probe for the participants’ intentions to try the system, we asked them the following two questions on point Likert scales: How likely are you to try out this AI system?, and Do you envision yourself engaging in long-term use of such a AI system? These two questions are combined to form a trial index ().

To understand the participants’ pre-use desire to cooperate with the system, they are asked on a point Likert scale: How likely are you to cooperate with such an AI system?, and How likely are you to tolerate errors made by this AI system? These two questions are combined to form a pre-use desire to cooperate index ().

6.3. Participants

Similar to the previous studies, we recruited participants from AMT. new participants participated in this survey: participants exposed to a metaphor from each quadrant. We ensured that none of the participants in this study participate in any of our other studies.

6.4. More interest in trying out high competence and high warmth AI systems

Participants were more interested to try out AI systems that were described by high competence and high warmth. A two-way ANOVA revealed that competence and warmth both had significant impact on their intention to try out the AI system. The average trial index response was for high competence, and for low competence. Following a similar pattern, the average trial index response was for high warmth as opposed to for low warmth. The ANOVA also showed an interaction effect between competence and warmth . In this case, the combination of high competence and high warmth produced a substantial benefit compared to the effects of warmth and competence individually: low competence and high warmth , high competence and low warmth , and both low competence and low warmth .

Participants were more likely to cooperate positively towards AI systems that were described by high competence and high warmth. A two-way ANOVA revealed that competence and warmth both had significant impact on the trial index. The average pre-use desire to cooperate index response was for the high competence as opposed to for low competence. Similarly, the average pre-use desire to cooperate index response was for the high warmth as opposed to for low warmth. The ANOVA also showed an interaction effect between competence and warmth . People expected to behave more positively with high competence and high warmth AI systems over the low competence and high warmth , high competence and low warmth , and both low competence and low warmth bots.

6.5. Summary

While our previous studies demonstrated the detrimental effects of presenting an AI system with a high competence metaphor, this study shows a positive benefit of high competence — people are more likely to try out a new service if it is described with a high competence metaphor. This study also shows that metaphors that project high warmth also increase people’s likelihood of trying out a service and to behave positively with it. We discuss the implications of of these findings and suggest guidelines for choosing metaphors considering both the competing objectives of attracting more users and ensuring favorable evaluations and cooperative behavior.

7. Discussion

Metaphors, as an expectation setting mechanism, are task- and model-agnostic. Users reason about complex algorithmic systems, including news feeds (DeVito et al., 2018), content curation, and recommender systems, using metaphors. This implies their effects are not limited to conversational agents or even to AI systems and can be used to set expectations of any algorithmic system (e.g., is Facebook’s newsfeed algorithm a gossipy teen, an information butler, or a spy?), although the implications of our study might differ depending on the task, interaction and context.

With our findings in mind, this section explores their design implications and limitations, and situates our work amongst existing literature in HCI. We end with a retrospective analysis on existing and previous conversational AI products, reinterpreting their metaphors and adoption/user cooperation patterns through the lens of our results.

7.1. User behavior around algorithmic systems

Our work contributes to a growing body of work in HCI that seeks to understand how people reason about algorithmic systems with the aim of facilitating more informed and engaging interactions (French and Hancock, 2017; DeVito et al., 2018; Eslami et al., 2015). Previous work has looked at how users form informal theories about the technical mechanisms behind of social media feeds (French and Hancock, 2017; DeVito et al., 2018; Eslami et al., 2016) and how these “folk theories” drive their interactions with these systems. People’s conceptual understanding of such systems have been known to be metaphorical in nature, leading them to form folk theories of socio-technical systems in terms of metaphors. Folk theories for Facebook and Twitter news feeds include metaphors rooted in personas such as “rational assistant” and “unwanted observer” as well as metaphors tied to more abstract concepts such as “corporate black box” and “transparent platform”. More recent work has sought to study the social roles of algorithms by looking at how people personify algorithms and attach personas to them (Wu et al., 2019). Prior work in the domain of interactive systems and embodied agents has observed that the mental schemas people apply towards agents affect the way they behave with the agent and it is possible to detect users’ schematic orientation through initial interaction (Lee et al., 2010). Diverging from previous work on folk theories, our work takes a complementary route — instead of studying which metaphors users attach to systems, we study how metaphors explicitly attached to the system, by designers, impact experiences.

7.2. Design implications

Studies 1 and 2 demonstrate that low competence metaphors lead to the highest evaluations of an AI agent, but Study 3 counters that agents with low competence metaphors are least likely to be tried out. What should a designer do?

From Study 3, it becomes clear that associating a high warmth metaphor is always beneficial—however, the choice of competence level projected by the metaphor becomes a more nuanced decision. One possible approach might be to choose a higher-competence metaphor but to lower competence expectations right after interaction begins (e.g., “Great question! One thing I should mention: I’m still learning how to best respond to questions like yours, so please have patience if I get something wrong.”) Another approach might be to age the metaphor over time: to present a high competence metaphor such as a professional, but when a user first encounters it, the agent introduces itself via a lower-competence version such as a professional trainee and tells the user that it will evolve over time into a full professional (Seering et al., 2020).

If designers are unwilling to change or adapt their high-competence metaphor, then their designs run the risk of being abandoned for being less effective than users expect. There may be other ways to disarm the contrast between expectations and reality. The agent blaming itself for errors or blaming the user for errors create challenging issues, but blaming an intermediary might work (Nass and Brave, 2005): for example, “I’ve seen that previous folks who asked that question meant multiple different things by it. To make sure I can help effectively, can you reword that question?”

7.3. Limitations and future work

The scope of the study was limited to a conversational AI, as an instance of an algorithmic system, where interaction is devoid of strong visual cues (Saygin et al., 2011). In the case of embodied agents and systems where visual communication is a major aspect of the interaction, visual factors might have a strong effect on expectations. It is important to understand how users factor in these visual signals in forming an impression of the system. Additionally, our choice of conceptual metaphors was solely textual metaphors; future work should explore how these findings translate to visual metaphors such as the abstract shape associated with Siri, or the cartoonish rendering of Clippy, because such visual abstractions also inform users’ judgements of a system’s competence and warmth.

Since the task in our study was highly structured and participants had no incentive to explore peripheral conversational topics, we did not observe significant differences in user vocabulary across the conditions. This result surprised us — that evaluations would differ even if the interactions themselves had no major differences between conditions. Future work should explore user behavior in open-ended conversations, which are more likely to contain personal stories and anecdotes that can elicit greater behavior changes. The conversations and therefore, interactions with the AI system were limited to

minutes, so further research needs to establish the effects of metaphors on prolonged exposure to the AI system. Additionally, our service is not commonly used by people today to book flights or hotels and it is possible that the novelty of performing this task with a conversational agent might have skewed evaluations. This necessitates the need to understand how prior experience with similar technology changes people’s susceptibility to such expectation shaping and subsequently their evaluations.

We observed partial support for Assimilation Theory along the warmth axis: participants preferred to cooperate with agents projecting higher warmth but at the same time, they perceived a difference between the agent’s projected warmth and the actual warmth. Future work is needed to develop more robust theories along the warmth axis. One potential direction could create conditions of more extreme violation — sampling metaphors that signal either extremely low or extremely high warmth and measuring to see if participants’ attitudes towards the agent are still driven by their pre-use warmth perceptions or whether the larger perceived difference in warmth alters their attitudes towards the agent.

Our study explored the effect of metaphors on adoption and behavioral intentions but these could also impact many other factors, including perceived trustworthiness of the system. Along this direction, future work should explore the impact on user evaluations when the interaction with the AI system results in a failure to accomplish the task. Finally, the actually competence and warmth of the AI system should be varied to analyze the effects metaphors as the AI system’s competence is lowered from our human-level performance.

7.4. Retrospective analysis

Studies have repeatedly shown that initial user expectations of their conversational agents (including Google Now, Siri, and Cortana) are not met (Zamora, 2017; Jain et al., 2018; Luger and Sellen, 2016), causing users to abandon these services after initial use. Initial experiences with conversational agents are often decisive: users reported that initial failures in achieving a task with Siri caused them to retreat to simple tasks they were sure the system could handle. While bloated expectations of users before interacting with AI systems have been acknowledged, little work has explored what those expectations are and how they contribute to user adoption and behavior.

Are today’s conversational agents being set up for failure? Our studies establish that the descriptions and metaphors attached to these systems can play a key role in shaping expectations. Woebot was introduced as “Tiny conversations to feel your best”; Replika presaged as “The AI companion who cares” and Mitsuku was revealed as “a record breaking five-time winner of the Loebner Prize Turing Test […] the world’s best conversational chatbot”. We collected these descriptions associated with popular social chatbots—Xiaoice, Mistuku, Tay, Replika and Woebot—and deployed the exact same warmth-competence measurement of those descriptions with participants from AMT as we used for the metaphors in our study.

Figure 7. Average warmth and competence measured for popular social chat-bots. Both axes range from to .

We find that today’s social chatbots signal high competence (Figure 7), between “recent graduate” and “trained professional”. (Tay, incidentally, also projects very low warmth.) Descriptions of this kind, as we’ve shown, might be setting such systems up for failure. As Ars Technica reported: “You might come away thinking that Apple found some way to shrink Mad Men’s Joan Holloway and pop her into a computer chip. Though Siri shows real potential, these kinds of high expectations are bound to be disappointed” (Cheng, 2011). It is important to note, then, that users often report disappointment after using these agents, especially since Apple’s announcement of Siri included the sentence: “Ask Siri and get the answer back almost instantly without having to type a single character”; Google Assistant was heralded as “the first virtual assistant that truly anticipates your needs”.

With the recent glut of Twitter bots and other social agents that learn from their interactions and are adaptive in nature (Park et al., 2019), it also becomes important to understand what drives users’ antisocial behaviour towards such bots and what factors contribute to antisocial behavior. Previous work has sought explanations through the lens of user profiling, gender attributions, and racial representations. Our work provides another lens on why otherwise similar systems such as Xiaoice and Tay (both female, teen-aged, and not representative of marginalised communities in their respective countries) might have elicited vastly different responses from their users. While Tay’s official Twitter account described it as “Microsoft’s AI fam from the internet that’s got zero chill!”, signaling high competence and low warmth, Xiaoice was setup to be “Sympathetic ear” and an “Empathetic social chatbot” (Zhou et al., 2018), very clearly signaling high warmth and even priming behaviors around warmth such as personal disclosure. Our study suggests that people are more likely to cooperate with a bot that is perceived as higher warmth before use — a result consistent with the fact that Xiaoice continued to be a friend and remained popular with its user base while Tay was pulled down within hours of its release for attracting trolls.

Xiaoice is not an isolated case. Other bots, such as Woebot and Replika, which were set up as with high warmth, have had success in garnering users. Even though both Woebot and Replika had comparable competence expectations to that of Tay’s, they were far warmer and obtained an altogether different outcome. Similarly, among the bots perceived as high warmth, Mitsuku stood out as as exceptionally competent and consistent with our finding that perceptions of very high competence decrease the desire to cooperate with the AI system: up to of the messages received by Mitsuku comprise of antisocial messages (Worswick, 2018). For their part, Microsoft may have absorbed the lesson, as Tay’s successor named Zo, was described more warmly as “Always down to chat. Will make you LOL”.

While we acknowledge that there are several variables that affect user reception of these systems, the fact that our findings are consistent with in-the-wild outcomes of extant conversational systems is notable. It is, of course, impossible to prove that the expectations set by attached metaphors are a causal factor in the users’ reception of these specific systems, and caution readers against concluding that metaphors alone are responsible.

8. Conclusion

We explore metaphors as a causal factor in determining users’ evaluations of AI agents. We demonstrate experimentally that these conceptual metaphors change users’ pre-use expectations as well as their post-use evaluations of the system, their intentions of adopt and their desire to cooperate. While people are more likely to cooperate with agents that they expect to be warm, they are more likely to adopt and cooperate with agents that project low competence. This result runs counter to designers’ usual default towards projecting high competence to attract more users.

We thank Jacob Ritchie, Mitchell Gordon and Mark Whiting for their valuable comments and feedback. This work was partially funded by the Brown Institute of Media Innovation and by Toyota Research Institute (TRI) but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.


  • N. Abokhodair, D. Yoo, and D. W. McDonald (2015) Dissecting a social botnet: growth, content and influence in twitter. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp. 839–851. Cited by: §1.
  • A. Arnold (2018) How chatbots feed into millennials’ need for instant gratification. Cited by: §3.1.
  • S. E. Asch (1946) Forming impressions of personality.. The Journal of Abnormal and Social Psychology 41 (3), pp. 258. Cited by: §3.3.
  • L. E. Asri, H. Schulz, S. Sharma, J. Zumer, J. Harris, E. Fine, R. Mehrotra, and K. Suleman (2017) Frames: A corpus for adding memory to goal-oriented dialogue systems. CoRR abs/1704.00057. External Links: Link, 1704.00057 Cited by: §1, §3.1, §3.2.
  • G. Bansal, E. Kamar, W. S. Lasecki, and D. S. W. E. Horvitz (2019) Beyond accuracy: the role of mental models in human-ai team performance. Cited by: §1, §1, §2.1.
  • A. L. Baylor and Y. Kim (2004) Pedagogical agent design: the impact of agent realism, gender, ethnicity, and instructional role. In International conference on intelligent tutoring systems, pp. 592–603. Cited by: §2.1.
  • S. A. Brown, V. Venkatesh, and S. Goyal (2012) Expectation confirmation in technology use. Information Systems Research 23 (2), pp. 474–487. Cited by: §5.5.
  • D. Brscić, H. Kidokoro, Y. Suehiro, and T. Kanda (2015) Escaping from children’s abuse of social robots. In Proceedings of the tenth annual acm/ieee international conference on human-robot interaction, pp. 59–66. Cited by: §2.1.
  • R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad (2015) Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. Cited by: §1.
  • J. Cassell, T. Bickmore, L. Campbell, H. Vilhjálmsson, and H. Yan (2000) Embodied conversational agents. pp. 29–63. External Links: ISBN 0-262-03278-3, Link Cited by: §1.
  • E. Chang and V. Kannan (2018) Conversational ai: best practices for building bots. External Links: Link Cited by: §1.
  • A. P. Chaves and M. A. Gerosa (2019) How should my chatbot interact? A survey on human-chatbot interaction design. CoRR abs/1904.02743. External Links: Link, 1904.02743 Cited by: §1, §3.2.
  • J. Cheng (2011) IPhone 4s: a siri-ously slick, speedy smartphone. Cited by: §7.4.
  • J. Cranshaw, E. Elwany, T. Newman, R. Kocielnik, B. Yu, S. Soni, J. Teevan, and A. Monroy-Hernández (2017) Calendar help: designing a workflow-based scheduling agent with humans in the loop. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 2382–2393. Cited by: §1.
  • L. E. Crawford (2009) Conceptual metaphors of affect. Emotion review 1 (2), pp. 129–139. Cited by: §1.
  • A. J. Cuddy, S. T. Fiske, and P. Glick (2008) Warmth and competence as universal dimensions of social perception: the stereotype content model and the bias map. Advances in experimental social psychology 40, pp. 61–149. Cited by: §1, §3.3.
  • S. Damani, N. Raviprakash, U. Gupta, A. Chatterjee, M. Joshi, K. Gupta, K. N. Narahari, P. Agrawal, M. K. Chinnakotla, S. Magapu, et al. (2018)

    Ruuh: a deep learning based conversational social agent

    arXiv preprint arXiv:1810.12097. Cited by: §3.
  • L. A. DeChurch and J. R. Mesmer-Magnus (2010) The cognitive underpinnings of effective teamwork: a meta-analysis.. Journal of Applied Psychology 95 (1), pp. 32. Cited by: §1, §2.1.
  • M. A. DeVito, J. Birnholtz, J. T. Hancock, M. French, and S. Liu (2018) How people form folk theories of social media feeds and what it means for how we study self-presentation. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI ’18, New York, NY, USA, pp. 120:1–120:12. External Links: ISBN 978-1-4503-5620-6, Link, Document Cited by: §2.1, §7.1, §7.
  • S. Drift (2018) 2018 state of chatbots report. Cited by: §3.1.
  • M. Eslami, K. Karahalios, C. Sandvig, K. Vaccaro, A. Rickman, K. Hamilton, and A. Kirlik (2016) First i “like” it, then i hide it: folk theories of social feeds. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, New York, NY, USA, pp. 2371–2382. External Links: ISBN 9781450333627, Link, Document Cited by: §7.1.
  • M. Eslami, A. Rickman, K. Vaccaro, A. Aleyasen, A. Vuong, K. Karahalios, K. Hamilton, and C. Sandvig (2015) “I always assumed that i wasn’t really that close to [her]”: reasoning about invisible algorithms in news feeds. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI ’15, New York, NY, USA, pp. 153–162. External Links: ISBN 9781450331456, Link, Document Cited by: §7.1.
  • E. Ferrara, O. Varol, C. Davis, F. Menczer, and A. Flammini (2016) The rise of social bots. Communications of the ACM 59 (7), pp. 96–104. Cited by: §1.
  • S. T. Fiske, A. J. Cuddy, P. Glick, and J. Xu (2018) A model of (often mixed) stereotype content: competence and warmth respectively follow from perceived status and competition (2002). In Social cognition, pp. 171–222. Cited by: §1, §3.3.
  • S. T. Fiske, J. Xu, A. C. Cuddy, and P. Glick (1999) (Dis)respecting versus (dis)liking: status and interdependence predict ambivalent stereotypes of competence and warmth. Journal of Social Issues 55 (3), pp. 473–489. External Links: Document, Link, Cited by: 2nd item.
  • M. French and J. Hancock (2017) What’s the folk theory? reasoning about cyber-social systems. Reasoning About Cyber-Social Systems (February 2, 2017). Cited by: §2.1, §2.1, §2.1, §2, §7.1.
  • J. Gao, M. Galley, and L. Li (2018) Neural approaches to conversational AI. CoRR abs/1809.08267. External Links: Link, 1809.08267 Cited by: §3.1.
  • A. L. Geers and G. D. Lassiter (1999) Affective expectations and information gain: evidence for assimilation and contrast effects in affective experience. Journal of Experimental Social Psychology 35 (4), pp. 394–413. Cited by: §5.5.
  • S. A. Gelman and C. H. Legare (2011) Concepts and folk theories. Annual review of anthropology 40, pp. 379–398. Cited by: §2.
  • J. Grudin and R. Jacques (2019) Chatbots, humbots, and the quest for artificial general intelligence. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, New York, NY, USA, pp. 209:1–209:11. External Links: ISBN 978-1-4503-5970-2, Link, Document Cited by: §1, §3.1, §3.2.
  • J. Hartmann, A. De Angeli, and A. Sutcliffe (2008a) Framing the user experience: information biases on website quality judgement. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’08, New York, NY, USA, pp. 855–864. External Links: ISBN 978-1-60558-011-1, Link, Document Cited by: §2.
  • J. Hartmann, A. De Angeli, and A. Sutcliffe (2008b) Framing the user experience: information biases on website quality judgement. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 855–864. Cited by: §1, §2.2.
  • P. J. Hinds, T. L. Roberts, and H. Jones (2004) Whose job is it anyway? a study of humanrobot interaction in a collaborative task.. Cited by: §4.4.
  • A. Ho, J. Hancock, and A. S. Miner (2018) Psychological, relational, and emotional effects of self-disclosure after conversations with a chatbot. Journal of Communication 68 (4), pp. 712–733. Cited by: §1, §3.2.
  • E. Hunt (2016) Tay, microsoft’s ai chatbot, gets a crash course in racism from twitter. The Guardian 24. Cited by: §1, §3.
  • M. Jain, P. Kumar, R. Kota, and S. N. Patel (2018) Evaluating and informing the design of chatbots. In Proceedings of the 2018 Designing Interactive Systems Conference, pp. 895–906. Cited by: §2.2, §7.4.
  • M. Jakesch, M. French, X. Ma, J. T. Hancock, and M. Naaman (2019) AI-mediated communication: how the perception that profile text was written by ai affects trustworthiness. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 239. Cited by: §1.
  • C. M. Judd, L. James-Hawkins, V. Yzerbyt, and Y. Kashima (2005) Fundamental dimensions of social judgment: understanding the relations between judgments of competence and warmth.. Journal of personality and social psychology 89 (6), pp. 899. Cited by: §3.3.
  • R. Kimball and B. V. E. Harslem (1982) Designing the star user interface. Byte 7 (1982), pp. 242–282. Cited by: §2.1, §2.1.
  • K. J. Klaaren, S. D. Hodges, and T. D. Wilson (1994) The role of affective expectations in subjective experience and decision-making. Social Cognition 12 (2), pp. 77–101. Cited by: §1, §2.
  • R. Kocielnik, S. Amershi, and P. N. Bennett (2019) Will you accept an imperfect ai?: exploring designs for adjusting end-user expectations of ai systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, New York, NY, USA, pp. 411:1–411:14. External Links: ISBN 978-1-4503-5970-2, Link, Document Cited by: §1, §1, §2.1, §2.
  • S. Kujala, R. Mugge, and T. Miron-Shatz (2017)

    The role of expectations in service evaluation: a longitudinal study of a proximity mobile payment service

    International Journal of Human-Computer Studies 98, pp. 51–61. Cited by: §1, §2, 1st item, 3rd item, §4.2.
  • E. Kuyda and P. Dudchuk (2015) Replika [computer program]. External Links: Link Cited by: §1.
  • I. Lage, A. Ross, S. J. Gershman, B. Kim, and F. Doshi-Velez (2018) Human-in-the-loop interpretability prior. In Advances in Neural Information Processing Systems, pp. 10159–10168. Cited by: §1.
  • G. Lakoff and M. Johnson (2008) Metaphors we live by. University of Chicago press. Cited by: §2.1, §2.1, §2.
  • J. Lee (2018) Chatbots were the next big thing: what happened?. External Links: Link Cited by: §1.
  • M. K. Lee, S. Kiesler, and J. Forlizzi (2010) Receptionist or information kiosk: how do people talk with a robot?. In Proceedings of the 2010 ACM conference on Computer supported cooperative work, pp. 31–40. Cited by: §7.1.
  • Q. V. Liao, M. Hussain, P. Chandar, M. Davis, Y. Khazaeni, M. P. Crasso, D. Wang, M. Muller, N. S. Shami, W. Geyer, et al. (2018) All work and no play? Conversations with a question-and-answer chatbot in the wild. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 3. Cited by: §2.2.
  • E. Luger and A. Sellen (2016) Like having a really bad pa: the gulf between user expectation and experience of conversational agents. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 5286–5297. Cited by: §1, §2.2, §2, §7.4.
  • M. S. McGlone (1996) Conceptual metaphors and figurative language interpretation: food for thought?. Journal of memory and language 35 (4), pp. 544–565. Cited by: §1.
  • J. Michalco, J. G. Simonsen, and K. Hornbæk (2015) An exploration of the relation between expectations and user experience. International Journal of Human-Computer Interaction 31 (9), pp. 603–617. Cited by: §2.2.
  • H. Mieczkowski, S. X. Liu, J. Hancock, and B. Reeves (2019) Helping not hurting: applying the stereotype content model and bias map to social robotics. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Vol. , pp. 222–229. External Links: Document, ISSN 2167-2148 Cited by: 3rd item.
  • C. I. Nass and S. Brave (2005) Wired for speech: how voice activates and advances the human-computer relationship. MIT press Cambridge, MA. Cited by: §2.1, §2, §7.2.
  • D. A. Norman (1988) The psychology of everyday things.. Basic books. Cited by: §1.
  • A. Nutt (2017) The woebot will see you now. the rise of chatbot therapy: Washington Post. Cited by: §1, §3.
  • J. Park, R. Krishna, P. Khadpe, L. Fei-Fei, and M. Berstein (2019) AI-based request augmentation to increase crowdsourcing participation. In AAAI Conference on Human Computation and Crowdsourcing, Cited by: §7.4.
  • J. W. Pennebaker, M. E. Francis, and R. J. Booth (2001) Linguistic inquiry and word count: liwc 2001. Mahway: Lawrence Erlbaum Associates 71 (2001), pp. 2001. Cited by: 1st item.
  • E. Raita and A. Oulasvirta (2011) Too good to be bad: favorable product expectations boost subjective usability ratings. Interacting with Computers 23 (4), pp. 363–371. Cited by: §1, §2.2, §2.
  • B. Reeves and C. I. Nass (1996) The media equation: how people treat computers, television, and new media like real people and places.. Cambridge university press. Cited by: §2.2.
  • C. Rudin (2018) Please stop explaining black box models for high stakes decisions. arXiv preprint arXiv:1811.10154. Cited by: §1.
  • P. Salvini, G. Ciaravella, W. Yu, G. Ferri, A. Manzi, B. Mazzolai, C. Laschi, S. Oh, and P. Dario (2010) How safe are service robots in urban environments? bullying a robot. In 19th International Symposium in Robot and Human Interactive Communication, pp. 1–7. Cited by: §2.1.
  • C. Sandvig (2015) Seeing the sort: the aesthetic and industrial defense of “the algorithm.”. Journal of the New Media Caucus, 11, 35-51. Cited by: §1.
  • A. P. Saygin, T. Chaminade, H. Ishiguro, J. Driver, and C. Frith (2011) The thing that should not be: predictive coding and the uncanny valley in perceiving human and humanoid robot actions. Social cognitive and affective neuroscience 7 (4), pp. 413–422. Cited by: §7.3.
  • R. Sease (2008) Metaphor’s role in the information behavior of humans interacting with computers. Information technology and libraries 27 (4), pp. 9–16. Cited by: §2.1, §2.
  • J. Seering, M. Luria, C. Ye, G. Kaufman, and J. Hammer (2020) It takes a village: integrating an adaptive chatbot into an online gaming community. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI ’20. Cited by: §7.2.
  • A. Shamekhi, Q. V. Liao, D. Wang, R. K. Bellamy, and T. Erickson (2018) Face value? exploring the effects of embodiment for a group facilitation agent. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 391. Cited by: §1.
  • M. Sherif, D. Taub, and C. I. Hovland (1958) Assimilation and contrast effects of anchoring stimuli on judgments.. Journal of experimental psychology 55 (2), pp. 150. Cited by: §1, §2.2, §2.2.
  • H. Shum, X. He, and D. Li (2018) From eliza to xiaoice: challenges and opportunities with social chatbots. Frontiers of Information Technology & Electronic Engineering 19 (1), pp. 10–26. Cited by: §1, §3.
  • M. Strait, V. Contreras, and C. D. Vela (2018a) Verbal disinhibition towards robots is associated with general antisociality. arXiv preprint arXiv:1808.01076. Cited by: §2.1.
  • M. Strait, A. S. Ramos, V. Contreras, and N. Garcia (2018b) Robots racialized in the likeness of marginalized social identities are subject to greater dehumanization than those racialized as white. In 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 452–457. Cited by: §2.1.
  • I. M. Thies, N. Menon, S. Magapu, M. Subramony, and J. O’neill (2017) How do you want your chatbot? an exploratory wizard-of-oz study with young, urban indians. In IFIP Conference on Human-Computer Interaction, pp. 441–459. Cited by: §1, §2.1, §3.2.
  • A. Todorov, C. P. Said, A. D. Engell, and N. N. Oosterhof (2008) Understanding evaluation of faces on social dimensions. Trends in cognitive sciences 12 (12), pp. 455–460. Cited by: §3.3.
  • P. Van Schaik and J. Ling (2008) Modelling user experience with web sites: usability, hedonic value, beauty and goodness. Interacting with computers 20 (3), pp. 419–432. Cited by: §2.2.
  • P. Wallis and E. Norling (2005) The trouble with chatbots: social skills in a social world. Virtual Social Agents 29. Cited by: §1, §3.2.
  • M. E. Whiting, G. Hugh, and M. S. Bernstein (2019) Fair work: crowd work minimum wage with one line of code. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 7, pp. 197–206. Cited by: §4.3.
  • T. D. Wilson, D. J. Lisle, D. Kraft, and C. G. Wetzel (1989) Preferences as expectation-driven inferences: effects of affective expectations on affective experience.. Journal of personality and social psychology 56 (4), pp. 519. Cited by: §1, §2.
  • B. Wojciszke (2005) Affective concomitants of information on morality and competence. European psychologist 10 (1), pp. 60–70. Cited by: §3.3.
  • V. Woollaston (2016) Following the failure of tay, microsoft is back with new chatbot zo. Wired. Cited by: §3.
  • S. Worswick (2015) Mitsuku [computer program]. Cited by: §1, §3.
  • S. Worswick (2018) The curse of the chatbot users. Cited by: §7.4.
  • E. Y. Wu, E. Pedersen, and N. Salehi (2019) Agent, gatekeeper, drug dealer: how content creators craft algorithmic personas. Proceedings of the ACM on Human-Computer Interaction 3 (CSCW), pp. 219. Cited by: §2.1, §7.1.
  • J. Xiao, J. Stasko, and R. Catrambone (2004) An empirical study of the effect of agent competence on user performance and perception. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 1, pp. 178–185. Cited by: §2.
  • J. Zamora (2017) I’m sorry, dave, i’m afraid i can’t do that: chatbot perception and expectations. In Proceedings of the 5th International Conference on Human Agent Interaction, HAI ’17, New York, NY, USA, pp. 253–260. External Links: ISBN 978-1-4503-5113-3, Link, Document Cited by: §1, §2, §7.4.
  • M. P. Zanna and D. L. Hamilton (1972) Attribute dimensions and patterns of trait inferences. Psychonomic Science 27 (6), pp. 353–354. Cited by: §3.3.
  • L. Zhou, J. Gao, D. Li, and H. Shum (2018) The design and implementation of xiaoice, an empathetic social chatbot. CoRR abs/1812.08989. External Links: Link, 1812.08989 Cited by: §7.4.