Explaining Reinforcement Learning to Mere Mortals: An Empirical Study

03/22/2019 ∙ by Andrew Anderson, et al. ∙ Oregon State University 0

We present a user study to investigate the impact of explanations on non-experts' understanding of reinforcement learning (RL) agents. We investigate both a common RL visualization, saliency maps (the focus of attention), and a more recent explanation type, reward-decomposition bars (predictions of future types of rewards). We designed a 124 participant, four-treatment experiment to compare participants' mental models of an RL agent in a simple Real-Time Strategy (RTS) game. Our results show that the combination of both saliency and reward bars were needed to achieve a statistically significant improvement in mental model score over the control. In addition, our qualitative analysis of the data reveals a number of effects for further study.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although eXplainable Artificial Intelligence (XAI) has seen increasing interest as AI becomes more pervasive in society, much of XAI work does not attend to the

people who consume explanations. In this paper, we draw upon a work that does, which introduced 4 principles for explaining AI systems to people who are not AI experts [Kulesza et al.2015]. These principles were: be iterative, be sound, be complete, and do not overwhelm the user, where here the notions of soundness and completeness are analogous to “the whole truth (completeness) and nothing but the truth (soundness).”

Empirical results showed that explanations adhering to these principles enabled non-AI experts to build higher-fidelity mental models of the agent than non-AI experts who received less sound/complete explanations [Kulesza et al.2015]. People’s mental models, in the context of XAI, are basically their understanding of the way the agent works. More formally, mental models are, “internal representations that people build based on their experiences in the real world.” [Norman and Gentner1983]. People’s mental models vary in complexity and accuracy, but a good mental model will enable a person to understand system behavior, and a very good one will enable them to predict future behaviors.

In this paper, we investigate how people’s mental models of a reinforcement-learning agent vary in response to different explanation styles–saliency maps showing where the agent is “looking,” and reward decomposition bars showing the agent’s current prediction of its future score. To do so, we conducted a controlled lab study with 124 participants across four treatments (saliency, rewards, both, neither), and measured both their understanding of the agent and their ability to predict its decisions. Our investigation was in the context of Real-Time Strategy (RTS) games.

However, publicly available RTS games have stringent time constraints, complex concepts, and myriad decisions, which would have introduced too many confounding variables for a controlled study. For example, we needed each participant to consider the same set of decisions. Thus, we built our own game, inspired by RTS, which we describe later.

In this context, we structured our investigation around the following research questions:

  • [topsep=0pt,itemsep=0pt,partopsep=0pt, parsep=0pt]

  • RQ-Describe - Which treatment is better (and how) at enabling people to describe how the system works?

  • RQ-Predict - Which treatment is better (and how) at enabling people to predict what the system will do?

2 Background & Related Work

We focus on model-free RL agents that learn a Q-function

to estimate the expected future cumulative reward of taking action

in state . After learning, the agent greedily selects actions according to , i.e. selecting action in . RL agents are typically trained with scalar rewards, leading to scalar Q-values. While a human can compare the scalars to see how much the agent prefers one action over another, the scalars give no insight into the cost/benefit factors contributing to action preferences.

Reward Decomposition. We draw on work by [Erwig et al.2018] that exploited the fact that rewards can typically be grouped into semantically meaningful types. For example, in RTS games, reward types might be “enemy damage” (positive reward) or “ally damage” (negative reward). Reward decomposition exposes reward types to an RL agent by specifying a set of types and letting the agent observe, at each step, a

-dimensional decomposed reward vector

, which gives the reward for each type. The total scalar reward is the sum across types, i.e. . The learning objective is still to maximize the long-term scalar reward.

By leveraging the extra type information in , the RL agent can learn a decomposed Q-function , where each component is a Q-value that only accounts for rewards of type . Using the definition of , the overall scalar Q-function can be shown to be the sum of the component Q-functions, i.e. . Prior work has shown how to learn via a decomposed SARSA algorithm [Russell and Zimdars2003, Erwig et al.2018].

Before  erwig2018explaining (erwig2018explaining), others considered using reward decomposition [Van Seijen et al.2017, Russell and Zimdars2003]—but for speeding up learning. Our focus here is on their visual explanation value. For a state of interest, the decomposed Q-function can be visualized for each action as a set of “reward bars”, one bar for each component. By comparing the bars of two actions, a human can gain insight into the trade-offs responsible for the agent’s preference.

Saliency Visualization. To gain further insight into the agent’s action choices, a human may want to know which parts of the agent’s input were most important to the value computed for a reward bar (i.e. a particular

). Such information is often visualized via saliency maps over the input. Our agent uses neural networks to represent the component Q-functions, letting us draw on the many neural network saliency techniques (e.g.

[Simonyan et al.2013, Zeiler and Fergus2014, Springenberg et al.2014, Zhang et al.2018]). While there have been a number of comparison and utility studies (e.g. [Riche et al.2013, Ancona et al.2017, Kim et al.2018, Adebayo et al.2018]), there is no clear consensus on the best approach.

After exploring various techniques, we developed a simple saliency approach, modified from  fong2017interpretable (fong2017interpretable)’s work on image classification, which we found to be effective in our RTS environment (details are given later). Since the network may “focus” on different parts of the input for each reward bar, we compute saliency maps for each bar, which are available to the UI for visualization.

3 Methodology

We performed an in-lab study using a between-subjects design with explanation style (Control, Saliency, Rewards, Everything) as the independent variable. Our dependent variable was the quality of participants’ mental models–measured by analysis of two main data sources: 1) answer to a post-task question. 2) accuracy of participants’ prediction for the agent’s selected action at each decision point (DP).

We ran an ablation study, where we measured the impact of each explanation by adding or removing them, as shown in Figure 1. Thus, Everything - Rewards - Saliency = Control, as follows: 1. Control participants saw only the agent’s actions, its consequences on the game state, the score, and question area (Regions 1 & 4). 2. Saliency participants saw Regions  1 & 4 plus the saliency maps (Region 2), allowing them to infer intention from gaze [Newn et al.2016]. 3. Rewards participants saw Region 1 & 4 plus reward decomposition bars (Region 3), allowing them to see the agent’s cost/benefit analysis. 4. Everything participants saw all regions.

Figure 1: The interface the Everything participants saw. Region 1: game map, which we expand on in Figure 3. Region 2: saliency maps. Region 3: reward decomposition bars for each action. Region 4: participant question/response area.

3.1 Participants and Procedures

We selected 124 participants from 208 survey respondents at X University. Since we were interested in AI non-experts, our selection criteria excluded Computer Science majors and anyone who had taken an AI course. We assigned the participants to a two-hour in-lab session based on their availability and randomly assigned a treatment to each session.

We collected the following demographics: major and experience with RTS games (Table 1). 78% of our participants, were “Gamers,” defined as those with 10+ hours RTS experience, consistent with prior research [Penney et al.2018]. Gamers were spread evenly across treatments such that it was unnecessary to control for this factor statistically (Figure 2).

Academic Discipline Participants   Gamers
Agricultural Sciences: 4 unique majors 8 2
Business: 3 unique majors 5 4
Engineering: 11 unique majors 63 56
Forestry: 3 unique majors 4 4
Science: 10 unique majors 25 20
Liberal Arts: 8 unique majors 9 6
Public Health & Human Sciences: 2 majors 5 1
Undisclosed 5 4
Totals 124 97
Table 1: Participant demographics, per academic discipline.
Figure 2: Distribution of RTS “gamers” in our study. “Gamers” are shown in grey, others in white.
Figure 3: A sequence of the first three DPs of the game. For each DP (circled in red) participants saw the game map and the score (boxed in red). Next, they made a prediction of which object the AI would choose to attack. Last, they would receive an explanation and have the ability to “play” the DP. At DP1, the agent chose to attack Q2, causing a score change of 121 (+21 pts for damaging and +100 from destroying it).

We began sessions with a 20-minute, hands-on tutorial on the system/game, with 3 practice DPs. Since participants were AI non-experts, we described saliency maps as, “…like where the eyeballs of the AI fall” and reward bars as, “…the AI’s prediction for the score it will receive in the future.

During the main tasks, participants examined the game state at each DP, predicted what the agent would attack and why. Then, they saw the agent’s actual choice, along with an explanation. Participants had 12 minutes to complete DP1 and 8 minutes per DP for the remaining 13. Finally, participants filled out a questionnaire.

Figure 4: The objects appearing in our game states. Enemy objects were black, and allied objects were white.

3.2 System Overview

Popularly available RTS games have an enormous action space – DeepMindAlphaStar (DeepMindAlphaStar) estimates for StarCraft II. With so many possibilities, it is not surprising that researchers have reported large differences in individual participants’ focus, leading them to notice different decisions [Dodge et al.2018, Penney et al.2018]. To avoid this, we built a game with a tightly controlled action space.

In our game, the agent’s goal was to maximize its score over each task (Figure 3), subject to the following rules:

  • [topsep=0pt,itemsep=0pt,partopsep=0pt, parsep=0pt]

  • Only Forts/Tanks could attack objects (Figure 4).

  • At each DP, the agent had to attack one of the quadrants.

  • If agent/friendlies were damaged/destroyed, it lost points.

  • If enemies were damaged/destroyed, it gained points.

  • Once the agent killed something, it “respawned” on a new map, carrying over its health.

3.2.1 The Reward Decomposition Implementation

The agent used six reward types to learn its : {Enemy Fort Damaged, Enemy Fort Destroyed, Friendly Fort Damaged, Friendly Fort Destroyed, Town/City Damaged, Town/City Destroyed}. The RL agent used a neural network representation of . For each reward type , there was a separate network for which took the state description as input–7, 40x40 greyscale image layers, each representing information about the state: {Health Points (HP), enemy tank, small forts, big forts, towns, cities, and friend/enemy}. The overall scalar Q-values used for action selection were the sum of each . The agent trained using the decomposed SARSA learning algorithm using a discount factor of 0.9, a learning rate of 0.1, with -greedy exploration ( decayed from 0.9 to 0.1). It trained for 30,000 games, at which point it demonstrated high-quality actions.

3.2.2 The Saliency Map Implementation

Given a state , our perturbation-based approach produced a saliency map for each bar by giving data representing the true state and a perturbed state (close to ), then subtracting the outputs for both states. Large output difference meant the system was more sensitive to the perturbed part of the state–indicating importance, which we showed with a brighter color. We chose to use a heated object scale, since  newn2017evaluating (newn2017evaluating) found it to be the most understandable for their participants. Our perturbations modify properties of objects in the game state and thus modify groups of pixels, not individual pixels.

Each of the perturbations represented a semantically meaningful operation: 1. Tank Perturbation. If a tank was present, we removed it by zeroing out its pixels in the tank layer. 2. Friend/Enemy Perturbation. Transform an object from friend to enemy by moving the friend layer pixels to the enemy layer (and vice versa). 3. Size Perturbation. Transform an object from big to small (or vice versa) by moving the pixels from one size layer to the other. 4. City/Fort perturbations. Similarly transform whether an object is a City or Fort. 5. HP Perturbation. Since HP is real-valued it is treated differently. We perturbed the object HP values by a small value, 30%. These operations were represented in five saliency maps: HP, Tank, Size, City/Fort, & Friend/Enemy

To make the saliency maps comparable across types, we found the maximum saliency value in each map for each reward type & class from 16,855 episodes. Normalizing each of the 5 maps by this value ensured that each saliency map pixel’s value is .

4 Results: People Describing the AI

To elicit participants’ understanding of the agent’s decision making (RQ-Describe), we qualitatively analyzed participants’ answers to the end-of-session question: “Please describe the steps in the agent’s approach or method…” [Lippa et al.2008]. Figure 5 shows an example response.

The agent began worried about damaging its allies…focused little on its own health and made decisions with respect to its allies… By DP3 it actually padding=1pt 1pt, bgcolor=highlightColorassigned a positive point value to destroying padding=1pt 1pt, bgcolor=highlightColoritself in the long term because it so heavily weighted potential padding=1pt 1pt, bgcolor=highlightColordamage to allies. This is because as its health dropped, it would only be able to attack allies in order to stay alive which would cause a massive penalty. Therefore, the agent decided to padding=1pt 1pt, bgcolor=highlightColoralways attack the largest base padding=1pt 1pt, bgcolor=highlightColorwith the most health so that it would take the most damage which would benefit allies in the long run.  (E23)

Figure 5: Top scoring mental model question response. The highlighted portions illustrate both “basic” and “extra credit” concepts, some of which are described in Table 2.

First, 2 researchers independently coded 20% of the data with 18 codes and reached

agreement on them (Jaccard index). This process is consistent with 

[Hughes and Parkes2003]. Given this level of reliability, one researcher coded the rest.

In parallel with this process, we generated a rubric of scores to associate with each code. The codes representing the agent’s four basic concepts were each worth 25% (e.g., its score maximization objective). Extra nuances in participants’ descriptions earned small additions of “extra credit” (e.g., saying the agent maximized its future score), and extra errors earned small deductions (e.g., saying it tried to preserve its HP). Experimenting with different values for the small extras and errors had little effect on comparisons of score distributions among treatments. Figure 6 has the score distribution.

Code Count Definition
Maximize Score 46 The agent’s overall objective is to maximize its long term score.
Forward Looking 13 The AI looks towards future instances when accounting for the action that it takes now.
Paranoia 8 The AI is paranoid about extending its life too much, expecting penalties when it should not.
Episode Over 15 When the AI is nearing death, it behaves differently than it has in previous decision points.
Table 2: The four mental model codes revealing particularly interesting differences in nuances of participants’ mental models.

4.1 The More, the Better?

As Figure 6 shows, Everything participants had significantly better mental model scores than Control participants (ANOVA, F = 8.369, df = (1,59), p = .005111We consider as significant, and as marginally significant, by convention [Cramer and Howitt2004]; Note: all our ANOVAs are pairwise, ). One possible interpretation is that the Everything participants’ performance was due to receiving the most sound and complete explanation, consistent with  kulesza2015principles (kulesza2015principles)’s results.

However, another possibility is that the participants in the Everything treatment were benefiting from only one of the explanation types, and that the other type was making little difference. Thus, we isolated each explanation type.

To isolate the effect of the reward bars, we compared all participants who saw the decomposed reward bars (the Rewards and Everything treatments) with those who did not. As the left side of Figure 7 illustrates, participants who saw reward bars had significantly better mental model scores than those who did not (ANOVA, F = 6.454, df = (1,122), p = .0123). Interestingly, isolating the effect of saliency produced a similar impact. As the right side of Figure 7 illustrates, those who saw saliency maps (the Everything + Saliency treatments) had somewhat better mental model scores (ANOVA, F = 3.001, df = (1,122), p = .0858). This suggests that each component brought its own strengths.

Figure 6: The participants’ final mental model scores. Box colors from top to bottom: padding=1pt 1pt, bgcolor=EverythingColorEverything, padding=1pt 1pt, bgcolor=RewardsColorRewards, padding=1pt 1pt, bgcolor=SaliencyColorSaliency, and padding=1pt 1pt, bgcolor=ControlColorControl.
Figure 7: Same data as Figure 6. Left: Mental model scores for those who saw rewards (top) and those who did not. Right: Same data, but those who saw saliency (top) and those who did not.

4.2 Different Explanations, Different Strengths

Four of the 18 codes in our mental model codeset revealed nuanced differences among treatments in the participants’ understanding of the agent. Table 2 lists these four codes.

Participants who saw rewards (Rewards and Everything) often mentioned that the agent was driven by its objective to maximize its score (Table 2’s Maximize Score). Over 3/4 (36 out of 46) of the people who mentioned this were in treatments that saw rewards. For example: “The agent always tried to get as high a possible total sum of all rewards as possible. It valued allies getting damaged in the future as a rather large negative value, and dealing damage and killing enemy forts as rather high positive values.”  (R81)222 First letter of participant ID is treatment (Control, Saliency, Rewards, Everything) and: “These costs and rewards are then summed up into an overall cost/reward value, and this value is then used to dictate the agent’s action; whichever overall value is greater will be the action that the agent takes.”  (E14)

Some participants who saw rewards also mentioned the nuance that the agent’s interest was in its future score (Table 2’s Forward Looking), not the present one: “The AI simply takes in mind the unknown of the future rounds and keeps itself in range to be destroyed ‘quickly’ if a future city is under attack…”  (E83). Over 2/3 of the participants (9 out of 13) who pointed out this nuance saw decomposed reward bars.

Figure 8: Percentage of participants who successfully predicted the AI’s next move at each decision point (DP). Bar colors denote treatment (from left to right): padding=1pt 1pt, bgcolor=ControlColorControl, padding=1pt 1pt, bgcolor=SaliencyColorSaliency, padding=1pt 1pt, bgcolor=RewardsColorRewards, and padding=1pt 1pt, bgcolor=EverythingColorEverything. Participants’ results varied markedly for the different situations these DPs captured, and there is no evidence that any of the treatments got better over time.

Even more subtle was the agent’s paranoia (Table 2’s Paranoia). It had learned Q-value components that reflected a paranoia about receiving negative rewards for attacking its own friendly units. Specifically, even though the learned greedy policy appeared to never attack a friendly unit, unless there was no other option, the Q-components for friendly damage were highly negative even for actions that attacked enemies in many cases. After investigating, we determined that this was a result of learning via the on-policy SARSA algorithm333SARSA learns the value of the -greedy exploration policy, which can randomly attack friendly units. Thus, the learned Q-values reflect those random future negative rewards. However, after learning, exploration stops and friendlies are not randomly attacked. , which learns while it explores.

This paranoia can be a type of “bug” in the agents value estimates. The only 8 participants in the entire study who pointed out this bug were participants who saw rewards. For example: “The AI appears to be afraid of what might happen if a map is generated containing four [friendly] forts or something, in which it can do a lot of damage to itself.” (R73).

On the other hand, participants who saw saliency maps (Saliency + Everything) had a different advantage over the others–noticing how the agent changed behavior when it thought it was going to die (Episode Over). For example, it tended to embark on “suicide” missions at the end of a task when its health was low. About 2/3 (10 out of 15) of the participants who talked about such behaviors were those who saw saliency maps. As one participant put it: “If it cannot take down any structures, it will throw itself to wherever it thinks it will deal the most damage.” (S74).

4.3 Discussion: Which explanation?

On the surface, Section 4.1 suggests that, in explainable systems, the more explanation we give people, the better. However, Section 4.2 suggests that the question of which explanation or combination of explanations is better is more complex – each type has different strengths, which may matter differently in different situations. To investigate how situational an explanation type’s strength is, we turn next to a qualitative view of how participants fared in individual situations, which we captured with their predictions at each DP.

5 Results: People Predicting the AI

Participants’ predictions of the agent’s action at each DP provided us in situ data [Muramatsu and Pratt2001]. The state for each DP is depicted in Table 3. As Figure 8 shows, their ability to predict the agent’s behavior varied widely, but qualitative analysis revealed several phenomena to explain why.

Task 1

DP1 DP2 DP3 DP4

Task 2

DP5 DP6 DP7 DP8

Task 3

DP9 DP10 DP11

Task 4

DP12 DP13 DP14
Table 3: The tasks and their DPs. We have highlighted in green the action the AI chose.

5.1 Help! The choice is counter-intuitive

Situations where the agent went against participants’ intuitions proved confusing. These cases all had low accuracy, with all treatments’ below random guessing ( 25%).

One of these situations was the agent choosing neither the strongest nor weakest of similar enemies (DPs 10,12). When the Everything participants got it right, their comments suggest they combined both saliency and rewards into their reasoning; e.g.,  (E71) for DP10: “As it will look at the HP of the tank more it will not attack Q4 instead it will go for Q1 which will give it enough benefit but also maintain its HP.”

However, all of the participants in the Rewards treatment got DP10 wrong, suggesting that they needed the saliency maps to factor in how much the AI focuses on its own HP, which was key in this situation. For example: “Lowest HP out of the 3 big fort.” (R94).

A second situation that was counter-intuitive to participants was the agent choosing to attack an enemy elsewhere over saving a friend. The worst accuracy for this type was at DP4: 77% of the participants got it wrong. They incorrectly predicted it would attack the enemy tank, citing its health: “This is the enemy object with the lowest value for HP.”  (S18) or its threat to a friend “The enemy tank poses the greatest threat [to] allies…” (S25). Of the few participants (19 total) who predicted correctly, most (68%) were in treatments that saw reward bars, e.g.: “… destroying enemy [fort] will give you more point than destroying a tank.” (R94).

5.2 Overwhelmed!

Everything participants’ predictions had the lowest accuracy in certain DPs (6,9,11), while Control had the highest. This phenomenon seems tied to Everything participants coping with too much information, highlighting the importance of balancing completeness with not overwhelming users [Kulesza et al.2015].

For example, some Everything participants tried to account for all the information they had seen. For example, at DP6: “I think it considers own HP first then Friend/Enemy status, so going by that it will attack Q4. Also, …it attacks enemies with more HP.” (E38). Some explicitly bemoaned the complexity of the information: “It was confusing all around to figure out the main factor for movement using the maps and bars…” (E39). In contrast, participants who saw no explanations (Control) were able to apply simpler reasoning for the correct Q2 prediction at DP6: “because it is the lowest health of all of the enemy objects.” (C69).

Figure 9: Average task time vs DP, per treatment. Participants had 12 minutes for the first DP and 8 minutes for all subsequent DPs. “”: see text. Colors: same as Figure 8

Participants’ timing data also attest to Everything participants’ burden of processing all the information (Figure 9). In the figure, “” depicts how much time an Everything participant would spend if they spent as long as Control, plus the average time Saliency participants incurred above Control, plus the average time Rewards participants incurred. As the figure shows, Everything participants’ time to act upon their explanations exceeded the sum of acting upon the component parts, at every DP. Further, since participants had time limits for each DP (not shown in the figure), some “timed out”–and Everything accounted for 17 timeouts, while Saliency and Rewards accounted for a mere 2 and 8, respectively.

5.3 No help needed… yet

For some DPs (2, 3, 5, 13), explanations seemed unnecessary, as the Control proved “good enough” (at least 75% of participants predicted correctly). At “Easy” situations, explanations may simply interfere. However, it may not be easy for everyone. On-demand explanations can provide more information to those who need it, without overwhelming those that do not.

5.4 Discussion: It all depends…

Participants’ explanation needs depended on the situation; hence the variability illustrated in Figure 3. From a statistical perspective, putting these different situations together would have simply “canceled each other out.” In retrospect, we view such wide variation as to be expected, given the large variability in state/action pairs combined with the noisiness of human data. The mix of quantitative with qualitative methods for RQ-Predict served us well, and we recommend it to other XAI researchers facing similarly situation-dependent data.

6 Threats to Validity

Any empirical study has threats to validity, which might skew the results towards certain conclusions 

[Wohlin et al.2000].

Participants’ proficiency in RTS games might have assisted in understanding the agent’s tactics. To control for this, the RTS “gamers” were fairly evenly distributed across our treatments (Figure 2). However, we did not collect many demographics, preventing us from considering other factors that may impact people’s mental models of games, such as age.

Controlled studies like this one emphasize controls (careful isolation of variables) over external validity. For example, to remove uncontrolled sources of variation, we simplified our game, but this also means that these results might not hold in more complex RTS domains. We controlled the time all participants spent on each DP, but this may have impacted their mental models in two ways: limited time to examine DPs (8 minutes) and number of data points (14).

These threats can be addressed only by additional studies across a spectrum of empirical methods, to assess generality of findings across other populations and other RTS games.

7 Concluding Remarks

In this paper, we report on a mixed methods study with 124 participants with no AI background. Our goal was to investigate which of four explanation possibilities–saliency, rewards, both, or neither–would enable participants to build the most accurate mental models, and in what circumstances. Our quantitative results showed that the Everything participants, who saw both saliency and rewards explanations, scored significantly higher on their mental model descriptions than the Control participants. However, considering the qualitative results in light of the quantitative results for different DPs showed that the type of explanation that helped participants the most was very situation-dependent.

In some situations, full explanation clearly helped, allowing the Everything participants to statistically outperform the Control participants in their descriptions. Likewise, Figure 8 shows that Everything participants were top or near-top predictors in about half the DPs.

Other times, adding explanations caused issues. Some participants in all three of the explanation treatments complained about too much information. At almost every DP, Everything participants took longer than the time costs of the “sum of its parts“ (Figure 9). Finally, for some DPs, Control participants, who had no explanations, were able to succeed at predicting the AI’s move where other treatments’ participants failed.

The results of our quantitative and qualitative analyses suggest several one-size-does-not-fit-all takeaway messages. First, one type of explanation does not fit all situations, as Section 5 shows. Second, one type of explanation does not fit all people, as the distribution ranges in Figure 6 show. And perhaps most critical, one type of empirical analysis (strictly quantitative or strictly qualitative) was not enough; only by combining these techniques were we able to make sense of the wide differences among individual participants at individual DPs. We believe that, only by our community applying an arsenal of empirical techniques, can we gain the rich insights needed to learn how to explain AI effectively to mere mortals.

References

  • [Adebayo et al.2018] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018.
  • [Ancona et al.2017] M. Ancona, E. Ceolini, C. Öztireli, and M. Gross. Towards better understanding of gradient-based attribution methods for deep neural networks. arXiv:1711.06104, 2017.
  • [Cramer and Howitt2004] D. Cramer and D. Howitt. The Sage dictionary of statistics: a practical resource for students in the social sciences. Sage, 2004.
  • [Dodge et al.2018] J. Dodge, S. Penney, C. Hilderbrand, A. Anderson, and M. Burnett. How the experts do it: Assessing and explaining agent behaviors in real-time strategy games. In ACM Conference on Human Factors in Computing Systems, CHI ’18. ACM, 2018.
  • [Erwig et al.2018] M. Erwig, A. Fern, M. Murali, and A. Koul. Explaining deep adaptive programs via reward decomposition. In IJCAI Workshop on Explainable Artificial Intelligence, 2018.
  • [Fong and Vedaldi2017] R. C. Fong and A. Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. In

    IEEE International Conference on Computer Vision (ICCV)

    . IEEE, 2017.
  • [Hughes and Parkes2003] J. Hughes and S. Parkes. Trends in the use of verbal protocol analysis in software engineering research. Behaviour & Information Technology, 22(2), 2003.
  • [Kim et al.2018] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In

    International Conference on Machine Learning

    , volume 80. PMLR, 2018.
  • [Kulesza et al.2015] T. Kulesza, M. Burnett, W. Wong, and S. Stumpf. Principles of explanatory debugging to personalize interactive machine learning. In ACM International Conf. on Intelligent User Interfaces, IUI ’15. ACM, 2015.
  • [Lippa et al.2008] K. D. Lippa, H. A. Klein, and V. L. Shalin. Everyday expertise: cognitive demands in diabetes self-management. Human Factors, 50(1):112–120, 2008.
  • [Muramatsu and Pratt2001] J. Muramatsu and W. Pratt. Transparent queries: investigation users’ mental models of search engines. In International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2001.
  • [Newn et al.2016] J. Newn, E. Velloso, M. Carter, and F. Vetere. Exploring the effects of gaze awareness on multiplayer gameplay. In ACM Symposium on Computer-Human Interaction in Play Companion Extended Abstracts. ACM, 2016.
  • [Newn et al.2017] J. Newn, E. Velloso, F. Allison, Y. Abdelrahman, and F. Vetere. Evaluating real-time gaze representations to infer intentions in competitive turn-based strategy games. In ACM Symposium on Computer-Human Interaction in Play. ACM, 2017.
  • [Norman and Gentner1983] Donald Norman and D Gentner. Mental models. Lawrence Erlbaum Associates, Hillsdale, NJ, 1983.
  • [Penney et al.2018] S. Penney, J. Dodge, C. Hilderbrand, A. Anderson, L. Simpson, and M. Burnett. Toward foraging for understanding of StarCraft agents: An empirical study. In ACM International Conference on Intelligent User Interfaces, IUI ’18. ACM, 2018.
  • [Riche et al.2013] N. Riche, M. Duvinage, M. Mancas, B. Gosselin, and T. Dutoit. Saliency and human fixations: state-of-the-art and study of comparison metrics. In IEEE international conference on computer vision, 2013.
  • [Russell and Zimdars2003] S. J. Russell and A. Zimdars. Q-decomposition for reinforcement learning agents. In International Conference on Machine Learningx, 2003.
  • [Simonyan et al.2013] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034, 2013.
  • [Springenberg et al.2014] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. arXiv:1412.6806, 2014.
  • [Van Seijen et al.2017] H. Van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and J. Tsang. Hybrid reward architecture for reinforcement learning. In Advances in Neural Information Processing Systems, 2017.
  • [Vinyals et al.2019] O. Vinyals, I. Babuschkin, J. Chung, M. Mathieu, M. Jaderberg, W. Czarnecki, A. Dudzik, A. Huang, P. Georgiev, and R. Powell. Alphastar: Mastering the real-time strategy game StarCraft II, Jan 2019.
  • [Wohlin et al.2000] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén. Experimentation in Software Engineering: An Introduction. Kluwer Academic Publishers, 2000.
  • [Zeiler and Fergus2014] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision. Springer, 2014.
  • [Zhang et al.2018] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10):1084–1102, 2018.