Log In Sign Up

Semantic-Based Explainable AI: Leveraging Semantic Scene Graphs and Pairwise Ranking to Explain Robot Failures

by   Devleena Das, et al.
Georgia Institute of Technology

When interacting in unstructured human environments, occasional robot failures are inevitable. When such failures occur, everyday people, rather than trained technicians, will be the first to respond. Existing natural language explanations hand-annotate contextual information from an environment to help everyday people understand robot failures. However, this methodology lacks generalizability and scalability. In our work, we introduce a more generalizable semantic explanation framework. Our framework autonomously captures the semantic information in a scene to produce semantically descriptive explanations for everyday users. To generate failure-focused explanations that are semantically grounded, we leverages both semantic scene graphs to extract spatial relations and object attributes from an environment, as well as pairwise ranking. Our results show that these semantically descriptive explanations significantly improve everyday users' ability to both identify failures and provide assistance for recovery than the existing state-of-the-art context-based explanations.


page 1

page 4

page 6


Explainable AI for System Failures: Generating Explanations that Improve Human Assistance in Fault Recovery

With the growing capabilities of intelligent systems, the integration of...

Explainable AI for Robot Failures: Generating Explanations that Improve User Assistance in Fault Recovery

With the growing capabilities of intelligent systems, the integration of...

Why did I fail? A Causal-based Method to Find Explanations for Robot Failures

Robot failures in human-centered environments are inevitable. Therefore,...

Reasons People Want Explanations After Unrecoverable Pre-Handover Failures

Most research on human-robot handovers focuses on the development of com...

WHY: Natural Explanations from a Robot Navigator

Effective collaboration between a robot and a person requires natural co...

Learning to Rank Rationales for Explainable Recommendation

State-of-the-art recommender system (RS) mostly rely on complex deep neu...

Conflict Detection and Resolution in Table Top Scenarios for Human-Robot Interaction

As in any interaction process, misunderstandings, ambiguity, and failure...

I Introduction

Increasingly, robots are becoming deployed in everyday environments – homes, hospitals, and offices – in which the robot’s primary users are everyday people rather than trained technicians [33]. Occasional robot failures are inevitable when operating in unstructured human environments, as the robot may be unable to find an object it requires, be unable to reach an object, encounter a planning error, etc. When an error occurs, everyday people in the robot’s environment are typically the first to respond, but to effectively assist in failure recovery users must have an understanding of the robot’s behavior, decision making, and what went wrong [6].

Research on Explainable AI (XAI) focuses on the development of techniques that increase the transparency and interpretability of complex, black box systems [1]. The vast majority of XAI techniques developed to date have been designed for experts and system developers [1, 20, 22, 9], however XAI systems also have the potential to explain the cause of a system error to everyday users. In particular, recent work has shown that natural language explanations are effective in improving user confidence in an AI system [7], and in improving user assistance in fault recovery [6]. In both of the above works, a contributing factor to the effectiveness of their explanations is the ability to incorporate situational, or environmental context from the agent’s environment. However, these early works lack generalizability and scalability as both techniques require that all domain-specific contexts to be hand-annotated a priori, preventing generalization to new scenarios. This leads to the question: How can we autonomously extract contextual information grounded in an environment to provide meaningful explanations of system failures to everyday users?

In this work, we introduce a generalizable framework for explaining robot pick errors to non-expert users. Specifically, we focus on explaining pick errors that occur amidst a robot’s task plan, causing a halt in the robot’s task execution. The key innovation of our approach is the use of scene graphs to produce semantically descriptive explanations that communicate why a failure to manipulate a given object in the scene occurred. A semantic scene graph (SSG) is a data structure that represents the entities of a scene as a graph, in which objects are nodes and edges represent relationships between objects [32]. Given an image from the robot’s view point of the scene, we utilize a semantic scene graph model, in conjunction with pairwise ranking [8], to produce semantically descriptive explanations. The use of scene graphs enable our approach to autonomously extract semantic context from novel scenes, thereby providing detailed explanations even for scenes and objects not previously encountered by the robot. Additionally, we expand the types of robot failures beyond those addressed in prior work [6].

Our work makes the following contributions. First, we adapt a state-of-the art semantic scene graph, MOTIFNET [34], to autonomously extract both inter-object spatial relations and object attribute information as contextual reasoning for robot failures in any scene. Second, we improve the semantically descriptive explanations producible through scene graphs by utilizing pairwise ranking. We show that pairwise ranking can be utilized to autonomously place attention on parts of a scene graph output that are relevant to a given failure. As a result, our framework can produce failure-focused, semantically descriptive explanations.

We validate our approach across 4 failure types in a user study with 90 participants. Our results show that our semantic explanation framework can produce semantically descriptive explanations that significantly improve everyday users’ robot failure understanding, as well as their ability to provide assistance in failure recovery, in comparison to the state-of-the-art context-based explanations.

Ii Related Works

The XAI community has developed methodologies that increase the interpretability and transparency of black box models [1]. Most of these techniques are model-agnostic and aimed at understanding classification problems. Example techniques include perturbing input data to analyze consequential prediction changes [20, 21], leveraging saliency maps to visualize a model’s attention during prediction [22]

, and utilizing inherently interpretable models, such as decision trees or rule lists, as approximate surrogate models


. While the above techniques provide insight into the inner workings of machine learning models, they have been developed for expert understanding. In our work, we focus on developing a framework that generates explanations accessible to

everyday users who are not AI experts.

Additionally, the Explainable AI Planning (XAIP) community has developed techniques that specifically focus on sequential-decision making problems [4]. Much of the work in this area has focused on generating plan explanations that explain a reasoning for the agent’s selected plan. For example, these explanations may be formed by contrastive explanations that explain “Why plan X instead of plan Y?” [18, 11]. Other works use model reconciliation, seeking to identify divergences between the mental model of the agent and the human user, to design explanations that bring such mental models closer together [29, 5]. Furthermore, when a planning problem is unsolvable, infeasible plans can be abstracted into simpler plans as a method for explaining unmet properties [23, 24]. In our work, while we generate explanations under the context of sequential-decision making tasks, our explanations focus on explaining the causes of a failure that may occur within a plan, as opposed to explaining the chosen plan.

A growing body of work is leveraging natural language explanations to explain AI decision-making to everyday users. Specifically, Ehsan et al. utilize sequence to sequence learning to autonomously generate natural language rationales that explain an agent’s decision making in the context of the game Frogger [7]. To train their model, the authors collect annotations in the form of behavior rationales from everyday users. Their results demonstrate that users significantly prefer “complete-view” rationales, which utilize the entire state space as context, as opposed to “focused-view”, which utilize only a subset of the full state space. Most closely related to our work, Das et al. utilize sequence to sequence learning to autonomously generate natural language explanations in the context of robot failures [6]. To train their model, the authors collect expert annotations for each timestep in a robot’s task plan. Their results demonstrate that the inclusion of environmental context and history of past actions help improve user ability to correctly identify failures and their solutions. In our work, we aim to produce natural language explanations grounded in semantic context. Instead of hand-annotating contextual information, we leverage semantic scene graphs to autonomously capture the semantic information from a scene. In doing so, we are able to expand the set of explainable failure scenarios from [6].

Within the robotics community, scene graphs have been utilized for scene analysis and goal-directed manipulation. For instance, Zeng et al. use scene graphs to parse a goal scene configuration in efforts to allow robots to efficiently motion plan and transform an initial scene into such pre-defined goal [35]. Sui et al. leverage scene graphs for axiomatic particle filtering, which allows robots to disambiguate objects in a cluttered scene for effective manipulation [26]. Kenfack et al. intersect Visual Question Answering (VQA) and robotics and develop a robotVQA architecture. RobotVQA provides semantically-grounded answers to questions about a scene with the motivation to illicit more meaningful robot object manipulations in the future [15]. Scene graphs have also been utilized to mitigate safety risks in human-robot-collaboration scenarios [19, 12]. Most closely related to our work, scene graphs have been shown to be effective in producing explainable answers for VQA queries [10]. The authors utilize attention maps to autonomously select relevant relations from a scene graph to explain their VQA model’s answers. In our work, instead of attention maps, we utilize pairwise ranking of inter-object relations and object attributes to provide ranked, semantically descriptive explanations which include only the relevant semantic information needed to explain a robot failure.

Research on fault diagnosis has led to the development of a wide range of techniques for diagnosing, and suggesting recoveries for robot failures [16]. Example techniques include execution monitoring [3, 13], sensor-processing [14]

, neural networks

[30, 28], and statistical filtering [31]. However, such techniques are either aimed at autonomous recovery [16, 3], or at aiding an expert operator who is deeply familiar with the inner workings of the system, not everyday users.

Fig. 1: Average selected explanation type based on users’ perceived helpfulness. Statistical significance is reported as: * p 0.05, ** p 0.01, *** p 0.001.

Iii Semantically Descriptive Explanations

Fig. 2: Breakdown of failure causes, and failure types expanded in this work, as well as the specific and explained by CB explanations from prior work.

Given a set of failure types, , that prevent a robot from picking up a desired object, , our goal is to produce natural language explanations that help everyday users (1) understand the cause of the robot’s failure, and (2) identify the correct way to assist the robot in recovery. As seen in Figure 2, we characterize by . In this set represent single spatial failures, represent compound spatial failures, and represent attribute failures. We denote that each failure type can be caused by the set , where define failures caused by spatial relationships and define object attribute causes. We believe that an effective solution to our objective is to utilize the semantic information from a scene and generate semantically descriptive explanations for robot failures. We qualify a semantically descriptive explanation as one that utilizes the inter-object spatial relationships and object specific attributes from a given scene. An inter-object spatial relationship describes an object’s location with respect to other objects in a scene. For instance, “a credit card is underneath a newspaper”. An object specific attribute describes a property of the object, such as “the vase is fragile”.

To validate the importance of semantically descriptive explanations, we developed a qualitative study in which our explanations were derived from hand-crafted semantic relationships in a scene111Participants were recruited through Amazon Mechanical Turk; They were 18 years or older (M=31.4 SD=9.5) and were compensated $2.50 for the task.. We compared our approach to the only previously published error explanation technique [6]. In [6], context-based (CB) explanations were generated for novel scenes based on similarity to previously annotated scenarios. CB explanations focus on single failure types, , and are developed without the use of semantic information; for instance, in the example where a credit card is underneath a newspaper, a CB explanation would be the “credit card is occluded”. Although correct, this statement is more vague than the one that utilizes the semantic scene information.

In our study, each user was presented with a scene from a household environment, as well as three descriptions for the cause of a pick error: no explanation, a CB explanation, and a semantically descriptive explanation. Users were asked to select which explanation was most helpful in understanding the robot’s cause of failure. In Figure 1, we present participants’ perceived usefulness both over aggregated failures, as well as across each failure type. We observe that for all failure types, the semantically descriptive explanations are perceived as significantly more useful than no explanations and the CB explanations.

Fig. 3: The semantic explanation framework is used to generate unranked explanations and ranked - explanations. The framework consists of three modules: (1) a scene graph network that autonomously extracts semantic information from a scene, (2) pairwise algorithm that ranks semantic information based on relevancy to a failure scenario, and (3) an explanation generation template that produces both variants of natural language explanations.
Fig. 4: Our adapted MOTIFNET model architecture utilized to evaluate predicate classification.
Fig. 5: Confusion Matrix of our SSG model’s performance where the y-axis denote the ground truth predicates and the x-axis denote predicted predicates.

Iv Scene Graph Model

Given that explanations grounded in inter-object relationships and object attributes were perceived as significantly more useful by everyday users, we next developed a methodology to autonomously generate these semantically descriptive explanations. To do so, we introduce the semantic explanation framework shown in Figure 3. Our framework leverages semantic scene graphs and pairwise ranking to deliver two variants of semantically descriptive explanations: and -. In Section IV-A we discuss how we adapt the semantic scene graph architecture MOTIFNET [34] to predict spatial relationships and object attribute from a given scene. In Section V we demonstrate how the semantic scene graph model outputs are utilized to template unranked explanations. We also showcase the utility of pairwise ranking to develop ranked - explanations that only select the relevant semantic information in a scene. Finally, in Section VI, we perform a quantitative analysis comparing both variants of semantically descriptive explanations with the existing baselines. We demonstrate the effectiveness of ranked - to everyday users in improving their understanding of robot failures and ability to accurately assist in failure recovery.

Iv-a Semantic Scene Graphs

A scene graph, , describes the semantic information contained in a given image and is represented by a set of nodes, , and edges, , [34]. Each is defined by a pair in which represents a detected bounding box, and represents the associated object label. Similar to [2], we also provide each with an object attribute, , where is the set of object attributes from Figure 2. Additionally, each is defined as a predicate label between and . The predicate labels refer to the inter-object relations in a scene (e.g., underneath, inside, close to). Given these definitions, the output of a scene graph is defined by a set of triples = { in which each is defined by .

We adapt the state-of-the-art scene graph model MOTIFNET [34]222We utilize the codebase provided by [27] to adapt our MOTIFNET. to predict spatial relationships and object attributes in a given scene. Figure 4 depicts our model architecture. For the purposes of our application, we evaluate our model on predicate classification, a form of SSG evaluation that utilizes both ground truth bounding box regions and object labels to predict predicate edge labels. As shown in Figure 4, ground truth bounding box regions, , and object labels, , are extracted from an image and passed into a bi-LSTM network structure with highway connections [25]. To predict a triple , the contextualized information for two objects, and , is utilized in conjunction with the union of corresponding bounding box information, and , to determine the final predicate label .

Iv-B Data Collection

To train our adapted MOTIFNET model, we collected a dataset consisting of 188 household cluttered images from the AI2Thor simulator [17]. Images in were taken from the perspective of the robot, and capture the unstructured, cluttered environment of human households. The images in represent what a robot would perceive as it attempts to pick up a desired object . Therefore, our images capture close-view scenes of major receptacles such as kitchen countertops, dining tables or desks. On average, each image in includes 13 objects. These objects include approximately 6 object attributes and 30 inter-object spatial relations. In each image, we assume every object to have only one attribute, including “None” which denotes when an object does not include an attribute listed in Figure 2.

Iv-C Training & Evaluation

We train our adapted MOTIFNET on ground truth bounding box regions and object labels to predict predicate and attribute labels. We utilize a 66%-17%-17% split in which we use 126 scenes for training, 32 for validation and 32 for evaluation. Our model is trained with 2000 iterations and utilizes a Cross Entropy loss that is optimized using SGD with a learning rate of 0.01 and momentum of 0.9.

The confusion matrix in Figure 5

shows the average performance of our predicate classification. Our adapted MOTIFNET can generalize the predicate labels with 84.9% accuracy. While our model has low false positive labels for most relations, we see that our model has a greater challenge differentiating labels that are semantically similar. For example, “close to” is most erroneously classified as “near”. Similarly, “in” is most erroneously classified as “on”. These labels learn the relationship between 2D bounding boxes, with a threshold as a discriminator. Improvements on the SSG model architecture, as well as additional training data, will likely lead to improvements in the classification accuracies. As we will show in Section

VI, the current level of performance is sufficient in effectively conveying error explanations to users.

Fig. 6: Sample failure scenarios, where the red boxes indicate the bounding boxes of ground truth objects in the scene, and the yellow box represents . We illustrate model-generated explanations, comparing CB, and - explanations.

V Generating Explanations from SSGs

Given a desired object , an image of the local environment from the robot’s camera corresponding to one of the failures in Figure 2, and the corresponding image scene graph , our goal is to produce semantically descriptive, natural language explanations that reason about why a robot cannot pick up . Below, we detail how explanation variants and - are generated.

V-a Explanations

To develop explanations, we follow a template-based approach that traverses a scene graph, , and extracts a subgraph containing all relations which contain as a node in the triple. To generate an SSG explanation, we describe , the elements of the scene that pertain to our object of interest. Specifically, we format the explanation as The robot could not pick up the because , where is a list of phrases that enumerates all of the object relations .

In Figure 6 we showcase examples of explanations in the context of our failure types . In every scene, the explanations include all relationships associated with . We observe that explanations are more semantically richer, and detailed than their CB counterpart. However, we also observe that these explanations include extraneous, semantic information that hides the true cause of a failure. In other words, a drawback of these explanations is that the scene graph model has no insight into which are relevant in describing the robot’s failure.

V-B - Explanations via Pairwise Ranking

To provide only relevant relations as for a failure, we develop - explanations. In addition to extracting a subgraph , we utilize pairwise ranking to autonomously determine the relevancy of each triple . Pairwise ranking is used to learn preferences between pairs of entities when multiple available entities exist [8]. In our application, a preference denotes how accurately a relationship describes the true cause(s) of failure(s).

To formulate our pairwise ranking problem we let a pair of relationship triples, and

, be defined by feature vectors

and . Recall from Section IV that represents the predicate label between two object nodes and , while and represent the predicted object attributes for and .

Given our feature vectors, Algorithm 1 further details the pairwise ranking process for a subgraph . We utilize to determine how many binary classifiers are required to represent each unique pair of labels. For our application, we had a set of three preference labels , and therefore required three binary classifiers to be instantiated, one for each label pair: , , and . The list contains the three instantiated classifiers (lines 1-2). We iterate through , and pass the feature vectors of each relationship pair, and , as an input to each classifier (lines 3-4). The following logic denotes how a predicted label is determined for a given input :

A predicted label 0 represents when the relationship better describes the cause of failure than . A label 1 represents when the relationship better describes the cause of failure than

, and a label 2 represents when both relationships equally describe the cause of failure. For our purposes, we utilize random forest classifiers, trained using cross validation via scikit-learn. Depending on the predicted label, the rank of one or both relationships is incremented (lines 5-10). The resulting list of relationships

, sorted by rank, is returned by the algorithm (line 14-15). Note, annotation of a training label 0, 1 or 2 is determined via domain knowledge of the failure scenario.

To develop an - explanation, we follow the identical template utilized for ; however, is now replaced with the top ranked relationship(s) in . Note, that including a tie label 2, that represents equally ranked relationships, allows our pairwise ranking to represent more than one relationship with the max rank. Figure 6 exemplifies how pairwise ranking can eliminate the presence of extraneous relationships when compared to explanations, while still being semantically richer than CB explanations.

Input:  - scene subgraph
Output:  - ranked relation list

1:   = 3
2:  classifierList = loadClassifiers((-1)/2)
3:  for all , in  do
4:     for classifier in classifierList do
5:        if classifier([, ]) = 0 then
6:           incrementRank()
7:        else if classifier([, ]) = 1 then
8:           incrementRank()
9:        else if classifier([, ]) = 2 then
10:           incrementRank(, )
11:        end if
12:     end for
13:  end for
14:   = sortByRank()
15:  return  
Algorithm 1 Pairwise Ranking Algorithm
Fig. 7: Average F1 and Recall score for participants’ failure identification across all study conditions. Statistical significance is reported as: * p 0.05, ** p 0.01, *** p 0.001.
Fig. 8: Average F1 and Recall score for participants’ solution identification across all study conditions. Statistical significance is reported as: * p 0.05, ** p 0.01, *** p 0.001.

Vi Quantitative Analysis of Semantic Explanations

From Section III, we observed that semantically descriptive explanations were perceived to be more helpful for failure understanding than the existing CB explanations. In this section, we evaluate the efficacy of our model-generated explanations, and - explanations in improving users’ ability to identify a failure and provide assistance for recovery. For our analyses we conducted a six-way between subjects study, with the following study conditions that differed by the type of explanation participants received:

  • None (Baseline): Participant receives no explanation describing the cause of error. As noted by [6], this is the standard in currently deployed robotic systems.

  • CB (Baseline): Participant receives a context-based explanation from prior work [6].

  • : Participant receives a ground truth, unranked, semantically descriptive explanation.

  • -: Participant receives a ground truth, ranked, semantically descriptive explanation.

  • : Participant receives a model-generated, unranked semantically descriptive explanation.

  • -: Participant receives a model-generated, ranked, semantically descriptive explanation.

Vi-a Study Design

Similar to Das et al. [6], our user study consisted of two stages, Pre-Test and Explanation. In both stages, users were presented with images of the environment in which the robot encountered a failure when tasked to pick up .

In the Pre-Test stage, participants were shown 16 randomly ordered failure scenarios representing all from Figure 2. None of the participants were provided explanations in this stage in order to establish their initial level of error understanding. For each scenario, participants were tasked to identify the possible cause(s) of failure and suggest possible solution(s) as a remedy.

In the Explanation stage, participants were shown another 16 different, failure scenarios. However, participants were now provided an explanation depending on their study condition. Similar to the Pre-Test stage, participants were tasked to identify the cause(s) of robot failure and suggest possible solution(s) as a remedy.

Vi-B Metrics

We evaluate participant performance using either an F1 score, or a Recall score. Since participants were allowed to select multiple answers for each question, in the case of compound spatial failure types from Figure 2, the quantity of false negatives are a more important measure of a participant’s performance. Therefore, we analyze participants’ Recall score for compound failures and F1 score for all other failure types. Specifically, we measure the difference between each participant’s Pre-Test and Explanation F1 score or Recall score using metrics similar to Das et al. [6]:

  • Failure Identification (FId): The ability to accurately select the correct cause(s) of failure in a scene.

  • Solution Identification (SId): The ability to accurately select the actions needed to remedy the failure in a scene.

Vi-C Participants

We recruited 93 participants from Amazon Mechanical Turk. Participants were required to be non-experts in the domain of robotics, thus we removed three participants for scoring a 100% accuracy on the Pre-Test stage. The remaining 90 participants, 15 for each study condition, included 53 males and 37 females, all whom were over the age of 18 (M=39.0, SD=10.8). The task took on average 20-30 minutes and participants were compensated $2.50.

Vi-D Quantitative Results

The participants’ failure identification (FId), and solution identification (SId

) scores follow a normal distribution, thus we utilize a one-way ANOVA with a Tukey HSD post-hoc test to evaluate statistical significance between study conditions.

In Figure 7, we examine the average F1 score and Recall score for participants’ failure identification (FId) across the aggregated failure types as well as across each individual failure type. Overall, we see that - and explanations have the highest improvement in FId scores in comparison to the other study conditions. This indicates the effectiveness of semantically descriptive explanations in improving participants’ understanding of robot failures.

When looking at the FId scores for aggregate failures (Figure 7(a)), we see -, and - explanations lead to a significant improvement in failure understanding compared to None, CB, and both and explanations ( for all). Similar trends are observed with single spatial failures (Figure 7(b)), and attribute failures (Figure 7(d)), reiterating the effectiveness of grounding explanations in the semantic information present in a scene.

For compound spatial failures (Figure 7(c)), we observe that only - leads to significant improvement in FId scores. This highlights the importance of ranked semantic-based explanations in significantly improving participants’ failure understanding, as well as highlights an area of improvement for our SSG model in detecting multiple failures in a scene. Furthermore, for compound failures, we also observe a strong learning effect across all study conditions. This indicates the variability in difficulty across compound spatial failures, where the Explanation compound failure scenarios may be more visually apparent, compared to the Pre-Test compound failure scenarios. However, this particular situation demonstrates the key reasoning for measuring the differences in performance between both stages as opposed to solely analyzing the Explanation stage performances.

Furthermore, Figure 7(d) presents the effectiveness of and explanations. We observe that and lead to significantly improved failure understanding in comparison to None () and CB (). However, - and show a significantly higher rate of improvement than and (). This further demonstrates the benefit of ranked - and - explanations in helping participants understand the true cause(s) of robot failure. Interestingly, in Figure 7(d), we observe the adverse effects CB explanations have in participants’ understanding of attribute failures. Given that CB explanations can only predict a limited set of single spatial failures, when used to explain attribute failures, they cannot express the true cause of failure.

Given the trends in participants’ failure identification scores, in Figure 8, we also examine the average F1 score and Recall score for participants’ solution identification (SId) across all failure types. We see that participants’ SId scores closely follow the trends observed for failure identification. However, there are instances in which solution identification is harder than failure identification. For example, when analyzing single spatial failures in Figure 8(b), we see that only - significantly improves SId in comparison to None, CB, and (). Overall, we find that both - and - explanations lead to the highest SId scores. This displays not only the importance of generating semantic-based explanations, but also the effectiveness of ranked, semantic-based explanation that only include semantic information relevant to a failure.

Vii Conclusion And Future Work

In this work we have introduced a generalizable framework that autonomously captures the semantic information in a scene to explain robot pick errors to everyday users. We leverage both semantic scene graphs and pairwise ranking to develop semantically descriptive explanations that highlight the true cause of a robot failure. Our results demonstrate that ranked, semantically descriptive explanations significantly improve everyday users’ ability to understand robot failures and provide assistance for fault recovery. Although the results are promising, there are limitations that should be addressed in future work. For example, we demonstrate that semantically descriptive explanations are useful to everyday users for understanding pick errors. It would be interesting to examine the effects of these semantic explanations in the context of other robot failures, such as navigation related errors. Additionally, while the current scene graph model can generalize over many spatial relations and object attributes, future work, in the form of additional data collection and model improvements, is required to further expand the scope of relations needed to explain a wider range of robot failures.


This material is based upon work supported by the NSF Graduate Research Fellowship under Grant No. DGE-1650044.


  • [1] A. Adadi and M. Berrada (2018)

    Peeking inside the black-box: a survey on explainable artificial intelligence (xai)

    IEEE access 6, pp. 52138–52160. Cited by: §I, §II.
  • [2] I. Armeni, Z. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese (2019) 3d scene graph: a structure for unified semantics, 3d space, and camera. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    pp. 5664–5673. Cited by: §IV-A.
  • [3] S. Banerjee, A. Daruna, D. Kent, W. Liu, J. Balloch, A. Jain, A. Krishnan, M. A. Rana, H. Ravichandar, B. Shah, et al. (2020) Taking recoveries to task: recovery-driven development for recipe-based robot tasks. arXiv preprint arXiv:2001.10386. Cited by: §II.
  • [4] T. Chakraborti, S. Sreedharan, and S. Kambhampati (2020) The emerging landscape of explainable ai planning and decision making. arXiv preprint arXiv:2002.11697. Cited by: §II.
  • [5] T. Chakraborti, S. Sreedharan, Y. Zhang, and S. Kambhampati (2017) Plan explanations as model reconciliation: moving beyond explanation as soliloquy. arXiv preprint arXiv:1701.08317. Cited by: §II.
  • [6] D. Das, S. Banerjee, and S. Chernova (2021) Explainable ai for robot failures: generating explanations that improve user assistance in fault recovery. arXiv preprint arXiv:2101.01625. Cited by: §I, §I, §I, §II, §III, 1st item, 2nd item, §VI-A, §VI-B.
  • [7] U. Ehsan, P. Tambwekar, L. Chan, B. Harrison, and M. O. Riedl (2019) Automated rationale generation: a technique for explainable ai and its effects on human perceptions. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pp. 263–274. Cited by: §I, §II.
  • [8] J. Fürnkranz and E. Hüllermeier (2010) Preference learning and ranking by pairwise comparison. In Preference learning, pp. 65–82. Cited by: §I, §V-B.
  • [9] K. Gade, S. C. Geyik, K. Kenthapadi, V. Mithal, and A. Taly (2019) Explainable ai in industry. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3203–3204. Cited by: §I, §II.
  • [10] S. Ghosh, G. Burachas, A. Ray, and A. Ziskind (2019) Generating natural language explanations for visual question answering using scene graphs and visual attention. arXiv preprint arXiv:1902.05715. Cited by: §II.
  • [11] J. Hoffmann and D. Magazzeni (2019) Explainable ai planning (xaip): overview and the case of contrastive explanation. Reasoning Web. Explainable Artificial Intelligence, pp. 277–282. Cited by: §II.
  • [12] R. Inam, K. Raizer, A. Hata, R. Souza, E. Forsman, E. Cao, and S. Wang (2018) Risk assessment for human-robot collaboration in an automated warehouse scenario. In 2018 IEEE 23rd International Conference on Emerging Technologies and Factory Automation (ETFA), Vol. 1, pp. 743–751. Cited by: §II.
  • [13] F. Ingrand and M. Ghallab (2017) Deliberation for autonomous robots: a survey. Artificial Intelligence 247, pp. 10–44. Cited by: §II.
  • [14] G. Jäger, S. Zug, T. Brade, A. Dietrich, C. Steup, C. Moewes, and A. Cretu (2014) Assessing neural networks for sensor fault detection. In 2014 IEEE international conference on computational intelligence and virtual environments for measurement systems and applications (CIVEMSA), pp. 70–75. Cited by: §II.
  • [15] F. K. Kenfack, F. A. Siddiky, F. Balint-Benczedi, and M. Beetz (2020)

    RobotVQA—a scene-graph-and deep-learning-based visual question answering system for robot manipulation

    In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, USA, Cited by: §II.
  • [16] E. Khalastchi and M. Kalech (2018) On fault detection and diagnosis in robotic systems. ACM Computing Surveys (CSUR) 51 (1), pp. 1–24. Cited by: §II.
  • [17] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi (2017) Ai2-thor: an interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474. Cited by: §IV-B.
  • [18] B. Krarup, M. Cashmore, D. Magazzeni, and T. Miller (2019) Model-based contrastive explanations for explainable planning. Cited by: §II.
  • [19] H. Riaz, A. Terra, K. Raizer, R. Inam, and A. Hata (2020) Scene understanding for safety analysis in human-robot collaborative operations. In 2020 6th International Conference on Control, Automation and Robotics (ICCAR), pp. 722–731. Cited by: §II.
  • [20] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ” Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §I, §II.
  • [21] M. T. Ribeiro, S. Singh, and C. Guestrin (2018) Anchors: high-precision model-agnostic explanations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §II.
  • [22] W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, and K. Müller (2019) Explainable ai: interpreting, explaining and visualizing deep learning. Vol. 11700, Springer Nature. Cited by: §I, §II.
  • [23] S. Sreedharan, T. Chakraborti, C. Muise, Y. Khazaeni, and S. Kambhampati (2020) –D3WA+–a case study of xaip in a model acquisition task for dialogue planning. In Proceedings of the International Conference on Automated Planning and Scheduling, Vol. 30, pp. 488–497. Cited by: §II.
  • [24] S. Sreedharan, S. Srivastava, D. Smith, and S. Kambhampati (2019) Why can’t you do that hal? explaining unsolvability of planning tasks. In International Joint Conference on Artificial Intelligence, Cited by: §II.
  • [25] R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Training very deep networks. arXiv preprint arXiv:1507.06228. Cited by: §IV-A.
  • [26] Z. Sui, L. Xiang, O. C. Jenkins, and K. Desingh (2017)

    Goal-directed robot manipulation through axiomatic scene estimation

    The International Journal of Robotics Research 36 (1), pp. 86–104. Cited by: §II.
  • [27] K. Tang (2020)

    A scene graph generation codebase in pytorch

    Note: Cited by: footnote 2.
  • [28] M. Van and H. Kang (2015) Robust fault-tolerant control for uncertain robot manipulators based on adaptive quasi-continuous high-order sliding mode and neural network. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 229 (8), pp. 1425–1446. Cited by: §II.
  • [29] S. L. Vasileiou, A. Previti, and W. Yeoh (2020) On exploiting hitting sets for model reconciliation. arXiv preprint arXiv:2012.09274. Cited by: §II.
  • [30] A. T. Vemuri, M. M. Polycarpou, and S. A. Diakourtis (1998) Neural network based fault detection in robotic manipulators. IEEE Transactions on Robotics and Automation 14 (2), pp. 342–348. Cited by: §II.
  • [31] V. Verma, G. Gordon, R. Simmons, and S. Thrun (2004) Real-time fault diagnosis [robot fault diagnosis]. IEEE Robotics & Automation Magazine 11 (2), pp. 56–66. Cited by: §II.
  • [32] P. Xu, X. Chang, L. Guo, P. Huang, X. Chen, and A. G. Hauptmann (2020) A survey of scene graph: generation and application. EasyChair Preprint (3385). Cited by: §I.
  • [33] G. A. Zachiotis, G. Andrikopoulos, R. Gornez, K. Nakamura, and G. Nikolakopoulos (2018) A survey on the application trends of home service robotics. In 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1999–2006. Cited by: §I.
  • [34] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi (2018) Neural motifs: scene graph parsing with global context. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5831–5840. Cited by: §I, §IV-A, §IV-A, §IV.
  • [35] Z. Zeng, Z. Zhou, Z. Sui, and O. C. Jenkins (2018) Semantic robot programming for goal-directed manipulation in cluttered scenes. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7462–7469. Cited by: §II.