Towards A Robot Explanation System: A Survey and Our Approach to State Summarization, Storage and Querying, and Human Interface

by   Zhao Han, et al.

As robot systems become more ubiquitous, developing understandable robot systems becomes increasingly important in order to build trust. In this paper, we present an approach to developing a holistic robot explanation system, which consists of three interconnected components: state summarization, storage and querying, and human interface. To find trends towards and gaps in the development of such an integrated system, a literature review was performed and categorized around those three components, with a focus on robotics applications. After the review of each component, we discuss our proposed approach for robot explanation. Finally, we summarize the system as a whole and review its functionality.



There are no comments yet.


page 8


Systems of natural-language-facilitated human-robot cooperation: A review

Natural-language-facilitated human-robot cooperation (NLC), in which nat...

Robot human interface for housekepeer with wireless capabilities

This paper presents the design and implementation of a Human Interface f...

Impact of Explanation on Trust of a Novel Mobile Robot

One challenge with introducing robots into novel environments is misalig...

Personality in Healthcare Human Robot Interaction (H-HRI): A Literature Review and Brief Critique

Robots are becoming an important way to deliver health care, and persona...

Towards a Grounded Dialog Model for Explainable Artificial Intelligence

To generate trust with their users, Explainable Artificial Intelligence ...

AI and Holistic Review: Informing Human Reading in College Admissions

College admissions in the United States is carried out by a human-center...

Motion Control on Bionic Eyes: A Comprehensive Review

Biology can provide biomimetic components and new control principles for...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the advancement and wide adoption of deep learning techniques, explainability of software systems and interpretability of machine learning models has attracted both human-computer interaction (HCI) researchers (e.g.,


) and the artificial intelligence (AI) community (e.g.,

[33]). Work in human-robot interaction (HRI) has shown that improving understanding of a robot makes it more trustworthy [15] and more efficient [4]. However, how robots can explain themselves at a holistic level (i.e., generate explanations and communicate them, with a supporting data storage system with efficient querying) remains an open research question.

As opposed to virtual AI agents or computer software, robots have physical embodiment, which influences metrics such as empathy [41] and cooperation [6] with humans. Given this embodiment, some research in human-agent interaction is not applicable to human-robot interaction. For example, in a literature review about explainable agents and robots [5], approximately half () of the explanation systems examined used text-based communication methods, which is less relevant for robots that are not usually equipped with display screens. Instead, HRI researchers have been exploring non-verbal physical behavior such as arm movement [17, 29] and eye gaze [34]. Non-verbal behaviors can help people to anticipate a robot’s actions [30], but understanding why that behavior occurred can improve one’s prediction of behaviors, especially if the behavior is opaque [31]. Thus, robot explanations of their own behavior are needed.

In this paper, aholistic robot explanation system is decomposed into three components (see Figure 1), the research literature is surveyed to explore trends and gaps, and a proposed designed philosophy and approach is detailed as informed by the results of the literature review. We aim to provide important considerations and directions towards a robot explanation system that can accelerate robot acceptance. The term “summarization” addresses the process of shortening the description of the robot’s activities while “explanation” strives to give insight into why the robot performed the summarized behaviors.

1.1 System Components

Figure 1: High-level representation of the robot explanation system’s three components. Fig. 2 shows a detailed version.

A robot explanation system requires state summarization, data storage and querying, and a human interface.

State summarization is at the core of the system, manually or automatically generating varying levels of summaries from different robot states while performing tasks or from the stored states in a post-hoc fashion. The varying summary levels allow people to receive explanations ranging from more abstract to more detailed [10] (e.g., processed data compared to raw sensor data, respectively). Explanations that utilize raw sensor data will likely only be favored by the minority of expert users while explanations involving processed data will be useful for both expert users and the majority of non-expert users.

A persistent storage system is needed to retain robot data and generated explanations. This system will pass the generated explanations, or summaries, to the human interface to be communicated. While storing them, different levels of explanations stemming from the same instance of source data need to be linked to maintain fluid interactions with a person. The person may request follow up explanations with more or less detail than the initial explanation. The storage system must also have a query component as part of the database interface to support online state summarization, which is needed when the stored summaries are not sufficient to answer users’ questions. Querying must also be efficient, given the potentially large amount of robot data being stored.

The human interface component communicates the explanations from the robot to the human and allows the person to ask the system questions. The human interface can use several different modalities, such as natural language dialogues, a traditional graphic user interface (GUI) on a display screen or virtual reality (VR), and augmented reality (AR) that directly projects onto the robot’s environment. The communication method could also involve moving the robot system, such as moving the robot’s head or arm.

1.2 Scope and Contributions of the Work

This paper surveys the literature about the three components of an explanation system to provide critique, summarize trends, and discover gaps towards the design of a robot explanation system. By leveraging the trends observed in the research literature, we propose a robot explanation system architecture that will aim to fill the discovered gaps.

The literature review was focused on research involving robots, rather than more general AI agents or the interpretation of machine learning models. This constraint was relaxed when work with robots was underrepresented or shared commonalities in one component, such as state summarization where abstract states both apply to AI agents and robots.

Thus, the literature review is not meant to be exhaustive, but comprehensive under these constraints to cover the three components in the context of physical co-located robot systems. We refer interested readers to [1] for explainability of computer software, [33, 2] for the explainability of virtual AI agents, and [49, 22] for interpretability of machine learning/deep learning models.

2 State Summarization

Before a robot can begin to explain its actions, it must first translate its decisions in a manner that could be understood by a human. Significant research has been performed within the fields of HCI and HRI towards this goal. In this section, we discuss systems in the literature that are deployable to physical robots. There has been significant research in the field of explainable AI, sometimes referred to as XAI [45]; however, this is beyond our scope. We are specifically interested in summarization methods that can work for a variety of different systems.

2.1 Manual Methods

There are two components of state summarization: the state of which the robot is aware and the state that the robot communicates to the user. A common approach is for developer to manually create categories by which the robot can explain its actions. For example, programmer specified function annotations for each designated robot action are used in [23]. By creating a set of robot actions, correlated with code functions, the system is able to snapshot the state of the robot before and after a function is called. Since the state of the robot could be exceedingly large in a real world, deployed system, the state space is shrunk by isolating which variables are predetermined to be most relevant. These annotated variables are recorded every time a pre- and post-action snapshot is made. The robot then uses inspection to compare the pre- and post-variables of one action, compared to other similar successful actions, to make judgments.

A different approach, suggested in [28], is to adapt hierarchical task analysis [39] to a goal hierarchy tree (GHT). This involves creating a tree where the top node would be a high level task, which can be broken into a number of sub-goals, each linked by a belief (i.e., condition). Each sub-goal can then be broken into either sub-goals or actions. Choosing one sub-goal or action over another is based on a belief. The GHT can then be used to generate explanations. When comparing goal based vs. belief based explanations, kaptein2017personalised found that adults significantly preferred goal based explanations.

2.2 Summarization Algorithms

While manually creating categories or explanations can be effective, it is time consuming and not easily generalizable. Many techniques attempt to automate the process.

Programmer supplied explanations might be able to accurately describe the state of a robot, however, they can prove to be inadequate for a user. ehsan2019automated state that it is best to use a rationale justification ehsan2019automated to explain to non-expert users, differentiating between a rationale and an explanation. An explanation can be made by exposing the inner workings of a system, but this type of explanation may not be understandable from non-experts. They suggest the alternative, a rationale, is meant to be an accessible and intuitive way of describing what the robot is doing. They also discuss how explanations can be tailored to optimize for different factors, including relatability, intelligibility, contextual accuracy, awareness and strategic detail; these factors can affect the user’s confidence, understandability of the explanations, and how human-like explanation was. The approach does not attempt to provide an explanation that reveals the underlying algorithm, but rather attempts to justify an action based on how a non-developer bystander would think. The authors explore two different explanation strategies: “focused view rationale” provides concise and localized rationale, which is more intelligible, and easier to understand, whereas “complete view rationale” provides detailed and holistic rationale, which has better strategic detail and increased awareness.

haidarian2010metacognitive proposed a metacognitive loop (MCL) architecture with a generalized metacognition module that monitors and controls the performance of the system haidarian2010metacognitive. Every decision performed by the system has a set of expectations and a set of corrections or corrective responses. Their framework does not attempt to monitor and respond to specific expectation failures which would require intricate knowledge of how the world works. However, the abandonment of intricate knowledge makes it difficult to provide specialized, highly detailed explanations to an expert operator.

Most of this prior work examined explanations within rule-based and logic-based AI systems, not addressing the quantitative nature of much of the AI used in HRI. More recent work on automatic explanations instead used Partially Observable Markov Decision Problems (POMDPs) which have seen success in several situations within robotics [46]. Unfortunately, the quantitative nature of these models and the complexity of their solution algorithms also makes POMDP-reasoning opaque to people. wang2016impact propose an approach to automatically generate natural-language explanations for POMDP-based reasoning, with predefined string representations of the potential actions, accompanied by the level of uncertainty, and the relative likelihood of outcomes. The system could also reveal information about its sensing abilities along with how accurate its sensor is likely to be. However, modeling using POMPDs can be time consuming.

miller2019 discusses how explanations delivered to the user should be generated based on data from social and behavioral research, which could increase user understandability miller2019. Whether the explanation is generated from expert developers or from a large dataset of novice operators, both cases still require manually tying the robot algorithm to an explanation, a process that can be difficult and faulty.

In the literature review by anjomshoae2019explainable, they conclude that context-awareness and personalization remain under-researched despite having been determined to be key factors in explainable agency anjomshoae2019explainable. They also suggest that multi-model explanation presentation is possibly useful, which would mean your underlying state representation would need to be robust enough to handle several different approaches. Finally, they propose that a robot should keep track of a user’s knowledge, with the explanation generation model updated to reflect the evolution of user expertise.

2.3 Proposed Approach

Much of the prior research focuses on scenarios where the system is designed for either novice or expert users but not both. de2017people argue that a robot should take a person’s knowledge and role into account when formulating a response de2017people. While context-awareness and personalization have been outlined as key factors for effective state summarization, there is little research where the robot adjusts its explanation depending on the user [5], even in simple cases such as expert vs. novice user.

There are many identified differences between expert users and novice users. For example, traceability and verification are very important for software and hardware engineers [12] while explainability or intelligibility are particularly important for laymen [13]. Cases where a user starts as a novice then gains experience over time are under-researched. In our system, the state representation and explanation generation need to be able to adjust to the user and possibly change depending on context. In addition, even if the user is taken into account, such situations fail to account for cases where a user could potentially want to have both levels of explanations available simultaneously or to switch between. For example, The user could quickly receive a high level explanation for why the robot performed an action, then if that proves insufficient, inquire for more details. Ideally, the robot would be able to provide explanations with varying granularity and context, tailored to the experience level of the user (e.g., bystander, operator, programmer, etc.).

Given the conclusion that participants preferred annotations of the actions by a reinforcement learning game agent from developers

[19], both manual and automated methods to generate explanations should be considered. Manually generated explanations fill the gap that new users do not have a thorough understanding of the logic in the underlying algorithm. However, when developers manually put explanations in code (e.g., using [23]), one should always consider the new users audience and provide easy-to-understand explanations that are not tightly coupled with implementation details.

In order to cover a wide variety of possible situations, our proposed approach is for an expertly created inner state representation based on categories and goals with methods to automatically create desirable explanations delivered to the user, where those explanations can adjust based on the user’s experience (bystander, novice, expert) as well as other factors. The system should isolate and convey necessary context for a decision or state, or be prepared to provide it if additional information is requested. Specifically, if confidence in these generated responses in low, the state summarization algorithms can fall back to the expertly created explanations.

For example, the sample task of picking up an object could be decomposed into subtasks: locate the object, navigate to it, then grasp the located object. Each of those subtasks can then be broken down to subtasks of their own. The last step of grasping can be decomposed to reaching, grabbing, and retreating arm back to home location. Eventually the subtasks end up in robot primitives, the simplest actions the system can describe. Each of these actions can have a failure reason, along with context, which would include sensor data and prior relevant state information. If the motion planner fails to find a valid solution, the error propagates up and the action of “grasping an object” failed because the subtask “grabbing the object” failed when “reach for the object” failed as a result of no valid inverse kinematics solution being found. The system needs to correctly determine the most relevant and useful failure level to report; for example, in this case, an expert operator would be told, “No valid inverse kinematics solution was found” while a bystander would be told “I could not reach the object.”

3 Storage and Querying

Terminal output or logs are common methods for debugging during active development and for error analysis after a robot has been deployed, but both methods have some drawbacks. Terminal output is essentially volatile memory, lost after the terminal window is closed, disallowing retrospection. However, despite being persistent on disks, software logs are unstructured and unlinked between related data, which makes it hard to effectively and efficiently query. Thus, researchers have been exploring database techniques to better store and query robotic data. Because storage and querying are under-discussed in the robotic community, this section is more detailed than the other two components.

3.1 Storing Unprocessed Data

Many researchers have been leveraging the schemaless MongoDB database to store unprocessed data from sensors or communication messages from lower-level middleware such as motion planners [36, 9]. Being schemaless allows for recording different hierarchical data messages without declaring the hierarchy in the database (i.e., tables in relational databases such as MySQL). One such hierarchical example is the popular Pose message type present in the Robot Operating System (ROS) framework [38]. A Pose message contains a position Point message and an orientation Quaternion message; a Point message contains float values , , and ; an orientation message is represented by , , , and . It is imaginable to go through the cumbersome process of creating tables of Pose, Point, and Quaternion. Even more tables have to be created for each hierarchical data message. This advantage is also described as minimal configuration and allows evolving data structures to support innovation and development [36].

Niemueller, Lakemeyer and Srinivasa open-sourced the library and are among the first to introduce MongoDB to robotics for logging purposes, which has applications to fault analysis and performance evaluation [36]. In addition to being schemaless, the features that support scalability, such as capped collections, indexing and replication, are discussed. Capped collections handles limited storage capability by replacing old records with new ones. Indexing on a field or a combination of fields speeds up querying. Replication allows storing data across computers using the distributed pragma. Note that the indexing and replication features are also supported by relational databases.

While low-level data is needed, recording all raw data will soon hit the storage capacity limit: when old data is replaced by new records, the important information in the old data will be lost. This is particularly true when the data comes at a high rate; e.g., a HERB robot generates 0.1 GB per minute typically and 0.5 GB at peak times [36]. A more effective way is to be selective, only storing the data of interest [37]. However, storing raw sensor data only facilitates debugging for developers; it does not solve the high-level explanation storage that will help non-expert users to understand the robot.

In addition, while it might be appropriate to expose the database to developers, a more effective way may be an interface that hides the database complexity, easing the cognitive burden on developers. This could be programming language agnostic, for example, by having a HTTP REST API.

Other researchers have also used MongoDB to store low-level data [8, 35, 48, 7] except for oliveira2014perceptual oliveira2014perceptual who used LevelDB, a key-value database for perceived object data. ravichandran2018workbench benchmarked major types of databases and found on average MongoDB has the best performance to continuous robotic data ravichandran2018workbench. However, time-series and key-value databases are not included in the benchmark.

3.2 Storing Processed Data

Instead of looking for related data using the universal time range, balint2017storing proposed Common Analysis Structure to store linked data for manipulation tasks balint2017storing. The structure includes timestamp, scene, image, and camera information. A scene has a viewpoint coordinate frame, annotations, and object hypotheses. Annotations are supporting planes or a semantic location, and object hypotheses are regions of raw data and their respective annotations. The authors considered storage space constraints, thus filtering and storing only regions of interest in unblurred images or point clouds. In their follow-up work [18], the Common Analysis Structure is used to optimize perception parameters by users providing ground truth labels.

Similarly, oliveira2014perceptual proposed a perception database using LevelDB to enable object category learning from users oliveira2014perceptual. Instead of regions of raw point cloud data, user mediated key views of the same object are stored linking to one object category.

wang2012cloudrobot utilized a relational database as cloud robotics storage so multiple low-end robots can retrieve 3D laser scan data from a high-end robot, which has a laser sensor and its data being processed onboard with more storage and better computation power wang2012cloudrobot. Specifically, low-end robots can send a query with their poses on a map to retrieve 3D map data and image data. PostgreSQL is used but the data structure detail is not discussed, as the paper focuses on resource allocation and scheduling. However, a local data buffer on robots is proposed to store frequently accessed data to reduce the database access latency bottleneck.

Dietrich et al. used Cassandra to store and query 2D and 3D map data with spatial context such as building, floor, and room dietrich2014distributed. There are several benefits of using Cassandra, such as the ability to have a local server that can query both local data and remote data, avoiding single point failure. Developers can also define TTLs (Time to Live) to remove data automatically, avoiding a maintenance burden.

In addition, fourie2017slamindb leveraged a graph database, Neo4j, to link vision sensor data stored in MongoDB to pose-keyed data fourie2017slamindb. Graph databases allow complex queries with spatial context for multiple mapping mobile robots, which enables multi-robot mapping. This line of research focuses on storing processed data but did not discuss a way to link raw data back to the processed data. This is important because not storing linked raw data may lead to loss of information during retrospection. There is a trend that other types of database systems, e.g., relational database (PostgreSQL) and key-value database (LevelDB and Cassandra) are used to store those processed data, because only a few ever-evolving data structures need to be stored.

3.3 Querying

There is no unified method for querying; most are application specific, such as efficient debugging [36, 7] and task representation [9, 43]. Interfaces are also tightly coupled to programming languages: JavaScript from MongoDB [36], Prolog [9, 43] and SQL [16].

In , Niemueller, Lakemeyer and Srinivasa proposed a knowledge hierarchy for manipulation tasks to enable efficient querying for debugging niemueller2012genericdb. The knowledge hierarchy consists of all raw data and the poses of the robot and manipulated objects, all of which are timestamped. When a manipulation task fails, a top-down search is performed in the knowledge hierarchy in a specific time range. Poses are at the root of the hierarchy and raw data, such as coordinate frames and point cloud data, are replayed in a visualization tool for further investigation (i.e., Rviz in ROS). The query language is JavaScript using the MapReduce paradigm, which supports aggregation of data natively.

beetz2015openease proposed Open-EASE, a web interface for robotic knowledge representation and processing for developers [9, 43]. Robotics and AI researchers are able to encapsulate manipulation tasks semantically as temporal events with sets of predefined semantic predicates. Manipulation episodes are logged by storing low-level data, which are the environment model, object detection results and poses, and planned tasks in an XML-based Web Ontology Language (OWL) [44]. High-velocity raw data such as sensor data and robot poses are logged in a schemaless MongoDB database. Querying uses Prolog with a predefined concept vocabulary, similar to the semantic predicates.

While Open-EASE allows semantic querying, it does not come easily. One disadvantage is the introduction of a different programming paradigm, logic programming in Prolog, which robot developers have to learn for querying regardless of the paradigm being used for robot programming. It is also unclear how to extend the pre-defined semantic predicates for other generic tasks in different environments.

balint2017storing use a similar high level description language to replace the JavaScript query feature in MongoDB balint2017storing to avoid the in-depth knowledge requirement of the internal data structure. The description language also contains predefined predicates and can be queried through Prolog. This work has the same drawbacks as Open-EASE.

Interestingly, dietrich2015selectscript proposed SelectScript, a SQL-inspired query-only language for robotic world models without having relevant tables in the database dietrich2015selectscript. Without using a different programming language to specify how to retrieve data, SelectScript provides a declarative and language-agnostic way to specify what data are needed rather than how. SelectScript also features custom function support to queries and custom return type native to robotic applications such as an occupancy grid map.

While SelectScript is modeled on the well-known standard SQL, but it is not language-agnostic as stated. Custom functions are only supported in Python, leaving ROS C++ programmers behind. Except for requiring significant effort to support C++, it is not trivial to extend return type to new data types such as the popular Octomap used in 3D mapping [25] for obstacle avoidance in SelectScript.

fourie2017slamindb proposed to use a graph database to query spatial data from multiple mobile robots fourie2017slamindb. However, it is not plausible for our use given that only one relationship is used: odometry poses linked to image and RGBD data. This work also suffers the same drawbacks of SelectScript in that custom queries have to be programmed in Java.

Similar to our argument in the previous storage sections, robotic database designers should embrace the programming languages with which robotic developers are already familiar. Database technology should be hidden by interfaces written in programming languages that also support access to the underlying database for advanced and customized use.

3.4 Proposed Approach

MongoDB’s use has been proven by robotic developers and should thus be chosen to store low-level sensor data, which is potentially large and high velocity. This is mainly to replace the rosbag utility111 which relies on a filesystem and is not easy to query. To store summaries at different levels and to link them, a relational database should be chosen because it is specialized to store relational data. Additional columns will be used as reference to the sensor data in MongoDB. When data are deleted in MongoDB or the relational database, the linked data should be deleted as well; this can be achieved by a background job system or enforced in the programming interface to be discussed below.

Instead of directly exposing the database to robot developers, the storage and querying interface should be written in the same programming language the developer is using for the robot system. The programming interface should be written in an object-oriented manner to allow easy extension (e.g., from a single robot to a cluster). A minimum subset of functional programming should also be used to support custom functions, similar to SelectScript [16]. For common use cases, the determination of which data storage method to use should be handled by the interface so users do not need to be concerned with the underlying database being used. However, a raw interface that enables developers to directly communicate with the underlying database should not be completely left out, to allow for use cases that are not in consideration by interface designers. In terms of programming languages, C++ should be used, given its popularity among developers and most packages of the ROS framework. Python support should also be provided through binding the C++ implementation222 a C++ class in Python.

Human interface developers should be able to query using indexes such as state summarization level, time range, and others specific to the domain. Custom queries should also be allowed by having interface functions exposing the database, as interface designers are not able to consider an exhaustive list of all use cases. Human interface developers should also be able to easily query linked follow-up data after the first query. This is essential for the human interface to provide interaction (e.g., interactive conversation or projection).

4 Human Interface

A human interface is used to communicate the explanations generated by the robot. Communication of the explanations can occur in different channels, such as a traditional graphical user interface (GUI) on a monitor, head-mounted displays, and robot movements. While some human interface methods have been studied for decades in the HCI community, our literature review is selective. We focus largely on novel approaches (e.g., AR techniques) and the most prominent related work, justified by the citation number relative to the publication year. There is a large body of research for some techniques with existing comprehensive literature reviews. We direct readers to the following papers: eye gaze in social robotics [3], using animation techniques with robots [40]

(which provides 12 design guidelines), speech and natural language processing for robotics

[32], and tactile communication via artificial skins in social robots [42].

4.1 Display Screen

While computer interfaces largely use a display screen for the primary communication channel, screens on robots are largely used to display facial expressions [27] due to their physicality, but are considered less convenient than speech [14]. For co-location scenarios, it is rare to find a display screen as part of a robot that is not attached to its head, so very little research has been performed for simple displays or visualizations of sensor values or other relevant information.

brooks2017human investigated displaying a general set of state icons on the body of robots to indicate internal states brooks2017human. Five icons – OK, Help, Off, Safe, and Dangerous – were shown to participants for evaluation. The results show that while bystanders are able to understand those icons, their level of understanding is vague. For example, the ”Off” icon could be interpreted as stating that the robot is powered off or that it is just not currently actively operating.

SoftBank Robotics’ Pepper robot is one of the only robot systems that features a touch screen not attached to its head. feingold9differences found that people enjoyed interacting with a touch screen on a robot more than using a computer screen with a keyboard feingold9differences. Specifically, participants preferred to use the touch screen to indicate the completion of a task. de2018towards used Pepper’s screen to present buttons to use for inputting instructions, such as object directions de2018towards. bruno2018culturally used Pepper’s touch screen like a tablet where multiple-choice questions are shown and users can answer by tapping on the choices bruno2018culturally.

While a display screen has been demonstrated to be effective at showing accurate information (e.g., replaying past events [26], which can be used during explanations), there is sometimes a mental conversion issue where humans have to map what is displayed on the screen to the physical environment. A display screen may also suffer from being less readable from a longer distance, which is important as such proximity to a robot may not be safe during certain failure cases [24].

4.2 Augmented Reality (AR)

Utilizing AR for explainability allows visual cues to be projected directly into the environment with which the robot interacts, allowing for more specificity and reference points to be drawn. This technique can make explanations more accurate, less ambiguous, and remove the burden of mental mapping between different reference frames (e.g., 2D display screen compared to the real world 3D environment).

andersen2016projecting proposed to use a projector to communicate a robot’s intent and task information onto the workspace to facilitate human-robot collaboration in a manufacturing environment andersen2016projecting. The robot locates a physical car door using an edge-based detection method, then projects visualizations of parts onto it to indicate its perception and intended manipulation actions. The authors also conducted an experiment by asking participants to collaboratively rotate and move cubes with the robot arm, comparing the AR projector method to the use of a display screen with text. Results show there were fewer performance errors and questions asked by the participants when using the projector method.

For mobility, researchers also leveraged projection techniques onto the ground to show robot intention. chadalavada2015s projected a green line to indicate the planned path and two white lines to the left and right of the robot to visualize the collision avoidance range of the robot chadalavada2015s. Gradient light bands have also been used to show a robot’s path [47]. Similarly, coovert2014spatial projected arrows to show the robot’s path coovert2014spatial, while daily2003world used a head-mounted display to visualize the robot’s path onto the user’s view of the environment daily2003world. Circles have also been used to show landmarks on a robot swarm using a projector located above the performance space [21]. However, the AR techniques utilize in these papers are passive and not interactive. chakraborti2018projection proposed using Microsoft HoloLens to enable a user to interact with AR projections chakraborti2018projection, where users can use pinch gestures to move a robot’s arm or base, start or stop robot movement, and pick a block for stacking.

AR may be more salient than a display screen for our use case, but it does have some drawbacks. For example, it cannot be used for a robot to take initiative for explanations: This is because it can be easily ignored if a human is not paying attention to the projected area.

4.3 Robot Activities

Figure 2: The workflow of the robot explanation system. State summaries are saved in the databases which are hidden by programming interfaces. A multi-modal human interface queries the database through another programming interface and answers follow-up conversation and interactions. The state summarization component can also initiate explanations.

Due to the physicality of robot systems, body language of robots has been studied extensively in the HRI community to communicate intent. For example, dragan2015effects proposed using legible robot arm movements to allow people to quickly infer the robot’s next grasp target dragan2015effects. Repeated arm movement has also been proposed to communicate a robot’s incapability to pick up an object [29]. Eye gaze behavior or head movement has also been studied (e.g., [34, 3]). However, this communication method is limited in the amount of information that it can convey, if used as the only channel of communication.

In addition to robot movements, researchers have also explored auxiliary methods of communication such as light. Notably, the Rethink Robotics Baxter system utilizes a ring of lights on its head to indicate the distance of humans moving nearby to support safe HRI. Similarly, szafir2015communicating used light to indicate the flying direction of a drone when co-located with humans in close proximity szafir2015communicating; the results show improvement in response time and accuracy.

4.4 Proposed Approach

Given that speech is a natural interaction method for people, it should be considered for initial explanations. It can be used to garner a person’s attention and to initiate high-level summary explanations. However, it is limited in that only one audio stream is available at a time (i.e., the human cannot listen to multiple streams simultaneously without interference). Body language can be used to supplement speech explanations, but likely should not be used as the only communication channel. For example, robots can use arm movements to refer to relevant geography of the robot (e.g., components, actions) and the task space (e.g., objects, areas of the environment) simultaneously with another communication channel such as speech.

When a human requests more detailed explanations, other communication channels should be used to avoid misinterpretation. One such interface is the AR projection method in the literature, given the limitations of display screens (i.e., availability, size, reading distance). While using a projection method, one should keep in mind that projection on ground is not always visible [11].

Communication of explanations may occur in the three temporal levels — a priori, in situ, or post hoc — which will impact the effectiveness of a chosen communication channel. More detailed, in-depth communications via visual or audio means may be better suited for explanations of planned actions (a priori) or analysis of resulting actions (post hoc). Simpler techniques for alerting people (e.g., flashing lights, vibrating tactors) may be better suited for conveying state information in situ, at least to garner attention before more information is conveyed.

Thus, communication should be multi-modal as some methods are better suited for different levels of explanations, temporal levels, and data types, but also need to ensure the human understands all possible means of communication.

5 Integration

Next we describe the integration of the three components of our approach for and their interconnections in the robot explanation system. The proposed component designs are intended to serve as guidelines for development with reasonable justifications, rather than compulsory decisions. In addition, the design of each component should be self-contained and decoupled from the operation of other components, allowing each component to be used independently.

5.1 System Review and Workflow

A diagram of the system is shown in Figure 2. There are two main methods for state summarization: manual methods and summarization algorithms. Manually, developers can annotate functions or specify goals as leaves in a tree structure in the robotic applications to provide explanations. When using summarization algorithms, explanations can be learned using end-to-end or semi-supervised deep learning from robot states and annotator-provided explanations. Summarization algorithms should also be able generalize summaries online when the stored summaries are not sufficient to answer some users’ questions. While generating explanations, the state summarization component can initiate explanations if necessary (e.g., in cases of incapability or failures).

The generated state summaries and their linked raw data are then saved to databases through a programming interface, currently using C++ or Python. Two databases are used: MongoDB for storing raw sensor data and a relational database to store explanations (i.e., linked summaries). The two data storage methods are mostly hidden by the interface, allowing developers to use the programming language and avoid knowing database details. However, the interface will also provide ways to directly access each database if needed.

With the stored summaries in the database, after the robot or human initiates communication, the human interface is responsible for all follow-up conversations or interactions by passing the semantics to the state summarization component. Communication should occur in multiple modalities including speech, body language, screen, and AR (e.g., projection techniques). Due to the differences of each communication method in terms of fidelity, attention-getting, etc., the system will utilize multi-modal communication.

5.2 Evaluation

To evaluate the effectiveness of system, usability testing should be performed with users of varying experience to assess the acceptance and understanding of the system.

Figure 3: The FetchIt! task environment we recommend for evaluation. The robot’s goal is to place the irregular parts into the correct sections of the concave caddy, and transport the caddy to the bottom-left table for inspection.

For an example scenario and tasks, we recommend using the FetchIt! mobile manipulation challenge [20]; Figure 3 shows a rendering of the task space. The tasks are to assemble a kit by navigating to collect parts on different station; while scoped for a manufacturing environment, the same types of tasks are relevant to home environments: e.g., navigating between areas in a narrow hallway kitchen and a dining table, and manipulating objects in these places. The challenges are also not singular to a work cell manufacturing environment. One challenge is detecting different objects, whose shapes are complex and irregular (e.g., large bolts and gearbox parts; kitchen utensils would be similarly difficult). Invisible from the rendered task space, the screws are in the blue container. Another challenge is to navigate through the narrow and constrained work cell; the available space and obstructions therein are similar to that of a kitchen.

The FetchIt! environment provides a reasonable test bed for a robot explanation system because there are several opportunities for unexpected or opaque events to occur. For example, a common occurrence is that the Fetch robot may not be able to grasp a caddy or a gearbox part if it is placed too close to a wall. Fetch’s arm may not be long enough to reach given the constraints presented by the end-effector orientation (it must be pointed down in order to grasp the caddy) and standoff distances imposed by the dimensions of the tables. These scenarios are not apparent to novice users or bystanders who do not have intimate knowledge of Fetch’s characteristics. In this scenario, Fetch should initiate an explanation to inform the user. Another common occurrence is confusion when differentiating between two gearbox parts that appear similar in height via point cloud due to sensor noise. This can make the object detection fail, causing the robot to grasp the incorrect object. In this scenario, a human might initiate a robot explanation, as the robot may not be aware that it performed incorrectly. Another human-initiated robot explanation might be when the robot stops at a different location in front of the caddy table than what was expected, and places a part into the incorrect caddy. This could occur due to navigation error range and the narrow horizontal field of view (54°) of its RGBD camera, which may cause part of the caddy to be occluded.

The FetchIt! competition testbed is available for Gazebo on GitHub:˙gazebo/tree/gazebo9/fetchit˙challenge. A working implementation, including navigation and manipulation, is available at

6 Future Work

To date, we have started implementing the robot explanation system. The system will be open-sourced to facilitate advancing research and to assist other practitioners in integrating their existing software with the system. We plan to evaluate the implementation with a formal HRI user study and analyze the results for further improvements.

7 Conclusion

This paper presents a survey of the three components of the robot explanation system. For state summarization, manually generated summaries may be the workaround solution due to lack of maturity of learning methods that required more research effort. For storage and querying, the programming interface should be developed for easy integration from state summarization and human interface developers. Multi-modal human interface communication methods should be used not only to garner attention from humans in initiating an explanation, but also as the enabling methods to convey both high level summarized explanations and low level detailed explanations.


This work has been supported in part by the Office of Naval Research (N00014-18-1-2503).


  • [1] A. Abdul, J. Vermeulen, D. Wang, B. Y. Lim, and M. Kankanhalli (2018) Trends and trajectories for explainable, accountable and intelligible systems: an hci research agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 582:1–582:18. Cited by: §1.2, §1.
  • [2] A. Adadi and M. Berrada (2018) Peeking inside the black-box: a survey on explainable artificial intelligence (xai). IEEE Access 6, pp. 52138–52160. Cited by: §1.2.
  • [3] H. Admoni and B. Scassellati (2017) Social eye gaze in human-robot interaction: a review. Journal of Human-Robot Interaction 6 (1), pp. 25–63. Cited by: §4.3, §4.
  • [4] H. Admoni, T. Weng, B. Hayes, and B. Scassellati (2016) Robot nonverbal behavior improves task performance in difficult collaborations. In The Eleventh ACM/IEEE International Conference on Human Robot Interaction, pp. 51–58. Cited by: §1.
  • [5] S. Anjomshoae, A. Najjar, D. Calvaresi, and K. Främling (2019) Explainable agents and robots: results from a systematic literature review. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1078–1088. Cited by: §1, §2.3.
  • [6] W. A. Bainbridge, J. Hart, E. S. Kim, and B. Scassellati (2008) The effect of presence on human-robot interaction. In The 17th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 701–706. Cited by: §1.
  • [7] F. Balint-Benczédi, Z. Márton, M. Durner, and M. Beetz (2017) Storing and retrieving perceptual episodic memories for long-term manipulation tasks. In 2017 18th International Conference on Advanced Robotics (ICAR), pp. 25–31. Cited by: §3.1, §3.3.
  • [8] M. Beetz, L. Mösenlechner, and M. Tenorth (2010) CRAM — a cognitive robot abstract machine for everyday manipulation in human environments. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1012–1017. Cited by: §3.1.
  • [9] M. Beetz, M. Tenorth, and J. Winkler (2015) Open-EASE — a knowledge processing service for robots and robotics/ai researchers. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 1983–1990. Cited by: §3.1, §3.3, §3.3.
  • [10] D. Brooks, A. Shultz, M. Desai, P. Kovac, and H. A. Yanco (2010) Towards state summarization for autonomous robots. In 2010 AAAI Fall Symposium Series, Cited by: §1.1.
  • [11] T. Chakraborti, S. Sreedharan, A. Kulkarni, and S. Kambhampati (2018) Projection-aware task planning and execution for human-in-the-loop operation of robots in a mixed-reality workspace. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4476–4482. Cited by: §4.4.
  • [12] J. Cleland-Huang, O. Gotel, A. Zisman, et al. (2012) Software and systems traceability. Vol. 2, Springer. Cited by: §2.3.
  • [13] M. M. De Graaf and B. F. Malle (2017) How people explain action (and autonomous intelligent systems should too). In 2017 AAAI Fall Symposium Series, Cited by: §2.3.
  • [14] M. de Jong, K. Zhang, A. M. Roth, T. Rhodes, R. Schmucker, C. Zhou, S. Ferreira, J. Cartucho, and M. Veloso (2018) Towards a robust interactive and learning social robot. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 883–891. Cited by: §4.1.
  • [15] M. Desai, P. Kaniarasu, M. Medvedev, A. Steinfeld, and H. Yanco (2013) Impact of robot failures and feedback on real-time trust. In Proceedings of the 8th ACM/IEEE International Conference on Human-robot Interaction, pp. 251–258. Cited by: §1.
  • [16] A. Dietrich, S. Zug, and J. Kaiser (2015) Selectscript: a query language for robotic world models and simulations. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 6254–6260. Cited by: §3.3, §3.4.
  • [17] A. D. Dragan, K. C. Lee, and S. S. Srinivasa (2013) Legibility and predictability of robot motion. In Proceedings of the 8th ACM/IEEE international conference on Human-robot interaction, pp. 301–308. Cited by: §1.
  • [18] M. Durner, S. Kriegel, S. Riedel, M. Brucker, Z. Márton, F. Bálint-Benczédi, and R. Triebel (2017) Experience-based optimization of robotic perception. In 2017 18th International Conference on Advanced Robotics (ICAR), pp. 32–39. Cited by: §3.2.
  • [19] U. Ehsan, P. Tambwekar, L. Chan, B. Harrison, and M. O. Riedl (2019) Automated rationale generation: a technique for explainable ai and its effects on human perceptions. In Proceedings of the 24th International Conference on Intelligent User Interfaces (IUI), pp. 263–274. Cited by: §2.3.
  • [20] FetchIt!, a mobile manipulation challenge. Note: 2019-09-11 Cited by: §5.2.
  • [21] F. Ghiringhelli, J. Guzzi, G. A. Di Caro, V. Caglioti, L. M. Gambardella, and A. Giusti (2014) Interactive augmented reality for understanding and analyzing multi-robot systems. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1195–1201. Cited by: §4.2.
  • [22] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi (2018) A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51 (5), pp. 93. Cited by: §1.2.
  • [23] B. Hayes and J. A. Shah (2017) Improving robot controller transparency through autonomous policy explanation. In Proceedings of the 2017 ACM/IEEE international conference on human-robot interaction, pp. 303–312. Cited by: §2.1, §2.3.
  • [24] S. Honig and T. Oron-Gilad (2018) Understanding and resolving failures in human-robot interaction: literature review and model development. Frontiers in psychology 9, pp. 861. Cited by: §4.1.
  • [25] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard (2013) OctoMap: an efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots. Note: Software available at External Links: Link, Document Cited by: §3.3.
  • [26] S. Jeong, I. Choi, Y. Kim, Y. Shin, J. Han, G. Jung, and K. Kim (2017) A study on ros vulnerabilities and countermeasure. In Proceedings of the Companion of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, pp. 147–148. Cited by: §4.1.
  • [27] A. Kalegina, G. Schroeder, A. Allchin, K. Berlin, and M. Cakmak (2018) Characterizing the design space of rendered robot faces. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, pp. 96–104. Cited by: §4.1.
  • [28] F. Kaptein, J. Broekens, K. Hindriks, and M. Neerincx (2017) Personalised self-explanation by robots: the role of goals versus beliefs in robot-action explanation for children and adults. In 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 676–682. Cited by: §2.1.
  • [29] M. Kwon, S. H. Huang, and A. D. Dragan (2018) Expressing robot incapability. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, pp. 87–95. Cited by: §1, §4.3.
  • [30] P. A. Lasota, T. Fong, J. A. Shah, et al. (2017) A survey of methods for safe human-robot interaction. Foundations and Trends® in Robotics 5 (4), pp. 261–349. Cited by: §1.
  • [31] B. F. Malle (2006) How the mind explains behavior: folk explanations, meaning, and social interaction. MIT Press. Cited by: §1.
  • [32] N. Mavridis (2015) A review of verbal and non-verbal human–robot interactive communication. Robotics and Autonomous Systems 63, pp. 22–35. Cited by: §4.
  • [33] T. Miller (2019) Explanation in artificial intelligence: insights from the social sciences. Artificial Intelligence 267, pp. 1 – 38. Cited by: §1.2, §1.
  • [34] A. Moon, D. M. Troniak, B. Gleeson, M. K. Pan, M. Zheng, B. A. Blumer, K. MacLean, and E. A. Croft (2014) Meet me where i’m gazing: how shared attention gaze affects human-robot handover timing. In Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, pp. 334–341. Cited by: §1, §4.3.
  • [35] T. Niemueller, N. Abdo, A. Hertle, G. Lakemeyer, W. Burgard, and B. Nebel (2013) Towards deliberative active perception using persistent memory. In Proc. IROS 2013 Workshop on AI-based Robotics, Cited by: §3.1.
  • [36] T. Niemueller, G. Lakemeyer, and S. S. Srinivasa (2012) A generic robot database and its application in fault analysis and performance evaluation. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 364–369. Cited by: §3.1, §3.1, §3.1, §3.3.
  • [37] M. Oliveira, G. H. Lim, L. S. Lopes, S. H. Kasaei, A. M. Tomé, and A. Chauhan (2014) A perceptual memory system for grounding semantic representations in intelligent service robots. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2216–2223. Cited by: §3.1.
  • [38] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng (2009) ROS: an open-source Robot Operating System. In ICRA Workshop on Open Source Software, pp. 5. Cited by: §3.1.
  • [39] J. M. Schraagen, S. F. Chipman, and V. L. Shalin (2000) Cognitive task analysis. Psychology Press. Cited by: §2.1.
  • [40] T. Schulz, J. Torresen, and J. Herstad (2019) Animation techniques in human-robot interaction user studies: a systematic literature review. ACM Transactions on Human-Robot Interaction (THRI) 8 (2), pp. 12. Cited by: §4.
  • [41] S. H. Seo, D. Geiskkovitch, M. Nakane, C. King, and J. E. Young (2015) Poor thing! would you feel sorry for a simulated robot?: a comparison of empathy toward a physical and a simulated robot. In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, pp. 125–132. Cited by: §1.
  • [42] D. Silvera-Tawil, D. Rye, and M. Velonaki (2015) Artificial skin and tactile sensing for socially interactive robots: a review. Robotics and Autonomous Systems 63, pp. 230–243. Cited by: §4.
  • [43] M. Tenorth, J. Winkler, D. Beßler, and M. Beetz (2015) Open-EASE: a cloud-based knowledge service for autonomous learning. KI-Künstliche Intelligenz 29 (4), pp. 407–411. Cited by: §3.3, §3.3.
  • [44] W3C (2009-10) OWL 2 web ontology language document overview. W3C Recommendation W3C. Note: Cited by: §3.3.
  • [45] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3156–3164. Cited by: §2.
  • [46] N. Wang, D. V. Pynadath, and S. G. Hill (2016) The impact of pomdp-generated explanations on trust and performance in human-robot teams. In Proceedings of the 2016 international conference on autonomous agents & multiagent systems, pp. 997–1005. Cited by: §2.2.
  • [47] A. Watanabe, T. Ikeda, Y. Morales, K. Shinozawa, T. Miyashita, and N. Hagita (2015) Communicating robotic navigational intentions. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5763–5769. Cited by: §4.2.
  • [48] J. Winkler, M. Tenorth, A. K. Bozcuoglu, and M. Beetz (2014) CRAMm–memories for robots performing everyday manipulation activities. Advances in Cognitive Systems 3, pp. 47–66. Cited by: §3.1.
  • [49] Q. Zhang and S. Zhu (2018) Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering 19 (1), pp. 27–39. Cited by: §1.2.