Improving HRI through robot architecture transparency

08/26/2021 ∙ by Lukas Hindemith, et al. ∙ honda 0

In recent years, an increased effort has been invested to improve the capabilities of robots. Nevertheless, human-robot interaction remains a complex field of application where errors occur frequently. The reasons for these errors can primarily be divided into two classes. Foremost, the recent increase in capabilities also widened possible sources of errors on the robot's side. This entails problems in the perception of the world, but also faulty behavior, based on errors in the system. Apart from that, non-expert users frequently have incorrect assumptions about the functionality and limitations of a robotic system. This leads to incompatibilities between the user's behavior and the functioning of the robot's system, causing problems on the robot's side and in the human-robot interaction. While engineers constantly improve the reliability of robots, the user's understanding about robots and their limitations have to be addressed as well. In this work, we investigate ways to improve the understanding about robots. For this, we employ FAMILIAR - FunctionAl user Mental model by Increased LegIbility ARchitecture, a transparent robot architecture with regard to the robot behavior and decision-making process. We conducted an online simulation user study to evaluate two complementary approaches to convey and increase the knowledge about this architecture to non-expert users: a dynamic visualization of the system's processes as well as a visual programming interface. The results of this study reveal that visual programming improves knowledge about the architecture. Furthermore, we show that with increased knowledge about the control architecture of the robot, users were significantly better in reaching the interaction goal. Furthermore, we showed that anthropomorphism may reduce interaction success.



There are no comments yet.


page 7

page 9

page 10

page 11

page 13

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advancements in robot capabilities have facilitated the transition of application fields from mainly static environments to more dynamic ones. Not only did the environment change, but also the involvement of humans (sendhoff2020cooperative). Thereby, scenarios emerged where the robot cannot pre-plan all actions in advance to accomplish a certain goal. Instead, only close interactions with a human partner allow the robot to determine the subsequent action. Moreover, not only experts are supposed to interact with a robot, but also naïve users. The term naïve user in the following denotes a human user without computer science background who is unfamiliar in interaction with a robotic system. However, we assume they live in a society where almost everyone is exposed to technology to some degree, such as computers or smartphones.

Although robots are capable of interacting with naïve users, problems occur regularly (johnson2009autonomy; tsarouhas2016mission). These can be attributed to both, robots and users. More functionality on the robot side increases the potential for errors. To perceive the world, various hardware sensors need to work reliably, several modules need to communicate with each other and data needs to be transferred (brooks2017human). The behavior of a robot also depends on successful perceptions of the environment, working actuators, and a correct design by engineers (steinbauer2012survey). Besides errors in the robot system, users have a tremendous influence on the performance of a robot. Especially in close human-robot interactions and cooperation, the successful behavior of a robot highly depends on the correct input by the user. A more detailed failure taxonomy can be found in (honig2018understanding).

Naïve users are often unaware of the functionality and limitations of a robot. This can be ascribed to the mental model a naïve user has about the robot. According to (staggers1993mental), a mental model is defined as a cognitive framework of internal representations humans build about things they interact with. When a person interacts with an artifact for the first time, an initial representation is built and continuously updated while interacting with it. This is influenced by prior experiences with other artifacts and expectations towards them. In the case of robots, people relate to experiences with other people (nass1994computers). Additionally, expectations about robots, formed by movies and media, may lead a naïve user’s mental model further away from reality. Consequently, a non-functional mental model is shaped, resulting in input that does not match the robot’s way of processing (kriz2010fictional). Henceforth, we refer to a non-functional mental model in cases where the internal representation about the robot’s functionality differs from reality in a way that leads the user to generate incorrect input and to be unable to comprehend the behavior of the robot. If the user is aware of the functionality and limitations of a robot, we refer to a functional mental model. We argue that a functional user’s mental model of the robot reduces erroneous human-robot interactions.

One approach for the user to gain a functional mental model is to convey the robot’s internal architecture. To achieve this goal, two main factors need to be considered. First, the architecture of the robot itself needs to be designed in a way, that even non-expert users can understand its functionality. Second, knowledge about the architecture needs to be conveyed to the user comprehensively. To achieve the best result, one has to balance between functionality and comprehensibility. The development of a complex control architecture to solve a wide variety of problems is at the cost of comprehensibility for users.

This work investigates ways to increase users’ knowledge about the architecture of the robot and its influence on the success of human-robot interactions. To increase the comprehensibility of the robot’s inner working, we employ a behavior-based architecture we call FAMILIAR – FunctionAl user Mental model by Increased LegIbility ARchitecture. We chose this architecture because it focuses on legibility for the user, while still abstracting parts. Based on this architecture, we implemented two complementary approaches to increase users’ knowledge about the architecture. To evaluate these approaches, we conducted an interactive online user study in a simulation to test whether they increase knowledge about the robot’s architecture and whether more architecture knowledge does indeed improve human-robot interactions.

2 Related Work

2.1 Mental Model Improvement

Communication in human-robot interaction fulfills a crucial role in improving the human’s understanding of the robot. In the field of didactics for computer science, researcher differentiate between the relevance and the architecture of computational artifacts (rahwan2019machine; schulte2018framework). This differentiation describes the dual nature of such artifacts. While an internal mechanism processes the input to generate a behavior (Architecture), this behavior becomes observable from the outside (Relevance). Knowledge about the relevance of a robot is the basis for a successful interaction. However, in erroneous human-robot interactions, knowledge about the architecture is needed to comprehend the source of the problem.

When interacting with a robot, it is of considerable importance for the user to comprehend the course of the interaction (Wortham2017) and whether an error occurred while executing a task (BERT; kwon2018expressing). In many instances, particular actions are only understandable if the overarching goal is known. Therefore, a robot should not only communicate which action it executes but also what goal should be achieved by this action (huang2019enabling; kaptein2017personalised).

The execution of actions is preceded by a series of perceptions by the robot which led to the decision. For the user to comprehend, whether a wrong action was executed based on design errors or because the perception process failed, the decision process should also be communicated (Breazeal; Thomaz09; Otero08).

While various studies were conducted to measure the influence of communication strategies on users’ attitudes towards robots, the mental model was rarely investigated explicitly. Previous work showed naïve users can comprehend the architectural concepts of a robot. Furthermore, the comprehension depends on the familiarity and observability of concepts (lukas2020robots).

2.2 Robot Control Architecture

For a robot to interact with the world, sensors need to observe the environment. Based on the perception of the robot, the control module defines which actuators will interact with the world to produce a certain behavior. As this robot control module forms the basis for the robot’s interaction behavior, the user’s perception and comprehension of it have a tremendous influence on the success of the interaction. As previous work showed, the widely used concept of the state machine for robot control is incomprehensible for naïve users (lukas2020robots). State machines are composed of states and connections between them. A transition from one state to another depends on the state outcome for a given input. Furthermore, in robotic applications, state machines often allow for concurrency (bohren2010smach). This highly interlaced structure makes it difficult for naïve users to trace the decision-making process, which is especially crucial in case of errors.

Various architectures for robot control use complex control structures to achieve goal-oriented behavior. While these allow for solving challenging tasks, the comprehension by naïve users suffers. This is primarily due to the frequently used, hierarchical structure of the planning process (bryson2001intelligence; colledanchise2017behavior; erol1996hierarchical; peterson1977petri).

In contrast, behavior-based controls are more reactive in their traditional form (michaud2016behavior). They define a set of modules (called behaviors), which are composed of expected sensory input as the trigger and behavioral patterns that achieve a certain goal. These behavioral patterns, in turn, produce an output on actuators (maes1990learning). Even though the reactivity reduces the applicability for complex tasks, it increases the legibility. This is achieved by the tight coupling of sensors and their impact on executed behaviors. In contrast, state machines mostly run through an extensive process of state transitions to determine the execution of an action.

2.3 Knowledge Comprehension

To be useful for naïve users, in the long run, robots need to be flexible and adjustable. To achieve this, non-expert users need to be able to adjust and program the behavior of robots. Out of this need, the field of End-User Development (EUD) emerged (paterno2017new). To enable novice users to program the behavior of robots, many tools use a visual interface (CORONADO2020100970). These interfaces allow the user to define the behavior of robots via drag-and-drop of behavioral primitives (huang2017code3; 5326209). One important difference between the various visual programming software modules is the abstraction of the low-level architecture of the robot. A trade-off between user experience, or simplicity of the software, and the abstraction and deception of the robot architecture must be made (DAGIT2006302). In our work, we want to impart the architecture of the robot as accurately as possible, while still being understandable. Therefore, our interface is designed towards transparency, taking into account the loss of user satisfaction.

3 Hypotheses

With the goal to improve human-robot interactions through increased knowledge about the architecture of the robot, we hypothesize the following:

3.1 Hypothesis 1

A visualization that imparts the inner processes of the robot will increase users’ knowledge about the processes of the robot control system.

3.2 Hypothesis 2

Through visually programming the behavior of the robot, the users’ structural knowledge about the robot’s control system will increase.

3.3 Hypothesis 3

Insights about the architecture of the robot control system will reduce user induced errors in interactions with a robot.

4 Methodology

To design an appropriate robot control architecture, a trade-off between functionality and comprehensibility needs to be made. diprose2017designing suggests multiple abstraction levels for robot architectures. This work separates social interaction into five abstraction levels. Namely, these are hardware, algorithm, social and emergent primitives and methods of controlling these primitives. Hardware primitives include the low level hardware, e.g. laser scanner. The algorithm primitives describe algorithms, such as speech recognition. The next higher level of abstraction are the social primitives. These primitives are reusable units for social interaction (e.g. speaking). Building on this are the emergent primitives, which are functions that emerge through the combination of social primitives. The highest abstraction, the methods of controlling primitives, describes how the lower level abstractions are controlled. The conclusion of their investigations suggests, that the abstraction level of social primitives is most suitable for programming robot social interactions. While our research focuses to convey the architecture of the robot to the user, we followed this decision.

In the following, we will first describe the architecture, which consists of a behavior-based method for controlling primitives (cf. Section 4.1). For our first approach to improve the understandability of the architecture, we developed a dynamic visualization that communicates how the social primitives are used by the architecture (cf. Section 4.2). The second approach, the visual programming of the robot, is described in Section 4.5.2.

4.1 FAMILIAR – Architecture

The human-robot interaction failure taxonomy by (honig2018understanding) describes various sources of errors. According to their taxonomy, faulty interactions can either be based on technical issues of the robot or incorrect input by the user. In both cases, the problem typically affects the behavior of the robot and, by this, becomes observable. Therefore, users need to understand how the behaviors of the robot are triggered. Based on this, our control system was designed with regard to two essential factors. The system should be capable to act in a dynamic, fast-changing environment, as well as being comprehensible for naïve users. Because the comprehensibility of the architecture was more important, we reduced the complexity of the system for better legibility. We named the architecture FAMILIAR – FunctionAl user Mental model by Increased LegIbility ARchitecture. The system also incorporates a general visualization to make the architecture observable for users. This visualization forms the basis of our dynamic visualization. We developed our behavior system in python 2.7 (van1995python) using the Robot Operating System (ROS) (ros) as middleware.

Figure 1: Overview of the robot control architecture.

4.1.1 General Structure

The general structure of our behavior system is displayed in Figure 1. The architecture controls the behaviors of the robot by two primary components: the Perception and the Behavior Guidance. The Perception perceives the environment with sensors and extracts higher-level semantic sensors. The Behavior Guidance contains the selection process of behaviors to execute. Behaviors to execute are determined by the current set of semantic sensor values. The selected behavior for execution is sent to the Execution component, which executes low-level actions to generate the desired behavior.

4.1.2 Perception

To perceive the world, robotic systems need to process through various stages. First, the low-level hardware primitives need to observe the world and represent them in numbers. These observations are then used by algorithm primitives to interpret the observations. While all these operations are important, they are difficult for users to understand. More importantly, if an error occurs in one of these steps, non-expert users would not be able to counteract the problem. Therefore, a more abstract and comprehensible representation of the robot’s perception is needed. saunders2015teach used so-called semantic sensors in their architecture. This type of sensor abstracts the low-level perception and extracts semantically simpler observations of the environment. In that way, users can more easily understand what the robot observed.

Values of semantic sensors are extracted in three steps. First, the robot uses an arbitrary number of hardware sensors together with software components to perceive the world. This can be, for example, a camera for object recognition or a microphone for speech recognition. The perceptions are sent to a memory module, where they are combined with a set of sensors that represent the world (henceforth referred to as "world state"). This world state is used to extract and update semantic sensors. These semantic sensors are then transmitted to the Behavior Guidance module. For example, a sensor module that recognizes people would return an array of bounding boxes, which denotes the locations of people as seen by the camera. A semantic sensor that extracts a user comprehensible meaning might be a sensor that denotes a Boolean value if a person is visible or not.

4.1.3 Behavior Guidance

The Behavior Guidance process is at the core of our architecture. We use the word guidance as behaviors are not only executed based on the current set of semantic sensors. The interaction context (represented as Interaction Protocols) further guides the selection process. The overall guidance component was developed as simple as possible, to be comprehensible for naïve users. When errors in the interaction occur, the user should be able to trace the error back to its origin. With more complex robot control designs, the process of error tracing becomes too complex for non-expert users. For example, saunders2015teach integrated a hierarchical task network planner (SHOP2) (nau1999shop) to plan subsequent actions to achieve more complex behaviors of the robot.

Our system specifies an Interaction Protocol (IP) as a data structure to group behaviors that need to be executed to fulfill the goal of the interaction. The IP structure allows for more goal-specific interactions which, in turn, supports comprehension for users. An IP is a hierarchical structure and consists of one behavior that starts the protocol, an arbitrary number of behaviors that can relate to each other, and one or more exit behaviors to end the IP. The data structure also stores information about the execution status of behaviors, as well as a priority value that determines the rank compared to other protocols. IPs with a higher priority value are preferred. Moreover, it is possible to switch from an active IP to a higher prioritized one and later switch back. That way, timing-sensitive goals can be achieved faster.

A behavior is described by preconditions and predecessors that need to be satisfied for execution, and the action (with its parameters) that is triggered by the behavior. Preconditions specify desired values of semantic sensors for the behavior to be executable. In addition to preconditions, where semantic sensors are compared to desired values, a behavior can also define a set of predecessors that need to be completed before the behavior can be executed.

When a precondition of a behavior is satisfied and all predecessors are completed, a behavior can be executed. The Behavior Selection process determines the behavior to execute (cf. Section 4.1.4). Executing it triggers the corresponding action. An action is defined by the process triggered in the Execution component and its parameters. The corresponding parameters can be either static values (specified in a configuration file) or dynamic values that depend on the current world state. For example, a static parameter value for a navigation action could be the goal kitchen. In contrast, a dynamic parameter could define the goal as the current position of the interaction partner. The triggered action is transmitted to the Execution component. To achieve the goal of the behavior action, the Execution component executes multiple low-level actions. After the execution of the action is completed, the corresponding behavior is marked as finished.

4.1.4 Behavior Selection

After semantic sensors are updated, multiple behaviors from different IPs might be executable. Our system does not allow concurrent execution of behaviors. That way, users merely need to focus on one behavior, which improves comprehensibility and failure detection. Therefore, a selection process has to determine the subsequent behavior to execute. This task of guiding the behavior execution is carried out by a higher-level process. This process determines which IP is active (and therefore preferred), and which behavior within the active IP should be executed.

The selection process starts with an update of the current world state. Based on the world state, the affected semantic sensors are updated. Preconditions of behaviors are subscribed to their corresponding semantic sensors. Therefore, an update of semantic sensors also leads to an update of preconditions. If a precondition is fulfilled by the update, the status of the corresponding behavior changes to executable.

After all semantic sensors are updated and all executable behaviors are determined, the selection process decides what behavior should be executed. In the first step, the process selects an IP from which a behavior should be executed. In general, the IP that is currently active and includes executable behaviors is preferred. The sole exception to this is the priority of IPs. A higher prioritized IP is preferred to an active IP. This is important to prefer time-sensitive tasks. When an IP is selected, the behavior within the protocol needs to be determined. If the protocol is inactive yet, only the entry behavior can be executed. Otherwise, multiple behaviors may be executable. In that case, behaviors that have predecessors that were executed last are prioritized. Otherwise, behaviors are selected based on their position within the IP. E.g., behaviors that were defined first are also selected first.

4.2 Architecture Visualization

Along with the behavior architecture, we also developed a graphical user interface to convey each part of the architecture. The goal was to convey the inner working of the robot as accurately as possible, while still being abstract enough to be understandable. To investigate our second hypothesis (cf. Section 3), the user interface was developed in two versions. The first version only displayed the static information of the architecture, i.e. the components and structures as described above. Based on this basic visualization, we developed an enhanced version, which also displays the dynamic process information.

Figure 2: Architecture Visualization Overview. The sensors are displayed at the top. In the bottom left, the interaction protocols and behaviors are displayed. To the right, details of behaviors are displayed. In addition, new behaviors are defined there.
Figure 3: A close-up view of a behavior. When clicking on a behavior, this behavior is zoomed in and details are displayed on the right side.

Our interface was developed as a web application in JavaScript (flanagan2006javascript) and is divided into two parts (cf. Figure 3). These are a visualization of the semantic sensors and their current values, and a visualization of the Behavior Guidance component. All visualization parts were integrated in the web interface of the simulation (cf. Section 4.4.3). In the following, we will discuss each part in more detail.

4.2.1 Sensor Visualization

For a successful human-robot interaction, the user must be able to comprehend and track the perception of the robot. Our behavior system extracts semantic sensor values that abstract from the low-level perception of the robot (cf. Section 4.1.2). Semantic sensors are visualized at the top of our user interface (cf. Figure 3), and displayed side by side. The basic visualization only displays representative icons for each sensor. The enhanced version also displays the current value of the sensor.

We used icons to represent each sensor to facilitate a fast perception and understanding. For icons to represent the underlying semantic sensor concisely, the selection has to be made with considerable care. Our strategy was to select similar icons to known interfaces. For example, the recognized speech command could be represented by sound waves. This icon is potentially known from speech recognition systems, such as Google assist111 [accessed: 2021-07-05]. To further improve the understanding of the meaning of each icon, we added descriptive tooltip boxes. The tooltip of an icon was printed above when the mouse hovered over the sensor icon.

4.2.2 Behavior Visualization

Our Behavior Guidance visualization was displayed at the center of our visualization. We focused on displaying the important parts of the system as simple as possible without simplifying or abstracting the inner working too much. To represent the IP structure with their behaviors, we used a network representation, using the cytoscape.js library (franz2016cytoscape). Each IP is drawn as a rectangular node, including its behaviors as children. IPs are vertically aligned, and the names are drawn above the node. The containing behaviors of an IP are aligned horizontally within the node. To display the behaviors in a legible way, we calculated an even distribution of the graphical positions of the behavior nodes within the IP node with the Reingold-Tilford algorithm (reingold1981tidier).

The behaviors are likewise drawn like rectangular nodes (cf. Figure 3). Each node includes information about the preconditions that have to be fulfilled, as well as the corresponding action that is triggered by the behavior. To distinguish immediately between the various behaviors, a description of the action that is triggered by the behavior is displayed as the title. The preconditions are visualized with the corresponding semantic sensor icons. To receive more detailed information about a behavior, clicking on the behavior node displays additional information on the right side. In addition, entry behaviors have big double right arrows on the left side, indicating the entry point to the corresponding IP. In contrast, exit behaviors have the arrows on the right side, indicating the exit point of the IP. To express the predecessors of a behavior, arrows are drawn from the behavior node towards the predecessor node.

The enhanced visualization adds coloring and highlighting as mechanisms to indicate the processes of the architecture. For the preconditions, we used colored icons to indicate the satisfaction status. Behaviors of an inactive IP have black colored precondition icons, indicating that the corresponding preconditions are not updated. Otherwise, a fulfilled condition is displayed in green. In contrast, unfulfilled conditions are displayed in red. Predecessor arrows are colored red while the predecessor is unexecuted, otherwise it is colored green.

To highlight the selection process, we used border and background colors, varying sizes, and zooming in on nodes. Nodes of inactive IPs and behaviors have blue border colors and light gray backgrounds. Active IPs and executable behaviors have green border colors. When a behavior is executed, the background color of the corresponding node turns green and is zoomed in. After the execution of a behavior is finished, the background and border color turn dark gray to indicate this behavior has already been executed in the current interaction. Only after the respective IP finished, the behavior border-color changes back to blue, and the background color changes to light gray again.

4.3 Visual Programming

Our second approach, to improve the users’ knowledge about the architecture of the robot, is the active definition of the interaction protocol. Through the process of visual programming, the users need to digest their own role in the interaction, as well as think through the robot’s role. Thus, develop a deeper understanding of the robot’s architecture.

For this, we developed an editor interface to add new IPs and define the containing behaviors with their preconditions and actions. This editor interface is structured similarly to fill out a form. The most important part of this is the behavior definition (cf. Figure 4). To define if this behavior is an entry– or exit behavior, it can be selected by checkboxes. The predecessor can be selected by entering the ID of the preceding behavior. The preconditions are defined by selecting the corresponding sensors and their expected values in drop-down menus. Similarly, the action with its parameters is added.

To guide participants of the study through the process of defining an IP, we developed an interactive tutorial. In the tutorial, participants had to read text boxes with information and had to interact with the editor interface. In this way, we were able to introduce the interface, as well as ensure that all participants define the same IP for the following interaction. Each step of the tutorial was only shown once. The information, presented in the tutorial, was only descriptive of what is shown and which value an input field needs. Therefore, no additional architectural knowledge about the robot was communicated.

The tutorial steps to define the first behavior were designed highly detailed, while the other behaviors were only one tutorial step each, describing what the behavior should look like. In that way, we wanted the participants to not only follow each step, but elaborate on the final interaction protocol.

Figure 4: The behavior editor interface with a tutorial step. The tutorial step describes how the behavior should be defined. In the interface, all parts of a behavior can be defined.

4.4 Simulation

To evaluate our developed architecture and investigate the hypothesis (cf. Section 3) we designed an online simulation to conduct a human-robot interaction user study. The simulation was developed with Gazebo (koenig2004design) and ROS. For users to interact with the simulation remotely, we implemented a web visualization using gzweb222 [accessed: 2021-07-05]. For easy deployment, the simulation was built as a docker (merkel2014docker) image and ran on university servers.

4.4.1 Server infrastructure and communication with simulation

Our infrastructure used one main server connected to the internet and multiple backend servers to run an instance of the simulation each. The main server was configured as a reverse proxy to distribute incoming connections to the simulation server. To ensure that each simulation instance was only used by one client, we limited incoming connections to each backend server to one. In addition, we configured the reverse proxy to connect each client always to the same backend server. In addition to the reverse proxy, the main server also ran a web server to provide the resources of the study website.

4.4.2 Scenario

The simulation and its web interface were developed for a region learning scenario. The goal of the human-robot interaction was to teach the robot about three regions in an apartment. The region learning was realized with a scikit333 [accessed: 2021-07-05]

implementation of the k-nearest neighbor algorithm. Therefore, for learning a region the x-y coordinate had to be stored together with the region name as the label. The taught regions and the resulting segmentation of the apartment by the robot were visualized with a color texture on the floor of the apartment. An example of how a classified apartment could have looked like is shown in

Figure 5.

Figure 5: Example of an apartment with two regions taught to the robot. The classification of the robot is indicated on the floor. The yellow region is the kitchen, the purple region is the entrance and the white areas are unclassified.

To trigger the robot to learn a new region, the user had to move the avatar, representing an interacting user, in sight of the robot. Afterward, a specific command, together with the label of the region, had to be provided. While in a real interaction this command would be given verbally and, thus, have to be recognized by an ASR component, in our online version we used written text that the participant had to specify via a keyboard. This command triggered the robot to follow the avatar. The user then had to move the avatar around the apartment to guide the robot to the region to learn. When the avatar and the robot arrived at the location to learn, the user had to give the command that they arrived. This triggered the robot to learn the current x-y location of the robot, together with the label as class. Afterward, the user had to repeat the interaction for the other regions.

For this scenario, we developed several sensors and actuators for the robot. To allow the user to interact with the robot, we equipped the robot with the ability to recognize the avatar of the user and to interpret natural language. To simulate how person recognition would be in the real world, the robot was only able to recognize the avatar around a certain radius from itself. For this, we calculated the Euclidean distance between the current position of the robot and the avatar. This distance determined, whether the avatar is in sight of the robot. This represented approximately the real behavior of the person recognition component, which requires users to be well visible and close enough. For the natural language understanding, we used the Snips NLU (coucke2018snips) project. We trained the NLU module to understand more commands than those necessary to achieve the interaction goal. The module also supported the recognition of slight variations of the intended commands. This allowed us to differentiate, later, if participants of the study only applied the exclusion procedure or understood how to interact with the robot. To interact with the environment, the robot was able to verbally answer the user, via a chat box. Moreover, the robot was able to navigate through the apartment and follow the avatar.

4.4.3 Web-Interface

The Web-Interface to interact with the simulation consisted of five parts (cf. Figure 6). In the top left corner was the graphical interface of the gazebo simulation located. On the right side was the Chat-Box interface. The visual components of the control architecture were displayed below the simulation and Chat-Box interface.

Figure 6: The Simulation Web-Interface. The simulation visualization is displayed in the top left. On the right side, the chat-box interface to communicate with the robot is displayed. Below, the FAMILIAR-architecture visualization is located.

The visualization of the gazebo simulation was realized with gzweb. The camera view was fixed at one location and was oriented towards the apartment. Therefore, the user was able to see the avatar, the robot, and the whole apartment. The height of the walls of the apartment together with the height of the camera was adjusted to prevent the users’ view from being blocked. To reduce the amount of resources that need to be loaded, and to reduce the calculation time of the simulation, we excluded any furnishings.

The user was able to move the avatar, by clicking in the simulation window. The 2D position of the click event was then projected on the floor plane of the simulation. The avatar’s position was updated with the new position. Therefore, the avatar did not move continuously towards the new location, but directly appear there. Because this mechanism sometimes resulted in the avatar falling over, the physics calculations of the avatar were disabled.

To prevent the avatar to appear outside the apartment, in a wall, or too close to the robot, the position of the click-event was checked to be free of obstacles and within the borders of the apartment. In case the user clicked in an unreachable position, a pop-up window appeared, which informed the user that this location is unreachable. (The starting positions of the robot and the avatar were so far apart, that the avatar was not in sight of the robot.)


To verbally interact with the robot, we developed a chat-box window. Because we aimed at easy accessibility, the design was inspired by common messengers, such as [accessed: 2021-07-05] or Facebook [accessed: 2021-07-05]. The main window displayed the communication history, while the bottom window consisted of elements to write new messages.

The messages of the user were displayed in speech bubbles on the right side. Because the robot could only understand certain commands, the color and content of these speech bubbles depended on the content of the message. If a message could not be interpreted by the robot, the speech bubble was colored red and the message was displayed in it. In the case, that the message could be interpreted by the robot, the speech bubble was colored green. In Addition, the content of the message, as well as the classified intent by the robot, was displayed in the speech bubble. The answers of the robot were displayed in a white speech bubble on the left side. In addition to the verbal communication of the user and the robot, the chat-box window also displayed the (de-)activation events of the robot control architecture.

The bottom window, to write new messages, consisted of three elements. At the center of this window, the message could be entered in an input field. This message could be sent by clicking on the send-button on the right side. In addition, an information-button was displayed on the left side. Clicking this button displayed a pop-up window, presenting the intents the robot was able to understand, as well as example sentences for each intent.

4.5 User Study

The conducted user study was carried out online using the online platform [accessed: 2021-07-05]. Based on our two complementary approaches, we employed a 2-by-2 between-subjects study design, with the two dimensions being participating in a visual programming tutorial before the interaction (VP, no VP) and seeing a dynamic visualization during the interaction (DV, no DV). Therefore, participants in the Baseline condition were neither shown the process visualization, nor did participants complete the tutorial. Participants in the DV (Dynamic Visualization) condition were only shown the process visualization, while participants in the VP (Visual Programming) condition only completed the tutorial. Participants in the VP+DV condition completed the tutorial and were shown the process visualization.

Acquired participants were forwarded to our study website, where they were assigned to one of the four conditions. The study procedure can be seen in Figure 7

. First, all participants were shortly briefed about the structure of the study and the estimated time scope. Afterward, their gender, age, and prolific-id were gathered.

To measure the technical affinity of participants, we administered the ATI (ati) questionnaire. Moreover, we measured the memorability with a word memory test (green1996word). In this test, each participant was shown thirteen word items for ten seconds. They were asked to memorize as many word items as they could. Afterwards, participants were asked to recall and list the memorized word items.

Participants were then shown an instruction video, introducing the simulation and the robot control architecture (cf. Section 4.5.1). Depending on the condition, participants were forwarded to one of two stages. The Baseline and DV conditions were shown another video, explaining at the relevance level, how to achieve the interaction goal. In contrast, the VP and VP+DV conditions followed the tutorial to program the behaviors.

In the next phase of the study, all participants interacted with the robot, trying to accomplish the interaction goal (cf. Section 4.4.2). To have the same amount of time for each participant, we limited the interaction with the robot to thirty minutes. After the interaction, participants were forwarded to the knowledge questionnaires. Participants who did not achieve the interaction goal within the time limit were asked why they could not achieve the goal before they were also forwarded to the knowledge questionnaires.

To receive more subjective ratings about the system, after the knowledge questionnaires, we collected parts of the Godspeed questionnaire (GOD) and the System-Usability-Scale (SUS) (SUS). We limited the Godspeed questionnaire to the scales of Anthropomorphism, Likability and Perceived Intelligence. These scales were most relevant to rate the technical system. To end the study, participants were thanked for their participation and forwarded to prolific for their payment.

Figure 7: Study schedule. First, participants were briefed and had to answer preceding questionnaires. Afterwards, the architecture instruction video was shown. Depending on the condition, participants completed the tutorial or watched an interaction instruction video. In the main part, participants interacted with the robot to achieve a joint goal. At the end, various questionnaires were collected.

4.5.1 Introduction Video

Based on our hypotheses (cf. Section 3), the goal of the introduction video was to inform participants about the architecture of the robot as well as the simulation. The total length of the video was 8:25 minutes. The video was designed as a series of questions and answers together with corresponding video material. The questions were asked by a novice. Thus, questions were asked by someone who is not aware of how the simulation or the robot works. The answers were given by an expert who introduces all important aspects. In that way, we explained all important parts naturally in a conversation instead of an enumeration of facts. The questions and answers were spoken as well as transcribed and displayed. All important aspects of the architecture were introduced together with their visualization in the web interface. The simulation interface is shown in the center of the video. At the bottom of the video, icons for the novice and the expert, as well as the transcription of the spoken text, are displayed.

4.5.2 Interaction Briefing

While the VP and VP+DV conditions went through the tutorial steps, the other conditions were shown an interaction video. This video showed the resulting interaction protocol and described which steps were to follow to teach the robot. This way, participants of all conditions were exposed to the way in which they could interact with the robot, before interacting with it. Thus, the tutorial only influenced how intensively participants interacted with the low-level parts of the architecture.

4.5.3 Measures

Architectural Knowledge

To measure the resulting architectural knowledge of each participant, we asked several multiple-choice questions about the robot’s functionality. Each question consisted of three statements, where only one statement was correct. The question categories for the structural knowledge were Sensor, IP, Behavior, Precondition, Action, and Predecessor. The process knowledge was asked in the Process category. The knowledge needed to answer all questions was provided in the introduction video (cf. Section 4.5.1). Therefore, all conditions were given the same foundation to answer these questions. Our approaches of dynamic visualization and visual programming only helped consolidate the knowledge. The results of this measurement was used to investigate if the dynamic visualization (H.1) and the visual programming (H.2) improved the architectural knowledge. Additionally, it was used to analyze if an increase in users’ knowledge improves the interaction (H.3).

Interaction Success

To further analyze the interaction of the participants, we calculated several key figures from the log data of the simulation interaction. To get an overall impression of how successful participants were in each condition, we divided participants into two groups based on whether they achieved the interaction goal: interactions of participants who taught all three regions to the robot within the time limit, were classified as successful, otherwise the interactions were unsuccessful. This grouping was used together with the architectural knowledge to answer hypothesis 3.

Interaction Failures

We also evaluated mistakes of the user during the interaction. For this, we counted the number of wrong commands provided by the user. We classified a command as wrong, if the given command was unneeded to reach the interaction goal. Therefore, a command could be wrong even though the robot was able to interpret it. Moreover, we counted the amount of times users moved the avatar out of sight of the robot. Thus, indicating how aware participants were about the perception limitations of the robot. These measurements were used to further investigate the influence of knowledge on the interaction success.

Godspeed, SUS, and ATI

As already mentioned above, we also collected data of the Godspeed questionnaire (Anthropomorphism, Likability, Perceived Intelligence) and the System-Usability-Scale. In contrast to the objective measurement above, the results from these questionnaires were used to get insights of the subjective rating of the interaction.

5 Results

We conducted our study with 85 participants, of which 4 were excluded due to technical problems. The remaining 81 participants were randomly assigned to the four conditions (Baseline: n=20, DV: n=21, VP: n=20, DV+VP

: n=20). The average age of participants was 35 (SD=9.53) years. Overall, 30 female, 50 male and 1 diverse persons participated. A Kruskal-Wallis test

(Kruskal) showed no significant difference between the conditions regarding age (H(4)=3.9091, p=0.2714), gender (H(4)=0.9606, p=0.8108), ATI (H(4)=0.8627, p=0.8344) or word recall (H(4)=0.5732, p=0.9025).

For a first overview, we analyzed the architectural knowledge between the conditions. The data of each knowledge category was checked towards normality via a Shapiro-Wilk (SHAPIRO1965)

test. In the case of normally distributed data, we applied a one-way ANOVA. Otherwise, we used a Kruskal-Wallis test. When the one-way ANOVA indicated a significance, we conducted a Tukey-HSD 

(tukeyHSD) posthoc test. In the case of the Kruskal-Wallis test, we used a follow-up Posthoc Dunn (dunn) test. The results can be seen in Table 1 and Figure 8. The results showed, that the overall knowledge only differed significantly between the Baseline(M=39.29, SD=10.16) and VP+DV(M=52.44, SD=19.5) (p=0.0582) condition. Additionally, the Behavior knowledge differed between the Baseline(M=35.0, SD=27.84) and the VP(M=70.0, SD=29.15) (p=0.0011) and VP+DV(M=57.5, SD=36.31) (p=0.0349) conditions, and between the DV(M=45.24, SD=30.49) and VP (p=0.0190) conditions.

Test statistic p-value
Test statistic p-value
Test statistic p-value
Kruskal-Wallis 0.0071
Posthoc Dunn
Condition 1 Condition 2 p-value
VP 0.0011
Baseline VP+DV 0.0349
VP 0.0190
Test statistic p-value
Test statistic p-value
Test statistic p-value
Test statistic p-value
Test statistic p-value
1-Way ANOVA 0.0389
Condition 1 Condition 2 p-value
Baseline VP+DV 0.0582
Table 1: Statistical tests for architectural knowledge of conditions.

5.1 H1: Process Visualization improves process knowledge

Our first hypothesis stated, that the visualization about the processes of the architecture will increase the process knowledge about this architecture. To get insights about this hypothesis, we analyzed the knowledge questionnaire scores. We tested between each condition and also the DV and DV+VP conditions pooled together. No significant differences between the conditions could be found. Therefore, our first hypothesis could not be confirmed.

Test Statistic p-value
Kruskal-Wallis 0.0397
Condition 1 Condition 2 p-value
DV 0.0491
Baseline VP+DV
DV VP 0.0225
VP+DV 0.0145
Table 2: Statistical tests for the amount of visualization usages. P-values 0.1 are indicated by italic font, P-values 0.05 are indicated in bold font.

5.2 H2: Visual programming improves structural knowledge about the robot

In contrast to our process visualization, we hypothesized that visually programming the interaction would improve the structural knowledge about the robot which corresponds to the knowledge categories Sensor, IP, Behavior, Precondition, Action, and Predecessor. To evaluate this, we compared the architectural knowledge of the visual programming (VP, VP+DV) conditions with the conditions without visual programming(Baseline, DV

). A Shapiro-Wilk normality check with a follow-up T-test showed that participants in the visual programming conditions had significantly more knowledge about the overall architecture (statistic=-2.8264, p=0.0060). Additionally, a Mann-Whitney-U test showed that knowledge about the

Behavior concept (statistic=520.50, p=0.0018) was also increased. Moreover, there was a tendency towards more knowledge of the Precondition concept in the VP condition (cf. Table 3 and Figure 8). Even though the visual programming did not improve knowledge about all concepts, we conclude that our second hypothesis can be partly confirmed.

Category no VP vs. VP
statistic p-value
Sensor (MWU) 0.5772
IP (MWU) 0.5505
Behavior (MWU) 0.0018
Prec. (MWU) 0.0894
Action (MWU) 0.1996
Pred. (MWU) 0.1977
Process (MWU) 0.5039
All (T) 0.0060
Table 3: Statistical analysis of the architectural knowledge for groupings of no VP vs. VP. Tests were either Mann-Whitney-U (MWU) or T-Test (T) depending on the results from the Shapiro-Wilk normality check. P-Values 0.1 are indicated by italic font, P-Values 0.05 are indicated in bold font.
Figure 8: Architecture Knowledge for each condition (left); and groups successful and unsuccessful (right)

5.3 H3: Architecture knowledge improves human-robot interaction

Our third hypothesis stated, that knowledge about the robot control architecture would improve the human-robot interaction. Our first measure concerned the amount of successful interactions per group. As it can be seen in Figure 9, 80.95% of the participants in the DV condition achieved the interaction goal, while only 65% of the Baseline and 60% of the VP and VP+DV conditions achieved the goal. To further investigate the influence of architectural knowledge on the interaction, we compared the successful and unsuccessful interaction groups. We applied a Shapiro-Wilk normality check and a Mann-Whitney-U test for non-normal distributed data and a T-Test otherwise. The results can be seen in Table 4 and Figure 8. Overall, the results show that participants who were able to achieve the interaction goal had more knowledge about the architecture of the robot. A closer look at each category reveals that knowledge about Behavior, Precondition and Process were higher.

Figure 9: (left) Amount of successful interactions per condition; (right) Amount of visualization usages per condition.

Another key measure was the number of wrong commands given. To compare participants who used a high number of wrong commands with those who used fewer wrong commands, we calculated a median split for the number of wrong commands. Again, we compared the answer score of each category and the overall score with a Mann-Whitney-U or T-Test. Results can be seen in Table 4. They showed a tendency for the overall knowledge(statistic=1.9199, p=0.0585), the IP(statistic=677.0, p=0.0986) and Behavior(statistic=988.50, p=0.0789) categories. This indicates that participants with a low number of wrong commands had more knowledge about the IP and Behavior categories, and more overall architecture knowledge. Moreover, the Process(statistic=1068.0, p=0.0116) knowledge was significantly higher for those with a lower number of used wrong commands.

We also investigated how often the participants moved the avatar out of sight of the robot. For this, we compared the visualization conditions with the conditions without a visualization. We only analyzed the successful interactions, because some participants, who did not achieve the interaction goal, never moved the avatar. Thus, they also never moved the avatar actively out of sight of the robot. A Mann-Whitney-U test showed a tendency of decrease of the number of avatar loosing between no visualization (M=9.16, SD=4.95) and visualization (M=6.72, SD=4.30) conditions (statistic=468.0, p=0.0676).

Based on these results, we conclude that our third hypotheses can be confirmed. We further could observe that certain concepts of our architecture were of particular importance for an improved interaction. Additionally, this is the reason, why the visual programming improved the knowledge about the architecture but could not improve the interaction itself.

Category Successful Interaction Wrong Commands
statistic p-value statistic p-value
Sensor (MWU) 0.6565 0.5599
IP (MWU) 0.9951 0.0986
Behavior (MWU) 0.0168 0.0789
Prec. (MWU) 0.0289 0.1024
Action (MWU) 0.7997 0.2581
Pred. (MWU) 0.8676 0.5098
Process (MWU) 0.0344 0.0116
All (T) 0.0273 0.0585
Table 4: Statistical analysis of the architectural knowledge for groupings of Successful Interaction, Wrong Commands. Tests were either Mann-Whitney-U (MWU) or T-Test (T) depending on the results from the Shapiro-Wilk normality check. P-Values 0.1 are indicated by italic font, P-Values 0.05 are indicated in bold font.

5.4 Subjective Ratings of the Interaction

In addition to the knowledge questionnaires, we also collected data of the godspeed questionnaire and the SUS. While we could not find any differences for the likability or perceived intelligence, differences for the anthropomorphism score could be found. In general, in social robotics it is seen as a positive result if users anthropomorphize a robot as this indicates that the robot is seen as a social entity. However, it is unknown how this affects the user’s reasoning about a non-human entity such as a robot. To investigate which influence the anthropomorphisation of participants had on the success of the interaction, we applied a median split on the anthropomorphism score. We then compared the interaction success of participants with lower anthropomorphism score with the interaction success of participants with a higher anthropomorphism score (cf. Figure 10

). A Shapiro-Wilk normality check (statistic=0.9096, p=0.0000) with a Mann-Whitney-U test (statistic=488.50, p=

0.0518) revealed that participants with a lower anthropomorphism score were more likely able to achieve the interaction goal (cf. Table 5).

We also compared the SUS and ATI for the anthropomorphism median split. A Mann-Whitney-U test revealed that higher anthropomorphism is related to a higher rating of the system usability (statistic=417.50, p=0.0002), as well as a higher ATI score (statistic=602.50, p=0.0449). Additionally, a Mann-Whitney-U test showed a significantly lower SUS for participants in the VP conditions (Statistic= 1089.50, p=0.0110).

Interaction Success
Test Statistic P-Value
Shapiro-Wilk 0.9096 0.0000
Mann-Whitney-U 488.50 0.0158
Test Statistic P-Value
Shapiro-Wilk 0.9694 0.0000
Mann-Whitney-U 602.50 0.0449
Test Statistic P-Value
Shapiro-Wilk 0.9603 0.0000
Mann-Whitney-U 417.50 0.0002
Table 5: Statistical tests of anthropomorphism median split for Interaction Success, ATI and SUS. P-Values 0.1 are indicated by italic font, P-Values 0.05 are indicated in bold font.
Figure 10: Comparison of godspeed anthropomorphism median split for Interaction Success, ATI and SUS

5.5 Qualitative Observations

In addition to the quantitative results, we also observed some qualitative results. While of the participants were able to achieve the interaction goal, the participants who did not achieve the interaction goal revealed some insights on their problems. One observation was that from the 27 participants who did not achieve the interaction goal within the 30-minute time limit, only were able to teach at least one region. From the 18 participants who did not even teach one region, 14 participants were in one of the VP conditions. This result is a consequence of the fact, that participants of these conditions either took too long to successfully complete the visual programming sequence or were not even able to complete the visual programming in the time limit. Upon closer investigation of the log-data of the visual programming sequence revealed that participants had problems to specify the Predecessor and add Preconditions to the behavior.

Problems in the interaction with the robot itself were mostly based on providing the correct commands. Even though the VP interface provided a pop-up window to retrieve commands that can be interpreted by the robot, some/many participants did not use this. Instead, they seemed to assume a chatbot like interaction, where the robot always answers. Another common problem was, that participants correctly triggered the "Following" behavior of the robot, to guide it to the region to learn, but then never gave the second command to trigger the learning of this region.

6 Discussion

The analysis of our user study revealed some interesting insights into the effects, knowledge has on interaction performance. Participants, who achieved the interaction goal, had significantly more knowledge about the robot’s architecture. Upon closer inspection, the knowledge about Behavior, Precondition and Process were critical for the success of the interaction. These concepts, together with the Action concept, build the core of our architecture. While knowledge about the Action concept did not differ between successful and unsuccessful participants, the questions were answered as correct as the Behavior and Precondition questions by the successful participants. Additionally, we not only observed that architectural knowledge improves the interaction, but is a factor that makes the interaction possible in the first place. From the participants who could not achieve the interaction goal within the 30-minute time limit, not even taught one region. The investigation of the log data revealed, that these participants had general knowledge gaps. Not only did they provide the wrong commands, but communicated with the robot as it was a chatbot. Something participants, who achieved the interaction goal, did not. This indicates problems of the mental model. We assume that the wrong implications of the visual interface are the reason for this misconception. Because chat-boxes are often used to communicate with other human beings or a chatbot, wrong assumptions about the functionality were created. Thus, a different interface might have worked better.

While the concept of visual programming is not new, the resulting understanding of the system by novice users was merely investigated (CORONADO2020100970). In this study, we used visual programming as an approach to improve knowledge about the structure of the architecture. The results show that participants of the VP conditions had increased knowledge about the architecture. In addition to the overall knowledge, especially the Behavior knowledge was significantly improved. While our visual programming interface showed positive effects on the knowledge, not all aspects of our architecture could be conveyed. Because additional knowledge about Precondition and Process concepts were critical for the interaction success, we could not observe an increase in the interaction performance for the VP conditions. Additionally, a negative side effect, in the form of a lower SUS rating, could be observed. Therefore, further work needs to investigate how a visual programming interface should be designed to successfully communicate all aspects of the architecture. For this, an increased focus should be put on the Precondition and Process knowledge. We expect that the abstract mechanism of the visual programming, together with the highly flexible way people act, has a major influence on the difficulty to understand these concepts. Even though Preconditions in our architecture are more flexible than in e.g. state machines, they still expect the environment to be in a specific state. In contrast, humans can reason about their environment and other humans on a higher level. To overcome these false assumptions, the visual programming interface should be closely related to the interaction itself. In that way, users might better understand the impact of the concepts on the interaction.

Although the dynamic visualization of our architecture did not lead to a measurable improvement of the knowledge, participants were yet more likely to achieve the interaction goal. The improved ability to reach the interaction goal indicates improvement of knowledge. Presumably, our knowledge questionnaire did not retrieve this information. The effect of an improved interaction could not be observed for the combination of the VP and the DV in the DV+VP condition. One reason might be a cognitive overload caused by the VP. Presumably, a more accessible visual programming interface together with the dynamic visualization could have improved the interaction.

Previous work showed that visualizing the behavior decision improves the understanding of what the robot does (Wortham2017). Additionally, lukas2020robots showed that the communication of the robots’ functionality improves the ability to recognize erroneous interactions in videos. In this work, we have gone a step further, by mediating in-depth knowledge about the architecture of the robot and testing this knowledge in actual human-robot interactions. In this way, we were able to get insights into the specific aspects of a robot control architecture that influence the interaction. Moreover, we enabled persons not only to detect errors in interactions, but also to overcome and reduce these errors and achieve a joint goal with the robot. While the work in (lukas2020robots) is not entirely comparable to ours, only around of participants were able to detect errors based on the way the state machine works. In contrast, we enabled of our participants to achieve an interaction goal.

Besides the influence of the communicated knowledge for improved interaction, the anthropomorphization of the robot influenced the interaction. We observed that participants with less anthropomorphism were more likely to achieve the interaction goal. Similar results were also observed in (lukas2020robots). In this work, participants with a lower anthropomorphism were more likely to detect errors in a human-robot interaction. In general, the question after cause and effect arises. One explanation could be that participants interacted unsuccessfully with the robot because they anthropomorphized the robot. The other direction would be that participants anthropomorphized the robot because the interaction was unsuccessful. We believe that increased anthropomorphization of the robot negatively affected the interaction success. A reason for this might be the difference in the flexible way humans can act, in contrast to the comparatively rigid way of the behavior architecture of the robot. When humans interact with other people, the theory of mind area in the brain is activated (frith2005theory). The same effect can be observed when interacting with an anthropomorphized robot (krach2008can). Using this area of the brain might lead to the incomprehension of how to interact with the robot. This, in turn, results in faulty inputs by the user. Therefore, in contrast to many current approaches, inducing an anthropomorphic image on users (breazeal2016social) can reduce the interaction success.

7 Conclusion and Future Work

In this work, we investigated the influence of knowledge about the robot control architecture for human-robot interactions. For this, we employed a behavior-based architecture, which aims at better understandability and transparency. Our study showed that knowledge about the architecture improves human-robot interactions. Moreover, this knowledge only makes it possible to successfully interact with the robot. Therefore, we argue that researcher and engineers should include factors of transparency and comprehensibility in their design processes. These factors can critically influence the success of interactions between users and robots. We also showed that the usage of visual programming can be used to improve the knowledge about architectural aspects of the robot. Moreover, design decisions regarding the anthropomorphisation of robots can negatively influence the interaction.

While the visual programming approach improved the overall knowledge of the architecture, some concepts could not be improved significantly. Therefore, the interaction itself did not improve. One reason might be, that the tutorial was too abstract to understand some aspects of it. Therefore, we will further investigate this approach. One idea could be, to incorporate approaches closer to the interaction. In that way, abstract aspects of the architecture could be directly observed in the interaction.


We acknowledge the support of the Honda Research Institute Europe GmbH, Carl-Legien-Strasse 30, 63073 Offenbach, Germany. Christiane B. Wiebel-Herboth is employed by the Honda Research Institute Europe GmbH. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. We also acknowledge the financial support of the German Research Foundation (DFG) and the Open Access Publication Fund of Bielefeld University for the article processing charge.