MIRROR: Differentiable Deep Social Projection for Assistive Human-Robot Communication

by   Kaiqi Chen, et al.
National University of Singapore

Communication is a hallmark of intelligence. In this work, we present MIRROR, an approach to (i) quickly learn human models from human demonstrations, and (ii) use the models for subsequent communication planning in assistive shared-control settings. MIRROR is inspired by social projection theory, which hypothesizes that humans use self-models to understand others. Likewise, MIRROR leverages self-models learned using reinforcement learning to bootstrap human modeling. Experiments with simulated humans show that this approach leads to rapid learning and more robust models compared to existing behavioral cloning and state-of-the-art imitation learning methods. We also present a human-subject study using the CARLA simulator which shows that (i) MIRROR is able to scale to complex domains with high-dimensional observations and complicated world physics and (ii) provides effective assistive communication that enabled participants to drive more safely in adverse weather conditions.


page 1

page 6

page 8

page 14

page 17


Planning with RL and episodic-memory behavioral priors

The practical application of learning agents requires sample efficient a...

Socially Compliant Navigation Dataset (SCAND): A Large-Scale Dataset of Demonstrations for Social Navigation

Social navigation is the capability of an autonomous agent, such as a ro...

Learning a Behavioral Repertoire from Demonstrations

Imitation Learning (IL) is a machine learning approach to learn a policy...

Active Hierarchical Imitation and Reinforcement Learning

Humans can leverage hierarchical structures to split a task into sub-tas...

Theory of Mind Based Assistive Communication in Complex Human Robot Cooperation

When cooperating with a human, a robot should not only care about its en...

Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces

While recent advances in deep reinforcement learning have allowed autono...

A Divergence Minimization Perspective on Imitation Learning Methods

In many settings, it is desirable to learn decision-making and control p...

I Introduction

Fig. 1: Human-Robot Communication Example. (A) A Robot Assistant needs to provide information to help a human driving the blue car in dense fog. The human has limited visibility and the assistant can highlight cars on a heads-up-display or provide verbal cues (as the human is unable to see highlighted cars that are in the rear). In this scenario, a collision is imminent—the red car in front is slowing down and a yellow car speeding up from the rear. How can the robot determine what, when, and how to inform the user about? Our proposed Mirror assistant immediately highlights the red car in front and verbally tells the user about yellow car at the rear. It chooses not to tell the human about the green car, whose location has little impact on the human’s decision. (B) Our Mirror framework is inspired by social projection; the robot reasons using a human model that is constructed from its own internal self-model. Strategically-placed learnable “implants” capture how the human is different from the robot. Mirror plans communicative actions by forward simulating possible futures by coupling its own internal model (to simulate the environment) and the human model (to simulate their actions).

Communication is an essential skill for intelligent agents; it facilitates cooperation and coordination, and enables teamwork and joint problem-solving. However, effective communication is challenging for robots; given a multitude of information that can be relayed in different ways, how should the robot decide what, when, and how to communicate?

In this paper, we focus on assistive shared-control or teleoperation scenarios. As an example, consider the scenario in Fig. 1 where a robot assistant is tasked to provide helpful information to a human driving the blue car in fog (or to explain the robot’s own driving behavior). There are other cars in the scene, which may not be visible to the human. The robot, however, has access to sensor readings that reveal the environment and surrounding cars. To prevent potential collisions, the robot needs to communicate relevant information — either using a heads-up display or verbally — that intuitively, should take into account what the human driver currently believes, what they can perceive, and the actions they may take.

Prior work on planning communication methods in HRI typically rely on human models, which are typically handcrafted using prior knowledge (e.g., [44, 5]) or learned from collected human demonstrations (e.g.,  [39]). Unfortunately, handcrafted models do not easily scale to complex real-world environments with high-dimensional observations, and data-driven models typically require a large number of demonstrations to generalize well. In this work, we seek to combine prior knowledge with data in a manner that reduces both manual specification and sample complexity.

Our key insight is that learning differences from a suitable reference model is more data-efficient than learning an entire human model from scratch. We take inspiration from social projection theory [1], which suggests that humans have a tendency to expect others to be similar to ourselves, i.e., a person understands other individuals using one’s self as a reference. This inductive bias can be effective when the agents are similar and may promote cooperation [24, 25]

. From a cognitive perspective, social projection is a heuristic by which we evaluate and predict another agent’s behavior. Likewise, a robot can use its self-model to reason about a human; in our setup, the robot first learns how to perform the task on its own then uses this model to reason about other agents.

A natural concern is that the human is unlikely to be similar to the robot; indeed, recent work has shown that (model-free) RL agents that are trained to respond to self-policies do not work well with actual humans [8]. The ways in which the human and robot perceive the world and make decisions are likely to differ. Here, we aim to isolate and learn these differences in a sample-efficient manner.

We argue that latent state-space models obtained via deep reinforcement learning are well-suited for this purpose — modern deep variants [28] can handle multiple high-dimensional sensory modalities, capture complex dynamics, and are useful for robot decision-making, yet are sufficiently modular to permit structural interpretability [16, 9]. We exploit this modularity and strategically place learnable implants that capture differences in perception and/or policy (Fig. 1

.B.). These implants can be small and relatively unstructured (e.g., neural networks) or specified using prior knowledge about the human (e.g., known cognitive biases or perceptual limitations) with associated parameters that can be quickly optimized via gradient-based learning.

We call our framework Model Implants for Rapid Reflective Other-agent Reasoning/Learning (Mirror

); an allusion to the mirror neurons in human brains that are hypothesized to play a role in social projection 

[4, 40]. Returning to our example, our Mirror-enabled robot highlights the red car in front and verbally informs the user about yellow car at the rear. To avoid distracting the user, it chooses not to tell the human about the green car, whose location and velocity has little impact on the human’s decision and safety.

These communicative actions are the result of planning over the implant-augmented self-model and the robot’s internal model (Fig. 4). Specifically, perceptual implants in the Mirror model were used to capture that the human had limited visibility, and could not see highlighted cars in the rear (but could be notified about them via auditory means). To plan forward, the robot samples possible future trajectories using both its self-model and the human model, which are coupled together via generated observations and predicted actions (Fig. 1.B). Unlike recent work (e.g., [39, 44, 11, 30]), planning with structured deep multi-modal self-models allows the robot to take into account longer-term behavior and choose among rich communication modalities, without having to hard-code environmental or human properties (e.g., belief dynamics).

Experiments in three simulated domain show that Mirror is able to learn human models faster and more accurately compared to behavioral cloning (BC) and a state-of-the-art imitation learning method [38]. In addition, we report on a human-subject study () using the CARLA simulator [12], which reveals that Mirror provides useful assistive information, enabling participants to complete a driving task with fewer collisions in adverse visibility conditions.

In summary, this paper makes the following contributions:

  • Mirror, a sample-efficient framework for learning human models using deep self-models for initial structure;

  • A planning-based communication approach that leverages learned world dynamics and human models;

  • Findings from a human-subject study in the assistive driving domain, showing that Mirror provides useful communication that improves task performance.

We believe Mirror is a step towards better data-efficient human models for human-robot interaction; to our knowledge, Mirror is the first work to demonstrate how deep representation learning during RL can be combined with demonstrations for human modeling and planning. Mirror can be used for human-robot communication during robot tele-operation and shared-control settings, and it opens up an alternative path to human-robot collaboration with deep models. To motivate future work, we discuss current limitations and potential research avenues in the concluding section of this paper.

Ii Background & Related Work

Mirror builds upon the existing literature on human modeling and human-robot communication. Due to its importance, the field of agent communication is large; here, we briefly summarize closely-related work that learn and use human models for human-robot/AI interaction and communication [45, 8, 10, 32].

Human Models for HRI. In this work, we focus on model-based methods that explicitly model human behavior. Compared to model-free approaches to HRI, model-based methods tend to make reasonable predictions with far less data [10]. Model-based methods can be “black-box” in that they make few assumptions about the human and focus on learning a policy function. For example, recent work learns a human policy via imitation learning, followed by a residual policy for shared control [42]. In contrast, Theory of Mind (ToM) models incorporate (possibly strong) assumptions about how humans perceive the world and make decisions. For example, a ToM model may assume people are rational and learn in a Bayesian manner [2, 20, 43], which is generally not true [20].

Mirror can be seen as a hybrid approach that scaffolds human model learning using the robot’s own internal model (obtained using RL). Compared to standard black-box human models, Mirror provides additional structure that can ease data requirements. Compared to handcrafted ToM approaches [5], Mirror is able to handle high-dimensional observations. Mirror is related to recent approaches that focus on capturing human traits, e.g., biases under risk and uncertainty [26] or action errors due to misunderstood environmental dynamics [37]. However, these approaches typically build on top of hand-crafted ToM models.

Assistance via Human-Robot Communication. Enabling robots to communicate with humans has had a long history; early robots in the 1990s (e.g., Polly [21] and RHINO [6]) were simple stimulus-response systems. In contrast, modern day robots leverage learning and planning to generate a variety of communication patterns, e.g., legible motion [14, 13, 7], and natural language [23, 46].

Recent work has shown that human-robot communication can improve human task performance [44, 47], explain robot errors [11] and calibrate human-robot trust [29]. However, these approaches typically use hand-specified human models and known environment models. In contrast, Mirror plans communication actions using learned models. Mirror is related to prior work on personalized assistive navigation [34], but the mechanism differs: Mirror adapts implant parameters whilst [34] uses a mixture of expert models.


is closely-related to Assistive State Estimation (ASE) 

[39] in that both approaches augment user observations to communicate state information. However, there are crucial differences: ASE assumes known dynamics and perceptual models to compute the near-optimal human policies, while MIRROR uses learned dynamics and implant models. In addition to differences in the human model, ASE generates observations that minimize the KL divergence between the (predicted) user’s beliefs and the assistant’s beliefs. Instead, Mirror forward simulates possible futures using its internal models, and plans communication to maximize task rewards while minimizing communication costs.

Iii Model Implants for Rapid Reflective Other-agent Reasoning/Learning (Mirror)

Our problem setting is one of assistance: a (human) user is acting in a partially-observable environment to maximize rewards. The assistive robot’s goal is to help the user achieve their objective. The robot may receive different observations from the environment and can modify the user’s observations to provide additional information. We seek to derive an effective assistant. At a high level, we will imbue the robot with a structured model of the human that can be adapted with data. After learning, the robot plans using its own internal (self) model and the human model to communicate valuable information. We first detail the robot’s underlying self-model, then describe the human model (specifically the implants), and finally, how communication can be achieved using both the self and human models.

Iii-a Self Model: Multi-Modal Latent State-Space Model

Fig. 2: Mirror

’s self-model is a multi-modal latent state-space model (MSSM). In the above, circle nodes represent random variables and shaded nodes are observed during learning.

Model Structure. In Mirror, the robot’s self-model is a multi-modal state-space model (MSSM) [9] (Fig. 2). Intuitively, the MSSM models an agent that is sequentially taking actions in a world and receiving rewards and multi-modal observations. The observation at time for sensory modalities are generated from the latent state . The model assumes Markovian transitions where the next state is conditioned upon the current state and the action taken by the agent. Upon taking an action, the agent receives reward . In general, the reward can also be conditioned upon the action and next state. Given the graphical structure in Fig. 2

, the MSSM’s joint distribution factorizes as:


where are the world model parameters and are the policy parameters, denotes all observations from , and similarly for , , and . In this work, each of the factorized distributions are modelled using nonlinear functions:

  • Transitions:

  • Observations:

  • Rewards:

  • Policy:

where , , , and are neural networks. Note the slight abuse of notation; we denote both the reward function and the reward random variable as , where the meaning should be clear from context. Depending on the application, the policy may be deterministic or stochastic. We write to refer to the latter case.

Dynamics Learning. Given trajectories of the form , we seek to learn the parameters . Because maximum likelihood estimation is intractable in this setting, we optimize the evidence lower bound (ELBO) under the data distribution using a variational distribution over the latent state variables ,




The first two terms in the ELBO are reconstruction terms and the Kullback-Leibler (KL) divergence term encourages consistency between the variational distribution and the transition dynamics. In this work, is an inference network,


where and are neural networks that give the mean and the covariance of the Gaussian latent state variable, respectively. When observations are missing, we apply a simple masking operation similar to prior work [31], but alternative approaches such as a product-of-experts [18] can be used as well (with corresponding changes to the inference network structure). The inference network will continue to play an important role in Mirror to facilitate fast inference and planning during communication.

Policy Learning. Given the latent states sampled from the inference network, we leverage RL to learn optimal policies. Our approach is based on Stochastic Latent Actor Critic [28]. In our human-subject study involving the assistive driving task, we train a critic network and an actor network using Soft-Actor Critic (SAC) [15]. In our simulated experiments involving gridworld environments, we train a Q-network using Deep Q-Learning (DQN) [33].

Practical Aspects. We find that eventual human model performance is significantly improved when the self-model is trained with data-augmentation [27] and random dropouts across the sensory modalities [48, 49]. For example, we randomly drop LIDAR range readings in each training batch when training our robot in CARLA. We hypothesize that to generalize appropriately, the model should be trained with data that approximates the different ways observations can appear to a human. As an added bonus, this training method often resulted in more robust internal world models and policies.

Iii-B Learning Human Models via Mirror

Fig. 3: An Example Thresholding Filter as a Perceptual Implant. The (generated) range observations (blue bars) are passed through a thresholding filter to eliminate observations that are beyond a parameterized distance (red bars) in each segment from the human.

In Mirror, the human model is identical to the robot’s self-model except for implants that are injected to change the model’s behavior. To distinguish variables in the human and robot models, we will use the superscript H and R to refer to the human or robot, respectively. For example, the human action at time is . Furthermore, we will denote generated/predicted variables by a hat accent, e.g., , and variables that are changed by an implant with a check accent, e.g., .

Model Implants. Once the self-model is trained, we augment it with implanted functions . In this work, we examine two implant classes:

  • Perceptual implants model how humans perceive the world by changing the observation . In several of our experiments, we use a threshold mask/filter implant that models that the human can only see objects within a certain range (Fig. 3)

  • Policy implants model how the human acts differently from the robot by changing the policy. Simple implants can simply add noise to increase policy entropy. In this work, we add a state-dependent residual term to the parameters of the policy distribution . As an example, for the Gaussian policies used in our CARLA experiments, we add to the mean of the Gaussian such that the augmented policy distribution follows . For our discrete policy , we add to the distribution parameters , where is softmax function. The residual is parameterized by a small neural network.

The above provides a flavor of what is possible and is not exhaustive. For example, to model humans who are slow to change their beliefs, we could implant a low-pass filter where is a learnable parameter. We leave exploration of other implants to future work.

Learning Implant Parameters. Given an implant parameterized by , we can learn by minimizing the following loss given data:



is a regularization hyperparameter that controls the strength of an optional prior. Intuitively, this loss optimizes the likelihood of observing a human’s actions given the self-model and the implant parameters. In our work, we approximate

using the learnt inference network and perform stochastic gradient descent by sampling

. Note that only the implant parameters are modified when learning from human data.

Iii-C Human-Robot Communication with Mirror

Next, we turn our attention to how the implanted self-model can be used for human-robot communication. The key idea is to plan using both learned models; we couple the robot’s self-model and the human model together via generated robot observations and communication actions, and predicted human actions (Fig. 4).

Fig. 4: Human-Robot Communication via Forward Simulation. In brief, the robot’s self-model is used to simulate the environment and the human model is used to simulate actions. Be leveraging the learnt dynamics network and inference network , the two models are coupled and used to “imagine” possible futures. The communication pathways (in teal) serves to link the robot’s communication actions (filtered generated observations) to the human model’s observations. The multi-modal models support potential communication pathways and the robot may choose one or more of these pathways at any given time step. Please see text for additional details.

Communication Pathways. To enable the robot to influence the human, we must first establish a means of communication between the two agents. Similar to [39], we assume the robot can change the human’s observations of the world, e.g., by providing some visual cues or alerting the user verbally. A communication pathway for modality is illustrated in Fig. 4 (teal block) where the human’s observation and the robot’s communication action are combined via an overlay function to yield a “combined” observation . As an example, for images, we can replace selected pixels of with corresponding pixels from .

Communication actions are actually observations that are generated from the robot belief state and passed though a communication filter; for example, we can use simple masks filters (parameterized by ) to filter away irrelevant information about the robot’s belief. Note that a communication action can be generated for each modality , which offers the robot multiple ways to communicate with the human.

Forward Imagination. Given our robot model, human model, and communication pathway, we now seek to sample future trajectories. We begin at time-step where we have samples from the robot’s belief over the world state and the (predicted) human belief . We take a step from to to obtain and . This can be accomplished in seven steps:

  1. Sample the human’s action, .

  2. Forward simulate using the world dynamics ;

  3. Generate the observations ;

  4. Obtain the human observation using the perceptual implants across the modalities;

  5. Obtain the communication action using the communication filter for each modality ;

  6. Combine the communication action and the observation to obtain ;

  7. Finally, sample the human’s belief state using the inference network .

Given the human’s action (the observed action if available or generated by the human policy), the forward dynamics function gives the next state . Given , we generate observations for modalities , which are modified by the perceptual implants to yield . The modified observations are used to update the human’s beliefs using the inference network: . This process can be iterated to forward propagate the belief states up to a future horizon .

Planning for Communication. Given the forward simulation and communication pathways above, the robot can optimize communication to maximize task rewards while minimizing costs:


where is the parameters of our communication filter for time-steps 0 to , is the discount factor, is the task reward function111In principle, this task reward function may differ from the reward function that the model was trained with, but we leave such experiments to future work., and is the cost function. The expectation is taken with respect to the trajectories under the models and filter parameters . Various methods can be used to optimize . In our work, we use the cross-entropy (CE) method [41], and re-plan at each time-step. Note that real observations are obtained after each step, which are used to update the beliefs of the robot and human models.

Fig. 5: Experimental domains. (A) Gridworld Driving where the blue vehicle is moving on a road at constant speed and has to avoid the other red vehicles. In the Fog setting, visibility is reduced; the black region indicates areas not visible to the human. (B) A Search-&-Rescue task where the agent (blue box) starts at the door and is tasked to rescue a victim at the green goal and bring them back to the door. The obstacle in red can appear in either the top or the bottom path, and the victim’s position is randomly initialized in one of three potential positions. In the Smoke variant, visibility is reduced to a small region around the human (C) A Bomb Defusal game where a teleoperated robot has 15 seconds to disarm the bomb by pressing three buttons (one in each stage). The correct button at each stage depends on six visible “terminals” (which change after each button press), the bomb type (not visible to the human, but detectable by the robot) and game rules. The rules differ slightly between the robot training environment and the test environment. The human, who has access to the updated rules, has to confirm the robot’s selection. This domain features asymmetric sensing and knowledge/policy between the human and robot. In all three domains, the assistive agent/robot has to communicate relevant information to help the human agent complete the task successfully.

Iv Experiments

In the following, we describe experiments with simulated humans (created using real human data) in a simplified driving task, a search-&-rescue, and a bomb defusal game (Fig. 5). Using simulated humans allowed us to compare the learned models against ground truth models with varying amounts of training data, and enabled us to test communication. Our primary hypothesis was that leveraging the self-model and learning the implant parameters allows Mirror to better model human behavior using fewer samples. In particular, we posit that using Mirror results in better communication during transfer (e.g., from clear weather to dense fog/smoke where visibility is reduced). We focus on the main results and relegate details (e.g., domain parameters) to the Appendix. Source code is available at https://github.com/clear-nus/mirror.

Iv-a Experimental Setup

Domains and Communication Modalities. Our setup comprised a primary human agent who is attempting to complete a task, and an assistive robot who is able to reveal information via communication. Figure 5 summarizes the three domains used in our experiments. The Driving task is similar to those used in prior work in HRI (e.g., [10]) where state information is available to the agent. The Search-&-Rescue domain is a more complex gridworld environment, where the robot can only observe raw image data and textual descriptions of the position of the victim and obstacles. In these two domains, there is a “transfer” setting (fog/smoke) where the human’s visual perception is degraded, i.e., it doesn’t receive any state information beyond its field of view. For the Bomb Defusal game, the transfer setting is different in that the robot’s performance is diminished; the rules dictating which button to press is different from the rules it was trained with. As such, the communication explains the robot’s choice and the human has to verify.

In all three domains, there are two communication modalities: visual and verbal. The total cost for each modality is quadratically related to the number of items communicated. Verbal communication is more costly, but can reveal information that is not visually apparent (e.g., vehicles in the rear or the bomb type). For more information, please see Table I in the Appendix.

Simulated Humans. For each domain, we created simulated human agents using human data; we collected data from 10 real humans playing multiple rounds (24 to 36 rounds depending on the domain). The data from each person was used to train an agent model for that individual. For example, in the Driving domain, we used Maximum Entropy IRL [50] to learn a reward function from the collected trajectories. Similar to prior work [10], we use as features (i) the distance to the center of the road, (ii) distances to the other cars, and (iii) the driver’s action. Given the learnt reward function, the simulated human agent perceives state information (up to its perceptual capabilities) and plans at each time-step to maximize its discounted cumulative reward up to a specified horizon. Further details about the other simulated agents are in Sec. VII-B of the Appendix.

Compared Methods. In total, we compare five different human modeling methods:

  • Ideal Model (Im): This baseline uses the same model as the simulated human and thus, represent “ideal” models that are typically unavailable in practice. These serve as an upper-bound on performance.

  • Behavioral cloning (Bc

    ): a black-box policy learnt via supervised learning on observed trajectories.

  • Soft Q-Imitation Learning (Sqil[38]: A state-of-the-art method that trains a policy via RL to match the human demonstrations. Sqil represents a class of imitation learning methods that have access to the environment222In preliminary experiments, we also tested GAIL [19] but Sqil outperformed GAIL on all our tests..

  • Mirror: With a perceptual implant and a policy implant. See Appendix (Sec. VII-C) for implant details.

  • Mirror-P: A Mirror variant with the perceptual implant only.

It is important to note that Bc, Sqil, and Im are aware of what the human can perceive; we provided the ground truth human observations. This is not the case for the Mirror models that perceive what the robot observes and had to learn what the human could observe. The Mirror self-models were trained to perform the task in the original domain, and only the implants were adapted using the simulated human data (in both the original and transfer settings).

To train Bc and Sqil in the transfer domains, we combined both the original and transfer data, e.g., if 10 trajectories were used to train Mirror, we used 20 trajectories (10 original, 10 transfer) for Bc and Sqil. This was necessary to obtain reasonable performance for these methods. Each method was used to train a neural network policy (comprising GRUs with fully-connected layers) with early-stopping on a validation set. We varied the number of layers and report best results for the baseline models.

Fig. 6: Policy Negative Log-Likelihood (NLL) in the different domains after training with different proportions of data. Lower scores indicate better fit to the data. In both the original and transfer settings, Bc achieved the best scores. However, this did not translate into good performance during communication. See main text for details.
Fig. 7: Task Performance and Communication Amount in the Transfer/Test Environment. Nc indicates agents that received no communication. In all domains, the simulated human’s performance improved given communication. For the Search & Rescue Task, the Sqil agents slightly outperform the Mirror agents but at far higher communication cost. In the Driving and Bomb Defusal tasks, the best performing simulated agents were those that were paired with the assistive Mirror agents. The performance was achieved with reasonable amounts of communication, close to that of the ideal models.

Iv-B Results and Analysis

Overall, we find that Mirror is able to learn better human models with fewer samples compared to Bc and Sqil. We initially compared the negative log-likelihood (NLL) of the trained policies (on a test set) with different amounts of training data (Fig. 6). Interestingly, Bc achieved the best NLL scores, indicating that the Bc models best fit the data and may be good proxy human models for planning. However, these scores are misleading.

The Mirror models were far more effective than Bc for communication planning. Fig. 7 summarizes the performance of the simulated human agents with communication from the compared methods, along with the amount of communication provided. In all domains, the agents that received communication outperformed the agents that did not (Nc). For the Gridworld Driving and Bomb Defusal tasks, the simulated humans achieved significantly better performance scores with reasonable amounts of communication from the Mirror assistants. Mirror learned rapidly—good models could be obtained with 20% of the data (4 trajectories). For the Search-&-Rescue task, Sqil obtained the best performance scores (about 1-2 steps better than Mirror), but only by using large amounts of verbal communication. Qualitatively, the Sqil agent would reveal almost the entire map, but with no discernible pattern to the communication. Moreover, providing more training data did not reduce the verbal communication provided. On the other hand, the Mirror models revealed information more selectively, often focusing on the locations of the goal item and the obstacle.

V Human-Subject Experiments

In this section, we report on a second set of experiments designed to test whether Mirror is able to provide useful information to human users in a more realistic setting. We use CARLA [12], a modern driving simulator (Fig 8). Unlike the previous gridworld experiments, the CARLA environment has continuous state and action spaces, with realistic dynamics and visuals. Our main hypotheses were that (H1) planning with implanted self-models would yield more helpful communication than planning with behavioral cloning models and (H2) planning to optimize task rewards and communication costs would lead to less redundant communication compared to belief matching. Our study was approved by our institution’s ethical review board.

Fig. 8: CARLA Experiment Setup. (A) The stretch of highway that participants drove along. (B) Participants drove the simulated car using a steering wheel with accelerator and brake pedals. (C) and (D) show the difference in visibility in clear and foggy weather. Both cars are visible in the clear setting. In the fog setting, the car on the left is visible, but the car in the front can barely be seen.

V-a Experimental Setup

Task Description. Participants interacted with our assistive driving agents in a highway driving task under dense fog (see Fig. 8.C and 8.D for a comparison between clear and adverse weather conditions). The goal was to drive along a stretch of CARLA’s Town04’s two-lane carriageway from the starting position to the destination, while navigating through a normal highway traffic. The other vehicles may slow down or speed up, and due to poor visibility, would not be able to avoid the participant’s vehicle. The participant had to actively avoid other vehicles along the way.

Assistive Communication. The car is equipped with a semantic LIDAR and a driving assistant that can provide both visual and verbal cues. Specifically, the agent could highlight selected vehicles through visual bounding boxes and/or provide informative speech (as previously shown in Fig. 1

.A). The visual bounding boxes can be generated when the other cars enter the range of LIDAR detection range. Verbal communication comprised of free text generated by OpenAI’s GPT-2 language model 

[35] (fine-tuned for our task). The system is capable of informing participants when a car approaches or slows down, as well as when no cars are detected in a specific direction. The robot is able to observe 36 LIDAR beams along three angles (108 beams in total), the velocity of the ego car, distances to the center of the lane, and the relative curvature of the road 10 meters ahead.

Compared Methods. Based on our previous experiments, we trained Mirror with both perceptual and policy implants. We compared four conditions:

  • No Communication (Nc): the human receives no assistance.

  • Behavioral Cloning (Bc): the human model is trained using behavioral cloning. The model is provided with both expert demonstrations in clear weather, and participant data in the foggy weather.

  • Mirror: our Mirror method that first trains a self-model via RL, then learns the implants from demonstrations, and plans communication.

  • Mirror-Kl: a Mirror variant that does not plan but minimizes the KL-Divergence between the human’s mental state and the robot’s belief, similar to [39]. Unlike [39], Mirror-Kl uses a learnt implant model and is capable of multi-modal communication.

We were unable to use Sqil due to experimental time constraints; training on the demonstrations with Sqil required several hours.

Participants. A total of 21 participants (mean age = 23.2, 10 females) were recruited from the university community. The experiment was designed to be within-subjects with all 21 participants in each condition.

Procedure. Participants entered the lab and were briefed about the task. They then engaged in two practice trials; the first trial involved driving freely along the highway in clear weather conditions, and the second trial involved three rounds of the driving task under dense fog conditions without any assistive communication. Thereafter, they performed a total of 24 rounds, with the first 6 rounds without any assistive communication, followed by 18 rounds with three different agents (six rounds per agent). The data from the first 6 rounds were used to train the models and the order of the three agents was counterbalanced. Participants could choose to take a one minute rest after every 6-th round to reduce fatigue.

Dependent Measures. We use both objective and subjective measures to evaluate agent performance. Objective measures comprised the number of collisions with other vehicles/environment, and the amount of communication (e.g., the number of speech utterances). We also collected a range of subjective measures to ascertain cognitive load, communication properties (helpfulness, redundancy, timeliness, modality selection), and trust after interaction with each agent (See Table III in the Appendix).

V-B Results and Analysis

Fig. 9:

Objective Measures. Error bars indicate one standard error. (

A) Number of Collisions. (B) Amount of Visual Communication. (C) Amount of Verbal Communication (D). Mirror results in fewer collisions compared to behavioral cloning (Bc) and no communication (Nc) conditions, but with far less communication required compared to belief matching (Mirror-Kl).

In brief, the results support both hypotheses; we give a summary of key findings below and place additional details (e.g., breakdown across NASA-TLX dimensions) in the Appendix. We compared the methods along both objective and subjective dimensions using a repeated measures one-way ANOVA, followed by selected pairwise -tests with adjusted- using Bonferroni correction.

Fig. 9.A shows that participants experienced significantly fewer collisions when interacting with Mirror (, ; Mirror vs Bc: , ). Subjectively, participants found Mirror provided information that was more helpful and timely compared to Bc, and were also more comfortable with the communication modality chosen ( across the measures and pairwise tests, Fig. 13 in the appendix). Participants also trusted the Mirror agent more than Bc (, ; Mirror vs Bc: , ). The overall Raw-TLX scores indicated that the participants felt less mentally burdened when they interacted with Mirror (, ; Mirror vs Bc: , ).

Taken together, both objective and subjective evidence strongly support hypothesis H1. This finding is corroborated by participant survey responses; they shared that Mirrorconveys critical information at a good timing” in contrast to Bc, which they felt “is not helping me at all” and “doesn’t inform me about the cars that are approaching from the rear”. Qualitatively, we found planning with the Bc human model to be inaccurate; the Bc model would quickly overfit, which led to poor communication. In contrast, the Mirror implant models resulted in better communication; Fig. 10 shows the perceptual implants learned by Mirror to well approximate what the human could see, even with a small amount of training data (six demonstrations).

Next, we turn our attention to H2, i.e., whether planning with task rewards and communication costs reduced redundant communication relative to belief matching. Figs. 9.B and 9.C show how long each agent highlighted cars and the number of times they verbally alerted the driver, respectively. Mirror-Kl tended to be overly communicative; it provided more than 5 times more visual and verbal communication compared to Mirror without significant benefits in terms of task performance. Subjectively, participants rated Mirror-Kl to provide more redundant information compared to Mirror (, ). In their survey responses, participants wrote that the Mirror-Kl agent “is too talkative”, “told me a lot of useless information” and “is distracting and annoying”. In comparison, they found Mirror to be “straight to the point.” Recall that the only difference between the two is the objective function; Mirror-Kl tries to align belief distributions, regardless of whether the alignment leads to better task accomplishment. We observed that Mirror-Kl would communicate verbally even when there was no need to, e.g., it would repeatedly tell participants that “there is no car in the rear”.

Fig. 10: Samples of Learned Perceptual Implants (threshold filters) for the CARLA experiment. Top is the front of the vehicle. Length of red bars indicate visibility distance. The implants indicate the human was not able to see far ahead or in the rear but could see cars at the side. Compare against Fig. 8.C and 8.D.

When asked which agent they were most comfortable with, a majority of the participants (14 out of 21) selected Mirror. The remaining participants picked Mirror-Kl. When asked about which agent they would prefer for long-distance driving (above 1 hour), the number of participants selecting Mirror increased to 18. No participant selected Bc. Interestingly, some participants preferred Mirror-Kl’s talkative nature, with one participant stating that they “felt annoyed and safe at the same time”. A few participants found Mirror to be too quiet: “I have some confidence that it works but I’m not entirely sure because it is quieter”. These responses suggest individual preferences for information/reassurance and differing trust in the system. How we can incorporate these aspects within Mirror would make for interesting future work.

Vi Conclusion

In summary, we present Mirror, a framework for learning human models using deep self-models for initial structure, along with a planning-based communication approach that couples the human model with learned world model. Experiments show Mirror to be effective, outperforming existing behavioral cloning and imitation learning methods. The results show that bootstrapping human model learning with latent-variable models learnt during reinforcement learning leads to generalizable models that are more useful for interaction planning.

More broadly, we consider Mirror to be a step towards more data-efficient human models for human-robot interaction. The key idea examined in this work—learning differences from the robot self-model—can be potentially be applied to general human-robot collaboration. Compared to existing work on human models, Mirror embodies an alternative paradigm that “front-loads” the learning the environmental dynamics and task structure, and thus, offers savings both in terms of sample complexity and computation when learning from human demonstrations. Below, we highlight potential areas for future work:

  • In our current setup, the agent and human are not co-present in the environment. We believe Mirror can be adapted for these situations by incorporating observations/models of other agents acting in the world. Related work involves crossing the Sim2Real gap, which remains a challenging problem for deep RL.

  • Intuitively, Mirror’s effectiveness depends on similarity between the human and the robot self-model; drastic differences will render the self-model ineffective as a reference point. We are working on how to characterize this intuition theoretically and potentially estimate in advance how effective Mirror might be given varying amounts of data.

  • A related open question is whether the human model is identifiable within our framework. Similar to IRL, we posit that the differences in the policy and perception cannot be completely ascertained given a dataset. Future work can look into methods that attempt to address this ambiguity, e.g., by maintaining a distribution over parameters [36] or via Bayesian Optimization [3]. A related issue is how to handle alternative policies and norms that achieve the same objective.

  • As our human subject experiments reveal, some human traits cannot be initially captured within a RL self-model. How we can incorporate elements such as human-robot trust [22], emotions, and social norms remains a key open problem.

We believe that building upon Mirror forms a compelling pathway towards robots that can interact fluently with humans.


This research is supported by the National Research Foundation Singapore under its AI Singapore Programme (Award Number: AISG-RP-2019-011).


  • [1] F. H. Allport (1924) Chapter 13: social attitudes and social consciousness. Social Psychology. Cited by: §I.
  • [2] C. L. Baker, J. Jara-Ettinger, R. Saxe, and J. B. Tenenbaum (2017) Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour 1 (4), pp. 1–10. Cited by: §II.
  • [3] S. Balakrishnan, Q. P. Nguyen, B. K. H. Low, and H. Soh (2020) Efficient exploration of reward functions in inverse reinforcement learning via bayesian optimization. In Advances in Neural Information Processing Systems, Cited by: 3rd item.
  • [4] S. Bekkali, G. J. Youssef, P. H. Donaldson, N. Albein-Urios, C. Hyde, and P. G. Enticott (2021) Is the putative mirror neuron system associated with empathy? a systematic review and meta-analysis. Neuropsychology review 31 (1), pp. 14–57. Cited by: §I.
  • [5] M. C. Buehler and T. H. Weisswange (2020) Theory of mind based communication for human agent cooperation. In 2020 IEEE International Conference on Human-Machine Systems (ICHMS), Vol. , pp. 1–6. External Links: Document Cited by: §I, §II.
  • [6] J. Buhmann, W. Burgard, A. B. Cremers, D. Fox, T. Hofmann, F. E. Schneider, J. Strikos, and S. Thrun (1995) The mobile robot rhino. Ai Magazine 16 (2), pp. 31–31. Cited by: §II.
  • [7] B. Busch, J. Grizou, M. Lopes, and F. Stulp (2017) Learning legible motion from human–robot interactions. International Journal of Social Robotics 9 (5), pp. 765–779. Cited by: §II.
  • [8] M. Carroll, R. Shah, M. K. Ho, T. Griffiths, S. Seshia, P. Abbeel, and A. Dragan (2019) On the utility of learning about humans for human-ai coordination. Advances in Neural Information Processing Systems 32, pp. 5174–5185. Cited by: §I, §II.
  • [9] K. Chen, Y. Lee, and H. Soh (2021) Multi-modal mutual information (mummi) training for robust self-supervised deep reinforcement learning. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §I, §III-A.
  • [10] R. Choudhury, G. Swamy, D. Hadfield-Menell, and A. D. Dragan (2019) On the utility of model learning in hri. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 317–325. Cited by: §II, §II, §IV-A, §IV-A, §VII-B.
  • [11] D. Das, S. Banerjee, and S. Chernova (2021) Explainable ai for robot failures: generating explanations that improve user assistance in fault recovery. In Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, pp. 351–360. Cited by: §I, §II.
  • [12] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16. Cited by: §I, §V.
  • [13] A. D. Dragan, S. Bauman, J. Forlizzi, and S. S. Srinivasa (2015) Effects of robot motion on human-robot collaboration. In 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 51–58. Cited by: §II.
  • [14] A. D. Dragan, K. C. Lee, and S. S. Srinivasa (2013) Legibility and predictability of robot motion. In 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 301–308. Cited by: §II.
  • [15] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In

    International conference on machine learning

    pp. 1861–1870. Cited by: §III-A.
  • [16] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi (2019) Dream to control: learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603. Cited by: §I.
  • [17] S. G. Hart and L. E. Staveland (1988) Development of nasa-tlx (task load index): results of empirical and theoretical research. In Advances in psychology, Vol. 52, pp. 139–183. Cited by: §VII-F, TABLE III.
  • [18] G. E. Hinton (2002)

    Training products of experts by minimizing contrastive divergence

    Neural computation 14 (8), pp. 1771–1800. Cited by: §III-A.
  • [19] J. Ho and S. Ermon (2016) Generative adversarial imitation learning. Advances in neural information processing systems 29, pp. 4565–4573. Cited by: footnote 2.
  • [20] M. K. Ho and T. L. Griffiths (2021) Cognitive science as a source of forward and inverse models of human decisions for robotics and control. arXiv preprint arXiv:2109.00127. Cited by: §II.
  • [21] I. Horswill (1993) Polly: a vision-based artificial agent. In AAAI, pp. 824–829. Cited by: §II.
  • [22] B. C. Kok and H. Soh (2020) Trust in robots: challenges and opportunities. Current Robotics Reports. External Links: Document Cited by: 4th item.
  • [23] T. Kollar, S. Tellex, D. Roy, and N. Roy (2010) Toward understanding natural language directions. In 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 259–266. Cited by: §II.
  • [24] J. I. Krueger, T. E. DiDonato, and D. Freestone (2012) Social projection can solve social dilemmas. Psychological Inquiry 23 (1), pp. 1–27. Cited by: §I.
  • [25] J. I. Krueger (2013) Social projection as a source of cooperation. Current Directions in Psychological Science 22 (4), pp. 289–294. Cited by: §I.
  • [26] M. Kwon, E. Biyik, A. Talati, K. Bhasin, D. P. Losey, and D. Sadigh (2020) When humans aren’t optimal: robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pp. 43–52. Cited by: §II.
  • [27] M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas (2020) Reinforcement learning with augmented data. Advances in Neural Information Processing Systems 33. Cited by: §III-A.
  • [28] A. Lee, A. Nagabandi, P. Abbeel, and S. Levine (2020) Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. Advances in Neural Information Processing Systems 33. Cited by: §I, §III-A.
  • [29] J. Lee, J. Fong, B. C. Kok, and H. Soh (2020) Getting to know one another: calibrating intent, capabilities and trust for human-robot collaboration. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6296–6303. Cited by: §II.
  • [30] M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg (2019) Making sense of vision and touch: self-supervised learning of multimodal representations for contact-rich tasks. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8943–8950. Cited by: §I.
  • [31] Z. C. Lipton, D. Kale, and R. Wetzel (2016) Directly modeling missing data in sequences with rnns: improved classification of clinical time series. In Machine learning for healthcare conference, pp. 253–270. Cited by: §III-A.
  • [32] N. Mavridis (2015) A review of verbal and non-verbal human-robot interactive communication. Robotics and Autonomous Systems 63 (P1), pp. 22–35. External Links: Document, arXiv:1401.4994v1, ISBN 0921-8890, ISSN 09218890, Link Cited by: §II.
  • [33] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §III-A.
  • [34] E. OhnBar, K. Kitani, and C. Asakawa (2018) Personalized dynamics models for adaptive assistive navigation systems. In Conference on Robot Learning, pp. 16–39. Cited by: §II.
  • [35] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §V-A.
  • [36] D. Ramachandran and E. Amir (2007) Bayesian inverse reinforcement learning.. In IJCAI, Vol. 7, pp. 2586–2591. Cited by: 3rd item.
  • [37] S. Reddy, A. D. Dragan, and S. Levine (2018) Where do you think you’re going?: inferring beliefs about dynamics from behavior. In NeurIPS, Cited by: §II.
  • [38] S. Reddy, A. D. Dragan, and S. Levine (2019) SQIL: imitation learning via reinforcement learning with sparse rewards. In International Conference on Learning Representations, Cited by: §I, 3rd item.
  • [39] S. Reddy, S. Levine, and A. D. Dragan (2020) Assisted perception: optimizing observations to communicate state. In Conference on Robot Learning (CoRL 2020), Cited by: §I, §I, §II, §III-C, 4th item.
  • [40] G. Rizzolatti and L. Craighero (2004) The mirror-neuron system. Annu. Rev. Neurosci. 27, pp. 169–192. Cited by: §I.
  • [41] R. Rubinstein (1999) The cross-entropy method for combinatorial and continuous optimization.

    Methodology and computing in applied probability

    1 (2), pp. 127–190.
    Cited by: §III-C.
  • [42] C. Schaff and M. R. Walter (2020) Residual policy learning for shared autonomy. In Proceedings of Robotics: Science and Systems (RSS), Cited by: §II.
  • [43] H. Soh, Y. Xie, M. Chen, and D. Hsu (2020) Multi-task trust transfer for human–robot interaction. The International Journal of Robotics Research 39 (2-3), pp. 233–249. External Links: Document Cited by: §II.
  • [44] A. Tabrez, S. Agrawal, and B. Hayes (2019) Explanation-based reward coaching to improve human performance via reinforcement learning. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 249–257. Cited by: §I, §I, §II.
  • [45] A. Tabrez, M. B. Luebbers, and B. Hayes (2020) A Survey of Mental Modeling Techniques in Human–Robot Teaming. Current Robotics Reports 1 (4), pp. 259–267. External Links: Document Cited by: §II.
  • [46] J. Thomason, M. Murray, M. Cakmak, and L. Zettlemoyer (2020) Vision-and-dialog navigation. In Conference on Robot Learning, pp. 394–406. Cited by: §II.
  • [47] V. V. Unhelkar, S. Li, and J. A. Shah (2020) Decision-making for bidirectional communication in sequential human-robot collaborative tasks. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pp. 329–341. Cited by: §II.
  • [48] M. Wu and N. Goodman (2018) Multimodal generative models for scalable weakly-supervised learning. In Advances in Neural Information Processing Systems, pp. 5575–5585. Cited by: §III-A.
  • [49] T. Zhi-Xuan, H. Soh, and D. C. Ong (2020)

    Factorized inference in deep markov models for incomplete multimodal time series


    Proceedings of the AAAI Conference on Artificial Intelligence

    Cited by: §III-A.
  • [50] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd national conference on Artificial intelligence-Volume 3, pp. 1433–1438. Cited by: §IV-A, §VII-B.

Vii Appendix

This section contains supplementary material, specifically details on the experiment domains, the simulated human agents, and experimental results relating to the learnt implants. We also provide additional details on the human-subject experiment.

. Domain Visual Cost Verbal Cost Driving Relative positions of other cars in front of the agent. Symbols representing position and speed of other vehicles. Search-&-Rescue Raw image pixels revealing a location in the map. Each map is divided into 9 locations. GPT-2 embedding of speech utterances indicating the position of the victim and obstacles, e.g., “victim in top-right”. Bomb-Defusal Raw image pixels of specific terminal. GPT-2 embedding of speech utterances indicating the bomb type, e.g., “Type A Bomb”.

TABLE I: Communication Modalities and Costs.

Vii-a Domains for Simulated Human Experiment

Our experiments made use of three Gridworld domains, each with two communication/observation modalities (see Fig. 5 in the main paper and Table I below).

Gridworld Driving In this task, a human drives a car at a constant speed along a road with two lanes. The driver has two possible actions, to stay in his lane or switch lanes. There are four cars around the driver that may speed up or slow down. These cars will not avoid the human’s car. The position of the cars are randomly initialized at the beginning of each episode. We examined 8 different scenarios in total with varying car movements. While simple, the state space of this domain is larger than .

In addition to observing their lane, the human driver can receive two types (modalities) of discrete observations (i) the relative position of the other cars and (ii) specific “speech” symbols that describe the position of the other cars as well as the speed of the other cars. When driving in clear weather, all cars are visible. However, in dense fog, the driver can only see two units ahead and cannot see cars in the rear.

Search & Rescue In this task, a firefighter has to rescue a victim from a burning house; the firefighter has to find the victim’s position and bring them back to the entrance. The victim may appear at one of the three potential positions. There are two corridors to the victim’s potential positions, but one of them may be blocked by an obstacle. At the outset, the firefighter is unaware of the positions of victim and the obstacle. We examined 9 different scenarios in total with different situations of corridors and the victim’s position.

The firefighter can receive two kinds of observations in addition to his own position: (i) a raw image of the map and (ii) specific speech utterances that describe the corridors (blocked/unblocked) and victim’s potential position (victim is/isn’t there). In the original training (clear) setting, all information are visible. In the transfer setting (dense smoke), the firefighter can only see units around himself and does not observe the speech utterances.

Bomb Defusal In this task, a human tele-operates a robot arm to defuse a bomb. To defuse the bomb, the robot needs to press buttons (a button at each stage). Which button to press at each stage depends on the bomb type and the “terminals” alongside the bomb.

The human can receive two kinds of observations: (i) raw images of specific terminals and (ii) specific speech utterances that indicate the type of the bomb. The robot knowledge of the rules is out-of-date, i.e., its policy is wrong. As such, the robot cannot defuse the bomb by itself. The human knows the updated rules but is slower than the robot at identifying the correct terminals and is unable to identify the bomb type. To help human quickly defuse the bomb, the robot advises the human which button to press and provides explanations (images of specific terminals and the bomb type). The human then chooses a button to press. Note that the human is unable to complete the task on their own since they cannot perceive the bomb type.

Vii-B Simulated Humans Agents.

In the following, we describe the simulated agents that were created for our experiments. For each domain, we collected data from 10 participants and trained a simulated agent for each participant. The simulated agents are able to perceive state information, up to their perceptual limitations. Qualitatively, we find the behavior of the simulated agents to be very similar to their human counterparts.

Gridworld Driving Each participant played 24 rounds in each setting (clear weather and dense fog). We trained a reward function on all the collected human data using Maximum Entropy Inverse Reinforcement Learning [50]. Similar to previous work [10], we use as features (i) the distance to the center of the road, (ii) distances to the other four cars and (iii) driver’s action. Given the learnt reward function, the simulated human agent plans actions at each time-step using CE. Each simulated agent plays 40 rounds (clear weather and dense fog). The last 20 rounds of each simulated agent’s data were used as validation set (10 rounds) and testing set (6 rounds). We assumed that the simulated human agent can digest all communicated information and never forgets.

Search & Rescue Each participant played 36 rounds in each setting (clear and smoke). The simulated agent was a planning agent that computes and executes the shortest path between subgoal positions. The subgoals were manually specified, and the transitions probabilities between subgoals were learnt using the collected data. In the smoke setting, if the simulated agent does not know the position of victim, he will search all victim’s potential positions until the victim is found. Similar to Gridworld Driving, each simulated agent plays 40 rounds (clear and smoke). The last 20 rounds of each simulated agent’s data were used as validation set (10 rounds) and testing set (6 rounds). As before, we assumed that the simulated human agent can digest all communicated information and never forgets.

Bomb Defusal

Each participant played 36 rounds. For the human participants, we informed them of the type of bomb to ease the number of samples that needed to be collected. To determine which button to press, the simulated human has to guess the type of bomb and identify one correct terminal out of the six displayed. The type of the bomb is not visible to the simulated human and hence, the chance of guessing the correct type is 0.5. To model the length of time the human takes to find the correct terminal, we use a geometric distribution with success probability

learnt from human data. Once the simulated human finds the relevant terminal, it will always press the correct button. Otherwise, it refrains from pressing any button and continue searching for the correct terminal. Same as before, each simulated agent plays 40 rounds. The last 20 rounds of each simulated agent’s data were used as validation set (10 rounds) and testing set (6 rounds). We also assumed that the simulated human agent can digest all communicated information and never forgets.

Vii-C Mirror Implants

The implants used in our experiments are similar to those described in III-B.

Perceptual Implants.

In gridworld driving, the perceptual implant is a threshold filter governed by 4 parameters. Each parameter specifies the distance a human can see along a specific direction (front/back and left/right lane) on the road. If a car on a lane is within the specified threshold distance, the model can observe the position of the car. For Search & Rescue, the perceptual implant is a single parameter threshold filter; we split the image map into 9 portions and if the center of a portion is within this range, it is observable. In bomb defusal, the perceptual implant are 6 parameters — each models the probability that a human perceives 1 of 6 terminals within 1 second (1 time step). Given these 6 probabilities, we sample a 6 dimensional vector with the value of each dimension to be 0 or 1. If the value of a dimension is 1, we show the corresponding terminal in the image, if not, we mask it out.

Policy Implants. For all three domains, the policy distributions are modeled as categorical distributions . The policy implant was a small neural network that takes in the latent state as input and outputs a residual term that was added to of original policy distribution. The resultant action distribution is then parameterized by .

Learnt perceptual implants. Samples of the learnt perceptual implants by Mirror are shown in Fig. 11, Fig. 12 and Table. II). Both in Gridworld Driving and Search & Rescue, the learnt perceptual implants indicate that the human was able to see in the original setting and was only able to see a small area nearby in transfer setting (e.g., fog). For the Bomb Defusal task, we see the average success rate was approx 0.05.

Fig. 11: Samples of learnt perceptual implants for Gridworld Driving. The blue block represents the ego car and red blocks represent other cars. The black areas represent the region that the human cannot see. Ground truth images represent what human actually sees in the original (clear weather) and transfer (dense fog) settings. (A) The learnt implants indicate that the human was able to see the most of road in clear weather. (B) The implants show the human was only able to see a small nearby area in dense fog.
Fig. 12: Samples of learnt perceptual implants in the Search & Rescue task. The red and purple blocks represent obstacles and the green box represents the victim. The black areas represent the region that the human can not see. The implants are similar to the ground truth images that represent what human actually sees in the original (clear weather) and transfer (dense fog) settings. (A) For the clear setting, the learnt implants indicate the human was able to see the most of map. (B) In the dense smoke setting, the implants show the human was only able to see a small surrounding region. The implants are similar to the ground truth images that represent what human actually sees in the original (clear weather) and transfer (dense fog) settings.
TABLE II: samples of learnt perceptual implants in the bomb defusal game.

Vii-D Model Architectures and Training

In all three domains, we approximate the posterior . Below, we give an overview of our models; source code implementing these models is available at http://blinded-for-review.

Gridworld Driving. Recall that in this domain, state (e.g., position of cars) information is available to the agent. The Mirror model consists of the following main components:

  • The transition distribution and variational distribution

    are both 48-dimensional Gaussian distribution with diagonal covariance. We set the transition network to be 3 FC layers deep with hidden size 32. The network parameterizing

    was a MLP with 3 fully-connected (FC) layers of hidden size 64.

  • The observation (decoder) distributions , and reward are Gaussian distributions with a diagonal covariance. Each decoder is 3 FC layers deep with hidden size 32.

  • The Q function is modeled by a Q-network, which takes as input and outputs the Q value for the particular . The Q-networks consist of 3 FC layers with hidden size 128.

The learning rate was set to 0.0003. Before each experience collection phase in Deep Q-Learning, we probabilistically determine whether to use the expert or the model policy via the hyperparameter . In our experiments, was initialized to 0.5 with a decay factor of 0.8. Furthermore, whenever the policy is chosen to perform rollout, we perform -greedy action selection feature to allow for further randomization; was initialized to 0.5 with a decay of 0.75 after every rollout phase.

For Bc, the policy distribution is a categorical distribution . The policy network is modeled as a 48-dimensional GRU followed by one FC layer of size 32; At each time step, the GRU first takes , , and hidden state from the previous time step as input and outputs a new hidden state , which is feed into a fully-connected layer to produce the parameter of the categorical policy distribution . For Sqil, the Q function is modeled as a combination of a 48-dimensional GRU followed by a 3 FC layers deep with hidden size 32; the GRU first takes , , and hidden state from the previous time step as input and outputs a new hidden state , which is fed into 3 FC layers to produce the Q value.

Search and Rescue The distributions and neural networks in the Mirror model are similar to the Gridworld driving environment. Accommodating differences in the input (“raw” image and text) led to differences in:

  • The encoder

    , which uses convolutional layers for the raw image input, and feedforward layers for the symbolic speech observations. Due to computational limits, we pre-trained an autoencoder to reduce the dimensionality of the speech observations (768 dimensional GPT-2 word vector). We set the encoder to have 1 convolutional layer with both kernel size and stride size 3 and outputs 4 channels followed by 3 FC layers with 48 dimensions, while the feedforward portion for the symbolic speech input contains just 1 layer. Each network (Conv for image and FC for speech) produces a 16-dimensional vector, which are concatenated and passed through 3 FC layers (hidden size 128) to derive the state parameters for


  • The image decoder, which consists of a deconvolution layer with the same kernel size and stride as the convolution layer in the encoder. The symbolic speech decoder has 3 FC layers with hidden size 64.

The Q-network was a MLP with three FC layers (hidden size 200). During the training process, was initialized to 1.0 with a decay factor of 0.98 while was initialized to 0.5 with a decay factor of 0.9.

The Bc and Sqil models are also similar to the Gridworld driving domain, except for the observation input networks. We used the same networks as the Mirror model but the concatenated features are fed into a GRU instead of FC layers.

Bomb Defusal The distribution and model setups were very similar to the above domains; there were only minor differences in the convolutional layers (stride size 2) and FC layers (64 dimensions). The Q network was a larger MLP with 3 FC layers (2048 neurons each). During the training process, was initialized to 1.0 with a decay factor of 0.98 while was initialized to 0.5 with a decay factor of 0.9. Likewise, the Bc and Sqil models were similar to other domains above, but the networks were larger (three layers as before, but with 2048 neurons in each layer).

CARLA The Mirror model used higher capacity representations/networks; both and were 128-dimensional Gaussian distributions with diagonal covariance. Similarly, the observation distributions (, ), reward and policy distributions were Gaussian distributions with diagonal covariance. The above distributions and the Q-function were modeled using neural networks with 3 fully-connected layers of 128 neurons each. During the training process, was initialized to 1.0 with a decay factor of 0.995 while was initialized to 0.5 with a decay factor of 0.9.

For Bc, the policy distribution is a Gaussian distribution with a diagonal covariate matrix. The policy network is modeled as a GRU with hidden size 128 followed by a 3 FC layers with hidden size 128; The GRU first takes , and the hidden state from the previous time step as input and outputs a hidden state

, which is feed into the 3 FC layers to produce the mean and variance of the Gaussian policy distribution.

Vii-E Task Reward

In the following, we describe the task reward functions used to train our agents and to model the simulated humans.

Gridworld Driving If the distance of a car and the ego car is within units, the agent will receive a penalty. If the ego car collides with other cars or goes off the road, the agent will receive a penalty. If the ego car changes the lane, the agent will receive a penalty.

Search and Rescue During each episode, if the human agent finds the victim, they will receive a reward. After the agent finds the victim, if they return to the entrance, they will receive another reward. Colliding with obstacles incurs a penalty. At each time step, the agent will receive a time penalty.

Bomb Defusal During each stage, if the human agent press a correct button, they will receive a reward. If the agent press a wrong button, they will receive a penalty. Each time step incurs a cost.

CARLA The reward function comprises several components:

  • Speed reward , which encourages the agent to drive as fast as possible without exceeding (set to 40 km/h in the experiments).

  • Braking and steering penalties , to promote smooth driving.

  • Lane change reward , which provides a negative reward of -1 whenever the agent changes a lane, and lane center reward which penalizes off-center driving using the normalized value of .

  • Proximity reward , which is separated into front and back proximity (penalty of -2 whenever a car is detected within 20 meters by the front and back facing LIDAR beams), and the immediate surrounding proximity (penalty of -4 whenever a car is detected within 1.6 meters by any LIDAR beams)

  • Road shoulder penalty of -4 whenever the ego car goes into the road shoulder.

Subjective Measures
After each Method
(Including NC)
Cognitive Load
(7-point Likert Scale)
- NASA Task Load Index (NASA-TLX) [17]
After each Method
(Excluding NC)
What, When, How of Human-Robot
(7-point Likert Scale)
- The assistive driving agent’s communication
was helpful in accomplishing the task.
- The assistive driving agent’s communication
was redundant.
- The assistive driving agent’s communication
was timely.
- I feel comfortable with the mode of
communication (visual and/or speech)
selected by the assistive driving agent at
different scenarios.
Human-Robot Trust
(7-point Likert Scale)
- I trust the assistive driving agent to
provide useful communication on this task.
After Interaction
Perceived Relative Method Preferences
(Short Distance vs Long Distance)
- Which agent were you the most comfortable with?
- Imagine that you have to drive for long distances
of at least one hour under the same weather
conditions and traffic, which agent would you be
the most comfortable with?
TABLE III: Subjective measures in the human experiment.
Fig. 13: Subjective Measures in the CARLA Driving Experiments across conditions. Error bars indicate one standard error. (A) Human-Robot Communication and Trust; (B) Individual NASA-TLX ratings; (C) Raw-TLX score; (D) Counts of which agent was preferred for short and long distance driving. Overall, Mirror was perceived more positively than Bc. Please see the main text for additional details.

Vii-F Human-Subject Experiments

To supplement the main results in the paper, this section provides the specific survey questions (Table III) and an analysis of the cognitive workload experienced by the participants.

We focus on the responses collected using NASA TLX [17]. Fig. 13.B. shows the user ratings for the individual subscales, which supports the notion that Mirror lessened cognitive load compared to Bc. Differences between the communication agents were statistically significant at the level across the subscales except for Physical Demand and Temporal Demand; this was likely because the task (driving along a relatively straight highway) was not physically and temporally demanding in nature. Pairwise -tests (adjusted-) indicate differences between Mirror v.s. Bc and Mirror-Kl v.s. Bc to be statistically significant except for Performance (Mirror-Kl vs BC: ).

The overall Raw-TLX scores (Fig. 13.C.) show that the participants felt less mentally burdened when they interacted with Mirror (, ; Mirror vs Bc: , ; Mirror-Kl vs Bc: , ). Nc was always done first (to collect data for training the models), so we cannot completely ignore ordering effects. That said, participants only started the task after sufficient practice. The differences in scores between the Nc and Mirror conditions are large, which suggests that Mirror is effective at reducing cognitive load via assistive communication.