Consider a scenario where a human operator must manage several autonomous search-and-rescue agents that can move through, observe, and modify their respective environments, which are sites of recent disasters. The agents have highly important objectives: to rescue trapped victims. Secondarily, they should keep the human operator informed about what is taking place, but should not sacrifice their primary objective just to transmit information. Rather than receiving a continuous stream of data such as a video feed from each agent, from which it would be hard to extract salient findings, the human may only want to receive important information, forcing the agents to make decisions about what information is worth giving. Naturally, the human will have preferences
about what information is important to them: for instance, they would want to be notified when an agent encounters a victim, but probably not every time it encounters a pile of rubble.
In this work, we address the algorithmic question of how an agent should plan out what actions to take in the world and what information to transmit. We treat this problem as a sequential decision task where on each timestep the agent can choose to transmit information, while also acting in the world. To capture the notion that the human has preferences, we model the human as an entity that scores the agent based on how interesting the transmitted information is to them. The agent’s primary objective is to act optimally in the world; secondarily, it should transmit score-maximizing information while acting. We formulate this setting as a decomposable belief Markov decision process (beliefmdp) and give a tractable algorithm for solving it approximately in practice.
We model the human’s score function information-theoretically. First, we suppose that the human maintains a belief state, a probability distribution over the set of possible environment states; this belief gets updated based on information received from the agent. Next, we let the human’s score for a given piece of information be a function of the change inweighted entropy induced by the belief update. This weighting is a crucial aspect of our approach: it allows the human to describe, in a natural way, which aspects of the environment they want to be informed about.
We give an algorithm that allows the agent to learn the human’s preferences online, through exploration. In this setting, online learning is very important: the agent must explore in order to discover the human’s preferences, by giving them a variety of information. We validate our approach experimentally in simulated discrete and continuous partially observed search-and-recover domains, and find that our belief mdp framework and corresponding planning and learning algorithms are effective in practice. Visit http://tinyurl.com/chitnis-corl-18 for a supplementary video.
2 Related Work
The problem setting we consider, in which an agent must act optimally in its environment while secondarily giving information that optimizes a human’s score function, is novel but has connections to several related problems in human-robot interaction. Our work is unique in using weighted entropy to capture the human’s preferences over which aspects of the environment are important.
Information-theoretic perspective on belief updates. The idea of taking actions that lower the entropy of a belief state has been studied in robotics for decades. Originally, it was applied to navigation  and localization . More recently, it has also been used in human-robot interaction settings [3, 4]
: the robot asks the human clarifying questions about its environment to lower the entropy of its own belief, which helps it plan more safely and robustly. By contrast, in our method the robot is concerned with estimating the entropy of thehuman’s belief, like in work by Roy et al. .
Estimating the human’s mental state. Having a robot make decisions based on its current estimate of the human’s mental state has been studied in human-robot collaborative settings [6, 7, 8]. The robot first estimates the human’s belief about the world state and goal, then uses this information to build a human-aware policy for the collaborative task. This strategy allows the robot to exhibit desirable behavior, such as signaling its intentions in order to avoid surprising the human.
Modeling user preferences with active learning.
The idea of using active learning to understand user preferences has received significant attention[9, 10, 11]. Typically in these methods, the agent gathers information from the user through some channel, estimates a reward function from this information, and acts based on this estimated reward. Our method for learning the human’s preferences online works similarly, but we assume that the reward has an information-theoretic structure.
3.1 Partially Observable Markov Decision Processes and Belief States
Our work considers agent-environment interaction in the presence of uncertainty, which is often formalized as a partially observable Markov decision process (pomdp) . An undiscounted pomdp is a tuple : is the state space; is the action space; is the observation space; is the transition distribution with ; is the observation model with ; and is the reward function with . Some states in are said to be terminal, ending the episode and generating no further reward. The agent’s objective is to maximize its overall expected reward, . The optimal solution to a pomdp is a policy that maps the history of observations and actions to the next action to take, such that this objective is optimized. Exact solutions for interesting pomdps are typically infeasible to compute, but some popular approximate approaches are online planning [13, 14, 15] and finding a policy offline with a point-based solver [16, 17].
The sequence of states is unobserved, so the agent must instead maintain a belief state, a probability distribution over the space of possible states. This belief is updated on each timestep, based on the received observation and taken action. Unfortunately, representing this distribution exactly is prohibitively expensive for even moderately-sized pomdps. One typical alternative representation is a factored one, in which we assume the state can be decomposed into variables (features), each with a value; the factored belief then maps each variable to a distribution over possible values.
A Markov decision process (mdp) is a simplification of a pomdp where the states are fully observed by the agent, so and are not needed. The optimal solution to an mdp is a policy that maps the state to the next action to take, such that the same objective as before is optimized.
Every pomdp induces an mdp on the belief space, known as a belief mdp, where: is the space of beliefs over ; ; and . See Kaelbling et al.  for details.
3.2 Weighted Entropy and Weighted Information Gain
Weighted entropy is a generalization of Shannon entropy that was first presented and analyzed by Guiaşu . The Shannon entropy of a discrete probability distribution , given by , is a measure of the expected amount of information carried by samples from the distribution, and can also be viewed as a measure of the distribution’s uncertainty. Note that trying to replace the summation with integration for continuous distributions would not be valid, because the interpretation of entropy as a measure of uncertainty gets lost; e.g., the integral can be negative. The information gain in going from a distribution to another is .
The weighted entropy of a discrete probability distribution is given by , where all . The weighted information gain in going from a distribution to another is .
Weighted entropy allows certain values of the distribution to heuristically have more impact on the uncertainty, but cannot be interpreted as the expected amount of information carried by samples.
Intuition. Figure 1 helps give intuition about weighted entropy by plotting it for the case of a distribution with three values. In the figure, we only let vary freely and set , so that the plot can be visualized in two dimensions. When only one value is possible (), the entropy is always 0 regardless of the setting of weights, but as approaches 1 from the left, the entropy drops off more quickly the higher is (relative to and ). If all weight is placed on (the orange curve), then when the entropy also goes to 0, because the setting of weights conveys that distinguishing between and gives no information. However, if no weight is placed on (the green curve), then when we have , and the entropy is high because the setting of weights conveys that all of the information lies in telling and apart.
4 Problem Setting
We formulate our problem setting as a belief mdp (Section 3.1) from the agent’s perspective, then give an algorithm for solving it approximately. At each timestep, the agent takes an action in the environment and chooses a piece of information (or null if it chooses not to give any) to transmit, along with the marginal probability, , of under the agent’s current belief. See Figure 2.
Our presentation of the formulation will assume that the agent knows 1) the human’s initial belief, 2) the model for how the human updates their belief, and 3) that only information from the agent can induce belief updates; this assumption effectively renders the human’s belief state fully observed by the agent. We can easily relax this assumption: suppose the agent were allowed to query only some aspects of the human’s belief; then, it could incorporate the remainder into its own belief as part of the latent state. We will not complicate our presentation by describing this setting explicitly.
4.1 Belief mdp Formulation
Let the agent-environment interaction be modeled as a pomdp , where is continuous or discrete. This induces a belief mdp , where is the space of beliefs over . The agent maintains a belief state , updated with a Bayes filter .
The human maintains their own belief state over environment states, updated based only on information transmitted by the agent, and gives the agent a real-valued score on each timestep for this information. We model the human as a tuple : is a set of fluents (Boolean atoms that may or may not hold in the state) that defines the space of information the agent can transmit; is the human’s forward model of the world with ; is the human’s initial belief; and is the human’s score function with . The allows the human to model the degradation of information over time; we use a simple that is almost the identity function, but gives probability to non-identity transitions.
At each timestep, the agent selects information to give and transmits it along with the marginal probability of under , defined as . We update the belief according to Jeffrey’s rule , which is based on the principle of probability kinematics for minimizing the change in belief. First, we define . Then the full belief update, , is if holds in and if does not hold in , . The summations can be replaced with integration if is continuous.
Objective. We define the agent’s objective as follows: to act optimally in the environment (maximizing the expected sum of rewards ) and, subject to acting optimally, to give information such that the expected sum of the human’s scores over the trajectory is maximized.
The full belief mdp for this setting (from the agent’s perspective) is a tuple :
. A state is a pair of the agent’s belief and the human’s belief .
. An action is a pair of environment action and information .
if satisfies the update equation, else 0.
is a pair with the comparison operation ; similarly for .
The following algorithm for solving by decomposition will help us give an approximation next.
Algorithm 1 returns an optimal solution for .
Proof. Note that a policy for maps pairs to pairs , with and . We have . Define , the first entry in the pair. Due to the comparison operation we defined on , we can write , and if there are multiple such , pick the one that also maximizes . The decomposition strategy exactly achieves this, by leveraging the fact that the human cannot affect the environment. ∎
4.2 Approximation Algorithm
can be hard to solve optimally even using the decomposition strategy of Algorithm 1. A key challenge is that branches due to uncertainty about observations and transitions, so searching for the optimal becomes computationally infeasible. Instead, we apply the determinize-and-replan strategy [21, 22, 23], which is not optimal but often works well in practice. We determinize using a maximum likelihood assumption , then use Algorithm 1. This procedure is repeated any time the determinization is found to have been violated. See Algorithm 2 for full pseudocode.
Line 3 generates the trajectory of the agent’s beliefs induced by , which works because does not contain branches. Line 8 constructs a directed acyclic graph (dag) whose states are tuples of (human belief, timestep). An edge exists between and iff some information causes the belief update under the determinized . The edge weight is , the human’s score for . Note that all paths through have the same number of steps, and because the edge weights are the human’s scores, the longest weighted path through is precisely the information-giving plan that maximizes the total score over the trajectory. Our implementation does not build the full dag ; we prune the search using domain-specific heuristics.
5 Learning an Information-Theoretic Score Function
In this section, we first model the human’s score function information-theoretically using the notion of weighted entropy. Then, we give an algorithm by which the agent can learn online.
5.1 Score Function Model
We model the score function as a function of the weighted information gain (Section 3.2) of the belief update induced by information:
where the are a set of weights. The human chooses both and to suit their preferences.
Assumptions. This model introduces two assumptions. 1) The human’s belief , which is ideally over the environment state space , must be over a discrete space in order for its entropy to be well-defined. If is continuous, the human can make any discrete abstraction of , and maintain over this abstraction instead of over . Note that the agent must know this discrete abstraction. 2) If the belief is factored (Section 3.1), we calculate the total entropy by summing the entropy of each factored distribution. This is an upper bound that assumes independence among the factors.
Motivation. Assuming structure in the form of makes it easier for the agent to learn the human’s preferences; the notion of weighted entropy is a compelling choice. The human’s belief state captures their perceived likelihood of each possible environment state (or value of each factor in the state). Each term in the entropy formula corresponds to an environment state or value of a factor, so the encode the human’s preferences about which states or values of factors are important.
Interpretation of . Different choices of allow the human to exhibit various preferences. Choosing to be the identity function means that the human wants the agent to act greedily, transmitting the highest-scoring piece of information at each timestep. The human may instead prefer for to impose a threshold : if the gain is smaller than , then could return a negative score to penalize the agent for not being sufficiently informative. A sublinear rewards the agent for splitting up information into subparts and transmitting it over multiple timesteps, while a superlinear rewards the agent for withholding partial information in favor of occasionally transmitted, more complete information.
5.2 Learning Preferences Online
We now give Algorithm 3, which allows the agent to learn and online through exploration. This algorithm works for both single-episode lifelong learning problems where no states are terminal and short-horizon problems where the agent must learn over several episodes. In Line 7, the agent explores the human’s preferences using an -greedy policy that gives a random piece of information with probability and otherwise follows , the policy solving under the current and .
If the human’s preferences ( or ) ever change, we can reset to an appropriate value and continue running the algorithm, so the agent can explore information that the human now finds interesting. Additionally, with some small modifications we can make depend on the last few timesteps of transmitted information: we need only augment states in our belief mdp with this history so it can be used to calculate , and include this history in the dataset used for learning in Algorithm 3.
We show results for three settings of the function : identity, square, and natural logarithm. All three use a threshold : if the weighted information gain is less than 1, then returns , penalizing the agent. (This threshold is arbitrary, as the weights can always be rescaled to accommodate any threshold.) If the information is null, then returns , which causes the agent to slightly prefer giving no information rather than information that induces no change in the human’s belief. We use the same weights for each factor in the belief, though this simplification is not required.
We implemented and
as a single fully connected network, with hidden layer sizes [100, 50], that outputs the predicted score. The model takes as input a vector of the change, betweenand , in for each entry in the belief. We used a gradient descent optimizer with learning rate , regularization scale , sigmoid activations, batch size 100, and exponentially decaying from 1 to roughly over the first 20 episodes.
We experiment with simulated discrete and continuous partially observed search-and-recover domains, where the agent must find and recover objects in the environment while transmitting information about these objects based on the human’s preferences. Although the pomdps we consider are simple, the aim of our experiments is to understand and analyze the nature of the transmitted information, not to require the agent to plan out long sequences of actions in the environment.
|Experiment||Score from Human||# Info / Timestep||Alg. 2 Runtime (sec)|
|N=4, M=1, f=id||375||0.34||6.2|
|N=4, M=5, f=id||715||0.25||6.7|
|N=6, M=5, f=id||919||0.24||24.1|
|N=4, M=1, f=sq||13274||0.25||4.7|
|N=4, M=5, f=sq||33222||0.2||6.7|
|N=6, M=5, f=sq||41575||0.19||23.6|
|N=4, M=1, f=log||68||0.39||5.6|
|N=4, M=5, f=log||91||0.32||5.7|
|N=6, M=5, f=log||142||0.3||23.8|
|Experiment||Score from Human||# Info / Timestep||Alg. 2 Runtime (sec)|
|N=5, M=5, f=id||362||0.89||0.4|
|N=5, M=10, f=id||724||1.12||2.0|
|N=10, M=10, f=id||806||1.56||48.4|
|N=5, M=5, f=sq||37982||0.52||0.4|
|N=5, M=10, f=sq||99894||0.67||1.8|
|N=10, M=10, f=sq||109207||0.71||39.7|
|N=5, M=5, f=log||19||1.05||0.4|
|N=5, M=10, f=log||31||1.39||1.8|
|N=10, M=10, f=log||39||1.7||42.7|
6.1 Domain 1: Search-and-Recover 2D Gridworld Task
Our first domain is a 2D gridworld in which locations form a discrete grid, objects are scattered across the environment, and the agent must find and recover all objects. Each object is of a particular type; the world of object types is known, but all types need not be present in the environment. The actions that the agent can perform on each timestep are as follows: Move by one square in a cardinal direction, with reward -1; Detect whether an object of a given type is present at the current location, with reward -5; and Recover the given object type at the current location, which succeeds with reward -20 if an object of that type is there, otherwise fails with reward -100.
An episode terminates when all objects have been recovered. To initialize an episode, we randomly assign each object a type and a unique grid location. The factored belief representation for both the agent and the human maps each grid location to a distribution over what object type (or nothing) is located there, initialized uniformly. This choice of representation implies that each in the human’s weights represents their interest in receiving information about object type ; for example, the human may prioritize information regarding valuable objects. The space of information that the agent can select from is: At() for every object type and location ; NotAt() for every object type and location ; and null (no information). Our experiments vary the grid size , the number of objects , the human’s choice of weights , and the human’s choice of . Table 1, Figure 3, and Figure 4 show and discuss our results.
time-varying by changing the weights at the training epochs shown by the dotted lines. The agent learns to give good information after an exploratory period following each change in the human’s preferences.Note.
Learning curves are averaged over 5 independent trials, with standard deviations shaded in green.
6.2 Domain 2: Search-and-Recover 3D Continuous Task
Our second domain is a more realistic 3D robotic environment implemented in pybullet . There are objects in the world with continuous-valued positions, scattered across N “zones” which partition the position space, and the agent must find and recover all objects. The actions that the agent can perform on each timestep are as follows: Move to a given pose, with reward -1; Detect all objects within a cone of visibility in front of the current pose, with reward -5; and Recover the closest object within a cone of reachability in front of the current pose, which succeeds with reward -20 if such an object exists, otherwise fails with reward -100.
An episode terminates when all objects have been recovered. To initialize an episode, we place each object at a random collision-free position. The factored belief representation for the agent maps each known object to a distribution over its position, whereas the one for the human (which must be over a discrete space per our assumptions) maps each known object to a distribution over which of the zones it could be in; both are initialized uniformly. This choice of representation implies that each in the human’s weights represents their interest in receiving information about zone ; for example, the zones could represent sections of the ocean floor or rooms within a building on fire. The space of information that the agent can select from is: In() for every object and zone ; NotIn() for every object and zone ; and null (no information). Our experiments vary the number of zones , the number of objects , the human’s choice of weights , and the human’s choice of . Table 1 and Figure 5 show and discuss our results.
7 Conclusion and Future Work
We have formulated a problem setting in which an agent must act optimally in a partially observed environment while learning to transmit information to a human teammate, based on their preferences. We modeled the human’s score as a function of the weighted information gain of their belief.
One direction for future work is to experiment with settings where the human has preferences over information about the different factors. Such preferences could be realized by having different scales of weights across factors, or by calculating the weighted entropy as a weighted sum across factors according to some other weights (rather than an unweighted sum as in this work), possibly learned. Another future direction is to have the agent learn to generate good candidates for information to transmit, rather than naively consider all available options in at each timestep.
We gratefully acknowledge support from NSF grants 1420316, 1523767, and 1723381; from AFOSR grant FA9550-17-1-0165; from Honda Research; and from Draper Laboratory. Rohan is supported by an NSF Graduate Research Fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.
- Cassandra et al.  A. R. Cassandra, L. P. Kaelbling, and J. A. Kurien. Acting under uncertainty: Discrete Bayesian models for mobile-robot navigation. In Intelligent Robots and Systems’ 96, IROS 96, Proceedings of the 1996 IEEE/RSJ International Conference on, volume 2, pages 963–972. IEEE, 1996.
- Burgard et al.  W. Burgard, D. Fox, and S. Thrun. Active mobile robot localization by entropy minimization. In Advanced Mobile Robots, 1997. Proceedings., Second EUROMICRO workshop on, pages 155–162. IEEE, 1997.
- Deits et al.  R. Deits, S. Tellex, P. Thaker, D. Simeonov, T. Kollar, and N. Roy. Clarifying commands with information-theoretic human-robot dialog. Journal of Human-Robot Interaction, 2(2):58–79, 2013.
- Tellex et al.  S. Tellex, P. Thaker, R. Deits, D. Simeonov, T. Kollar, and N. Roy. Toward information theoretic human-robot dialog. Robotics, page 409, 2013.
- Roy et al.  N. Roy, J. Pineau, and S. Thrun. Spoken dialogue management using probabilistic reasoning. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 93–100. Association for Computational Linguistics, 2000.
- Devin and Alami  S. Devin and R. Alami. An implemented theory of mind to improve human-robot shared plans execution. In Human-Robot Interaction (HRI), 2016 11th ACM/IEEE International Conference on, pages 319–326. IEEE, 2016.
- Lemaignan et al.  S. Lemaignan, M. Warnier, E. A. Sisbot, A. Clodic, and R. Alami. Artificial cognition for social human–robot interaction: An implementation. Artificial Intelligence, 247:45–69, 2017.
- Trafton et al.  G. Trafton, L. Hiatt, A. Harrison, F. Tamborello, S. Khemlani, and A. Schultz. Act-r/e: An embodied cognitive architecture for human-robot interaction. Journal of Human-Robot Interaction, 2(1):30–55, 2013.
- Racca and Kyrki  M. Racca and V. Kyrki. Active robot learning for temporal task models. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, pages 123–131. ACM, 2018.
- Sadigh et al.  D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia. Active preference-based learning of reward functions. In Robotics: Science and Systems (RSS), 2017.
- Boutilier  C. Boutilier. A POMDP formulation of preference elicitation problems. In AAAI/IAAI, pages 239–246, 2002.
- Kaelbling et al.  L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101:99–134, 1998.
- Silver and Veness  D. Silver and J. Veness. Monte-carlo planning in large POMDPs. In Advances in neural information processing systems, pages 2164–2172, 2010.
- Somani et al.  A. Somani, N. Ye, D. Hsu, and W. S. Lee. DESPOT: Online POMDP planning with regularization. In Advances in neural information processing systems, pages 1772–1780, 2013.
- Bonet and Geffner  B. Bonet and H. Geffner. Planning with incomplete information as heuristic search in belief space. In Proceedings of the Fifth International Conference on Artificial Intelligence Planning Systems, pages 52–61, 2000.
- Kurniawati et al.  H. Kurniawati, D. Hsu, and W. S. Lee. SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and systems, volume 2008. Zurich, Switzerland., 2008.
- Pineau et al.  J. Pineau, G. Gordon, S. Thrun, et al. Point-based value iteration: An anytime algorithm for POMDPs. In IJCAI, volume 3, pages 1025–1032, 2003.
- Guiaşu  S. Guiaşu. Weighted entropy. Reports on Mathematical Physics, 2(3):165–179, 1971.
- Russell and Norvig  S. J. Russell and P. Norvig. Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,, 2016.
- Jeffrey  R. C. Jeffrey. The logic of decision. University of Chicago Press, 1965.
- Platt Jr. et al.  R. Platt Jr., R. Tedrake, L. Kaelbling, and T. Lozano-Perez. Belief space planning assuming maximum likelihood observations. 2010.
- Hadfield-Menell et al.  D. Hadfield-Menell, E. Groshev, R. Chitnis, and P. Abbeel. Modular task and motion planning in belief space. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 4991–4998, 2015.
- Yoon et al.  S. W. Yoon, A. Fern, and R. Givan. FF-Replan: A baseline for probabilistic planning. In ICAPS, volume 7, pages 352–359, 2007.
Abadi et al. 
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, et al.
TensorFlow: A system for large-scale machine learning.In OSDI, volume 16, pages 265–283, 2016.
- Coumans et al.  E. Coumans, Y. Bai, and J. Hsu. Pybullet physics engine. 2018. URL http://pybullet.org/.