Deploying autonomous agents such as robots equipped with an appropriate set of sensors allows automated execution of various information gathering tasks. The tasks can include monitoring and identification of spatio-temporal processes, automated exploration, or other data collection campaigns in environments where human presence is undesired or infeasible. Robots are mobile sensor platforms whose actions are optimized to maximize the informativeness of measurement data.
As the target state is not known, a probability density function (pdf) over the state, called a belief state, is maintained. Information conveyed by measurement data is incorporated into the belief state by Bayesian filtering. Assuming Markovian dynamics and conditional independence of measurement data given the system state, the problem is a partially observable Markov decision process, or POMDP.
Optimal information gathering has been studied in the context of sensor management , and a review of applying POMDPs for sensor management is presented in . The problem is formulated as a decision process under uncertainty. The goal is to find a control policy mapping belief states to actions, that when followed maximizes the expected sum of discounted rewards over a horizon of time. The reward associated with an action may depend either on the true state of the system or the belief state. The former can encode objectives such as reaching a favorable state or avoiding costly ones, useful e.g. for navigation and obstacle avoidance. The latter option allows information theoretic rewards, such as mutual information, applied in various sequential information gathering problems in robotics, see e.g. [4, 5, 6]. Indefinite-horizon problems that terminate when a special stopping action is executed are a natural model for tasks that may be stopped once a certain level of confidence about the state is reached .
track so-called alpha vectors at a set of points in the belief space. The alpha vectors may then be used to approximate the optimal policy at any belief state. Online planning methods find an optimal action for the current belief state instead of a representation of the optimal policy. The problem is cast as a search over the tree of belief states reachable from the current belief state under various action-observation histories. Combining online methods with Monte Carlo simulations to evaluate utility of actions has lead to approximate algorithms able to handle problems with up to states .
In mixed observability domains a part of the state space is fully observable. The belief space is a union of low-dimensional subspaces, one for each value of a fully observable state variable. Robotic systems often exhibit mixed observability which may be exploited to derive efficient POMDP algorithms .
A multi-armed bandit (MAB) is a model for sequential decision-making also applied in sensor management . A decision-maker plays one arm of the MAB and collects a reward depending on the state of the arm. The arm then randomly transitions to a new state while other arms remain stationary. Solutions to MABs are index policies that are easier to compute than solutions to general POMDPs .
Most of the aforementioned research applies reward functions that only depend on the true state and action. The expectation of the reward is linear in the belief state, a feature leveraged by many of the solution algorithms. Information theoretic quantities such as entropy and mutual information that would be useful as reward functions in optimal sensing problems are nonlinear in the belief state. Classical POMDP algorithms cannot be applied to solve such problems.
In this paper, we study POMDPs with mixed observability with mutual information as the reward function. As such, our approach is especially suited for optimal sensing problems in robotics domains. We remove constraints on available actions to obtain a relaxed problem. The optimal value of the relaxed problem obtained is an upper bound on the optimal value in the POMDP. We identify the conditions under which the relaxed problem is a MAB and has an easily computable optimal solution. The upper bound is applied in an online planning algorithm to prune the search space.
The paper is organized as follows. In Section II, the mixed-observability POMDP is defined. In Section III, methods for solving the problem are discussed. In Section IV, two relaxations are derived that provide upper bounds for the optimal value function. Section V determines the conditions under which the relaxations are MABs. Empirical results are provided in Section VI. Section VII concludes the paper.
Ii A Mixed Observability Pomdp
We denote random variables and sets by uppercase letters, and realizations of random variables and members of sets as lowercase letters. Time instants are distinguished by writing e.g.and for realizations at time and , respectively.
An agent, e.g. a robot or another sensor platform, has an internal state that captures the dynamics and constraints of operating on-board sensors and other devices. The internal state evolves according to a deterministic dynamics model , defined where is a control action in the finite set of actions allowed in internal state .
Let , , denote a set of random inference variables an agent wishes to obtain information about. The dynamics of the variables are governed by a stochastic model
, defined as a Markov chain. The complete state of the system is .
The problem features mixed observability, where the internal state is fully observable and the inference variables are partially observable. The agent’s observations follow an observation model , defined by .
The agent maintains a belief state , consisting of the deterministic, fully observable internal state and a pdf over . The initial belief state is given. Given a belief state , an action , and an observation , the belief state at the next time instant is given by the belief update equation where , and the pdf over the inference variables is obtained from a Bayesian filter
where is the predictive pdf and
is the normalization factor denoting the prior probability of observing. Given any sequence of actions and observations, there is no uncertainty about the resulting internal state . Thus we can equivalently define the set of allowed actions via the belief state as .
The agent’s objective is encoded by a reward function . The objective is to maximize the expected sum of discounted rewards over a horizon of decisions. The discount factor is .
We consider belief-dependent reward functions. Let , i.e. the mutual information (MI) between the posterior state and observation. MI is defined
where is the entropy of the predictive pdf and the second term is the expected entropy of the posterior pdf (1) under the prior pdf .
The problem where is an instance of a POMDP. By Bellman’s principle of optimality  the solution may be found via a backward in time recursion procedure known as value iteration. An optimal value function maps a belief state to its maximum expected sum of discounted rewards when an optimal policy is followed for the next decisions. Optimal value functions are computed by
starting from . The optimal policy for remaining decisions is found by extracting the argument maximizing . The recursion is continued up to .
Iii Solving Pomdps With Belief-Dependent Rewards
In most POMDPs, the reward function is state-dependent and its expectation is linear in the belief state. The finite-horizon optimal value function then has a finite representation by a convex hull of a set of hyperplanes over the belief space. Many exact  and approximate [18, 10, 9] offline algorithms for POMDPs rely on this piecewise linearity and convexity of the value function. Reward functions such as mutual information and entropy that are useful in optimal sensing problems are nonlinear in the belief state. Thus, these offline algorithms are not applicable to solve the recursion (3) with a belief-dependent reward function.
Online planning methods  find an optimal action for the current belief state instead of a closed form representation of the optimal policy. As explicit representations of policies are not required, a nonlinear belief-dependent reward function does not constitute any additional difficulty.
In online planning, a tree graph of belief states reachable from the current belief state is constructed. The current belief state is the root of the tree, and belief states computed via are added as child nodes of node . When a desired search depth is reached, the values from the leaves of the tree are propagated back to the root according to (3).
Suboptimal actions may sometimes be pruned from the search tree by branch-and-bound pruning when the optimal value for executing action in belief state , , has an upper bound and a lower bound . For a given and any , if then action is suboptimal at and all its successor nodes may be pruned from the tree. The bounds may similarly be propagated via (3). The number of belief states in the search tree is reduced.
Alternatives to online tree search include e.g. specialized approximate methods , however limited to small problems, open-loop approximation applied with the receding horizon control principle , or reduced value iteration  for Gaussian beliefs over in a mixed-observability case. For a theoretical treatment of nonlinear but convex reward functions in POMDPs, we refer the reader to .
Iv Bounds for the Value Function
The optimal policy attains the optimal value for all belief states. Then any other policy achieves a value that is a lower bound on the optimal value. A simple choice is to set as the greedy one-step look-ahead policy . Other options include random policies or blind policies  always executing a single fixed action.
Upper bounds are found by deriving two relaxed versions of the original POMDP problem by removing constraints on the applicable actions. The set of internal states reachable from a subset in a single time step is
The set of internal states reachable in steps from is
The first relaxation is obtained by removing all constraints imposed by the internal state as follows.
Universal sensor relaxation.
Given a POMDP problem , its universal sensor relaxation is , where contains all actions, is the stochastic part from , and is replaced by .
When we consider only actions applicable in the internal states reachable within decisions, where is the current time step, we obtain the -step sensor relaxation.
k-step sensor relaxation.
Given a POMDP problem , its -step sensor relaxation is , where is the set of all actions possible in the internal states reachable within time steps from the current internal state , and and are as for .
As , the optimal value in either relaxed problem is greater than or equal to the optimal value in the original problem. Let , and denote the optimal value functions for , , and , respectively, and let denote the value function for the greedy policy in for a given . Now
holds for the optimal value and the bounds.
V Multi-Armed Bandit Index Policies for Pomdp Relaxations
Both relaxations defined above are POMDPs themselves. Solving even the relaxed problems may thus be a computationally intractable task. This motivates identifying POMDPs whose relaxations have easily computable optimal policies.
In a multi-armed bandit (MAB) problem, a decision-maker plays one arm of the MAB and collects a reward depending on the state of the arm. Four requirements distinguish MABs among general stochastic control problems : 1) exactly one machine is played by the agent per action, and the state of that machine evolves such that the agent may not affect it, 2) machines not played remain in their current state, 3) the machines are independent, and 4) the machines that are not played do not contribute any reward. Gittins  showed that the optimal policies in MABs are so-called greedy index allocation policies. For each arm, an allocation index known as the Gittins index is calculated with the optimal selection yielding the highest index value. Index policies are optimal when actions are not irrevocable [14, 2]: any action is available at any stage, and may be chosen at a later stage with the same reward, excluding the effect of the discount factor. Index policies are usually much easier to compute than backward induction solutions of POMDPs .
An index policy is in general not optimal for the mixed-observability POMDP of Section II, as actions are irrevocable due to the constraints imposed by the internal state. However, both of the relaxations and have a fixed action space. The following three properties are required for the relaxations to be MABs. Results are derived for , and they hold for the more restricted case as well.
Each is related to , such that and .
Given , each is conditional on the values of some subset of the inference variables in , and are stationary, i.e.
where is the Dirac delta function.
For , the observation is conditional on , i.e.
Equation (9) is seen to hold applying (1) to the given prior with models satisfying (7) and (8). Equation (10) is seen to hold through two steps. First, due to the independence structure of the prior and posterior, and similarly for . Second, by (1) we see from (9) that for . Applying these steps to (2) leads to (10).
We now state our main result determining the conditions under which a POMDP relaxation is a MAB.
Proposition 1 (MAB equivalence of POMDP relaxations).
(Sketch). Consider the four requirements for MABs introduced above. Property 1 establishes the ”arms” of the bandit, partly satisfying requirement 1. The rest of requirements 1 and 2 are satisfied by Properties 2 and 3, which establish the states of the bandit arms as . Requirement 3 is satisfied by the independence properties in the first part of Corollary 1. The latter part of the corollary shows that requirement 4 is satisfied. ∎
When the proposition holds, the optimal policies for and are greedy index policies with values and , respectively. These optimal values are thus much easier to compute than for general POMDPs.
Let us consider the following example problem.
Monitoring reactive targets.
An agent is located at . At every time step the agent may either stay where it is or move to one of the neighboring locations . The applicable actions are . Let , with assuming value if a target is present at location and 0 if not. Each target reacts to the agent’s presence such that
The agent records measurements in according to
where are the false negative and positive probabilities, respectively, and . The reward function is (2).
Vi Empirical Evaluation
We ran simulation experiments on the monitoring problem defined above. There were inference variables, arranged on a rectangular two-dimensional four-connected grid. The agent was allowed to move on this grid and sense the targets. We examined two cases. In the first case, all of the properties 1-3 were satisfied. In the second case, we relaxed Property 2 by allowing all inference variables to change state. In all cases, the optimization horizon was varied from 1 to 6 decisions. The other parameters were .
We implemented the real-time belief space search (RTBSS) algorithm of  as presented in . RTBSS implements an online search of belief states reachable from the current belief state, and applies lower and upper bounds to prune suboptimal actions. We applied the greedy lower bound (Section IV) and upper bounds or (Section V). We compared this approach to an exhaustive search of all reachable belief states equivalent to using lower and upper bounds , and to the POMCP algorithm , which gives a recommendation on the next action to execute based on a series of Monte Carlo (MC) simulations.
We defined , and were two-state Markov chains with parameters , , where denotes the probability that transitions from to . For each , we sampled uniformly at random . A set of 1000 initial belief states satisfying the independence assumption between inference variables was sampled uniformly at random.
As Proposition 1 holds, and , and the greedy MAB policies give valid upper bounds, see (6). Applying RTBSS with these bounds hence always finds the optimal solution, which was verified in our simulations. The number of visited nodes in the search tree for each of the 1000 belief states is shown in Fig. 1 for and both upper bounds. Since the bound is tighter, applying it results in a lower or equal number of visited nodes than . For comparison, the average number of visited nodes for the exhaustive search is shown in Table I. We note that applying either bound greatly reduces the number of visited nodes, in some cases by up to an order of magnitude. Although the reduction in the number of visited nodes is substantial, evaluating the bounds has a computational cost that must be balanced with the savings from visiting fewer nodes. This point is discussed in more detail in the next subsection.
POMCP recommendations coincide with the optimal action more reliably when the number of MC simulations is increased and the optimization horizon is short, see Table II. We compared the values of optimal actions to those recommended by POMCP when the two differed. The difference between the two values is the performance loss, for which we computed the mean values and worst-case maximum values. The results are shown in Table III. Performance loss tends to be greater for fewer MC simulations and a greater optimization horizon . As the number of MC simulations increases the mean performance loss is low, indicating that on average POMCP performs very well compared to the optimal solution. However, even if the mean performance loss is low, the worst case performance loss from following POMCP recommendations may be significantly greater. In problems where suboptimal actions may lead to unacceptable performance loss, methods such as RTBSS with valid bounds may be preferable to POMCP.
Vi-B Case 2: Property 2 not satisfied
We next examined the case where Property 2 was not satisfied. We set for each . Each of the dynamics models was a two-state Markov chain. We considered three subcases distinguished by the rate of the state transitions: slow, medium or fast. For slow dynamics, the parameters were sampled for each uniformly at random such that , for medium dynamics , and for fast dynamics . Each experiment was repeated for 1000 randomly sampled initial belief states and dynamics models. All beliefs satisfied the independence assumption between inference variables.
The problem is quite similar to the one in Subsection VI-A, and POMCP performance was also observed to be very good on average. The MAB equivalence, Proposition 1, is now not satisfied for the relaxed problems. Thus, the upper bounds are approximate, and optimality for RTBSS cannot be guaranteed. We examined the effect that this had on solutions provided by RTBSS. The results are summarized in Table IV. The table shows the percentage of solutions equal to the optimal solution in case of slow, medium or fast dynamics for either the universal sensor upper bound from or the -step sensor upper bound from .
Optimal solutions are found in the majority of cases, with the percentage decreasing as the optimization horizon is greater and the rate of dynamics faster. Since often , it is more likely that the bound obtained from the universal sensor relaxation does not overestimate the optimal value, and consequently better agreement with the optimal solution is observed. The results suggest that it may still be reasonable to approximate upper bounds for by the value of greedy policies in the relaxed problems or , even if their optimality cannot be guaranteed.
Efficiency of pruning the search tree was not affected significantly compared to the case of the previous subsection. Applying either bound dramatically reduced the number of visited nodes in the search tree. We examined the mean time required to find a solution for a belief state either by exhaustive search or branch-and-bound pruning. A representative comparison is presented in Fig. 2 for the case of medium dynamics. For , exhaustive search performs fastest: the computational burden of computing the bounds outweighs the savings from visiting fewer nodes during the search. The advantages of pruning the search tree become apparent for . At best, applying pruning is an order or magnitude faster than exhaustive search. For , the upper bound from is fastest. Using the upper bound from is faster than exhaustive search for . Comparing computation times between POMCP and RTBSS were not meaningful, as the experiments were run on different computer platforms with different implementations of e.g. the search trees.
An optimal sensing problem in a mixed-observability domain where an internal state is fully observable and a set of inference variables are partially observable was formulated as a POMDP. The objective was the sequential maximization of mutual information of the inference variables and observations. Upper bounds for the optimal value function were found by relaxing constraints of the original problem.
When three conditions are fulfilled, the relaxed problems are MABs. First, each action is related to a unique subset of inference variables. Secondly, only inference variables in the subset corresponding to the current action evolve, while the other inference variables remain stationary. Finally, observations depend only on the inference variables in the subset related to the current action. The optimal solution of a MAB problem is a greedy index allocation policy, which is much easier to find than solving a general POMDP.
The POMDP was solved by a branch-and-bound search. The effectiveness of the bounds for pruning the search space was empirically verified in a target monitoring problem. Finding an optimal action by requires searching a fraction of the reachable belief states compared to an exhaustive search. Computation time is at best an order of magnitude smaller when applying pruning. The computational savings become apparent when savings due to reduced search space size exceed the additional cost of computing the bounds.
Future work includes studying applicability of our methodology in a wider range of mixed observability domains. Motivated by positive results on optimality of greedy policies for the restless bandit problem , we believe there may exist more classes of stochastic control problems than currently known where a greedy policy is optimal. Identifying such classes would further expand the applicability of our results.
-  L. Kaelbling, M. Littman, and A. Cassandra, “Planning and acting in partially observable stochastic domains,” Artificial Intelligence, vol. 101, no. 1-2, pp. 99–134, 1998.
-  A. O. Hero, D. Castañón, D. Cochran, and K. Kastella, Eds., Foundations and Applications of Sensor Management. New York, NY: Springer, 2007.
-  E. K. P. Chong, C. M. Kreucher, and A. O. Hero, “Partially Observable Markov Decision Process Approximations for Adaptive Sensing,” Discrete Event Dynamic Systems, vol. 19, no. 3, pp. 377–422, May 2009.
-  B. Charrow, V. Kumar, and N. Michael, “Approximate representations for multi-robot control policies that maximize mutual information,” Autonomous Robots, vol. 37, no. 4, pp. 383–400, Aug. 2014.
-  N. Atanasov, J. Le Ny, K. Daniilidis, and G. Pappas, “Information Acquisition with Sensing Robots: Algorithms and Error Bounds,” in IEEE Int. Conf. on Robotics and Automation (ICRA), Hong Kong, China, June 2014, pp. 6447–6454.
-  M. Lauri and R. Ritala, “Stochastic control for maximizing mutual information in active sensing,” in ICRA 2014 Workshop on Robots in Homes and Industry: Where to Look First?, Hong Kong, China, June 2014.
-  E. A. Hansen, “Indefinite-horizon POMDPs with action-based termination,” in Proceedings of the National Conference on Artificial Intelligence, Vancouver, Canada, July 2007, pp. 1237–1242.
-  C. H. Papadimitriou and J. N. Tsitsiklis, “The Complexity of Markov Decision Processes,” Mathematics of Operations Research, vol. 12, pp. 441–450, 1987.
-  M. T. Spaan and N. A. Vlassis, “Perseus: Randomized point-based value iteration for POMDPs,” Journal of Artificial Intelligence Research, vol. 24, pp. 195–220, 2005.
-  J. Pineau, G. Gordon, and S. Thrun, “Anytime point-based approximations for large POMDPs,” Journal of Artificial Intelligence Research, vol. 27, no. 1, pp. 335–380, 2006.
-  S. Ross, J. Pineau, S. Paquet, and B. Chaib-Draa, “Online planning algorithms for POMDPs,” Journal of Artificial Intelligence Research, vol. 32, pp. 663–704, 2008.
-  D. Silver and J. Veness, “Monte-Carlo Planning in Large POMDPs,” in Advances in Neural Information Processing Systems 23, Vancouver, Canada, Dec. 2010, pp. 2164–2172.
-  S. C. Ong, S. W. Png, D. Hsu, and W. S. Lee, “Planning under uncertainty for robotic tasks with mixed observability,” International Journal of Robotics Research, vol. 29, no. 8, pp. 1053–1068, 2010.
-  A. O. Hero and D. Cochran, “Sensor management: Past, present, and future,” IEEE Sensors Journal, vol. 11, no. 12, pp. 3064–3075, 2011.
-  R. Bellman, Dynamic Programming. Princeton, New Jersey: Princeton University Press, 1957.
-  R. D. Smallwood and E. J. Sondik, “The optimal control of partially observable Markov processes over a finite horizon,” Operations Research, vol. 21, no. 5, pp. 1071–1088, 1973.
-  W. S. Lovejoy, “A survey of algorithmic methods for partially observed Markov decision processes,” Annals of Operations Research, vol. 28, no. 1, pp. 47–65, 1991.
-  M. Hauskrecht, “Value-function approximations for partially observable Markov decision processes,” Journal of Artificial Intelligence Research, vol. 13, no. 1, pp. 33–94, 2000.
-  V. Krishnamurthy and D. V. Djonin, “Structured Threshold Policies for Dynamic Sensor Scheduling – A Partially Observed Markov Decision Process Approach,” IEEE Transactions on Signal Processing, vol. 55, no. 10, pp. 4938–4957, Oct. 2007.
-  M. Araya, O. Buffet, V. Thomas, and F. Charpillet, “A POMDP Extension with Belief-dependent Rewards,” in Advances in Neural Information Processing Systems 23, Vancouver, Canada, Dec. 2010, pp. 64–72.
-  J. C. Gittins, “Bandit processes and dynamic allocation indices,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 148–177, 1979.
-  S. Paquet, B. Chaib-draa, and S. Ross, “Hybrid POMDP algorithms,” in Proceedings of The AAMAS Workshop on Multi-Agent Sequential Decision Making in Uncertain Domains, Hakodate, Japan, May 2006, pp. 133–147.
-  S. H. A. Ahmad, M. Liu, T. Javidi, Q. Zhao, and B. Krishnamachari, “Optimality of myopic sensing in multichannel opportunistic access,” IEEE Transactions on Information Theory, vol. 55, no. 9, pp. 4040–4050, 2009.