1. Introduction
Autonomous agents and robots can be deployed in information gathering tasks in environments where human presence is either undesirable or infeasible. Examples include monitoring of deep ocean conditions, or space exploration. It may be desirable to deploy a team of agents, e.g., due to the large scope of the task at hand, resulting in a decentralized information gathering task.
Some recent works, e.g., Charrow et al. (2014); Schlotfeldt et al. (2018), tackle decentralized information gathering while assuming perfect, instantaneous communication between agents, while centrally planning how the agents should act. In terms of communication, we approach the problem from the other extreme as a decentralized partially observable Markov decision process (DecPOMDP) Oliehoek and Amato (2016). In a DecPOMDP, no explicit communication between the agents is assumed^{1}^{1}1If desired, communication may be included into the DecPOMDP model Spaan et al. (2006); Wu et al. (2011).. Each agent acts independently, without knowing what the other agents have perceived or how they have acted.
Informally, a DecPOMDP model consists of a set of agents in an environment with a hidden state. Each agent has its own set of local actions, and a set of local observations it may observe. Markovian state transition and observation processes conditioned on the agents’ actions and the state determine the relative likelihoods of subsequent states and observations. A reward function determines the utility of executing any action in any state. The objective is to centrally design optimal control policies for each agent that maximize the expected sum of rewards over a finite horizon of time. The control policy of each agent depends only on the past actions and observations of that agent, hence no communication during execution of the policies is required. However, as policies are planned centrally, it is possible to reasong about the joint information state of all the agents. It is thus possible to calculate probability distributions over the state, also known as joint beliefs.
A decentralized information gathering task differs from other multiagent control tasks by the lack of a goal state. It is not the purpose of the agents to execute actions that reach a particular state, but rather to observe the environment in a manner that provides the greatest amount of information while satisfying operational constraints. As the objective is information acquisition, the reward function depends on the joint belief of the agents. Convex functions of a probability mass function naturally model certainty DeGroot (2004), and have been proposed in the context of singleagent POMDPs ArayaLópez et al. (2010) and DecPOMDPs Lauri et al. (2017). However, to the best of our knowledge no heuristic or approximate algorithms for convex reward DecPOMDPs have been proposed, and no theoretical results on the properties of such DecPOMDPs exist in the literature.
In this paper, we propose the first heuristic algorithm for DecPOMDPs with a convex reward. We prove the value function of such DecPOMDPs is convex, generalizing the similar result for singleagent POMDPs ArayaLópez et al. (2010). The DecPOMDP generalizes other decisionmaking formalisms such as multiagent POMDPs and DecMDPs Bernstein et al. (2002). Thus, our results also apply to these special cases removing parts required by the more general DecPOMDP.
Our paper has three contributions. Firstly, we prove that in DecPOMDPs where the reward is a convex function of the joint belief, the value function of any finite horizon policy is convex in the joint belief. Secondly, we propose the first heuristic algorithm for DecPOMDPs with a reward that is a function of the agents’ joint state information. The algorithm is based on iterative improvement of the value of fixedsize policy graphs. We derive a lower bound that may be improved instead of the exact value, leading to computational speedups. Thirdly, we experimentally verify the feasibility and usefulness of our algorithm. For DecPOMDPs with a state information dependent reward, we find policies for problems an order of magnitude larger than previously.
The paper is organized as follows. We review related work in Section 2. In Section 3, we define our DecPOMDP problem and introduce notation and definitions. Section 4 derives the value of a policy graph node. In Section 5, we prove convexity of the value in a DecPOMDP where the reward is a convex function of the state information. Section 6 introduces our heuristic policy improvement algorithm. Experimental results are presented in Section 7, and concluding remarks are provided in Section 8.
2. Related work
Computationally finding an optimal decentralized policy for a finitehorizon DecPOMDP is NEXPcomplete Bernstein et al. (2002). Exact algorithms for DecPOMDPs are usually based either on backwards in time dynamic programming Hansen et al. (2004), forwards in time heuristic search Szer et al. (2005); Oliehoek et al. (2013), or on exploiting the inherent connection of DecPOMDPs to nonobservable Markov decision processes Dibangoye et al. (2016); MacDermed and Isbell (2013). Approximate and heuristic methods have been proposed, e.g., based on finding locally optimal “best response” policies for each agent Nair et al. (2003), memorybounded dynamic programming Seuken and Zilberstein (2007), crossentropy optimization over the space of policies Oliehoek et al. (2008a), or monotone iterative improvement of fixedsize policies Pajarinen and Peltonen (2011). Algorithms for special cases such as goalachievement DecPOMDPs Amato and Zilberstein (2009) and factored DecPOMDPs, e.g., Oliehoek et al. (2008b), have also been proposed. Structural properties, such as transition, observation, and reward independence between the agents, can also be leveraged and may even result in a problem with a lesser computational complexity Allen and Zilberstein (2009). Some DecPOMDP algorithms Oliehoek et al. (2013)
take advantage of plantime sufficient statistics, which are joint distributions over the hidden state and the histories of the agents’ actions and observations
Oliehoek (2013). The sufficient statistics provide a means to reason about possible distributions over the hidden state, also called joint beliefs, reached under a given policy.The expected value of a reward function that depends on the hidden state and action is a linear function of the joint belief. These types of rewards are standard in DecPOMDPs. In the context of singleagent POMDPs, ArayaLópez et al. ArayaLópez et al. (2010) argue that information gathering tasks are naturally formulated using a reward function that is a convex function of the state information and introduce the POMDP model with such a reward. This enables application of, e.g., the negative Shannon entropy of the state information as a component of the reward function. Under certain conditions, an optimal value function of a POMDP is Lipschitzcontinuous Fehr et al. (2018) which may be exploited in a solution algorithm. An alternative formulation for information gathering in singleagent POMDPs is presented in Spaan et al. (2015), and its connection to POMDPs is characterized in Satsangi et al. (2018). Recently, Lauri et al. (2017) proposes an extension of the ideas presented in ArayaLópez et al. (2010) to the DecPOMDP setting. Entropy is applied in the reward function to encourage information gathering. Problem domains with up to 25 states and 5 actions per agent are solved with an exact algorithm.
In this paper, we present the first heuristic algorithm for DecPOMDPs with rewards that depend nonlinearly on the joint belief. Our algorithm is based on the combination of the idea of using a fixedsize policy represented as a graph Pajarinen and Peltonen (2011) with plantime sufficient statistics Oliehoek (2013) to determine joint beliefs at the policy graph nodes. The local policy at each policy graph node is then iteratively improved, monotonically improving the value of the node. We show that if the reward function is convex in the joint belief, then the value function of any finitehorizon DecPOMDP policy is convex as well. This is a generalization of a similar result known for singleagent POMDPs ArayaLópez et al. (2010). From this property, we obtain a lower bound for the value of a policy that we empirically show improves the efficiency of our algorithm. Compared to prior stateoftheart in DecPOMDPs with convex rewards Lauri et al. (2017), our algorithm is capable of handling problems an order of magnitude larger.
3. Decentralized POMDPs
We next formally define the DecPOMDP problem we consider. Contrary to most earlier works, we define the reward as a function of state information and action. This allows us to model information acquisition problems. We choose the finitehorizon formulation to reflect the fact that a decentralized information gathering task should have a clearly defined end after which the collected information is pooled and subsequent inference or decisions are made.
A finitehorizon DecPOMDP is a tuple , , , , , , , , , where is the set of agents, is a finite set of hidden states, and are the finite action and observation sets of agent , respectively, is the state transition probability that gives the conditional probability of the new state given the current state and joint action , where is the joint action space obtained as the Cartesian product of for all , is the observation probability that gives the conditional probability of the joint observation given the state and previous joint action , with being the joint observation space defined as the Cartesian product of for , is the initial state distribution^{2}^{2}2We denote by the space of probability mass functions over . at time , is the problem horizon, and are the reward functions at times , while determines a final reward obtained at the end of the problem horizon.
The DecPOMDP starts from some state . Each agent then selects an action , and the joint action is executed. The state then transitions according to , and each agent perceives an observation , where the likelihood of the joint observation is determined according to . The agents then select the next actions , and the same steps are repeated until and the task ends.
Optimally solving a DecPOMDP means to design a policy for each agent that encodes which action the agent should execute conditional on its past observations and actions; in a manner such that the expected sum of rewards collected is maximized. In the following, we make the notion of a policy exact, and determine the expected sum of rewards collected when executing a policy.
3.1. Histories and policies
Define the history set of agent at time as , and . A local history contains all information available to agent to decide its next action . We define the joint history set as the Cartesian product of over . We write a joint history as , or equivalently as where and . Both the local and joint histories satisfy the recursion .
A solution of a finitehorizon DecPOMDP is a local policy for each agent that determines which action an agent should take given a local history in for any . We define a local policy similarly as Pajarinen and Peltonen (2011) as a deterministic finitehorizon controller viewed as a directed acyclic graph.
Definition (Local policy).
For agent , a local policy is , where is a finite set of nodes, is a starting node, is an output function, and is a node transition function.
Fig. 1 shows an example of a local policy. Note that a sufficiently large graph can represent any finite horizon local policy.
We constrain the structure of local policies by enforcing that each node can be identified with a unique time step. We call this the property of temporal consistency.
Definition (Temporal consistency).
A local policy , , , is temporally consistent if where are pairwise disjoint and nonempty, and , and for any , for , for all , .
In a temporally consistent policy, at a node in the agent has decisions left until the end of the problem horizon. Temporal consistency guarantees that exactly one node in each set can be visited, and that after visiting a node in , the next node will belong to . In Fig. 1, , and , , . Temporal consistency is assumed throughout the rest of the paper.
A joint policy describes the joint behaviour of all agents and is defined as the combination of the local policies .
Definition (Joint policy).
Given local policies , , , for all , a joint policy is , where is the Cartesian product of all , , and for and , is such that , and is such that .
Temporal consistency naturally extends to joint policies, such that there exists a partition of by pairwise disjoint sets .
3.2. Bayes filter
While planning policies for information gathering, it is useful to reason about the joint belief of the agents given some joint history. This can be done via Bayesian filtering as described in the following.
The initial state distribution is a function of the state at time , and for any state , is equal to the probability . When action is executed and observation is perceived, we may find the posterior belief where by applying a Bayes filter.
In general, given any current joint belief corresponding to some joint history^{3}^{3}3For notational convenience, we drop the explicit dependence of on the joint history. , and a joint action and joint observation , the posterior joint belief is calculated by
(1) 
where
is the normalization factor equal to the prior probability of observing
. Given and any joint history ,,, , , , repeatedly applying Eq. (1) yields a sequence of joint beliefs. We shall denote the application of the Bayes filter by the shorthand notation . Furthermore, we shall denote the filter that recovers given by repeated application of by a function .3.3. Value of a policy
The value of a policy is equal to the expected sum of rewards collected when acting according to the policy. We define value functions that give the expected sum of rewards when following policy until the end of the horizon when decisions have been taken so far, for any joint belief and any policy node .
The time step is a special case when all actions have already been taken, and the value function only depends on the joint belief and is equal to the final reward: .
For , one decision remains, and the remaining expected sum of rewards of executing policy is equal to
(2) 
i.e., the sum of the immediate reward and the expected final reward at time . From the above, we define iterating backwards in time for as
(3) 
where the expectation is under . The expected sum of rewards collected when following a policy is equal to its value . The objective is to find an optimal policy whose value is greater than or equal to the value of any other policy.
4. Value of a policy node
Executing a policy corresponds to a stochastic traversal of the policy graphs (Fig. 1) conditional on the observations perceived. In this section, we first answer two questions related to this traversal process. First, given a history, when is it consistent with a policy, and which nodes in the policy graph will be traversed (Subsection 4.1)? Second, given an initial state distribution, what is the probability of reaching a given policy graph node, and what are the relative likelihoods of histories if we assume a given node is reached (Subsection 4.2)? With the above questions answered, we define the value of a policy graph node both in a joint and in a local policy (Subsection 4.3). These values will be useful in designing a policy improvement algorithm for DecPOMDPs.
4.1. History consistency
As illustrated in Fig. 1, there can be multiple histories along which a node can be reached. We define when a history is consistent with a policy, i.e., when executing a policy could have resulted in the given history. As histories in are reached after executing all actions, in the remainder of this subsection we consider .
Definition (History consistency).
We are given for all ,,,, and the corresponding joint policy ,,,.

A local history is consistent with if the sequence of nodes where for satisfies: for every . We say ends at under .

A joint history is consistent with if for all , is consistent with and ends at . We say ends at under .
Due to temporal consistency, any consistent with a policy will end at some . Similarly, any ends at some .
4.2. Node reachability probabilities
Above, we have defined when a history ends at a particular node. Using this definition, we now derive the joint probability mass function (pmf) of policy nodes and joint histories given that a particular policy is executed.
We note that and first consider . The unconditional a priori probability of experiencing the joint history is . For , the unconditional probability of experiencing is obtained recursively by . Conditioning on a policy yields if is consistent with and 0 otherwise. Next, we have , with if ends at under and 0 otherwise.
Combining the above, the joint pmf is defined as
Marginalizing over , the probability of ending at node under is
(4) 
and by definition of conditional probability,
(5) 
We now find the probability of ending at under . Let denote the Cartesian product of all except . Then denotes the nodes for all agents except . We have . The probability of ending at under is
(6) 
where the sum terms are determined by Eq. (4). Again, by definition of conditional probability,
(7) 
where the term in the numerator is obtained from Eq. (4).
4.3. Value of policy nodes
We define the values of a node in a joint policy and an individual policy.
Definition (Value of a joint policy node).
Given a joint policy , the value of a node is defined as
where is defined in Eq. (5) and is the joint belief corresponding to history .
Definition (Value of a local policy node).
For , let be the local policy and let be the corresponding joint policy. For any , the value of a local node is
where is defined in Eq. (7).
In other words, the value of a local node is equal to the expected value of the value of the joint node under .
5. Convexreward DecPOMDPs
In this section, we prove several results for the value function of a DecPOMDP whose reward function is convex in . Convex rewards are of special interest in information gathering. This is because of their connection to socalled uncertainty functions DeGroot (2004), which are nonnegative functions concave in . Informally, an uncertainty function assigns large values to uncertain beliefs, and smaller values to less uncertain beliefs. Negative uncertainty functions are convex and assign high values to less uncertain beliefs, and are thus suitable as reward functions for information gathering. Examples of uncertainty functions include Shannon entropy, generalizations such as Rényi entropy, and types of value of information, e.g., the probability of error in hypothesis testing.
The following theorem shows that if the immediate reward functions are convex in the joint belief, then the finite horizon value function of any policy is convex in the joint belief. If the reward functions and are convex in , then for any policy , is convex and is convex in for any .
Proof.
Let , and . We proceed by induction ( is trivial). For , let , and denote . From Eq. (2), . We recall from above that is convex, and by Eq. (1), the Bayes filter is a linear function of . The composition of a linear and convex function is convex, so is a convex function of . The nonnegative weighted sum of convex functions is also convex, and by assumption is convex in , from which it follows that is convex in .
Now assume is convex in for some . By the definition in Eq. (3) and the same argumentation as above, it follows that is convex in . ∎
Since a sufficiently large policy graph can represent any policy, we infer that the value function of an optimal policy is convex.
The following corollary gives a lower bound for the value of a policy graph node.
Corollary
Let be a probability mass function over the joint histories at time . If the reward functions and are convex in , then for any time step and any policy ,
Proof.
By Theorem 5, is convex in . The claim immediately follows applying Jensen’s inequality. ∎
Applied to Definition 4.3, the corollary says the value of a joint policy node is lower bounded by the value of the expected joint belief at . Applied to Definition 4.3, we obtain a lower bound for the value of a local policy node as
where inside the inner expectation we write . Thus, we can evaluate a lower bound for the value of any local node by finding the values of all joint nodes and then taking the expectation of where under .
Corollary 5 has applications in policy improvement algorithms that iteratively improve the value of a policy by modifying the output and node transition functions at each local policy node. Instead of directly optimizing the value of a node, the lower bound can be optimized. We present one such algorithm in the next section.
As Corollary 5 holds for any pmf over joint histories, it could be applied also with pmfs other than . For example, if it is expensive to enumerate the possible histories and beliefs at a node, one could approximate the lower bound, e.g., through importance sampling (Murphy, 2012, Ch. 23.4).
In standard DecPOMDPs, the expected reward is a linear function of the joint belief. Then, the corollary above holds with equality.
Corollary
Consider a DecPOMDP where the reward functions are defined as and for , , where is a statedependent final reward function and are the statedependent reward functions. Then, the conclusion of Corollary 5 holds with equality.
Proof.
Let and . First note that . Consider then , and let , and write . Then from the definition of in Eq. (2), consider first the latter sum term which equals
which follows by replacing by Eq. (1), canceling out , and rearranging the sums. The above is clearly a linear function of , and by definition, so is , the first part of . Thus, is linear in . By an induction argument, it is now straightforward to show that is linear in for all . Finally,
for any pmf over joint histories by linearity of expectation. ∎
Corollary 5 shows that a solution algorithm for a DecPOMDP with a reward convex in the joint belief that uses the lower bound from Corollary 5 will also work for standard DecPOMDPs with a reward linear in the joint belief.
Since a linear function is both convex and concave, rewards that are statedependent and rewards that are convex in the joint belief can be combined on different time steps in one DecPOMDP and the lower bound still holds.
6. Policy graph improvement
The Policy Graph Improvement (PGI) algorithm Pajarinen and Peltonen (2011) was originally introduced for standard DecPOMDPs with reward function linear in the joint belief. PGI monotonically improves policies by locally modifying the output and node transition functions of the individual agents’ policies. The policy size is fixed, such that the worst case computation time for an improvement iteration is known in advance. Moreover, due to the limited size of the policies the method produces compact, understandable policies.
We extend PGI to the nonlinear reward case, and call the method nonlinear PGI (NPGI). Contrary to tree based DecPOMDP approaches the policy does not grow doubleexponentially with the planning horizon as we use a fixed size policy. If the reward function is convex in , NPGI may improve the lower bound from Corollary 5. The lower bound is tight when each policy graph node corresponds to only one history suggesting we can improve the quality of the lower bound by increasing policy graph size.
NPGI is shown in Algorithm 1. At each improvement step, NPGI repeats two steps: the forward pass and the backward pass. In the forward pass, we use the current best joint policy to find the set of expected joint beliefs at every policy graph node. In practice, we do this by first enumerating for each agent the sets of local histories ending at all local nodes, then taking the appropriate combinations to create the joint histories for joint policy graph nodes. We then evaluate the expected joint beliefs at every joint policy graph node.
In the backward pass, we improve the current policy by modifying its output and node transition functions locally at each node. As output from the backward pass, we obtain an updated policy using the improved output and node transition functions and , respectively. As NPGI may optimize a lower bound of the node values, we finally check if the value of the improved policy, , is greater than the value of the current best policy, and update the best policy if necessary.
Backward pass.
The backward pass of NPGI is shown in Algorithm 2. At time step for agent , for each node , we maximize either the value or its lower bound with respect to the local policy parameters. In the following, we present the details for maximizing the lower bound, the algorithm for the exact value can be derived analogously.
For , we consider the last remaining action.
Fix a local node .
Denote the expected belief at as .
We write as the joint action where local actions of all other agents except are fixed to those specified by the current output function.
We solve
(8) 
where the outer expectation is under , the distribution over the nodes of agents other than , and the inner expectation is under . Note that in general, is different for each , as will be different. We assign equal to the local action that maximizes Eq. (8). Note that this modification of the policy does not invalidate any of the expected beliefs at the nodes in .
For , we consider both the current action and the next nodes via the node transition function. Fix a local node , and define and similarly as above. Additionally, for any joint observation , define
as the next node in when transitions of all other agents except are fixed to those specified by the current node transition function. We solve
(9) 
where the outer expectation is under , and the inner expectation is under . We assign and to the respective maximizing values of Eq. (9). This assignment potentially invalidates the expected beliefs in for any nodes in for . However, as in the subsequent optimization steps we only require the expected beliefs for , , we do not need to repeat the forward pass.
Line 12 of Algorithm 2 checks if there exists a node that we have already optimized that has the same local policy as the current node . If such a node exists, we redirect all of the inedges of to
instead. This redirection is required to maintain correct estimates of the respective node probabilities in the algorithm. If we redirected the inedges of
to , on Line 14 we randomize the local policy of the now useless node that has no inedges^{4}^{4}4To randomize the local policy of a node , we sample new local policies until we find one that is not identical to the local policy of any other node in . Likewise, when randomly initializing a new policy in our experiments we avoid including in any nodes with identical local policies., in the hopes that it may be improved on subsequent backward passes. If a node is to be improved that is unreachable, i.e., it has no inedges or the probabilities of all histories ending in it are zero, we likewise randomize the local policy at that node.Policy initialization.
We initialize a random policy for each agent with a given policy graph width for each as follows^{5}^{5}5At the last time step, it is only meaningful to have . In our experiments if , we instead set .. For example, for a problem with and , we create a policy similar to Fig. 1 for each agent, where there is one initial node , and 2 nodes at each time step . The action determined by the output function is sampled uniformly at random from
Comments
There are no comments yet.