Planning for multiagent systems (MASs) under uncertainty is an important research problem in artificial intelligence. The decentralized partially observable Markov decision process (Dec-POMDP) is a general principled framework for addressing such problems. Many recent approaches to solving Dec-POMDPs propose to exploitlocality of interaction Nair05AAAI also referred to as value factorization Kumar11IJCAI . However, without making very strong assumptions, such as transition and observation independence Becker03AAMAS , there is no strict locality: in general the actions of any agent may affect the rewards received in a different part of the system, even if that agent and the origin of that reward are (spatially) far apart. For instance, in a traffic network the actions taken in one part of the network will eventually influence the rest of the network Oliehoek08AAMAS .
A number of approaches have been proposed to generate solutions for large MASs Velagapudi11AAMAS ; Yin11IJCAI ; Oliehoek13AAMAS ; Wu13IJCAI ; Dibangoye14AAMAS ; Varakantham14AAAI . However, these heuristic methods come without guarantees. In fact, since it has been shown that approximation (given some , finding a solution with value within of optimal) of Dec-POMDPs is NEXP-complete Rabinovich03AAMAS , it is unrealistic to expect to find general, scalable methods that have such guarantees. However, the lack of guarantees also makes it difficult to meaningfully interpret the results produced by heuristic methods. In this work, we mitigate this issue by proposing a novel set of techniques that can be used to provide upper bounds on the performance of large factored Dec-POMDPs.
More generally, the ability to compute upper bounds is important for numerous reasons: 1) As stated above, they are crucial for a meaningful interpretation of the quality of heuristic methods. 2) Such knowledge of performance gaps is crucial for researchers to direct their focus to promising areas. 3) Such knowledge is also crucial for understanding which problems seem simpler to approximate than others, which in turn may lead to improved theoretical understanding of different problems. 4) Knowledge about the performance gap of the leading heuristic methods can also accelerate their real-world deployment, e.g., when their performance gap is proven to be small over sampled domain instances, or when the selection of which heuristic method to deploy is facilitated by clarifying the trade-off of computation and closeness to optimality. 5) Upper bounds on achievable value without communication may guide decisions on investments in communication infrastructure. 6) Last, but not least, these upper bounds can directly be used in current and future heuristic search methods, as we will discuss in some more detail at the end of this paper.
Computing upper bounds on the achievable value of a planning problem typically involves relaxing the original problem by making some optimistic assumptions. For instance, in the case of Dec-POMDPs typical assumptions are that the agents can communicate or observe the true state of the system Emery-Montemerlo04AAMAS ; Szer05UAI_MAA ; Roth05AAMAS ; Oliehoek08JAIR . By exploiting the fact that transition and observation dependence leads to a value function that is additively factored into a number of small components (we say that the value function is ‘factored’, or that the setting exhibits ‘value factorization’), such techniques have been extended to compute upper bounds for so-called network-distributed POMDPs (ND-POMDPs) with many agents. This has greatly increased the size of the problems that can be solved Varakantham07AAMAS ; Marecki08AAMAS ; Dibangoye14AAMAS . Unfortunately, assuming both transition and observation independence (or, more generally, value factorization) narrows down the applicability of the model, and no techniques for computing upper bounds for more general factored Dec-POMDPs with many agents are currently known.
We address this problem by proposing a general technique for computing what we call influence-optimistic upper bounds. These are upper bounds on the achievable value in large-scale MASs formed by computing local influence-optimistic upper bounds on the value of sub-problems that consist of small subsets of agents and state factors. The key idea is that if we make optimistic assumptions about how the rest of the system will influence a sub-problem, we can decouple it and effectively compute a local upper bound on the achievable value. Finally, we show how these local bounds can be combined into a global upper bound. In this way, the major contribution of this paper is that it shows how we can compute factored upper bounds for models that do not admit factored value functions.
We empirically evaluate the utility of influence-optimistic upper bounds by investigating the quality guarantees they provide for heuristic methods, and by examining their application in a heuristic search method. The results show that the proposed bounds are tight enough to give meaningful quality guarantees for the heuristic solutions for factored Dec-POMDPs with hundreds of agents.111In the paper, we use the word ‘tight’ for its (empirical) meaning of “close to optimal”, not for its (theoretical CS) meaning of “coinciding with the best possible bound”. This is a major accomplishment since previous approaches that provide guarantees 1) have required very particular structure such as transition and observation independence Becker03AAMAS ; Becker04AAMAS ; Varakantham07AAMAS ; Dibangoye14AAMAS or ‘transition-decoupledness’ combined with very specific interaction structures (transitions of an agent can be affected in a directed fashion and only by a small subset of other agents) Witwicki11PhD , and 2) have not scaled beyond 50 agents. In contrast, this paper demonstrates quality bounds in settings of hundreds of agents that all influence each other via their actions.
This paper is organized as follows. First, Section 2 describes the required background by introducing the factored Dec-POMDP model. Next, Section 3 describes the sub-problems that form the basis of our decomposition scheme. Section 4 proposes local influence-optimistic upper bounds for such sub-problems together with the techniques to compute them. Subsequently, Section 5 discusses how these local upper bounds can be combined into a global upper bound for large problems with many agents. Section 6 empirically investigates the merits of the proposed bounds. Section 7 places our work in the context of related work in more detail, and Section 8 concludes.
In this paper we focus on factored Dec-POMDPs Oliehoek08AAMAS ,
which are Dec-POMDPs where the transition and observation models can
be represented compactly as a two-stage dynamic Bayesian network
two-stage dynamic Bayesian network(2DBN) Boutilier99JAIR :
A factored Dec-POMDP is a tuple where:
is the set of agents.
is the set of joint actions .
is the set of joint observations .
is a set of state variables, or factors, that take values and thus span the set of states
is the transition model which is specified by a set of conditional probability tables (CPTs), one for each factor.
is the observation model, specified by a CPT per agent.
is a set of local reward functions.
is the (factored) initial state distribution.
Each local reward function has a state factor scope and agent scope over which is it is defined: . These local reward functions form the global immediate reward function via addition. We slightly abuse notation and overload to denote both an index into the set of reward functions, as well as the corresponding scopes:
Every Dec-POMDP can be converted to a factored Dec-POMDP, but the additional structure that a factored model specifies is most useful when the problem is weakly coupled, meaning that there is sufficient conditional independence in the 2DBN and that the scopes of the reward functions are small.
For instance, Fig. 1 shows the FireFightingGraph (FFG) problem Oliehoek13AAMAS , which we adopt as a running example. This problem defines a set of houses, each with a particular ‘fire level’ indicating if the house is burning and with what intensity. Each agent can fight fire at the house to its left or right, making observations of flames (or no flames) at the visited house. Each house has a local reward function associated with it, which depends on the next-stage fire-level,222FFG has rewards of form , but we support in general. as illustrated in Fig. 2(left) which shows the 2DBN for a 4-agent instantiation of FFG. The figure shows that the connections are local but there is no transition independence Becker03AAMAS or value factorization Kumar11IJCAI ; Witwicki11PhD : all houses and agents are connected such that, over time, actions of each agent can influence the entire system. While FFG is a stylized example, such locally-connected systems can be found in applications as traffic control Wu13IJCAI or communication networks Ooi96 ; Hansen04AAAI ; Mahajan14AOR .
This paper focuses on problems with a finite horizon such that . A policy for an agent specifies an action for each observation history The task of planning for a factored Dec-POMDP entails finding a joint policy with maximum value, i.e., expected sum of rewards:
Such an optimal joint policy is denoted .333We omit the ‘*’ on values; all values are assumed to be optimal with respect to their given arguments.
In recent years, a number of methods have been proposed to find approximate solutions for factored Dec-POMDPs with many agents Pajarinen11IJCAI ; Kumar11IJCAI ; Velagapudi11AAMAS ; Oliehoek13AAMAS ; Wu13IJCAI but none of these methods are able to give guarantees with respect to the solution quality (i.e., they are heuristic methods), leaving the user unable to confidently gauge how well these methods perform on their problems. This is a principled problem; even finding an -approximate solution is NEXP-complete Rabinovich03AAMAS , which implies that general and efficient approximation schemes are unlikely to be found. In this paper, we propose a way forward by trying to find instance-specific upper bounds in order to provide information about the solution quality offered by heuristic methods.
3 Sub-Problems and Influences
The overall approach that we take is to divide the problem into sub-problems (defined here), compute overestimations of the achievable value for each of these sub-problems (discussed in Section 4) and combine those into a global upper bound (Section 5).
3.1 Sub-Problems (SPs)
The notion of a sub-problem generalizes the concept of a local-form model (LFM) Oliehoek12AAAI_IBA to multiple agents and reward components. We give a relatively concise description of this formalization, for more details, please see Oliehoek12AAAI_IBA .
A sub-problem (SP) of a factored Dec-POMDP is a tuple , where denote subsets of agents, state factors and local reward functions.
An SP inherits many features from : we can define local states and the subsets induce local joint actions , observations , and rewards
However, this is generally not enough to end up with a fully specified, but smaller, factored Dec-POMDP. This is illustrated in Fig. 2(left), which shows the 2DBN for a sub-problem of FFG involving two agents and three houses (dependence of observations on actions is not displayed). The figure shows that state factors (in this case and ) can be the target of arrows pointing into the sub-problem from the non-modeled (dashed) part. We refer to such state factors as non-locally affected factors (NLAFs) and denote them , where indexes the SP and indexes the factor. The other state factors in are referred to as only-locally affected factors (OLAFs) . The figure clearly shows that the transition probabilities are not well-defined since the NLAFs depend on the sources of the highlighted influence links. We refer to these sources as influence sources (in this case and ). This means that an SP has an underspecified transition model: .
3.2 Structural Assumptions
In the most general form, the observation and reward model could also be underspecified. In order to simplify the exposition, we make two assumptions on the structure of an SP:
For all included agents , the state factors that can influence its observations (i.e., ancestors of in the 2DBN) are included in .
For all included reward components , the state factors and actions that influence are included in .
That is, we assume that SPs exhibit generalized forms of observation independence,
and reward independence (cf. (1)). These are more general notions of observation and reward independence than used in previous work on TOI-Dec-MDPs Becker03AAMAS and ND-POMDPs Nair05AAAI , since we allow overlap on state factors that can be influenced by the agents themselves.444Previous work only allowed ‘external’ or ‘unaffectable’ state factors to affect the observations or rewards of multiple components.
Crucially, however, we do not assume any form of transition independence (for instance, the sets of SPs can overlap), nor do we assume any of the transition-decoupling (i.e., TD-POMDP Witwicki10ICAPS ) restrictions. That is, we neither restrict which node types can affect ‘private’ nodes; nor do we disallow concurrent interaction effects on ‘mutually modeled’ nodes.
This means that assumptions 1 and 2 (above) that we do make are without loss of generality: it is possible to make any Dec-POMDP problem satisfy them by introducing additional (dummy) state factors.555In contrast, TOI-Dec-MDPs and ND-POMDPs impose both transition and observation independence, thereby restricting consideration to a proper subclass of those considered here.
3.3 Influence-Augmented SPs
An LFM can be transformed into a so-called influence-augmented local model, which captures the influence of the policies and parts of the environment that are not modeled in the local model Oliehoek12AAAI_IBA . Here we extend this approach to SPs, thus leading to influence-augmented sub-problems (IASPs).
Intuitively, the construction of an IASP consists of two steps: 1) capturing the influence of the non-modeled parts of the problem (given the policies of non-modeled agents) in an incoming influence point , and 2) using this to create a model with a transformed transition model and no further dependence on the external problem.
Step (1) can be done as follows: an incoming influence point can be specified as an incoming influence for each stage: . Each such corresponds to the influence that the SP experiences at stage
, and thus specifies the conditional probability distribution of the influence sources. That is, assuming that the influencing agents use deterministic policies that map observation histories to actions, is the conditional probability distribution given by
where is the Kronecker Delta function, and the d-separating set for : the history of a subset of all the modeled variables that d-separates the modeled variables from the non-modeled ones.666 is defined such that , see Oliehoek12AAAI_IBA for details.
Step (2) involves replacing the CPTs for all the NLAFs by the CPTs induced by .
Let be an NLAF (with index ), and (the instantiation of) the corresponding influence sources. Given the influence , and its d-separating set , we define the induced CPT for as the CPT that specifies probabilities:
Finally, we can define the IASP.
An influence-augmented SP (IASP) for an SP is a factored Dec-POMDP with the following components:
The agents (implying the actions and observations) from the respective subproblem participate: .
The set of state factors is such that states specify a local state of the SP, as well as the d-separating set for the next-stage influences.
Transitions are specified as follows: For all OLAFs we take the CPTs from the factored Dec-POMDP , but for all NLAFs we take the induced CPTs, leading to an influence-augmented transition model which is the product of CPTs of OLAFs and NLAFS:
(Note that and together uniquely specify ).
The observation model follows directly from (from ).
The reward is identical to that of the SP: .
Fig. 2(right) illustrates the IASP for FFG. It shows how the d-separating set acts as a parent for all NLAFs, thus replacing the dependence on the external part of the problem.
We write for the value that would be realized for the reward components modeled in sub-problem , under a given joint policy :
As one can derive, given the policies of other agents , , the value of the optimal solution of an IASP constructed for the influence corresponding to , is equal to the best-response value:
This extends the result in Oliehoek12AAAI_IBA to multiagent SPs.
4 Local Upper Bounds
In this section we present our main technical contribution: the machinery to compute influence-optimistic upper bounds (IO-UBs) for the value of sub-problems. In order to properly define this class of upper bound, we first define the locally-optimal value:
The locally-optimal value for an SP ,
is the local value (considering only the rewards ) that can be achieved when all agents use a policy selected to optimize this local value. We will denote the maximizing argument by .
Note that —the value for the rewards under the optimal joint policy —since optimizes the sum of all local reward functions: it might be optimal to sacrifice some reward if it is made up by higher rewards outside of the sub-problem.
expresses the maximal value achievable under a feasible incoming influence point; i.e., it is optimistic about the influence, but maintains that the influence is feasible. Computing this value can be difficult, since computing influences and subsequently constructing and optimally solving an IASP can be very expensive in general. However, it turns out computing upper bounds to can be done more efficiently, as discussed in Section 4.4.
The IO-UBs that we propose in the remainder of this section upper-bound by relaxing the requirement of the incoming influence being feasible, thus allowing for more efficient computation. We present three approaches that each overestimate the value by being optimistic with respect to the assumed influence, but that differ in the additional assumptions that they make.
4.1 A Q-MMDP Approach
The first approach we consider is called influence-optimistic Q-MMDP (IO-Q-MMDP). Like all the heuristics we introduce, it assumes that the considered SP will receive the most optimistic (possibly infeasible) influence. In addition, it assumes that the SP is fully observable such that it reduces to a local multiagent MDP (MMDP) Boutilier96AAAI . In other words, this approach resembles Q-MMDP Szer05UAI_MAA ; Oliehoek08JAIR
, but is applied to an SP, and performs an influence-optimistic estimation of value.777What we have termed “Q-MMDP” has been referred to in past work as “Q-MDP”; we add the extra M emphasize the presence of multiple agents. IO-Q-MMDP makes, in addition to influence optimism, another overestimation due to its assumption of full observability. While this negatively affects the tightness of the upper bound, it has as the advantage that its computational complexity is relatively low.
Formally, we can describe IO-Q-MMDP as follows. In the first phase, we apply dynamic programming to compute the action-values for all local states:
Comparing this equation to (3), it is clear that this equation is optimistic with respect to the influence: it selects the sources in order to select the most beneficial transition probabilities. In the second phase, we use these values to compute an upper bound:
This procedure is guaranteed to yield an upper bound to the locally-optimal value for the SP.
IO-Q-MMDP yields an upper bound to the locally-optimal value: .
An inductive argument easily establishes that, due to the maximization it performs, (6) is at least as great as the Q-MMDP value (for all ) of any feasible influence, given by:
Moreover, it is well known that, for any Dec-POMDP, the Q-MMDP value is an upper bound to its value Szer05UAI_MAA , such that
We can conclude that is an upper bound to the Dec-POMDP value of the IASP induced by :
with the identities given by (5), thus proving the theorem.∎
The upshot of (6) is that there are no dependencies on d-separating sets and incoming influences anymore: the IO assumption effectively eliminates these dependencies. As a result, there is no need to actually construct the IASPs (that potentially have a very large-state-space) if all we are interested in is an upper bound.
4.2 A Q-MPOMDP Approach
The IO-Q-MMDP approach of the previous section introduces overestimations through influence-optimism as well as by assuming full observability. Here we tighten the upper bound by weakening the second assumption. In particular, we propose an upper bound based on the underlying multiagent POMDP (MPOMDP).
An MPOMDP Messias11NIPS24 ; Amato13MSDM is partially observable, but assumes that the agents can freely communicate their observations, such that the problem reduces to a special type of centralized model in which the decision maker (representing the entire team of agents) takes joint actions, and receives joint observations. As a result, the optimal value for an MPOMDP is analogous to that of a POMDP:
where is the joint belief resulting from performing Bayesian updating of given and .
Using the value function of the MPOMDP solution as a heuristic (i.e., an upper bound) for the value function of a Dec-POMDP is a technique referred to as Q-MPOMDP Roth05AAMAS ; Oliehoek08JAIR . Here we combine this approach with optimistic assumptions on the influences, leading to influence-optimistic Q-MPOMDP (IO-Q-MPOMDP).
In case that the influence on an SP is fully specified, (8) can be readily applied to the IASP. However, we want to deal with the case where this influence is not specified. The basic, conceptually simple, idea is to move from the influence-optimistic MMDP-based upper bounding scheme from in Section 4.1 to one based on MPOMDPs. However, it presents a technical difficulty, since it is not directly obvious how to extend (6) to deal with partial observability. In particular, in the MPOMDP case as given by (8), the state is replaced by a belief over such local states and the influence sources affect the value by both manipulating the transition and observation probabilities, as well as the resulting beliefs.
To overcome these difficulties, we propose a formulation that is not directly based on (8
), but that too makes use of ‘back-projected value vectors’. That is, it is possible to rewrite the optimal MPOMDP value function as:888In this section and the next, we will restrict ourselves to rewards of the form to reduce the notational burden, but the presented formulas can be extended to deal with formulations in a straightforward way.
where denotes inner product and where are the back-projections of value vectors :
The key insight that enables carrying influence-opti-mism to the MPOMDP case is that this back-projected form (10) does allow us to take the maximum with respect to unspecified influences. That is, we define the influence-optimistic back-projection as:
Since this equation does not depend in any way on the d-separating sets and influence, we can completely avoid generating large IASPs. As for implementation, many POMDP solution methods Cassandra97UAI ; Kaelbling98AI are based on such back-projections and therefore can be easily modified; all that is required is to substitute these the back-projections by their modified form (11). When combined with an exact POMDP solver, such influence-optimistic back-ups will lead to an upper bound , to which we refer as IO-Q-MPOMDP, on the locally-optimal value.
To formally prove this claim, we will need to discriminate a few different types of value, and associated constructs. Let us define:
, an MPOMDP belief for the IASP induced by an arbitrary influence ,
, the optimal value function when the IASP is solved as an MPOMDP, such that is the value of and . is represented using vectors .
, an arbitrary distribution over that can be thought of as the MPOMDP belief for the ‘influence optimistic SP’.999For instance, in the case of FFG from Fig. 2, we can imagine an SP that encodes optimistic assumptions by assuming that the neighboring agents always will fight fire at houses and . Even though it may not be possible to define such a model for all problems—the optimistic influence could depend on the local (belief) state in intricate ways—this gives some interpretation to Additionally, we exploit the fact that construction of an optimistic SP model is possible for the considered domains in Section 6.
, the value function computed by an (exact) influence-optimistic MPOMDP method, that assigns a value to any . is represented using vectors . The IO-Q-MPOMDP upper bound is defined by plugging in the true initial state distribution , restricted to factors in : .
First we establish a relation between the different vectors representing and .
Let be a -steps-to-go policy. Let and be the vectors induced by under regular MPOMDP back-projections (for some ), and under IO back-projections respectively. Then
The proof is listed in Appendix A.∎
This lemma provides a strong result on the relation of values computed under regular MPOMDP backups versus influence-optimistic ones. It allows us to establish the following theorem:
For an SP , for all ,
provided that concides with the marginals of :
We start with the left hand side:
thus proving the theorem.∎
IO-Q-MPOMDP yields an upper bound to the locally-optimal value: .
The initial beliefs are defined such that the above condition (A1) holds. That is:
Therefore, application of Theorem 8 to the initial belief yields:
It is well-known that the MPOMDP value is an upper bound to the Dec-POMDP value Oliehoek08JAIR , such that
and we can immediately conclude that
with the identities given by (5), proving the result.∎
4.3 A Dec-POMDP Approach
The previous approaches compute upper bounds by, apart from the IO assumption, additionally making optimistic assumptions on observability or communication capabilities. Here we present a general method for computing Dec-POMDP-based upper bounds that, other than the optimistic assumptions about neighboring SPs, make no additional assumptions and thus provide the tightest bounds out of the three that we propose. This approach builds on the recent insight MacDermed13NIPS26 ; Dibangoye13IJCAI ; Oliehoek13JAIR that a Dec-POMDP can be converted to a special case of POMDP (for an overview of this reduction, see Oliehoek14IASTR ); we can thereby leverage the influence-optimistic back-projection (11) to compute an IO-UB that we refer to as IO-Q-Dec-POMDP.
As in the previous two sub-sections, we will leverage optimism with respect to an influence-augmented model that we will never need to construct. In particular, as explained in Section 3 we can convert an SP to an IASP given an influence . Since such an IASP is a Dec-POMDP, we can convert it to a special case of a POMDP:
A plan-time influence-augmented sub-problem, , is a tuple , where:
is the set of states .
is the set of actions, each corresponds to a local joint decision rule in the SP.
is the transition function defined below.
specifies that observation is received with probability 1 (irrespective of the state and action).
The horizon is not modified: .
is the initial state distribution. Since there is only one (i.e., the empty joint observation history).
The transition function specifies:
if and 0 otherwise. In this equation are given by the IASP (cf. Section 3).101010Remember that is a function of the specified quantities: .
This reduction shows that it is possible to compute , the optimal value for an SP given an influence point , but the formulation is subject to the same computational burden as solving a regular IASP: constructing it is complex due to the inference that needs to be performed to compute , and subsequently solving the IASP is complex due to the large number of augmented states .
Fortunately, here too we can compute an upper bound to any feasible incoming influence, and thus to , by using optimistic backup operations with respect to an underspecified model, to which we refer as simply plan-time SP:
We define the plan-time sub-problem as an under-specified POMDP with
states of the form ,
an underspecified transition model
and as above.
Since this model is a special case of a POMDP, the theory developed in Section 4.2 applies: we can maintain a plan-time sufficient statistic (essentially the ‘belief’ over augmented states ) and we can write down the value function using (9). Most importantly, the IO back-projection (11) also applies, which means that (similar to the MPOMDP case) we can avoid ever constructing the full PT-IASP. The IO back-projection in this case translates to:
Here, we omitted the superscript for the observation. Also, note that in (11) corresponds to the observation in the PT model, but since the observation histories are in the states, comes out of the transition model.
Again, given this modified back-projection, the IO-Q-Dec-POMDP value can be computed using any exact POMDP solution method that makes use of vector back-projections; all that is required is to substitute these the back projections by their modified form (12).
IO-Q-Dec-POMDP yields an upper bound to the locally-optimal value: .
Directly by applying ≘9 to ∎
4.4 Computational Complexity
Due to the maximization in (6), (11) and (12), IO back-projections are more costly than regular (non-IO) back-projections. Here we analyze the computational complexity of the proposed algorithms relative to the regular, non-IO, backups.
We start by comparing IQ-Q-MMDP backup operation (6) to the regular MMDP backup for an SP that does not have incoming influences. For such an SP, the MMDP backup is given by
Comparing (6) and (13), we see two differences: 1) the transition probabilities can be written in one term since we do not need to discriminate NLAFs from OLAFs, and 2) there is no maximization over influence sources . The first difference does not induce a change in computational cost, only in notation: in both cases the entire transition probability is given as the product of next-stage CPTs. The second difference does induce a change in computational cost: in (6), in order to select the maximum, the inner part of the right-hand side needs to be evaluated for each instantiation of the influence sources . That is, the computational complexity of a Q-MMDP backup (for a particular -pair) is
whereas the total comput