1 Introduction
Decentralized partially observable Markov decision processes (DecPOMDPs)
[3, 25] provide a general framework for solving the cooperative multiagent sequential decisionmaking problems that arise in numerous applications, including robotic soccer [24], transportation [4], extraplanetary exploration [8], and traffic control [35]. DecPOMDPs can be viewed as a POMDP controlled by multiple distributed agents. These agents make decisions based on their own local streams of information (i.e., observations), and their joint actions control the global state dynamics and the expected reward of the team. Because of the decentralized decisionmaking, an individual agent generally does not have enough information to compute the global belief state, which is a sufficient statistic for decision making in POMDPs. This makes generating an optimal solution in a DecPOMDP more difficult than for a POMDP [10], especially for long planning horizons.To circumvent the difficulty of solving longhorizon DecPOMDPs optimally, while still generating a high quality policy, this paper presents scalable learning methods using a finite memory policy representation. For infinitehorizon problems (which continue for an infinite number of steps), significant progress has been made with agent policies represented as finitestate controllers (FSCs) that map observation histories to actions [9, 2]. Recent work has shown that expectationmaximization (EM) [14] is a scalable method for generating controllers for large DecPOMDPs [19, 27]
. In addition, EM has also been shown to be an efficient algorithm for policybased reinforcement learning (RL) in DecPOMDPs, where agents learn FSCs based on trajectories, without knowing or learning the DecPOMDP model
[35].An important and yet unanswered question is how to define an appropriate number of nodes in each FSC. Previous work assumes a fixed FSC size for each agent, but the number of nodes affects both the quality of the policies and the convergence rate. When the number of nodes is too small, the FSC is unable to represent the optimal policy and therefore will quickly converge to a suboptimal result. By contrast, when the number is too large, the FSC overfits data, often yielding slow convergence and, again, a suboptimal policy.
This paper uses a Bayesian nonparametric approach to determine the appropriate controller size in a variablesize FSC. Following previous methods [35, 25], learning is assumed to be centralized, and execution is decentralized. That is, learning is accomplished offline based on all available information, but the optimization is only over decentralized solutions. Such a controller is constructed using the stickbreaking (SB) prior [18]. The SB prior allows the number of nodes to be variable, but the set of nodes that is actively used by the controller is encouraged to be compact. The nodes that are actually used are determined by the posterior, combining the SB prior and the information from trajectory data. The framework is called the decentralized stickbreaking policy representation (DecSBPR) to recognize the role of the SB prior.
In addition to the use of variablesize FSCs, the paper also makes several other contributions. Specifically, our algorithm directly operates on the (shifted) empirical value function of DecPOMDPs, which is simpler than the likelihood functions (a mixture of dynamic Bayes nets (DBNs)) in existing planningasinference frameworks [19, 35]. Moreover, we derive a variational Bayesian (VB) algorithm for learning the DecSBPR based only on the agents’ trajectories (or episodes) of actions, observations, and rewards. The VB algorithm is linear in the number of agents and at most square in the problem size, and is therefore scalable to large application domains. In practice, these trajectories can be generated by a simulator or a set of realworld experiences that are provided, and this batch data scenario is general and realistic, as it is widely adopted in learning from demonstration [23], and reinforcement learning. To the best of our knowledge, this is the first application of Bayesian nonparametric methods to the difficult and littlestudied problem of policybased RL in DecPOMDPs, and the proposed method is able to generate highquality solutions for large problems.
2 Background and Related Work
Before introducing the proposed method, we first describe the DecPOMDP model and some related work.
2.1 Decentralized POMDPs
A DecPOMDP can be represented as a tuple , where is a finite set of agent indices; and respectively are sets of joint actions and observations, with and available to agent . At each step, a joint action is selected and a joint observation is received; is a set of finite world states; is the state transition function with
denoting the probability of transitioning to
after taking joint action in ; is the observation function with the probability of observing after taking joint action and arriving in state ; is the reward function with the immediate reward received after taking joint action in ; is a discount factor. A global reward signal is generated for the team of agents after joint actions are taken, but each agent only observes its local observation. Because each agent lacks access to other agents’ observations, each agent maintains a local policy , defined as a mapping from local observation histories to actions. A joint policy consists of the local policies of all agents. For an infinitehorizon DecPOMDP with initial belief state , the objective is to find a joint policy , such that the value of starting from , , is maximized.An FSC is a compact way to represent a policy as a mapping from histories to actions. Formally, a stochastic FSC for agent is defined as a tuple , where, and are the same as defined in the DecPOMDP; is a finite set of controller nodes for agent ; is the initial node distribution with the probability of agent initially being in ; is a set of Markov transition matrices with denoting the probability of the controller node transiting from to when agent takes action in and sees observation ; is a set of stochastic policies with the probability of agent taking action in .
For simplicity, we use the following notational conventions. , where is the cardinality of , and and follow similarly. is the joint FSC of all agents. A consecutivelyindexed variable is abbreviated as the variable with the index range shown in the subscript or superscript; when the index range is obvious from the context, a simple “” is used instead. Thus, represents the actions of agent from step to and represents the node transition probabilities for agent when starting in node , taking action and seeing observation . Given , a local history of actions and observations up to step , as well as an agent controller, , we can calculate a local policy , the probability that agent chooses its action .
3 Bayesian Learning of Policies
EM algorithms infer policies based on fixedsize representation and observed data only, it is difficult to explicitly handle model uncertainty and encode prior (or expert) knowledge. To address these issues, a Bayesian learning method is proposed in this section. This is accomplished by measuring the likelihood of using , which is combined with the prior in Bayes’ rule to yield the posterior
(1) 
where is the marginal likelihood of the joint FSC and, up to additive constant, proportional to the marginal value function,
(2)  
(3) 
To compute the posterior,
, Markov chain Monte Carlo (MCMC) simulation
[32] is the most straight forward method. However, MCMC is costly in terms of computation and storage, and lacks a strong convergence guarantee. An alternative is a variational Bayes (VB) method [7], which performs approximate posterior inference by minimizing the KullbackLeibler (KL) divergence between the true and approximate posterior distributions. Because the VB method has a (local) convergence guarantee and is able to tradeoff scalability and accuracy, we focus on the derivation of VB method here. Denoting as the variational approximation to , and as the approximation to , a VB objective function ^{1}^{1}1Refer to the appendix for derivation details is(4) 
where
(5) 
is the lower bound of and
(6) 
is the reweighted reward. Since in equation (4) is independent of and , minimizing the KL divergence is equivalent to maximizing the lower bound, leading to the following constrained optimization problem,
(7)  
(8)  
(9)  
(10) 
where the constraint in the second line arises both from the meanfield approximation and from the decentralized policy representation, and the last two lines summarize the normalization constraints. It is worth emphasizing that we developed this variational meanfield approximation to optimize a decentralized policy representation, showing that the VB learning problem formulation (7) is both a general and accurate method for the multiagent problem considered in this paper.
3.1 Stickbreaking Policy Priors
To solve the Bayesian learning problem described above and obtain the variablesize FSCs, the stickbreaking prior is used to specify the policy’s structure. As such, DecSBPR is formally given in definition 1.
Definition 1.
The decentralized stick breaking policy representation (DecSBPR) is a tuple ( ), where and are as in the definition of DecPOMDP; is an unbounded set of nodes indexed by positive integers; for notational simplicity^{2}^{2}2Nonparametric priors over can also be used., are assumed to be deterministic with ; determine , the FSC parameters defined in section 2.1, as follows
(11) 
where represents Dirichlet distribution and represents the stickbreaking process with and , and .
DECSBPR differs from previous nonparametric Bayesian RL methods [21, 16]. Specifically, DecSBPR performs policybased RL and generalizes the nonparametric Bayesian policy representation of POMDPs [21] to the decentralized domain. Whereas [16] is a modelbased RL method that doesn’t assume knowledge about the world’s model, but explicitly learns it and then performs planning. Moreover, DecSBPR further distinguishes from previous methods [16, 21]
by the prior distributions and inference methods employed. These previous methods employed hierarchical Dirichlet processes hidden Markov models (HDPHMM) to infer the number of controller nodes. However, due to the lack of conjugacy between two levels of DPs in the HDPHMM, a
fully conjugate Bayesian variational inference does not exist^{3}^{3}3The VB method in [12] imposes pointmass proposals over top level DPs, lacking a uncertainty measure.. Therefore, these methods used MCMC which requires high computational and storage costs, making them not ideal for solving large problems. In contrast, DecSBPR employs single layer SB priors over FSC transition matricesand sparse Gamma priors over SB weight hyperparameters
to bias transition among nodes with smaller indices. A similar framework has been explored to infer HMMs, and we refer readers to [26] for more details.It is worth noting that SB processes subsume Dirichlet Processes (DPs) [17] as a special case, when (in DecSBPR). The purpose of using SB priors is to encourage a small number of FSC nodes. Compared to a DP, the SB priors can represent richer patterns of sparse transition between the nodes of an FSC, because it allows arbitrary correlation between the stickbreaking weights (the weights are always negatively correlated in a DP).
3.2 Variational Stickbreaking Policy Inference
It is shown in [18] that the random weights constructed by the SB prior are equivalently governed by a generalized Dirichlet distribution (GDD) and are therefore conjugate to the multinomial distribution; hence an efficient variational Bayesian algorithm for learning the decentralized policies can be derived. To accommodate an unbounded number of nodes, we apply the retrospective representation of SB priors [28] to the DecSBPR. For agent , the SB prior is set with a truncation level , taking into account the current occupancy as well as additional nodes reserved for future new occupancies. The solution to (7) under the stickbreaking priors is given in Theorem 2, the proof of which is provided in the appendix.
Theorem 2.
Let be constructed by the SB priors defined in (11) with hyperparameters , then iterative application of the following updates leads to monotonic increase of (5), until convergence to a maxima. The updates of are
(12) 
where is computed using (6) with replaced by , a set of undernormalized probability (mass) functions , with , and , and denotes expectations of with respect to distributions . The hyperparameters of the posterior distribution are updated as
(13)  
(14) 
with , where is the indicator function, and both and are marginals of , i.e.
(15)  
(16) 
The update equations in Theorem 2 constitute the VB algorithm for learning a variablesize joint FSCs under SB priors with batch data. In particular, (12) is a policyevaluation step where the rewards are reweighted to reflect the improved marginal value of the new policy posterior updated in the previous iteration, and (13) is a policyimprovement step where the reweighted rewards are used to further improve the policy posterior. Both steps require (15), which are computed based on and , . The are forwardbackward messages. Their updating equations are derived in the appendix.
To determine the number of controller nodes , the occupancy of a node is computed by checking if there is a positive reward assigned to it. For example, for action and node , is the reward being assigned. If this quantity is greater than zero, then node is visited. Summing over all actions gives the value of node . Hence can be computed based on the following formula
(17) 
The complete algorithm is described in Algorithm 1. Upon the convergence of Algorithm 1
, point estimates of the decentralized policies may be obtained by calculating the expectation:
, , and (see the appendix for details).var  best case  worst case 
3.3 Computational complexity
The time complexity of Algorithm 1 for each iteration is summarized in Table 1, assuming the length of an episode is on the order of magnitude of , and the number of nodes per controller is on the order of magnitude of . In Table 1, the worst case refers to when there is a nonzero reward at every time step of an episode (dense rewards), while the best case is when nonzero reward is received only at the terminal step. Hence in general, the algorithm scales linearly with the number of episodes and the number of agents. The time dependency on is between linear and quadratic. In any case, the computational complexity of Algorithm 1 is independent of the number of states, making it is scalable to large problems.
3.4 Exploration and Exploitation Tradeoff
Algorithm 1 assumes offpolicy batch learning where trajectories are collected using a separate behavior policy. This is appropriate when data has been generated from realworld or simulated experiences without any input from the learning algorithm (e.g., learning from demonstration). Offpolicy learning is efficient if the behavior policy is close to optimal, as in the case when expert information is available to guide the agents. With a random behavior policy, it may take a long time for the policy to converge to optimality; in this case, the agents may want to exploit the policies learned so far to speed up the learning process.
An important issue concerns keeping a proper balance between exploration and exploitation to prevent premature convergence to a suboptimal policy, but allow the algorithm to learn quickly. Since the execution of DecPOMDP policies is decentralized, it is difficult to design an efficient exploration strategy that guarantees optimality. [35]
count the visiting frequency of FSC nodes and apply upperconfidencebound style heuristic to select next controller nodes, and use
greedy strategy to select actions. However greedy might be sample inefficient. [6] proposed a distributed learning approach where agents take turns to learn the best response to each other’s policies. This framework applies an Rmax type of heuristic, using the counts of trajectories to distinguish known and unknown histories, to tradeoff exploration and exploitation. However, this method is confined to treebased policies in finitehorizon problems, and requires synchronized multiagent learning.To better accommodate our Bayesian policy learning framework for RL in infinitehorizon DecPOMDPs, we define an auxiliary FSC, , to represent the policy of each agent in balancing exploration and exploitation. To avoid confusion, we refer to as a primary FSC. The only two components distinguishing from are and , where encodes exploration () or exploitation (), and with denoting the probability of agent choosing in . One can express in the same way as one expresses (which is described in section 2.1). The behavior policy of agent is given as
(18) 
where is the primary FSC policy, and is the exploration policy of agent
, which is usually a uniform distribution.
4 Experiments
The performance of the proposed algorithms are evaluated on five benchmark problems [1] and a largescale problem (traffic control) [35]. The experimental procedure in [35] was used for all the results reported here. For DecSBPR, the hyperparameters in (11) are set to and to promote sparse usage of FSC nodes.^{4}^{4}4These values were chosen for testing, but our approach is robust to other values of and . The policies are initialized as FSCs converted from the episodes with the highest rewards using a method similar to [5].
policy learning (unknown model)  planning (known model)  
Problems  DecSBPR(fixed iteration)  DecSBPR(fixed time)  MCEM  PeriEM  FBHSVI  






DecTiger (2, 3, 3) 






Broadcast (4, 2, 5) 






Recycling Robots (3, 3, 2) 






Box Pushing (100, 4, 5) 






Mars Rovers (256, 6, 8) 





Learning variablesize FSC vs learning fixedsize FSC
To demonstrate the advantage of learning variablesize FSCs, DecSBPR is compared with an implementation of the previous EM algorithm [35]. The comparison is for the Mars Rover problem using episodes ^{5}^{5}5Using smaller training sample size , our method can still perform robustly, as it is shown in the appendix. to learn the FSCs and evaluating the policy by the discounted accumulated reward averaged over 100 test episodes of 1000 steps. Here, we consider offpolicy learning and apply a semirandom policy to collect samples. Specifically, the learning agent is allowed access to episodes collected by taking actions according to a POMDP algorithm (pointbased value iteration (PBVI) [29]). Let be the probability that the agents follow the PBVI policy and be the probability that the agents take random actions. This procedure mimics the approach used in previous work [35]. The results with are reported in Figure 1, which shows the exact value and computation time as a function of the number of controller nodes . As expected, for the EM algorithm, when is too small, the FSCs cannot represent the optimal policy (underfitting), and when the number of nodes is too large, FSCs overfits a limited amount of data and perform poorly. Even if is set to the number inferred by DecSBPR, EM can still suffer severely from initialization and local maxima issues, as can be seen from a large errorbar. By setting a high truncation level (), DecSBPR employs Algorithm 1 to integrate out the uncertainty of the policy representation (under the SB prior). As a result, DecSBPR can infer both the number of nodes that is needed (
) and optimal controller parameters simultaneously. Furthermore, this inference is done with less computation time and with a higher value and improved robustness (low variance of test value) than EM.
Comparison with other methods
The performance of DecSBPR is also compared to several stateofart methods, including: Monte Carlo EM (MCEM) [35]. Similar to DecSBPR, MCEM is a policybased RL approach. We apply the explorationexploitation strategy described in section 3.4 and follow the same experimental procedure in [35] to report the results^{6}^{6}6The learning curves of DecSBPR are shown in the appendix.. The rewards after running a fixed number of iterations and a fixed amount of time are summarized (respectively) in Table 2 (the first column under policylearning category). DecSBPR is shown to achieve better policy values than MCEM on all problems ^{7}^{7}7The results are provided by personal communication with its authors and run on the same benchmarks that are available online.. These results can be explained by the fact that EM is (more) sensitive to initialization and (more) prone to local optima. Moreover, by fixing the size of the controllers, the optimal policy from EM algorithms might be over/under fitted . By using a Bayesian nonparametric prior, DecSBPR learns the policy with variablesize controllers, allowing more flexibility for representing the optimal policy. We also show the result of DecSBPR running the same amount of clock time as MCEM (DecSBPR (fixed time)), which indicates DecSBPR can achieve a better tradeoff between policy value and learning time than MCEM.
Finally, DecSBPR is compared to Periodic EM (PeriEM) [27] and FBHSVI [15], two stateofart planing methods (with known models) for generating controllers. Because having a DecPOMDP model allows more accurate value function calculations than a finite number of trajectories, the value of PeriEM and FBHSVI are treated as upperbounds for the policybased methods. Our DecSBPR approach can sometimes outperform PeriEM, but produces lower value than FBHSVI. FBHSVI is a boundedlyoptimal method, showing that DecSBPR can produce near optimal solutions in some of these problems and produces solutions that are much closer to the optimal than previous RL methods. It is also worth noting that neither PeriEM nor FBHSVI can scale to large problems (such as the one discussed below), while by using a policybased RL approach, DecSBPR can scale well.
Scaling up to larger domains
To demonstrate scalability to both large problem sizes and large numbers of agents, we test our algorithm on a traffic problem [35], with states. Here, there are agents controlling the traffic flow at intersections with one agent located at each intersection. Except for MCEM, no previous DecPOMDPs algorithms are able to solve such large problems.
Since the authors in [35] use a handcoded policy (comparing the traffic flow between two directions) as a heuristic for generating training trajectories, we also use such a heuristic for a fair comparison. In addition, to examine the effectiveness of the explorationexploitation strategy described in Section 3.4, we also consider the case where the initial behavior policy is random and then it is optimized as discussed. From Figure 2, we can see that, with the help of the heuristic, DecSBPR can achieve the best performance. Without using the heuristic (by just using our explorationexploitation strategy), in a few iterations, DecSBPR is able to produce a higher quality policy than MCEM. Moreover, the inferred number of FSC nodes (averaged over all agents) is smaller than the number preselected by MCEM. This shows that not only can DecSBPR scale to large problems, but it can also produce higherquality solutions than other methods for those large problems.
5 Conclusions
The paper presented a scalable Bayesian nonparametric policy representation and an associated learning framework (DecSBPR) for generating decentralized policies in DecPOMDPs. An new explorationexploitation method, which extends the popular
greedy method, was also provided for reinforcement learning in DecPOMDPs. Experimental results show DecSBPR produces higherquality solutions than the stateofart policybased method, and has the additional benefit of inferring the number of nodes needed to represent the optimal policy. The resulting method is also scalable to large domains (in terms of both the number of agents and the problem size), allowing highquality policies for large DecPOMDPs to be learned efficiently from data.Acknowledgments
Acknowledgments This research was supported by the US Office of Naval Research (ONR) under MURI program award #N000141110688 and NSF award #1463945.
Appendices
A Proof of Theorem 2:(Meanfield) Variational Bayesian (VB) Inference for DECSBPR
Under the standard variational theory [7, 11], minimizing the KL divergence between and is equivalent to minimizing the lower bound of log marginal likelihood (empirical value function for our case). Using Jensen’s inequality, we can obtain the following lower bound of the log marginal value function
(19)  
(20)  
(21)  
(22)  
where and . We assume and to accommodate decentralized policy representations.
To derive the VB updating equations, we rewrite the lower bound in equation (19) as follows
(23)  
(24)  
(25) 
The VB Inference algorithm for DECSBPR is based on maximizing w.r.t. the distribution of the joint DECSBPR parameters , which can be achieved by alternating the following steps.
Update the distribution of nodes (VB Estep): Keeping and fixed, solve subject to the normalization constraint for . In this step, we construct the Lagrangian
(26) 
then take derivative w.r.t and set the result to zero
(27)  
(28)  
(29)  
(30) 
which is solved to give the distribution of nodes for the agent