I Introduction
Decisionmaking under uncertainty is a ubiquitous robotics problem wherein a robot collects data from its environment and decides subsequent tasks to execute. While lowcost robotics platforms and sensors have increased the affordability of multirobot systems, derivation of policies dictating robot decisions remains a challenge. This decisionmaking problem is even more complex in noisy settings with imperfect communication, requiring a formal framework for its treatment.
A general representation of the multiagent planning under uncertainty problem is the Decentralized Partially Observable Markov Decision Process (DecPOMDP) [1], which extends singleagent POMDPs to decentralized domains. Due to DecPOMDPs’ usage of primitive actions (atomic actions assumed to each take a single time unit to execute) they have exceedingly large policy spaces which severely limits planning scalability. Recent efforts have extended DecPOMDPs to use macroactions (temporally extended actions), resulting in the Decentralized Partially Observable SemiMarkov Decision Process (DecPOSMDP) [2, 3]. The result is a scalable asynchronous multirobot decisionmaking framework which plans over the space of highlevel robot tasks (e.g., Openthevalve or Findthekey) with nondeterministic durations.
Despite the increased actionspace scalability offered by DecPOSMDPs, they have so far been limited to planning over the space of discrete observations. To date, no algorithms exist for continuousobservation DecPOSMDPs (or DecPOMDPs [4]). This is a major research gap, especially important in the context of robotics where a vast number of realworld sensors provide continuous observation data. Application of DecPOSMDPs to continuous problems such as robot navigation currently mandates observation space discretization, resulting in loss of valuable sensor information which could otherwise be used to better inform the decisionmaking policy. Several approaches have targeted singleagent continuousobservation POMDPs. These include partitioning of continuous spaces into lossless discrete spaces [5], Gaussian mixtures for belief representation [6]
, use of continuousobservation classifiers
[7], and learned discrete representations for continuous state spaces [8]. This paper expands this body of work beyond the singleagent case, targeting scalable treatment of continuousobservation DecPOSMDPs. The methods presented are applicable to domains with continuous underlying state spaces, as shown in some of the experiments used for evaluation.In order to develop solvers for continuousobservation DecPOSMDPs, we build on current stateoftheart discrete policy search methods [9, 2, 3]. Unfortunately, these algorithms suffer from convergence speed limitations—an issue which was identified in prior work but remains untreated [9]. A major gap exists in addressing these issues before extending the foundations of these discrete algorithms to the continuous case, where such convergence issues are exacerbated. To resolve this, we first introduce a maximal entropy injection approach targeting convergence acceleration for both discrete and continuous algorithms, without degrading overall policy quality. The approach is shown to significantly outperform existing search acceleration methods.
The paper’s key contribution is a stochastic kernelbased policy representation and search algorithm, allowing direct mapping of continuous observations to robot decisions (with no discretization necessary). This algorithm leverages the proposed entropy injection acceleration method and is evaluated on a multirobot nuclear contamination domain—the first ever continuousobservation DecPOMDP/DecPOSMDP domain—in which discrete policy search algorithms perform extremely poorly. Failure modes of discrete methods are analyzed and compared to the superior continuous policy behavior. The contributions introduced in this paper can be readily applied to DecPOMDPs and DecPOSMDPs. However, as we are motivated by applications to extremely large actionobservation spaces, the notation used and experiments conducted focus on the more scalable DecPOSMDP framework.
Ii Background
Iia Decentralized Planning using MacroActions
This section summarizes the DecPOSMDP, a multirobot decentralized decisionmaking under uncertainty framework targeting actionspace scalability. For a more detailed introduction to DecPOSMDPs, we refer readers to [9, 2, 3].
The DecPOSMDP is a beliefspace framework in which agents execute macroactions (temporallyextended actions) with nondeterministic completion times, and receive noisy highlevel observations of their postMA state. Macroactions (MAs) are abstractions of lowlevel POMDPs involving primitive actions and observations , allowing execution of highlevel tasks (e.g., Parkthecar)^{1}^{1}1We denote a generic parameter of the th robot as , a joint team parameter as , and a joint team parameter at timestep as .. Each MA executes until an neighborhood of its belief milestone is reached. This neighborhood defines the MA termination condition or goal belief node, denoted [3].
Upon completion of an MA, each robot makes a macro (or highlevel) observation of the underlying highlevel system state . It also calculates its own final belief state, . Thus far, both DecPOMDPs and DecPOSMDPs have only seen limited applications to finite discrete observation spaces. Due to its actionspace scalability, let us focus on the DecPOSMDP, defined as follows:

is the set of heterogeneous robots.

is the belief space, with local belief milestones and joint environment (or highlevel) space .

is the joint MA space, where is the finite set of MAs for the th robot.

is the space of all joint MAobservations.

is the highlevel reward of taking a joint MA at .

is the joint observation likelihood model, with joint observation .

is the reward discount factor.
Macroobservations and final beliefs are jointly denoted as MAobservation . Trajectories of MAs and received MAobservations are denoted as the MAhistory,
(1) 
Transition probability from to under joint MA in timesteps is,
(2)  
The generalized highlevel team reward for a discretetime DecPOSMDP during execution of joint MA is defined [9],
(3) 
where is the first timestep at which any robot completes its current MA.
The joint highlevel policy, , dictates MA selection. Highlevel policy maps the th robot’s MAhistory to the next MA to be executed. Joint DecPOSMDP value under policy is then [9],
(4)  
(5) 
The optimal joint highlevel policy is,
(6) 
Solving the DecPOSMDP results in joint highlevel decisionmaking policy dictating the MA executed by each robot based on its MAhistory. Each MA is, itself, a policy over lowlevel actions and observations . Thus, decisionmaking using the DecPOSMDP allows abstraction of tasklevel actions from lowlevel actions, leading to significantly improved planning scalability over DecPOMDPs.
IiB DecPOSMDP Policy Search Algorithms
So far, research efforts have focused on DecPOSMDP policy search for discrete observation spaces, resulting in several algorithms: Masked Monte Carlo Search (MMCS) [3]
, MacDecPOMDP Heuristic Search (MDHS)
[2], and Graphbased Direct Cross Entropy method (GDICE) [9]. These algorithms use Finite State Automata (FSAs) for policy representation. FSAbased policy for robot consists of FSA nodes, . FSAbased decisionmaking is twofold: each robot begins execution in FSA node , where MA output function assigns it an MA, . Following MA execution, the robot receives a highlevel observation and selects its next FSA node using transition function . The graphbased nature of FSAs allows their application to infinitehorizon domains.Though DecPOSMDPs have increased the size of solvable planning domains beyond DecPOMDP counterparts, major algorithm limitations still exist. MMCS is a greedy algorithm which succumbs to local optimality issues [3]. MDHS uses lower and upper bound value heuristics to bias search towards promising policy regions, by initiating an empty (partial) FSA and incrementally assigning nodes actions and transitions . Partial policies with high upper bounds are expanded incrementally. Yet, each expansion involves child policies, severely limiting usage for large observation spaces.
GDICE is a cross entropybased algorithm which iteratively updates policies using two sampling distributions at each FSA node: MA distribution and node transition distribution , where
are parameter vectors. Each iteration samples the distributions
times, resulting indeterministic FSA policies. Maximum likelihood estimates (MLE) of parameters
are calculated using the best policies. To prevent convergence to local optima, smooth parameter updates,(7) 
are used, with iteration number and learning rate . For sufficiently small values of , this process minimizes cross entropy between each sampling distribution and a unit mass centered at the optimal policy [10]. GDICE is executed until convergence, after which the best deterministic policy from the history of samples is returned.
Using smooth parameter updates and sampling distributions initiating from a uniform distribution allows GDICE to tradeoff exploration and exploitation in the policy space, outperforming other DecPOSMDP search approaches given a fixed computational budget. Yet, GDICE suffers from sample degeneracy and convergence issues related to the sampling distributions, and in its current form only applies to discrete observation settings. The following sections resolve these issues, resulting in a scalable, accelerated continuousobservation search algorithm.
Iii Accelerated Policy Search
Prior to extending to continuous observations, this section treats the sampling distribution degeneracy issue in samplingbased DecPOSMDP approaches. It also introduces a maximal entropy injection scheme which is then embedded in the proposed continuousobservation DecPOSMDP algorithm.
Iiia Sampling Distribution Degeneracy Problem
A major issue with sampling distributionbased approaches, such as GDICE, occurs when a low enough learning rate is not used, causing underlying sampling distributions to rapidly converge to degenerate distributions far from the optimum [11]. All subsequent search iterations return identical samples of the policy space, stifling exploration altogether. Yet, one benefit of a high learning rate is fast convergence, especially useful for complex DecPOSMDPs with large observation spaces and computationally expensive trajectory sampling and evaluation. Sampling distributionbased approaches such as GDICE often require handtuned selection of for good performance, even after which convergence may be excessively slow and can hinder experimentation and analysis. This tradeoff was noted in [9], where it was left as future work. Recall the motivation behind the DecPOSMDP framework is scalability to very large multirobot planning domains. Despite the fact that policy search is conducted offline, hindrance of humanintheloop analysis due to slow convergence is undesirable. A naïve solution is to set arbitrarily low, but this implies arbitrarily high convergence time (on the order of many days for complex domains). These foundational issues must first be resolved before extending these algorithms to treat the more complex continuous observation case.
Several works have targeted this degeneracy problem. One approach uses dynamic smoothing of learning rates [12],
(8) 
where is the baseline rate (typically close to ) and is the dropoff rate (typically between to ). The result is a monotonically decreasing which initially starts high.
Another approach involves the addition of a noise term to the sampling distribution at each iteration to prevent degeneration. Linearly decreasing noise injection,
(9) 
was investigated in [13]. In the above, is the maximum allowable noise and is the noise dropoff rate.
These approaches are not ideal as they are agnostic to DecPOSMDP value function convergence, meaning they do not adapt to domainspecific behaviors. Thus, subparameters (, , , ) typically need significant tuning to alleviate convergence issues for individual domains.
IiiB Maximal Entropy Injection
A principled approach combining policy exploration with fast convergence is desired, without reliance on sensitive dynamic smoothing or noise terms. As degenerate distributions have minimal entropy [14], an intuitive idea is to simultaneously monitor policy value convergence and underlying sampling distribution entropy to alleviate degeneracy issues.
In the proposed acceleration approach, search is conducted as usual for iterations where policy value has not converged, allowing policy space exploration. Once convergence occurs, entropies of sampling distributions and are calculated. If a distribution’s entropy is significantly below the max entropy for its distribution family, degeneracy has likely occurred [14]. Max entropy distributions are wellstudied and closed form results for many families and constraint sets are known [15]. For DecPOSMDPs, these entropy calculations are computationally cheap as sampling distributions are categorical, with corresponding discrete uniform maximal entropy distributions.
In postdegeneracy iterations, each sampling distribution’s entropy is increased by incrementally combining its parameters with the max entropy distribution parameters ,
(10) 
where is the entropy injection rate. This encourages policy space exploration while still allowing usage of high learning rates (e.g., ) for fast convergence. In practice, entropy injection rate has a low value (between 1%  3% per iteration). As this process is repeated only in postconvergence iterations, there is low sensitivity to as entropy is incrementally increased whenever necessary. Injection stops as soon as the policy value diverges, allowing unhindered exploration. This acceleration approach is evaluated in Section VA and also integrated into the proposed continuousobservation search algorithm in the next section.
Iv ContinuousObservation DecPOSMDP Search
This section focuses on multirobot policy search in continuous observation spaces. It first presents an extension of traditional discrete, deterministic FSAs to allow representation of continuous policies. A continuousobservation DecPOSMDP search algorithm is then introduced.
Iva Stochastic KernelBased Finite State Automata
We first extend the notion of deterministic policies used in existing DecPOSMDP algorithms to stochastic policies. In a stochastic FSA, MA output function and node transition function
provide robots with a probability distribution over MAs and nextnodes
during policy execution, rather than deterministic MA and transition assignments. The resulting stochastic decisionmaking scheme allows robots to escape cycles of incorrect decisions which may otherwise occur in deterministic FSAs [16]. While it has been shown that finitehorizon DecPOMDPs have at least one optimal deterministic policy (i.e., guaranteed to at least equal performance of the optimal stochastic policy) [17], in approximate searches, stochastic FSAs often result in a higher joint value [18, 16]. One can readily modify cross entropybased search to provide such a stochastic policy by simply using the underlying sampling distributions and to define the policy, rather than the best sampled deterministic policy (as done in GDICE).A second issue is extension of FSAs to support continuous observations, a formidable task as continuous observation spaces are uncountably infinite. Existing DecPOSMDP algorithms are, thus, inapplicable. To resolve this, we assume policy smoothness over the observation space, a characteristic which occurs naturally in many robotics domains. In other words, the controller structure should induce similar decisions from similar observation chains. This typical assumption is also made by the continuous stateaction MDP and POMDP literature [19, 8, 7].
We exploit this smoothness assumption and introduce Stochastic Kernelbased Finite State Automata (SKFSAs) for policy representation (Fig. 1), which have similar structure to the controllers used in [7]. Policy execution in SKFSAs is similar to traditional FSAs. Each robot’s SKFSA node (e.g., node in Fig. 1) outputs categorical MA distribution , which the robot samples to select its next MA (Fig. 1). Following MA execution, the robot receives a continuous highlevel observation, which the SKFSA node transition function uses to output a corresponding node transition distribution . Note the distinction between transition function and transition distribution—the transition function maps continuous observations to the dimensional simplex. Given an observation, outputs an infinitesimal ‘slice’ representing a categorical transition distribution over nextnodes . Fig. 1 illustrates such a slice, evaluated at highlevel observation . The robot samples this categorical distribution, transitions to its next SKFSA node , and repeats this process indefinitely.
We propose use of kernel logistic regression (KLR) to represent node transition functions. KLR is a nonparametric multiclass classification model (i.e., model complexity grows with the number of kernel points). In SKFSAs, node transition functions use KLR with highlevel observation inputs,
, and output probabilities over nextnodes . KLR is a natural model for stochastic policies as it is a probabilistic classifier (i.e., SKFSA transition distributions correspond to KLR probabilities) [20]. Our approach uses KLR with radial basis function (RBF) kernels over the observation space,
(11) 
where is the kernel radius. RBF kernels are preferred as they provide smooth classification outputs while allowing nonlinear decision boundaries [20], in contrast to linear kernels. The next section discusses SKFSA policy search, including details on kernel basis selection and kernel weight training.
IvB Entropybased Policy Search over SKFSAs
This section introduces an SKFSA search algorithm titled Entropybased Policy Search using Continuous Kernel Observations (EPSCKO). EPSCKO consists of 3 steps: cross entropy search for MA distributions (as done in GDICE), memorybounded KLR training for SKFSA node transition functions, and entropy injection for search acceleration (as in Section IIIB). In each EPSCKO iteration, decision trajectories are sampled from the SKFSA policy. The best trajectories (evaluated using Equation 4) are used for policy update.
We first detail the KLR training approach and then present the overall algorithm. As transition function uses a kernelbased representation over the observation space, it requires a set of observation kernel basis points and weights. In EPSCKO, kernel weights constitute the node transition parameter vector . To simplify notation, references to in this section refer to this transition parameter vector.
The computational cost of training KLR models is [20], where is the training input size. For a sustainable training time, EPSCKO uses a memorybounded kernel basis consisting of continuous observations received during evaluation of the best policies in each of the latest iterations. In each iteration, the bundle of observations in the best decision trajectories is pushed to a firstin, firstout (FIFO) circular queue of length . KLR training outputs are the corresponding sampled node transitions taken along these same trajectories. The nonparametric nature of KLR ensures that node transition function complexity increases in regions with high observation density, so the policy naturally focuses on prominent observation space regions. The result is a compact yet informative policy representation.
To counter convergence to locally optimal SKFSAs, EPSKCO uses a weighted loglikelihood function to train the KLR model. Weights are discounted such that observations sampled in earlier algorithm iterations are given higher value. Given learning rate , the following weight set is used,
(12) 
where is the training weight for the th observation bundle in the FIFO kernel queue. This weighting is derived from recursive application of (7), and is analogous to the smoothing step used in GDICE. For each robot , the weighted loglikelihood function is maximized over for KLR training,
(13) 
where , , and are transition function training inputs, outputs, and kernel weights for the th observation bundle. The partial derivative with respect to the th component of each parameter is,
(14) 
where is the indicator function. The loglikelihood can be maximized using a quasiNewton method (our implementation uses the BroydenFletcherGoldfarbShanno algorithm). To improve the generalization of the learned model, regularization is used during weight training.
EPSCKO is outlined in Algorithm 1. It begins by specifying an empty SKFSA policy and length FIFO circular kernel basis queue for each robot (Algorithm 1, Lines 11). The bestvaluesofar, , and worstjointvalue, , are set to (Algorithm 1, Line 1). To encourage policy space exploration, SKFSA parameter vectors are initialized such that associated distributions are uniform (Algorithm 1, Line 1).
The main algorithm loop updates the SKFSA policy over iterations, using the maximal entropy injection scheme detailed in Section IIIB to accelerate search. Entropy injection is initially disabled and a flag indicating successful entropy injection in the current iteration is set to False (Algorithm 1, Line 1). The team’s SKFSA policies are evaluated times, with perceived continuous observation and node transition trajectories saved for KLR training (Algorithm 1, Line 1). MA selections and node transitions from policies exceeding the previous iteration’s worst joint value are tracked in (Algorithm 1, Lines 11). The bestvaluesofar, , is saved (Algorithm 1, Line 1). Trajectory lists are pruned to retain only the best trajectories (Algorithm 1, Line 1). Continuous observations and node transitions from this list are pushed to the FIFO queue, causing old trajectories to be popped (Algorithm 1, Line 1). The iteration’s worst joint value, , is then updated.
At this point, the algorithm checks if the DecPOSMDP joint value has converged. If so, entropy injection is enabled to counter convergence to a local optima (Algorithm 1, Line 1). This does not imply entropy injection will occur, only that it is allowed to occur. Each robot subsequently updates its MA distribution parameter vector, , using a smoothed MLE approach (Algorithm 1, Lines 11). As discussed earlier, weighted loglikelihood maximization is used to train the KLR model for each node transition function (Algorithm 1, Line 1).
Next, if maximal entropy injection is allowed, entropies of sampling distributions are calculated and (if necessary) injection occurs (Algorithm 1, Line 1). As transition function is continuous and nonlinear, an approximate measure of its entropy is calculated using transition distributions sampled at its underlying set of observation kernels. This approximation was found to work well in practice (Section VB) and is computationally efficient as it avoids domain resampling. To increase entropy of the node transition function, a continuous uniform distribution injection is done using update rule Equation 10. If entropy injection is conducted for any robot, the current iteration’s worst joint value, , is set to (Algorithm 1, Line. 1). This critical step ensures trajectories sampled in the next iteration can actually be used for policy exploration.
EPSKCO is an anytime algorithm applicable to continuousobservation DecPOMDPs and DecPOSMDPs. This approach also offers memory advantages to discretization as SKFSA memory usage is , in contrast to for FSAs with discretization resolution .
V Experiments
This section first validates maximal entropy search acceleration, which resolves a longstanding convergence issue for samplingbased DecPOSMDP algorithms. Then, EPSCKO is evaluated against discrete approaches in the first ever continuousobservation DecPOMDP/DecPOSMDP domain.
Va Accelerated Policy Search
We evaluate policy search acceleration approaches discussed in Section III on the benchmark Navigation Among Movable Obstacles (NAMO) domain [21] with horizon and a grid. Fig. 2 shows convergence trends for all approaches. A low learning rate of is needed in GDICE [9] to find the optimal policy (taking iterations). 50 policies are sampled per iteration, with 1000 trajectories used to approximate policy value in each iteration, so total policy evaluations are conducted. This computationally expensive evaluation becomes prohibitively large as domain complexity grows. Increasing learning rate to causes fast convergence to a suboptimal solution, after which exploration stops due to sampling distribution degeneration.
Existing search acceleration approaches are also evaluated. Dynamic smoothing with a moderate baseline rate (, ) slightly improves value. However, decay rate is static with no closedloop feedback from underlying sampling distributions. The result is a suboptimal policy (found around iteration ) which then quickly converges to the same value as the baseline approach with . Linearly decreasing noise injection with and performs similarly, with fast initial increase in value and subsequent degeneration to a suboptimal policy.
The proposed entropy injection method significantly outperforms the above approaches. The same baseline learning rate as previous methods () is used with a 3% entropy injection rate, resulting in much faster convergence (around ). Sensitivity to and injection rate is low as value convergence monitoring is conducted in all iterations. While some initial tuning of entropy injection rate is necessary, the key insight is that posttuning results converge much faster and are more conducive to additional experimentation and analysis (e.g., with domain/policy structure). Oscillations in plots are due to postconvergence injections, which reset underlying sampling distributions and forces further policy space exploration. In practice, the best policy found in a fixed number of iterations would be returned by the algorithm.
VB Continuous Observation Domain
To evaluate EPSCKO, a multirobot continuousobservation nuclear contamination domain is considered (Fig. 4). This firstever continuousobservation DecPOMDP/POSMDP domain involves 3 robots cleaning up nuclear waste. MAs are Navigate to base, Navigate to waste zone, Correct position, and Collect nuclear contaminant. Following MA execution, each robot receives a noisy highlevel observation of its 2D () state. The above MAs have nondeterministic durations and a 30% failure probability (due to nuclear contaminant degrading the robots). This causes poor performance of observationagnostic policies which memorize chains of MAs, rather than make informed decisions using the observations.
Robots are initially at the base and must first navigate to the waste zone prior to collection attempt. Robots which execute the Navigate to base MA terminate with a random continuous state in a region centered on the base (brown region marked ‘B’ in Fig. 4). The Navigate to waste zone MA results in a random terminal state within two large regions surrounding the nuclear zone (everything interior of gray regions marked ‘L’ in Fig. 4, including the green regions marked ‘S’). Collection attempts are only possible if the robot is within the waste zone (green regions marked ‘S’ in Fig. 4). Collections attempted outside these small contamination regions result in wasted time, which further discounts the team’s future joint rewards. Robot can attempt a Correct position MA, which resamples their state to be within these smaller regions. However, repeated attempts may be necessary due to the 30% MA failure probabilities.
After successful collection, each robot must return to the base to deposit the waste before attempting another collection. Each collection results in joint team reward (with discount factor ). This domain is particularly challenging due to the high failure rate of MAs, and the presence of a continuous, nonlinear decision boundary in the nuclear zone center, where the tradeoff between the correction and collection MAs must be considered by robots given their noisy observations.
Fig. 3 compares best values obtained using continuousobservation and discreteobservation policy search (EPSCKO, GDICE with maximal entropy injection, and MDHS). Time horizon was used for evaluation, with each MA taking an average of  time units to complete. nodes were used for both discrete and continuous policies. GDICE and MDHS results are shown for observation discretization factors , with uniform discretization in each observation dimension. EPSCKO significantly outperforms the discrete approaches, more than doubling the mean policy value of the best discreteobservation case (). MDHS faces the policy expansion issues discussed in Section IIB.
GDICE policy values initially increase with higher discretization resolutions ( to ), yet a dropoff occurs beyond . While initially counterintuitive, as higher discretization factors imply increased precision regarding important decision boundaries in the continuous domain, Figs. 4 and 4 reveal the underlying problem. These plots show the normalized count of observation samples used to compute node discrete policies for the and cases, with discounting of old observation samples using (7). In other words, they provide a measure of discrete observation bins which have informed each GDICE policy throughout its iterations. The core issue for discrete policies is that no correlation exists between decisions at nearby observation bins. Fine discretization meshes, as in Fig. 4, result in cyclic processes where observation bins with no previous samples are encountered, therefore causing the robot to make a poor MA selection. Nearby observation bins do not inform the robot during this process, leading it to repeatedly make incorrect decisions. This issue is especially compounded in this domain due to delays caused by high MA failure probabilities, which reduce the overall number of observations received by robots. The result is a highly uninformative policy with no observations made in many bins, in contrast to policies with lower discretization factor (Fig. 4).
To build intuition on continuouspolicy decisionmaking, Fig. 5 plots transition functions for a 6node EPSCKO policy. For each node , colored 3D manifolds represent probabilities of transitioning to nextnodes, , given a continuous observation. Circles plotted beneath transition functions indicate base and nuclear zone locations. Colorbars indicate the transition manifold color associated with each node and the highestprobability MA, , executed in it.
Consider a robot policy starting at node (far left in Fig. 5) which has two major manifolds (beige and green). Observations under a prominent green manifold region indicate high probability of transitioning to node (as its colorbar is green), which has Navigate to waste zone. For , this green manifold is centered on the base, which makes intuitive sense as the Navigate to waste zone MA should only be executed if the robot is confident it is at the base. Thus, the robot most likely transitions to node , and a complex transition function manifold is encountered. Two beige peaks are centered on the small inner regions of the nuclear zone, indicating transition to node , which has Collect nuclear contaminant. Thus, when the robot is in and confident that it is in the center of the nuclear zone, it attempts a collection MA. Yet, for observations outside the inner nuclear zone, the red and blue manifolds are most prominent. These indicate high probabilities of transitioning to and , which have Correct position. Thus, the robot most likely performs a heading correction before continuing policy execution and attempting waste collection. This process continues indefinitely or until the time horizon is reached. Recall that SKFSA policies are stochastic, so these discussions provide an intuition of the ‘most likely’ continuouspolicy behaviors.
Vi Conclusion
This paper presented an approach for solving continuousobservation multirobot planning under uncertainty problems. Entropy injection for policy search acceleration was presented, targeting convergence issues of existing algorithms, which are exacerbated in the continuous case. Stochastic Kernelbased Finite State Automata (SKFSAs) were introduced for policy representation in continuous domains, with the Entropybased Policy Search using Continuous Kernel Observations (EPSCKO) algorithm for continuous policy search. EPSCKO was shown to significantly outperform discrete search approaches for a complex multirobot continuousobservation nuclear contamination mission—the first ever DecPOMDP/DecPOSMDP domain. Future work includes extending the framework to continuoustime planning.
References
 [1] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein, “The complexity of decentralized control of Markov decision processes,” Math. of Oper. Research, vol. 27, no. 4, pp. 819–840, 2002.
 [2] C. Amato, G. Konidaris, A. Anders, G. Cruz, J. How, and L. Kaelbling, “Policy search for multirobot coordination under uncertainty,” in Robotics: Science and Systems XI (RSS), 2015.
 [3] S. Omidshafiei, A.A. AghaMohammadi, C. Amato, and J. P. How, “Decentralized control of partially observable markov decision processes using belief space macroactions,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 5962–5969.
 [4] F. A. Oliehoek and C. Amato, A Concise Introduction to Decentralized POMDPs. Springer, 2016.
 [5] J. Hoey and P. Poupart, “Solving POMDPs with continuous or large discrete observation spaces,” in IJCAI, 2005, pp. 1332–1338.

[6]
J. M. Porta, N. Vlassis, M. T. Spaan, and P. Poupart, “Pointbased value
iteration for continuous POMDPs,”
Journal of Machine Learning Research
, vol. 7, no. Nov, pp. 2329–2367, 2006.  [7] H. Bai, D. Hsu, and W. S. Lee, “Integrated perception and planning in the continuous space: A POMDP approach,” The International Journal of Robotics Research, vol. 33, no. 9, pp. 1288–1302, 2014.
 [8] S. Brechtel, T. Gindele, and R. Dillmann, “Solving continuous POMDPs: Value iteration with incremental learning of an efficient space representation.” in ICML (3), 2013, pp. 370–378.
 [9] S. Omidshafiei, A.A. AghaMohammadi, C. Amato, S.Y. Liu, J. P. How, and J. Vian, “Graphbased cross entropy method for solving multirobot decentralized pomdps,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 5395–5402.
 [10] A. Costa, O. D. Jones, and D. P. Kroese, “Convergence properties of the crossentropy method for discrete optimization.” Oper. Res. Lett., vol. 35, no. 5, pp. 573–580, 2007.
 [11] Z. I. Botev and D. P. Kroese, “Global likelihood optimization via the crossentropy method, with an application to mixture models.” in Winter Simulation Conference. WSC, 2004, pp. 529–535.
 [12] D. P. Kroese, S. Porotsky, and R. Y. Rubinstein, “The crossentropy method for continuous multiextremal optimization,” Methodology and Computing in Applied Probability, vol. 8, no. 3, pp. 383–407, 2006.
 [13] C. Thiery and B. Scherrer, “Improvements on learning tetris with cross entropy.” ICGA Journal, vol. 32, no. 1, pp. 23–33, 2009.

[14]
L. Devroye, L. Györfi, and G. Lugosi,
A probabilistic theory of pattern recognition
. Springer Science & Business Media, 2013, vol. 31.  [15] C. E. Shannon, “A mathematical theory of communication,” ACM SIGMOBILE Mob. Comp. and Comm. Rev., vol. 5, no. 1, 2001.
 [16] C. Amato, D. S. Bernstein, and S. Zilberstein, “Optimizing fixedsize stochastic controllers for POMDPs and decentralized POMDPs,” Auton. Agents and MultiAgent Sys., vol. 21, no. 3, pp. 293–320, 2010.
 [17] F. Oliehoek, Valuebased planning for teams of agents in stochastic partially observable environments. Amsterdam University Press, 2010.
 [18] D. S. Bernstein, C. Amato, E. A. Hansen, and S. Zilberstein, “Policy iteration for decentralized control of markov decision processes,” J. of Artif. Intell. Res., vol. 34, no. 1, p. 89, 2009.
 [19] S. W. Carden, “Convergence of a Qlearning variant for continuous states and actions,” J. of Artif. Intell. Res., vol. 49, pp. 705–731, 2014.
 [20] J. Zhu and T. Hastie, “Kernel logistic regression and the import vector machine,” Journal of Computational and Graphical Statistics, 2012.
 [21] M. Stilman and J. J. Kuffner, “Navigation among movable obstacles: Realtime reasoning in complex environments,” International Journal of Humanoid Robotics, vol. 2, no. 04, pp. 479–503, 2005.