I Introduction
Multirobot systems is a widely studied field, but the research is typically focused on a single team of cooperative, or selfinterested, robots [1, 2, 3]. In contrast, many realworld domains consist of a team of robots that must complete tasks while competing against other adversarial robots. For instance, consider a team of UAVs tasked with surveying a scene or locating a secret base as well as an opposing team of UAVs tasked with preventing the secret base from being found. These adversarial scenarios require reasoning about not only completing the tasks designated to the team, but also considering what the adversarial robots may do to prevent their completion. In this paper, we study the general multirobot decisionmaking problem with uncertainty in outcomes, sensors and communication, while incorporating multiple adversarial robots into this problem. Communication uncertainty and limitations further necessitates the design of decentralized agents that can coordinate with their teammates while anticipating changes in the adversary strategies using only their partial views of the world. This is for the first time all these forms of uncertainty as well as adversarial behavior have been considered in the same decisiontheoretic planning framework^{1}^{1}1Previous works [4, 5] in the literature addressing this challenge have mostly focused on reactive frameworks or do not consider multiple adversarial robots into their frameworks.. Furthermore, the size of the spaces for possible set of actions and coordination strategies for both the teammates and adversaries scale exponentially in the number of agents [6] and are typically much too large for an agent to reason about directly. Hence, a successful agent usually requires some form of highlevel abstraction to reduce its effective planning space [7, 8, 9].
One approach is therefore to create a set of basic stratagems, which are bestresponses to particular forms of adversarial behavior. The reasoning problem is then reduced to choosing among these basic stratagems in a given situation, thus significantly improving the scalability of planning. This approach can be achieved by anticipating in advance a small set of highlevel tactics from which the adversaries can choose in any situations, that capture the diversity of their intentions. The task of the highlevel planner then is to choose a response to the adversaries’ current tactics and follow it until it is determined that they have changed their tactics and a new response is needed. This fits particularly well in asynchronous robotic planning scenarios: since each stratagem has different execution time, agent decision making is no longer synchronized as assumed in existing noncooperative multiagent frameworks [6, 10, 11].
The main contribution of this paper therefore focuses on the design of such a highlevel planner, which can be decoupled into two separate tasks. The first task involves generating a set of basic stratagems
for a team of decentralized agents, each of which is optimized to work best against a particular tactic of the adversaries. This is formulated as a set of MacroAction Decentralized Partially Observable Markov Decision Processes (MacDecPOMDPs)
[9, 12] that each characterize a cooperative scenario where a team of decentralized agents collaborate to maximize the team’s expected performance while operating in a stationary environment simulated by a single tactic of the adversaries (Section III). The stratagems can therefore be acquired by solving for a set of probabilistic policy controllers that maximize the expected total reward generated by the corresponding MacDecPOMDPs. Then, the second task is to integrate these specialized policy controllers into a unified policy controller that works best on average against the adversaries’ switching tactics. This again can be achieved by optimizing the unified controller with respect to a series of MacDecPOMDPs (Section IV) so that it can detect situation changes and switch opportunistically between these stratagems to respond effectively to the adversaries’ new tactical choice. Interestingly, it can be shown that under a certain mild assumption, the result of this stratagem integration/fusion scheme appears to be near optimal with high probability as shown in Section
V. Finally, to empirically demonstrate the effectiveness of the proposed framework, experiments conducted for a robotic scenarios are presented in Section VI, which show consistent results with our theoretical analysis.Ii Background and Notations
This section provides a short overview of MacDecPOMDPs [9, 12] for decentralized multiagent decisionmaking under uncertainty. Formally, a MacDecPOMDP is defined as a DecPOMDP [13, 14] tuple augmented with a finite set of macroactions, , for each agent, , with denote the set of joint macroactions. Each macroaction is defined as a tuple where and are sets of rules that decide, respectively, the termination and eligibility to initiate of the corresponding macroaction , while denotes a lowlevel policy that maps agent ’s local histories to primitive actions . Each agent will follow a chosen macroaction until its termination condition is met. Its stream of observations collected during the execution of is jointly defined as a macroobservation . As such, each individual highlevel policy of agent can then be characterized as a mapping from its history of macroactions and observations to the next macroaction. Planning in DecPOMDP therefore involves maximizing the following total expected reward with respect to the joint highlevel policy :
(1) 
Unlike DecPOMDP’s, the MacDecPOMDP formalism is naturally suitable for asynchronous multirobot planning scenarios since it is not necessary for the macroactions to share the same execution time. In fact, from the perspective of an individual agent, the outcome of its selected macroaction (e.g., when it terminates) is nondeterministic as its termination rule may depend on the global state of the environment as well as the movements of the other parties, which are not observable to the agent. This makes optimizing via (1) using traditional modelbased dynamic programming techniques [15, 16, 17, 18, 19, 20, 9]
possible only if the probability distribution over the stochastic outcome of
, e.g., , is explicitly characterized. This is not trivial and does not scale well in complex decision problems with long planning horizon, vast state and action spaces. Alternatively, to sidestep this difficulty, it is also possible to parameterize and optimize directly via interaction with a blackbox simulator^{2}^{2}2In many realworld scenarios, it is often easier to handcode a simulator that captures the interaction rules between agents than learning probability models of their outcomes. that implicitly encodes the probabilistic models of transition , observation , reward and termination rule [8]. This interestingly allows us to avoid modeling these probabilistic models directly and improve the scalability of solving MacDecPOMDPs. The specifics of this modelfree approach are detailed in Section III which serves as the building block of our adversarial multiagent planning paradigm in Section IV.Iii Generating Basic Stratagems
This section assumes we have access to a set of blackbox simulators preset by the domain expert to simulate accurately the adversaries’ basic tactics, upon which more advanced strategies might be built. For example, in popular realtime strategy (RTS) games, a player can often anticipate in advance a small set of effective basic tactics from which the other competitors might choose in any situations. The decision making process of a player therefore comprises two parts. The first part focuses on formulating fundamental stratagems to counter the anticipated tactics of the adversaries and is addressed in the remaining of this section. The second part then is to integrate the resulting stratagems into a unified strategy that can detect changes in the adversaries’ tactical choice and switch opportunistically between them in response to those changes (see Section IV).
In particular, formulating a stratagem to counter a specific tactic of the adversaries can be posed as solving a MacDecPOMDP which characterizes a cooperative scenario where a team of decentralized agents collaborate to maximize their total expected reward while operating in an artificial environment driven by the corresponding tactic simulator. The stratagem can then be optimized via simulation as detailed next. Formally, we represent a stratagem of a team of agents as a set of decentralized finitestateautomata (FSA) policy controllers, , each of which characterizes a single agent’s corresponding part of the stratagem, . Each individual controller has nodes and there are two probabilistic functions associated with each node : (a) an output function which decides the probability that macroaction is selected by agent ; and (b) a transition function which determines the probability to transit from to following the execution of the selected macroaction . The weights can then be optimized via simulation using the graphbased direct crossentropy (GDICE) optimization method described in [8] (see Figure 1).
In essence, GDICE iteratively samples from a distribution parameterized by and simulates the induced policy (with respect to ) with the opponent’s tactic
using its blackbox simulator to acquire a performance estimate
. At each iteration, a subset of samples with top performance estimates is used to update via maximum likelihood estimation (MLE). This process has been demonstrated empirically in [21]to converge towards a uniform distribution over optimal values of
. In practice, this optimization paradigm is very wellfitted to multirobot planning scenarios since it allows us to bypass the explicit probabilistic modeling of opponent’s tactic which is usually fraught with the curses of dimensionality and histories, especially in complex problem domains with large number of agents, vast action and observation spaces
[8]. This method will also serve as the building block for our stratagem fusion scheme detailed in Section IV below.Iv Strategem Fusion
This section introduces the stratagem fusion scheme that integrates all basic stratagems (see Section III) into a set of unified policies for a team of agents to collaborate effectively against the adversaries’ highlevel switching policies that switch opportunistically among a set of basic tactics. The task of stratagem fusion is then to formulate a highlevel policy that can automatically detect situation changes and choose which response to follow at any point of decision to adapt effectively to new situations (e.g., the adversaries decide to switch to a different tactic) and consequently, maximize its expected performance. To achieve this, we model the team’s highlevel policy as a set of unified controllers, each of which characterizes a single agent’s highlevel individual policy that results from connecting its lowlevel controllers via intercontroller transitions (see Figure 2). This essentially allows the agents to change their strategic choices during realtime execution by transiting between different nodes of different controllers. The weights associated with these transitions therefore regulate the switching decision of the highlevel controller and need to be optimized. If we know exactly how the adversaries change their tactics (i.e., their blackbox simulators) in response to our strategic choices, these weights can be optimized using the same approach described in Section III (see Figure 2b).
(a)  (b) 
In practice, however, the adversaries’ switching mechanism is often unknown or highly nontrivial to characterize, especially in decentralized settings where their strategic choices are largely influenced by their limited observation and communication capacities, which are also unknown. Existing works [10, 6, 22, 23] that attempt to reason explicitly about the adversaries’ strategic rationalities are therefore impractical and less robust in situations where irrational choices arise due to limited cognitive abilities and lack of communication. This motivates us to consider a more reasonable approach to formulate a robust policy that works well on average when tested against all possible highlevel strategies of the adversaries. To achieve this, the adversaries’ switching policies are similarly modeled as highlevel controllers that connect lowlevel controllers representing their basic tactics using intercontroller transitions as illustrated in Figure 2
a. The weights of these intercontroller transitions (that regulate switching decisions) are then treated as random variables distributed by a known distribution. Thus, instead of optimizing our agents’ switching weights with respect to a single realization of the adversaries’ intercontroller transitions, we optimize them with respect to the distribution of these switching weights to embrace their uncertainty.
Formally, let and denote the sets of highlevel controllers for the teams of collaborative agents and adversaries, respectively, where () denotes a single agent’s (adversary’s) individual switching policy. Let and denote the weights associated with intercontroller transitions of and the distribution over random weights that regulates the switching decision of , respectively. Our approach proposes to optimize such that the expected performance of the induced highlevel controller when tested against a random adversary distributed by is maximized:
(2) 
where denotes the simulated performance of against . However, since we can only access the value of via simulation, solving (2) requires simulating against infinitely many candidates of and is therefore intractable. To sidestep this intractability, we instead exploit the following surrogate objective function,
(3) 
where are i.i.d samples drawn from . Intuitively, these are the potential candidates for the adversaries’ switching weights that can be identified in advance using the domain expert’s knowledge. We can now solve (3) using GDICE [8] (see Section III) with a meta blackbox that aggregates the feedback of each blackbox .
V Theoretical Analysis
This section derives performance guarantees for the above stratagem fusion scheme (Section IV) which depend on the solution quality of the graphbased direct crossentropy (GDICE) optimization method described in [8].
To enable the analysis, we put forward the following assumption:
Assumption 1. Let denote an arbitrary blackbox function being optimized via simulation with GDICE using (3). Let denotes the set of optimal solutions to (3). Then, let and denote the uniform distribution over and the sampling distribution of GDICE parameterized by (see Section III). For any , there exists a nondecreasing sequence for which:
where and denote the size of and the optimal parameterization of found by GDICE, respectively.
This is a reasonable assumption to make since it has been previously demonstrated that the underlying crossentropy optimization process of GDICE empirically causes to converge towards the uniform distribution over optimal values of [8, 21]. Then, let (with defined in (2)) denote the expected performance of when is drawn randomly from , we are interested in the gap between and (see Eq (2)), the latter of which is the best performance that can be achieved. Thus, this gap essentially characterizes the nearoptimality of , which are bounded below. To do this, we first establish the following results in Lemmas 1 and 2 that bound the difference between the generalized performance of (i.e., ) and its empirical version (i.e., the average performance when tested against a finite set of adversary candidates). Lemma 3 is then established to bridge the gap between the and . The main result that bounds the performance gap between and is then derived in Theorem 1 as a direct consequence of the previous Lemmas.
Lemma 1. For any sampling distribution , let , with defined in (3), denotes the empirical performance of where is randomly drawn from . Then, it follows that with probability at least over the choice of candidates for the adversaries’ switching weights,
(4) 
holds universally for all possible where denotes the uniform distribution over the set of optimal choice of for (3), i.e., .
Exploiting the result of Lemma 1, we can further derive a tighter and domain specific bound on the difference between the generalized and empirical performance of our stratagem fusion scheme (see Section IV) that incorporates the empirical optimality of GDICE (see Assumption 1):
Lemma 2. Let denotes the optimal sampling distribution found by GDICE [8]. Let denotes the number of stratagems of each agent (Section III) and let denotes the number of nodes in each agent’s individual specialized controller . It then follows that with probability at least ,
(5) 
where and are defined in Lemma 1, and .
Lemmas 1 and 2 thus bound the performance gap between and . To relate to , we need to bound the gap between and , which is detailed below.
Lemma 3. Let denotes the optimal sampling distribution found by GDICE [8] and , then with probability at least ,
(6) 
where is defined in Lemma 1.
Using these results, the key result can be stated and proven:
Theorem 1. Let denotes the optimal sampling distribution found by GDICE [8] and denotes the optimal solution to (2). thus represents the best possible performance and with probability at least ,
(7) 
where is defined in Lemma 1, . Due to the limited space, all proofs of the above results are deferred to the appendix of the extended version of this paper at https://www.dropbox.com/s/ao7onnpq52t3ar3/icra18.pdf?dl=0.
Vi Experiments
This section presents an adversarial, multirobot CaptureTheFlag (CTF) domain adapted from its original domain in [4] to demonstrate the effectiveness of our stratagem fusion framework in Section IV. The specific domain setup for our CTF variant is detailed in Section VIA below.
(a)  (b) 
Via CaptureTheFlag Domain
The domain settings for CaptureTheFlag are shown in Fig. 3a, which depicts a competitive scenario between two teams of decentralized, collaborative robots. Each team has robots and the two teams divide the environment into two parts separated by a horizontal boundary (the cyan line in Fig. 3a), each of which belongs to one team (e.g., red blue). There are vantage points within each team’s territory. One of which contains the flag of the team (e.g., the red and blue circles in Fig .3a denote the locations of the flags for the red and blue team, respectively). Each team, however, only knows the location of its own flag, thus making observations necessary to correctly detect the enemy’s flag. The rule of the game is for each team to defend its own flag while seeking to capture the flag of the opposing team without getting caught. The game ends when one team successfully captures the flag of the opposing team. To achieve this, each team of agents need to coordinate their movements between vantage points to reach the opposing team’s flag and at the same time, avoid being seen by opposing agents. If an agent engages an opposing agent on foreign territory, its team will be charged with a penalty. The particular macroactions and observations available for each robot are detailed below, which feature a wide range of interesting observations and patterns of collaborative attack and defend for the opposing robots:
MacroActions. There are classes of macroactions available to each robot at any decision time: (a) which invokes a collision avoidance navigation procedure that directs the robot to vantage point from its current location; (b) which directs the robot to vantage point and then lets it stay in a closedloop moving from to to and back to . There are predefined instances for each team; (c) which directs the robot to vantage point (with being its role index in the team) and then . This creates an effective pincer attack when or robots choose the same instance. There are different macros predefined for each team; and (d) which allows a robot to catch an opposing robot on its own territory provided the opposing robot is within a predefined tagging range.
MacroObservations. There are in total macroobservations for each robot, which are generated by first collecting raw observations the environment using the robot’s onboard visual recognition/detection modules and then summarizing the raw information into a
dimensional observation vector. Each observation is represented as a
dimensional binary vector whose components correspond to yes/no () answers to the following questions: (a) Is the robot residing in its own or the opposition territory? (b) does the enemy flag appear in sight? (c) is there an opposition robot in close proximity? (d) is there an opposition robot further away? (e) is there an allied robot in the vicinity? and (f) is there an observed pincer signal emitted from allied robots? The answers to these questions can be generated from the raw visual processing unit onboard each robot.Rewards. Finally, in order to encourage each team to discover and capture the opposition’s flag as soon as possible while avoid getting tagged, a reward mechanism is implemented which issues (a) a negative reward of to each robot at each time step; (b) a positive reward of to a team if one of its member successfully tags an enemy; (c) a negative reward of for the entire team if one of the team member gets caught; and (d) a large award of is issued to a team when it successfully captures the opposition’s flag. Conversely, this implies a large penalty of issued to the other team who loses the flag.
Blackbox Simulators. In addition to the domain specification above, the allied robots also have access to a set of blackbox simulators of the opposition’s fundamental tactics upon which more advanced strategies might be built. In our experiments, these are constructed as tuples of individual handcoded tactics (see Table I below) that include: (a) and which script the robot to play defensively on left and right flank of its territory using and macroactions, respectively; (b) which scripts the robot to play defensively on the middlefront of the allied territory; (c) which leads the robot to a vantage point inside the opposition’s territory to get an observation. Depending on the collected observation, the robot either moves to another vantage point or launch a pincer attack to a vantage point estimated to contain the opposition’s flag; and (d) which is similar to except that it enables the robot to retreat to a safe place within the allied territory to gather extra observations if it observes that there is an opposing robot in close proximity. The team of allied robots however do not have access to these details and can only interact with them via a blackbox interface that gives feedback on how well their strategies fare against the opposition’s.
R1  R2  R3  

ViB Experiment: Generating Basic Stratagems
To learn the fundamental stratagems to counter the opposition’s basic tactics as described in Section VIA, we construct separate MacDecPOMDPs (see Section II) that encapsulates the opposition’s corresponding tactic simulator . The corresponding stratagem can then be formulated and computed as decentralized FSA controllers (Section III) that optimizes these MacDecPOMDPs. This is achieved via a recently developed graphbased direct crossentropy (GDICE) stochastic optimization method of [8]. Fig 4 shows that the empirical performance of each stratagem when tested against the corresponding opposition’s tactic increases and converges rapidly to the optimal performance when we increase the number of optimization iterations. Table II
then reports the averaged performance (with standard errors) of each stratagem when tested against all other opposition’s tactics over
independent simulations. The results interestingly show that the quality of each stratagem decreases significantly when tested against other opposition’s tactics that it was not optimized to interact with (see Table II’s first rows and columns). This implies a performance risk when applying a single stratagem against nonstationary opponent with switching tactics: The applied stratagem might no longer be optimal when the opponent switches to a new tactic. This necessitates design of agents which can detect and respond aptly when the opponents change their tactics which constitutes the main contribution of our work (Section IV). Its effectiveness is demonstrated next in Section VIC.(a)  (b) 
(c)  (d) 
). The shaded area represents the confidence interval of the average performance.
ViC Experiment: Stratagem Fusion
This section empirically demonstrates the effectiveness of our stratagem fusion framework (Section IV) against more sophisticated and nonstationary/strategic opponents. In particular, we first evaluate the performance of the optimized stratagems in the previous experiments (Section VIB) against a team of opponents with switching tactic: each opponent independently switches its tactic based on a set of probability weights (as previously described in Section IV). The results (averaged over independent runs) are reported in the last columns of Table II, which show significant decreases in the performance of each stratagem when tested against an opponent that keeps switching between tactics. This corroborates our observations earlier that a single stratagem is generally ineffective against opponents with unexpected behaviors. This can be remedied using our stratagem fusion scheme (see Fig. 2) to integrate all single stratagems into a unified (switching) policy which can perform effectively against the switching tactic of the opponents (assuming the switching weights are known). The reported results in the last row of Table II in fact show that among all policies, the optimized switching policy performs best against the tacticswitching opponents and nearoptimal against each stationary opponent: Its performance is, in most cases, only second (and very close) to the corresponding stratagem specifically designed to counter the opponent’s tactic.
In practice, however, since the switching weights of the opponents are usually not known a priori, a similar problem arises when the actual weights used by the opposition’s switching tactic are different from those used to optimize the switching policy of the allied robots. To resolve this, our stratagem fusion scheme further treated the switching weights of the opposition as random variables whose samples are either given in advance or can be drawn directly from a blackbox distribution . A goodforall switching policy can thus be computed using our sampling method in Section IV (specifically, see Eq. (3)) which is guaranteed, with high probability, to produce nearoptimal performance against unseen switching weights of the opposition. This is empirically demonstrated in Table III which shows the superior performance of the goodforall policy to that of the goodforone policy when tested against opponents with unseen tacticswitching weights . Also, similar to the case of basic stratagem in Section III, the quality of those switching policies increases and converges rapidly to the optimal value when we increase the number of optimization iterations (Fig. 5) in our stratagem fusion framework, which demonstrates the its stability.
(a)  (b) 
Vii Hardware Experiments
In addition to the simulated experiments, we also conduct realtime experiments with real robots to showcase the robustness of our proposed framework in practical RTS scenarios. The specifics of our robot configuration and domain setup are shown in Fig. 3. Each robot is built with the Kobuki base of TurtleBot and configured with onboard processing unit (Gigabyte Aero 14 laptop with Intel Core iHQ quadcore CPU and NVIDIA GTX GPU with GB RAM) as well as sensory devices including (1) Intel RealSense Camera (R) Developer Kit (mm x mm x mm) with Depth/IR: Up to resolution at FPS RGB: p at FPS; and (2) Omnidirectional RPLIDAR A2 ( samples/sec (Hz) and m range). The information provided by the LIDAR sensor is directed to each robot’s onboard collisionavoidance navigation procedure [24] to helps it localize and move around without colliding with other robots and obstacles in the environment. The visual feed from RealSense camera is passed through the Single Shot MultiBox Detector [25] implemented on each allied robot’s processing unit to detect its surrounding objects (e.g., the opposing robots, other allied robots and flags). The processed information is then used to generate the highlevel macroobservations (Section VIA) for the robot’s onboard policy controller. Fig. 6 shows a visual excerpt from our video demo featuring a CTF scenario of allied robots which implement the optimized policy produced by our framework to compete against an opposing team of adversary robots implementing the handcoded tactics in Section VIA. The excerpt shows interesting teamwork between all allied robots in capturing the opposing team’s flag despite their partial, decentralized views of the world (see detailed narration in Fig. 6’s caption), which further demonstrates the robustness of our proposed framework in practical robotic applications. Interested readers are referred to our attached video demo for a complete visual demonstration.
(a)  (b)  (c) 
(d)  (e)  (f) 
Viii Conclusion
This paper introduces a novel nearoptimal adversarial policy switching algorithm for decentralized, noncooperative multiagent systems. Unlike the existing works in literature which are mostly limited to simple decisionmaking scenarios where a single agent plans its best response against an adversary whose strategy is specified a priori under reasonable assumptions, we investigate instead a class of multiagent scenarios where multiple robots need to operate independently in collaboration with their teammates to act effectively against adversaries with changing strategies. To achieve this, we first optimize a set of basic stratagems that each is tuned to respond optimally to a preidentified basic tactic of the adversaries. The stratagems are then integrated into a unified policy which performs nearoptimally against any highlevel strategies of the adversaries that switches between their basic tactics. The nearoptimality of our proposed framework can be established in both theoretical and empirical settings with interesting and consistent results. We believe this is a significant step towards bridging the gap between theory and practice in multiagent research.
Acknowledgements. This research is funded in part by the ONR under MURI program award N000141110688 and BRC award N000141712072 and Lincoln Lab.
References
 [1] Michael Rubenstein, Alejandro Cornejo, and Radhika Nagpal. Programmable selfassembly in a thousandrobot swarm. Science, 345(6198):795–799, 2014.
 [2] Justin Werfel, Kirstin Petersen, and Radhika Nagpal. Designing collective behavior in a termiteinspired robot construction team. Science, 343(6172):754–758, 2014.
 [3] Shayegan Omidshafiei, AliAkbar AghaMohammadi, Yu Fan Chen, Nazim Kemal Ure, ShihYuan Liu, Brett T. Lopez, Rajeev Surati, Jonathan P. How, and John Vian. Measurable augmented reality for prototyping cyberphysical systems: A robotics platform to aid the hardware prototyping and performance testing of algorithms. IEEE Control Systems, 36:65–87, 2016.
 [4] Raffaello D’Andrea and Richard M. Murray. The roboflag competition. In Proc. ACC, 2003.
 [5] Peter Stone. Intelligent Autonomous Robotics: A Robot Soccer Case Study. Morgan and Claypool Publishers, 2007.
 [6] P. J. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multiagent settings. JAIR, 24:49–79, 2005.
 [7] S. Omidshafiei, A. akbar Aghamohammadi, C. Amato, and J. P. How. Decentralized control of partially observable markov decision processes using belief space macroactions. In Proc. ICRA, 2015.
 [8] S. Omidshafiei, A. akbar Aghamohammadi, C. Amato, ShihYuan Liu, J. P. How, and J. Vian. Graphbased cross entropy method for solving multirobot decentralized POMDPs. In Proc. ICRA, 2016.
 [9] Christopher Amato, George D. Konidaris, and Leslie P. Kaelbling. Planning with macroactions in decentralized POMDPs. In Proc. AAMAS, 2014.
 [10] T. N. Hoang and K. H. Low. Interactive POMDP Lite: Towards practical planning to predict and exploit intentions for interacting with selfinterested agents. In Proc. IJCAI, 2013.

[11]
T. N. Hoang and K. H. Low.
A general framework for interacting Bayesoptimally with selfinterested agents using arbitrary parametric model and model prior.
In Proc. IJCAI, pages 1394–1400, 2013.  [12] Christopher Amato, George D. Konidaris, Ariel Anders, Gabriel Cruz, Jonathan P. How, and Leslie P. Kaelbling. Policy search for multirobot coordination under uncertainty. The International Journal of Robotics Research, 2017.

[13]
Daniel S. Bernstein, Shlomo Zilberstein, and Neil Immerman.
The complexity of decentralized control of markov decision processes.
In
Proc. of the 16th Conference on Uncertainty in Artificial Intelligence
, June 2000.  [14] Frans A. Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs. Springer, 2016.

[15]
D. Szer, F. Charpillet, and S. Zilberstein.
Maa*: A heuristic search algorithm for solving decentralized POMDPs.
In Proc. of Uncertainty in Artificial Intelligence, 2005.  [16] S. Seuken and S. Zilberstein. Improved memorybounded dynamic programming for decentralized POMDPs. In Proc. of Uncertainty in Artificial Intelligence, July 2007.
 [17] S. Seuken and S. Zilberstein. Memorybounded dynamic programming for DECPOMDPs. In Proc. IJCAI, pages 2009–2015, 2007.
 [18] A. Boularias and B. Chaibdraa. Exact dynamic programming for decentralized POMDPs with lossless policy compression. In Proc. of Int. Conference on Automated Planning and Scheduling, 2008.
 [19] M. T. J. Spaan, F. A. Oliehoek, and N. Vlassis. Multiagent planning under uncertainty with stochastic communication delays. In Proc. ICAPS, 2008.
 [20] M. T. J. Spaan, F. A. Oliehoek, and C. Amato. Scaling up optimal heuristic search in DECPOMDPs via incremental expansion. In Proc. IJCAI, pages 2027–2032, 2011.

[21]
Rubinstein and D. P. Kroese.
The CrossEntropy Method: A Unified Approach to Monte Carlo Simulation, Randomized Optimization and Machine Learning
. 2004.  [22] P. Doshi and D. Perez. Generalized point based value iteration for interactive POMDPs. In Proc. AAAI, pages 63–68, 2008.
 [23] P. Doshi, Xia Qu, Adam Goodie, and Diana Young. Modeling recursive reasoning by humans using empirically informed interactive POMDPs. In Proc. AAMAS, pages 1223–1230, 2010.
 [24] Y. Chen, S. Liu, M. Liu, J. Miller, and J. How. Motion planning with diffusion maps. In Proc. IROS, 2016.
 [25] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, and Alexander C. Berg. SSD: Single Shot MultiBox Detector. arXiv:1512.02325, 2016.
 [26] David McAllester. PACBayesian model averaging. In COLT, 1999.