I Introduction
There has been significant progress in recent years on developing cooperative multirobot systems that can operate in realworld environments with uncertainty. Example applications of social and economical interest include search and rescue (SAR) [1], traffic management for smart cities [2], planetary navigation [3], robot soccer [4], ecommerce and transport logistic processes [5]. Planning in such environments must address numerous challenges, including imperfect models and knowledge of the environment, restricted communications between robots, noisy and limited sensors, different viewpoints by each robot, asynchronous calculations, and computational limitations.
These planning problems, in the most general form, can be formulated as a Decentralized Partially Observable Markov Decision Process (DecPOMDP) [6], a general framework for cooperative sequential decision making under uncertainty. In DecPOMDPs, robots make decisions based on local streams of information (i.e., observations), such that the expected value of the team (e.g., number of victims rescued, average customer satisfaction) is maximized. However, representing and solving DecPOMDPs is often intractable for large domains, because finding the optimal (even approximate) solution of a DecPOMDP (even for finite horizon) is NEXPcomplete [6]. To combat this issue, recent research has addressed the more scalable macroaction based DecPOMDP (MacDecPOMDP), where each agent has temporallyextended actions, which may require different amounts of time to complete [7]. Moreover, significant progress has been made on demonstrating the usefulness of MacDecPOMDPs via a range of challenging robotics problems, such as a warehouse domain [8], bartending and beverage service [9], and package delivery [10, 11]
. However, current MacDecPOMDP methods require knowing domain models a priori. Unfortunately, for many realworld problems, such as SAR, the domain model may not be completely available. Recently, researchers started to address this issue via reinforcement learning and proposed a policybased EM algorithm (PoEM)
[12], which can learn valid controllers via only trajectory data containing observations, macroactions (MAs), and rewards.Although PoEM has convergence guarantees for the batch learning setting and can recover optimal policies for benchmark problems with sufficient data, it suffers from local optimality and sensitivity to initial conditions for complicated realword problems. Inevitably, as an EM type algorithm, the results of PoEM can be arbitrarily poor given bad initialization. Additionally, few hardware demonstrations based on challenging tasks such as SAR, which involves a large team of heterogeneous robots (both ground vehicles and aerial vehicles) and with MacDecPOMDP formulation exists. This paper addresses these gaps by proposing an iterative samplingbased ExpectationMaximization algorithm (iSEM) to learn polices. Specifically, this paper extends previous approaches by using concurrent (multithreaded) EM iterations providing feedback to one another to enable resampling of parameters and reallocation of computational resources for threads which are clearly converging to poor values.
The algorithm is tested in batch learning settings, which is commonly used in learning from demonstration. Through theoretical analysis and numerical comparisons on a large multirobot SAR domain, we demonstrate the new algorithm can better explore the policy space. As a result, iSEM is able to achieve better expected values compared to the stateoftheart learningbased method, PoEM. Finally, we present an implementation of two variants of multirobot SAR domains (with and without obstacles) on hardware to demonstrate the learned policies can effectively control a team of distributed robots to cooperate in a partially observable stochastic environment.
Ii Background
We first discuss the background on DecPOMDPs and MacDecPOMDPs and then describe the PoEM algorithm.
Iia DecPOMDPs and MacDecPOMDPs
Decentralized POMDPs (DecPOMDPs) generalize POMDPs to the multiagent, decentralized setting [6, 13]. Multiple agents operate under uncertainty based on partial views of the world, with execution unfolding over a bounded or unbounded number of steps. At each step, every agent chooses an action (in parallel) based on locally observable information and then receives a new observation. The agents share a joint reward based on their joint concurrent actions, making the problem cooperative. However, agents’ local views mean that execution is decentralized.
Formally, a DecPOMDP is represented as an octuple , where is a finite set of agent indices; and respectively are sets of joint actions and observations, with and available to agent . At each step, a joint action is selected and a joint observation is received; is a set of finite world states; is the state transition function with
denoting the probability of transitioning to
after taking joint action in ; is the observation function with the probability of observing after taking joint action and arriving in state ; is the reward function with the immediate reward received after taking joint action in ; is a discount factor. Because each agent lacks access to other agents’ observations, each agent maintains a local policy , defined as a mapping from local observation histories to actions. A joint policy consists of the local policies of all agents. For an infinitehorizon DecPOMDP with initial state , the objective is to find a joint policy , such that the value of starting from , , is maximized. Specifically, given , the history of actions and observations up to , the policy probabilistically maps to : .A MacDecPOMDP with (local) macroactions extends the MDPbased options [14] framework to DecPOMDPs. Formally, a MacDecPOMDP is defined as a tuple , where and are the same as defined in the DecPOMDP; are sets of joint macroaction observations which are functions of the state; are sets of joint macroactions, with , where is the initiation set that depends on macroaction observation histories, defined as , is a stochastic termination condition that depends on the underlying states, and is an option policy for macroaction ( is the space of history of primitiveaction and observation). Macroactions are natural representations for robot or human operation for completing a task (e.g., navigating to a way point or placing an object on a robot). MacDecPOMDPs can be thought of as decentralized partially observable semiMarkov decision processes (DecPOSMDPs) [9, 10], because it is important to consider the amount of time that may pass before a macroaction is completed. The high level policy for each agent , can be defined for choosing macroactions that depends on macroaction observation histories. Given a joint policy, the primitive action at each step is determined by the highlevel policy that chooses the MA, and the MA policy that chooses the primitive action. The joint high level policies and macroaction policies can be evaluated as: ^{1}^{1}1Note that MacDecPOMDPs allows asynchronous decision making, so synchronization issues must be dealt with by the solver as part of the optimization. Some temporal constraints (e.g., timeouts) can be encoded into the termination condition of a macroaction..
IiB Solution Representation
A Finite State Controller (FSC) is a compact way to represent a policy as a mapping from histories to actions. Formally, a stochastic FSC for agent is defined as a tuple , where, is the set of nodes^{2}^{2}2A controller node can be understood as a decision state (summary of history). They are commonly used for policy representation when solving infinite horizon POMDPs [15] and DecPOMDPs [6].; and are the output and input alphabets (i.e., the macroaction chosen and the observation seen); is the node transition probability, i.e., ; is the output probability for node , such that ; is the output probability for nodes that associates output symbols with transitions, i.e. ; is the initial node distribution . This type of FSC is called a Mealy machine [16], where an agent’s local policy for action selection depends on both current controller node (an abstraction of history) and immediate observation. By conditioning action selections on immediate observations, a Mealy machine can use this observable information to help ensure a valid macroaction controller is constructed [12].
IiC Policy Learning Through EM
A DecPOMDP problem can be transformed into an inference problem and then efficiently solved by an EM algorithm. Previous EM methods [17, 18] have achieved success in scaling to larger problems, but these methods require a DecPOMDP model both to construct a Bayes net and to evaluate policies. When the exact model parameters , and are unknown, a Reinforcement Learning (RL) problem must be solved instead. To this end, EM has been adapted to modelfree RL settings to optimize FSCs for DecPOMDPs [19, 20] and MacDecPOMDPs [12].
For both purposes of selfcontainment and ease of analyzing new algorithm, we first review the policy based EM algorithm (PoEM) developed for the MacDecPOMDP case [12].
Definition 1
(Global empirical value function) Let be a set of episodes resulting from agents who choose macroactions according to , a set of arbitrary stochastic policies with , action , history . The global empirical value function is defined as
(1) 
where , is the discount.
Definition 1 provides an offpolicy learning objective: given data generated from a set of behavior policies , find a set of parameters such that is maximized. Here, we assume a factorized policy representation to accommodate decentralized policy execution.
IiD PoEM
Direct maximization of is difficult; instead, can be augmented with controller node sequences and maximize the lower bound of the logarithm of (obtained by Jensen’s inequality):
(2)  
(3) 
where , and satisfy the normalization constraint with
the most recent estimate of
, and are reweighted rewards with denoting the minimum reward, leading to the following constrained optimization problem(4)  
,  (5)  
(6) 
Based on the problem formulation (4), an EM algorithm can be derived to learn the macroaction FSCs. Algorithmically, the main steps involve alternating between computing the lower bound of the log empirical value function (2) (Estep) and parameter estimation (Mstep). This optimization algorithm is called policy based expectation maximization (PoEM), the details of which is referred to [12].
Iii Related work
The use of multirobot teams has recently become viable for largescale operations due to everdecreasing cost and increasing accessibility of robotics platforms, allowing robots to replace humans in teambased decisionmaking settings including, but not limited to, search and rescue [1]. Use of multiple robots allows dissemination of heterogeneous capabilities across the team, increasing faulttolerance and decreasing risk associated with losing or damaging a single allencompassing vehicle [21].
The large body of work on multirobot task allocation (MRTA) comes in decentralized, centralized, and distributed/hybrid flavors. Centralized architectures [22, 23] rely on full information sharing between all robots. However, in settings such as SAR, communication infrastructure may be unavailable, requiring the use of alternative frameworks. Distributed frameworks, such as those used in auctionbased algorithms [24], use local communication for consensus on robot policies. This enables robustness against communication failures in hazardous, realworld settings. However, in settings such as SAR, it can be unreasonable or impossible for robots to communicate with one another during task execution. Decentralized frameworks, such as DecPOMDPs [13] and the approach proposed in this paper, target this setting, allowing a spectrum of policies ranging from communicationfree to explicitly communicationenabled. The flexibility offered by decentralized planners makes them suitable candidates for multirobot operation in hazardous or uncertain domains, such as SAR.
Finally, note that unlike the majority of the existing MRTA literature, the work presented here exploits the strengths of the MacDecPOMDP framework [8] to develop a unifying framework which considers sources of uncertainty, tasklevel learning and planning, temporal constraints, and nondeterministic action durations.
Iv Iterative Sampling Based Expectation Maximization Algorithm
The PoEM algorithm [12] is the first attempt to address policy learning for MacDecPOMDPs with batch data. However, one of the biggest challenge for PoEM is that it only grantees convergence to a local solution, a problem often encountered when optimizing mixture models, such as the empirical value function (1) ^{3}^{3}3Note that the empirical value function (1) can be interpreted as a likelihood function for FSCs with the number of mixture components equal to the total number of subepisodes [25].. Moreover, PoEM is a deterministic algorithm for approximate optimization, meaning that it converges to the same stationary point if initialized repeatedly from the same starting value. Hence, PoEM can be prone to poor local solution for more complicated realworld problems (as it will be shown in a later numerical experiment). To address these issues, we propose a concurrently (multithreaded) randomized method called iterative sampling based Expectation Maximization (iSEM). The iSEM algorithm is designed to run multiple instances of PoEM with randomly initialized FSC parameters in parallel to minimize the probability of converging to a suboptimal solution due to poor initialization. Furthermore, to exploit information and computational efforts on runs of PoEM which are clearly converging to poor values, iSEM allows resampling of parameters once convergence of is detected, increasing the chance of overcoming poor local optima. Because of the resampling step, which involves random reinitialization for threads converging to poor local value, iSEM can be deemed as a randomized version of the PoEM algorithm. This is essential for convergence to wellperforming policies, since it widely known that global optimization paradigms are often based on the principal of stochasticity [26].
iSEM is outlined in Algorithm 1. Domain experience data is first partitioned into training and evaluation sets, and . iSEM takes the partitioned data, the number of Monte Carlo samples (threads) and parameters controlling convergence as input, and maintains two sets, and : records the indices of threads whose evaluation values are lower than the best value, and records the remaining thread indices (and is initialized as empty). iSEM iteratively applies four steps: 1) update (line 3); 2) for the threads in , randomly initialize FSC parameters by drawing samples from Dirichlet distributions with concentration parameter , run the PoEM algorithm [12] and evaluate the resulting policy ^{4}^{4}4 sign indicates run the PoEM algorithm until convergence. (line 57); 3) update the best policy and its evaluation value obtained in current iteration (line 8); 4) update by recording the indices of threads whose converged policy values are close to the best policy (line 1014). Critically, the final step (update of ) enables distinguishing threads that clearly converge to poor local solutions and ”good” local solutions. In the subsequent iteration, threads with poor local solutions are reinitialized and reexecuted until the policy values from all the threads are close to the best solution learned so far. The iSEM algorithm is guaranteed to monotonically increase the lower bound of empirical value function over successive iterations and the convergence property is summarized by the following theorem.
Theorem 2
Algorithm 1 monotonically increases , until convergence to a maximum.
Assume that is a policy with the highest evaluation value among the policies learned by all the threads at iteration , and the set records the thread indices with corresponding policy value close to . In the iteration , the set contains the thread indices with corresponding policy values satisfy . Starting from , we have . In the next iteration (i.e., ), we have . The steps 57 allow the threads in to rerun with randomly reinitialized parameters. According to step 9 (Algorithm 1), we can obtain . Following the same analysis for , we can obtain . Since is a monotone sequence and it is upper bounded by , according to Monotone convergence theorem, has a finite limit, which completes the proof.
Note that the convergence of iSEM is different from that of PoEM in the sense that iSEM updates a global parameter estimate based on feedbacks from several local optima (obtained from random initialization). It is also worth mentioning that with finite number of threads, iSEM might still converge to a local maximum. However, we can show that on average, iSEM has higher probability of convergence to better solutions than PoEM. Moreover, the iSEM algorithm can be considered a special case of evolutionary programming (EP) [27], which maintains a population of solutions (i.e., the set of policy parameters in ). Yet, there are obvious differences between iSEM and PE. Notably, instead of mutating from existing solutions, iSEM resamples completely new initializations for parameters and optimizes them using PoEM. In additional, iSEM is highly parallelizable due to its use of concurrent threads.
V Experiments
This section presents simulation and hardware experiments for evaluating the proposed policy learning algorithm. First a simulator for a large problem motivated by SAR is introduced. Then, the performance of iSEM is compared to previous work based on the simulated SAR problem. Finally, a multirobot hardware implementation is presented to demonstrate a working realworld system.
Va Search and Rescue Problem
The SAR problem involves a heterogeneous set of robots searching for victims and rescuing survivors after a disaster (e.g., bringing them to a location where medical attention can be provided). Each robot has to make decisions using information gathered from observations and limited communications with teammates. Robots must decide how to explore the environment and how to prioritize rescue operations for the various victims discovered.
The scenario begins after a natural disaster strikes the simulated world. The search and rescue domain considered is a 20 10 unit grid with designated sites: 1 muster site and 5 victim sites. All robots are initialized at the muster site. Victim sites are randomly populated with victims (6 victims total). Each victim has a randomlyinitialized health state. While the locations of the sites are known, the number of victims and their health at each site is unknown to the robots. The maximum victim capacity of each site also varies based on the site size. Each victim’s health degrades with time.
An unmanned aerial vehicle (UAV) surveys the disaster from above. A set of 3 unmanned ground vehicles (UGVs) can search the space or retrieve victims and deliver them to the muster site, where medical attention is provided. The objective of the team is to maximize the number of victims returned to the muster site while they are still alive. This is a challenging domain due to its sequential decisionmaking nature, large size (4 agents), and both transition and observation process uncertainty, including stochasticity in communication. Moreover, as communication only happens within a limited radius, synchronization and sharing of global information are prohibited, making this a highlyrealistic and challenging domain.
VB Simulator Description
All simulation is conducted within the Robot Operating System (ROS) [28]. The simulator executes a timestepped model of the scenario, where scenario parameters define the map of the world, number of each type of robot, and locations and initial states of victims.
Each robot’s macrocontroller policy is executed by a lowerlevel controller which checks the initiation and termination conditions for the macroaction and generates sequences of primitive actions.
VB1 Primitive Actions
The simulator models primitive actions, each of which take one timestep to execute. The primitive actions for the robots include: (a) move vehicle, (b) pickup victim (UGVs only), (c) dropoff victim (UGVs only) and (d) do nothing. Observations and communication occur automatically whenever possible and do not take any additional time to execute.
Macroaction policies, built from these primitive actions, may take any arbitrary amount of time in multiples of the timesteps of the simulator. Macroaction durations are also nondeterministic, as they are a function of the scenario parameters, world state, and interrobot interactions (e.g., collision avoidance).
VB2 The World
While the underlying robotics simulators utilized are threedimensional, the world representation is in two dimensions. This allows increased computational efficiency while not detracting from policy fidelity, as the sites for ground vehicles are ultimately located on a 2D plane. The world is modeled as a 2D plane divided into an evenlyspaced grid within a rectangular boundary of arbitrary size. Each rescue site is a discrete rectangle of grid spaces of arbitrary size within the world.
Some number of victims are initially located in each rescue site. Victim health is represented as a value from 0 to 1, with 1 being perfectly healthy and 0 being deceased. Each victim may start at any level of health, and its health degrades linearly with time. If a victim is brought to the muster location, its health goes to 1 and no longer degrades. One victim at a time may be transported by a UGV to the muster, although this can be generalized to larger settings by allowing the vehicle to carry multiple victims simultaneously.
VB3 Movement
Simulated dynamical models are used to represent the motion of the air and ground vehicles within ROS. The vehicles can move within the rectangular boundaries of the world defined in the scenario.
UGV motion is modeled using a Dubins car model. Realtime multirobot collision avoidance is conducted using the reciprocal velocity obstacles (RVO) formulation [29]. State estimates are obtained using a motion capture system, and processed within RVO to compute safe velocity trajectories for the vehicles.
UAV dynamics are modeled using a linearization of a quadrotor around hover state, as detailed in [30]. Since the UAV operates at a higher altitude than UGVs and obstacles, there are no restrictions to the air vehicle’s movement.These dynamics correspond to the transition model specified in the (Mac)DecPOMDP frameworks discussed in the section IIA.
VB4 Communication
Communication is rangelimited. When robots are within range (which is larger for UAVUGV communication than for UGVUGV communication), they will automatically share their observations with twoway communication. Communication is imperfect, and has a .05 probability of failing to occur even when robots are in range. For the scenarios used to generate the results in this study, a UGV can communicate its observation with any other UGV within 3 grid spaces in any direction; the UAV can communicate with any UGV within 6 grid spaces in any direction.
VC MacDecPOMDP Representation
We now describe the MacDecPOMDP represention that is used for learning. Note that the reprentation in Section VB is not observable to the robots and is only used for constructing the simulator.
VC1 Rewards
The joint reward is for each victim brought back to muster alive and for each victim who dies.
VC2 Observations
In the SAR domain, a UAV can observe victim locations when over a rescue site. However, victim health status is not observable by air. A UGV that is in a rescue site can observe all victims (location and health status) within that site. Robots are always able to observe their own location and whether they are holding a victim at a given moment.
The observation vector
on which the macrocontroller makes decisions is a subset of the raw observations each robot may have accumulated through the execution of the prior macroaction. The robots report the state of their current location and one other location (which could be directly observed or received via communication while completing the macro action). The second location reported is the most urgent state with the most recent new observation. If there are no new observations other than the robot’s own location, the second location observation is equivalent to the self location.The observation vector is as follows,
(7)  
where, self state {= is/not holding victim}, self location {site 1, site 2, …, site s}, location state { no victims needing help, victims needing help (not critical), victims needing help (critical)}, second location {site 1, site 2, …, site s}, and second location state { no victims needing help, victims needing help (not critical), victims needing help (critical)}. There are possible observation vectors, making the observation space substantially larger than previous macroaction based domains [10, 8].
VC3 MacroActions
The macroactions utilized in this problem are as follows:

Go to Muster (available to both UAV and UGV): Robot attempts to go to the muster point from anywhere else, but only if it is holding a live victim. If a victim is onboard, victim will always disembark at the muster.

Pick up Victim (available only to UGV): Robot (UGV only) attempts to go to a victim’s location from a starting point within the site. Terminates when the robot reaches the victim; also may terminate if there is no longer a victim needing help at the site (i.e., another robot picked the victim up first or the victim died). If victim and robot are located in the same grid cell, the victim can be “picked up”.

Go to Site (available to both UAV and UGV): Robot goes to a specified disaster site . Terminates when the robot is in the site. Robot can receive observations of the victims at the site.
VD Simulations and Numerical Results
The SAR domain extends previous benchmarks for MacDecPOMDPs both in terms of the number of robots and the number of states/actions/observations. Notably, due to the very large observation space cardinality of the SAR domain, it is difficult to generate an optimal solution with existing solvers such as [10, 8] in a reasonable amount of time. Hence, due to the lack of a known global optima, the RL algorithms (iSEM and PoEM) are compared over the same datasets. The dataset is collected through the simulator by using a behavior policy combining a handcoded expert policy (the same used in [12]) and a random policy, with denoting the percentage of expert policy.
To compare iSEM and PoEM on the SAR domain, experiments are conducted with and (varying controller sizes)^{5}^{5}5 corresponds to reactive policies (based only on current observations).. Corresponding test (holdout) set results are plotted in Figure 1. Several conclusions can be drawn from the results. First, as the amount of training data () increases, the cumulative reward increases for both PoEM and iSEM (under the same , as shown in Fig.0(b)). Second, with the same , iSEM achieves better performance than PoEM, which validates that iSEM is better at overcoming the local optimality limitation suffered by PoEM. In addition, as the number of threads
increases, iSEM converges to higher average values and smaller variance (as indicated by the errorbar, compared to PoEM), according to Figure
0(c), which empirically justifies the discussion under Theorem 2. Moreover, as shown in Fig.0(d), under three settings of , the FSCs learned by iSEM render higher value than the PoEM policy. As increases, the difference between PoEM and iSEM (with fixed ) tends to decrease, which indicates we should increase as iSEM is exploring higher dimensional parameter spaces. Finally, even in cases where the mean of iSEM is only slightly higher than PoEM, the variance of iSEM is is consistently lower than PoEM – a critical performance difference given the uncertainty involved in the underlying domain tested.RVIZ [31] was used in conjunction with ROS to visualize the simulations. Fig.0(a) shows the start of one trial with the different colored circles being sites, the stacked cubes positioned at sites as victims with colors indicating their health values, and the 4 green cylinders indicating the 3 UGVs and the UAV. The sites are as follows (from furthest to closest): site 1 (red circle), site 2 (green), site 3 (sky blue), site 4 (pink), site 5 (turquoise), site 6 (orange). Note that the normal gridworld model used in the POMDP formulation usually assumes discrete state and discrete primitive actions, whereas the simulation models are based on macroactions which comprised lowlevel controllers that can deal with both discrete and continuous primitive action and states.
VE Hardware Implementation
While simulation results validate that the proposed MacDecPOMDP search algorithm achieves better performance than stateoftheart solvers, we also verify the approach on a SAR mission with real robots. This allows further learning from realworld experiences. A video demo is made available online^{6}^{6}6Video URL: https://youtu.be/B3b60VqWMIE. Learning from simulation allows robots operate in a reasonable (safe) way, whereas real robots experiments can potentially provide ”realworld” experiences that are not fully captured by the simulators, hence allowing the robots to improve their baseline policy (learned from simulators). The video essentially demonstrates this potential, assuming the training data is collected from the “realworld”.
A DJI F330 quadrotor is used as the UAV for hardware experiments, with a custom autopilot for lowlevel control and an NVIDIA Jetson TX1 for highlevel planning and task allocation (Fig. 1(a)). The UGVs are Duckiebots [32], which are custommade ground robots with an onboard Raspberry Pi 2 for computation (Fig. 1(b)). Experiments were conducted in a 40 ft. 20 ft. flight space with a ceilingmounted projection system [33] used to visualize site locations, obstacles, and victims. As discussed earlier, limited communication occurs between robots, with a motion capture system used to ensure adherence to maximal interrobot communication distances.
The hardware experiments conducted demonstrated that the policy generated from iSEM (with , , ) was able to save all victims consistently well, despite robots having to adhere to collision avoidance constraints. In some instances, the robots were not able to save all 6 victims. However, in these scenarios, only 1 victim was lost, with the cause of loss due to an extremely low starting health for multiple victims. In such cases, an early victim death would occur before any robot could respond.
Fig. 4 shows the progression of one hardware trial. Sites are randomly populated with 6 victims total. All robots initiate at the muster site (Fig. 3(a)). As the UGVs navigate towards sites (dictated by their policy), they simultaneously begin observing their surroundings. When they do, the outer ring surrounding them turns into the color of the latest victim observed (Fig. 3(b)). A UGV can only pick up a new victim if it is not currently carrying a victim. Its inner circle then indicates the health of the victim it is carrying, while its outer ring indicates the health of a randomlyselected victim still present at the site (if any). Fig. 3(c) illustrates a situation where no more victims are present at site 6, thereby causing the UGV’s outer ring to turn black (no victims to save at latest encountered site). Note that an observed deceased victim also falls under this category. After a UGV picks up a victim, it drops it off at the muster (Fig. 3(d)). The victim returns to full health, indicating a successful rescue. When a UAV visits a site, its outer ring also turns into the color of the victim observed at the site (Fig. 3(e)). The UAV has no inner circle because it cannot pick up victims. Fig. 3(f) and 3(g) show two more instances of a UGV picking up a victim from site 5. As mentioned before, a deceased victim results in a observation color of black in Fig. 3(g). Fig. 3(h) shows the end of the hardware trial, where all healthy victims have been rescued.
Vi Conclusion
This paper presents iSEM, an efficient algorithm which improves the stateoftheart learningbased methods for coordinating multiple robots operating in partially observed environments. iSEM enables cooperative sequential decision making under uncertainty by modeling the problem as a MacDecPOMDP and using iterative sampling based Expectation Maximization trials to automatically learn macroaction FSCs. The proposed algorithm is demonstrated to address local convergence issues of the stateoftheart macroaction based reinforcement learning approach, PoEM. Moreover, simulation results showed that iSEM is able to generate higherquality solutions with fewer demonstrations than PoEM. The iSEM policy is then applied to a hardwarebased multirobot search and rescue domain, where we demonstrate effective control of a team of distributed robots to cooperate in this partially observable stochastic environment. In the future, we will make our demonstration even closer to real world scenarios by modeling observations and communications as actions and assigning costs. We will also experiment with other methods other than random sampling, such as active sampling for the resampling step in iSEM, to accommodate restrictions of computational resources (i.e., number of threads).
References
 [1] S. Grayson, “Search & Rescue using MultiRobot Systems,” http://www.maths.tcd.ie/~graysons/documents/COMP47130˙SurveyPaper.pdf, 2014.
 [2] K. Dresner and P. Stone, “Multiagent traffic management: Opportunities for multiagent learning,” in Learning and Adaption in MultiAgent Systems. Springer, 2006, pp. 129–138.

[3]
D. Bernstein, S. Zilberstein, R. Washington, and J. Bresina, “Planetary rover
control as a markov decision process,” in
Sixth Int’l Symposium on Artificial Intelligence, Robotics, and Automation in Space
, 2001. 
[4]
K. Jolly, K. Ravindran, R. Vijayakumar, and R. S. Kumar, “Intelligent decision making in multiagent robot soccer system through compounded artificial neural networks,”
Robotics and Autonomous Systems, vol. 55, no. 7, pp. 589–596, 2007.  [5] M. Gath, Optimizing transport logistics processes with multiagent planning and control. Springer, 2016.
 [6] F. A. Oliehoek and C. Amato, A Concise Introduction to Decentralized POMDPs. Springer, 2016.
 [7] C. Amato, G. D. Konidaris, and L. P. Kaelbling, “Planning with macroactions in decentralized POMDPs,” in Proc. of the int’l Conf. on Autonomous agents and multiagent systems (AAMAS14), 2014.
 [8] C. Amato, G. Konidaris, G. Cruz, C. Maynor, J. How, and L. Kaelbling, “Planning for decentralized control of multiple robots under uncertainty,” in 2015 IEEE Int’l Conf. on Robotics and Automation (IRCA), 2015.
 [9] C. Amato, G. Konidaris, A. Anders, G. Cruz, J. P. How, and L. P. Kaelbling, “Policy Search for MultiRobot Coordination under Uncertainty,” in Proc. of the 2015 Robotics: Science and Systems Conference (RSS15), 2015.
 [10] S. Omidshafiei, A. akbar Aghamohammadi, C. Amato, and J. P. How, “Decentralized control of partially observable Markov decision processes using belief space macroactions.” in 2015 IEEE International Conference on Robotics and Automation (ICRA).
 [11] S. Omidshafiei, A.a. Aghamohammadi, C. Amato, S.Y. Liu, J. P. How, and J. Vian, “Graphbased cross entropy method for solving multirobot decentralized pomdps,” in 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 5395–5402.
 [12] M. Liu, C. Amato, E. Anesta, J. Griffith, and J. How, “Learning for decentralized control of multiagent systems in large, partiallyobservable stochastic environments,” in AAAI Conf. on Artificial Intelligence, 2016.
 [13] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein, “The complexity of decentralized control of Markov decision processes,” Mathematics of Operations Research, vol. 27, no. 4, pp. 819–840, 2002.
 [14] R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 1, pp. 181–211, 1999.
 [15] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” Artificial intelligence, vol. 101, no. 1, pp. 99–134, 1998.
 [16] C. Amato, B. Bonet, and S. Zilberstein, “Finitestate controllers based on Mealy machines for centralized and decentralized POMDPs.” 2010.
 [17] A. Kumar, S. Zilberstein, and M. Toussaint, “Probabilistic inference techniques for scalable multiagent decision making,” Journal of Artificial Intelligence Research, vol. 53, no. 1, pp. 223–270, 2015.
 [18] Z. Song, X. Liao, and L. Carin, “Solving DECPOMDPs by expectation maximization of value functions,” 2016.
 [19] F. Wu, S. Zilberstein, and N. R. Jennings, “MonteCarlo expectation maximization for decentralized POMDPs.” in Proc. of the 23rd Int’l Joint Conf. on Artificial Intelligence (IJCAI13), 2013.
 [20] M. Liu, C. Amato, X. Liao, J. P. How, and L. Carin, “StickBreaking Policy Learning in DECPOMDPs,” in Proc. of the 24th Intĺ Joint Conf. on Artificial Intelligence (IJCAI15), 2015.
 [21] C. Y. Wong, G. Seet, and S. K. Sim, “Multiplerobot systems for USAR: key design attributes and deployment issues,” International Journal of Advanced Robotic Systems, vol. 8, no. 1, pp. 85–101, 2011.
 [22] Y. Jin, A. A. Minai, and M. M. Polycarpou, “Cooperative realtime search and task allocation in UAV teams,” in Decision and Control, 2003. Proceedings. 42nd IEEE Conference on, vol. 1, 2003, pp. 7–12.
 [23] D. Turra, L. Pollini, and M. Innocenti, “Fast unmanned vehicles task allocation with moving targets,” in Decision and Control, 2004. CDC. 43rd IEEE Conference on, vol. 4, 2004, pp. 4280–4285.
 [24] H.L. Choi, L. Brunet, and J. P. How, “Consensusbased decentralized auctions for robust task allocation,” IEEE transactions on robotics, vol. 25, no. 4, pp. 912–926, 2009.
 [25] A. Kumar and S. Zilberstein, “Anytime planning for decentralized POMDPs using expectation maximization,” in Proc. of the 26th Conf. on Uncertainty in Artificial Intelligence (UAI10), 2010.
 [26] R. Horst and P. M. Pardalos, Handbook of global optimization. Springer Science & Business Media, 2013, vol. 2.
 [27] D. Simon, “Evolutionary optimization algorithms: biologicallyinspired and populationbased approaches to computer intelligence. hoboken,” 2013.
 [28] M. Quigley, K. Conley, B. P. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “Ros: an opensource robot operating system,” in ICRA Workshop on Open Source Software, 2009.
 [29] J. Van den Berg, M. Lin, and D. Manocha, “Reciprocal velocity obstacles for realtime multiagent navigation,” in 2008 IEEE Int’l Conf. on Robotics and Automation (ICRA), pp. 1928–1935.
 [30] D. Mellinger, N. Michael, and V. Kumar, “Trajectory generation and control for precise aggressive maneuvers with quadrotors,” The International Journal of Robotics Research, vol. 32, no. 5, pp. 664–674, 2012.
 [31] D. Gossow and W. Woodall. (2016, nov) RVIZ. http://wiki.ros.org/rviz.
 [32] L. Paull, J. Tani, H. Ahn, J. AlonsoMora, L. Carlone, M. Cap, Y. F. Chen, C. Choi, J. Dusek, Y. Fang, et al., “Duckietown: an open, inexpensive and flexible platform for autonomy education and research,” in 2017 IEEE Int’l Conf. on Robotics and Automation (ICRA).
 [33] S. Omidshafiei, A.A. AghaMohammadi, Y. F. Chen, N. K. Ure, S.Y. Liu, B. T. Lopez, R. Surati, J. P. How, and J. Vian, “Measurable augmented reality for prototyping cyberphysical systems: A robotics platform to aid the hardware prototyping and performance testing of algorithms,” IEEE Control Systems, vol. 36, no. 6, pp. 65–87, 2016.