A Multi-robot system (MRS) is a system consisting of a collection of coordinated robots. Compared with a single robot system, MRS often has a larger action scope and handles more complex tasks regarding multiple targets, large area size, and complex mission procedures. For example, multi unmanned aerial vehicle (UAVs) and multi unmanned ground vehicle (UGVs) teams have been successfully applied to disaster relief, explosive ordinance disposal, structural inspection, reconnaissance, cargo freight, hostage search, and many other scenarios.
MRS deployments usually rely on real-time human guidance. However, the complexity of these tasks presents a heavy workload for the human operator, leading to delayed or unsafe guidance for robots. Unexpected emergencies that require rapid adjustments of the MRS inevitably happen during missions, exacerbating the problem of delayed or unsafe human guidance. This causes performance degradation and physical system damages to the robot team and even threatens the safety of humans [8, 12, 18]. Therefore, to ensure task success and human and robot safety, an emergency reaction mechanism is highly desired for real-world MRS deployments.
Due to the dynamic and uncertain nature of emergencies and the complexity of MRS control, it is challenging to ensure the safety of an MRS in real-world deployments. Traditional autonomous MRS mainly relies on centralized control to plan paths [14, 20]. However, the centralized emergency reaction requires going through the central decision-making node, limiting reaction speed, and creating a single point of failure which is fragile during an emergency. On the other hand, simple decentralized control suffers from the opposite problem of weak coordination among multiple robots and global suboptimality in task executions. A common drawback of centralized and decentralized methods is the lack of the ability to understand the physical properties such as the type and degree of emergent situations, thus decreasing overall task performance. In addition, traditional planning methods lack robustness against unexpected emergencies which have not been sufficiently explored by the robots.
To address these issues, the bio-inspired Collective Conditioned Reflex (CCR) algorithm is developed for fast emergency reactions. CCR is inspired by the collective behaviors of animal groups responding to predators and obstacles. When a fish school encounters a predator such as a shark, the closest fishes will quickly react and escape together along safe directions, meanwhile sharing emergency information with the far fishes to adjust their motion behaviors to help the whole fish school to avoid the predator [9, 26, 5, 6]. These behaviors of local-global collective emergency reaction maximally avoid the loss in danger as the most urgent fishes/robots react first and reduce response time by using local analysis and avoiding going through a global analysis. This collective intelligence in emergency reaction effectively helps an animal team to avoid danger; the reaction mechanism is similar to spinal reflexes in human muscles, which are fast response mechanisms that react to stimulation without going through the central decision-maker – the brain . These collective behaviors help them react to dangers preemptively and rapidly, providing safety guidance at an early stage for the whole school. Such observations inspire us to apply a similar approach in training multi-robot systems (Fig. 1). In particular, we encourage pioneer robots encountering danger to locally collaborate to avoid dangers without going through centralized analysis for global optimal strategy searching; meanwhile, pioneer robots share danger information with other robots to guide their motion behaviors to achieve the goal of team emergency avoidance.
The contributions in this study are mainly three folds:
A novel bio-inspired emergency reaction model is developed by mimicking animals’ emergency reaction behavior. The module can be integrated with any existing cooperative multi-agent reinforcement learning algorithm to improve the emergency reaction capabilities of robot teams.
An environment-adaptive safe learning framework is developed by considering safety assessment and safety adjustment. This framework provides a pipeline for customizing the robot teams’ trade-offs between safety and efficiency when handling different levels of dangers.
Emergencies typically encountered by multi-UGV and multi-UAV systems are summarized and modeled into three scenarios: unexpected obstacle, turbulence, and strong wind. The experiment results reveal that with a small sacrifice of efficiency, CCR significantly reduces the risky robotic behavior and largely improves the MRS resilience towards common unexpected dangers.
Ii Related Work
Safe Multi-Robot System. Due to its attractive capability in performing large-scale and complex tasks, the safety of robot teams has been researched in recent years. One of the popular ways to achieve safety in the system was eliminating dangerous behaviors among robots and between robots and the environment . The algorithm formulated the problem into a problem of satisfying constraints. This approach had a theoretical guarantee to develop a collision-free system [27, 2]. A more relaxed approach tolerated unsafe behaviors and robot failures [23, 21, 22], which was more robust to complex environments. However, both methods are based on an explicit mathematical model of the emergency situation, which assumes full knowledge of its type, size, location, and intensity. On the other hand, the CCR algorithm introduced in this paper can adapt to emergency situations with high randomness and poor observability.
Safe Reinforcement Learning for MRS.
There has been an increasing effort to incorporate reinforcement learning (RL) into safe multi-robot systems. A carefully-designed reward function was usually a necessity for such methods to work. Existing related works were classified into two paradigms: communicative emergency reflex and non-communicative emergency reflex. In, a multi-scenario multi-stage training framework was developed to train a fully decentralized sensor-level collision avoidance policy for MRS. To improve system coordination without adding more communication assumptions, Chen et al. proposed to use value networks to directly generate trajectories, which showed great performance  for non-communicating MRS. Although these previous works developed safer robot behaviors, important real-world metrics such as minimum safety distance and group response time were not considered. Consequently, these algorithms lacked generalization and were difficult to be applied in real-world robots. In contrast, CCR offers a tunable parameter that controls these properties through controlling the trade-off between efficiency and safety. This allows for more leniency when applying to real-world systems.
Emergent Behavior Development. Emergent behaviors make an MRS with better adaptability towards suddenly-applied disruptions and damages. In , the authors applied Monte Carlo methods to enable an MRS to develop emergent behaviors, such as attaching to other units to move. However, such behaviors generalized poorly in unseen environments due to the inherent planning rather than learning property of Monte Carlo methods, resulting in weak adaptation to unstructured environments and the limited application scope. Our work CCR differs in that CCR uses a novel muscle conditional reflex mechanism to help an MRS to assess emergencies’ scope and magnitude; based on which MRS initiates complex emergent behaviors like evasive maneuvers in time-varying airflow, thereby adapting to more dynamic environments. In , the authors developed a competitive multi-robot environment and trained the robots with Proximal Policy Optimization (PPO). It was found that the robots naturally developed emergent heterogeneous offensive and defensive behaviors. However, the emergent behaviors developed in this work only concerned the global space where all robots participated. This hindered small, local scale emergent behavior development. Our work focuses on developing emergent behaviors on both a global and local scale when robots are exposed to dangers, allowing for a faster response to dangers.
Iii Collective Conditioned Reflex (CCR)
Ensuring safety and efficiency during emergency reactions in multi-robot systems has been an open research topic. Existing approaches tend to focus only on control commands at the physical level while ignoring heuristic awareness of the danger that the robot itself can build. To help build this recognition and use it to achieve better performance, we propose Collective Conditioned Reflex (CCR), a novel bio-inspired emergency reaction mechanism that can be attached to most multi-agent reinforcement learning algorithms. The algorithm utilizes its three key components to effectively generate MRS emergent behaviors. First, a physical model estimates the ideal next state for the system assuming no emergencies. Then, a heuristic algorithm computes an emergency score for each robot based on the difference between the ideal and actual observation of each robot. Last, an intrinsic reward is computed and applied to the learning robots to encourage robots to avoid regions with high emergency scores. After sufficient training, robots in the MRS will learn to use these avoidance behaviors when they believe the danger is imminent, even when the danger is not in their sensor range. This mechanism draws inspiration from animals inferring dangers from other nearby animals that are alerted; then effective local reactions are organized to avoid dangers collectively.
Preliminaries of MRS Reinforcement Learning. The standard setup of reinforcement learning assumes a Markov decision process (MDP) model for the agent-environment interaction. A MDP consists of a 5-tuple, representing the set of states, set of actions, reward function, transition function, and the discount factor, respectively. The agent is tasked with learning a policy such that the total expected discounted reward is maximized. This study uses the multi-agent extension of the MDP, where each of the agent has its own state and action . The agents are tasked with maximizing their own expected total discounted reward.
Iii-a Overview of CCR Design
Figure LABEL:fig:overall_framework is an overview of the CCR framework. On the left-hand side, the MARL algorithm learns a policy for each of the robots by interacting with the environment. Let , , and denote the local observation, the reward, and the action of th robot at time , respectively. The CCR module comes into play by redefining the reward and observation received by the robots based on the emergency score derived from the learned physical model. The last-step observations and actions are stored and used to compute the ideal current observation for each robot. This represents how the environment dynamics are expected to flow without disturbance. Then, the difference between and are used to compute an emergency score for each robot, which are appended to every other robot’s observations . This gives the robots more explicit information about the emergency. Then, the emergency scores are used to compute an intrinsic reward for each robot via a heuristic algorithm, which is added to the rewards received from the environment. The intrinsic reward provides the robots with additional information about the potential consequences of actions. Robots will be discouraged from getting close with other robots in danger and encouraged to approach other robots in safety. The details of the algorithm are introduced in the following sections.
Iii-B Emergency Score Assessment
The emergency score part derives from the perception of animals to the external sudden emergencies and is used to measure how dangerous they are. In general, animal groups have different levels of responses to crises of different emergency scores. In MRS, emergency scores can remind robots to learn adaptive local reaction policies to different levels of danger. CCR assumes access to transition of robot at time : and last-step transition of other robots at time : . In simulation environments, these last-step transitions can be saved in memory; in real-world scenarios, they can be obtained through communication among robots. The predicted states for all other robots are obtained by passing the last step transitions to the physical model:
The emergency score is computed for all other robots as a form of prediction error:
In practice, we observe that smoothing the emergency score with a rolling average can further strengthen the performance. Let be the number of time steps to take on average. Then,
The emergency score for each robot is then appended to the observation of robot . This step is repeated for all robots.
Iii-C Potential Consequences Evaluation
Reward in reinforcement learning can be understood as an evaluation of how advantageous the action of a robot is. In the traditional setting, this evaluation is done by the environment, and the robot gets the reward as a return. However, it is also possible for the robot to evaluate with a metric different than the environment. This is the motivation behind intrinsic reward. In this work, the intrinsic reward is designed to estimate the potential consequences caused by the emergency.
Specifically, the emergency score introduced in the previous section is used to compute the intrinsic reward as the estimation of potential consequences. The intrinsic reward is then augmented to the reward returned by the environment to form the final reward used to train policy for each robot.
First, the emergency scores for all robots are compared pairwise. The idea is that within each pair, robots with lower emergency scores should be at a “safer” state than others. Then, the path safety score is computed to measure whether each robot moved away from the emergency (represented by the state of the robot with a higher emergency score). Then,
Note that is essentially just the change in distance to the emergency during the transition. Furthermore, is positive when and is moving away from or and is moving towards – both cases are desirable. is negative when and is moving away from or and is moving towards – both cases are undesirable. We then normalize it and multiply with a coefficient computed from the difference in emergency score, to produce the intrinsic reward induced by robot :
Summing across all other robots will give us the final intrinsic reward for this time step:
Each received from the environment is added with the intrinsic reward, with a tunable coefficient multiplying . The augmented reward replaces the reward obtained from the environment, and the transition is then saved to the experience replay (if using off-policy algorithms) or summed into the return (if using on-policy algorithms). Regular MARL algorithms take on from here.
Iii-D Developing Emergent Collective Behaviors
Emergency scores prompt robots with immediate emergency hazard information which will stimulate them to take the corresponding reactions at the same time. Otherwise, robots will miss the best first reaction time window that may lead to global failure. Since the estimation of potential consequences is computed from the emergency scores of other robots, the behavior of each robot is influenced heavily by the states and actions of its neighbors. Thus, CCR enhances emergent behavior in the MRS by actively propagating the emergency information from local to the whole team. Local robots in the danger zone will be self-organized to form local first response to the emergency.
Iv-a Experiment Setting
Experiments were conducted in a simulated environment to evaluate the effectiveness of the CCR algorithm. The goal of these experiments was to investigate how the behavior of the robots was affected when CCR was applied and to quantitatively measure the effect of this change. Our experimental environment was based on the Multi-Agent Particle Environment [28, 24], a simple multi-agent particle world based on some basic simulated physics. The environment was modified to include more objects and improve rendering. We implemented three scenarios that aimed to simulate how an MRS might encounter danger in the real world: turbulence, strong wind, and unexpected obstacle. According to the investigation, the unexpected obstacle is regarded as the most common emergency for UGV systems while strong wind and turbulence are the two most common emergencies for UAV systems, shown in Fig. 3.
Emergency 1: Interfered by Turbulence: In the turbulence scenario, four robots need to find their way to the target area marked with green color. A circular turbulence area will appear on their way and apply forces with random direction and magnitude on them.
Emergency 2: Blown Away by Strong Wind: In the strong wind scenario, four robots are trained to navigate to a target point. A rectangular strong wind area will appear on their way and apply force with a determined direction but higher magnitude than those in the turbulence scenario.
Emergency 3: Collide with Unexpected Obstacle: In the unexpected obstacle scenario, four robots are trained to navigate to a green triangle target point. A rectangular solid obstacle is located on their way.
Metrics. We evaluated the performance of MARL algorithms equipped with the CCR module with four metrics designed to account for both the safety and the efficiency of the algorithm. The safety aspect was evaluated with the average distance from each robot to the center of the danger area at every time step and the number of time steps each robot was in a dangerous state. The efficiency aspect was evaluated with the number of robots that successfully accomplished the task (success criteria varies) and the reward.
Baseline. To validate the stable effectiveness of the reflex mechanism in emergency reaction, experiments were conducted with independent DDPG (IDDPG), MADDPG, and MAAC respectively which are regarded as the most representative multi-agent reinforcement learning algorithms. Performance comparison between pure RL algorithm and corresponding CCR aided algorithm was utilized to demonstrate the superiority of CCR module. The state (observation) space was continuous for all three algorithms, but the action space was continuous only for IDDPG and MADDPG and discrete for MAAC. This was to diversify the settings that we tested the CCR module on and contribute to proving the broad applicability of our approach.
To illustrate that robots equipped with CCR can have a timely reaction effect, different colors are assigned to the trajectory of robots when CCR is running in different steps. The color pattern is shown in Fig. 3. The trajectory is blue when the robot is operating normally and the CCR module is on standby, yellow when the robot senses the danger, and pink when CCR is activated and triggers an intrinsic reward. Figure 3 also demonstrates the unsafe behavior and success criteria as described above.
The robot operator can tune the efficiency-safety tradeoff by changing the hyperparameter, which controls the weight of the intrinsic reward added to the environment reward. After conducting a parameter search, we found that setting gave a fairly balanced trade-off between efficiency decrease and safety improvement, and generalized well across all three scenarios. Therefore, all experiments shown here were conducted with .
Iv-B Overall Performance
We measured the task success rate and the number of dangerous time steps to quantify the tradeoff between safety and efficiency. The results are shown in Tables I and II. It is found that in all scenarios and all baseline algorithms, the task success rates are only slightly lower, 2.0% on average when the CCR module is applied. Meanwhile, the number of dangerous time steps is often clearly reduced. The most significant case is MADDPG in the unexpected obstacle scenario, where a 23.3% reduction in the number of collisions with the obstacle is recorded, while the decrease in success rate is only 2.4%. To examine this effect more closely, the change in the number of dangerous steps and task success rate during the entire training process are also plotted. Figure 5 shows that the episodic number of collisions decreases nearly 20% in the unexpected obstacle scenario when CCR is applied. It is found that, surprisingly, the success rate when CCR is applied remains higher than the baseline, and is only overtaken at the very end. Meanwhile, the number of collisions with the obstacle over the entire training process is significantly lower than the baseline, especially at the beginning, even less than 1/15 of the latter. We examined the results from other scenarios and observed similar trends. These results show that CCR is capable of reducing unsafe behavior frequency in a wide range of MRS tasks while sacrificing no more than 5.0% efficiency which is considered generally acceptable in practical deployment. More importantly, unsafe behaviors during training can be reduced significantly, offering great potential for deploying the algorithm to real-world systems, where exploring with uncertainty can be costly and dangerous. These experiments are substantial for extending CCR to more realistic and complex scenarios, possibly in physical environments.
|unexpected obstacle||turbulence||strong wind|
|unexpected obstacle||turbulence||strong wind|
Iv-C Behavioral Analysis
Figure 4 illustrates how the behavior of the robots in the unexpected obstacle scenario is substantially affected when the CCR module is introduced into training. In the unexpected obstacle scenario, since the robots can only sense the obstacle when they get sufficiently close to it, all the robots trained with the baseline MADDPG algorithm will repeat the process of running into the obstacle, then moving away from it. On the other hand, for the robots trained with CCR-aided MADDDPG, only the first few robots will run into the obstacle, while others steer away from it without even seeing the obstacle. This is because the obstacle is not present in the physical model, and therefore robots that run into the obstacle will have high emergency scores because they do not expect there to be an obstacle. Other robots will thus receive positive auxiliary rewards for moving away from them and negative auxiliary rewards for moving towards them. Especially, the following robots will obtain danger information from observing the emergency reaction behavior of the leading robot and react in advance. This is eventually reflected in the policy after training. The behavior is confirmed by the average distance to the obstacle in Fig. 4, which shows that in total, the robots trained with CCR-aided MADDPG get much farther to the obstacle than those trained with MADDPG, by approximately 1.3x to 6.2x.
Scalability Discussion. In addition to the discussion above, we also conducted extension experiments to prove our method’s scalability. We set the robot team size to 2, 6 and repeated the previous experiment process. In the two extension experiments, we found that our algorithm could reach a performance no worse than that when the team size was set to 4. Especially, MADDPG+CCR could reduce the dangerous behavior frequency by 23.5% and 25.2% respectively in the unexpected obstacle scenario which confirmed the scalability of our algorithm.
This paper introduced Collective Conditioned Reflex (CCR), a biology-inspired fast emergency reaction module based on multi-agent reinforcement learning for multi-robot systems. In the future, more realistic 3D physical dynamic simulators such as Gazebo and AirSim can be deployed to evaluate the effectiveness of our method. Moreover, how to keep the system’s performance when safety is fully achieved can also be investigated. Further research will focus on generalizing the related research to the real robot teams for practical deployment.
-  (2017) Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. In ICRA, pp. 285–292. Cited by: §II.
-  (2018) Multi robot collision avoidance in a shared workspace. Autonomous Robots 42 (8), pp. 1749–1770. Cited by: §II.
Natural emergence of heterogeneous strategies in artificially intelligent competitive teams. In ICSI, pp. 13–25. Cited by: §II.
-  (2017) Nonsmooth barrier functions with applications to multi-robot systems. L-CSS 1 (2), pp. 310–315. Cited by: §II.
-  (2016) From division of labor to the collective behavior of social insects. Behavioral ecology and sociobiology 70 (7), pp. 1101–1108. Cited by: §I.
-  (2019) The ecology of collective behavior in ants. Annual review of entomology. Cited by: §I.
-  (2016) Application of multi-robot systems to disaster-relief scenarios with limited communication. In Field and Service Robotics, pp. 639–653. Cited by: §I.
-  (2017) Multi-robot coalitions formation with deadlines: complexity analysis and solutions. PloS one 12 (1), pp. e0170659. Cited by: §I.
-  (2021) The influence of social environment on cooperation and conflict in an incipiently social bee, ceratina calcarata. Behavioral Ecology and Sociobiology 75 (4), pp. 1–11. Cited by: §I.
-  (2019) Actor-attention-critic for multi-agent reinforcement learning. In ICML, pp. 2961–2970. Cited by: §IV-A.
-  (2017) Programming and coordination in distributed multi-robot systems. Ph.D. Thesis, PolyU. Cited by: §I.
-  (2015) Multi-robot task allocation: a review of the state-of-the-art. CRSN 2015, pp. 31–51. Cited by: §I.
-  (2015) Continuous control with deep reinforcement learning. arXiv. Cited by: §IV-A.
-  (2019) Trust-aware behavior reflection for robot swarm self-healing.. In AAMAS, pp. 122–130. Cited by: §I.
-  (2018) Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning. In ICRA, pp. 6252–6259. Cited by: §II.
-  (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv. Cited by: §IV-A.
-  (2021) Evidence-based physical diagnosis e-book. Elsevier Health Sciences. Cited by: §I.
-  (2013) Emergency response to the nuclear accident at the fukushima daiichi nuclear power plants using mobile rescue robots. J. Field Robot. 30 (1), pp. 44–63. Cited by: §I.
-  (2021) Continuous learning of emergent behavior in robotic matter. PNAS 118 (21). Cited by: §II.
-  (2020) Trust aware emergency response for a resilient human-swarm cooperative system. arXiv. Cited by: §I.
-  (2017) Fault-tolerant rendezvous of multirobot systems. T-RO 33 (3), pp. 565–582. Cited by: §II.
-  (2016) An efficient algorithm for fault-tolerant rendezvous of multi-robot systems with controllable sensing range. In ICRA, pp. 358–365. Cited by: §II.
-  (2013) Scalable, fault-tolerant and distributed multi-robot patrol in real world environments. In IROS, pp. 4759–4764. Cited by: §II.
-  (2016) Human-inspired motion model of upper-limb with fast response and learning ability–a promising direction for robot system and control. Assembly Automation. Cited by: §IV-A.
-  (2008) Springer handbook of robotics. Vol. 200, Springer. Cited by: §I.
-  (2019) The social structure of bombus terrestris colonies: a review. The biology of social insects, pp. 196–200. Cited by: §I.
-  (2017) Safety barrier certificates for collisions-free multirobot systems. T-RO 33 (3), pp. 661–674. Cited by: §II.
-  (2021) Toward safe human-robot interaction: a fast-response admittance control method for series elastic actuator. T-ASE. Cited by: §IV-A.