Safe Multi-Agent Reinforcement Learning via Shielding

01/27/2021 ∙ by Ingy Elsayed-Aly, et al. ∙ Northeastern University University of Virginia The University of Texas at Austin Clausthal University of Technology 0

Multi-agent reinforcement learning (MARL) has been increasingly used in a wide range of safety-critical applications, which require guaranteed safety (e.g., no unsafe states are ever visited) during the learning process.Unfortunately, current MARL methods do not have safety guarantees. Therefore, we present two shielding approaches for safe MARL. In centralized shielding, we synthesize a single shield to monitor all agents' joint actions and correct any unsafe action if necessary. In factored shielding, we synthesize multiple shields based on a factorization of the joint state space observed by all agents; the set of shields monitors agents concurrently and each shield is only responsible for a subset of agents at each step.Experimental results show that both approaches can guarantee the safety of agents during learning without compromising the quality of learned policies; moreover, factored shielding is more scalable in the number of agents than centralized shielding.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Multi-agent reinforcement learning (MARL) addresses sequential decision-making problems where multiple agents interact with each other in a common environment. In recent years, MARL methods have been increasingly used in a wide range of safety-critical applications from traffic management singh2020hierarchical to robotic control yu2019coordinated to autonomous driving shalev2016safe. Existing MARL methods hernandez2019survey; zhang2019multi focus mostly on optimizing policies based on returns, none of which can guarantee safety (e.g., no unsafe states are ever visited) during the learning process. Nevertheless, learning with provable safety guarantees is necessary for many safety-critical MARL applications where the agents (e.g., robots, autonomous cars) may break during the exploration process and lead to catastrophic outcomes.

A recent work alshiekh2018safe developed a shielding framework for single-agent reinforcement learning (RL), which synthesizes a shield to enforce the correctness of safety specifications in linear temporal logic (LTL) pnueli1977temporal. The shield guarantees safety during learning by monitoring the RL agent’s actions and preventing the exploration of any unsafe action that violates the LTL safety specification. In this paper, we adapt the shielding framework to the multi-agent setting. Guaranteeing safety for multiple agents with potentially competing goals is more challenging than the single-agent setting, because safety is an emergent property that concerns the coupling of all agents. In addition, the combinatorial nature of MARL (i.e., the joint state space and joint action space increase exponentially with the number of agents) poses scalability issues to the computation of shields.

We present in this paper the first work to provide safety guarantees (expressed as LTL specifications) for MARL. Our contributions are threefold. First, we develop a centralized shielding approach for MARL, where we synthesize a single shield to centrally monitor the joint actions of all agents. The shield determines that a joint action is safe if all agents satisfy the safety specification. We follow the minimal interference principle proposed in alshiekh2018safe; that is, a shield should restrict the agents as infrequently as possible and only corrects the actions that violate the safety specification. Moreover, we introduce an additional interpretation of minimal interference in the multi-agent setting: a shield should change the actions of as few agents as possible when correcting an unsafe joint action. The centralized shielding approach has limited scalability, because the computational cost of synthesizing shields depends on the number of MARL agents and the complexity of the safety specification.

Second, we develop a factored shielding approach for MARL to address the aforementioned scalability issues. The factored shielding offers a divide-and-conquer approach: multiple shields are computed based on a factorization of the joint state space observed by all agents. The set of factored shields monitors agents concurrently and each shield is only responsible for a subset of agents at each step. Agents can join or leave a factored shield at any time depending on their states. Factored shields enforce the correctness of safety specification by preventing unsafe actions similarly to the centralized shield. While each individual factored shield can only monitor a limited number of agents due to the restriction of shield computation, we can employ as many shields as needed; and together the set of factored shields can monitor a large number of MARL agents.

Third, we showcase the performance of the two shielding approaches via experimental evaluation on six benchmark problems in a grid world melo2009learning and a cooperative navigation yang2019cm3 environment. We used two MARL algorithms, CQ-learning de2010learning and MADDPG lowe2017multi, in our experiments to demonstrate that the shielding approaches are compatible with different MARL algorithms. Experimental results show that the two shielding approaches can both guarantee the safety of agents during learning without compromising the quality of learned policies; moreover, factored shielding is more scalable in the number of agents than centralized shielding.

2. Related Work

Safe reinforcement learning (RL) is an active research area, but existing results focus mostly on the single-agent setting garcia2015comprehensive, while safe MARL is still a relatively uncharted territory zhang2019multi. To the best of our knowledge, this paper presents the first safety-constrained MARL method. The survey in garcia2015comprehensiveclassifies safe RL methods into two categories: (1) transforming the optimization criterion with a safety factor, such as the worst case criterion, risk-sensitive criterion, or constrained criterion; and (2) modifying the exploration process through the incorporation of external knowledge (e.g., demonstrations, teacher advice) or the guidance of a risk metric. Our shielding approaches fall into the second category. In particular, shields act similarly to a teacher who provides information (e.g., safe actions) to the learner when necessary (e.g., unsafe situations are detected). The concept of shielding was introduced to RL for the single-agent setting in alshiekh2018safe. In this work, we adapt the shielding framework for MARL via addressing challenges such as the coupling of agents and scalabilty issues in the multi-agent setting.

Different safety objectives for RL have been considered in the literature, such as the variance of the return, or limited visits of error states 

garcia2015comprehensive. In this work, we synthesize shields that enforce safety specifications expressed in linear temporal logic (LTL) pnueli1977temporal, which is a commonly used specification language in formal methods for safety-critical systems alur2015principles; baier2008principles. For example, LTL has been used to express complex task specifications for robotic planning and control kress2009temporal; ulusoy2013optimality. Several recent works hasanbeig2020cautious; bozkurt2020control; hahn2019omega have developed reward shaping techniques that translate logical constraints expressed in LTL to reward functions for RL. However, as we demonstrated in our experiments (Section 6), relying on reward functions only is not sufficient for MARL methods to learn policies that guarantee the safety (e.g., no collisions).

The shield synthesis technique based on solving two-player safety games was developed in bloem2015shield for enforcing safety properties of a system at runtime, and was adopted in alshiekh2018safe to synthesize shields for single-agent RL. We further adapt this technique to synthesize centralized and factored shields for MARL in this paper. There are a few recent works raju2019decentralized; bharadwaj2019synthesis considering the shield synthesis for multi-agent (offline) planning and coordination, none of which are directly applicable for MARL.

3. Background

A discrete probability

distribution over a (countable) set is a function such that . Let denote the set of distributions over . We use to denote the real numbers. Given an alphabet , we denote by and the set of infinite and finite words over , respectively.

Multi-Agent Reinforcement Learning (MARL). We follow the Markov game formulation of MARL in zhang2019multi. A Markov game is a tuple with a finite set of agents, and a finite state space observed by all agents; let be the set of joint actions for all agents, where denotes the actions of agent ; the probabilistic transition function is defined over the joint states and actions of all agents; is an immediate reward function for agent under the joint states and actions; is the discount factor of future rewards. At time step , each agent chooses an action based on the observed state . The environment moves to state with the probability , where is the joint action of all agents, and rewards agent with . The goal of an individual agent is to learn a policy that optimizes the expectation of cumulative future rewards . The performance of individual agent is not only influenced by its own policy, but also the choices of all other agents.

Depending on agents’ goals, MARL algorithms can be categorized as fully cooperative (i.e., agents collaborate to optimize a common long-term return), fully competitive (i.e., zero-sum game among agents), or a mixed setting that involves both cooperative and competitive agents. In our experiments (Section 6), we used the following three mixed-setting algorithms. Independent Q-learning tan1993multi is a baseline algorithm where agents learn Q-values over their own action set independently and do not use any information about other agents. CQ-learning de2010learning is an algorithm that allows agents to act independently most of the time and only accounts for the other agents when necessary (e.g., when conflict situations are detected). MADDPG lowe2017multi is a deep MARL algorithm featuring centralized training with decentralized execution, in which each agent trains models simulating each of the other agents’ policies based on its observation of their actions.

Scalability is a key challenge of MARL, due to its combinatorial nature. For example, our experiments can only use two agents with CQ-learning, but more than four agents with MADDPG which applies deep neural networks for function approximation to mitigate the scalability issue. Another key challenge of MARL is the lack of convergence guarantees in general, except for some special settings 

zhang2019multi. As multiple agents learn and act concurrently, the environment faced by an individual agent becomes non-stationary, which invalidates the stationary assumption used for proving convergence in single-agent RL algorithms.

Safety Specifications and Safety Games. We use linear temporal logic (LTL) pnueli1977temporal to express safety specifications. In addition to propositional logical operators, LTL employs temporal operators such as (next), (until), (always), and (eventually). The set of words that satisfies an LTL formula represents a language , where is a given set of atomic propositions. LTL formulas can be used to express a wide variety of requirements. We focus on safety specifications, which are informally interpreted as “something bad should never happen”. For example, the LTL formula expresses that “unsafe states should never be visited”. An LTL safe specification can be translated into a safe language accepted by a deterministic finite automaton (DFA) kupferman2001model.

Formally, a deterministic finite automaton is a tuple with a finite set of states , an initial state , a finite alphabet , the transition function , and a finite set of accepting states . Let be a run of the DFA. The word is in the safety language accepted by the DFA if the run only visits accepting states of the DFA, i.e., for all .

We use Mealy machines to represent shields. Formally, a Mealy machine is a tuple with a finite set of states , an initial state , finite sets of input alphabet and output alphabet , the transition function , and the output function . For a given input trace , the Mealy machine generates a corresponding output trace where for all .

As we will describe later, we synthesize shields by solving two-player safety games. Formally, a two-player safety game is a tuple with a finite set of game states , an initial state , finite sets of alphabet and for Player 1 and Player 2 respectively, the transition function , and a set of safe states defines the winning condition such that a play of the game is winning iff for all . At each game state , Player 1 chooses an action , then Player 2 chooses an action , and the game moves to the next state . A memoryless strategy for Player 2 is a function . A winning region is the set of states from which there exists a winning strategy (i.e., all plays constructed using the strategy satisfy the winning condition).

4. Centralized Shielding

We introduce a centralized shield (i.e., a single shield for all agents) into the traditional MARL process. In the following, we first describe how the centralized shield interacts with the learning agents and the environment to achieve safe MARL, then we present our method for synthesizing the centralized shield.

Figure 1 illustrates the interaction of the centralized shield, the MARL agents, and the environment. Algorithm 1 summarizes the centralized shield’s behavior at time step . The shield monitors the joint action chosen by the MARL agents. If the shield detects that is unsafe (i.e., violates the safety specification) at the agents’ joint state , the shield substitutes with a safe joint action ; otherwise, the shield forwards to the environment directly (i.e., ). The environment receives the action output by the shield, moves to state , and provides reward for each agent to update its policy. Meanwhile, the shield assigns a punishment to agent (where ) to help the MARL algorithm learn about the cost of unsafe actions.

A centralized shield enforces the safety specification during the learning process (i.e., any unsafe action is corrected to a safe action before being sent to the environment). Moreover, we require the shield to restrict MARL agents as rarely as possible via the minimal interference criteria: (1) the shield only corrects the joint action if it violates the safety specification, and (2) the shield seeks a safe joint action that changes as few of the agents’ actions as possible from .

Figure 1. Safe MARL with centralized shielding.

Our approach synthesizes a centralized shield based on the safety specification and a coarse environment abstraction. Note that we do not require the environment dynamics to be completely known in advance. The shield can be synthesized based on a coarse abstraction of the environment that is sufficient to reason about the potential violations of safety specifications. For example, before deploying a team of robots for a disaster search and rescue mission, we may use some low-resolution satellite imagery to build a coarse, high-level abstraction about the terrain environment for shield synthesis. However, such a coarse environment abstraction is not sufficient for planning algorithms that rely on complete models of the environment. Therefore, MARL agents still need to learn about the concrete environment dynamics.

0:  Shield , MARL agents’ joint action and joint state , a constant punishment cost
0:  Safe joint action , punishment
2:   safe action output by the shield
3:  for all agent such that  do
5:  end for
6:  return  ,
Algorithm 1 Centralized shielding at time step

We describe how to synthesize centralized shields as follows. We assume some coarse environment abstraction has been given as a DFA with the alphabet , where an observation function maps the MARL agents’ joint state space to some observation set , and is the joint action set of all agents. We translate the safety specification expressed as an LTL formula to another DFA with the same alphabet . We combine and into a two-player safety game where , , , , for all , , and , and . We solve the two-player safety game and compute the wining region using the techniques described in bloem2015shield. We construct the centralized shield represented as a Mealy machine , where the state space is given by the game states , the initial state , the input alphabet , the output alphabet ; the transition function for all , , and ; the output function if , and if , where is a safe action with and only differs from the unsafe action in terms of the minimal number of agents’ actions. We also define a (negative) constant as punishment for unsafe actions. The computational cost of synthesizing centralized shields grows exponentially as the number of agents increases, and also depends on the complexity of the safety specification and environment abstraction.

To exemplify the shield synthesis method, let us consider two agents (blue and orange) in the grid map shown in Figure 2. Each agent can move left or right, or stay in the same grid. An agent receives a reward of if it reaches grid 1 or 6, and receives a negative reward of if it collides with the other agent. The discount factor being . Each agent tries to learn an optimal policy based on the observed rewards. However, the negative reward cannot completely prevent collisions during the learning process of traditional MARL algorithms. Because the agents need to explore different (even unsafe) actions to learn about states and rewards from the environment. Now we show how to construct a shield that can block unsafe actions and guarantee collision free. We use an observation set that measures the distance between blue and orange agents. For example, for agents’ positions shown in Figure 2. We build a coarse environment abstraction DFA that captures the relation of agents’ distances and joint actions. Figure 3(a) shows a fragment of . We can express the safety specification of collision avoidance using the following LTL formula:

which indicates that the following bad scenarios should never occur: two agents being in the same grid (), or taking certain unsafe joint actions that would make them collide into each other when or . We can translate the LTL formula into the DFA shown in Figure 3(b). We build a two-player safety game from the product of and . Figure 4 shows a fragment of the safety game. For example, in the game state , the blue and orange agents should not choose a joint action (stay, left) that leads to an unsafe game state where two agents collide into each other. The synthesized centralized shield prevents the collision by correcting the unsafe action (stay, left) with a safe action (stay, stay) and assigns a punishment cost of to the orange agent.

Figure 2. Example grid map with two agents.
Figure 3. (a) An example environment abstraction DFA . (b) An example safety specification DFA . (Double circle denotes accepting states of DFAs. * refers to any action.)
Figure 4. An example safety game given by the product of and shown in Figure 3. Double circles denote safe states.

Correctness. We show that the synthesized centralized shields can indeed enforce safety specifications for MARL agents as follows. Given a trace jointly produced by MARL agents, the centralized shield, and the environment, there is a corresponding run of the shield such that and for all , where is the observation function. By the construction of the shield, we have , where and are the state space of the environment abstraction DFA and the safety specification DFA , respectively. Thus, we can project the run of the shield onto a trace on . The shield is constructed from the winning region of the safety game, which ensures that only safe states are ever visited along the trace of (i.e., for all ). Thus, the centralized shield can guarantee that the safety specification is never violated.

Impact on Learning Performance. The centralized shielding approach is agnostic to the choice of a MARL algorithm, because the shield interacts with the learner only via inputs and outputs, and does not rely on the inner-workings of the learning algorithm. As explained in Section 3, there is a lack of theoretical convergence guarantees for MARL algorithms in general. Thus, a full theoretical analysis of the shielding approach’s impact on MARL convergence is out of scope for this paper. We show empirically in our experiments (Section 6) that (1) MARL with and without centralized shielding both converge; (2) centralized shielding can guarantee the safety in all examples, while MARL without shielding does not prevent agents’ unsafe behavior; (3) centralized shielding learns more optimal policies with better returns than non-shielded MARL in some examples (e.g., due to the removal of unsafe actions that may destabilize learning).

5. Factored Shielding

Figure 5. Safe MARL with factored shielding.

The centralized shielding approach has limited scalability, because the computational cost of shield synthesis grows exponentially with the number of agents. To address this limitation, we develop a factored shielding approach that synthesizes multiple shields to monitor MARL agents concurrently, as illustrated in Figure 5.

Let us consider a finite set of factored shields where each shield is synthesized based on a factorization of the joint state space observed by all agents. We can leverage problem-specific knowledge to achieve an efficient factorization scheme (e.g., how many shields to use, what is the state space covered by each shield). For example, we synthesize two factored shields and for monitoring agents’ behavior in grids 1-3 and 4-6 of Figure 2, respectively. A factored shield monitors a subset of agent actions at each time step. A shield is not tied to any specific agent; instead, an agent can request to join or leave a shield from border states at any time. For example, if the orange agent in Figure 2 wants to move from grid 4 to grid 3, it would request to join and leave .

Algorithm 2 describes how the factored shielding works at each time step . There are three phases: (1) factorization, (2) shielding, and (3) coordination. In the factorization phase (line 5-14), the algorithm identifies the factored shields that are responsible for monitoring each agent in the current time step , based on a mapping between the agent state and the factored state space assigned to each shield . Thus, there must exist at least one factored shield monitoring each agent. If agent happens to be in a border state within the shield and, by taking action , the agent would cross the border to another shield , the algorithm relates agent with both shields and renames its actions in shield and as leave and join, respectively. Next, in the shielding phase (line 16-33), each factored shield checks if the set of related agents act safely (i.e., not violating the safety specification within it) and substitutes any unsafe action with a default safe action (e.g., stay in our running example). In the coordination phase (line 35-47), the algorithm checks the output of all shields to make sure compatible decisions are made for each agent. For example, if an agent action is translated to requests of leaving and joining , then both requests need to be approved by the shields; however, if considers join as unsafe at this time and substitutes with a default safe action stay, then the algorithm corrects the agent action and output with safe action , Finally, the algorithm assigns a punishment cost for any unsafe action with .

0:  A set of factored shields , MARL agents’ joint action and joint state , a default safe action , a constant punishment cost
0:  Safe joint action , punishment
1:  Initialize int array // related shield index
2:  Initialize string array // actions
3:  Initialize Boolean array // agents in each shield
4:   // Factorization phase
5:  for all agent  do
6:     find a factored shield related to the agent state
7:     if  may leave shield and join shield  then
8:         ,
9:         ,
10:         ,
11:     else
12:         , ,
13:     end if
14:  end for
15:   // Shielding phase
16:  for all shield with  do
18:     for all  with  do
19:         if  then
20:             append
21:         else
22:             append
23:         end if
24:     end for
25:      safe action output by the shield
26:     for all agent such that  do
27:         if  then
29:         else
31:         end if
32:     end for
33:  end for
34:   // Coordination
35:  for all agent  do
36:     if  then
37:         if  and  then
39:         else
41:         end if
42:     end if
43:     ,
44:     if  then
46:     end if
47:  end for
48:  return  ,
Algorithm 2 Factored shielding at time step

We synthesize factored shields using a similar method as the synthesis of centralized shields. However, instead of building a safety game that accounts for the joint states and joint actions of all MARL agents, we only consider a factorization of states and actions for the synthesis of each factored shield. Let be the factored state space of shield . We factor the coarse environment abstraction DFA into a DFA with the alphabet , where an observation function maps the factored states to some observation set , and is the joint action in shield with determined by the maximum number of agents that shield can monitor at once. Note that we need to translate the agent actions at border states of a shield to join or leave requests. Intuitively, since any agent may request to join or leave shield at any time, the joint action needs to account for any possible combination of agents. This allows us to synthesize factored shields offline with a fixed alphabet, instead of re-computing shields for different agents at each step during learning. Similarly, we can factor the safety specification DFA into a DFA with the alphabet . We obtain the shield as a Mealy machine by solving the two-player safety game built from and , in a similar way as described in Section 4.

Figure 6. An excerpt of the safety game for constructing shield of our running example. Double lines indicate safe states. To simplify the graphic notation, we put observations inside each state which should be labeled on all outgoing transitions from that state. The observations are about agents’ grid positions, with denoting outside. refers to any action except “join”.

Figure 6 shows an example safety game for synthesizing the shield that monitors agents’ actions in grid 1-3 of our running example. The initial game state observes that the blue agent is in grid 3 and the orange agent is outside the shield. If the blue and orange agents ask for a pair of actions (stay, join), then the game would move to an unsafe state where both agents collide into each other in grid 3. In this case, shield substitutes (stay, join) with safe actions (stay, stay). Since the orange agent is involved in two shields and , we need to coordinate the output of both shields. For example, if rejects orange agent’s join request but accepts the same agent’s leave request, then there is conflict among the output of and . In such case, our coordination algorithm chooses the default safe action stay for the orange agent. Note that, if there is another agent in shield , then it should not be allowed to move to grid 4 before the orange agent successfully leaves to avoid collision. Such safety constraints can be encoded in the safety game for synthesizing the shield .

Correctness. We show that the factored shielding algorithm can guarantee safety for MARL agents. Given a trace jointly produced by MARL agents, the factored shielding, and the environment, we prove that the state-action pair is safe at every time step . There are several cases. First, suppose none of the agents requests to switch shields at time step . By the construction of factored shields, each shield monitors a subset of agents based on the factored state space and outputs a safe joint action that does not violate the safety specification. Thus, the joint state and joint action output by all shields are safe for all agents. Second, suppose there is some agent requesting to leave shield and join shield . If both shields accept agent ’s requests, which means that agent does not cause a violation of safety specification with either shield. So we still have and safe for all agents. If rejects agent ’s joining request and substitutes with a default safe action, then the factored shielding algorithm coordinates with the output of and corrects agent ’s leaving request with the default safe action as well. Such a correction does not affect the safety of other agents in shield , because by construction the shield accounts for the worst case scenario of leaving request being rejected. Therefore, we have the joint state-action pair safe at every time step for all agents.

Impact on Learning Performance. Similarly to centralized shielding, the factored shielding approach is agnostic to the choice of a MARL algorithm. We show empirically via our experiments that adding factored shields does not prevent MARL algorithms from converging. In addition, our experiments show that the factored shielding approach can be applied to examples where the synthesis of centralized shields is not feasible due to a large number of agents. While the two shielding approaches can both guarantee the safety during learning in all examples, factored shielding sometimes leads to less optimal policies than centralized shielding (e.g., due to the delay caused by agents switching shields).

6. Experiments

We implemented both the centralized shielding and factored shielding approaches in Python and used the Slugs tool ehlers2016slugs to synthesize shields via solving two-player safety games. We applied our prototype implementation to six benchmark problems in the grid world (Figure 7) and a cooperative navigation environment (Figure 8). We used two MARL algorithms CQ-learning de2010learning and MADDPG lowe2017multi

in experiments to show that our shielding approaches are agnostic to the choice of MARL algorithms. The experiments were run on a computer with Intel i5 CPU and 16 GB of RAM. Each experiment was split into training phase (linearly decreasing exploration) and evaluation phase (immediately following the training phase and with an exploration rate of 5%). All experiments were conducted for 10 independent runs whose results were averaged to reduce the impact of outliers. The shields in all examples were synthesized within two minutes.

Figure 7. Maps of grid world examples adapted from melo2009learning. In each map, blue and orange agents aim to learn optimal policies to navigate from start (circles) to target (squares) while avoiding collisions.
Figure 8. Visualisations of cooperative navigation examples adapted from yang2019cm3. Four agents (blue, orange, green, and grey) aim to learn optimal policies to navigate from start (large circles) to target (small circles) while avoiding collisions.

Problem Setup. Figure 7 shows four maps of benchmark grid world examples adapted from melo2009learning. Each map has two agents, where each agent aims to learn its own optimal policy for navigating from the start position to the target position while trying to avoid collisions. Each agent has five possible actions: stay, up, down, left, right. Once an agent reaches its target position, it stays there. A learning episode ends when both agents have reached their target positions. Both agents have the same reward function: for a valid move, for a collision with a wall, for collision with the other agent, for arriving at the agent’s target position.

Figure 8 shows two benchmark cooperative navigation examples adapted from yang2019cm3. Each example has four agents represented as particles. The goal is for agents to cooperate and reach their designated target positions as fast as possible while avoiding collisions. We discretize the fully continuous environment in yang2019cm3 by restricting agents only take positions with a precision of . An agent receives a higher reward when it gets closer to its target position (i.e., negation of the distance value), and a negative reward for any collision.

Figure 9. Collision variation experiments results (average of 10 evaluation episodes conducted after 1,000 training episodes, across 10 independent runs).

Collision Variation Experiments. We conducted a set of experiments using the grid world examples to highlight why relying on the reward function only is not sufficient to achieve safety (i.e., collision avoidance in our examples). To prevent collisions, the traditional practice of reinforcement learning is to assign a negative reward (we refer to its absolute value as the cost of collision) whenever a collision occurs, and increase the cost until the probability of collision happening becomes negligible. Figure 9 shows the results of our experiments using the independent Q-learning tan1993multi and CQ-learningde2010learning. The left side of the figure shows that, for the independent Q-learning, increasing the cost of collision cannot guarantee that the evaluation phase will be completely collision free; moreover, the increased cost of collision leads to a significant agent performance degradation measured by a larger number of steps to reach target positions. In the MIT and SUNY maps, agents even learn policies that give up the primary task of reaching target positions in order to avoid the high collision cost. The results of the CQ-learning (shown in the right side of the figure) are better than those of the independent Q-learning. The number of collisions drops quickly with a relatively low cost. However, CQ-learning cannot guarantee zero collision either (see Table 1).

IQL CQ CQ with centralized shield CQ with factored shield
Maps Optimal Steps Steps Reward Collisions Steps Reward Collisions Steps Reward Collisions Steps Reward Collisions
ISR 5 30.35 -10.20 20.30 8.66 89.53 0.40 7.03 93.85 0.00 7.31 93.74 0.00
Pentagon 10 46.58 -19.17 11.60 10.96 88.96 0.20 12.08 88.44 0.00 13.20 84.88 0.00
MIT 18 20.84 77.33 0.00 42.93 30.38 0.90 28.38 73.94 0.00 29.96 37.96 0.00
SUNY 10 34.80 -160.175 72.60 13.97 84.78 0.30 11.97 88.44 0.00 14.02 83.77 0.00
Table 1. Results comparing the independent Q-learning, CQ-learning, CQ-learning with centralized and factored shields (average of 10 evaluation episodes conducted after 1,000 training episodes, across 10 independent runs).
Figure 10.

Comparison of CQ-learning without shielding, with centralized or factored shielding based on the accumulated rewards per episode (average and standard deviation over 1,000 training episodes, across 10 independent runs).

Centralized Shielding Evaluation. We integrated CQ-learning with centralized shielding and applied it to the four grid world examples shown in Figure 7. The results in Table 1 show that centralized shielding can guarantee collision free learning in all cases. Moreover, in three out of four maps, CQ-learning with centralized shield obtained better policies with higher rewards and smaller number of steps to reach the target, compared to no shielding. Figure 10 shows that centralized shielding achieves the highest accumulated reward in most times; moreover, the blue shaded area (standard deviation of no shielding) tends to stretch lower than others, indicating that CQ-learning without shielding obtains lower rewards than with centralized shielding on average. The learning curves also show that the centralized shielding does not prevent the learner from converging across different examples. However, we failed to synthesize centralized shields with more than two agents in these grid maps, due to scalability issues of shield synthesis.

Figure 11. Comparison of MADDPG without and with factored shielding based on the accumulated rewards per episode (average and standard deviation over 20,000 training episodes, across 10 independent runs).
Cross 207.20 0.00
Antipodal 14,419.20 0.00
Table 2. Total number of collisions over 20,000 training episodes for the cooperative navigation examples.

Factored Shielding Evaluation. First, we applied CQ-learning with factored shielding to the four grid world examples. We adopted a factorization scheme such that each shield monitors agent actions occurring within a grid block in each map. Results in Table 1 show that CQ-learning with factored shielding can guarantee zero collisions in all examples, while learned policies have similar quality as those obtained from CQ-learning with centralized shielding. Figure 10 shows that factored shielding achieves similar performance in terms of the accumulated rewards per episode, compared to centralized shielding and without shielding. Due to the scalability limitation of CQ-learning, we can only consider two agents in these examples.

Additionally, we integrated a different algorithm MADDPG lowe2017multi with factored shielding and applied it to the cooperative navigation examples shown in Figure 8 with a shield size where one unit of distance corresponds to in the environment. There are four agents in each example, which is not feasible for centralized shielding approach to handle. Table 2 shows that MADDPG with factored shielding can guarantee zero collisions over the training period of episodes for both examples. By contrast, MADDPG without shielding leads to about and occurrences of collisions for the cross and antipodal examples, respectively. Figure 11 shows that in the cross example, MADDPG without and with factored shielding have comparable learning performance in terms of the accumulated rewards per episode; in the antipodal example, MADDPG without shielding achieves higher rewards than MADDPG with factored shielding, though this comes at a trade-off of more collisions. The learning curves in Figure 11 also show that the factored shielding do not have negative impact on the learner’s ability to converge.

Summary. Our experiments demonstrate that the two shielding approaches can guarantee the safety, without compromising the learning performance in terms of the convergence rate and the quality of learned policies. Moreover, factored shielding is more scalable in the number of agents than centralized shielding.

7. Conclusion

In this paper, we present two shielding approaches that guarantee the safety specifications expressed in linear temporal logic (LTL) during the learning process of MARL. The centralized shielding approach synthesizes a single shield to centrally monitor the joint actions of all agents and only corrects any unsafe action that violates the LTL safety specification. However, the scalability of centralized shielding is restricted because the computational cost of shield synthesis grows exponentially with the number of agents. The factored shielding approach addresses this limitation by synthesizing multiple factored shields with each shield monitoring a subset of agents at each time step. Our experimental results show that both shielding approaches can guarantee the safety specification (e.g., collision avoidance) during learning, and achieve similar learning performance (e.g., convergence speed, quality of learned policies) as non-shielded MARL. We manually devise factorization schemes for the factored shielding approach in our experiments based on the problem-specific knowledge. In the future, we will explore the automated learning of efficient factorization schemes.

8. Acknowledgements

This work was supported in part by the Office of Naval Research Science of AI Program (grant N00014-18-1-2829). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Office of Naval Research.