Delay-Aware Multi-Agent Reinforcement Learning for Cooperative and Competitive Environments

Action and observation delays exist prevalently in the real-world cyber-physical systems which may pose challenges in reinforcement learning design. It is particularly an arduous task when handling multi-agent systems where the delay of one agent could spread to other agents. To resolve this problem, this paper proposes a novel framework to deal with delays as well as the non-stationary training issue of multi-agent tasks with model-free deep reinforcement learning. We formally define the Delay-Aware Markov Game that incorporates the delays of all agents in the environment. To solve Delay-Aware Markov Games, we apply centralized training and decentralized execution that allows agents to use extra information to ease the non-stationarity issue of the multi-agent systems during training, without the need of a centralized controller during execution. Experiments are conducted in multi-agent particle environments including cooperative communication, cooperative navigation, and competitive experiments. We also test the proposed algorithm in traffic scenarios that require coordination of all autonomous vehicles to show the practical value of delay-awareness. Results show that the proposed delay-aware multi-agent reinforcement learning algorithm greatly alleviates the performance degradation introduced by delay. Codes and demo videos are available at: https://github.com/baimingc/delay-aware-MARL.

READ FULL TEXT VIEW PDF

Authors

page 1

page 9

05/11/2020

Delay-Aware Model-Based Reinforcement Learning for Continuous Control

Action delays degrade the performance of reinforcement learning in many ...
05/25/2022

Scalable Multi-Agent Model-Based Reinforcement Learning

Recent Multi-Agent Reinforcement Learning (MARL) literature has been lar...
03/07/2022

Reinforcement Learning for Location-Aware Scheduling

Recent techniques in dynamical scheduling and resource management have f...
10/26/2021

Learning to Simulate Self-Driven Particles System with Coordinated Policy Optimization

Self-Driven Particles (SDP) describe a category of multi-agent systems c...
08/01/2021

Agent-aware State Estimation in Autonomous Vehicles

Autonomous systems often operate in environments where the behavior of m...
08/16/2019

Iterative Update and Unified Representation for Multi-Agent Reinforcement Learning

Multi-agent systems have a wide range of applications in cooperative and...
05/24/2022

Graph Convolutional Reinforcement Learning for Collaborative Queuing Agents

In this paper, we explore the use of multi-agent deep learning as well a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep reinforcement learning (DRL) has made rapid progress in solving challenging problems, including games [30, 43] and robotic control [42, 10, 19]. Recently, DRL has been used in multi-agent scenarios since many important applications involve multiple agents cooperating or competing with each other, including multi-robot control [26], the emergence of multi-agent communication and language [12, 33, 45], multi-player games [37], etc. Learning in multi-agent scenarios is fundamentally more difficult than the single-agent case due to many reasons, e.g., non-stationary [17]

, curse of dimensionality

[8], multi-agent credit assignment [1], global exploration [27]. For a more comprehensive review of DRL applied in multi-agent scenarios, readers are referred to [18].

Most DRL algorithms are evaluated in turn-based simulators like Gym [7] and MuJoCo [48], where the observation, action selection and actuation of the agent are assumed to be instantaneous. Delay in observation and action, although prevalent in many areas of the real world such as robotic systems [5], communication networks [32], and parallel computing [16], may not be directly handled in this scheme. This issue is even worse in multi-agent scenarios, where the delay of one agent could spread to other coupled agents. For example, in tasks involving communications between agents, the action delay of a speaker would give rise to observation delays of all listeners subscribing to this speaker. Ignoring agent delays would not only degrade the performance of the agents but also induce instability to the systems [14], which is a fatal threat in safety-critical systems like connected and autonomous vehicles (CAVs) [13]. For instance, it usually takes more than 0.4 seconds for the hydraulic automotive brake system to generate the desired deceleration [5], which could make a huge impact on the planning and control modules of CAVs [38].

The control community has proposed several methods to address the delay problem, such as using Smith predictor [4, 25], Artstein reduction [3, 34], finite spectrum assignment [24, 31], and controller [28]. However, most of these methods depend on accurate dynamic models of the system [35, 14], which is usually not available in real-world applications.

Recently, DRL has offered the potential to resolve this issue. The problems that DRL solves are usually modeled as Markov Decision Process (MDP). However, ignoring the delay of agents violates the Markov property and results in partially observable MDPs, or POMDPs, with historical actions as hidden states. From

[44]

, it is shown that solving POMDPs without estimating hidden states can lead to arbitrarily suboptimal policies. To retrieve the Markov property, the delayed system was reformulated as an augmented MDP problem such as the work in

[50]. Travnik et al. [49] noticed the illness of the traditional MDP framework, but did not provide a theoretical analysis. Ramstedt & Pal [40] proposed an off-policy model-free algorithm known as Real-Time Actor-Critic to efficiently address the 1-step delayed problem. The delay issue could also be relieved with the model-based manner by learning a dynamics model to predict the future state as in [50, 11]. However, most of the previous works are limited to single-agent tasks and are not able to directly handle the non-stationary issue introduced by multiple agents. To our best knowledge, there has not been a general framework to use model-free DRL for multi-agent tasks with delayed agents. As for model-based DRL, dealing with multi-agent tasks involves agents modeling agents which introduces extra non-stationary issues since policies of all agents are consistently updated [18].

In this paper, we propose a novel framework to deal with delays as well as the non-stationary training issue of multi-agent tasks with model-free DRL. We first propose a general model for multi-agent delayed systems, Delay-Aware Markov Game (DA-MG), by augmenting standard Markov Game with agent delays. We prove the solidity of this new structure with the Markov reward process. We then develop a delay-aware training algorithm for DA-MGs that utilizes centralized training and decentralized execution to alleviate the non-stationary issue of multi-agent training: for each agent, we learn a centralized Q function which conditions on global information and a decentralized policy that only needs partial observation. We test our algorithm in both benchmark platforms and practical traffic scenarios.

The rest of the paper is organized as follows. We first review the preliminaries in Section II. In Section III, we formally define the Delay-Aware Markov Game (DA-MGs) and prove the solidity of this new structure with the Markov reward process. In Section IV, we introduce the proposed framework of delay-aware multi-agent reinforcement learning for DA-MGs with centralized training and decentralized execution. In Section V, we demonstrate the performance of the proposed algorithm in cooperative and competitive multi-agent particle environments, as well as traffic scenarios that require coordination of autonomous vehicles.

Ii Preliminaries

Ii-a Markov Decision Process and Markov Game

In the framework of reinforcement learning, the problem is often represented by a Markov Decision Process (MDP). The definition of a standard delay-free MDP is:

Definition 1.

A Markov Decision Process (MDP) is characterized by a tuple with
(1) state space ,    (2) action space ,   
(3) initial state distribution ,
(4) transition distribution ,   
(5) reward function .

The agent is represented by a policy that directs the action selection, given the current observation. The goal of the agent is to find the optimal policy that maximizes its expected return where is a discount factor and denotes the time horizon.

Markov game is a multi-agent extension of MDP with partially observable environments. The definition of a standard delay-free Markov game is:

Definition 2.

A Markov Game (MG) for agents is characterized by a tuple with
(1) A state space describing all agents,
(2) A set of action spaces ,
(3) A set of observation spaces ,
(4) initial state distribution ,
(5) transition distribution ,
(6) reward function for each agent .

Each agent receives an individual observation from the state and uses a policy to choose actions. The goal of each agent is to maximize its own expected return where is a discount factor and denotes the time horizon.

Ii-B Delaye-Aware Markov Decision Process

The delay-free MDP is problematic with agent delays and could lead to arbitrarily suboptimal policies [44]. To retrieve the Markov property, Delay-Aware MDP (DA-MDP) is proposed:

Definition 3.

A Delay-Aware Markov Decision Process augments a Markov Decision Process , such that
(1) state space where denotes the delay step,
(2) action space ,
(3) initial state distribution

where denotes the initial action sequence,
(4) transition distribution

(5) reward function

The state vector of DA-MDP is augmented with an action sequence being executed in the next

steps where is the delay duration. The superscript of means that the action is one element of and the subscript represents the action executed time. is the action taken at time in a DA-MDP but executed at time due to the -step action delay, i.e. .

Policies interacting with the DA-MDPs, which also need to be augmented since the dimension of state vectors has changed, are denoted by bold .

It should be noted that both action delay and observation delay could exist in real-world systems. However, it has been proved that from the perspective of the learning agent, observation and action delays form the same mathematical problem, since they both lead to the mismatch between the current observation and the executed action [21]. For simplicity, we will focus on the action delay in this paper, and the algorithm and conclusions should be able to generalize to systems with observation delays.

The above definition of DA-MDP assumes that the delay time of the agent is an integer multiple of the time step of the system, which is usually not true for many real-world tasks like robotic control. For that, Schuitema et al. [41] has proposed an approximation approach by assuming a virtual effective action at each discrete system time step, which could achieve first-order equivalence in linearizable systems with arbitrary delay time. With this approximation, the above DA-MDP structure can be adapted to systems with arbitrary-value delays.

Ii-C Multi-Agent Deep Deterministic Policy Gradient

Reinforcement learning has been used to solve Markov games. The simplest way is to directly train each agent with single-agent reinforcement learning algorithms. However, this approach will introduce the non-stationary issue since the learning agent is not aware of the evolution of other agents that treated as part of the environment, thus violate the Markov property that is required for the convergence of most reinforcement learning algorithms [47, 46]. To alleviate the non-stationary issue introduced by the multi-agent setting, several approaches have been proposed [36]. Centralized training and decentralized execution is one of the most widely used diagram for multi-agent reinforcement learning. Lowe et al. [23] utilized this diagram and proposed the multi-agent deep deterministic policy gradient (MADDPG) algorithm. The core idea of MADDPG is to learn a centralized action-value function (critic) and a decentralized policy (actor) for each agent. The centralized critic conditions on global information to alleviate the non-stationary problem, while the decentralized actor conditions only on private observation to avoid the need for a centralized controller during execution.

A brief description of MADDPG is as follows. In a game with agents, let be the set of all agent policies parameterized by , respectively. Based on the the deterministic policy gradient (DPG) algorithm [22], we can write the gradient of the objective function for agent as:

(1)

In Equ. 1, is the centralized Q function (critic) for agent that conditions on the global information including the global state representation () and the actions of all agents (). Under this setting, agents can have different reward functions since each is learned separately, which means this algorithm can be used in both cooperative and competitive tasks.

Based on deep Q-learning [30], the centralized Q function for agent is updated as:

Here, is the set of target policies with soft-updated parameters to stabilize training [30].

Iii Delaye-Aware Markov Game

Ignoring delays violates the Markov property in multi-agent scenarios and could lead to arbitrarily suboptimal policies. To retrieve the Markov property, we formally define the Delay-Aware Markov Game (DA-MG) as below:

Definition 4.

A Delay-Aware Markov Game with agents augments a Markov Game , such that
(1) state space where denotes the delay step of agent ,
(2) action space ,
(3) initial state distribution

where denotes the initial action sequences of all agents,
(4) transition distribution

(5) reward function

DA-MGs have an augmented state space . denotes the delay step of agent . is one element of and denotes the agent of agent executed at time . is the action vector taken at time in a DA-MG; its -th element is executed by agent at time due to the -step action delay, i.e. . Policies interacting with the DA-MDPs, which also need to be augmented since the dimension of state vectors has changed, are denoted by bold .

To prove the solidity of Definition 4, we need to show that a Markov game with multi-step action delays can be converted to a regular Markov game by state augmentation (DA-MG). We prove the equivalence of these two by comparing their corresponding Markov Reward Processes (MRPs). The delay-free MRP for a Markov Games is:

Definition 5.

A Markov Reward Process can be derived from a Markov Game with a set of policy , such that

where is the state transition distribution and is the state reward function of the MRP. is the original environment without delays.

In the delay-free framework, at each time step, the agents select actions based on their current observations. The actions will immediately be executed in the environment to generate the next observations. However, if the action delay exists, the interaction manner between the environment and the agents changes, and a different MRP will be generated. An illustration of the delayed interaction between agents and the environment is shown in Fig. 1. The agents interact with the environment not directly but through an action buffer.

Fig. 1: Interaction manner between delayed agents and the environment. The agent interact with the environment not directly but through an action buffer. At time , agents get the observation from the environment as well as a future action sequences from the action buffer. The agents then decide their future action and store them in the action buffer. The action buffer then pops actions to be executed to the environment.

Based on the delayed interaction manner between the agents and the environment, the Delay-Aware MRP (DA-MRP) is defined as below.

Definition 6.

A Delay-Aware Markov Reward Process with agents can be derived from a Markov Game with a set of policy and a set of delay step , such that
(1) state space

(2) initial state distribution

where denotes the initial action sequences of all agents,
(3) state transition distribution

(4) state-reward function

The input of policy for agent at time has two parts: , where is the observation of the environment and is a planned action sequence for agent of length that will be executed from current time step: .

With Def. 26, we are ready to prove that DA-MG is a correct augmentation of MG with delay, as stated in Theorem 1.

Theorem 1.

A set of policy interacting with in the delay-free manner produces the same Markov Reward Process as interacting with with action delays for agents, i.e.

(2)
Proof.

For any , we need to prove that the above two MRPs are the same. Referring to Def. 4 and 5, for , we have
(1) state space     ,
(2) initial distribution

(3) transition kernel

(4) state-reward function

Since the expanded terms of match the corresponding terms of (Def. 6), Eq. 2 holds. ∎

Fig. 2: The framework of Delay-Aware Multi-Agent Reinforcement Learning (DAMARL). We adopt the paradigm of centralized training with decentralized execution: for each agent , we learn a centralized action-value function (critic) which conditions on global information and a decentralized policy (actor) that only needs partial observation. For each agent, the input of agent policy has two parts: , where is the observation of the environment and is a planned action sequence that will be executed from current time step.

Iv Delay-Aware Multi-agent Reinforcement Learning

Theorem. 1 shows that instead of solving MGs with delays, we can alternatively solve the corresponding DA-MGs directly with DRL. Based on this finding, we proposed the framework of Delay-Aware Multi-Agent Reinforcement Learning (DAMARL). To alleviate the non-stationary issue introduced by the multi-agent setting, we adopt the paradigm of centralized training with decentralized execution: for each agent, we learn a centralized Q function (critic) which conditions on global information and a decentralized policy (actor) that only needs partial observation. An illustration of the framework is shown in Fig. 2. Main advantages of this structure are as follows:

  • The non-stationary problem is alleviated by centralized training since the transition distribution of the environment is stationary when knowing all agent actions.

  • A centralized controller to direct all agents, which is not realistic to deploy in many real-world multi-agent scenarios, is not needed with decentralized policies.

  • We learn an individual Q function for each agent, allowing them to have different reward functions so that we can adopt this algorithm in both cooperative and competitive multi-agent tasks.

  • Individual Q functions and policies allow agents to have different delay steps.

  Initialize the experience replay buffer
  for  do
     Initialize the action noise and the action buffer
     Get initial state
     for  do
        for agent  do
           get from the environment and
           select action
        end for
        Store actions in
        Pop actions from and execute
        get the reward and the new state
        Store
        
        for agent  do
           Randomly sample a batch of samples from
           Set
           Update centralized critics by
           Update decentralized actors by
        end for
        Soft update of target networks for each agent :
     end for
  end for
Algorithm 1 DAMA-DDPG
(a) Cooperative communication
(b) Cooperative navigation
(c) Predator-prey
Fig. 3: Tasks in the multi-agent particle environment.

With the framework of DAMARL, we can adapt any DRL algorithm with the actor-critic structure [46] to a delay-aware algorithm such as Advantage Actor-Critic (A2C) [29], Deep Deterministic Policy Gradient (DDPG) [22] and Soft Actor-Critic (SAC) [15]. In this paper, we update the multi-agent version of DDPG with delay-awareness and propose delay-aware multi-agent deep deterministic policy gradient (DAMA-DDPG). Concretely, in a game with agents, let be the set of all agent policies parameterized by , respectively. Based on the the deterministic policy gradient (DPG) algorithm [22], we can write the gradient of the objective function for agent as:

(3)

The structure of Eq. 3 is in conformity with the original deterministic policy gradient (Eq. 1). However, the policies , states and observations are augmented based on the DA-MG proposed in Def. 4. In Equ. 3, is the centralized Q function (critic) for agent that conditions on the global information including the global state representation () and the actions of all agents (). In the delay-aware case, could consist of the observations of all agents as well as action sequences of all agents in a near future, , where is the input of the policy of agent and has two parts: . Here, is the observation of the environment by the -th agent, and is a planned action sequence for agent of length that will be executed from current time step. For example, at time , . The is fetched from an action buffer that serves as a bridge between the agents and the environment, as shown in Fig. 1.

The replay buffer is used to record historical experiences of all agents. The centralized Q function for agent is updated as:

Here, is the set of augmented target policies with soft-updated parameters used to stabilize training.

The description of the full algorithm is shown in Algorithm 1.

V Experiment

To show the performance of DAMARL, we adopt two environment platforms. One is the multi-agent particle environment platform proposed in [23]222https://github.com/openai/multiagent-particle-envs where the agents are particles that move on a two-dimensional plane to achieve cooperative or competitive tasks. The other is a traffic platform CARLA [9]333http://carla.org/ where we construct multi-agent traffic scenarios that requires coordination of all road users.

V-a Multi-Agent Particle Environment

The multi-agent particle environment is composed of several agents and landmarks in a two-dimensional world with continuous state space. Agents can move in the environment and send out messages that can be broadcasted to other agents. Some tasks are cooperative where all agents share one mutual reward function, while others are competitive or mixed where agents have inverse or different reward functions. In some tasks, agents need to communicate to achieve the goal, while in other tasks agents can only perform movements in the two-dimensional plane. We provide details for the used environments below.

Cooperative communication. Two cooperating agents are involved in this task, a speaker and a listener. They are spawned in an environment with three landmarks of different colors. The goal of the listener is to navigate to a landmark of a particular color. However, the listener can only observe the relative position and color of the landmarks, excluding which landmark it must navigate to. On the contrary, the speaker knows the color of the goal landmark, and it can send out a message at each time step which is heard by the listener. Therefore, to finish the cooperative task, the speaker must learn to output the correct landmark color.

Cooperative navigation. In this environment, three agents must collaborate to ‘cover’ all of the three landmarks in the environment by movement. In addition, these agents occupy a large physical space and are punished when they collide with each other.

Predator-prey. In this environment, three slower cooperative predators must catch up with a faster prey in a randomly generated environment, with two large landmarks blocking the way. Each time a collaborating predator collides with the prey, the predators will be rewarded and the prey will be punished. The agents can observe the relative position and speed of other agents as well as the positions of the landmarks.

An illustration of the tasks introduced above is shown in Fig. 3.

(a) Cooperative communication ( = 0.1 s)
(b) Cooperative navigation ( = 0.2 s)
(c) Predator-prey with fixed prey policy ( = 0.2 s)
Fig. 4: Effect of delay-awareness. DAMA is the proposed algorithm that utilizes delay-awareness as well as multi-agent centralized training. It is clearly shown that DAMA outperforms other algorithms in all 3 tasks while the vanilla DDPG has the worst performance.
Delay time (s) Algorithm of prey Algorithm of predators Improvement of predators
DAMA MA DA DDPG DAMA MA DA
0.2 DAMA 10.3 2.1 9.6 2.0 8.5 1.9 8.1 1.8 2.2 1.5 0.4
MA 12.1 2.3 10.1 2.1 9.0 2.1 8.5 1.8 3.6 1.6 0.5
DA 15.8 2.9 13.2 2.7 9.7 2.0 8.8 1.8 7.0 4.4 0.9
DDPG 17.0 3.2 14.9 2.9 11.4 2.2 9.1 1.9 7.9 5.8 2.3
0.4 DAMA 10.1 1.9 8.6 1.7 8.2 1.7 7.3 1.6 2.8 1.3 0.9
MA 14.2 2.8 9.5 2.0 9.0 1.8 7.9 1.8 6.3 1.6 1.1
DA 14.9 2.9 10.7 2.2 9.5 1.9 8.2 1.8 6.7 2.5 1.3
DDPG 17.6 3.3 14.5 2.9 13.1 2.5 8.8 1.9 8.8 5.7 4.3
0.6 DAMA 9.6 1.9 6.9 1.7 7.6 1.7 5.9 1.6 3.7 1.0 1.7
MA 16.0 3.1 8.8 2.0 12.5 2.4 7.4 1.7 7.6 1.4 5.1
DA 13.7 2.9 7.5 1.7 9.3 1.9 6.2 1.6 7.5 1.3 3.1
DDPG 19.2 3.8 13.5 2.8 16.8 3.4 8.3 1.8 10.9 5.2 8.5
TABLE I: Number of touches in predator-prey ( s)
(a) Cooperative communication, = 0.1 s
(b) Cooperative communication, = 0.2 s
(c) Cooperative navigation, = 0.1 s
(d) Cooperative navigation, = 0.2 s
Fig. 5: Performance of delay-aware (DAMA) and delay-unaware (MA) algorithms in cooperative communication and cooperative navigation scenarios with different agent delay times. As the delay time increases, both DAMA and MA algorithms get degraded performance. In most cases, the DAMA algorithm maintains higher performance than the MA algorithm. The performance gap gets more significant as the delay time increases.

V-A1 Effect of Delay-Awareness

To show the effect of delay-awareness, we first test our algorithm on cooperative tasks including cooperative communication, cooperative navigation, and predator-prey where we adopt a fixed prey policy and only train the cooperative predators. To support discrete actions used for message communication in the cooperative communication task, we use the Gumbel-Softmax estimator [20]

. Unless specified, our policies and Q functions are parameterized by two-layer neural networks with 128 units per layer. Each experiment is run with 5 random seeds. The baseline algorithms are DDPG and MA-DDPG that use decentralized and centralized training, respectively, without delay-awareness. We test the proposed delay-aware algorithm DAMA-DDPG (Algorithm 

1). We also adapt DAMA-DDPG to Delay-Aware DDPG (DA-DDPG) which uses decentralized training and test it for comparison.

For simplicity, we omit ’-DDPG’ when referring to an algorithm throughout the experiment part since our framework can be adapted to any DRL algorithms with the actor-critic structure. For example, ’DAMA-DDPG’ is shortened by ’DAMA’.

The performance of the aforementioned algorithms in cooperative tasks indicated by episodic reward is shown in Fig. 4. We use to denote the simulation timestep which is 0.1 seconds for cooperative communication and 0.2 for cooperative navigation and predator-prey. The agents are with 1-step action delay in all tasks ( for ). We will change the simulation timestep as well as the agent delay time in the later part of the environment. It is clearly shown that DAMA outperforms other algorithms in all 3 tasks with delay-awareness and centralized training, while the vanilla DDPG has the worst performance. The result of cooperation communication (Fig. 3(a)) shows the importance of centralized training. In this task, the action of the speaker significantly affects the behavior of the listener by setting the goal, so a centralized Q function conditioned on all agent actions will greatly stabilize training. The advantage of delay-awareness is more significant in the high-dynamic task, predator-prey, where a prey is running fast to escape from the agents (Fig. 3(c)).

We also test our algorithm in a competitive task: predator-prey. To compare performance, we train agents and adversaries with different algorithms and let them compete with each other. The simulation timestep is set to 0.2 seconds. The delay times of agents are 0.2 s, 0.4 s and 0.6 s in each set of experiments. We evaluate the performance of the aforementioned algorithms by the number of prey touches by predators per episode. Since the goal of predators is to touch the prey as many times as possible, a higher number of touches indicates a stronger predator policy against a weaker prey policy. The results on the predator-prey task are shown in Table I. All agents are trained with 30,000 episodes. It is clearly shown that DAMA has the best performance against other algorithms: with any delay time, the DAMA predator gets the highest touch number against the DDPG prey, while the DDPG predator gets the lowest touch number against the DAMA prey. Also, as the delay time increases, the delay-awareness gets more important than multi-agent centralized training, as shown in the last column of Table I. When the delay time is relatively small as 0.2 s, the improvement of predator policies by utilizing multi-agent centralized training is larger than delay-awareness. When the delay time grows to 0.6 s, however, the situation is reversed and delay-awareness has a larger impact on the improvement of predator policies.

(a) Unsignalized intersection
(b) Parking lot exit
Fig. 6: Traffic Scenarios

V-A2 Delay Sensitivity

Ignoring the delay of agents violates the Markov property can lead to arbitrarily suboptimal policies [44]. However, it is possible that delay-unaware algorithms could still achieve acceptable performance in certain tasks, especially when the delay is small [41]. On the other hand, though maintaining theoretical optimality, delay-aware algorithms could suffer from performance degradation resulting from augmented state space. This phenomenon leads to a trade-off between precision and efficiency.

To show the value of delay-awareness, we compare the performance of the aforementioned delay-aware (DAMA) and delay-unaware (MA) algorithms with different agent delay times. We perform experiments in cooperative communication and cooperative navigation scenarios. Results are shown in Fig. 5. The agent delay step is in each sub-figure. The simulation timestep is 0.1 seconds in Fig. 4(a) and 4(c) and 0.2 seconds in Fig. 4(b) and 4(d). It is shown that as the delay time increases, both delay-aware and delay-unaware algorithms get degraded performance. In most cases, the delay-aware algorithm maintains higher performance than the delay-unaware algorithm, and the performance gap gets more significant as the delay time increases. The only exception is in Fig. 4(c) when the delay time is less than 0.2 seconds. The performance of the delay-unaware algorithm is slightly better than the delay-aware algorithm in that situation. We hypothesize the primary reason is that when the delay is small, the effect of expanding state-space on training is more severe than the model error introduced by delay-unawareness.

(a) Success rate in unsignalized intersection
(b) Crash rate in unsignalized intersection
(c) Success rate in parking lot exit
(d) Crash rate in parking lot exit
Fig. 7: Success rate and crash rate of vehicle agents trained with delay-aware and delay-unaware algorithms in traffic scenarios. It is shown that delay-awareness drastically improve the performance of vehicles, both in success rate and crash rate. Delay-unaware agents suffer from huge model error and are not able to learn a good police.

V-B Traffic Environment

To show the practical value of delay-awareness for real-world applications, we construct two traffic scenarios that require coordination of autonomous vehicles. The details are described below.

Unsignalized intersection. This scenario consists of four vehicles coming from four directions (north, south, west, east) of the intersection, respectively. The common goal of the vehicles is to take a left turn at the intersection. Vehicles are spawned at a random distance () from the intersection center with an initial velocity ( m/s). They can observe the position and velocity of other vehicles as well as themselves. They can decide the longitudinal acceleration based on their policies. The intersection is unsignalized so the vehicles need to coordinate to decide the sequence of passing. Vehicles are positively rewarded if all of them successfully finish the left turn and penalized if any collision happens.

Parking lot exit. This scenario consists of three controlled vehicles inside a parking lot. The goal of the vehicles is to navigate to the exit without collisions. To successfully do that, they need to cooperate with each other. This scenario is extracted from a real-world application of self-driving vehicles: Autonomous Valet Parking (AVP). Vehicles are spawned at their initial positions with a Gaussian random noise. Their initial velocities are 5 m/s. They can observe the positions and velocities of other vehicles as well as themselves. Vehicles can decide their longitudinal acceleration based on their observations. Vehicles are positively rewarded if all of them successfully drive out of the parking lot and penalized if any collision happens.

An illustration of the two scenarios are shown in Fig. 6.

In the traffic scenarios, the delay of an vehicle mainly includes: communication delay , sensor delay , time for decision making , actuator delay . The total time is the sum of all the components: . Literature showed that the communication delay of vehicle-to-vehicle (V2V) systems with dedicated short-range communication (DSRC) devices can be minimal with a mean value of 1.1 ms [6, 2] under good condition; the delay of sensors (cameras, LIDARs, radars, GPS, etc) is usually between 0.1 and 0.3 seconds [51]; the time for decision making depends on the complexity of the algorithm and is usually minimal; the actuator delay for vehicle powertrain system and hydraulic brake system is usually between 0.3 and 0.6 seconds [5, 39]. Adding them together, a conservative estimation of the total delay time of a vehicle would be roughly between 0.4 and 0.8 seconds, without communication loss. Thus, we test the delay-aware and delay-unaware algorithms under the delay of 0.4 and 0.8 seconds, in both tasks.

There are three possible outcomes for each experiment: success, crash, stuck. We evaluate the performance of the learned policies based on the success rate and the crash rate. The results are shown in Fig. 7. It is shown that delay-awareness drastically improves the performance of vehicles, both in success rate and crash rate. The delay-aware agents successfully learn how to cooperate and finish both tasks without crash, while the delay-unaware agents suffer from huge model error introduced by delay and are not able to learn good policies: under 0.8 seconds delay, the success rate is limited to be less than 30% for the unsignalized intersection task and 40% for the parking lot exit task. The results are reasonable: consider a velocity of 10 m/s, the 0.8 seconds delay could cause a position error of 8 m, which injects huge uncertainty and bias to the state-understanding of the agents. With highly-biased observations, the agents are not able to learn good policies to finish the task.

Vi Conclusion

In this work, we propose a novel framework to deal with delays as well as non-stationary training issue of multi-agent tasks with model-free deep reinforcement learning. We formally define a general model for multi-agent delayed systems, Delay-Aware Markov Game, by augmenting standard Markov Game with agent delays while maintaining the Markov property. The solidity of this new structure is proved with the Markov reward process. For the agent training part, we proposed a delay-aware algorithm that adopts the paradigm of centralized training with decentralized execution, and refer to it as delay-aware multi-agent reinforcement learning. Experiments are conducted in the multi-agent particle environment as well as a practical traffic simulator with autonomous vehicles. Results show that the proposed delay-aware multi-agent reinforcement learning algorithm greatly alleviate the performance degradation introduced by delay.

References

  • [1] A. K. Agogino and K. Tumer (2004) Unifying temporal and structural credit assignment problems. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 980–987. Cited by: §I.
  • [2] S. Ammoun, F. Nashashibi, and C. Laurgeau (2006) Real-time crash avoidance system on crossroads based on 802.11 devices and gps receivers. In 2006 IEEE Intelligent Transportation Systems Conference, pp. 1023–1028. Cited by: §V-B.
  • [3] Z. Artstein (1982) Linear systems with delayed controls: a reduction. IEEE Transactions on Automatic control 27 (4), pp. 869–879. Cited by: §I.
  • [4] K. J. Astrom, C. C. Hang, and B. Lim (1994) A new smith predictor for controlling a process with an integrator and long dead-time. IEEE transactions on Automatic Control 39 (2), pp. 343–345. Cited by: §I.
  • [5] F. P. Bayan, A. D. Cornetto, A. Dunn, and E. Sauer (2009) Brake timing measurements for a tractor-semitrailer under emergency braking. SAE International Journal of Commercial Vehicles 2 (2009-01-2918), pp. 245–255. Cited by: §I, §V-B.
  • [6] S. Biswas, R. Tatchikou, and F. Dion (2006) Vehicle-to-vehicle wireless communication protocols for enhancing highway traffic safety. IEEE communications magazine 44 (1), pp. 74–82. Cited by: §V-B.
  • [7] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §I.
  • [8] L. Bu, R. Babu, B. De Schutter, et al. (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38 (2), pp. 156–172. Cited by: §I.
  • [9] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16. Cited by: §V.
  • [10] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016) Benchmarking deep reinforcement learning for continuous control. In

    International Conference on Machine Learning

    ,
    pp. 1329–1338. Cited by: §I.
  • [11] V. Firoiu, T. Ju, and J. Tenenbaum (2018) At human speed: deep reinforcement learning with action delay. arXiv preprint arXiv:1810.07286. Cited by: §I.
  • [12] J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson (2016) Learning to communicate with deep multi-agent reinforcement learning. In Advances in neural information processing systems, pp. 2137–2145. Cited by: §I.
  • [13] S. Gong, J. Shen, and L. Du (2016) Constrained optimization and distributed computation based car following control of a connected and autonomous vehicle platoon. Transportation Research Part B: Methodological 94, pp. 314–334. Cited by: §I.
  • [14] K. Gu and S. Niculescu (2003) Survey on recent results in the stability and control of time-delay systems. Journal of dynamic systems, measurement, and control 125 (2), pp. 158–165. Cited by: §I, §I.
  • [15] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §IV.
  • [16] R. Hannah and W. Yin (2018) On unbounded delays in asynchronous parallel fixed-point algorithms. Journal of Scientific Computing 76 (1), pp. 299–326. Cited by: §I.
  • [17] P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote (2017) A survey of learning in multiagent environments: dealing with non-stationarity. arXiv preprint arXiv:1707.09183. Cited by: §I.
  • [18] P. Hernandez-Leal, B. Kartal, and M. E. Taylor (2019) A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems 33 (6), pp. 750–797. Cited by: §I, §I.
  • [19] J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter (2017) Control of a quadrotor with reinforcement learning. IEEE Robotics and Automation Letters 2 (4), pp. 2096–2103. Cited by: §I.
  • [20] E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §V-A1.
  • [21] K. V. Katsikopoulos and S. E. Engelbrecht (2003) Markov decision processes with delays and asynchronous cost collection. IEEE transactions on automatic control 48 (4), pp. 568–574. Cited by: §II-B.
  • [22] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §II-C, §IV.
  • [23] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in neural information processing systems, pp. 6379–6390. Cited by: §II-C, §V.
  • [24] A. Manitius and A. Olbrot (1979) Finite spectrum assignment problem for systems with delays. IEEE transactions on Automatic Control 24 (4), pp. 541–552. Cited by: §I.
  • [25] M. R. Matausek and A. Micic (1999) On the modified smith predictor for controlling a process with an integrator and long dead-time. IEEE Transactions on Automatic Control 44 (8), pp. 1603–1606. Cited by: §I.
  • [26] L. Matignon, L. Jeanpierre, and A. Mouaddib (2012) Coordinated multi-robot exploration under communication constraints using decentralized markov decision processes. In

    Twenty-sixth AAAI conference on artificial intelligence

    ,
    Cited by: §I.
  • [27] L. Matignon, G. J. Laurent, and N. Le Fort-Piat (2012) Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems.

    The Knowledge Engineering Review

    27 (1), pp. 1–31.
    Cited by: §I.
  • [28] L. Mirkin (2000) On the extraction of dead-time controllers from delay-free parametrizations. IFAC Proceedings Volumes 33 (23), pp. 169–174. Cited by: §I.
  • [29] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §IV.
  • [30] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §I, §II-C.
  • [31] S. Mondié and W. Michiels (2003) Finite spectrum assignment of unstable time-delay systems with a safe implementation. IEEE Transactions on Automatic Control 48 (12), pp. 2207–2212. Cited by: §I.
  • [32] S. B. Moon, P. Skelly, and D. Towsley (1999)

    Estimation and removal of clock skew from network delay measurements

    .
    In IEEE INFOCOM’99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No. 99CH36320), Vol. 1, pp. 227–234. Cited by: §I.
  • [33] I. Mordatch and P. Abbeel (2018) Emergence of grounded compositional language in multi-agent populations. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §I.
  • [34] E. Moulay, M. Dambrine, N. Yeganefar, and W. Perruquetti (2008) Finite-time stability and stabilization of time-delay systems. Systems & Control Letters 57 (7), pp. 561–566. Cited by: §I.
  • [35] S. Niculescu (2001) Delay effects on stability: a robust control approach. Vol. 269, Springer Science & Business Media. Cited by: §I.
  • [36] G. Papoudakis, F. Christianos, A. Rahman, and S. V. Albrecht (2019) Dealing with non-stationarity in multi-agent deep reinforcement learning. arXiv preprint arXiv:1906.04737. Cited by: §II-C.
  • [37] P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang (2017) Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXiv preprint arXiv:1703.10069 2. Cited by: §I.
  • [38] J. Ploeg, N. Van De Wouw, and H. Nijmeijer (2013) Lp string stability of cascaded systems: application to vehicle platooning. IEEE Transactions on Control Systems Technology 22 (2), pp. 786–793. Cited by: §I.
  • [39] R. Rajamani (2011) Vehicle dynamics and control. Springer Science & Business Media. Cited by: §V-B.
  • [40] S. Ramstedt and C. Pal (2019) Real-time reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3067–3076. Cited by: §I.
  • [41] E. Schuitema, L. Buşoniu, R. Babuška, and P. Jonker (2010) Control delay in reinforcement learning for real-time dynamic systems: a memoryless approach. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3226–3231. Cited by: §II-B, §V-A2.
  • [42] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §I.
  • [43] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §I.
  • [44] S. P. Singh, T. Jaakkola, and M. I. Jordan (1994) Learning without state-estimation in partially observable markovian decision processes. In Machine Learning Proceedings 1994, pp. 284–292. Cited by: §I, §II-B, §V-A2.
  • [45] S. Sukhbaatar, R. Fergus, et al. (2016)

    Learning multiagent communication with backpropagation

    .
    In Advances in neural information processing systems, pp. 2244–2252. Cited by: §I.
  • [46] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §II-C, §IV.
  • [47] M. Tan (1993) Multi-agent reinforcement learning: independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330–337. Cited by: §II-C.
  • [48] E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §I.
  • [49] J. B. Travnik, K. W. Mathewson, R. S. Sutton, and P. M. Pilarski (2018) Reactive reinforcement learning in asynchronous environments. Frontiers in Robotics and AI 5, pp. 79. Cited by: §I.
  • [50] T. J. Walsh, A. Nouri, L. Li, and M. L. Littman (2009) Learning and planning in environments with delayed feedback. Autonomous Agents and Multi-Agent Systems 18 (1), pp. 83. Cited by: §I.
  • [51] M. Wang, S. P. Hoogendoorn, W. Daamen, B. van Arem, B. Shyrokau, and R. Happee (2018) Delay-compensating strategy to enhance string stability of adaptive cruise controlled vehicles. Transportmetrica B: Transport Dynamics 6 (3), pp. 211–229. Cited by: §V-B.