Multi-Agent Broad Reinforcement Learning for Intelligent Traffic Light Control

Intelligent Traffic Light Control System (ITLCS) is a typical Multi-Agent System (MAS), which comprises multiple roads and traffic lights.Constructing a model of MAS for ITLCS is the basis to alleviate traffic congestion. Existing approaches of MAS are largely based on Multi-Agent Deep Reinforcement Learning (MADRL). Although the Deep Neural Network (DNN) of MABRL is effective, the training time is long, and the parameters are difficult to trace. Recently, Broad Learning Systems (BLS) provided a selective way for learning in the deep neural networks by a flat network. Moreover, Broad Reinforcement Learning (BRL) extends BLS in Single Agent Deep Reinforcement Learning (SADRL) problem with promising results. However, BRL does not focus on the intricate structures and interaction of agents. Motivated by the feature of MADRL and the issue of BRL, we propose a Multi-Agent Broad Reinforcement Learning (MABRL) framework to explore the function of BLS in MAS. Firstly, unlike most existing MADRL approaches, which use a series of deep neural networks structures, we model each agent with broad networks. Then, we introduce a dynamic self-cycling interaction mechanism to confirm the "3W" information: When to interact, Which agents need to consider, What information to transmit. Finally, we do the experiments based on the intelligent traffic light control scenario. We compare the MABRL approach with six different approaches, and experimental results on three datasets verify the effectiveness of MABRL.


page 1

page 13


Federated Control with Hierarchical Multi-Agent Deep Reinforcement Learning

We present a framework combining hierarchical and multi-agent deep reinf...

STMARL: A Spatio-Temporal Multi-Agent Reinforcement Learning Approach for Traffic Light Control

The development of intelligent traffic light control systems is essentia...

Scenario-Assisted Deep Reinforcement Learning

Deep reinforcement learning has proven remarkably useful in training age...

End-to-End Intersection Handling using Multi-Agent Deep Reinforcement Learning

Navigating through intersections is one of the main challenging tasks fo...

Halftoning with Multi-Agent Deep Reinforcement Learning

Deep neural networks have recently succeeded in digital halftoning using...

Graph Convolutional Reinforcement Learning for Collaborative Queuing Agents

In this paper, we explore the use of multi-agent deep learning as well a...

Gamma-Reward: A Novel Multi-Agent Reinforcement Learning Method for Traffic Signal Control

The intelligent control of traffic signal is critical to the optimizatio...

I Introduction

Intelligent Traffic Light Control (ITLC) is a typical application of Multi-Agent Systems (MAS) [tits_tlc2]. Intersections of road networks in ITLC are modeled as agents, and agents aim to relieve traffic congestion [tiv_tlc1]. Therefore, building the suitable MAS model of ITLC is the basis for achieving goals. In MAS, agents are a series of intelligence controllers with autonomy, inferential capability, and social behavior. MAS organize a group of agents to achieve a common objective by interacting, making decisions, coordinating, and learning [MASintroduction]. While MAS has significant progress in many areas [bc2, visualization], MAS is still facing many challenges. Firstly, each agent of MAS is a local observer and has limited perception. Secondly, the decisions of each agent will disturb the whole environment, causing the environment becomes unstable. Thus, the mappings from state to action among Multiple Agents (MA) are complicated. Thirdly, the interaction of different agents needs to consider. Aiming to achieve global optimal resolution, agents require to quantify the information and ability about other agents, then build suitable inference mechanisms [2018survery, survery2021] to make decisions. In summary, agents of MAS need to take account of perception, decision, and inference.

Many approaches have been proposed to enhance the performance of MAS [ACO, MH, PSO, tiv_madrl_1]. Among these methods, MADRL receives more attention. MADRL adopts the Deep Reinforcement Learning (DRL) algorithm with Deep Neural Network (DNN) to solve problems of MAS. Comparing the typical structure of traditional Single-Agent DRL (SADRL), the MADRL approach adds a new section to ponder the influence of other agents. There are two reasons for the MADRL approaches not only the simple extension of SADRL. In terms of information in MADRL, the information space scales up as the number of agents increases. Both the own and joint state-action values of individual agents should be utilized. Thus, storage pressure and the calculation difficulty are heavier than SADRL. In terms of the framework in MADRL, Fully Decentralized (FD) [FD], Fully Centralized (FC) [FC], Centralized Training and Decentralized Execution (CTDE) [CTDE] are typically paradigms of MADRL, they all constituted by DNN. Before agents make the decision, these paradigms of MADRL standardize the interaction structure by constructing serviceable and multiple layers. As traditional SADRL approaches without the cross and parallelism of policy between MA, the network structures of MADRL are more intricate than SADRL, and the training time will be lengthened when the parameters transmit layers by layers. Accordingly, MADRL has a heavier calculation burden and intricate structure than SADRL.

Broad Learning Systems (BLS) [2017Broad], [BLS] is an incremental algorithm inspired by the Single-Layer Feedforward Neural networks (SLFN) [SLFN]

. BLS finds a new way with fast remodeling speed to substitute the learning of DNN. Unlike the single layer network structure, such as Radial Basis Function (RBF)


, BLS uses the mapped nodes and enhancement nodes to handle the input data and trains the models regarding ridge regression. Besides, Broad Learning with RL signal Feedback (BLRLF)

[BLRLF] exploits BLS to RL area by introducing a weight optimization mechanism into Adaptive Dynamic Programming (ADP) [ADP] to enhance the expansion capability of BLS. Recently, Broad Reinforcement Learning (BRL) [IOT-BRL] has been investigated combining the BLS and DRL. The framework of BRL has two important keys, one is using the Broad Networks (BN) in BLS to replace the DNN, and the other is adopting the training pool to introduce the labels of BLS. Compared with DRL approaches, BRL has better performance with a shorter execution time. BRL is the first algorithm to solve the control questions by using the BLS. However, BRL has concentrated on the problem with single agent without paying little attention to the MAS. With the number of agents increasing, it is worth investigating how BLS handles the mutual effect between MA.

Inspired by the issues of MADRL and the feature of BRL, we propose Multi-Agent Broad Reinforcement Learning (MABRL). Firstly, we outline the framework of MABRL, which combines the MADRL and BRL to explore the function of BLS in MAS. Each agent has integrated decision-making structures with broad networks, and updates policy based on the memory and pseudoinverse calculation. Secondly, agents of MABRL adopt the joint policy based on the stochastic game to interact with the environment continually. They evaluate the influences of other agents by the Dynamic Self-Cycling Interaction Mechanism (DSCIM) to make decisions. Finally, we model an instance of ITLC and experiment with three datasets to verify the effectiveness of MABRL. To the best of our knowledge, this is the first approach that applies BRL in MADRL. The contributions of our work summarize as follow:

  • A novel MABRL framework has been proposed adopting the BRL to solve the problems of MAS, as Fig. 1 shows. Unlike the traditional MADRL, MABRL has a simple and traceable BN structure and updates training models and parameters by pseudoinverse calculation. Compared with BRL, multiple agents of MABRL adopt BN with interaction mechanisms to make decisions.

  • The Dynamic Self-Cycling Interaction Mechanism (DSCIM) has been designed in MABRL to enhance the interaction between agents, which accounts for the attention mechanism. Agents adopt DSCIM to confirm the joint information about ”3W”: When to interact with others, Which agents need to consider, and What informations need to transmit. After obtaining the joint information, agents of MABRL conduct the mapped features and joint information by enhancement nodes.

  • We build a model of ITLC with MABRL. Three datasets are considered to experiment. The ability of MABRL can be measured by relieving traffic congestion when the environment and datasets are intricate.

Fig. 1: The framework of MABRL.

This paper is organized as follows: Section II summarizes the related work. In section III, the framework of MABRL is introduced. Section IV illustrates the experiment and results. Finally, section V concludes the paper.

Ii Related Work

In this section, two pivotal topics about MABRL are briefly reviewed. One is the multi-agent deep reinforcement learning method, and the other is broad learning systems.

Ii-a Multi-Agent Reinforcement Learning

The DRL has achieved profound impacts recently [chang2020important], [2018deeplearning]. A series of relevant methods have been proposed and extensively applied in various fields, such as robot control [gu2017deep], [zhang2015towards], automatic driving [auto_drive_tits], edge computing [tits_edge_compute], traffic control [tiv_tlc2] and others. Specifically, the agent of DRL repeatedly explores the environment and maximizes reward to get excellent performance. However, it is becoming difficult for agents to make the right decision using SADRL methods with the increasing complexity of the environment.

Recently, many researchers have built multiple agents models for exploring real questions. MADRL has been proposed to be implemented with large-scale and complex practical scenarios. As the cooperation or competition of multiple agents is more important to explore, various studies were conducted separately on emergent behaviors [behavior], the communication between multiple agents [comm], and the cooperation of each agent [2018deeplearning]. Many methods and applications about MADRL have been presented in recent years. For the first time, a simple approach that extends the single-agent method to independent multi-agent methods has been proposed [IMADRL]. It is a baseline method for MADRL, but it only handles the problem with an unstable environment. Moreover, how and when to communicate with other agents are essential questions to develop [AQ]. The CTDE has been adopted to solve these questions. Two measures based on CTDE were proposed, in which agents can learn elegant communication protocols to complete the task by Reinforced Inter-Agent Learning (RIAL), and Differentiable Inter-Agent Learning (DIAL) [foerster2016learning]. However, it only considers the discrete variable. Moreover, three methods for cooperative learning have been introduced, which value-based methods [vdn], policy-based approaches [coma], and experience replay-based methods [chu2019multi]. In traffic control problems, the MADRL has been widely applied in ITLC. By adjusting the phases and duration of traffic lights, the Multi-agent Deep Deterministic Policy Gradient (MADDPG) method can reduce the average waiting time of vehicles [S2020MADDPG, tcbtraffic]. To address the mixed manual-automated traffic scenario in an isolated intersection, an adaptive traffic signal control scheme is used to minimize the total traveling time of all vehicles. With the portion of the automated vehicles increasing, the performance of this method is better than others [MAS2020development].

Ii-B Broad Learning Systems

DNN is conducted by a wealth of hyperparameters and complicated structures. The methods with DNN take a step on learning policy to solve decision-making problems. One conventional measure to enhance the capacity is to deepen the network layers. This way makes the network structure analyzed with challenges, leading too long to training. Recently, the explorations about underscore take less reliance on DNN are developed to solve decision problems. In 2017, Ph. Chen

[2017Broad] has proposed a new framework named BLS based on SLFN, which has a flat network. The remodeling speed is apace and updating dynamically without reshaping the deep structure [BLS]. The core difference between DRL and BLS is the network ramework. DRL has multiple network layers, but the BLS only has one. In BLS, the input features of BLS are conducted by mapped features and enhancement features. Both features are updated by pseudoinverse, which is based on a label to update the weights. Recently, Stacked BLS has been proposed to enhance the accuracy rate of models [2020Stacked]. Moreover, the BLS and its variants [2020Semi, 2020Analysis, chen2020deep] have been used in other fields, including image classification [wang2020hyperspectral, 2020BLS], industrial process [chu2019multi], resource utilization [wu2019prediction], medical care [ali2020optic], time series prediction [xu2018recurrent], maximum information network [han2020maximum], network traffic flow prediction [chen2020deep, peng2020broad], event commentary [sheng2020greensea] and as well as traffic forecasting [liu2020training]. In the traffic area, the BRL methods have been first adopted in TLC [IOT-BRL]. BRL combines the DRL approach and the BLS. While its effectiveness draws level with DRL, the execution times are shorter than the comparative DRL methods. An application about training traffic prediction [liu2020training] also proposed for the first time recently. A New method for traffic service in intelligent cities adopted a variant semi-supervised DRL system [tang2020semi] based on BLS.

Iii Proposed Approach

In this section, we first show the motivation of our work by reviewing the framework of MADRL and BLS from previous work. Then, we indicate the framework of MABRL. Finally, we build the model about instantiation adopting MABRL.

Iii-a Motivation

Fig. 2: The typical network structure of DRL approaches.
Fig. 3: The framework of MADRL approaches.
Fig. 4: The typical structure of the BLS.

The conventional DRL approach models an agent with neural networks by trial with error. The learning mechanism of DRL is that the agent learns policies to make decisions and updates by a specific loss function, then aims to maximize rewards or achieve specific goals by interaction with the environment. The objective of the loss function is to optimize the parameters of each network to build a suitable model. Because of this learning mechanism, both SADRL and MADRL approaches need to update models with DNN

[DRL-network1], [DRL-network2]. Fig. 2

illustrates the basic framework of deep networks. It is different from the Single Layer Neural Netwroks (SLNN) which only has one layer, such as RBF. A typical framework for DNN includes a series of Fully Connected (FC), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) layers. Most SADRL methods have one DNN framework.

With respect to the framework of MADRL approaches, as Fig. 3 shows, the number of DNN frameworks in MADRL is more than one. The portion for training and executing needs parameters of different agents passing through others to update, thus the complexity of MADRL is heavier than SADRL. Moreover, training speed has emerged as a major issue recently. With the number of network layers and agents increasing, training speed will be slowed, and the training time will be lengthened, especially for MADRL. When the MADRL approach updates the parameters, the local agent needs to interact with other agents by layers iteration of the neural networks, increasing the network burden and lengthening the training time. Therefore, it is crucial to find a suitable way to relieve the complexity of networks.

BLS provides a novel way to replace the DNN structure. It reduces the depth of networks and expands the network structures laterally, as Fig. 4 indicates. BLS designs mapped nodes and enhancement nodes to manage the input of its network. The initial inputs are transferred and conducted as mapped features, and the network structure is broadened in a wide sense through enhancement nodes. Speci?cally, the mapped nodes map the state as the features, and the enhancement nodes conducted the processed features. Ridge regression has been adopted to update the connection weights between two kinds of nodes. Moreover, BLS approaches to improve performance by increasing the mapped nodes, enhancement nodes, and input data. Unlike the SLNN handles the raw data with high dimension directly to NN, BLS map the inputs to form mapped features and enhancement features, updating with pseudoinverse. Compared with DRL, the remodeling speeds of the BLS method are faster. Recently, BRL has been proposed based on the BLS and DRL [IOT-BRL]. BRL is based on Deep Q-learning Networks (DQN) approach, and it comprises five elements: environment, experience, training pool, evaluation BLS, target BLS. As BLS needs a label to update the parameters, the target value is regarded as the label which provides the training samples and updates parameters untimely. The evaluation network is able to get appropriate action according to the training parameters, and it updates duly compared with the target network. BRL is the first work to combine the BLS and DRL. However, it only considers the situation with SA control problems. With the number of agents increasing, many agents need to model. The dimensions of datasets are larger than the environment with SA, the difficulty of calculation increases exponentially, and the stability of the environment should be considered. Moreover, in the problem about SA, it only has one agent without interaction with others. The multi-agent interactions are ubiquity, and there are some difficulties to scale from single to multiple agents. MABRL needs to conduct the mutual effect of agents. Thus, it is worth developing the BLS in MAS.

In summary, there are two reasons for us to develop the ability of BRL applied in MAS. Firstly, the complicated and time-consuming structure of the MADRL. Then, the current BRL method considers the condition without the interactions between MA, which only focuses on the SA. Motivated by these reasons, we outline a MABRL framework and propose a specific approach verifying its effectiveness through the experiments.

Fig. 5: The network structure of MABRL.

Iii-B MABRL Framework

We introduce the MABRL algorithm on the side of implementation. Fig. 5 illustrates the complete setup for MABRL. At each time step , every agent obtains the local observation from the environment, . Secondly, regarding the DSCIM, obtains information about other , and organizes the joint state . Thirdly, we parameterize the execution policy by , expresses the mapped node, means the enhancement node. inputs the into Evaluation Broad Network (EBN) to acquire action at time . Fourthly, executes action into environment to obtain new observation , joint-state and reward at time . Fifthly, stores into each memory. Finally, updates policy via the iteration and calculation between EBN and Target Board Network (TBN) by parameterizing the . In summary, the network structure consists of four key components: interaction mechanism, the EBN, the TBN, and the update mechanism, which we explain as follows.

Fig. 6: The dynamic self-cycling interaction mechanism of MABRL

Iii-B1 Interaction Mechanism

MABRL touches on two aspects of interaction: the interaction between agents and environment, the interaction of different agents. The first side confirms when to communicate. The second side determines who to interact with and what information adopts. Specify, in the case of environment, agents obtain observation from the environment regarding the action calculated by the EBN. The observation determines whether the agent is capable of handling the current situation. In the case of other agents, if agents need to cooperate to alleviate the current situation, the local agent obtains the joint state value according to the intersection mechanism. Many approaches focus on the interaction between multiple agents. CommNet has been proposed to manage the continuous variables [CommNet]. BiCNet shares partial parameters to interact [BiCNet]. The above approaches need to cooperate with other agents at every time, which loads the burden of networks. The Attention Mechanism (AM) has been adopted in MADRL [DIAL], and AM provides a manner to communicate with other agents dynamically to reduce communication costs.

Inspired by the AM, we design the DSCIM to confirm the ”3W” information. ”3W”: when agents need to communicate, which other agents need to interact, and what information to transmit. Firstly, we evaluate whether the local agent needs to communicate or not. The has been defined to reflect the ability of local agent to handle the current situation calculated by:


represents the threshold, if the reward of is smaller than , local agent needs to cooperate with the other agents, and lable is true:


Then we should confirm which other agents need to interact. The position of each agent can be obation in most applications, so we adopt the location of agents to decide. indicates the position of local , indicates the position of other . We calculate the euclidean distance between and :


then regard the shortest distance as the threshold. Following the determination of distance, we can obtain the number


and identity of other agents, then we use these messages to confirm the final information:


The calculated by:


where is the observation of . Finally, the state of is at current, and in the next time.

Iii-B2 Decision-making Policy

The EBN of MABRL has been used to make decision by the value of action . EBN has the mapped and enhancement nodes, the parameter of it for each agent are . Firstly, the observation of each agent transfers to the mapped features, considered as the input .


is a nonlinear transformation for . and are the random weights and biases with the proper dimensions, both of them without update and randomly generate. All mapped features of denote as . denotes the numbers of input samples,

denotes the input vector dimension.

is the number of mapping features, and each mapping has nodes. All agents have the same number of . In MABRL, regards as the input , and combines the : , then conducts by enhancement node:



is an activation function,

and also random respectively generated weights and biases, which remain unchanged. The enhancement features integrate into . The concatenation between and as:


is imported into the output layer. The above parameters not need to update, all of them initialize at first time. Finally, the is calculated by:


Accordingly, we adopt the greedy algorithm to acquire the action:


As Fig. 5 shows, the need output weights to update, the and obtain from the TBN, which introduce in the next section.

Notation Meaning
the observation of
The information about others
The state of
The action of
The total number of agents
The total training episodes
The total time of each episode
The discount factor
The parameter of in EBN
The parameter of in TBN
The updating weight of
TABLE I: Notations of MABRL

Iii-B3 Update mechanism

The EBN and TBN both combine to update the policy. As action value needs to calculate the , the prime mission of MABRL is to update . It is different from the DRL method which need to update all hyper-parameters. Regarding the principle of BLS, we initialize randomly in the first time, and each agent collects the information into memory. After pre-training, EBN has enough samples. We then continue to initialize the parameters of TBN, . In order to optimize the , we calculate the as following:


where is the fixed value without update. Then, we calculate the target value considering as the output :


After obtaining the output , the optimization problem is formulated to find the regularized least squares solution of





denotes an identity matrix with proper dimensions, and

is a non-negative constant for regularization. When , the above solution is equivalent to


denotes the pseudo-inverse of .

In summary, the optimization of MABRL is to update the weight of each agent. Taking the output of EBN as the label, then obtaining the by pseudo-inverse, finally using TBN calculate the to determine action. The notations about framework described in Table I, and the integrated algorithm is shown in Algorithm 1.

1:  Initialize the EBN, TBN and the parameters , for each agent.
2:  for   do
3:     Initialize the environment;
4:     for   do
5:        Each agent obtain the observation ;
6:        if  then
7:           Random initialization ;
8:           Putting the to EBN calculating the by Eq. (10);
9:           Storing the in each memory;
10:        else
11:           Determining the joint-state value as time by Eq. (1) - (6);
12:           Extracting the from memory into EBN;
13:           Obtaining by Eq. (12);
14:           Acquiring the by Eq. (16);
15:           Each agent input the and into EBN to obtain the action by Eq. (11);
16:        end if
17:        Executing action , and calculate reward ;
18:        Storing the in memory;
19:        Updating EBN and TBN parameters respectively.
20:     end for
21:  end for
Algorithm 1     MABRL

Iii-C Instantiation of MABRL for Intelligent Traffic Light Control

This section instantiates the MABRL approach, applying it to the ITLC problems. In ITLC, the road network comprises several intersections, and we model each intersection as an agent via MABRL to make decisions and update. Fig. 7 shows the structure of this application. We indicate the models as follows, and the notations are described in Table II.

Fig. 7: The structure of ITLC adopting MABRL.

State: In area road networks, each agent observes the number of vehicles , the waiting time , the queue length for each lane at time . Let denotes the . We collect these characterizes in :


represents the car number of at time . Similar to the , we can calculate the severally. The above is the observation of one agent. In the multi-agents environment, agents consider the local observation and the information about others . Basically, each agent has local observation , the state of it is , can be obtained regarding the section III-B1. As one area road network comprises more than one intersection, the whole environment state formulates as .
Action: After each agent obtains the state, as the input will feed into the EBN to obtain the action , which is indicated in section III-B2

. In ITLC problems, agents adopt action to control the transformation of the traffic phase. Regarding the policy of MABRL, the probability of each phase will be calculated by:


According to the greedy policy, each agent then choices the max value as action :


When the action is determined, the phase will be compared with the previous phase firstly, and if same as the previous one , the phase will be executed continuously. If not, the traffic light will first turn yellow light and then change to the next phase , regarding the action.


Reward: We record the features of each intersection during execution time . As the queue length directly represents the traffic condition, we count the queue length as the reward. The reward displays as:


represents the weight of queue length. is the negative feature of reward, and the higher reward, the fewer vehicles for waiting. In addition to recording the performance of the current action, rewards also guide the update between TBN and EBN. The TBN adopts reward to calculate the target value , then uses the target value to update the of EBN. Specify, corresponding to ITLC methods aims to minimize queue length, each agent of MABRL devotes to maximize reward by updating policy. If agents obtain value to decide action, they demand to calculate. The is the core of MABRL updating by increment learning, detailed in the update mechanism. This update policy reduces the computation time and adequately adopts the information of the environment.

Notation Meaning
The number of vehicles in each intersection
The waiting time of each intersection
The queue length of each intersection
The number of intersections
The execution time
TABLE II: Notations of MABRL for ITLC

Iv Experiment

In this section, we conduct experiments on three open traffic datasets. We first introduce the datasets in detail and then demonstrate the experimental details: the experimental setting, evaluative indicators, and comparison algorithm. Finally, we analyze the experimental results in three aspects.

Iv-a Datasets

We build experiments on three conditions, and each condition comprises independent datasets and a road network. Three open datasets are obtained from the database111 As the low quality of real data, the datasets have been simplified as necessary. The following describes the datasets of three different conditions in detail, and the information about them is shown in Table III.

G-X Dataset

: The G-X dataset is constructed artificially. It obeys the Gaussian distribution that 30% of the vehicles turn right, 60% go straight, and the rest turn left. The maximum speed of the vehicles is limited to 35 km/h. The number of vehicles is 8412. We validate them on the road network of Gaoxin sub-district

222, as the first diagram above Fig. 8 (a) shows.

Jinan Dataset: This traffic flow data is obtained from the camera data in Jinan, and has necessary simplifications as the low quality of actual data. It contains the traffic data of Hongqi street in Jinan, which has 12 intersections and 2983 cars. The road network structure as the second diagram above Fig. 8 (b) shows.

Hangzhou Dataset: The dataset of Hangzhou is synthesized from the taxi GPS data of Guandong street. The road network of this condition contains 16 intersections and 6295 vehicles, as the third diagram above Fig. 8 (c) shows.

Fig. 8: The road network of three condition.
Fig. 9: The phases of traffic light.
Datasets Number of Intersections Time Span (seconds) Number of Vehicles
G-X Dataset 9 3600 8412
Jinan Dataset 12 3600 2983
Hangzhou Dataset 16 3600 6295
TABLE III: The information about three datasets

Iv-B Experimental Details

Iv-B1 Experimental Settings

We conduce our experience on the CityFlow (CF)333 CF is a simulation platform for a desktop equipped with an intel-i7 2.4GHz CPU, 32 GB memory. Many intersections in CF link closely and form complicated road networks. It visualizes the traffic flow in the area road network. The vehicles with a defined route drive into the road network and exit from the road network when they reach the destination. When traffic lights turn red or congestion occurs, vehicles queue up at the intersection until the lights turn green or the congestion eases. Thus, the queue length and the waiting time change with time in this process.

In this experiment, the count of mapped features is 10, the number of enhancement features is 25, memory capacity is 10000, and the decay coefficient is 0.99. One episode consists of 3600 seconds, and agents are training 100 episodes. Besides, in terms of models, four phases in this experiment represent the action that the agents can choose, as Fig. 9 shows. Phase 1 demonstrates the going straight in the east-west direction, and phase 2 indicates turning left in the same direction. Phase 3 shows the going straight in south-north direction and turning left is described as phase 4. The more precise settings of MABRL are demonstrated in Table IV.

Model Parameter Symbol Value
Decay coeffcient 0.99

Simulation time of each epoch

Iterations 100
Memory capacity m 100000
Learning Rate 0.001
Mapped Features 10
Enhancement Features 25
TABLE IV: The parameter settings of MABRL

Iv-B2 Evaluative Indicators

MABRL approach intends to mitigate traffic congestion and minimize the overall waiting time of vehicles. The key metric is the reward that reflects the queue length of vehicles. Besides, the waiting also is collected to evaluate the performance. The following is a detailed description of the three indicators.

Reward: The reward represents the total queue length of the vehicles calculated by Equation 22. The methods with a higher reward have shorter queue lengths.

Waiting Time: The waiting time is when vehicles wait at the intersection until the vehicles start to move. Average waiting time reflects the performance of actions. The shorter average waiting time, the better performance.

(a) G-X Dataset
(b) Jinan Dataset
(c) Hangzhou Dataset
(d) The reward of MABRL in three datasets.
Fig. 10: Figure (a), (b), (c) demonstrate the reward of different methods, (d) shows the reward of MABRL.

Iv-B3 Comparison Algorithm

We compare MABRL approach to six algorithms. FT, SOTL are the traditional approaches for ITLC. SADRL, MARDDPG, CILDDQN are based on DRL, SABRL is based on BRL. We detail these approaches as follows.

Fixed Time traffic light control (FT): The FT methods fix the duration of the phase to control traffic lights. When the continuance of traffic light reaches the corresponding pre-setting time, the traffic lights change to the next cycling phase.

Self-Organizing Traffic Light control (SOTL)[sotl]: The SOTL is a modified approach of FT. SOTL designs a threshold based on the queue length of each intersection to determine the traffic lights phase. The traffic light phase changed cycling when the queue length reaches the threshold.

Single-Agent Deep Reinforcement Learning (SADRL): This method is based on the traditional DQN method. It has a memory place to learn prior experiences. This method takes the whole environment as one agent. This agent is responsible for scheduling traffic lights of all intersections.

Single-Agent Broad Reinforcement Learning (SABRL)[IOT-BRL]: By contrast with SADRL approach, the similarity of SADRL and SABRL approaches is that both of them model one agent, and this agent controls the traffic lights of all junctions. The diversity is that SABRL using broad networks, SADRL using deep networks.

Multi-Agent Recurrent Deep Deterministic Policy Gradient (MARDDPG)[MARDDPG]: MARDDPG approach adopts centralized learning in each critic network, and each agent makes decisions decentralized based on MADDPG.

Cooperative Important Lenient Double DQN (CILDDQN) [CILDDQN]: CILDDQN algorithm models each intersection as the independent agent, and aims to reduce the cooperative dif?culty of learning.

(a) G-X Dataset
(b) Jinan Dataset
(c) Hangzhou Dataset
(d) The waiting time for MABRL on three datasets
Fig. 11: The average traveling time of different methods in three datasets.

Iv-C Experimental Results

Iv-C1 Comparison of Network Structures

In this experiment, SADRL is based on DRL methods, MARDPPG and CILDDQN are MADRL methods, MABRL and SABRL are both based on BRL methods. To show the performance of MABRL and other methods, Fig. 10 and Fig. 11 respectively display the training processes of reward and average waiting time. It is obvious that DRL methods and BRL methods both have better performance than FT and SOTL. The reward curves with the G-X dataset of MABRL, MARDDPG and CILDDQN are almost coincidentally. Nevertheless, MABRL achieves 16.08%, 5.49% than MARDDPG and CILDDQN on average. This phenomenon indicates that The fitting speed of MABRL is faster than MADRL. Additionally, the MABRL performs better than the MARDDPG and CILDDQN in Jinan dataset and Hangzhou dataset as Fig. 10 (b) and Fig. 10 (c) show. The performance exhibits the robustness of MABRL. The upward trends of rewards in MABRL are relatively stable and without fluctuation. This phenomenon is because the rigorous pseudoinverse calculation compares with other methods, and the choice of next action is based on the last state, closing to the current state. Thus, the training process of MABRL is more steady without too much volatility. Concretely, the rewards of SOTL increase 17.16%, 48.24%, 35.86% than FT, and the waiting time decrease 8.02%, 24.73%, 24.73% respectively. The performance of SABRL and SADRL are similar. Comparing the MABRL and MARDDPG, MABRL reduces 6.00%, 7.52%, 9.08% waiting time than MARDDPG of three conditions. MABRL also decreases 2.91%, 8.68%, 6.49% waiting time than CILDDQN.
In conclusion, the performances of MABRL are better than the MADRL method. The SABRL almost has the same effect as SADRL. Moreover, the improvement of fitting speed of BRL methods is faster than others.

Algorithms Datasets
G-X Dataset Jinan Dataset Hangzhou Dataset
FT -54.46 -36.98 -10.68
SOTL -45.11 -19.14 -6.85
SADRL -28.29 17.53 -15.23 8.94 -6.40 2.02
SABRL -26.78 13.71 -15.54 5.89 -5.40 1.64
MARDDPG -27.48 19.43 -13.93 5.55 -4.71 1.62
CILDDQN -24.40 15.64 -14.39 5.05 -4.28 1.73
MABRL -23.06 15.36 -11.10 6.35 -3.27 2.90
TABLE V: The average reward performance with different traffic datasets.
Algorithms Datasets
G-X Dataset Jinan Dataset Hangzhou Dataset
FT 73.16 57.87 21.43
SOTL 67.29 43.56 18.44
SADRL 47.14 16.85 38.41 9.00 17.45 2.02
SABRL 46.11 14.44 38.89 6.21 16.24 1.78
MARDDPG 45.00 19.38 37.08 5.84 15.53 1.77
CILDDQN 43.57 16.75 37.55 5.33 15.10 1.88
MABRL 42.30 16.30 34.29 6.65 14.12 2.31
TABLE VI: The average traveling time performance of different traffic datasets.

Iv-C2 Effectiveness of Multi-Agent Methods

We find out that the three MA methods achieve approximately 32%,17%,13% enhancement than SADRL method, and 27%,16%,9% than SABRL with three datasets. Moreover, the performances of the MABRL approach are on average 32% higher than SADRL, and MABRL performs 14% better than SABRL in the datasets of Jinan condition. From Fig. 10 (c), it is clearly noted that the performance of reward in MABRL is much higher than SABRL. As methods based on SA treat the whole traffic situation as global information, these approaches neglect the communication of multiple intersections, which cause a prolonged blockage of the intersections. Besides, excessive space of state and action space of agent will burden network training. On the contrary, the control methods based on MAS treat each intersection or several intersections as one agent to deal with the traffic flow independently. The control methods based on MA take into account its state and the states of other agents. Thus, the methods based on MA get a higher reward and less traveling time than SA control methods in the experiments.

(a) G-X Dataset
(b) Jinan Dataset
(c) Hangzhou Dataset
Fig. 12: The effect of different mechanisms for three datasets.
Comparison G-X Dataset Jinan Dataset Hangzhou Dataset
Reward Waiting Time Reward WaitingTime Reward Waiting Time
Random -25.86 14.98 48.61 15.18 -14.20 6.47 37.43 6.74 -4.77 1.86 15.58 2.00
Original -29.32 14.67 45.44 15.89 -13.74 6.63 36.95 6.88 -4.82 1.61 15.70 1.76
DSCIM -23.06 15.36 42.30 16.30 -11.10 6.35 34.29 6.65 -3.27 2.19 14.12 2.31
TABLE VII: The different performance of MABRL with three traffic datasets.

Iv-C3 Impact of Dynamic Self-Cycling Interaction Mechanism

To verify the effectiveness of the DSCIM, we adopt three control experiments: original MABRL, MABRL with a random value, and MABRL with DSCIM. From Fig. 12 (a), we find that the MABRL with random value performs worse than the original MABRL and MABRL with DSCIM, which is due to the uncertainty. As Fig. 12 (b) shows, the result of the method based on random value and the original method has little difference. According to Fig. 12 (c), it is clear that the method based on random value performs better than the original method. However, the curve of the method with a random value is unstable because the attention mechanism influences the proportion of joint state-action information.

The phenomenons about original MABRL and the MABRL with random value illustrate that if the AM is not suitable, the function of the control mechanism causes adverse effects, and it is difficult to set the appropriate fixed value to achieve the better result. The reward of MABRL with DSCIM is 19.21% and 21.38% higher than the original MABRL and MABRL with a random value at the G-X condition, respectively. In the case of the other two conditions, the effect of MABRL with the DSCIM gets more enhancement. As Table VII shows, the waiting times of MABRL are shorter than the other two methods. Besides, while the curve of the original MABRL is steady, the final reward is not better than the MABRL with DSCIM. Thus, compared to the method without an interaction mechanism or with a random value, DSCIM is more flexible and reasonable.

V Conclusion

In this paper, we propose a Multi-Agent Broad Reinforcement Learning (MABRL) framework for developing the interaction of MA with broad networks. Specifically, BLS is used to replace the network architecture of MADRL with DNN, making the approach more flexible and intelligible. Besides, we design DSCIM inspired by AM to enhance the interaction between agents. Moreover, we apply the MABRL approach in ITLC to adjust the traffic flow dynamically. The results of the experiments manifest the MABRL approach obtains stable performance of different scenarios compared with other approaches for ITLC. In the future, we will focus on more complex structures and paradigms of MABRL to solve the problems of MAS in a new way, and apply the MABRL in more practical and extensive scenarios.


This work was supported in part by National Science Foundation of China under Grant 62001422, China Postdoctoral Science Foundation No. 2021T140622, and Open Fund of IPOC (BUPT) No. IPOC2019A008.