1 Introduction
The world is experiencing a period of rapid urbanization, with more than 60 percent of the world population expected to live in cities by 2025 Prendinger et al. (2013). The huge increase in the number and length of traffic jams on freeways has led to the use of several dynamic traffic management measures. The sophistication of traffic network demands, as well as their severity, have also increased recently. Consequently, the need for an optimal and reliable traffic control, for urban freeways networks, has become more and more critical.
Generally, there are three freeway control strategies: onramp control, mainline control, and DSLs control. Two signals are only generated by a ramp metering device: red and green (no yellow), controlled by a smart or basic controller that regulates the flow of traffic entering the freeway according to the current traffic volume. We aim to control the number of vehicles entering the mainstream freeway from the ramp merging area. This optimizes the freeway density below the critical one. In order to achieve this flow, optimal coordination of the freeway traffic control measures over the network level is highly needed.
Machine Learning (ML) is one of the fastest growing areas of science. It has been been used in many applications; e.g., traffic signal control Khamis and Gomaa (2014, 2012); Khamis et al. (2012b, a), carfollowing models Zaky et al. (2015), bioinformatics Khamis et al. (2015); Khamis and Gomaa (2015), etc. Reinforcement Learning (RL) is a machine learning technique that addresses the inquiry of how an autonomous agent can learn behavior through trial and error interaction with the dynamic environment to accomplish its goals Kaelbling and Moore (1996). In order to solve the problem of optimal metering rate on the network level; a collaborative MultiAgent System (MAS) Vlassis et al. (2004) based on reinforcement learning is appropriate. A collaborative MAS is defined as the system in which the agents are designed to act together in order to optimize the sequence of actions. The framework of the Coordination Graphs (CGs) Guestrin et al. (2001) is utilized which is designed based on the dependencies between agents. It decomposes the global payoff function into a sum of local payoffs based on that dependency.
Traffic congestion is a challenging problem faced in everyday life. It has multiple negative effects on the average speed, overall total travel time, fuel consumption, safety (primary cause of accidents), and environment (air pollution). Hence, comes the need for an intelligent reliable traffic control system. The objective of this article is to optimize the overall traffic congestion in freeways via multiple ramps metering controls plus its complementary multiple DSLs controls. Maximize the freeway traffic flow by always keeping the freeway density within a small margin of the critical ratio (which is calculated using the calibrating fundamental diagram of the traffic flow). Preventing the breakdown of a downstream bottleneck through the regulation of traffic stream speed. Preventing the capacity drop at the merging area via ramp metering control plus its complementary DSLs control, i.e., complements the function of RM when the flow is above the critical ratio.
In this article, a MultiAgent Reinforcement Learning for Freeways Control (MARLFWC) system for ramp metering and dynamic speed limit is proposed based on a Reinforcement Learning density Control Agent (RLCA). MARLFWC introduces a new microscopic framework at the network level based on collaborative Markov Decision Process (Markov game) modeling and an associated cooperative Qlearning algorithm. The technique incorporates payoff propagation (MaxPlus algorithm) under the coordination graphs framework, particularly suited for optimal control purposes. A model for the local payoff, the joint payoff, as well as the global payoff is proposed. MARLFWC provides three control designs: fully independent, fully distributed, and centralized; suited for different network architectures.
Preliminary results of this work have been published in Fares and Gomaa (2014, 2015). In this article, a more detailed description and improvements on MARLFWC are provided. An adaptive objective function plus DSLs are presented. The new approach for ramp metering plus DSLs have shown considerable improvement over the base case. In addition, a detailed survey of the stateoftheart work is presented.
The article is organized as follows. Section 2 briefly illustrates the related work. Section 3 describes the RL approach, particularly the Qlearning algorithm. In addition, it briefly describes the MaxPlus algorithm and associated CGs framework. Section 4 illustrates the design of the RLCA. Furthermore, it demonstrates the design of three approaches to solve the freeway congestion problem, particularly the design of the global and joint payoff. And finally it depicts the MARLFWC for adaptive ramp metering plus DSLs. The VISSIM traffic simulator capabilities are also presented. Section 5 represents the results of applying MARLFWC to different traffic networks and compares its performance to the base case. Finally, Section 6 summarizes the conclusions.
2 Related Work
Ramp metering and speed limits lay under the advanced traffic management systems and are considered one of the most effective traffic control strategies on the freeways recently. Intelligent ramp metering and DSLs mainly depend on the current traffic state. This gives them advance over the fixed time (predefined time signal) and actuated one (sensor located underground to detect the vehicles and change the signal to green). Several different techniques have been recently applied including: Intelligent control theory, fuzzy control Yu et al. (2012); Ghods et al. (2009)
, neural network control
Liang et al. (2010); Li and Liang (2009), and even some hybrid of them Feng et al. (2011).In this article, a new framework of ramp metering in the network level is proposed based on modeling by collaborative Markov Decision Process (Markov Game) Vlassis et al. (2004); Guestrin (2003) and an associated cooperative Qlearning algorithm, which is based on a payoff propagation algorithm Pearl (1988) under the CGs framework. MARLFWC avoids not only the locality of all techniques mentioned above, but also the computational complexity and the risk of being trapped in local optimum of the Model Predictive Control (MPC) Ernst et al. (2009). In addition, the solid knowledge of the system considered to extract the rules as in the fuzzy control system. MARLFWC has been extensively tested in order to assess the proposed model of the joint payoff, as well as the global payoff.
3 Single and MultiAgent Algorithms
In this section, RL approach is illustrated. This section describes the Qlearning algorithm and shows the difference between the traditional Qlearning and the modified one (temporal difference algorithm). Then, this section describes briefly the MaxPlus algorithm. This algorithm is used to determine the option joint action in connected graph, so the coordination graphs framework is introduced. Finally, freeway traffic flow model is illustrated in details, particularly the fundamental diagram of the traffic flow.
3.1 Reinforcement learning
The RLCA is interacting with the traffic network. The agent receives the traffic state from the network detectors. This state consists of the number of vehicles and the current density . Based on this information the RLCA chooses an action . As a consequence of ; the agent receives a reward that keeps the density of the network around a small neighborhood of the critical density. The aim of the agent is to learn a policy which maps an arbitrary state into an optimal action . An optimal action is an action that optimizes the longterm cumulative discounted reward, thus the optimization is done over an infinite horizon. Given a policy , the value function corresponding to is stated by Eq. 1:
(1) 
where is the discount factor and is necessary for the convergence of the previous formula. A policy is optimal if its corresponding value function is optimal as in Eq. 2, that is, for any control policy and any state (here it is assumed that the objective function is to be maximized):
(2) 
3.2 QLearning
A Qlearning agent could learn based on experienced action sequence actuated in a Markovian environment. In fact, Qlearning is a kind of MDP. In his proof of convergence, Watkins Watkins and Dayan (1992) assumed a lookup table to represent the value function Watkins and Dayan (1992).
This algorithm is guaranteed to converge to the optimal
values with a probability of one under some conditions: the
function is represented exactly using a lookup table and each stateaction pair is visited infinitely often. After convergence to the optimal function , the optimal control policy can be extracted as in Eq. 3:(3) 
And the optimal value function (the function that gives a valuation of the states; it can be viewed as a projection of the function over the action space) is represented as in Eq. 4:
(4) 
3.2.1 A modified Qlearning: Temporal difference algorithm
As long as there is no model of the environment (unknown environment), so the infinite sum of the discount reward is no longer considered as a function of state only as in Dynamic Programming (DP) value iteration Bertsekas (1996), but as a function of the state action pair . That is why such a function Watkins and Dayan (1992) is used. When the agent is in state and the agent performs action and its longterm reward is ; if each state and action pairs are visited infinitely often; then converges to Kaelbling and Moore (1996). Based on such longterm reward the optimal policy is calculated using Eq. (3). learning converges no matter how the actions during the learning are chosen, as long as every action is selected infinitely often (fair action selection strategy). A modification of the traditional learning rule is in Eq. (3.2.1). This new scheme is called a Temporal Difference (TD) learning where is the learning rate Kaelbling and Moore (1996).
(5) 
For alpha value, the easiest approach is to take a fixed value , but even better to use a variable . Eq. (6) is one possibility where is the number of times the action has been chosen in the state .
(6) 
There are two extreme possibilities for the best action selection strategy Kaelbling and Moore (1996). One extreme is to always choose the action randomly. This what is called exploration; just explore what the environment gives in terms of feedback. In the beginning of the learning process this is a good idea. But after some time when the RLCA has already learned, other alternative can be tried which is to select the best action. This is called exploitation. In the present case, a combination of exploitation and exploration are used by using greedy action selection strategy, that is with a certain probability ; the agent always selects a random action. The state space, the action space and the reward function of the single control agent are described in Fares and Gomaa (2014) and Subsection 4.1. The joint action space and the (joint and global) payoff function of collaborative multiagent are described in detail in Fares and Gomaa (2015) and Subsection 4.2.
3.3 Payoff propagation
In this subsection, the problem of dynamics in MASs is discussed and the suggestion that the agents should take the behavior of other agents into account.
3.3.1 Coordination graphs
CGs Guestrin et al. (2001) can be demonstrated as an undirected graph , where each node represents an agent, and an edge between agents means that a dependency exists between them ( and ). Thus, they need to coordinate their actions. CGs allow for the decomposition of a coordination problem into several smaller subproblems that are easier to solve. CGs have been represented in the context of probabilistic graphs, where the global function, consisting of many variables, is decomposed into local functions. Each of these local functions depends only on a subset of the variables. The Maximum A Posteriori (MAP) configuration (see Subsection 3.3.2
) is then estimated by sending locally optimized messages between connected agents (nodes) in the CGs. Even though the messagepassing algorithm is developed for estimating the MAP configuration in a probabilistic graph, it is applicable for estimating the optimal joint action of a group of agents in CGs. Because in both situations, the function that is being optimized is decomposed in local terms
Kok and Vlassis (2006).In collaborative multiagent systems, the CGs framework Papageorgiou et al. (1991) assumes the action of an agent only depends on a subset of other agents () which may influence its state. The global payoff function is then broken down into a linear combination of local payoff functions . The proposed design of the local and global payoff is found in Subsections 4.2.1 and 4.2.2 respectively. The global function is then decomposed into a sum of local functions given by:
(7) 
Where: is the joint action resulting from: each agent selects an action from its action set , is the joint state. The joint function depends only on , where , is the set of neighborhoods of agent . The decomposition in Eq. (7) can be demonstrated in Fig. 0(a) (an example of a CGs of four agents where the dependence among agents is decomposed into pairwise functions Yedidia et al. (2003)); where is the joint payoff between agent and agent .
3.3.2 MaxPlus algorithm
The MaxPlus algorithm approximates the optimal joint action by iteratively sending locally optimized messages between connected agents (neighbors) in the coordination graph. It is similar to the belief propagation or the maxproduct algorithm, which estimates the MAP configuration in Belief Networks Kschischang et al. (2001)
. MaxPlus was originally developed for computing MAP solutions in Bayesian networks
Pearl (1988). The MAP for each agent is estimated using only the incoming messages from other agents. It has been shown that the message updates converge to a fixed point after a finite number of iterations for cycle free graphs. The message mainly depends on the decomposition given by Eq. (7) and can be defined as follows:(8) 
Where: is the subset of all neighbors connected to except , and is the message from an agent to agent after agent preforms the action as an evaluation of the influence of this action on agent state. The message approximates the maximum reward agent can achieve for a committed action of agent . This message is the maxsum of the local payoff , the joint payoff and all incoming messages agent received except that from agent . The design of the joint payoff is found in detail in Subsection 4.2.2. Figure 0(b) shows a CG with three agents and the corresponding messages.
If the graph has no cycle, MaxPlus always converges after a finite number of steps to a fixed point in which the messages are changed below a certain threshold Wainwright et al. (2004). At each time step, the value for an action of an agent can be determined as follows:
(9) 
When the convergence is held, optimal joint action has the element which can be computed as follows:
(10) 
In the MaxPlus algorithm; at each iteration, an agent sends a message to each neighbor given the joint payoff and current incoming messages it has received and also computes its current optimal action given all the messages it has received so far that is why it is called anytime algorithm Kok and Vlassis (2006). The process continues until all messages converge, or a deadline signal is received which means the agents must report their individual actions. In the anytime algorithm, the joint action Eq. 10 is only updated when the corresponding global payoff improves.
4 MARLFWC Architecture
Figure 2 demonstrates MARLFWC architecture. MARLFWC consists of two layers; the intelligent control layer and the configuration layer. The configuration layer consists of three blocks:

Ramp metering parameters: configure the ramp metering.

DSLs parameters: configure the DSLs.

Agents’ parameters: configure the control agent.
The main task of the configuration layer is to configure the control layer. The control layer consists of three interacting modules (blocks):

Environment module (VISSIM simulator): models the traffic network.

Multiagent reinforcementlearning module (Computing environment): implements different controls strategies.

Interface module (VISSIM interface): facilitates the interaction between the Multiagent reinforcementlearning module and the environment module.
In the following subsections, the main components of the MARLFWC architecture are described in details, e.g., RLCA (used in the IAs framework).
4.1 MARLFWC with single ramp and RLCA as control measure
In the following subsections, all the components of the Markov decision process that corresponds to the onramp metering RLCA are described.
4.1.1 State space
4.1.2 Action space
By metering on the onramp, the red and green phases of the ramp metering change in order to control the flow entering the freeway from the merging point. So, the action space is modeled as consisting only of two actions: red and green. The optimal action is then chosen based on Eq. (3).
4.1.3 Reward function
Since the RLCA’s goal is to keep the freeway density around a small neighborhood of the critical density , the reward function is designed so as to depend on the current freeway density and how much it deviates from the critical density . Hence, the reward of taking an action in state is designed as following:
(11) 
As long as the value Eq. (3.2.1) (as a function of current state and action) is maximized, the difference between the freeway density and the critical density is minimized. Therefore, the best utilization of the freeway is achieved without causing congestion.
4.2 MARLFWC based on multiagent and cooperative Qlearning
4.2.1 Independent agents
The “independent learning” is the first proposed design to solve the freeway congestion problem (Fig. 2(a)). The function is updated as follows:
(12) 
Where: from Eq. (11), with such conception of the objective function, the agents try to keep the freeway density within a small margin of the critical ratio; that ensures the maximum utilization of the freeway without entering in congestion. In this function design, the agents are partially observing the environment. An agent observes its state which is defined as the number of vehicles in the areas of interest associated with that agent and chooses the local action either red or green, independent of all other agent actions. This cheap design is recommended when the distances between ramps are too long.
4.2.2 Coordinated reinforcement learning with Maxplus
The second design is considered as a completely distributed design (Fig. 2(b)). It considers a MARL based design. This design works on the network level and is based on modeling by collaborative Markov Decision Process (Markov Game) and an associated cooperative Qlearning algorithm. This design incorporates a payoff propagation (Maxplus algorithm) under the coordination graphs framework, particularly suited for optimal control purposes. In this design, the control agent coordinates its actions with its neighboring agents (message passing algorithm). The agent updates the cooperative function globally as follows:
(13) 
Where:

can be computed using Eq. (7).

can be computed using Eq. (6).

R(s,a) can be figured out by the proposed conception of the global reward function using the harmonic mean. With such design, it is guaranteed the balance between the control agents payoffs, and hence optimal response to the freeway dynamics, as follows:
(14) 
The maximum control action in state and its associated estimation of optimal future value can be computed using the MaxPlus algorithm. The function can be computed using the proposed design of the joint payoff between two neighboring control agents (with such design, it is guaranteed the balance between connected control agents), as follows:
(15) 
This moderate design is recommended when the distances between ramps are short and the traffic network has many ramps.
4.2.3 Centralized agent
The third extreme design considers a centralized controller (Fig. 2(c)) using a collaborative MAS coordinated actions are orchestrated as a single action. The cooperative function for the joint actions are updated using a single function, as follows:
(16) 
Where: R(s,a) can be figured out by Eq. (14). Nevertheless, this design leads to an optimal solution Watkins and Dayan (1992)
, it is not scalable as it suffers from the curse of dimensionality. This costly design is recommended when the distances between ramps are short and the traffic network has fewer number of ramps.
4.3 MARLFWC for adaptive ramp metering plus DSLs
This section illustrates the new conception of ramp metering objective function. This conception takes into account the ramp queue length. This section also demonstrates how DSLs can complement the ramp metering in order to mitigate the traffic congestion when the network is dense.
4.3.1 MARLFWC another conception of ramp metering objective function
By considering the weighted sum of both freeway density and ramp, it is possible to minimize both the normalized difference from the critical density in the freeway and the normalized queue length in the ramp .
The ramp metering objective function is demonstrated as follows:
Minimize O:
Collective Objective Function  (17)  
Freeway Objective Function  
(normalized difference from critical density)  
Ramp Objective Function  
(normalized queue length)  
Adaptive Freeway Weight  
Adaptive Ramp Weight  
Control Parameter 
Where: is the adaptive freeway weight. It is adaptive because it depends on the current freeway state; as far as the density increases, the also increases. is the adaptive ramp weight. It is adaptive because it depends on the current ramp state; as far as the queue length increases, the also increases. Finally, is the control parameter. It is used to determine the importance of each weight (biasing factor). With such conception of ramp metering objective function, there is a balance between mitigating the traffic congestion at the freeway (capacity drop) and moving the problem to another road segment.
4.3.2 DSLs objective function
The DSLs analytical proof proves that changing the velocity affects the freeway density. This section illustrates the analytical proof. This proof is used to demonstrate how the change in velocity can affect both freeway density and flow. It shows that the slope between density and flow is the velocity, see Fig. 4.
Freeway Flow (veh/h)  
Freeway Density (veh/km/lane)  
Freeway Velocity (km/h)  
(18) 
Where: is the number of vehicles (veh) (see Fares and Gomaa (2014)) and is the time (h).
DSLs objective function: taking into account the queue length in designing the freeway objective function Eq. (17) raises a question; how to control the amount of vehicles that enter the merging area from the main stream freeway. Hence, the role of the DSLs can be considered as the main stream metering to complement the ramp metering function. When the ramp metering cannot mitigate the capacity drop at the merging area, the DSLs can complement by limiting the number of vehicles entering the merging area. That been said, the DSLs objective function is as follows:
(normalized difference from critical density)  (19) 
4.4 The VISSIM simulation environment
In VISSIM there are many functions and parameters which control the VISSIM itself and associated study experiments. These parameters can be assigned through VISSIM GUI and remain fixed during the running of the experiment or manipulated through programming via VISSIM COM interface which gives the ability to change these parameters during the running time. For example, the traffic signal during the running time of the experiment can be changed in order to respond to the dynamics of the freeway. VISSIM COM interface can be programmed via any type of programming languages with the ability to handle COM object. Accordingly, the next section depicts the advanced VISSIM simulator possibilities, particularly ramp control development and DSLs. Some examples of dense networks with different number of ramps are given, however, MARLFWC can handle any type of networks with any number of ramps.
5 Performance Evaluation
5.1 Multiagent reinforcement learning
The studied network in Fig. 5 consists of a mainstream freeway with three lanes and threemetered onramp with one lane each. The network consists of four sources of inflow: for the mainstream flow, and for the three onramp flow. It has only one discharge point with unrestricted outflow.
The mainstream freeway consists of three areas of interest: , and , plus three control agents (RLCA) which are located at the entrance points of each ramp. Each area of interest is associated with one control agent. Each area is about as follows: before the ramp, as a merging area and after the ramp. There are four uncontrolled sections; three of them before , (which is about ) and (which is about ) and the fourth is after .
Table 1 represents the demand for both the mainstream freeway and ramps. That is because with such scenario the density can exceed the critical density which is proven during the learning process. Random choice of action can lead to a scenario with density equals (Veh/km) which is higher than the critical one which is equal to (Veh/km). The studied network is considered a dense network where the smart control system is highly needed.
Simulation Time (sec)  300  600  900  1200  1500  1800  2100 

Freeway Demand Flow (veh/h)  4000  5500  7000  6500  6000  5500  4500 
Ramp Demand Flow (veh/h)  1000  1000  1000  1000  1000  1000  1000 
Table 2 shows MARLFWC performance evaluation per agent (the three approaches), where is defined as the area of the freeway, which starts from the beginning of to the end of . And , , are defined as areas starting from the beginning points of the ramps , , and to the end points of , , and respectively. In the IAs, tries greedily to solve its local congestion only regardless of other agents’ problems. This leads to overpopulating section of the road, hence increasing the average travel time for and to and respectively, and the same for . In contrary, in coordinated reinforcement learning with Maxplus, the agents optimally cooperate to resolve the congestion problem which leads to 3.2% advance over the IAs case. Although the CA gives some improvements, this costly solution is not recommended compared to the RL with a Maxplus.
IAs  109.05  89.7  112.06  113.07  97.53  80.92  401.79  
Maxplus  112.87  69.18  100.34  77.36  94.18  54.93  388.85  
TT(AVG)(s)  3.2%  
CA  110.05  70.47  99.09  77.9  93.68  54.01  385.32  
4% 
Line graphs in Fig. 6 demonstrate the freeway density associated with each approach of MARLFWC over 2100 seconds of simulation. An important thing here is that, in Fig. 5(a), tries to maintain its density within a small neighborhood of the critical density regardless of other agents’ performance. This leads to overpopulating section of the road. Hence, there is a dramatic increase in the density of section over the critical ratio between and . This leads to the conclusion that does not converge to the optimal solution. Figures 5(b) and 5(c) show that coordinated reinforcement learning with Maxplus and centralized agent successfully keep the density of the freeway at required level over the simulation period.
Table 3 provides the results obtained from applying MARLFWC with all the three approaches and compares it to the base case (nometering). It can be seen that coordinated reinforcement learning with MaxPlus and centralized agent have shown considerable improvement over the base case. Centralized agent and cooperative learning with Maxplus give % and % in terms of total travel time and % and % in terms of average speed improvements over the base case, respectively.
Nometering  IAs  Maxplus  Centralized agent  

Total travel time (h)  333  331.5  311.5  308.7 
0.5%  6.5%  6.9%  
Average speed (km/h)  45.25  45.49  48.5  48.65 
0.5%  6.74%  7.5% 
An experiment of a dense network with three ramps is provided, however, MARLFWC can handle whatsoever type of networks with any number of ramps. In addition, the recommendation for this type of networks and similar ones is to apply the second approach of MARLFWC which is cooperative learning with Maxplus.
5.2 Adaptive objective function plus DSLs
The studied network in Fig. 7 consists of a mainstream freeway with two lanes and an onramp with one lane. The network consists of two sources of inflow: for the mainstream flow and for the onramp flow and it has only one discharge point with unrestricted outflow. The mainstream freeway consists of two sections: and . The first section lies before the ramp, it is long. The second section lies after the ramp, it is long. The mainstream freeway capacity is for both lanes, and the onramp capacity is . The network parameters are taken as follows: , , and .
Table 4 represents the demand (for both the mainstream freeway and ramp) considered in the simulation experiment over a four hours period. This demand scenario gives the opportunity to study the effect of the onramp smart control and DSLs. In the freeway line, there was an increase in the demand flow over the first hour, then the demand flow remains steady at high level near the capacity of the freeway over two hours and half, and finally, the demand flow decreases in one hour to low level. In the ramp line, the demand flow went up over the first hour, then remains at high level near the capacity for half an hour, finally, it dropped to low value and remains steady for two hours and half. The studied network is considered a dense network where the smart control system is highly needed.
Simulation Time (h)  0.5  1  1.5  2  2.5  3  3.5  4 

Freeway Flow (veh/h)  2500  3000  3500  3500  3500  3500  2500  1000 
Ramp Flow (veh/h)  500  1000  1500  500  500  500  500  500 
Table 5 demonstrates the total time spent (TTS (veh.h.)) between three different strategies. There was a decline in the total time spent, when MARLFWC for adaptive ramp metering and ramp metering plus DSLs is used as compared to the base case.
Nometering  Ramp metering  Ramp metering plus DSL  
TTS(veh.h.)  1240.0 


An important thing to be noticed here is that the new approach for ramp metering plus DSLs have shown considerable improvement over the base case. Adaptive ramp metering and adaptive ramp metering plus DSLs give and in terms of total time spent over the base case. Finally, in onramp queue constraint, the DSLs can complement the ramp metering by preventing traffic break down and maintaining the traffic flow at high level.
6 Conclusions
In this article, the problem of traffic congestion in freeways at the network level has been addressed. A new system for controlling ramps metering and speed limits has been introduced based on a multiagent reinforcementlearning framework. MARLFWC comprises both a MDP modeling technique and an associated cooperative learning technique, which is based on payoff propagation (Maxplus algorithm) under the coordination graphs framework. For MARLFWC to be as optimal as possible, the approach is evaluated using different modes of operation depending on the network architecture. This framework has been tested on the stateoftheart VISSIM traffic simulator in a dense practical scenarios.
The experimental results have been thoroughly analyzed to study the performance of MARLFWC using a concrete set of metrics, namely, total travel time, average speed, and total time spent while keeping freeway density at optimum level close to the critical ratio for maximum traffic flow. The findings have proved that MARLFWC has achieved significant enhancement on three features as compared to the base case. It have been also shown that when the ramp metering could not mitigate the capacity drop at the merging area, DSLs could complement the ramp metering by adjusting the flow rate on the main street via changing the speed limits values. The advantages of applying MARLFWC include saving car fuel, decreasing air pollution, mitigating capacity drop, and saving lives by reducing the chances of car accidents.
7 Acknowledgements
This work is mainly supported by the Ministry of Higher Education (MoHE) of Egypt through PhD fellowship awarded to Dr. Ahmed Fares. This work is supported in part by the Science and Technology Development Fund (STDF), Project ID 12602 ”Integrated Traffic Management and Control System”, and by EJUST Research Fellowship awarded to Dr. Mohamed A. Khamis.
References
 Bertsekas (1996) Bertsekas DP. Dynamic programming and optimal control, vol. 1. Athena Scientific Belmont, Massachusetts; 1996.
 Ernst et al. (2009) Ernst D, Glavic M, Capitanescu F, Wehenkel L. Reinforcement learning versus model predictive control: a comparison on a power system problem. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 2009;39(2):517–529.
 Fares and Gomaa (2014) Fares A, Gomaa W. Freeway rampmetering control based on Reinforcement learning. In: Control Automation (ICCA), 11th IEEE International Conference on; 2014. p. 1226–1231.
 Fares and Gomaa (2015) Fares A, Gomaa W. MultiAgent Reinforcement Learning Control for Ramp Metering. In: Selvaraj H, Zydek D, Chmaj G, editors. Progress in Systems Engineering, vol. 330 of Advances in Intelligent Systems and Computing Springer International Publishing; 2015.p. 167–173.
 Feng et al. (2011) Feng C, Yuanhua J, Jian L, Huixin Y, Zhonghai N. Design of Fuzzy Neural Network Control Method for Ramp Metering. In: Measuring Technology and Mechatronics Automation (ICMTMA), 2011 Third International Conference on, vol. 1; 2011. p. 966–969.
 Ghods et al. (2009) Ghods AH, Kian A, Tabibi M. Adaptive freeway ramp metering and variable speed limit control: a geneticfuzzy approach. Intelligent Transportation Systems Magazine, IEEE 2009;1(1):27–36.
 Guestrin et al. (2001) Guestrin C, Koller D, Parr R. Multiagent planning with factored MDPs. In: NIPS, vol. 1; 2001. p. 1523–1530.
 Guestrin (2003) Guestrin CE. Planning under uncertainty in complex structured environments. PhD thesis, Stanford University; 2003.

Kaelbling and Moore (1996)
Kaelbling MLL Leslie Pack, Moore AW.
Reinforcement learning: A survey.
Journal of artificial intelligence research 1996;p. 237–285.
 Khamis and Gomaa (2012) Khamis MA, Gomaa W. Enhanced Multiagent MultiObjective Reinforcement Learning for Urban Traffic Light Control. In: Proc. IEEE 11th International Conference on Machine Learning and Applications (ICMLA 2012) Boca Raton, Florida; 2012. p. 586–591.
 Khamis and Gomaa (2014) Khamis MA, Gomaa W. Adaptive multiobjective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multiagent framework. Journal of Engineering Applications of Artificial Intelligence 2014 March;29:134–151.
 Khamis and Gomaa (2015) Khamis MA, Gomaa W. Comparative Assessment of MachineLearning Scoring Functions on PDBbind 2013. Engineering Applications of Artificial Intelligence 2015;p. 136–151.
 Khamis et al. (2015) Khamis MA, Gomaa W, Ahmed WF. Machine learning in computational docking. Artificial Intelligence In Medicine 2015;63:135–152.

Khamis et al. (2012a)
Khamis MA, Gomaa W, ElMahdy A, Shoukry A.
Adaptive Traffic Control System Based on Bayesian Probability Interpretation.
In: Proc. IEEE 2012 JapanEgypt Conference on Electronics, Communications and Computers (JECECC 2012) Alexandria, Egypt; 2012. p. 151–156.  Khamis et al. (2012b) Khamis MA, Gomaa W, ElShishiny H. MultiObjective Traffic Light Control System based on Bayesian Probability Interpretation. In: Proc. IEEE 15th International Conference on Intelligent Transportation Systems (ITSC 2012) Anchorage, AK; 2012. p. 995–1000.
 Kok and Vlassis (2006) Kok JR, Vlassis N. Collaborative multiagent reinforcement learning by payoff propagation. The Journal of Machine Learning Research 2006;7:1789–1828.
 Kschischang et al. (2001) Kschischang FR, Frey BJ, Loeliger HA. Factor graphs and the sumproduct algorithm. Information Theory, IEEE Transactions on 2001;47(2):498–519.

Li and Liang (2009)
Li J, Liang X.
Freeway ramp control based on single neuron.
In: Intelligent Computing and Intelligent Systems, 2009. ICIS 2009. IEEE International Conference on, vol. 2; 2009. p. 122–125.  Liang et al. (2010) Liang X, Li J, Luo N. Single Neuron Based Freeway Traffic Density Control via Ramp Metering. In: Information Engineering and Computer Science (ICIECS), 2010 2nd International Conference on; 2010. p. 1–4.
 Papageorgiou et al. (1991) Papageorgiou M, HadjSalem H, Blosseville JM. ALINEA: A local feedback control law for onramp metering. Transportation Research Record 1991;(1320).
 Pearl (1988) Pearl J. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann; 1988.
 Prendinger et al. (2013) Prendinger H, Gajananan K, Bayoumy Zaki A, Fares A, Molenaar R, Urbano D, et al. Tokyo Virtual Living Lab: Designing Smart Cities Based on the 3D Internet. Internet Computing, IEEE 2013 Nov;17(6):30–38.
 Vlassis et al. (2004) Vlassis N, Elhorst R, Kok JR. Anytime algorithms for multiagent decision making using coordination graphs. In: Systems, Man and Cybernetics, 2004 IEEE International Conference on, vol. 1 IEEE; 2004. p. 953–957.
 Wainwright et al. (2004) Wainwright M, Jaakkola T, Willsky A. Tree consistency and bounds on the performance of the maxproduct algorithm and its generalizations. Statistics and Computing 2004;14(2):143–166.
 Watkins and Dayan (1992) Watkins CJ, Dayan P. Qlearning. Machine learning 1992;8(34):279–292.
 Yedidia et al. (2003) Yedidia JS, Freeman WT, Weiss Y. Understanding belief propagation and its generalizations. Exploring artificial intelligence in the new millennium 2003;8:236–239.
 Yu et al. (2012) Yu XF, Xu WL, Alam F, Potgieter J, Fang CF. Genetic fuzzy logic approach to local ramp metering control using microscopic traffic simulation. In: Mechatronics and Machine Vision in Practice (M2VIP), 2012 19th International Conference; 2012. p. 290–297.
 Zaky et al. (2015) Zaky AB, Gomaa W, Khamis MA. Car Following Markov Regime Classification and Calibration. In: Proceedings of the IEEE 14th International Conference on Machine Learning and Applications (ICMLA 2015) Miami, Florida, USA: IEEE; 2015. .
Comments
There are no comments yet.