1 Abstract
This paper uses supervised learning, random search and deep reinforcement learning (DRL) methods to control large signalized intersection networks. The control policy at each intersection is parameterized as a deep neural network, to approximate the “best” signal setting as a function of the state of its incoming and outgoing approaches. The traffic model is Cellular Automaton rule 184, which has been shown to be a parameterfree representation of traffic flow, and which is probably the most efficient implementation of the Kinematic Wave model with triangular fundamental diagram. We are interested in the steadystate performance of the system, both spatially and temporally: we consider a homogeneous grid network inscribed on a torus, which makes the network boundaryfree, and drivers choose random routes. As a benchmark we use the longestqueuefirst (LQF) greedy
[arel2010reinforcement] algorithm. We find that: a policy trained with supervised learning with only two examples outperforms LQF, random search is able to generate nearoptimal policies, the prevailing average network occupancy during training is the major determinant of the effectiveness of DRL policies. When trained under freeflow conditions one obtains DRL policies that are optimal for all traffic conditions, but this performance deteriorates as the occupancy during training increases. For occupancies during training, DRL policies perform very poorly for all traffic conditions, which means that DRL methods cannot learn under highly congested conditions. We conjecture that DRL’s inability to learn under congestion might be explained by a property of urban networks found here, whereby even a very bad policy produces an intersection throughput higher than downstream capacity. This means that the actual throughput tends to be independent of the policy. Our findings imply that it is advisable for current DRL methods in the literature to discard any congested data when training, and that doing this will improve their performance under all traffic conditions. They also suggest that this inability to learn under congestion might be alleviated by combining DRL for freeflow and supervised learning for congested conditions.Keywords: Traffic signal control, machine learning, deep reinforcement learning
2 Introduction
The use of deep neural networks within Reinforcement Learning algorithms has produced important breakthroughs in recent years. These deep reinforcement learning (DRL) methods have outperformed expert knowledge methods in areas such as arcade games, backgammon, the game Go and autonomous driving [mnih2015human, silver2017mastering, chen2019model]. In the area of traffic signal control numerous DRL control methods have been proposed both for isolated intersections [li2016traffic, genders2016using] and small networks [chu2015traffic, chu2019multi, tan2019cooperative, ge2019cooperative]. The vast majority of these methods have been trained with a single (dynamic) traffic demand profile, and then validated using another one, possibly including a surge [ge2019cooperative].
A gap in the literature appears to be a consistent analysis of the different aspects of large traffic flow networks that influence the performance of DRL methods. For example, it is not clear if and how network congestion levels affect the learning process, or if other machine learning methods are effective, or if current findings also apply to large networks. This paper is a step in this direction, where we examine the simplest possible DRL setup in order to gain some insight on how the optimal policy changes with respect to different configurations of the learning framework. In particular, we are interested in the steadystate performance of the system, both spatially and temporally: we consider a homogeneous grid network inscribed on a torus, which makes the network boundaryfree, and drivers choose random routes.
In the current signal control DRL literature the problem is treated, invariably, as an episode process, which is puzzling given that the problem is naturally a continuing (infinite horizon) one. Here, we adopt the continuing approach to maximize the longterm average reward. We argue that in signal control there is no terminal state because the process actually goes on forever. And what may appear as a terminal state, such as an empty network, cannot be considered so because it is not achieved through the correct choice of actions but by the traffic demand, which is uncontrollable. An explanation for this puzzling choice in the literature might be that DRL training methods for episodic problems have a much longer history and our implemented in most machine learning development frameworks. For continuing problems this is not unfortunately the case, and we propose here the training algorithm REINFORCETD, which is in the spirit of REINFORCE with baseline [willianms1988toward] but for continuing problems. To the best of our knowledge, this extension of REINFORCE is not available in the literature.
The remainder of the paper is organized as follows. We start with the background section regarding DRL methods and the macroscopic fundamental diagram of urban networks, followed by a survey of related work. Then, we define the problem set up and apply it to a series of experiments that highlight the main properties found here. Finally, the paper concludes with a discussion and outlook section.
3 Background
3.1 The macroscopic fundamental diagram (MFD) of urban networks
Macroscopic models for traffic flow have become increasingly popular after the empirical verification of a networklevel Macroscopic Fundamental Diagram (MFD) on congested urban areas [Daganzo2007Urban, Geroliminis2008Existence]. For a given traffic network, the MFD describes the relationship between traffic variables averaged across all lanes in the network. In this paper we will use the flowdensity MFD, which gives the average flow on the network as a function of the average density on the network.
The main requirement for a welldefined MFD is that congestion be homogeneously distributed across the network, i.e. there must be no ”hot spots” in the network. For analytical derivations it is often also assumed that each lane of the network obeys the kinematic wave model [Lighthill1955Kinematic, richards1956shock] with common fundamental diagram [daganzo2008analytical, Laval2015Stochastic]. In this way, upper bounds for the MFD have been found using the method of cuts in the case of homogenous networks. For general networks, [Laval2015Stochastic]
show that (the probability distribution of) the MFD can be well approximated by a function of mainly two parameters: the mean distance between traffic lights divided by the mean green time, and the mean redtogreen ratio across the network.
3.2 Reinforcement learning
Reinforcement learning is typically formulated within the framework of a Markov decision process (MDP). At discrete time step the environment is in state , the agent will choose and action , to maximize a function of future rewards with . There is a state transition probability distribution that gives the probability of making a transition from state to state using action is denoted , and is commonly referred to as the “model”. The model is Markovian since the state transitions are independent of any previous environment states or agent actions. For more details on MDP models the reader is referred to [bellman1957markovian, bertsekas1987dynamic, howard1960dynamic, puterman1994markovian]
The agent’s decisions are characterized by a stochastic policy , which is the probability of taking action in state . In the continuing case the agent seeks to maximize the average reward:
(1) 
The term means that the expected value (with respect to the distribution of states) assumes that the policy is followed.
In the case of traffic signal control for largescale grid network, methods based on transition probabilities are impractical because the stateaction space tends to be too large as the number of agents increases. An alternative approach that circumvents this curse of dimensionality problem—the approach we pursue here—are “policygradient” algorithms, where the policy is parameterized as , typically a neural network. Parameters are adjusted to improve the performance of the policy by following the gradient of cumulative future rewards, given by the identity
(2) 
as shown in [sutton1999policy] for both continuing and episodic problems. In continuing problems cumulative rewards are measured relative to the average cumulative reward:
(3) 
and is known as the differential return. The value function is the expected differential return the agent will gain if it starts in that state and executes the policy .
(4) 
3.3 Related work
The existing literature is split between two approaches for formulating the largescale traffic control problem: either a centralized DRL algorithm or a decentralized method with communication and cooperation among multiagents. The centralized approach [genders2016using, li2016traffic, chu2016large] usually adopts a singleagent learning algorithm as many DRL control problems and tries to tackle the highdiamentional continuous control problem by novel algorithms like memory replay, dual networks and advantage actorcritic [lillicrap2015continuous, mnih2015human]. The decentralized method takes advantage of multiple agents and requires design of efficient communication and coordination to address the limitation of partial observation of local agents. Current studies [khamis2014adaptive, wei2019colight, tan2019cooperative, gong2019decentralized] often decompose the large network into small regions or individual intersections, and train the localoptimum policies separately given reward functions reflecting certain level of cooperation. It is worth noting that different observational measures of the environment are used as communication information between agents, such as neighbouring intersections, downstream intersections or upstream intersections. How to incorporate those communication information to help design the reward function for local agents remains an open question.
The environment modeling, state representation and reward function design are key ingredients in DRL. For the environment emulator, most studies are based on popular microscopic traffic simulation packages like AIMSUM or SUMO. Recently, FLOW [kheterpal2018flow] has been developed as a computational framework integrating SUMO with some advanced DRL libraries to implement DRL algorithm on ground traffic scenarios. [vinitsky2018benchmarks] provided a benchmark for major traffic control problems including the multiple intersection signal timing. There also exist studies [chu2015traffic, arel2010reinforcement, ge2019cooperative] adopting methods to use selfdefined traffic models as the environment. Complementary to those microscopic simulation packages, macroscopic models are able to represent the traffic state using cell or link flows. The advantage of macroscopic models is twofold: i) reducing complexity in state space and computation ii) being compatible with domain knowledge from traffic flow theory such as MFD theory.
Expert knowledge has been included in some studies to reduce the scale of the network control problem. In [xu2018network], critical nodes dictating the traffic network were identified first before the DRL was implemented. The state space can be remarkably reduced. Macroscopic fundamental diagram (MFD) theory cannot provide sufficient information to determine the traffic state of a network. For instance, [chu2015traffic] successfully integrated the MFD with a microscopic simulator to constrain the searching space of the control policies in their signal design problem. They defined the reward as the trip completion rate of the network, and simultaneously enforcing the network to remain under or near the critical density. The numerical experiments demonstrated that their policy trained by the integration of MFD yields a more robust shape of the MFD, as well as a better performance of trip completion maximization, compared to that of a fixed and a greedy policy.
While most of the related studies on traffic control only focus on developing effective and robust deep learning algorithms, few of them have shown traffic considerations, such as the impact of traffic density. The learning performance of RLbased methods under different densities have not been sufficiently addressed. To the best of our knowledge,
[camponogara2003distributed] is the only study which trained a RL policy for specific and varied density levels, but unfortunately their study only accounted for freeflow and midlevel congestion. [dai2011neural]classified the traffic demand into four vague levels and reported that inflow rates at 1000 and 1200 veh/h needed more time for the algorithm to show convergence. But they did not report network density, nor try more congested situations nor discussed why the converging process has been delayed. Most studies only trained RL methods in noncongestion conditions, [ge2019cooperative] adopted the Qvalue transfer algorithm (QTCDQN) for the cooperative signal control between a simple 2*2 grid network and validated the adaptability of their algorithm to dynamic traffic environments with different densities, such as the the recurring congestion and occasional congestion.In summary, most recent studies focus on developing effective and robust multiagent DRL algorithms to achieve coordination among intersections. The number of intersections in those studies are usually limited, thus their results might not apply to large open network. Although the signal control is indeed a continuing problem, it has been always modeled as an episodic process. From the perspective of traffic considerations, expert knowledge has only been incorporated in downscaling the size of the control problem or designing novel reward functions for DRL algorithm. Few studies have tested their methods given different traffic demands, or shed lights on the learning performance under different traffic conditions, especially the congestion regimes. To fill the gap, our study will treat the largescale traffic control as a continuing problem and extend classical RL algorithm to fit it. More importantly, noticing the lack of traffic considerations on learning performance, we will train DRL policies under different density levels and explore the results from a traffic flow perspective.
4 Problem set up
The traffic flow model used in this paper is the kinematic wave model [Lighthill1955Kinematic, richards1956shock]
with a triangular flowdensity fundamental diagram, which is the simplest model able to predict the main features of traffic flow. The shape of the triangular fundamental diagram is irrelevant due to a symmetry in the kinematic wave model whereby flows and delays are invariant with respect to linear transformations, and renders the kinematic wave model parameterfree; see
[Lav16] for the details. This allows us to use an isosceles fundamental diagram, which in combination with a cellular automaton (CA) implementation of the kinematic wave model, produces its most computationally efficient numerical solution method: Elementary CA Rule 184 [wolfram1984cellular].In a CA model, each lane of the road is divided into small cells the size of a vehicle jam spacing, where cell is the most downstream cell of the lane. The value in each cell, namely , can be either “1” if a vehicle is present and “0” otherwise. The update scheme for CA Rule 184, shown in Fig. 1, operates over a neighborhood of length 3, and can be written as:
(5) 
The vector
is a vector of bits and (5) is Boolean algebra, which explains the high computational efficiency of this traffic model. Notice that (5) implies that the current state of the system is described completely by the state in the previous time step; i.e. it is Markovian and deterministic. Stochastic components are added by the signal control policy, and therefore our traffic model satisfies the main assumption of the MDP framework.The signalized network corresponds to a homogeneous grid network of bidirectional streets, with one lane per direction of length
(6) 
To attain spatial homogeneity, the network is defined on a torus: Street ends on the edge of the network are connected so that each street can be thought of as a ring road. This is illustrated in Fig. 2, where we have omitted the cells on the connecting links (to form the torus) to reduce clutter. Notice that in this setting all intersections have 4 incoming and 4 outgoing approaches.
Vehicle routing
is such that reaching the stop line will choose to turn right, left or keep going straight with equal probability. This promotes a uniform distribution for density on the network.
Traffic signals operate with only two restrictions: a redred of one time step (of the CA model) to account for the lost time steps when switching lights, and a minimum green time, of 3 time steps. This means that one iteration in the learning framework correspond to time steps of the CA model.
4.1 The DRL framework
Each traffic signal is considered an agent that learns from the environment. There are two possible actions for each agent: turning the light red/green for the NorthSouth approaches (and therefore turns the light green/red for the EastWest approaches). We don’t consider yellow phase in this paper. The state observable by the agent is a matrix of bits, given the four incoming and the four outgoing vectors from the CA model, one for each approach to the intersection.
The policy for each traffic signal agent is approximated by a deep neural network as shown in Fig. 3
. It is a 3layer perceptron with tanh nonlinearity, known to approximate any continuous function with an arbitrary accuracy provided the network is ”deep enough”
[kuurkova1992kolmogorov]. The input to the network is the state observable by the agent, while the output is a single real number that gives the probability of turning the light red for the NorthSouth approaches.We define the reward at time as the incremental average flow per lane, defined here as the average flow through the intersection during minus the flow predicted by the network MFD at the prevailing density. We will see that this definition of reward is superior to the more standard average flow per lane
in that case the resulting parameter variance is larger, which makes it more difficult for training algorithms to converge. In this context the MFD can be seen as a baseline for the learning algorithm, which reduces parameter variance. But for a baseline to be effective it needs to be independent of the actions taken. To this end, we use the maximumqueuefirst (LQF) algorithm as a baseline, whose mean MFD is shown as a thick dashed curve starting in Figure
4.Because our network is spatially homogeneous and without boundaries, there is no reason why policies should be different across agents, and therefore we will train a single agent and share its parameters with all other agents. After training, we evaluate the performance of the policy by observing the resulting MFD.
4.2 The training algorithm ReinforceTd
In this paper we propose the training algorithm REINFORCETD, which is in the spirit of REINFORCE with baseline [willianms1988toward] but for continuing problems. To the best of our knowledge, this extension of REINFORCE is not available in the literature. Notice that we tried other methods in the literature with very similar results, so REINFORCETD is chosen here since it has the fewest hyper parameters: learning rates and for the parameters and the average reward, respectively. Using a grid search over these parameters resulted in and .
Recall that REINFORCE is probably the simplest policy gradient algorithm that uses (2) to guide the parameter search. In the episode setting it is considered a MonteCarlo method since it requires full episode replay, and it has been considered to be incompatible with continuing problems in literature [sutton2018reinforcement]. Here, we argue that a onestep Temporal Difference (TD) approach [sutton1988learning]
can be used instead of the MonteCarlo replay to fit the continuing setting. This boils down to estimating the differential return (
3) by the temporal onestep differential return of an action:(7) 
Notice that the second term in this expression can be interpreted as a baseline in REINFORCE, and baselines are known to reduce parameter variance. The pseudocode is shown in Algorithm 1.
5 Experiments
In this section we perform a series of experiments to highlight the main properties of our problem. The different policies will be compared based on the MFD they produce once deployed to all intersections in the network. The MFD for each policy is obtained by simulating this policy for several network densities and reporting the average flow in the network after 40 time steps. This process is repeated 20 times for each density value to obtain an approximate 95%confidence interval (mean
2 standard deviations) of the flow for each density value, shown as the shaded areas in all flowdensity diagrams that follow.
As a visual benchmark we also use the a greedy method, i.e. longestqueuefirst (LQF) policy, whose mean MFD is shown as a thick dashed curve on the following flowdensity diagrams (only the mean values are reported here to avoid clutter). In this way, we are able to test the hypotheses that the policy outperforms LQF simply by observing if the shaded area is above the dashed line. In particular, we will say that a policy is “optimal” if it outperforms LQF, “competitive” if it performs similarly to LQF (shaded area overlaps with dashed line), and “suboptimal” if it underperforms LQF (shaded area below the dashed line).
5.1 Random policies
In this experiment the weights
for the policy are set according to a standard normal distribution. As illustrated in Fig.
4, it is possible to find a competitive policy after just a few trials; as in trial 9 in the figure, similar random search method for RL problems can be found in [mania2018simple, camponogara2003distributed]. A visual analysis of a large collection of such images reveals that about 15% of these random policies are competitive.Notice that this figure also reveals that all policies, no matter how bad, are optimal when the density exceeds approximately 75%. To the best of our knowledge, this property of signalized traffic networks has not been reported previously. [camponogara2003distributed] was the only study being close to it, which compared average delay by using random search, LQF, and RLbased policies under densities ranging from 0.1 to 0.75. However, they did not show any results under density higher than 0.75, and no explanation or discussion was provided. Although they did not discuss the learning performance under density higher than 0.75, their results given density from 0.1 to 0.75 show consistency with our finding, and supported the density threshold value 0.75 revealed in our study.
To see the property, consider the upper and lower bounds on the MFD in the figure, as a way of defining a feasibility region for the MFD. The existence of this lower bound is unexpected, and since it overlaps with the upper bound for congested densities means that intersection throughput is not affected by the control.
A possible explanation is that under heavily congested conditions there will always be a queue waiting to discharge at all intersections, and therefore which approach gets the green becomes irrelevant.
5.2 Supervised learning policies
In this section we will report a rather surprising result, training the policy with only two examples yields a nearoptimal policy. These examples are shown in Fig. 5 and correspond to two extreme situations where the choice is trivial: the left panel shows extreme state , where both NorthSouth approaches are empty and the EastWest ones are at jam density (and therefore red should be given to those approaches with probability one), while the middle panel shows , the opposite situation (and therefore red should be given to NorthSouth approaches with probability zero); in both cases all outgoing approaches are empty. The training data is simply:
(8) 
The figure also shows the MFD resulting from this policy, where it can be seen that it outperforms our benchmark for all densities.
5.3 DRL policies
Here we training the policy using DRL, as discussed earlier. We included two experiments. In the first one the policy is trained under a constant number of vehicles in the network, and we show that as soon as congestion builds up the learning process deteriorates. The second experiment considers the standard definition of reward in the literature, and we show that it produces a slower convergence.
5.3.1 Constant demand
In these experiments we consider a constant traffic demand, i.e. the density of vehicles in the network, , is kept constant during the entire training process.
The results for three levels of demand, and for random and supervised initial conditions for the policy weights are shown in Fig. 6 and Fig. 7, respectively. Each row corresponds to a constant density level, while the first column depicts the NS red probabilities of the extreme states, and (described in section 5.2) as a function of the iteration number, and these probabilities should tend to (5) for “sensible” policies. These figures reveals considerable insight:

the first column in Fig. 6 reveals that a sensible policy cannot be achieved for . This is apparent because probabilities and converge to the wrong values. We have verified that for congested traffic conditions with this result is observed.

for in Fig. 6 all the policies obtained are suboptimal and deteriorate as density increases.

the best policies in Fig. 6, albeit only competitive, are obtained for freeflow conditions, i.e. , with lower density leading to slightly better policies.

Fig. 7 shows that even starting with initial parameters from the supervised experiment, the additional DRL training under congested conditions, , leads to a deterioration of the policy. Under freeflow, conversely, policies seem to improve slightly.
These observations indicate that DRL policies lose their ability to learn and deteriorate as density increases. A possible explanation that deserves further research is elaborated momentarily in the discussion section. We conjecture that this result is a consequence of a property of congested urban networks and has nothing to do with the algorithm to train the DRL policy.
5.3.2 Nonincremental rewards
In this experiment we define the reward at time as the average flow per lane through the intersection during , without subtracting the MFD flow at the prevailing density, as in all previous experiments. This is the standard definition in the literature and we show here that incremental rewards produce faster convergence. This is shown in Fig. 8, which depicts the NS red probabilities and for random initial parameters , for the same density levels in the previous experiments. Comparing these results with the first column in Fig. 6 we can see that within the first 2000 or so iterations and tend to the wrong values but then converge to the expected ones, except for the heavily congested case.
6 Discussion and outlook
This paper exposed several important properties of machine learning methods applied to traffic signal control on large networks. We have raised more questions than answers at this point, but our future research will focus on formalizing and extending these results. It is important to note that we have verified that these results remain true for configurations not shown here, i.e. for different (i) number of cells in each lane and minimum green time , (ii) number of intersections in the network, and (iii) DRL training algorithm.
Based on our results and to facilitates this discussion, we argue that networks have 4 distinctive traffic states: extreme freeflow, moderate freeflow, moderate congestion and extreme congestion; see Fig. 9. To see this, we reason as follows. We know that traffic is symmetric with respect to the critical density: freeflow traffic and congested traffic share the same mathematical properties. In particular, close to the critical density in moderate freeflow and moderate congestion, the network flow exhibits moderate variance, but in extreme freeflow or extreme congestion the network flow becomes deterministic, as can be confirmed from the numerous simulated MFDs shown here (or from the method of cuts in [Laval2015Stochastic]).
We found a property of congested urban networks, namely the congested network property , that makes DRL methods unable to find sensible policies under congested traffic conditions. We conjecture that this behavior is consistent with the shape of the MFD bounds uncovered in section 5.1, whereby the more the congestion, the less the policy affects intersection throughput. This tendency of the flow to be independent of the policy under congestion renders gradient information less meaningful, which corrupts the learning process. Even starting with initial weights given by the supervised training policy, we saw that additional training under congested conditions leads to a deterioration of the policy. Similarly, we have verified similar behavior under dynamic demands whenever congestion appears in the network.
To the best of our knowledge, this behavior has been mentioned only once in the literature [de2006reinforcement], but no followup research has been generated since. This is unfortunate because this means, potentially, that all the DRL methods proposed in the literature to date are unable to learn as soon as congestion appears on the network. It also means that the limited success of DRL for traffic signal control might be explained by the congested network property , which has been overlooked so far. But the current explanation in the literature to explain the limited success of DRL is that the problem is nonstationary and/or nonMarkovian [choi2000hidden, da2006dealing], which probably explain why we still have not solved the problem.
It is important to investigate new DRL methods able to cope with the congested network property , i.e. methods able to extract relevant knowledge from congested conditions. In the meantime, it is advisable to train DRL policies under freeflow conditions only, discarding any information from heavily congested ones. We have shown here that such policies are nearly optimal for all traffic conditions. This intriguing result indicates that most of what the agent needs to learn is encoded in freeflow conditions, and that the data in congestion is irrelevant. We will challenge this idea in future studies with networks that are not “ideal” as in here.
We suspect, that the congested network property will still hold for general networks because every network should have a lower bound for the MFD that is greater than zero under congestion. To see this, consider a very bad policy that gives the green to the smallest queue at every time instant. In freeflow conditions it is likely that the smallest queue will be an empty queue and therefore the throughput would be zero. But under congestion the smallest queue will tend to be greater than zero and therefore the throughput has no choice but to increase. We can formalize this idea by assuming that at each time instant the number of vehicles in the network is described by a Bernoulli process of probability . It follows that the number of vehicles in the NS approaches is binomial with parameters and , where are the number of cells of both NS approaches, respectively. And similarly for the distribution of the number of vehicles in the EW approaches. This can be used to obtain the flow through the intersection (provided outgoing approaches do not block traffic) simply by dividing by
to obtain the density and multiplying by the freeflow speed of 1 to obtain the flow. Therefore, the distribution of the number of vehicles flowing through the intersection under this bad policy is simply linear transformation of the distribution of the minimum of these two binomial random variables. It turns out that the percentile function of this distribution for small probabilities exhibits the desired shape for the MFD lower bound. The left panel in Fig.
10 (left) shows this function for and for selected probabilities along with the MFD for the LQF policies for reference, where the infeasible region for the MFD becomes apparent. Similar behavior is observed for different values of , not shown here for brevity, and therefore these lower bounds should exist in general networks.For the upper bounds, the method of cuts provides the answer, and the reader is referred to [daganzo2008analytical, Laval2015Stochastic] for the details. Given this upper bound, it becomes clear that the congested states beyond the intersection point “A” in Fig. 9 obey very different rules compared to congestion near capacity. In this extreme congestion state the control has absolutely no influence on network flow, which is deterministic in which explains the failure of DRL methods to learn (because there is nothing to learn!). In moderate congestion closer to the critical density the flow is stochastic and the lower bound starts activating, causing the distance between upper and lower bounds to decrease with density, which might explain the learning difficulties in this traffic state. Moderate freeflow is similar, except that the lower bound remains at zero flow. Finally, extreme freeflow is different than extreme congestion because in the former the lower bound is zero.
The above paragraph suggests that at any given density the distance between the upper bound in congestion and the lower bound is a measure of the potential for DRL methods to learn. This “learning potential” is maximum in freeflow and starts shrinking until disappearing at point “A” in Fig. 9. But in practice this learning potential might be much narrower than shown in the figure, which explains why learning difficulties start near the critical density. To see this, consider the median outflow in Fig. 10 (left), which indicates that most of the time a bad policy will produce remarkably high flows; e.g. for the median outflow is around 0.2. But this high outflow will be above the upper bounds in Fig. 9. This means that most of the time the throughput of an intersection is dictated by the infrastructure (the upper bounds) rather than the policy. This is the simplest explanation consistent with the results in this paper, but research is needed to understand how to use it for learning under congestive conditions.
Notably, we also found that supervised learning with only two examples yields a nearoptimal policy. This intriguing result indicates that extreme states and encode vital information and that the neural network can successfully extrapolate to all other states. (Notice however, that this result cannot be obtained if the input to the neural network is a matrix instead of the matrix used in this paper.) Understanding precisely why this happens could lead to very effective supervised learning methods based on expert knowledge, and to supplement DRL’s inability to learn under congested conditions.
Combining the results in this paper with those in [Laval2015Stochastic] we conjecture that a necessary condition for a policy to be optimal under congestion is that the average green time given to any incoming approach be proportional to the length of the approach. To see this, recall that [Laval2015Stochastic] show that the MFD can be well approximated by a function of mainly two parameters: the mean distance between traffic lights divided by the mean green time, and the mean redtogreen ratio across the network. Since our network is spatially homogeneous we have . Using this in equation (17b) of [Laval2015Stochastic] we infer that in the deterministic case ( in (17b)) the slope of the MFD in extreme congestion, namely ; see Fig. 9, is given by:
(9) 
The reader can verify from the many MFD’s shown here that and therefore that . This means that the mean green time produced by the policy matches the mean distance between traffic lights (in dimensionless form)busy working so. Notice that was shown in [Laval2015Stochastic] to be the shortblock condition, i.e. the network becomes prone to spill back, which can have a severe effect on capacity. Conversely, a network with has long blocks (compared to the green time) and therefore will not exhibit spill back. Therefore, that an optimal policy produce is not surprising as it indicates that green times are just long enough as to not produce spill back. This highlights the importance of considering segment length when deciding signal timing, a subject rarely mentioned in the signal control literature.
7 Acknowledgements
This study has received funding from NSF research projects # 1562536 and # 1826162.