1 Introduction
Cities are characterized by
the evolution of their transit dynamics. Originally
meant solely for pedestrians, urban streets soon shared usage with
carriages and then with cars. Traffic organization became soon
an issue that led to the introduction of signaling, traffic lights
and transit planning.
Nowadays, traffic lights either have fixed programs or are actuated.
Fixed programs (also referred to as pretimed control) are
those where the timings of the traffic lights
are fixed, that is, the sequences of red, yellow and green phases
have fixed duration. Actuated traffic lights change their
phase to green or red depending on traffic detectors that are
located near the intersection; this way, actuated
traffic light are dynamic and adapt to the traffic conditions to
some degree; however, they only take into account the conditions
local to the intersection. This also leads to discoordination with
the traffic light cycles of other nearby intersections and hence
are not used in dense urban areas.
Neither pretimed or actuated traffic lights take into account the current
traffic flow conditions at the city level. Nevertheless, cities have
large vehicle detector
infrastructures that feed traffic volume forecasting tools used
to predict congestion situations. Such information is normally only used
to apply classic traffic management actions like sending police officers
to divert part of the traffic.
This way, traffic light timings could be improved by means of machine learning algorithms that take advantage of the knowledge about traffic conditions by optimizing the flow of vehicles. This has been the subject of several lines of research in the past. For instance, Wiering proposed different variants of reinforcement learning to be applied to traffic light control
(wiering2004simulation), and created the Green Light District (GLD) simulator to demonstrate them, which was further used in other works like (prashanth2011reinforcement). Several authors explored the feasibility of applying fuzzy logic, like (favilla1993fuzzy) and (chiu1993adaptive). Multiagent systems where also applied to this problem, like (cai2007study) and (shen2011agent).Most of the aforementioned approaches simplify the scenario to a
single intersection or a reduced
group of them. Other authors propose multiagent systems where each agent
controls a single intersection and where
agents may
communicate with each other to share information to improve coordination
(e.g. in a connected vehicle setup (feng2015real)) or
may receive a piece of shared information to be aware of the
crossed effects on other agents’ performance
((el2013multiagent)).
However, none of the aforementioned approaches fully profited from
the availability of all the vehicle flow information, that is,
the decisions taken by those agents were in all cases partially informed.
The main justification for the lack of holistic traffic light
control algorithms is the poor scalability of most algorithms. In
a big city there can be thousands of vehicle detectors and tenths of
hundreds of traffic lights. Those numbers amount for huge space
and action spaces, which are difficult to handle by
classical approaches.
This way, the problem addressed in this works is the devisal of an agent that receives traffic data and, based on these, controls the traffic lights in order to improve the flow of traffic, doing it at a large scale.
2 Traffic Simulation
In order to evaluate the performance of our work, we make use
of a traffic simulation.
The base of a traffic simulation is the network, that is,
the representation of roads and intersections where the vehicles
are to move. Connected to some roads, there are centroids,
that act as sources/sinks of vehicles. The amount of vehicles
generated/absorbed by centroids is expressed in a
traffic demand matrix, or origindestination (OD) matrix,
which contains one cell per each
pair of origin and destination centroids. During a simulation,
different OD matrices can be applied to different periods of time
in order to mimic the dynamics of the real traffic through time.
In the roads of the network, there can be traffic detectors,
that mimic induction loops beneath the ground that are able to
measure traffic data as vehicles go pass through them. Typical
measurements that can be taken with traffic detectors include
vehicle counts, average speed and percentage of occupancy.
There can also be traffic lights. In many cases they are used to
regulate the traffic at intersections. In those cases, all the
traffic lights in an intersection are coordinated so that
when one is red, another one is green, and vice versa
(this way, the use of the intersection is regulated so that
vehicles don’t block the intersection due to their intention
to reach an exit of the intersection that is currently in use)
. All the traffic lights in the intersection
change their state at the same time. This intersectionlevel
configuration of the traffic lights is called a phase, and
it is completely defined by the states of each traffic light in the
intersection plus its duration. The different phases in an
intersection form its control plan. The phases in the
control plan are applied cyclically, so the phases
are repeated after the cycle duration elapses. Normally,
control plans of adjacent intersections are synchronized to
maximize the flow of traffic avoiding unnecessary stops.
Urban traffic simulation software can keep models at different levels of abstraction. Microscopic simulators simulate vehicles individually computing their positions at every few milliseconds and the the dynamics of the vehicles are governed by a simplified model that drives the behaviour of the driver under different conditions, while macroscopic simulators work in an aggregated way, managing traffic like in a flow network in fluid dynamics. There are different variations between microscopic and macroscopic models, broadly referred to as mesoscopic simulators. To our interests, the proper simulation level would be microscopic, because we need information of individual vehicles and their responses to changes in the traffic lights, mimicking closely real workd dynamics in terms of congestion. As third party simulator we chose Aimsun (casas2010traffic; aimsun2012dynamic), a commercial microscopic, mesoscopic and macroscopic simulator widely used, both in the private consulting sector and in traffic organization institutions.
3 Preliminary Analysis
The main factor that has prevented further advance in the
traffic light timing control problem is the large scale of
any realistic experiment. On the other hand, there is a
family of machine learning algorithms whose very strenght
is their ability of handle large input spaces, namely
deep learning.
Recently, deep learning has been successfully applied to reinforcement
learning, gaining
much attention due to the effectiveness of Deep QNetworks (DQN)
at playing Atari games using as input the raw pixels of the game
(mnih2013playing; mnih2015human). Subsequent successes
of a similar approach called Deep Deterministic Policy
Gradient (DDPG) were achieved in
(lillicrap2015continuous), which will be used in our
work as reference articles, given the similarity of the
nature of the problems addressed there, namely large continuous
state and action spaces.
This way, the theme of this work is the application of
Deep Reinforcement Learning
to the traffic light optimization problem with an holistic
approach, by leveraging deep learning to cope with the large
state and action spaces. Specifically, the hypothesis that
drives this work is that Deep reinforcement learning
can be successfully applied to urban
traffic light control, having similar or better performance than other
approaches.
This is hence the main contribution of the present work, along with the different techniques applied to make this application possible and effective. Taking into account the nature of the problem and the abundant literature on the subject, we know that some of the challenges of devising a traffic light timing control algorithm that acts at a large scale are:

Define a sensible state space. This includes finding a suitable representation of the traffic information. Deep learning is normally used with input signals over which convolution is easily computable, like images (i.e. pixel matrices) or sounds (i.e. 1D signals). Traffic information may not be easily represented as a matrix, but as a labelled graph. This is addressed in section 6.4.

Define a proper action space that our agent is able to perform. The naive approach would be to let the controller simply control the traffic light timing directly (i.e. setting the color of each traffic light individually at each simulation step). This, however, may lead to breaking the normal routing rules, as the traffic lights in an intersection have to be synchronized so that the different intersection exit routes do not interfere with each other. Therefore a careful definition of the agent’s actions is needed. This is addressed in section 6.5.

Study and ensure the convergence of the approach: despite the successes of Deep QNetworks and DDPG, granted by their numerous contributions to the stability of reinforcement learning with value function approximation, convergence of such approaches is not guaranteed. Stability of the training is studied and measures for palliating divergence are put in place. This is addressed in section 6.9.

Create a sensible test bed: a proper test bed should simulate relatively realistically the traffic of a big city, including a realistic design of the city itself. This is addressed in section 7.
4 Related Work
In this section we identify and explore other lines of research that also try to solve the traffic light control problem.
4.1 Offline Approaches
The most simple traffic light control approaches are those that define fixed timings for the different traffic light phases. These timings are normally defined offline (i.e. not in closed loop). Several different approaches have been proposed in the literature for deriving the phase timings, which can be grouped into the following categories ^{1}^{1}1The categorization focuses on both the adaptative nature (or lack thereof) of the approach and the type of algorithms used and their similarity to the approach proposed in this work.:

Modelbased: a mathematical model of the target urban area is prepared and then used to derive an optimal timing, either via derivative calculus, numerical optimization, integer linear programming, or any other method. An example of this approach is MAXBAND
(little1966synchronization), which defines a model for arterials and optimizes it for maximum bandwidth by means of linear programming. Another example is the TRANSYT system (robertson1969transyt), which uses an iterative process to minimize the average journey time in a network of intersections. 
Simulationbased: this case is analogous to the modelbased, but the core of the validation of the timings is a traffic simulation engine, which is connected to a black box optimization computation that iteratively searches the traffic light timing space to find an optimal control plan. Some examples of this approach are (rouphail2000direct)
, which make use of genetic algorithms together with the CORSIM simulator
(holm2007traffic), or (garcia2013optimal), which uses particle swarm optimization with the SUMO simulator
(SUMO2012).
The usual way of maximizing the success of this kind of methods is to analyze historical traffic data and identify time slots with different traffic volume characteristics; once defined, a different timing strategy is derived for each these time bands. However, not even this partitioning scheme adapts to the dynamism of traffic demand or progressive changes in drivers’ behaviour.
4.2 Modelbased Adaptive Approaches
The simplest of these approaches only one intersection
into consideration. They define a model (e.g. based
on queue theory) that is fed with real detector
data (normally from the closest detectors to the
intersection). Then, by using algorithmic logic
based on thresholds and rules, like (lin89binary),
or optimization
techniques like (shao2009adaptive), they try to minimize
waiting times.
More complex approaches are based on traffic network models of several intersections that are fed with the real time data from multiple traffic detectors. Some of these approaches are heuristically defined algorithms that
tune their parameters by performing tests with variations on the aforementioned models. For example, the SCOOT system (hunt1982scoot) performs small reconfigurations (e.g. individual intersection cycle offsets or cycle splits) on a traffic network model. More recent approaches like (tubaishat2007adaptive) make use of the information collected by Wireless Sensor Networks (WSN) to pursue the same goal. There are also approaches where more formal optimization methods are employed on the traffic network models fed with read time data, like the case of (gartner1983opac), (henry1983prodyn), (boillot1992optimal) or (sen1997controlled), which compute in real time the switch times of the traffic lights within the next following minutes by solving dynamic optimization problems on realistic models fed with data from real traffic detectors.4.3 Classic Reinforcement Learning
Reinforcement Learning has been applied in the past to urban traffic
light control. Most of the instances from the literature consist of a
classical algorithm like QLearning, SARSA or TD() to control
the timing of a single intersection. Rewards are typically
based on the reduction of the travel time of the vehicles or the queue
lengths at the traffic lights. (el2014design) offers a thorough review of
the different approaches followed by a subset of articles from the literature
that apply reinforcement learning to traffic light timing control. As
shown there, many studies use as state space information such as
the length of the queues and the travel time delay; such type of
measures are rarely available in a realworld setup and can therefore
only be obtained in a simulated environment. Most of the approaches
use discrete actions (or alternatively, discretize the continuous
actions by means of tile coding,
and use either greedy selection (choose the action with
highest Q value with probability, or random action
otherwise) or softmax selection (turn Q values into probabilities
by means of the softmax function and then choose stochastically accordingly).
In most of the applications of reinforcement learning to traffic control,
the validation scenario consists of a single intersection, like in
(thorpe1997vehicle). This is due to
the scalability problems of classical RL tabular approaches: as the number
of controlled intersections increases, so grows the
state space, making the learning unfeasible due to the impossibility
for the agent to apply every action under every possible state.
This led some researchers to study multiagent approaches, with varying
degrees of complexity: some approaches like that from
(arel2010reinforcement) train each agent separately,
without notion that more agents even exist, despite the
coordination problems that this approach poses. Others like (wiering2000multi)
train each agent separately, but only the intersection with maximum
reward executes the action.
More elaborated approaches, like in (camponogara2003distributed),
train several agents together
modeling their interaction as a competitive stochastic game.
Alternatively, some lines of research like (kuyer2008multiagent)
and (bakker2010traffic)
study cooperative interaction of agents by means of coordination mechanisms,
like coordination graphs ((guestrin2002coordinated)).
As described throughout this section, there are several examples in the literature of the application of classical reinforcement learning to traffic light control. Many of them focus on a single intersection. Others apply multiagent reinforcement learning techniques to address the problems derived from the high dimensionality of state and action spaces. Two characteristics of most of the explored approaches are that the information used to elaborate the state space is hardly available in a realworld environment and that there are no realistic testing environments used.
4.4 Deep Reinforcement Learning
There are some recent works that, like ours, study the applicability of
deep reinforcement learning to traffic light control:
Li et al. studied in (li2016traffic) the application of deep learning to traffic light timing in a single intersection. Their testing setup consists of a single crossshape intersection with two lanes per direction, where no turns are allowed at all (i.e. all traffic either flows NorthSouth (and SouthNorth) or EastWest (and WestEast), hence the traffic light set only has two phases. This scenario is therefore simpler than our simple network A presented in 7.2. For the traffic simulation, they use the proprietary software PARAllel MICroscopic Simulation (Paramics) (cameron1996paramics), which implements the model by Fritzsche (fritzsche1994model). Their approach consists of a Deep QNetwork ((mnih2013playing; mnih2015human)) comprised of a heap of stacked autoencoders (bengio2007greedy; vincent2010stacked)
, with sigmoid activation functions where the input is the state of the network and the output is the Q function value for each action. The inputs to the deep Q network are the queue lengths of each lane at time
(measured in meters), totalling 8 inputs. The actions generated by the network are 2: remain in the current phase or switch to the other one. The reward is the absolute value of the difference between the maximum NorthSource flow and the maximum EastWest flow. The stacked autoencoders are pretrained (i.e. trained using the state of the traffic as both input and output) layerwise so that an internal representation of the traffic state is learned, which should improve the stability of the learning in further fine tuning to obtain the Q function as output (
(erhan2010does)). The authors use an experiencereplay memory to improve learning convergence. In order to balance exploration and exploitation, the authors use an greedy policy, choosing a random action with a small probability . For evaluating the performance of the algorithm, the authors compare it with normal Qlearning ((sutton1998reinforcement)). For each algorithm, they show the queue lengths over time and perform a linear regression plot on the queue lengths for each direction (in order to check the
balance of their queue length).Van der Pol explores in (van2016deep) the application of deep learning to traffic light coordination, both in a single intersection and in a more complex configuration. Their testing setup consists of a single crossshaped intersection with one lane per direction, where no turns are allowed. For the simulation software, the author uses SUMO (Simulation of Urban MObility), a popular opensource microscopic traffic simulator. Given that SUMO teleports vehicles that have been stuck for a long time ^{2}^{2}2See http://sumo.dlr.de/wiki/Simulation/Why_Vehicles_are_teleporting, the author needs to take this into account in the reward function, in order to penalize traffic light configurations that favour vehicle teleportation. Their approach consists on a Deep QNetwork. The author experiments with two two alternative architectures, taken verbatim respectively from (mnih2013playing) and (mnih2015human). Those convolutional networks were meant to play Atari games and receive as input the pixel matrix with bare preprocessing (downscaling and graying). In order to enable those architectures to be fed with the traffic data as input, an image is created by plotting a point on the location of each vehicle. The action space is comprised of the different legal traffic light configurations (i.e. those that do not lead to flow conflicts), among which the network chooses which to apply. The reward is a weighted sum of several factors: vehicle delay (defined as the road maximum speed minus the vehicle speed, divided by the road maximum speed), vehicle waiting time, the number of times the vehicle stops, the number of times the traffic light switches and the number of teleportations. In order to improve convergence of the algorithm, the authors apply deep reinforcement learning techniques such as prioritized experience replay and keeping a shadow target network, but also experimented with double Q learning (hasselt2010double; van2015deep). They as well tested different optimization algorithms apart from the normal stochastic gradient optimization, such as the ADAM optimizer (kingma2014adam), Adagrad (duchi2011adaptive)
or RMSProp
(tieleman2012lecture). The performance of the algorithm is evaluated visually by means of plots of the reward and average travel time during the training phase. The author also explores the behaviour of their algorithm in a scenario with multiple intersections (up to four) by means of a multiagent approach. This is achieved by training two neighbouring intersections on their mutual influence and then the learned joint Q function is transferred for higher number of intersections.Genders et al. explore in (genders2016using) the the application of deep convolutional learning to traffic light timing. Their test setup consists of a single crossshaped intersection with four lanes in each direction, where the inner lane is meant only for turning left and the outer lane is meant only for turning right. As simulation software, the authors use SUMO, like the work by Van der Pol (van2016deep) (see previous bullet). However, Genders et al do not address the teleportation problem and do not take into account its effect on the results. Their approach consists of a Deep Convolutional QNetwork. Like in (van2016deep), Genders et al. transform the vehicle positions into a matrix so that it becomes a suitable input for the convolutional network. They, however, scale the value of the pixels with the local density of vehicles. The authors refer to this representation as discrete traffic state encoding (DTSE). The actions generated by the QNetwork are the different phase configurations of the traffic light set in the intersection. The reward defined as the variation in cumulative vehicle delay since the last action was applied. The network is fed using experience replay.
5 Theoretical Background
Reinforcement Learning (RL) aims at training an agent so that it applies actions optimally to an environment based on its state, with the downside that it is not known which actions are good or bad, but it is possible to evaluate the goodness of their effects after they are applied. Using RL terminology, the goal of the algorithm is to learn an optimal policy for the agent, based on the observable state of the environment and on a reinforcement signal that represents the reward (either positive or negative) obtained when an action has been applied. The underlying problem that reinforcement learning tries to solve is that of the credit assignment
. For this, the algorithm normally tries to estimate the expected cumulative future reward to be obtained when applying certain action when in certain state of the environment. RL algorithms act at discrete points in time. At each time step
, the agent tries to maximize the expected total return , that is, the accumulated rewards obtained after each performed action: , where is the number of time steps ahead until the problem finishes. However, as normally is dynamic or even infinite (i.e. the problem has no end), instead of the summation of the rewards,the discounted return is used:(1) 
The state of the environment is observable, either totally or partially. The definition of the state is specific to each problem. One example of state of the environment is the position of a vehicle that moves in one dimension. Note that the state can certainly contain information that condenses pasts states of the environment. For instance, apart from the position from the previous example, we could also include the speed and acceleration
in the state vector. Reinforcement Learning problems that depend only on the current state of the environment are said to comply with the
Markov property and are referred to as Markov Decision Processes. Their dynamics are therefore defined by the probability of reaching from a state to a state by means of action :(2) 
This way, we can define the reward obtained when transitioning from state to by means of action :
(3) 
Deep Reinforcement Learning
refers to reinforcement learning algorithms that use a deep neural network as value function approximator. The first success of reinforcement learning with neural networks as function approximation was TDGammon
(tesauro1995temporal). Despite the initial enthusiasm in the scientific community, the approach did not succeed when applied to other problems, which led to its abandonment ((pollack1997did)). The main reason for its failure was lack of stability derived from:
The neural network was trained with the values that were generated on the go, therefore such values were sequential in nature and thus were autocorrelated (i.e. not independently and identically distributed).

Oscillation of the policy with small changes to Qvalues that change the data distribution.

Too large optimization steps upon large rewards.
Their recent rise in popularity is due to the success of Deep QNetworks (DQN) at playing Atari games using as input the raw pixels of the game (mnih2013playing; mnih2015human).
(4) 
In DQNs, there is a neural network that receives the environment state as input and generates as output the Qvalues for each of the possible actions, using the loss function (
4), which implies following the direction of the gradient (5):(5) 
In order to mitigate the stability problems inherent to reinforcement learning with value function approximation, in (mnih2013playing; mnih2015human), the authors applied the following measures:

Experience replay: keep a memory of past actionrewards and train the neural network with random samples from it instead of using the real time data, therefore eliminating the temporal autocorrelation problem.

Reward clipping: scale and clip the values of the rewards to the range
so that the weights do not boost when backpropagating.

Target network: keep a separate DQN so that one is used to compute the target values and the other one accumulates the weight updates, which are periodically loaded onto the first one. This avoid oscillations in the policy upon small changes to Qvalues.
However, DQNs are meant for problems with a few possible actions, and are therefore not appropriate for continuous space actions, like in our case. Nevertheless, a recently proposed Deep RL algorithm referred to as Deep Deterministic Policy Gradient or DDPG ((lillicrap2015continuous)) naturally accommodates this kind of problems. It combines the actorcritic classical RL approach (sutton1998reinforcement) with Deterministic Policy Gradient (silver2014deterministic). The original formulation of the policy gradient algorithm was proposed in (sutton1999policy), which proved the policy gradient theorem for a stochastic policy :
Theorem 1.
(Policy Gradient theorem from (sutton1999policy)) For any MDP, if the parameters of the policy are updated proportionally to the gradient of its performance then can be assured to converge to a locally optimal policy in , being the gradient computed as
with being a positive step size and where is defined as the discounted weighting of states encountered starting at and then following :
This theorem was further extended in the same article for the case where an approximation function is used in place of the policy . In this conditions the theorem holds valid as long as the weights of the approximation tend to zero upon convergence. In our reference articles (silver2014deterministic) and (lillicrap2015continuous), the authors propose to use a deterministic policy (as opposed to stochastic) approximated by a neural network actor that depends on the state of the environment and has weights , and another separate network implementing the critic, which is updated by means of the Bellman equation like DQN (5):
(6) 
And the actor is updated by applying the chain rule to the loss function (
4) and updating the weights by following the gradient of the loss with respect to them:(7) 
In order to introduce exploration behaviour, thanks to the DDPG
algorithm being offpolicy, we can add random noise to the policy.
This enables the algorithm to try unexplored areas from the action
space to discover improvement opportunities, much like the role of
in greedy policies in Qlearning.
In order to improve stability, DDPG also can be applied the same measures as DQNs, namely reward clipping, experience replay (by means of a replay buffer referred to as in algorithm 1) and separate target network. In order to implement this last measure for DDPG, two extra target actor and critic networks (referred to as and in algorithm 1) to compute the target Q values, separated from the normal actor and critic (referred to as and in algorithm 1) that are updated at every step and which weights are used to compute small updates to the target networks. The complete DDPG, as proposed in (lillicrap2015continuous), is summarized in algorithm 1.
6 Proposed Approach
In this section we explain the approach we are proposing to address the control of urban traffic lights, along with the rationale that led to it. We begin with section 6.1 by defining which information shall be used as input to our algorithm among all the data that is available from our simulation environment. We proceed by choosing a problem representation for such information to be fed into our algorithm in section 6.4 for the traffic state and section 6.6 for the rewards.
6.1 Input Information
The fact that we are using a simulator to evaluate the performance of our proposed application of deep learning to traffic control, makes the traffic state fully observable to us. However, in order for our system to be applied to the real world, it must be possible for our input information to be derived from data that is available in a typical urban traffic setup. The most remarkable examples of readily available data are the ones sourced by traffic detectors. They are sensors located throughout the traffic network that provide measurements about the traffic passing through them. Although there are different types of traffic detectors, the most usual ones are induction loops placed under the pavement that send real time information about the vehicles going over them. The information that can normally be taken from such type of detectors comprise vehicle count (number of vehicles that went over the detector during the sampling period), vehicle average speed during the sampling period and occupancy (the percentage of time in which there was a vehicle located over the detector). This way, we decide to constrain the information received about the state of the network to vehicle counts, average speed and occupancy of every detector in our traffic networks, along with the description of the network itself, comprising the location of all roads, their connections, etc.
6.2 Congestion Measurement
Following the selfimposed constraint to use only data that is actually available in a real scenario, we shall elaborate a summary of the state of the traffic based on vehicle counts, average speeds and occupancy. This way, we defined a measured called speed score, that is defined for detector as:
(8) 
where refers to the average of the speeds measured by traffic detector and refers to the maximum speed in the road where detector is located. Note that the speed score hence ranges in . This measure will be the base to elaborate the representation of both the state of the environment (section 6.4) and the rewards for our reinforcement learning algorithm (section 6.6).
6.3 Data Aggregation Period
The microscopic traffic simulator used for our experiments divides the simulation into steps. At each step, a small fixed amount of time is simulated and the state of the vehicles (e.g. position, speed, acceleration) is updated according to the dynamics of the system. This amount of time is configured to be 0.75 seconds by default, and we have kept this parameter. However, such an amount of time is too short to imply a change in the vehicle counts of the detectors. Therefore, it is needed to have a larger period over which the data is aggregated; we refer to this period as episode step, or simply ”step” when there is no risk of confusion. This way, the data is collected at each simulation step and then it is aggregated every episode step for the DDPG algorithm to receive it as input. In order to properly combine the speed scores of several simulation steps, we take their weighted average, using the proportion of vehicle counts. In an analogous way, the traffic light timings generated by the DDPG algorithm are used during the following episode step. The duration of the episode step was chosen by means of grid search, determining an optimum value of 120 seconds.
6.4 State Space
In order to keep a state vector of the environment, we make direct use of the speed score described in section 6.2, as it not only summarizes properly the congestion of the network, but also incorporates the notion of maximum speed of each road. This way, the state vector has one component per detector, each one defined as shown in (9).
(9) 
The rationale for choosing the speed score is that, the higher the speed score, the higher the speed of the vehicles relative to the maximum speed of the road, and hence the higher the traffic flow.
6.5 Action Space
In the real world there are several instruments to dynamically regulate traffic: traffic lights, police agents, traffic information displays, temporal traffic signs (e.g. to block a road where there is an accident), etc. Although it is possible to apply many of these alternatives in traffic simulation software, we opted to keep the problem at a manageable level and constrain the actions to be applied only to traffic lights. The naive approach would be to let our agent simply control the traffic lights directly by setting the color of each traffic light individually at every simulation step, that is, the actions generated by our agent would be a list with the color (red, green or yellow) for each traffic light. However, traffic lights in an intersection are synchronized: when one of the traffic lights of the intersection is green, the traffic in the perpendicular direction is forbidden by setting the traffic lights of such a direction to red. This allows to multiplex the usage of the intersection. Therefore, letting our agent freely control the colors of the traffic lights would probably lead to chaotic situations. In order to avoid that, we should keep the phases of the traffic lights in each intersection. With that premise, we shall only control the phase duration, hence the dynamics are kept the same, only being accelerated or decelerated. This way, if the network has different phases, the action vector has components, each of them being a real number that has a scaling effect on the duration of the phase. However, for each intersection, the total duration of the cycle (i.e. the sum of all phases in the intersection) should be kept unchanged. This is important because in most cases, the cycles of nearby intersections are synchronized so that vehicles travelling from one intersection to the other can catch the proper phase, thus improving the traffic flow. In order to ensure that the intersection cycle is kept, the scaling factor of the phases from the same intersection are passed through a softmax function (also known as normalized exponential function). The result is the ratio of the phase duration over the total cycle duration. In order to ensure a minimum phase duration, the scaling factor is only applied to 80% of the duration.
6.6 Rewards
The role of the rewards is to provide feedback to the reinforcement learning algorithm about the performance of the actions taken previously. As commented in previous section, it would be possible for us to define a reward scheme that makes use of information about the travel times of the vehicles. However, as we are selfconstraining to the information that is available in real world scenarios, we can not rely on other measures apart from detector data, e.g. vehicle counts, speeds. This way, we shall use the speed score described in section 6.2. But the speed score alone does not tell whether the actions taken by our agent actually improve the situation or make it worse. Therefore, in order to capture such information, we shall introduce the concept of baseline, defined as the speed score for a detector during a hypothetical simulation that is exactly like the one under evaluation but with no intervention by the agent, recorded at the same time step. This way, our reward is the difference between the speed score and the baseline, scaled by the vehicle counts passing through each detector (in order to give more weight to scores where the number of vehicles is higher), and further scaled by a factor to keep the reward in a narrow range, as shown in (10).
(10) 
Note that we may want to normalize the weights by dividing by the total vehicles traversing all the detectors. This would restrain the rewards in the range . This, however, would make the rewards obtained in different simulation steps not comparable (i.e. a lower total number of vehicles in the simulation at instant would lead to higher rewards). The factor was chosen to be empirically, by observing the unscaled values of different networks and choosing a value in an order of magnitude that leaves the scaled value around . This is important in order to control the scale of the resulting gradients. Another alternative used in (mnih2013playing; mnih2015human) with this very purpose is reward clipping; this, however, implies losing information about the scale of the rewards. Therefore, we chose to apply a proper scaling instead. There is a reward computed for each detector at each simulation time step. Such rewards are not combined in any way, but are all used for the DDPG optimization, as described in section 6.8. Given the stochastic nature of the microsimulator used, the results obtained depend on the random seed set for the simulation. This way, when computing the reward, the baseline is taken from a simulation with the same seed as the one under evaluation.
6.7 Deep Network Architecture
Our neural architecture consists in a Deep Deterministic ActorCritic Policy Gradient approach. It is comprised of two networks: the actor network and the critic network . The actor network receives the current state of the simulation (as described in section 6.4) and outputs the actions, as described in 6.5. As shown in figure 1, the network is comprised of several layers. It starts with several fully connected layers (also known as dense
layers) with Leaky ReLU activations
(maas2013rectifier), where the number of units is indicated in brackets, with being the number of detectors in the traffic network andis the number of phases of the traffic network. Across those many layers, the width of the network increases and then decreases, up to having as many units as actions, that is, the last mentioned dense layer has as many units as traffic light phases in the network. At that point, we introduce a batch normalization layer and another fully connected layer with ReLU activation. The output of the last mentioned layer are real numbers in the range
, so we should apply some kind of transformation that allows us to use them as scaling factors for the phase durations (e.g. clipping to the range ). However, as mentioned in section 6.5, we want to keep the traffic light cycles constant. Therefore, we shall apply an elementwise scaling computed on the summation of the actions of the phases in the same traffic light cycle, that is, for each scaling factory we divide by the sum of all the factors for phases belonging to the same group (hence obtaining the new ratios of each phase over the cycle duration) and then multiply by the original duration of the cycle. In order to keep a minimum duration for each phase, such computation is only applied to the 80% of the duration of the cycle. Such a computation can be precalculated into a matrix, which we call the phase adjustment matrix, which is applied in the layer labeled as ”Phase adjustment” in figure 1, and which finally gives the scaling factors to be applied to phase durations. This careful scaling meant to keep the total cycle duration can be ruined by the exploration component of the algorithm, as described in 1, which consists of adding noise to the actions (and therefore likely breaking the total cycle duration), This way, we implement the injection of noise as another layer prior to the phase adjustment. The critic network receives the current state of the simulation plus the action generated by the actor, and outputs the Qvalues associated to them. Like the actor, it is comprised of several fully connected layers with leaky ReLU activations, plus a final dense layer with linear activation.6.8 Disaggregated Rewards
In our reference article (lillicrap2015continuous), as well as all landmark ones like (mnih2013playing) and (mnih2015human), the reward is a single scalar value. However, in our case we build a reward value for each detector in the network. One option to use such a vector of rewards could be to scalarize them into a single value. This, however, would imply losing valuable information regarding the location of the effects of the actions taken by the actor. Instead, we will keep then disaggregated, leveraging the structure of the DDPG algorithm, which climbs in the direction of the gradient of the critic. This is partially analogous to a regression problem on the Qvalue and hence does not impose constraints on the dimensionality of the rewards. This way, we will have a dimensional reward vector, where is the number of detectors in the network. This extends the policy gradient theorem from (silver2014deterministic) so that the reward function is no longer defined as but as . This is analogous to having agents sharing the same actor and critic networks (i.e. sharing weights and ) and being trained simultaneously over different unidimensional reward functions. This, effectively, implements multiobjective reinforcement learning. To the best of our knowledge, the use of disaggregated rewards has not been used before in the reinforcement learning literature. Despite having proved useful in our experiments, further study is needed in order to fully characterize the effect of disaggregated rewards on benchmark problems. This is one of the future lines of research that can be spawned from this work. Such an approach could be further refined by weighting rewards according to traffic control expert knowledge, which will then be incorporated in the computation of the policy gradients.
6.9 Convergence
There are different aspects that needed to be properly tuned in order for the learning to achieve convergence:

Weight Initialization has been a key issue in the results cast by deep learning algorithms. The early architectures could only achieve acceptable results if they were pretrained by means of unsupervised learning so that they could have
learned the input data structure (erhan2010does). The use of sigmoid or hyperbolic tangent activations makes it difficult to optimize neural networks due to the numerous local minima in the function loss defined over the parameter space. With pretraining, the exploration of the parameter space does not begin in a random point, but in a point that hopefully is not too far from a good local minimum. Pretraining became no longer necessary to achieve convergence thanks to the use of rectified linear activation units (ReLUs) (nair2010rectified) and sensible weight initialization strategies. In our case, different random weight initializations (i.e. Glorot’s (glorot2010understanding) and He’s (he2015delving)) gave the best results, finally selecting He’s approach. 
Updates to the Critic: after our first experiments it became evident the divergence of the learning of the network. Careful inspection of the algorithm byproducts revealed that the cause of the divergence was that the critic network predicted higher outcomes at every iteration, as trained according to equation (11) extracted from algorithm 1.
(11) As DDPG learning like any other reinforcement learning with value function approximation approach is a closed loop system in which the target value at step is biased by the training at steps , drifts can be amplified, thus ruining the learning, as the distance between the desired value for and the obtained one differ more and more. In order to mitigate this divergence problem, our proposal consists in reducing the coupling by means of the application of a schedule on the value of the discount factor from Bellman’s equation, which is shown in figure 2.
The schedule of is applied at the level of the experiment, not within the episode. The oscillation in shown in figure 2 is meant to enable the critic network not to enter in the regime where the feedback leads to divergence. Discount Factor scheduling has been proposed before in (harrington2013robot) with positive results, although in that case the schedule consisted in a decaying rate.

Gradient evolution: the convergence of the algorithm can be evaluated thanks to the norm of the gradient used to update the actor network . If such a norm decreases over time and stagnates around a low value, it is a sign that the algorithm has reached a stable point and that the results might not further improve. This way, in the experiments described in subsequent sections, monitoring of the gradient norm is used to track progress. The gradient norm can also be controlled in order to avoid too large updates that make the algorithm diverge, e.g. (mnih2013playing). This mechanism is called gradient norm clipping and consists of scaling the gradient so that its norm is not over a certain value. Such a value was empirically established as in our case.
6.10 Summary
Our proposal is to apply Deep Deterministic Policy Gradient, as formulated in (lillicrap2015continuous)
, to the traffic optimization problem by controlling the traffic lights timing. We make use of a multilayer perceptron type of architecture, both for the actor and the critic networks. The actor is designed so that the modifications to the traffic light timings keep the cycle duration. In order to optimize the networks we make use of stochastic gradient descent. In order to improve convergence, we make use of a replay memory, gradient norm clipping and a schedule for the discount rate
. The input state used to feed the network consists of traffic detector information, namely vehicle counts and average speeds, which are combined in a single speed score. The rewards used as reinforcement signal are the improvements over the measurements without any control action being performed (i.e. baseline). Such rewards are not aggregated but fed directly as expected values of the critic network.7 Experiments
In this section we describe the experiments conducted in order to evaluate the performance of the proposed approach. In section 7.1 we show the different traffic scenarios used while in section 7.5 we describe the results obtained in each one, along with lessons learned from the problems found, plus hints for future research.
7.1 Design of the Experiments
In order to evaluate our deep RL algorithm, we devised increasing
complexity traffic networks.
For each one, we applied our DDPG algorithm to control the traffic
light timing, but also applied multiagent QLearning and
random timing in order to have a reference to properly assess
the performance of our approach.
At each experiment, the DDPG algorithm receives as input the
information of all detectors in the network, and generates the timings
of all traffic light phases.
In the multiagent Qlearning implementation,
there is one agent managing each intersection phase. It receives the
information from the closest few detectors and generates the timings
for the aforementioned phase. Given the tabular nature of Qlearning,
both the state space and the action space need to be categorical.
For this, tile coding is used. Regarding the state space, the tiles
are defined based on the same state space values as DDPG (see section
6.4), clustered in one the following
4 ranges
, , , ,
which were chosen empirically. As one Qlearning agent controls
the phases of the traffic lights of an intersection , the
number of states for an agent is . The action space
is analogous, being the generated timings one of the values
, , , or . The selected ratio
(i.e. ratio over the original phase duration) is applied to
the duration of the phase controlled by the Qlearning agent.
As there is one agent per phase, this is a multiagent reinforcement
learning setup, where agents do not communicate with each other.
They do have overlapping inputs, though, as the data from a
detector can be fed to the agents of several phases. In order to
keep the cycle times constant, we apply the same phase adjustment
used for the DDPG agent, described in section
6.5.
The random agent generates random timings in the range
, and then the previously mentioned phase adjustment is applied to keep
the cycle durations constant (see section 6.5).
Given the stochastic nature of the microscopic traffic simulator used, the results obtained at the experiments depend on the random seed set for the simulation. In order to address the implications of this, we do as follows:

In order for the algorithms not to overfit to the dynamics of a single simulation, we randomize the seed of each simulation. We take into account this also for the computation of the baseline, as described in section 6.6.

We repeat the experiments several times, and present the results over all of them (showing the average, maximum or minimum data depending on the case).
7.2 Network A
This network, shown in figure 3 consists only of an intersection of two 2lane roads. At the intersection vehicles can either go straight or turn to their right. It is forbidden to turn left, therefore simplifying the traffic dynamics and traffic light phases. There are 8 detectors (in each road there is one detector before the intersection and another one after it). There are two phases in the traffic light group: phase 1 allows horizontal traffic while phase 2 allows vertical circulation. Phase 1 lasts 15 seconds and phase 2 lasts 70 seconds, with a 5seconds interphase. Phases 1 and 2 have unbalanced duration on purpose, to have the horizontal road accumulate vehicles for long time. This gives our algorithm room to easily improve the traffic flow with phase duration changes. The simulation comprises 1 hour and the vehicle demand is constant: for each pair of centroids, there are 150 vehicles.
The traffic demand is defined by hand, with the proper order of magnitude to ensure congestion. The definition and duration of the phases were computed by means of the classical cycle length determination and green time allocation formulas from (webster1958traffic).
7.3 Network B
This network, shown in figure 4 consists of a grid layout of 3 vertical roads and 2 horizontal ones, crossing in 6 intersections that all have traffic lights.
Traffic in an intersection can either go straight, left or right, that is, all turns are allowed, complicating the traffic light phases, which have been generated algorithmically by the software with the optimal timing, totalling 30 phases. There are detectors before and after each intersection, totalling 17 detectors. The traffic demand is defined by hand, ensuring congestion. The trafic light phases were defined, like network A, with the classical approach from (webster1958traffic). Four out of six junctions have 5 phases, while the remaining two junctions have 4 and 6 phases each. The traffic demand has been created in a random manner, but ensuring enough vehicles are present and trying to collapse some of the sections of the network.
7.4 Network C
This network, shown in figure 5 is a replica of the Sants area in the city of Barcelona (Spain). There are 43 junctions, totalling 102 traffic light phases, and 29 traffic detectors. The locations of the detectors matches the real world. The traffic demand matches that of the peak hour in Barcelona, and it presents high degree of congestion.
The number of controlled phases per junction ^{3}^{3}3Note that phases from the network that have a very small duration (i.e. 2 seconds or less) are excluded from the control of the agent ranges from 1 to 6, having most of them only two phases.
7.5 Results
In order to evaluate the performance of our DDPG approach compared to both normal Qlearning and random timings on each of our test networks, our main reference measure shall be the episode average reward (note that, as described in section 6.6 there is actually a vector of rewards, with one element per detector in the network, that is why we compute the average reward) of the best experiment trial, understanding ”best” experiment as the one where the maximum episode average reward was obtained.
In figure 6 we can find the performance comparison for network A. Both the DDPG approach and the classical Qlearning reach the same levels of reward. On the other hand, it is noticeable the differences in the convergence of both approaches: while Qlearning is unstable, DDPG remains remarkably stable once it reached its peak performance.
In figure 7 we can find the performance comparison for network B. While Qlearning maintains the same band of variations along the simulations, DDPG starts to converge. Given the great computational costs of running the full set of simulations for one network, it was not affordable to let it run indefinitely, despite the promising trend.
Figure 8 shows the performance comparison for network C, from which we can appreciate that both DDPG and Qlearning performs at the same level, and that such a level is beneath zero, from which we know that they are actually worse than doing nothing. This way, the performance of DDPG is clearly superior to Qlearning for the simplest scenario (network A), slightly better for scenarios with a few intersections (network B) and at the same level for real world networks. From the evolution of the gradient for medium and large networks, we observed that convergence was not achieved, as it remains always at the maximum value induced by the gradient norm clipping. This suggests that the algorithm needs more training time to converge (probable for network B) or that it diverges (probable for network C). In any case, further study would be needed in order to assess the needed training times and the needed convergence improvement techniques.
8 Conclusions
We studied the application of Deep Deterministic
Policy Gradient (DDPG) to increasingly complex
scenarios. We obtained good results in network
A, which is analogous to most of the scenarios
used to test reinforcement learning applied to
traffic light control (see section 4
for details on this); nevertheless, for such
a small network, vanilla Qlearning performs
on par, but with less stability, though.
However, when the complexity of the network
increases, Qlearning can no longer scale,
while DDPG still can improve consistently the
obtained rewards. With a real world scenario,
our DDPG approach is not able to properly control
the traffic better than doing nothing. The
good trend for network B shown in figure
7, suggests that longer
training time may lead to better results. This
might be also true for network C, but the
extremely high computational costs could not
be handled without large scale hardware
infrastructure.
Our results show that DDPG is able to better
scale to larger networks than classical tabular
approaches like Qlearning.
Therefore, DDPG is able
to address the curse of dimensionality
(Goodfellowetal2016) regarding the traffic
light control domain, at least partially.
However, it is
not clear that the chosen reward scheme
(described in section 6.6) is
appropriate. One of its many weaknesses is
its fairness for judging the performance
of the algorithm based on the individual
detector information. In real life traffic
optimization it is common to favour some
areas so that traffic flow in arterials or
large roads is improved, at the cost of
worsening side small roads. The same principle
could be applied to engineer a more realistic
reward function from the point of view of
traffic control theory.
In order to properly asses the applicability of
the proposed approach to real world setups, it
would also be needed to provide a wide degree of
variations in the conditions of the simulation,
from changes in the traffic demand to having
road incidents in the simulation.
Another aspect that needs further study is the
effect of the amount and location of traffic
detectors on the performance of the algorithm.
In our networks A and B, there were detectors
at every section of the network, while in network
C their placement was scattered, which is the norm
in real world scenarios. We appreciate a loose
relation between the degree of observability
of the state of the network and the performance
of our proposed traffic light timing control
algorithm.
Further assessment about
the influence of observability
of the state of the network would help characterize
the performance of the DDPG algorithm and even
turn it into a means for choosing potential locations
for new detector in the real world. Also,
the relevance of the provided information
is not the same for all detectors; some of them
may provide almost irrelevant information while
others are key for understanding the traffic state.
This is another aspect that should be further
studied, along with the effect of the noise
present in data delivered by real traffic detectors.
An issue regarding the performance of our
approach is the sudden drops in the rewards
obtained through the training process. This
suggests that the landscape of the reward
function with respect to the actor and critic
network parameters is very irregular, which
leads the optimization to fall into
bad regions when climbing in the
direction of the gradient. A possible future line of
research that addressed this problem could
be applying Trusted Region Policy Optimization (schulman2015trust),
that is, leveraging the simulated nature of
our setup to explore more efficiently the
solution space. This would allow it to be
more data efficient, achieving comparable results
with less training.
We have introduced a
concept that, to the best of our
knowledge, has not been used before in the deep
reinforcement learning literature,
namely the use of disaggregated rewards (described
in section 6.8). This technique
needs to be studied in isolation from other factors
on benchmark problems
in order to properly assess its effect and contribution
to the performance of the algorithms. This is another possible
line of research to be spawned from this work.
On the other hand, we have failed to profit from the
geometric information about the traffic network.
This
is clearly a possible future line of research, that
can leverage recent advances in the application of
convolutional networks to arbitrary graphs, similar
to (defferrard2016convolutional).
Finally, we have verified the applicability of simple deep learning architectures to the problem of traffic flow optimization by traffic light timing control on small and mediumsized traffic networks. However, for largersized networks further study is needed, probably in the lines of exploring the results with significantly larger training times, using the geometric information of the network and devising data efficiency improvements.
Comments
There are no comments yet.