Effective traffic signal control is one of the key issues in Urban Traffic Control (UTC), effectively deciding how the available resources (green time) in our urban travel networks are allocated. The efficiency associated with this allocation has an important impact on travel times, harmful emissions and economic activity.
First, fixed time controllers, and later, adaptive systems have been used to further optimise the global traffic flow in our cities. Recent improvements in CPU and especially GPU power are allowing for vision-based sensors to gather large amounts of real-time data that a few years ago seemed unattainable, such as individual vehicle position and speeds, at a much lower marginal cost than would be feasible with traditional actuated sensors. As a side effect of these developments the area covered by sensors is ever increasing, also becoming possible to direct some of these towards pedestrians. This has allowed the development of novel smart control approaches, using real-time data to deliver cheap and responsive systems that can adapt to a variety of situations. Reinforcement Learning (RL) approaches have been showing promising results in this field. However, most of the existing works restrict themselves to vehicles only, not attempting to jointly optimise vehicular and pedestrian travel times, even though pedestrians are present in the great majority of real urban intersections.
This paper compares the performance of 30 different reward functions used by Deep Q-Network agents, split into 5 different classes based on the magnitudes they use, when controlling a simulation of a real-world junction in Greater Manchester (UK) that has been calibrated using 3.5 months of data gathered from Vivacity Labs vision-based sensors.
The paper is structured as follows: Section II reviews previous literature in the field. Section III states the mathematical framework used and provides some theoretical background. Section IV reviews the environment, the agents and their implementation. Section V introduces the reward functions tested in this paper and provides their analytical expressions. Section VI contains details about the training and evaluation of the agents. Lastly, Section VII provides the experimental results and discusses them.
Ii Related Work
RL for UTC has been previously explored and discussed in a variety of research, aiming to eventually substitute existing adaptive control methods such as SCOOT, MOVA and SCATS. The field has evolved from early inquiries about its theoretical potential use     , to progressively more applied and realistic scenarios that look towards real-world use and deployment. Recent works use different magnitudes in the reward function of the controlling agents (delay, queues, waiting time, throughput, …), however, it is not clear what benefits are provided from choosing which. The different magnitudes used as reward are thoroughly indexed in   
, although no direct performance comparisons are made. Different methods are taken regarding inputs, such as pixel-based vectors passed to a CNN  
, per-lane state signals using fully connected neural networks  , or hybrid approaches   . Recent research suggests that more complex state representations only provide marginal gains, if any, so in this paper the second approach is taken. A common thread in most previous works is the need for approximations about the network being studied and the lack of pedestrian modelling and joint optimisation for vehicles and pedestrians travel times. As indicated in , pedestrian implementation has a high impact on learning performance, being often discarded as unimportant or left for future work save for two exceptions  
, the first of which uses a genetic algorithm instead of RL, and the second explores a single reward function. In this paper we attempt to cover this gap in the literature, providing a robust performance assessment of RL agents serving both vehicles and pedestrians, using a variety of rewards, both novel and from the literature, attempting to uncover what state variables should be used in the reward to obtain the best performance. These are applied to a RL agent in a calibrated model of a real-world junction, using real geometry, calibrated demand, realistic sensor inputs and emulated traffic light controllers, to which some of these agents have been deployed to control real traffic in it since these experiments took place. This paper delivers the future work deferred from in terms of shifting the focus towards pedestrians and multi-objective optimisation, while keeping the problem grounded in the real world.
Iii Problem Definition
Iii-a Markov Decision Processes and Reinforcement Learning
The problem is framed as a Markov Decision Process (MDP), satisfying the Markov property: given a current state, the next state is independent of the succession of previous states . An MDP is defined by the 5-element tuple:
The set of possible states .
The set of possible actions .
The probabilistic transition function between states .
The discount factor
The scalar Reward Function .
The objective of an MDP optimisation is to find an optimal policy , mapping states to actions, that maximises the sum of the expected discounted reward,
In the case of RL for UTC, is unknown, making it necessary to approach it from a model-free RL perspective. Model-Free RL is an sub-field of RL covering how independent agents can take sequential decisions in an unknown environment and learn from their interactions in order to obtain
. There are two main approaches: Policy-Based RL, which maps states to a distribution of potential actions, and Value-Based RL, which is used in this paper and estimates thevalue (expected return) of the state-action pairs under a given policy defined as
Iii-B Q Learning and Value-Based RL
Q-Learning is an off-policy model-free value-based RL algorithm. For any finite MDP, it can find an optimal policy which maximises expected total discounted reward, starting from any state. Q-Learning aims to learn an optimal action-value function , defined as the total return after being in state , taking action and then following policy .
Traditional table-based Q-Learning approximates recursively through successive Bellman updates,
with the learning rate and the Temporal Difference (TD) target for the Q-function:
This table representation is not useful for high dimensional cases, since the size of our table would increase exponentially, nor for continuous cases, since every distinct would require an entry.
Iii-C Deep Q Network
One way of addressing the issues of Q-Learning in high dimensional spaces is to use neural networks as function approximators. This approach is called Deep Q-Network (DQN) . The Q-function approximation is denoted then in terms of the parameters of the DQN as . DQN stabilises the learning process by introducing a Target Network that works alongside the main network. The main network with parameters , approximates the Q-function, and the target network with parameters provides the TD targets for the DQN updates. The target network is updated every number of episodes by copying the weights . With representing the target network, it results in a TD target to approximate:
Iv-a Reinforcement Learning Agent
The agent used to obtain these results is a standard implementation of a DQN in PyTorch
, optimising its weights via Stochastic Gradient Descent using ADAM  as optimizer. The learning rate is and the discount factor is
Iv-B Reinforcement Learning Environment
The environment is modelled in the microscopic traffic simulator SUMO , representing a real-world intersection in Greater Manchester, UK. The junction consists of four arms, with 6 incoming lanes (two each in the north-south orientation, and one each in the east-west orientation) and 4 pedestrian crossings. The real-world site also contains 4 Vivacity vision-based sensors, able to supply occupancy, queue length, waiting time, speed and flow data. The demand and turning ratios at the junction have been calibrated using 3.5 months of journey time and flow data collected by these sensors. The environment includes an emulated traffic signal controller, responsible for changing between the different stages in the intersection and enforcing the operational limitations, which are focused on safety. This includes enforcing green times, intergreen times, as well as determining allowed stages. A stage is defined as a group of non-conflicting green lights (phases) in a junction which change at the same time. The agent decides which stage to select next and requests this from an emulated traffic signal controller, which moves to that stage subject to its limitations, which are primarily safety-related. The data available to the agent is restricted to what can be obtained from the sensors.
Iv-C State Representation
The agent receives an observation of the simulator state as input, using the same state information across all experiments here presented. Each observation is a combination of the state of the traffic controller (which stage is active) and data from the sensors. The data from the sensors is comprised of the occupancy in each lane area detector and a binary signal representing whether the pedestrian crossing button has been pushed. The agent receives a concatenation of the last 20 measurements at a time, covering the previous 12 seconds at a resolution of 0.6 seconds.
Iv-D Actions of the Agent
The junction is configured to have 4 available stages. The agent is able to choose Stage 2, Stage 3 or Stage 4, yielding an action space size of 3. Stage 1 services a protected right turn coming from the north. It is used by the traffic light controller, as a transitional step for reaching Stage 2, as defined by the transport authority. Stage 2 deals with the traffic in the north-south orientation. Stage 3 is the pedestrian stage, setting all pedestrian crossings to green, and all other phases to red. Stage 4 services the roads in the east-west orientation, which have considerable demand.
Once the controller has had a stage active for the minimum green time duration, the agent is requested to compute the value of all potential state-action pairs (i.e. the value of other stages given the current state) once per time-step. From these, the action with the highest expected value is selected following an -greedy policy. Should the agent choose the same action, the current stage will be extended for a further time-step (0.6 seconds). There is no built-in limit to the maximum number of said extensions, leaving it for the agent to learn the optimal green time for any given situation. If a different stage is chosen, then the controller will proceed to the intergreen transition between them.
There are 2 situations that further add to the complexity of this control process:
Variable number of extensions, and hence length of the stages, creates a distribution of values over the state-action pairs in most rewards, which the agent must approximate. The variance of this distribution will be higher than the variance that would be obtained using constant stage length.
The requirement that Stage 1 must be used as an intermediate step to reach Stage 2 implies less certainty in the control process than in other stages, since there is an unaccounted dilated temporal horizon between the state that triggered the action, and the effects of said action over the state variables.
Iv-E Modal Prioritisation and Adjusting by Demand
The agent serves vehicles and pedestrians arriving at the intersection, seeking to jointly optimise the intersection for both modes of transport.
All the reward functions presented in this paper follow the same structure. The reward, as seen by the agent, will be a linear combination of an independently calculated reward for the vehicles and another for the agents, as it can be seen in Eq. 7.
In this way, and are the Modal Prioritisation coefficients for our rewards, with being respectively the vehicular and pedestrian rewards.
Of the rewards presented in the following section, those that were more sensitive towards the relative ratio of the demand between pedestrian and vehicles require manual tuning of the modal prioritisation parameters. While undesirable from a modeller and operator point of view since it partially counters the benefits that RL provides in terms of self-adjustment, they are provided so potential users and researchers can evaluate the trade-offs between potential increased performance and increased configuration effort. The mentioned series will be identified by the weight applied to the pedestrians. As such, series identified as P80 and P95 represent those in which the weights were , , and , respectively. Those series without an identifier did not require modal prioritisation ().
Another addition that can be made to the rewards is to add a term scaling the difficulty with the demand level, implicitly accepting that higher demand typically worsens the performance of a network, independent of the actions of the controlling agent. These series are identified with the suffix AD (Adjusted by Demand).
V Reward Functions
All reward functions tested are presented in this section with their analytical expressions.
Let be the set of lane queue sensors present in the intersection. Let be the set of pedestrian occupancy sensors in the junction. Let and be respectively the set of vehicles in incoming lanes, and the set of pedestrians waiting to cross in the intersection at time . Let be the individual speeds of the vehicles, and the waiting times of vehicles and pedestrians, respectively. Let and be the vehicular and pedestrian flows across the junction over the length of the action. Let be the time at which the previous action was taken and the time of the action before that. Lastly, let and be the entry times of vehicles and pedestrians to the area covered by sensors.
V-a Queue Length based Rewards
V-A1 Queue Length
V-A2 Queue Squared
As seen in , this function squares the result of adding all queues.
V-A3 Queues PLN
As Queue length, but dividing the sum by the phase length (Phase Length Normalisation), approximating the reward that the action generates by unit of time it is active.
V-A4 Delta Queue
The reward is the variation of the sum of queues between actions.
V-A5 Delta Queue PLN
As Delta Queue, but dividing the sum by the phase length (Phase Length Normalisation).
V-B Waiting Time based Rewards
These rewards require Modal Prioritisation weights.
V-B1 Wait Time
The reward is the negative sum of time in queue accumulated since the last action by all vehicles.
V-B2 Delta Wait Time
As seen in , the reward is the variation in queueing time between actions.
V-B3 Waiting Time Adjusted by Demand
Negative sum of waiting time, adding a factor to scale it accordingly with an estimate of the demand ().
V-C Delay based Rewards
These rewards require Modal Prioritisation weights.
As seen in . Negative weighted sum of the delay by all entities. Delay is understood as deviation from the maximum allowed speed. For the pedestrians, the time in queue is used given that, from the point of view of the sensors, pedestrian presence is binary. Assuming a simulator time step of length :
V-C2 Delta Delay
V-C3 Delay Adjusted by Demand
Same as in Eq. (16), introducing a scaling demand term.
V-D Average Speed based Rewards
V-D1 Average Speed, Wait Time Variant
The vehicle reward is the average speed of vehicles in the area covered by sensors and normalised by the maximum speed. The pedestrian reward is the minimum between the sum of the waiting time of the pedestrian divided by a theoretical desirable maximum waiting time and 1. This produces two components of the reward .
V-D2 Average Speed, Occupancy Variant
Vehicle reward as in the previous entry. Pedestrian reward is the minimum between the sum of pedestrians waiting divided by a theoretical maximum desirable capacity and 1.
V-D3 Average Speed Adjusted by Demand, Demand and Occupancy Variants
As in the previous two entries, adding a multiplicative factor equal to the estimation of the demand , scaling the reward with the difficulty of the task.
V-E Throughput based Rewards
These rewards require Modal Prioritisation weights.
The reward is the sum of the pedestrians and vehicles that cleared the intersection since the last action.
|Normal Scenario||Peak Scenario||Oversaturated Scenario|
|Average Speed - Wait|
|Average Speed - Occ|
|Average Speed AD - Wait|
|Average Speed AD - Occ|
|Wait Time P80|
|Wait Time P95|
|Wait Time AD|
|Wait Time AD P80|
|Wait Time AD P95|
|Wait Time P80|
|Wait Time P95|
|Delay AD P80|
|Delay AD P95|
|Vehicle Actuated System D|
Vi-a DQN Agents Training
The training process covers 1500 episodes running for 3000 steps of length seconds for a simulated time of 30 minutes (1800 seconds). The traffic demand is increased as the training advances, with the agent progressively facing sub-saturated, near-saturated and over-saturated scenarios, with a minimum of 1 vehicle / 3 seconds (1200 vehicles/h) and a maximum of 1 vehicle / 1.4 seconds (2571 vehicles/h).
For each reward function, 10 copies of the agent are trained, and their performance was compared against two reference systems. These are Maximum Occupancy (longest queue first) and Vehicle Actuated System D  (vehicle-triggered green time extensions), which is commonly used in the UK. The agent performing best against the reference systems in each class is selected for detailed scoring.
Vi-B Evaluation and Scoring
Each selected agent is tested and its performance scored over 100 copies of 3 different scenarios with different demand levels. Each evaluation is the same length as the training episodes, with the demand kept constant during each run. These three scenarios are aimed to test the agents during normal operation, peak times and over-saturated conditions, and will be henceforth referred to as Normal, Peak and Over-saturated Scenarios. Peak Scenario uses the level of demand observed in the junction that results in saturated traffic conditions under traditional controllers.
The Normal Scenario uses an arrival rate of 1 vehicle / 2.1 seconds (1714 vehicles/h). Peak Scenario uses an arrival rate of 1 vehicle / 1.7 seconds (2117 vehicles/h). Over-saturated Scenario uses an arrival rate of 1 vehicle / 1.4 seconds (2400 vehicles/h)
Vii Results and Discussion
The results from the simulations of the different reward functions are summarised in Fig. 3, including the performance of the 15 rewards found to have lower waiting times and seeming most desirable in practice. They are detailed for all 30 rewards in Table LABEL:table. In Fig. 3, the distribution of pedestrian and vehicle waiting times, and the combination of mean performances for both modes of transportation across 100 repetitions of each demand level are presented. Table LABEL:table
shows the mean waiting time for each distribution and their standard deviation, also calculated across all three demand levels.
The results display further evidence that RL agents can reach better performance than reference adaptive methods, more evidently so when pedestrians are added. In the case of MO, the bad performance can be framed within the need of having more pedestrians queued than vehicles in any sensor in order to start the pedestrian stage. VA suffers due to its predisposition towards extending green times by 1.5s in the presence of any vehicle, making it more difficult to reach a state in which the pedestrian stage can be started. Both of these characteristics make the vanilla reference methods less suited for intersections including pedestrians than the RL methods presented in Fig. 3, especially in situations of high demand.
At a global level, methods based on maximisation of the average network speed show the lowest global waiting times for pedestrians and vehicles combined across all demand levels, while also obtaining some of the lowest spreads, as shown in the case with no pedestrians . Their performance is closely followed by Queue minimisation, which obtains the lowest average waiting times for vehicles in the Normal and Peak Scenarios, but falls behind in Over-saturated conditions and when dealing with pedestrians. Queue Squared minimisation has a comparable yet slightly worse performance, followed by Delta Queues and Delta Queues PLN. This last reward has shown to obtain better performance with higher demand, which is consistent with it generating less variance in the state, since it is modelling for arrival rates given an action, and makes it an option that could be further explored for permanently congested intersections. Prioritised rewards based on Waiting Time show acceptable performance, but also a high sensitivity to the changes in the modal prioritisation weights. This is similar to the behaviour shown by the Delay-based rewards, which overall perform worse, potentially due to the need to use Wait Time for pedestrians, mixing the state variables, although this does not seem to be an issue for average speed based rewards. Without a weight configuration heavily favouring the pedestrians, these reward functions were found to converge for vehicles only, obtaining the lowest vehicle waiting times overall in the case of the Delay functions, at the expense of rarely, if ever, serving pedestrians. The suitability of a given choice of modal prioritisation weights is further affected by the functional form of the reward. In the results, it can be observed that while in general the choice obtains better results (e.g. Wait Time and Delay), for certain functional choices the prioritisation
is the one producing the best results, which would not be the case if the suitability of the weights was only affected by the relative demand ratios between vehicles and pedestrians. This is the case with Throughput based functions, which, unlike the Wait and Delay functions, obtained lower waiting times with equal modal weights, and a general wait time increase as the weights become more skewed towards the pedestrians. Rewards using Differences in Delay or Wait Time, having good performance in the literature, were found either not to converge for pedestrians or to produce mediocre results. The addition of a demand scaling term generates, in general, a slight improvement in waiting times across the rewards using Wait Time and Delay, particularly at higher demand levels.
Overall, dominance shown by speed maximisation methods could be attributed to several factors. Average Speed based functions, as Queue based functions, obtain an instantaneous snapshot of a magnitude that does not intrinsically grow over time, as opposed to Delay, Wait and Throughput, so it exclusively encodes information about the moment the action is requested. It can also be argued that speed maximisation rewards are not affected by the correspondence between agent actions and time-steps in the environment. In the specific case of RL for UTC, the values of the reward received by the agent using a reward based on Queues, Delay, Wait or Throughput are a function of the length of the phase that generated them, making them theoretically less suitable for the underlying MDP than speed maximisation. Lastly, speed maximisation and queue minimisation have an extra benefit that makes them into serious candidates for expansive real-world use: the lack of need for modal prioritisation tuning. One of the main selling points of ML and RL methods stems from their ability to perform equal or better than traditional systems at a lower cost in a variety of situations. However, a lengthy manual tuning process in order to find the exact weights for a given junction is not only untranslatable to any other intersection, but may also not result in reduced planning and execution times compared with traditional control. The lack of need for manual tuning, especially in the case of Average Speed functions, which are specifically crafted to avoid this, make them in our view more applicable in a wider and faster manner than any of the other reward functions here presented.
One limitation of this paper is that the results are only relevant in the case of value-based DQN agents as introduced in Section III and Section IV, and not for CNN or Policy Gradient architectures. This work could be extended to account for other modes of transportation, performing a similar optimisation based on different vehicle classes (buses, cyclists, personal vehicles, trucks, etc.). The optimisation could seek to prioritise them based on different criteria (e.g. priority to cyclists and public transport during rush hours or weighting vehicles according to the expected number of passengers).
This work was part funded by EPSRC Grant EP/L015374 and part funded by InnovateUK grant 104219. Vivacity Labs thanks Transport for Greater Manchester for helping take this work to the real world, Immense for calibrating the simulation model, and InnovateUK for funding that made it possible. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
-  Hunt, P. B., Robertson, D. I., Bretherton, R. D., & Royle, M. C. (1982). The SCOOT on-line traffic signal optimisation technique. Traffic Engineering & Control, 23(4).
-  Vincent, R. A., & Peirce, J. R. (1988). ’MOVA’: Traffic Responsive, Self-optimising Signal Control for Isolated Intersections. Traffic Management Division, Traffic Group, Transport and Road Research Laboratory.
-  Lowrie, P. R. (1990). Scats, sydney co-ordinated adaptive traffic system: A traffic responsive method of controlling urban traffic.
Wiering, M. A. (2000). Multi-agent reinforcement learning for traffic light control. In Machine Learning: Proceedings of the Seventeenth International Conference (ICML’2000) (pp. 1151-1158).
-  Abdulhai, B., Pringle, R., & Karakoulas, G. J. (2003). Reinforcement learning for true adaptive traffic signal control. Journal of Transportation Engineering, 129(3), 278-285.
-  Prashanth, L. A., & Bhatnagar, S. (2010). Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 12(2), 412-421.
-  El-Tantawy, S., & Abdulhai, B. (2010, September). An agent-based learning towards decentralized and coordinated traffic signal control. In 13th International IEEE Conference on Intelligent Transportation Systems (pp. 665-670). IEEE.
-  Abdoos, M., Mozayani, N., & Bazzan, A. L. (2011, October). Traffic light control in non-stationary environments based on multi agent Q-learning. In 2011 14th International IEEE conference on intelligent transportation systems (ITSC) (pp. 1580-1585). IEEE.
-  Yau, K. L. A., Qadir, J., Khoo, H. L., Ling, M. H., & Komisarczuk, P. (2017). A survey on reinforcement learning models and algorithms for traffic signal control. ACM Computing Surveys (CSUR), 50(3), 1-38.
-  Haydari, A., & Yilmaz, Y. (2020). Deep Reinforcement Learning for Intelligent Transportation Systems: A Survey. Preprint arXiv:2005.00935.
-  Wei, H., Zheng, G., Gayah, V., & Li, Z. (2019). A Survey on Traffic Signal Control Methods. Preprint arXiv:1904.08117.
-  Liang, X., Du, X., Wang, G., & Han, Z. (2019). A deep reinforcement learning network for traffic light cycle control. IEEE Transactions on Vehicular Technology, 68(2), 1243-1253.
-  Gao, J., Shen, Y., Liu, J., Ito, M., & Shiratori, N. (2017). Adaptive traffic signal control: Deep reinforcement learning algorithm with experience replay and target network. arXiv preprint arXiv:1705.02755.
-  Mousavi, S. S., Schukat, M., & Howley, E. (2017). Traffic light control using deep policy-gradient and value-function-based reinforcement learning. IET Intelligent Transport Systems, 11(7), 417-423.
-  El-Tantawy, S., Abdulhai, B., & Abdelgawad, H. (2014). Design of reinforcement learning parameters for seamless application of adaptive traffic signal control. Journal of Intelligent Transportation Systems, 18(3), 227-245.
-  Aslani, M., Mesgari, M. S., Seipel, S., & Wiering, M. (2019, October). Developing adaptive traffic signal control by actor–critic and direct exploration methods. In Proceedings of the Institution of Civil Engineers-Transport (Vol. 172, No. 5, pp. 289-298). Thomas Telford Ltd.
-  Genders, W., & Razavi, S. (2019). Asynchronous n-step Q-learning adaptive traffic signal control. Journal of Intelligent Transportation Systems, 23(4), 319-331.
-  Genders, W., & Razavi, S. (2016). Using a deep reinforcement learning agent for traffic signal control. arXiv preprint arXiv:1611.01142.
-  Genders, W. (2018). Deep reinforcement learning adaptive traffic signal control (Doctoral dissertation).
-  Wan, C. H., & Hwang, M. C. (2018). Value-based deep reinforcement learning for adaptive isolated intersection signal control. IET Intelligent Transport Systems, 12(9), 1005-1010.
-  Genders, W., & Razavi, S. (2018). Evaluating reinforcement learning state representations for adaptive traffic signal control. Procedia computer science, 130, 26-33.
-  Turky, A. M., Ahmad, M. S., Yusoff, M. Z. M., & Hammad, B. T. (2009, July). Using genetic algorithm for traffic light control system with a pedestrian crossing. In International Conference on Rough Sets and Knowledge Technology (pp. 512-519). Springer, Berlin, Heidelberg.
-  Liu, Y., Liu, L., & Chen, W. P. (2017, October). Intelligent traffic light control using distributed multi-agent Q learning. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC) (pp. 1-8). IEEE.
-  Chacha Chen, H. W., Xu, N., Zheng, G., Yang, M., Xiong, Y., Xu, K., & Li, Z. Toward A Thousand Lights: Decentralized Deep Reinforcement Learning for Large-Scale Traffic Signal Control.
-  Cabrejas Egea, A., Howell, S., Knutins, M., & Connaughton C. (2020, October). Assessment of Reward Functions for Reinforcement Learning Traffic Signal Control under Real-World Limitations. IEEE Systems, Man, and Cybernetics (October 2020) (Accepted).
-  Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8(3-4), 279-292.
-  Melo, F. S. (2001). Convergence of Q-learning: A simple proof. Institute Of Systems and Robotics, Tech. Rep, 1-4.
-  Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., … & Petersen, S. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … & Desmaison, A. (2019). PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (pp. 8024-8035).
-  Kiefer, J., & Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3), 462-466.
-  Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
-  Lopez, P. A., Behrisch, M., Bieker-Walz, L., Erdmann, J., Flötteröd, Y. P., Hilbrich, R., … & WieBner, E. (2018, November). Microscopic traffic simulation using sumo. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC) (pp. 2575-2582). IEEE.
-  Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
-  Highways Agency (2002). Siting Of Inductive Loops For Vehicle Detecting Equipments At Permanent Road Traffic Signal Installations. MCE 0108 Issue C.