Deep Reinforcement Learning for Dynamic Urban Transportation Problems

by   Laura Schultz, et al.
George Mason University

We explore the use of deep learning and deep reinforcement learning for optimization problems in transportation. Many transportation system analysis tasks are formulated as an optimization problem - such as optimal control problems in intelligent transportation systems and long term urban planning. Often transportation models used to represent dynamics of a transportation system involve large data sets with complex input-output interactions and are difficult to use in the context of optimization. Use of deep learning metamodels can produce a lower dimensional representation of those relations and allow to implement optimization and reinforcement learning algorithms in an efficient manner. In particular, we develop deep learning models for calibrating transportation simulators and for reinforcement learning to solve the problem of optimal scheduling of travelers on the network.



There are no comments yet.


page 1

page 2

page 3

page 4


Deep Reinforcement Learning for Conversational AI

Deep reinforcement learning is revolutionizing the artificial intelligen...

Online Multimodal Transportation Planning using Deep Reinforcement Learning

In this paper we propose a Deep Reinforcement Learning approach to solve...

EpidemiOptim: A Toolbox for the Optimization of Control Policies in Epidemiological Models

Epidemiologists model the dynamics of epidemics in order to propose cont...

Long-term Joint Scheduling for Urban Traffic

Recently, the traffic congestion in modern cities has become a growing w...

Regularized Deep Networks in Intelligent Transportation Systems: A Taxonomy and a Case Study

Intelligent Transportation Systems (ITS) are much correlated with data s...

Learning Algorithms for Regenerative Stopping Problems with Applications to Shipping Consolidation in Logistics

We study regenerative stopping problems in which the system starts anew ...

On-line Building Energy Optimization using Deep Reinforcement Learning

Unprecedented high volumes of data are becoming available with the growt...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many modern transportation system analysis problems lead to high-dimensional and highly nonlinear optimization problems. Examples include fleet management [39], intelligent system operations [26] and long-term urban planning [45]. Transportation system models used in those optimization problems typically assume analytical formulations using mathematical programming [27, 44] or conservation laws [16, 40] and rely on high level abstractions, such as origin-destination matrices for demand. An alternative approach is to model transportation system using a complex simulator [38, 43, 7], which model individual travelers and provide flexible approach to represent traffic and demand patterns in large scale multi-modal transportation systems. However, solving optimization problems and dynamic-control problems that rely on simulator models of the system is prohibitive due to computational costs. Recently matamodel based approach was proposed to solve simulation-based transportation optimization problems [14, 42].

In this paper, we propose an alternative approach to solve optimization problems for large scale transportation systems. Our approach relies on low complexity metamodels deduced from complex simulators and reinforcement learning to solve optimization problems. We use deep learning approximators for the low complexity metamodels. A deep learning is a Latent Variable Model (LVM) that is capable of extracting the underlying low-dimensional pattern in a high-dimensional input-output relations. Deep learners have proven highly effective in combination with Reinforcement and Active Learning 

[36] to recognize such patterns for exploitation. Our approach builds on the work on simulation-based optimization [42, 14], deep learning [41, 18] as well as reinforcement learning [51, 50] techniques recently proposed for transportation applications. The two main contribution of this paper are

  1. Development of innovative deep learning architecture for reducing dimensionality of search space and modeling relations between transportation simulator inputs (travel behavior parameters, traffic network characteristics) and outputs (mobility patterns, traffic congestion)

  2. Development of reinforcement learning techniques that rely on deep learning approximators to solve optimization problems for dynamic transportation systems

We demonstrate our methodologies using two applications. First, we solve the problem of calibrating a complex, stochastic transportation simulators which need to be systematically adjusted to match field data. The problem of calibrating a simulator is the key for making it useful for both short term operational decisions and long term urban planning. We improve on the previously proposed numerous approaches for the calibration of simulation-based traffic flow models have been produced by treating the problem as an optimization issue [12, 34, 33, 15, 29, 24, 25, 20, 21, 19]. Our approach makes no assumption about the form of the simulator and types of inputs and outputs used. Further, we show that deep learning models are more sample efficient when compared to Bayesian techniques or more traditional filtering techniques. We build on our calibration framework [42]

by further exploring the dimensionality reduction utilized for more efficient input parameter space exploration. More specifically, we introduce the formulation and analysis of a combinatorial Neural Network method and compare it with previous work that used Active Subspace methods.

The second application builds upon recent advances in deep learning approaches to reinforcement learning (RL) that have demonstrated impressive results in game playing [35] through the application of neural networks for approximating state-action functions. Reinforcement Leaning mimics the way humans learn new tasks and behavioral policies via trial and error and has proven successful [47]

in many applications. While most of the research on RL is done in the field of machine learning and applied to classical AI problems, such as robotics, language translation and supply chain management problems 

[23], some classical transportation control problems have been previously solved using RL. [1, 6, 11, 8, 1, 5, 31, 17, 2, 13]. Furthermore, there were recent attempts that successfully demonstrated applications of deep RL to traffic flow control  [3, 51, 50, 9, 22].

The remainder of this paper is organized as follows: Section 2 briefly documents the highlights of neural network architectures; Section 3 describes the new deep learning architecture that finds low dimensional patterns in simulator’s inputs-output relations and we apply our deep learner to the problem of model calibration. Section 4 describes the additional application of deep reinforcement learning to transportation system optimization. Finally Section 5 offers avenues for further research.

2 Deep Learning

Let denote a (low dimensional) output and a (high dimensional) set of inputs. We wish to recover the multivariate function (map), denoted by , using training data of input-output pairs , that generalizes well for out-of-sample data. Deep learning uses compositions of functions, rather than traditional additive ones. By composing layers, a deep learning predictor becomes


is a univariate activation function. The set

is the set of weights and offsets which are learned from training data. Here and dimensionality is of the architecture specifications.

Training the parameters

and selecting an architecture is achieved by regularized least squares. Stochastic gradient descent (SGD) and its variants are used to find the solution to the optimization problem 


where is training data of input-output pairs, and is a regularisation penalty on the network parameters (weights and offsets).

In this paper we develop a new deep learning architecture for simultaneously learning low dimensional representation of simulator’s input parameter space as well as the relation between simulator’s inputs and outputs. Our architecture relies on multi-layer perceptron and auto-encoding layers. Multilayer Perceptron Network (MLP) – a neural network which takes a set of inputs,

, and feeds them through one or more sets of intermediate layers to compute one or more output values, . Although a single architecture is commonly implemented in practice, some success has been found through comparative or combinatorial means[53].

2.1 Auto-Encoder

An auto-encoder is a deep learning routine which trains the architecture to approximate by itself (i.e., = ) via a bottleneck structure. This means we select a model which aims to concentrate the information required to recreate . Put differently, an auto-encoder creates a more cost effective representation of

. For example, under an L2-loss function, we wish to solve

subject to a regularization penalty on the weights and offsets. In an auto-encoder, for a training data set , we set the target values as . A static auto-encoder with two linear layers, akin to a traditional factor model, can be written as a deep learner as


are activation vectors. The goal is to wind the weights and biases so that size of

us much smaller than size of

3 Deep Learning for Calibration

In this section we develop a new deep learning architecture that can be used to learn low-dimensional structure in the simulator input-output relations. Then we use the low dimensional representations to solve an optimization problem. Our optimization problem is the problem of calibration of a transportation simulator. As the simulators become more detailed, the high dimensionality has now become a pressing concern. In practice, high dimensional data possesses a natural structure within it, that can be expressed in low dimensions. Known as Dimension Reduction, the effective number of parameters in the model reduces and enables analysis from smaller data sets.

Recently, we developed a low-dimensional metamodel for to be used for transportation simulation-based problems [42]. We calculated active subspaces to capture low-dimensional structures in the simulator. We used Gaussian process to represent the input-output relations. There are several key concerns while developing an efficient simulation-based algorithms for transportation applications

  1. Algorithms must be sample efficient and parallelizable. Each simulation run is computationally expensive and can take up to a few days. This computational constraint could potentially limit the scale and scope of calibration investigations and result in large areas of sample space unexplored and sub-optimal decisions. As High-Performance Computing (HPC) resources have become increasingly available in most research environments, new modes of computational processing and experimentation have become possible – parallel tasking capabilities allow multiple simulated runs to be performed simultaneously and HPC programs aid in coordinating worker units to run codes across multiple processors to maximize the available resources and time management. By leveraging these advances and running a queue of pending input sets concurrently through the simulator, a larger set of unknown inputs can be evaluated in an acceptable time frame.

    The variational landscape for a simulation model will not be uniform throughout the state-space. Although active conservation of restricted resource allocations can be mitigated by HPC, additional care should be taken to determine when exploration or exploitation should be encouraged given the data collected and redundant sampling avoided.

    A machine learning technique known as active learning is leveraged to provide such a scheme. A utility function (a.k.a acquisition function) is built to balance the exploration of unknown portions of the input sample space with the exploitation of all information gathered by the previous evaluated data, resulting in a prioritized ordering reflecting the motivation and objectives behind the calibration effort. The expectation of the utility function is taken over the Bayesian posterior distribution and maximized to provide a predetermined number of recommendations.

  2. Transportation modelers have many ways to model complex interactions represented within the transportation simulator. Calibration methodologies that account for internal structure of a simulator could be more efficient. On the other hand, they are hardly generalizable to other types of simulation models. By treating the relationship between the inputs and outputs of the simulator in question as an unknown, stochastic function, black-box optimization methodologies can be leveraged. Specifically, our previously developed Gaussian process framework took the Bayesian approach to construct a probability distribution over all potential linear and nonlinear functions matching the simulator and leveraging evidential data to determine the most likely match. Once this distribution is sufficiently mapped, a valued estimation for the sought parameters can be made with minimal uncertainty.

3.1 Deep Learning Architecture

Within the calibration framework, two objectives must be realized by the neural network:

  1. A reduced dimension subspace which captures the relationship between the simulator inputs and outputs must be bounded in order for adequate exploration

  2. Given the reduced dimension sample determined by the framework, a method to convert the reduction to the original dimension subspace must exist to allow for simulator evaluations

To address these objectives, we use MLP architecture to capture the relations between inputs and outputs. We use an coder architecture to capture the low dimensional structure in the input parameters. We will run optimization algorithms inside the low dimensional representation of the input parameter space to address the curse of dimensionality. The Autoencoder and MLP share the same initial layers up to the reduced dimension layer, as shown in Figure


Figure 1: Graphical Representation of the Combinatorial Neural Network for Calibration

The activation function used is the , which has a range of and is a close approximation to the sign function.

We ran the simulator several times to generate initial sample set was used by the calibration framework to explore the relationship between the inputs and outputs. Additionally, to quantify the discrepancies during training, the following loss functions were used:

  1. The MLP portion of the architecture used the mean squared error function


    where represents the predicted values produced by the neural network for the simulator’s output given the input set

  2. The Autoencoder portion of the architecture used the mean squared error function and a quadratic penalty cost for producing predicted values outside of the original subspace bounds


    where represents the predicted values produced by the neural network for the simulator’s input given the input set , represents the input set’s upper bound, and represents the input set’s lower

3.2 Empirical Results

We use Sioux-Falls [28], transportation model for our empirical results. This model consists of 24 traffic analysis zones and 24 intersections with 76 directional roads, or arcs. The network structure and input data provided by [46] have been adjusted from the original dataset to approximate hourly demand flows in the form of Origin-Destination (O-D) pairs, the simulation’s input set.

The input data is provided to a simulator package which implements the iterative Frank-Wolfe method to determine the traffic equilibrium flows and outputs average travel times across each arc. Due to limited computing availability, only the first twenty O-D pairs are treated as unknown input variables between and which need to be calibrated while the other O-D pairs are assumed to be known. Random noise is added to the simulator to emulate the observational and variational errors expected in real-world applications. The calibration framework’s objective function is to minimize the mean discrepancy between the simulated travel times resulting from the calibrated O-D pairs and the ’true’ times resulting from the full set of true O-D pair values.

Overall, the performance of the calibration using a deep neural network proved significant, see Figure 2

(a). A calibrated solution set was produced which resulted in outputs, on average, within 3% of the experiment’s true output. With a standard deviation of 5%, Figure

2(b) provides a visualization for those links which possessed greater than average variation from the true demand’s output. Given the same computational budget, Bayesian optimization that uses low dimensional representation from the deep learner leads to 25% more accurate match between measured and simulated data when compared to active subspaces.

(a) Objective Function (b) Calibrated vs True
Figure 2: Results of demand matrix calibration using Bayesian optimization. (a) Comparison of the Three Methods in terms of Objective Evaluations applied to original parameter space (black line), reduced dimensionality parameter space of Active Subspaces (red line), and reduced dimensionality parameter space of Neural Networks(blue line). (b) Comparison of Calibrated and True Travel Time Outputs with Above Average Differences.

4 Deep Reinforcement Learning

Consider the desire for a calibrated simulator not to be used for the evaluation of interested scenarios but as a tool for designing a policy which dictates an optimal action for the current state of the system .

Once calibrated, the simulator is no longer regarded as a black-box but as an interactive system of players, known as agents, and their environment. In such a system, the agent interacts with an environment in discrete time steps. At each timestep, , the agent has a set of actions, which can be executed. Given the action, the environment changes from its original state, , to a new, influenced state, . If, for every action or through a set of actions, a reward is derived by the agent, the sequential decision problems can be solved through a concept known as Reinforcement Learning (RL).

Such a structure is quite conducive to transportation. For example, if a commuter chooses to leave the house after rush hour has ended, he will eventually be rewarded at the end of his commute with a shorter travel time than if he had left at the beginning of rush hour. Although not immediately realized, the reward is no less desired and will, in the future, encourage the agent to perform the same behavior when possible.

The quantification of such an action-reward trade-off is represented through a function known as ’Q-learning’[49]. Q-learning, referencing the ’quality’ of a certain action within the environment’s current state, represents the maximum, discounted reward that can be obtained in the future if action is performed in state and all subsequent actions are continued following the optimal policy from that state on:

where is the discounted return and is the factor used to enumerate the importance of immediate and future rewards.

In other words, it is the greatest reward we can expect given we follow the best set of action sequences after performing action in state . Subsequently, the optimal policy requires choosing the optimal, or maximum, value for each state:

While most of the research on RL is done in the field of machine learning and applied to classical AI problems, such as robotics, language translation and supply chain management problems [23], some classical transportation control problems have been previously solved using RL. [1, 6, 11, 8, 1, 5, 31, 17, 2, 13].

Unfortunately, the Q-functions for these transportation simulators continue to possess high-dimensionality concerns noted in our previous calibration work. However, recent advancements have allowed for the successful integration of reinforcement learning’s Q-function with deep neural networks.[37] Known as a Deep Q Network (DQN), these neural networks have the potential to provide a diminished feature set for highly structured, highly-dimensional data without hindering the power of the reinforcement learning.

For development and training of such a network, a neural network architecture best-fitting the problem is constructed with the following loss function[48]

where are the parameters, is the Q-function for state and action and

where represents parameters of a fixed and separate target network.

Furthermore, to increase the data efficiency and reduce the correlation among samples during training, DQNs leverage a buffer system known as experience replay. Each transition and answer set, , is stored offline and, throughout the training process, random small batches from the replay memory are used instead of the most recent transition.

For the purpose of this paper, a MLP network is utilized as the neural architecture.

4.1 Empirical Results

For demonstration and analysis, a small transportation network, depicted in Figure 3, consisting of 3 nodes and 2 routes, or arcs, is used.

Figure 3: Graphical Representation of the Example Transportation System

The small network has varying demand originating from node to node for time periods. Using RL, we find the best policy to handle this demand with the lowest system travel time given that any single period has two allowable actions:

  1. units of demand from node to node can be delayed up to one hour

  2. units of demand from node to node can be re-routed to node as an alternative destination at a further distance

In essence we solve the optimal traffic assignment problem. Our state contains the following information: (i) the amount of original demand from node to node that is to be executed at time , ; (ii) the amount of demand moved to time for execution from time , ; (iii) the amount of demand left to be met between time and divided by the amount of time left The action includes the option to move ,,or units of demand from the current period to the subsequent period or move ,,or units of demand from the arc between note and node to the arc between node and node , . The reward is calculated using the same simulator package from the Section LABEL:Emperical1,which implements the iterative Frank-Wolfe method to determine the traffic equilibrium flows and outputs total system travel time for the period. Since Q-learning seeks the maximum reward, we took the negative total system travel time.

After running the network on 100 of -long episodes, a randomly generated set of demand was produced and run through the resulting neural network. A improvement in the system travel time was achieved. Table 1 illustrates the adjustments decided by the network and Figure 4 compares the travel times by period between the original and adjusted demands.

Figure 4: Comparison of System Travel Time per Period

Original Demand of DQN Adjusted Demand of DQN Adjusted Demand of

4 4 0
2 2 0 1

3 1 1

1 2 1

1 0 1

3 3 0

0 0 0

0 0 0

1 0 1

1 0 1

1 0 1

2 0 1

2 1 1

4 2 1

2 2 1

1 1 1

3 0 1

2 2 1

0 1 0

2 0 1

3 1 1

2 2 1

1 1 1

2 0 2

Table 1: Adjustments to Demands per Period using DQN

5 Discussion

Deep learning provides a general framework for modeling complex relations in transportation systems. As such, deep learning frameworks are well-suited to many optimization problems in transportation. This paper presents an innovative deep learning architecture for applying reinforcement learning and calibrating a transportation model. We have demonstrated, deep learning is a viable option compared to other metamodel based approaches. Our calibration and reinforcement learning examples demonstrate how to develop and apply deep learning models in transportation modeling.

At the same time, there are significant challenges associated with using deep learning for optimization problems. Most notably, the issue of performance of deep reinforcement learning [52]

. Though theoretical bounds on performance of different RL algorithms do exist, the research done over the past few decades showed that worst case analysis is not the right framework for studying artificial intelligence: every model that is interesting enough to use in practice leads to computationally hard problems

[10]. Similarly, while there are many important theoretical results that show very slow convergence of many RL algorithm, it was shown to work well empirically on specific classes of problems. The convergence analysis developed for RL techniques is usually asymptotic and worst case. Asymptotic optimality was shown by [49] who shows that -learning, which is an iterative scheme to learn optimal policies, does converge to optimal solution asymptotically. Littman et.el. [32] showed that a general reinforcement learning models based on exploration model does converge to an optimal solution. It is not uncommon for convergence rates in practice to be much better than predicted by worst case scenario analysis. Some of the recent work suggests that using recurrent architectures for Value Iteration Networks (VIN) can achieve good empirical performance compared to fully connected architectures [30]. Adaptive approaches that rely on meta-learning were shown to improve performance of reinforcement learning algorithms [4].

Another issue that requires further research is the bias-variance trade-off in he context of deep reinforcement learning. Traditional regularization techniques that add stochasticity to RL functions do not prevent from over-fitting 


In the meantime, deep learning and deep reinforcement learning are likely to exert greater and greater influence in the practice of transportation.


  • [1] Baher Abdulhai and Lina Kattan. Reinforcement learning: Introduction to theory and potential for transport applications. Canadian Journal of Civil Engineering, 30(6):981–991, 2003.
  • [2] Zain Adam, Montasir Abbas, and Pengfei Li. Evaluating Green-Extension Policies with Reinforcement Learning and Markovian Traffic State Estimation. Transportation Research Record: Journal of the Transportation Research Board, 2128:217–225, December 2009.
  • [3] Zain Adam, Montasir Abbas, and Pengfei Li. Evaluating green-extension policies with reinforcement learning and markovian traffic state estimation. Transportation Research Record: Journal of the Transportation Research Board, (2128):217–225, 2009.
  • [4] Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments. arXiv preprint arXiv:1710.03641, 2017.
  • [5] Itamar Arel, Cong Liu, T Urbanik, and AG Kohls. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intelligent Transport Systems, 4(2):128–135, 2010.
  • [6] Theo Arentze and Harry Timmermans. Albatross: a learning based transportation oriented simulation system. Eirass Eindhoven, 2000.
  • [7] Joshua Auld, Michael Hope, Hubert Ley, Vadim Sokolov, Bo Xu, and Kuilin Zhang. Polaris: Agent-based modeling framework development and implementation for integrated travel demand and network and operations simulations. Transportation Research Part C: Emerging Technologies, 64:101–116, 2016.
  • [8] Ana LC Bazzan. Opportunities for multiagent systems and multiagent reinforcement learning in traffic control. Autonomous Agents and Multi-Agent Systems, 18(3):342–375, 2009.
  • [9] Francois Belletti, Daniel Haziza, Gabriel Gomes, and Alexandre M Bayen. Expert level control of ramp metering based on multi-task deep reinforcement learning. arXiv preprint arXiv:1701.08832, 2017.
  • [10] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaraghavan. Smoothed analysis of tensor decompositions. CoRR, abs/1311.3651, 2013.
  • [11] Ella Bingham. Reinforcement learning in neurofuzzy traffic signal control. European Journal of Operational Research, 131(2):232–241, 2001.
  • [12] R.L. Cheu, X. Jin, K.C. Ng, Y.L. Ng, and D. Srinivasan.

    Calibration of FRESIM for Singapore expressway using genetic algorithm.

    Journal of Transportation Engineering, 124(6):526–535, November 1998.
  • [13] L. Chong, M. Abbas, B. Higgs, A. Medina, and C. Y. D. Yang. A revised reinforcement learning algorithm to model complicated vehicle continuous actions in traffic. In 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), pages 1791–1796, October 2011.
  • [14] Linsen Chong and Carolina Osorio. A simulation-based optimization algorithm for dynamic large-scale urban transportation problems. Transportation Science, 2017.
  • [15] Ernesto Cipriani, Michael Florian, Michael Mahut, and Marialisa Nigro. A gradient approximation approach for adjusting temporal origin–destination matrices. Transportation Research Part C: Emerging Technologies, 19(2):270–282, April 2011.
  • [16] Christian G Claudel and Alexandre M Bayen. Lax–hopf based incorporation of internal boundary conditions into hamilton–jacobi equation. part i: Theory. IEEE Transactions on Automatic Control, 55(5):1142–1157, 2010.
  • [17] Raymond Cunningham, Anurag Garg, Vinny Cahill, et al. A collaborative reinforcement learning approach to urban traffic control optimization. In Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT’08. IEEE/WIC/ACM International Conference on, volume 2, pages 560–566. IEEE, 2008.
  • [18] Matthew F Dixon, Nicholas G Polson, and Vadim O Sokolov. Deep learning for spatio-temporal modeling: Dynamic traffic flows and high frequency trading. arXiv preprint arXiv:1705.09851, 2017.
  • [19] Tamara Djukic, Gunnar Flötteröd, Hans Van Lint, and Serge Hoogendoorn.

    Efficient real time OD matrix estimation based on Principal Component Analysis.

    In Intelligent Transportation Systems (ITSC), 2012 15th International IEEE Conference on, pages 115–121. IEEE, 2012.
  • [20] Gunnar Flötteröd. A general methodology and a free software for the calibration of DTA models. In The Third International Symposium on Dynamic Traffic Assignment, 2010.
  • [21] Gunnar Flötteröd, Michel Bierlaire, and Kai Nagel. Bayesian demand calibration for dynamic traffic simulations. Transportation Science, 45(4):541–561, 2011.
  • [22] Wade Genders and Saiedeh Razavi. Using a deep reinforcement learning agent for traffic signal control. arXiv preprint arXiv:1611.01142, 2016.
  • [23] Ilaria Giannoccaro and Pierpaolo Pontrandolfo. Inventory management in supply chains: a reinforcement learning approach. International Journal of Production Economics, 78(2):153–161, 2002.
  • [24] David K. Hale, Constantinos Antoniou, Mark Brackstone, Dimitra Michalaka, Ana T. Moreno, and Kavita Parikh. Optimization-based assisted calibration of traffic simulation models. Transportation Research Part C: Emerging Technologies, 55:100–115, June 2015.
  • [25] Martin L. Hazelton. Statistical inference for time varying origin–destination matrices. Transportation Research Part B: Methodological, 42(6):542–552, July 2008.
  • [26] Kiet Lam, Walid Krichene, and Alexandre Bayen. On learning how players learn: estimation of learning dynamics in the routing game. In Cyber-Physical Systems (ICCPS), 2016 ACM/IEEE 7th International Conference on, pages 1–10. IEEE, 2016.
  • [27] Jeffrey Larson, Todd Munson, and Vadim Sokolov. Coordinated platoon routing in a metropolitan network. In 2016 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing, pages 73–82. SIAM, 2016.
  • [28] Larry J. LeBlanc, Edward K. Morlok, and William P. Pierskalla. An efficient approach to solving the road network equilibrium traffic assignment problem. Transportation research, 9(5):309–318, 1975.
  • [29] Jung Beom Lee and Kaan Ozbay. New calibration methodology for microscopic traffic simulation using enhanced simultaneous perturbation stochastic approximation approach. Transportation Research Record, (2124):233–240, 2009.
  • [30] Lisa Lee, Emilio Parisotto, Devendra Singh Chaplot, and Ruslan Salakhutdinov. Lstm iteration networks: An exploration of differentiable path finding. 2018.
  • [31] Kenny Ling and Amer S Shalaby. A reinforcement learning approach to streetcar bunching control. Journal of Intelligent Transportation Systems, 9(2):59–68, 2005.
  • [32] Michael L. Littman and Csaba Szepesvari. A Generalized Reinforcement-Learning Model: Convergence and Applications. Technical report, Brown University, Providence, RI, USA, 1996.
  • [33] Lu Lu, Yan Xu, Constantinos Antoniou, and Moshe Ben-Akiva. An enhanced SPSA algorithm for the calibration of Dynamic Traffic Assignment models. Transportation Research Part C: Emerging Technologies, 51:149–166, February 2015.
  • [34] T. Ma and B. Abdulhai. Genetic algorithm-based optimization approach and generic tool for calibrating traffic microscopic simulation parameters. Intelligent Transportation Systems and Vehicle-highway Automation 2002: Highway Operations, Capacity, and Traffic Control, (1800):6–15, 2002.
  • [35] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • [36] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • [37] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.
  • [38] Kai Nagel and Gunnar Flötteröd. Agent-based traffic assignment: Going from trips to behavioural travelers. In Travel Behaviour Research in an Evolving World–Selected papers from the 12th international conference on travel behaviour research, pages 261–294. International Association for Travel Behaviour Research, 2012.
  • [39] Rahul Nair and Elise Miller-Hooks. Fleet management for vehicle sharing operations. Transportation Science, 45(4):524–540, 2011.
  • [40] Nicholas Polson, Vadim Sokolov, et al. Bayesian analysis of traffic flow on interstate i-55: The lwr model. The Annals of Applied Statistics, 9(4):1864–1888, 2015.
  • [41] Nicholas G Polson and Vadim O Sokolov. Deep learning for short-term traffic flow prediction. Transportation Research Part C: Emerging Technologies, 79:1–17, 2017.
  • [42] Laura Schultz and Vadim Sokolov. Bayesian optimization for transportation simulators. Procedia Computer Science, 130:973–978, 2018.
  • [43] Vadim Sokolov, Joshua Auld, and Michael Hope. A flexible framework for developing integrated models of transportation systems using an agent-based approach. Procedia Computer Science, 10:854–859, 2012.
  • [44] Vadim Sokolov, Jeffrey Larson, Todd Munson, Josh Auld, and Dominik Karbowski. Maximization of platoon formation through centralized routing and departure time coordination. Transportation Research Record: Journal of the Transportation Research Board, (2667):10–16, 2017.
  • [45] Heinz Spiess and Michael Florian. Optimal strategies: a new assignment model for transit networks. Transportation Research Part B: Methodological, 23(2):83–102, 1989.
  • [46] Ben Stabler. TransportationNetworks: Transportation Networks for Research, September 2017. original-date: 2016-03-12T22:38:10Z.
  • [47] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, in preparation) edition, 2017.
  • [48] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling Network Architectures for Deep Reinforcement Learning. arXiv:1511.06581 [cs], November 2015. arXiv: 1511.06581.
  • [49] Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3-4):279–292, May 1992.
  • [50] Cathy Wu, Aboudy Kreidieh, Kanaad Parvate, Eugene Vinitsky, and Alexandre M Bayen. Flow: Architecture and benchmarking for reinforcement learning in traffic control. arXiv preprint arXiv:1710.05465, 2017.
  • [51] Cathy Wu, Kanaad Parvate, Nishant Kheterpal, Leah Dickstein, Ankur Mehta, Eugene Vinitsky, and Alexandre M Bayen. Framework for control and deep reinforcement learning in traffic. In Intelligent Transportation Systems (ITSC), 2017 IEEE 20th International Conference on, pages 1–8. IEEE, 2017.
  • [52] Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. arXiv preprint arXiv:1803.07246, 2018.
  • [53] Bailing Zhang, Minyue Fu, and Hong Yan. A nonlinear neural network model of mixture of local principal component analysis: application to handwritten digits recognition.

    Journal of the Pattern Recognition Society

    , 34(2):203–214, February 2001.
  • [54] C. Zhang, O. Vinyals, R. Munos, and S. Bengio. A Study on Overfitting in Deep Reinforcement Learning. ArXiv e-prints, April 2018.