1 Introduction
Many modern transportation system analysis problems lead to highdimensional and highly nonlinear optimization problems. Examples include fleet management [39], intelligent system operations [26] and longterm urban planning [45]. Transportation system models used in those optimization problems typically assume analytical formulations using mathematical programming [27, 44] or conservation laws [16, 40] and rely on high level abstractions, such as origindestination matrices for demand. An alternative approach is to model transportation system using a complex simulator [38, 43, 7], which model individual travelers and provide flexible approach to represent traffic and demand patterns in large scale multimodal transportation systems. However, solving optimization problems and dynamiccontrol problems that rely on simulator models of the system is prohibitive due to computational costs. Recently matamodel based approach was proposed to solve simulationbased transportation optimization problems [14, 42].
In this paper, we propose an alternative approach to solve optimization problems for large scale transportation systems. Our approach relies on low complexity metamodels deduced from complex simulators and reinforcement learning to solve optimization problems. We use deep learning approximators for the low complexity metamodels. A deep learning is a Latent Variable Model (LVM) that is capable of extracting the underlying lowdimensional pattern in a highdimensional inputoutput relations. Deep learners have proven highly effective in combination with Reinforcement and Active Learning
[36] to recognize such patterns for exploitation. Our approach builds on the work on simulationbased optimization [42, 14], deep learning [41, 18] as well as reinforcement learning [51, 50] techniques recently proposed for transportation applications. The two main contribution of this paper are
Development of innovative deep learning architecture for reducing dimensionality of search space and modeling relations between transportation simulator inputs (travel behavior parameters, traffic network characteristics) and outputs (mobility patterns, traffic congestion)

Development of reinforcement learning techniques that rely on deep learning approximators to solve optimization problems for dynamic transportation systems
We demonstrate our methodologies using two applications. First, we solve the problem of calibrating a complex, stochastic transportation simulators which need to be systematically adjusted to match field data. The problem of calibrating a simulator is the key for making it useful for both short term operational decisions and long term urban planning. We improve on the previously proposed numerous approaches for the calibration of simulationbased traffic flow models have been produced by treating the problem as an optimization issue [12, 34, 33, 15, 29, 24, 25, 20, 21, 19]. Our approach makes no assumption about the form of the simulator and types of inputs and outputs used. Further, we show that deep learning models are more sample efficient when compared to Bayesian techniques or more traditional filtering techniques. We build on our calibration framework [42]
by further exploring the dimensionality reduction utilized for more efficient input parameter space exploration. More specifically, we introduce the formulation and analysis of a combinatorial Neural Network method and compare it with previous work that used Active Subspace methods.
The second application builds upon recent advances in deep learning approaches to reinforcement learning (RL) that have demonstrated impressive results in game playing [35] through the application of neural networks for approximating stateaction functions. Reinforcement Leaning mimics the way humans learn new tasks and behavioral policies via trial and error and has proven successful [47]
in many applications. While most of the research on RL is done in the field of machine learning and applied to classical AI problems, such as robotics, language translation and supply chain management problems
[23], some classical transportation control problems have been previously solved using RL. [1, 6, 11, 8, 1, 5, 31, 17, 2, 13]. Furthermore, there were recent attempts that successfully demonstrated applications of deep RL to traffic flow control [3, 51, 50, 9, 22].The remainder of this paper is organized as follows: Section 2 briefly documents the highlights of neural network architectures; Section 3 describes the new deep learning architecture that finds low dimensional patterns in simulator’s inputsoutput relations and we apply our deep learner to the problem of model calibration. Section 4 describes the additional application of deep reinforcement learning to transportation system optimization. Finally Section 5 offers avenues for further research.
2 Deep Learning
Let denote a (low dimensional) output and a (high dimensional) set of inputs. We wish to recover the multivariate function (map), denoted by , using training data of inputoutput pairs , that generalizes well for outofsample data. Deep learning uses compositions of functions, rather than traditional additive ones. By composing layers, a deep learning predictor becomes
Here
is a univariate activation function. The set
is the set of weights and offsets which are learned from training data. Here and dimensionality is of the architecture specifications.Training the parameters
and selecting an architecture is achieved by regularized least squares. Stochastic gradient descent (SGD) and its variants are used to find the solution to the optimization problem
[41].where is training data of inputoutput pairs, and is a regularisation penalty on the network parameters (weights and offsets).
In this paper we develop a new deep learning architecture for simultaneously learning low dimensional representation of simulator’s input parameter space as well as the relation between simulator’s inputs and outputs. Our architecture relies on multilayer perceptron and autoencoding layers. Multilayer Perceptron Network (MLP) – a neural network which takes a set of inputs,
, and feeds them through one or more sets of intermediate layers to compute one or more output values, . Although a single architecture is commonly implemented in practice, some success has been found through comparative or combinatorial means[53].2.1 AutoEncoder
An autoencoder is a deep learning routine which trains the architecture to approximate by itself (i.e., = ) via a bottleneck structure. This means we select a model which aims to concentrate the information required to recreate . Put differently, an autoencoder creates a more cost effective representation of
. For example, under an L2loss function, we wish to solve
subject to a regularization penalty on the weights and offsets. In an autoencoder, for a training data set , we set the target values as . A static autoencoder with two linear layers, akin to a traditional factor model, can be written as a deep learner as
where
are activation vectors. The goal is to wind the weights and biases so that size of
us much smaller than size of3 Deep Learning for Calibration
In this section we develop a new deep learning architecture that can be used to learn lowdimensional structure in the simulator inputoutput relations. Then we use the low dimensional representations to solve an optimization problem. Our optimization problem is the problem of calibration of a transportation simulator. As the simulators become more detailed, the high dimensionality has now become a pressing concern. In practice, high dimensional data possesses a natural structure within it, that can be expressed in low dimensions. Known as Dimension Reduction, the effective number of parameters in the model reduces and enables analysis from smaller data sets.
Recently, we developed a lowdimensional metamodel for to be used for transportation simulationbased problems [42]. We calculated active subspaces to capture lowdimensional structures in the simulator. We used Gaussian process to represent the inputoutput relations. There are several key concerns while developing an efficient simulationbased algorithms for transportation applications

Algorithms must be sample efficient and parallelizable. Each simulation run is computationally expensive and can take up to a few days. This computational constraint could potentially limit the scale and scope of calibration investigations and result in large areas of sample space unexplored and suboptimal decisions. As HighPerformance Computing (HPC) resources have become increasingly available in most research environments, new modes of computational processing and experimentation have become possible – parallel tasking capabilities allow multiple simulated runs to be performed simultaneously and HPC programs aid in coordinating worker units to run codes across multiple processors to maximize the available resources and time management. By leveraging these advances and running a queue of pending input sets concurrently through the simulator, a larger set of unknown inputs can be evaluated in an acceptable time frame.
The variational landscape for a simulation model will not be uniform throughout the statespace. Although active conservation of restricted resource allocations can be mitigated by HPC, additional care should be taken to determine when exploration or exploitation should be encouraged given the data collected and redundant sampling avoided.
A machine learning technique known as active learning is leveraged to provide such a scheme. A utility function (a.k.a acquisition function) is built to balance the exploration of unknown portions of the input sample space with the exploitation of all information gathered by the previous evaluated data, resulting in a prioritized ordering reflecting the motivation and objectives behind the calibration effort. The expectation of the utility function is taken over the Bayesian posterior distribution and maximized to provide a predetermined number of recommendations.

Transportation modelers have many ways to model complex interactions represented within the transportation simulator. Calibration methodologies that account for internal structure of a simulator could be more efficient. On the other hand, they are hardly generalizable to other types of simulation models. By treating the relationship between the inputs and outputs of the simulator in question as an unknown, stochastic function, blackbox optimization methodologies can be leveraged. Specifically, our previously developed Gaussian process framework took the Bayesian approach to construct a probability distribution over all potential linear and nonlinear functions matching the simulator and leveraging evidential data to determine the most likely match. Once this distribution is sufficiently mapped, a valued estimation for the sought parameters can be made with minimal uncertainty.
3.1 Deep Learning Architecture
Within the calibration framework, two objectives must be realized by the neural network:

A reduced dimension subspace which captures the relationship between the simulator inputs and outputs must be bounded in order for adequate exploration

Given the reduced dimension sample determined by the framework, a method to convert the reduction to the original dimension subspace must exist to allow for simulator evaluations
To address these objectives, we use MLP architecture to capture the relations between inputs and outputs. We use an coder architecture to capture the low dimensional structure in the input parameters. We will run optimization algorithms inside the low dimensional representation of the input parameter space to address the curse of dimensionality. The Autoencoder and MLP share the same initial layers up to the reduced dimension layer, as shown in Figure
1.The activation function used is the , which has a range of and is a close approximation to the sign function.
We ran the simulator several times to generate initial sample set was used by the calibration framework to explore the relationship between the inputs and outputs. Additionally, to quantify the discrepancies during training, the following loss functions were used:

The MLP portion of the architecture used the mean squared error function
(1) where represents the predicted values produced by the neural network for the simulator’s output given the input set

The Autoencoder portion of the architecture used the mean squared error function and a quadratic penalty cost for producing predicted values outside of the original subspace bounds
(2) where represents the predicted values produced by the neural network for the simulator’s input given the input set , represents the input set’s upper bound, and represents the input set’s lower
3.2 Empirical Results
We use SiouxFalls [28], transportation model for our empirical results. This model consists of 24 traffic analysis zones and 24 intersections with 76 directional roads, or arcs. The network structure and input data provided by [46] have been adjusted from the original dataset to approximate hourly demand flows in the form of OriginDestination (OD) pairs, the simulation’s input set.
The input data is provided to a simulator package which implements the iterative FrankWolfe method to determine the traffic equilibrium flows and outputs average travel times across each arc. Due to limited computing availability, only the first twenty OD pairs are treated as unknown input variables between and which need to be calibrated while the other OD pairs are assumed to be known. Random noise is added to the simulator to emulate the observational and variational errors expected in realworld applications. The calibration framework’s objective function is to minimize the mean discrepancy between the simulated travel times resulting from the calibrated OD pairs and the ’true’ times resulting from the full set of true OD pair values.
Overall, the performance of the calibration using a deep neural network proved significant, see Figure 2
(a). A calibrated solution set was produced which resulted in outputs, on average, within 3% of the experiment’s true output. With a standard deviation of 5%, Figure
2(b) provides a visualization for those links which possessed greater than average variation from the true demand’s output. Given the same computational budget, Bayesian optimization that uses low dimensional representation from the deep learner leads to 25% more accurate match between measured and simulated data when compared to active subspaces.(a) Objective Function  (b) Calibrated vs True 
4 Deep Reinforcement Learning
Consider the desire for a calibrated simulator not to be used for the evaluation of interested scenarios but as a tool for designing a policy which dictates an optimal action for the current state of the system .
Once calibrated, the simulator is no longer regarded as a blackbox but as an interactive system of players, known as agents, and their environment. In such a system, the agent interacts with an environment in discrete time steps. At each timestep, , the agent has a set of actions, which can be executed. Given the action, the environment changes from its original state, , to a new, influenced state, . If, for every action or through a set of actions, a reward is derived by the agent, the sequential decision problems can be solved through a concept known as Reinforcement Learning (RL).
Such a structure is quite conducive to transportation. For example, if a commuter chooses to leave the house after rush hour has ended, he will eventually be rewarded at the end of his commute with a shorter travel time than if he had left at the beginning of rush hour. Although not immediately realized, the reward is no less desired and will, in the future, encourage the agent to perform the same behavior when possible.
The quantification of such an actionreward tradeoff is represented through a function known as ’Qlearning’[49]. Qlearning, referencing the ’quality’ of a certain action within the environment’s current state, represents the maximum, discounted reward that can be obtained in the future if action is performed in state and all subsequent actions are continued following the optimal policy from that state on:
where is the discounted return and is the factor used to enumerate the importance of immediate and future rewards.
In other words, it is the greatest reward we can expect given we follow the best set of action sequences after performing action in state . Subsequently, the optimal policy requires choosing the optimal, or maximum, value for each state:
While most of the research on RL is done in the field of machine learning and applied to classical AI problems, such as robotics, language translation and supply chain management problems [23], some classical transportation control problems have been previously solved using RL. [1, 6, 11, 8, 1, 5, 31, 17, 2, 13].
Unfortunately, the Qfunctions for these transportation simulators continue to possess highdimensionality concerns noted in our previous calibration work. However, recent advancements have allowed for the successful integration of reinforcement learning’s Qfunction with deep neural networks.[37] Known as a Deep Q Network (DQN), these neural networks have the potential to provide a diminished feature set for highly structured, highlydimensional data without hindering the power of the reinforcement learning.
For development and training of such a network, a neural network architecture bestfitting the problem is constructed with the following loss function[48]
where are the parameters, is the Qfunction for state and action and
where represents parameters of a fixed and separate target network.
Furthermore, to increase the data efficiency and reduce the correlation among samples during training, DQNs leverage a buffer system known as experience replay. Each transition and answer set, , is stored offline and, throughout the training process, random small batches from the replay memory are used instead of the most recent transition.
For the purpose of this paper, a MLP network is utilized as the neural architecture.
4.1 Empirical Results
For demonstration and analysis, a small transportation network, depicted in Figure 3, consisting of 3 nodes and 2 routes, or arcs, is used.
The small network has varying demand originating from node to node for time periods. Using RL, we find the best policy to handle this demand with the lowest system travel time given that any single period has two allowable actions:

units of demand from node to node can be delayed up to one hour

units of demand from node to node can be rerouted to node as an alternative destination at a further distance
In essence we solve the optimal traffic assignment problem. Our state contains the following information: (i) the amount of original demand from node to node that is to be executed at time , ; (ii) the amount of demand moved to time for execution from time , ; (iii) the amount of demand left to be met between time and divided by the amount of time left The action includes the option to move ,,or units of demand from the current period to the subsequent period or move ,,or units of demand from the arc between note and node to the arc between node and node , . The reward is calculated using the same simulator package from the Section LABEL:Emperical1,which implements the iterative FrankWolfe method to determine the traffic equilibrium flows and outputs total system travel time for the period. Since Qlearning seeks the maximum reward, we took the negative total system travel time.
After running the network on 100 of long episodes, a randomly generated set of demand was produced and run through the resulting neural network. A improvement in the system travel time was achieved. Table 1 illustrates the adjustments decided by the network and Figure 4 compares the travel times by period between the original and adjusted demands.
Hour 
Original Demand of  DQN Adjusted Demand of  DQN Adjusted Demand of 
1 
4  4  0 
2  2  0  1 
3 
3  1  1 
4 
1  2  1 
5 
1  0  1 
6 
3  3  0 
7 
0  0  0 
8 
0  0  0 
9 
1  0  1 
10 
1  0  1 
11 
1  0  1 
12 
2  0  1 
13 
2  1  1 
14 
4  2  1 
15 
2  2  1 
16 
1  1  1 
17 
3  0  1 
18 
2  2  1 
19 
0  1  0 
20 
2  0  1 
21 
3  1  1 
22 
2  2  1 
23 
1  1  1 
24 
2  0  2 

5 Discussion
Deep learning provides a general framework for modeling complex relations in transportation systems. As such, deep learning frameworks are wellsuited to many optimization problems in transportation. This paper presents an innovative deep learning architecture for applying reinforcement learning and calibrating a transportation model. We have demonstrated, deep learning is a viable option compared to other metamodel based approaches. Our calibration and reinforcement learning examples demonstrate how to develop and apply deep learning models in transportation modeling.
At the same time, there are significant challenges associated with using deep learning for optimization problems. Most notably, the issue of performance of deep reinforcement learning [52]
. Though theoretical bounds on performance of different RL algorithms do exist, the research done over the past few decades showed that worst case analysis is not the right framework for studying artificial intelligence: every model that is interesting enough to use in practice leads to computationally hard problems
[10]. Similarly, while there are many important theoretical results that show very slow convergence of many RL algorithm, it was shown to work well empirically on specific classes of problems. The convergence analysis developed for RL techniques is usually asymptotic and worst case. Asymptotic optimality was shown by [49] who shows that learning, which is an iterative scheme to learn optimal policies, does converge to optimal solution asymptotically. Littman et.el. [32] showed that a general reinforcement learning models based on exploration model does converge to an optimal solution. It is not uncommon for convergence rates in practice to be much better than predicted by worst case scenario analysis. Some of the recent work suggests that using recurrent architectures for Value Iteration Networks (VIN) can achieve good empirical performance compared to fully connected architectures [30]. Adaptive approaches that rely on metalearning were shown to improve performance of reinforcement learning algorithms [4].Another issue that requires further research is the biasvariance tradeoff in he context of deep reinforcement learning. Traditional regularization techniques that add stochasticity to RL functions do not prevent from overfitting
[54].In the meantime, deep learning and deep reinforcement learning are likely to exert greater and greater influence in the practice of transportation.
References
 [1] Baher Abdulhai and Lina Kattan. Reinforcement learning: Introduction to theory and potential for transport applications. Canadian Journal of Civil Engineering, 30(6):981–991, 2003.
 [2] Zain Adam, Montasir Abbas, and Pengfei Li. Evaluating GreenExtension Policies with Reinforcement Learning and Markovian Traffic State Estimation. Transportation Research Record: Journal of the Transportation Research Board, 2128:217–225, December 2009.
 [3] Zain Adam, Montasir Abbas, and Pengfei Li. Evaluating greenextension policies with reinforcement learning and markovian traffic state estimation. Transportation Research Record: Journal of the Transportation Research Board, (2128):217–225, 2009.
 [4] Maruan AlShedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous adaptation via metalearning in nonstationary and competitive environments. arXiv preprint arXiv:1710.03641, 2017.
 [5] Itamar Arel, Cong Liu, T Urbanik, and AG Kohls. Reinforcement learningbased multiagent system for network traffic signal control. IET Intelligent Transport Systems, 4(2):128–135, 2010.
 [6] Theo Arentze and Harry Timmermans. Albatross: a learning based transportation oriented simulation system. Eirass Eindhoven, 2000.
 [7] Joshua Auld, Michael Hope, Hubert Ley, Vadim Sokolov, Bo Xu, and Kuilin Zhang. Polaris: Agentbased modeling framework development and implementation for integrated travel demand and network and operations simulations. Transportation Research Part C: Emerging Technologies, 64:101–116, 2016.
 [8] Ana LC Bazzan. Opportunities for multiagent systems and multiagent reinforcement learning in traffic control. Autonomous Agents and MultiAgent Systems, 18(3):342–375, 2009.
 [9] Francois Belletti, Daniel Haziza, Gabriel Gomes, and Alexandre M Bayen. Expert level control of ramp metering based on multitask deep reinforcement learning. arXiv preprint arXiv:1701.08832, 2017.
 [10] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaraghavan. Smoothed analysis of tensor decompositions. CoRR, abs/1311.3651, 2013.
 [11] Ella Bingham. Reinforcement learning in neurofuzzy traffic signal control. European Journal of Operational Research, 131(2):232–241, 2001.

[12]
R.L. Cheu, X. Jin, K.C. Ng, Y.L. Ng, and D. Srinivasan.
Calibration of FRESIM for Singapore expressway using genetic algorithm.
Journal of Transportation Engineering, 124(6):526–535, November 1998.  [13] L. Chong, M. Abbas, B. Higgs, A. Medina, and C. Y. D. Yang. A revised reinforcement learning algorithm to model complicated vehicle continuous actions in traffic. In 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), pages 1791–1796, October 2011.
 [14] Linsen Chong and Carolina Osorio. A simulationbased optimization algorithm for dynamic largescale urban transportation problems. Transportation Science, 2017.
 [15] Ernesto Cipriani, Michael Florian, Michael Mahut, and Marialisa Nigro. A gradient approximation approach for adjusting temporal origin–destination matrices. Transportation Research Part C: Emerging Technologies, 19(2):270–282, April 2011.
 [16] Christian G Claudel and Alexandre M Bayen. Lax–hopf based incorporation of internal boundary conditions into hamilton–jacobi equation. part i: Theory. IEEE Transactions on Automatic Control, 55(5):1142–1157, 2010.
 [17] Raymond Cunningham, Anurag Garg, Vinny Cahill, et al. A collaborative reinforcement learning approach to urban traffic control optimization. In Web Intelligence and Intelligent Agent Technology, 2008. WIIAT’08. IEEE/WIC/ACM International Conference on, volume 2, pages 560–566. IEEE, 2008.
 [18] Matthew F Dixon, Nicholas G Polson, and Vadim O Sokolov. Deep learning for spatiotemporal modeling: Dynamic traffic flows and high frequency trading. arXiv preprint arXiv:1705.09851, 2017.

[19]
Tamara Djukic, Gunnar Flötteröd, Hans Van Lint, and Serge Hoogendoorn.
Efficient real time OD matrix estimation based on Principal Component Analysis.
In Intelligent Transportation Systems (ITSC), 2012 15th International IEEE Conference on, pages 115–121. IEEE, 2012.  [20] Gunnar Flötteröd. A general methodology and a free software for the calibration of DTA models. In The Third International Symposium on Dynamic Traffic Assignment, 2010.
 [21] Gunnar Flötteröd, Michel Bierlaire, and Kai Nagel. Bayesian demand calibration for dynamic traffic simulations. Transportation Science, 45(4):541–561, 2011.
 [22] Wade Genders and Saiedeh Razavi. Using a deep reinforcement learning agent for traffic signal control. arXiv preprint arXiv:1611.01142, 2016.
 [23] Ilaria Giannoccaro and Pierpaolo Pontrandolfo. Inventory management in supply chains: a reinforcement learning approach. International Journal of Production Economics, 78(2):153–161, 2002.
 [24] David K. Hale, Constantinos Antoniou, Mark Brackstone, Dimitra Michalaka, Ana T. Moreno, and Kavita Parikh. Optimizationbased assisted calibration of traffic simulation models. Transportation Research Part C: Emerging Technologies, 55:100–115, June 2015.
 [25] Martin L. Hazelton. Statistical inference for time varying origin–destination matrices. Transportation Research Part B: Methodological, 42(6):542–552, July 2008.
 [26] Kiet Lam, Walid Krichene, and Alexandre Bayen. On learning how players learn: estimation of learning dynamics in the routing game. In CyberPhysical Systems (ICCPS), 2016 ACM/IEEE 7th International Conference on, pages 1–10. IEEE, 2016.
 [27] Jeffrey Larson, Todd Munson, and Vadim Sokolov. Coordinated platoon routing in a metropolitan network. In 2016 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing, pages 73–82. SIAM, 2016.
 [28] Larry J. LeBlanc, Edward K. Morlok, and William P. Pierskalla. An efficient approach to solving the road network equilibrium traffic assignment problem. Transportation research, 9(5):309–318, 1975.
 [29] Jung Beom Lee and Kaan Ozbay. New calibration methodology for microscopic traffic simulation using enhanced simultaneous perturbation stochastic approximation approach. Transportation Research Record, (2124):233–240, 2009.
 [30] Lisa Lee, Emilio Parisotto, Devendra Singh Chaplot, and Ruslan Salakhutdinov. Lstm iteration networks: An exploration of differentiable path finding. 2018.
 [31] Kenny Ling and Amer S Shalaby. A reinforcement learning approach to streetcar bunching control. Journal of Intelligent Transportation Systems, 9(2):59–68, 2005.
 [32] Michael L. Littman and Csaba Szepesvari. A Generalized ReinforcementLearning Model: Convergence and Applications. Technical report, Brown University, Providence, RI, USA, 1996.
 [33] Lu Lu, Yan Xu, Constantinos Antoniou, and Moshe BenAkiva. An enhanced SPSA algorithm for the calibration of Dynamic Traffic Assignment models. Transportation Research Part C: Emerging Technologies, 51:149–166, February 2015.
 [34] T. Ma and B. Abdulhai. Genetic algorithmbased optimization approach and generic tool for calibrating traffic microscopic simulation parameters. Intelligent Transportation Systems and Vehiclehighway Automation 2002: Highway Operations, Capacity, and Traffic Control, (1800):6–15, 2002.
 [35] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 [36] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [37] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.
 [38] Kai Nagel and Gunnar Flötteröd. Agentbased traffic assignment: Going from trips to behavioural travelers. In Travel Behaviour Research in an Evolving World–Selected papers from the 12th international conference on travel behaviour research, pages 261–294. International Association for Travel Behaviour Research, 2012.
 [39] Rahul Nair and Elise MillerHooks. Fleet management for vehicle sharing operations. Transportation Science, 45(4):524–540, 2011.
 [40] Nicholas Polson, Vadim Sokolov, et al. Bayesian analysis of traffic flow on interstate i55: The lwr model. The Annals of Applied Statistics, 9(4):1864–1888, 2015.
 [41] Nicholas G Polson and Vadim O Sokolov. Deep learning for shortterm traffic flow prediction. Transportation Research Part C: Emerging Technologies, 79:1–17, 2017.
 [42] Laura Schultz and Vadim Sokolov. Bayesian optimization for transportation simulators. Procedia Computer Science, 130:973–978, 2018.
 [43] Vadim Sokolov, Joshua Auld, and Michael Hope. A flexible framework for developing integrated models of transportation systems using an agentbased approach. Procedia Computer Science, 10:854–859, 2012.
 [44] Vadim Sokolov, Jeffrey Larson, Todd Munson, Josh Auld, and Dominik Karbowski. Maximization of platoon formation through centralized routing and departure time coordination. Transportation Research Record: Journal of the Transportation Research Board, (2667):10–16, 2017.
 [45] Heinz Spiess and Michael Florian. Optimal strategies: a new assignment model for transit networks. Transportation Research Part B: Methodological, 23(2):83–102, 1989.
 [46] Ben Stabler. TransportationNetworks: Transportation Networks for Research, September 2017. originaldate: 20160312T22:38:10Z.
 [47] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, in preparation) edition, 2017.
 [48] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling Network Architectures for Deep Reinforcement Learning. arXiv:1511.06581 [cs], November 2015. arXiv: 1511.06581.
 [49] Christopher J. C. H. Watkins and Peter Dayan. Qlearning. Machine Learning, 8(34):279–292, May 1992.
 [50] Cathy Wu, Aboudy Kreidieh, Kanaad Parvate, Eugene Vinitsky, and Alexandre M Bayen. Flow: Architecture and benchmarking for reinforcement learning in traffic control. arXiv preprint arXiv:1710.05465, 2017.
 [51] Cathy Wu, Kanaad Parvate, Nishant Kheterpal, Leah Dickstein, Ankur Mehta, Eugene Vinitsky, and Alexandre M Bayen. Framework for control and deep reinforcement learning in traffic. In Intelligent Transportation Systems (ITSC), 2017 IEEE 20th International Conference on, pages 1–8. IEEE, 2017.
 [52] Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with actiondependent factorized baselines. arXiv preprint arXiv:1803.07246, 2018.

[53]
Bailing Zhang, Minyue Fu, and Hong Yan.
A nonlinear neural network model of mixture of local principal
component analysis: application to handwritten digits recognition.
Journal of the Pattern Recognition Society
, 34(2):203–214, February 2001.  [54] C. Zhang, O. Vinyals, R. Munos, and S. Bengio. A Study on Overfitting in Deep Reinforcement Learning. ArXiv eprints, April 2018.
Comments
There are no comments yet.