Car companies has been increasing their R&D spending on Automated Driving (AD) technology in recent years, for good cause: AD promises to greatly reduce accident-related fatalities and increase productivity of the society as a whole. Although AD technology has made important strides over the last couple of years, current technology is still not ready for large scale roll-out. Urban environments especially pose significant challenges for AD, due to the unpredictable nature of pedestrians and vehicles in city traffic. Handling intersections safely and efficiently is one of the most challenging problems for Urban AD.
Rule-based methods provide a predictable method to handle intersections. However, rule-based intersection handling approaches don’t scale well because it becomes increasingly harder to design hand-crafted rules as scene complexity increases. Moreover, the algorithm designer has to come up with hand-crafted rules and parameters for different types of intersections. By different intersection types, we mean single or multi-lane right, left turns and forward passing.
Our goal for this research is to explore a machine learning based method that generalizes to various types of intersections. Machine learning, and particularly deep learning is a growing field that had tremendous impact on applications such as computer vision, speech recognition and language translation and it is increasingly being used for decision making. We model the AD vehicle as a learning agent, which learns from positive (successful passing) and negative experiences (collisions) in a reinforcement learning framework.
In this paper we focus on how the knowledge for one type of intersection, represented as a Deep Q-Network (DQN), translates to other types of intersections (tasks). First we look at direct copy: how well a network trained for Task A performs in Task B. Second, we analyze how the performance of a network initialized from Task A and fine tuned in Task B compares to a randomly initialized network exclusively trained on Task B. Third, we investigate reverse transfer: if a network pre-trained for Task A and fine-tuned to Task B, preserves knowledge for Task A. Finally, we explore training a network for five tasks sequentially as a lifelong learning scenario.
This paper is organized as follows. After providing a brief literature survey in Section II, we present the problem formulation as a DQN in Section III, before examining various knowledge sharing strategies in Section IV. After explaining the experimental setup in Section V, then present our results in Section VI before concluding in Section VII.
Ii Related Work
Recently there has been an increased interest in using machine learning techniques to control autonomous vehicles. In imitation learning, the policy is learned from a human driver. Online planners based on partially observable Monte Carlo Planning (POMCP) have been shown to handle intersections 
if the existence of an accurate generative model is available, and Markov Decision Processes (MDP) have been used offline to address the intersection problem[3, 4]. Additionally, machine learning has been used to optimize comfort from a set of safe trajectories .
Machine learning has greatly benefited from training on large amounts of data. This helps a system learn general representations and prevents over fitting based on incidental correlations in the sampled data. In the absence of huge datasets, training on multiple related tasks can give similar improvement gains . A large breadth of research has investigated transferring knowledge from one system to another both in machine learning in general , and reinforcement learning specifically .
The training time and sample complexity of deep networks make transfer methods particularly appealing , and has prompted in depth investigation to help understand its behavior . Recent work in deep reinforcement learning has looked at combining networks from different tasks to share information [11, 12]. And efforts have been made to enable a unified framework for learning multiple tasks through changes in architecture design  and modified objective functions  to address known problems like catastrophic forgetting .
Iii Intersection Handling using Deep Q-Networks
We view intersection handling as a reinforcement learning problem, and use a Deep Q-Network (DQN) to learn the state action value Q-function. We assume the AD vehicle is at the intersection, the path is known to it, and the network is tasked with choosing between two actions: wait or go, for every time step. Once the agent decides to go, it follows an intelligent driver model for keeping distance with the vehicles in front.
Iii-a Reinforcement Learning
In reinforcement learning, an agent in state takes an action according to the policy parameterized by . The agent transitions to the state , and receives a reward . This collection is defined as an experience .
This is typically formulated as a Markov Decision Process (MDP) , where is the set of states, is the set of actions that the agent may execute, is the state transition function, is the reward function, and
is a discount factor that adds preference for earlier rewards and provides stability in the case of infinite time horizons. MDPs follow the Markov assumption that the probability of transitioning to a new state given the current state and action is independent of all previous states and actions.
The goal at any time step is to maximize the future discounted return . In order to optimize the expected return we use Q-learning .
Q-learning defines an optimal action-value function as the maximum expected return that is achievable following any policy given a state and action , .
This follows the dynamic programming properties of the Bellman equation, which state that if the values are known for all then the optimal strategy is to select that maximizes the expected value of :
In Deep Q-learning , the optimal value function is approximated with a neural network . The parameters are learned by using the Bellman equation as an iterative update and minimizing the error between the expected return and the state-action value predicted by the network. This gives the loss for an individual experience in a deep Q-network (DQN)
is a poor estimate early on, which can make learning slow since many updates are required to propagate the reward to the appropriate preceding states and actions. One way to make learning more efficient is to use-step return .
During learning, an -greedy policy is followed by selecting a random action with probability to promote exploration and otherwise greedily selecting the best action according to the current network. In order to improve the effectiveness of the random exploration we make use of dynamic frame skipping. Frequently the same repeated actions is required over several time steps. It was recently shown that allowing an agent to select actions over extended time periods improves the learning time of an agent . For example, rather than having to explore through trial and error and build up over a series of learning steps that eight time steps is the appropriate amount of time an agent should wait for a car to pass, the agent need only discover that a ”wait eight steps” action is appropriate. Dynamic frame skipping can viewed as a simplified version of options  which is recently starting to be explored by the Deep RL community. [21, 22, 23].
Iii-C Deep Neural Network setup
The DQN uses a convolutional neural network with two convolution layers, and one fully connected layer. The first convolutional layer hasfilters with stride two, the second convolution layer has 24]. The final linear output layer has five outputs: a single go action, and a wait
action at four time scales (1, 2, 4, and 8 time steps). The network is optimized using the RMSProp algorithm.
Our experience replay buffers have an allotment of experiences. At each learning iteration we samples a batch of 60 experiences. Since the experience replay buffer imposes off-policy learning, we are able to calculate the return for each state-action pair in the trajectory prior to adding each step into the replay buffer. This allows us to train directly on the n-step return and forgo the added complexity of using target networks .
The state space of the DQN is represented as a grid in global coordinates. The epsilon governing random exploration was . For the reward we used for successfully navigating the intersection, for a collision, and step cost.
Iv Knowledge Transfer
We are interested in sharing knowledge between different driving tasks. By sharing knowledge from different tasks we can reduce learning time and create more general and capable systems. Ideally knowledge sharing can be extended to involve a system that continues to learn after it has been deployed  and can enable a system to accurately predict appropriate behavior in novel situations . We examine the behavior of various knowledge sharing strategies in the autonomous driving domain.
Iv-a Direct copy
To demonstrate the extent of transfer and show the difference between tasks, we train a network on a single source task for 25,000 iterations. The unmodified network is then evaluated on every other task. We repeat this process, using each different task as a source task.
Iv-B Fine tuning
Starting with a network trained for 10,000 iterations on a source task, we then fine tune a network for an additional 25,000 iterations on second target task. We use 10,000 iterations because it demonstrates substantial learning, but is suboptimal in order to emphasize the possible benefits gained from transfer. Fine tuning demonstrates the jumpstart and asymptotic performance as described by Taylor and Stone.
Iv-C Reverse transfer
After a network has been fine tuned on the target task, we evaluate the performance of that network on the source task. If training on a later task improves the performance of an earlier task this is known as reverse transfer. It is known that neural networks often forget earlier tasks in what is called catastrophic forgetting [29, 30, 15].
In the case of forgetting, retention describes the amount of previous knowledge retained by the network after training on a new task. This value is difficult to define formally since it must exclude any relevant knowledge for source tasks obtained from training on the target task, and additionally retention might include aspects that are not quantifiable such of weight configurations in the network. For example a network might exhibit catastrophic forgetting but in fact have retained a weight configuration that greatly reduces the training time needed to retrain the source task. Because of the difficulty of defining retention we define the empirical retention as the difference between the direct copy and fine tuned direct copy of the same network.
Iv-D Lifelong Learning
Lifelong learning is the process of learning multiple tasks sequentially where the goal is to optimize the performance on every task [27, 31]. The combination of information from all previous tasks can be used to jumpstart learning a new task. In a reciprocal fashion, learning a new task can potentially refine existing knowledge for previous tasks. By having a single system that handles all tasks, the system is able to handle a broader set of problems and will likely generalize better to new problems.
We examine how a deep Q-network performs when learning a sequence of tasks. The order in which tasks are encountered does impact learning, and several groups have investigated the effects of ordering [32, 33]. For our experiments we use a task ordering that demonstrates forgetting and hold it fixed for all experiments.
We are interested in how each tasks performance changes over time. We test at regular intervals with testing run as a separate procedure that does not have an impact on the replay buffer or learning process of the network.
V Experimental Setup
Experiments were run using the Sumo simulator , which is an open source traffic simulation package. This package allows users to model road networks, road signs, traffic lights, a variety of vehicles (including public transportation), and pedestrians to simulate traffic conditions in different types of scenarios. Importantly for the purpose of testing and evaluation of autonomous vehicle systems, Sumo provides tools that facilitate online interaction and vehicle control. For any traffic scenario, users can have control over a vehicle’s position, velocity, acceleration, steering direction and can simulate motion using basic kinematics models. Traffic scenarios like multi-lane intersections can be setup by defining the road network (lanes and intersections) along with specifications that control traffic conditions. To simulate traffic, users have control over the types of vehicles, road paths, vehicle density, and departure times. Traffic cars follow IDM to control their motion. In Sumo, randomness is simulated by varying the speed distribution of the vehicles and by using parameters that control driver imperfection (based on the Krauss stochastic driving model ). The simulator runs based on a predefined time interval which controls the length of every step.
We ran experiments using five different intersection scenarios: Right, Left, Left2, Forward and a Challenge. Each of these scenarios is depicted in Figure 2. The Right scenario involves making a right turn, the Forward scenario involves crossing the intersection, the Left scenario involves making a left turn, the Left2 scenario involves making a left turn across two lanes, and the Challenge scenario involves crossing a six lane intersection.
The Sumo traffic simulator is configured so that each lane has a 45 miles per hour (20 m/s) max speed. The car begins from a stopped position. Each time step is equal to 0.2 seconds. The max number of steps per trial is capped 100 steps which is equivalent to 20 seconds. The traffic density is set by the probability that a vehicle will be emitted randomly per second. We use depart probability of 0.2 for each lane for all tasks.
Navigating intersections involves multiple conflicting objectives. We evaluate four metrics in order to collect our statistics. The metrics are as follows:
Percentage of successes: the percentage of the runs the car successfully reached the goal. This metric takes into both collisions and time-outs.
Percentage of collisions: a measure of the safety of the method.
Average time: how long it takes a successful trial to run to completion.
Average braking time: the amount of time other cars in the simulator are braking, this can be seen as a measure of how disruptive the autonomous car is to traffic.
While there are multiple metrics, we focus on the percentage of success, which is the metric used in all our plots. All state representations ignores occlusion, assuming all cars are always visible.
Direct Copy: Table I shows the results when a network is trained on one task and applied to another. In no instance does a network trained on a different task do better than a network trained on the matching task, but we do see that several tasks achieve similar performance with transfer. Particularly we see that the single lane tasks (right,left, and forward) are related, they are consistently the top performers in all single lane tasks. Additionally the more challenging multi-lane settings (left2 and challenge) appear connected, the Left2 network does substantially better than either of the single lane tasks on the Challenge task.
Fine Tuning: Figure 4 shows fine tuning results. We see that in nearly all cases, pre-training with a different network gives a significant advantage in jumpstart  and in several cases there is an asymptotic benefit as well. When the fine tuned networks are re-applied to the source task the performance looks similar to direct copy, as shown in Figure 5.
Reverse Transfer: While the performance on the source task drops after fine tuning, we see a trend of positive improvement compared to direct copy. This indicates that some information was retained by the network. Figure 3 shows the retention for each task pair, showing the percentage gain resulting from the initialization. The Left2 and Challenge tasks have less overlap with other tasks in the state space, so it is possible that more aspects of the initialization are left unchanged, which might explain why there is the largest amount of retention for these tasks. This hypothesis is supported by the fact that training on the Right task exhibits the most retention, since these two tasks have the least overlap.
Lifelong learning: The results for the lifelong learning experiment are shown in Figure 6. Every task initially benefits from learning on the first task (Forward), although the performance in the Left2 and Challenge settings benefit less. In some cases we see that training on a different task helps up to a point and then further training hurts other tasks. For example, after training on approximately 5000 trials of Forward setting, the Right task performance starts to decrease.
Overall, we see an affinity between both the single lane tasks (Left, Right, and Forward) and the multi-lane tasks. When training on the Challenge task starts, Left2 benefits, but the single lane tasks exhibit catastrophic forgetting. Training on the Left task helps the other single lane tasks, but Challenge decreases in performance.
However the results are not consistent across the grouping of single lane tasks. Training on the Right task has a much more detrimental effect on the multi-lane tasks than either Forward or Left. We suspect this is because right turns can ignore one of the lanes of traffic which matters to all other tasks. Overall the negative effects of catastrophic forgetting negate many of the positive effects of transfer.
In this paper we view the AD vehicle as a learning agent in a reinforcement learning setting, and analyze how the knowledge for handling one type of intersection, represented as a Deep Q-Network, translates to other types of intersections. We investigated and compared four different transfer methods between different intersections (tasks): direct copy, fine tuning, reverse transfer and lifelong learning. Our results have several conclusions. First, we found the success rates were consistently low when a network is trained on Task A but directly tested on Task B. Second, a network that is initialized with the network of a Task A and then fine-tuned on Task B generally performed better than a randomly initialized network that is trained on Task B. Third, when a network that is initialized with Task A, fine-tuned on Task B, and is tested back on Task A, it performed better than a network directly copied from Task B to Task A. Finally, we examine a lifelong learning domain, where we train a single network to handle all five intersection scenarios and show that the resulting network exhibited catastrophic forgetting of previous task knowledge.
As future work, we will conduct research on the concept of a long-term memory and investigate how to effectively preserve previous task knowledge for lifelong learning.
-  M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al., “End to end learning for self-driving cars,” arXiv preprint arXiv:1604.07316, 2016.
-  M. Bouton, A. Cosgun, and M. J. Kochenderfer, “Belief state planning for navigating urban intersections,” IEEE Intelligent Vehicles Symposium (IV), 2017.
-  S. Brechtel, T. Gindele, and R. Dillmann, “Probabilistic decision-making under uncertainty for autonomous driving using continuous pomdps,” in Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on. IEEE, 2014, pp. 392–399.
-  W. Song, G. Xiong, and H. Chen, “Intention-aware autonomous driving decision-making in an uncontrolled intersection,” Mathematical Problems in Engineering, vol. 2016, 2016.
-  S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” arXiv preprint arXiv:1610.03295, 2016.
-  R. Caruana, “Multitask Learning,” Machine Learning, vol. 28, pp. 41–75, 1997.
S. J. Pan and Q. Yang, “A Survey on Transfer Learning,”IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, 2010.
-  M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning domains: A survey,” Journal of Machine Learning Research, vol. 10, no. Jul, pp. 1633–1685, 2009.
A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN Features
off-the-shelf: an Astounding Baseline for Recognition,”
Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 512–519, Mar. 2014.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks ?” NIPS, vol. 27, 2014.
-  A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016.
H. Yin and S. J. Pan, “Knowledge transfer for deep reinforcement learning with
hierarchical experience replay,” in
AAAI Conference on Artificial Intelligence (AAAI), 2017.
-  R. K. Srivastava, J. Masci, S. Kazerounian, F. Gomez, and J. Schmidhuber, “Compete to compute,” in Advances in neural information processing systems, 2013, pp. 2310–2318.
-  J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., “Overcoming catastrophic forgetting in neural networks,” arXiv preprint arXiv:1612.00796, 2016.
-  I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio, “An empirical investigation of catastrophic forgetting in gradient-based neural networks,” arXiv preprint arXiv:1312.6211, 2013.
-  C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
-  J. Peng and R. J. Williams, “Incremental multi-step q-learning,” Machine learning, vol. 22, no. 1-3, pp. 283–290, 1996.
-  A. Srinivas, S. Sharma, and B. Ravindran, “Dynamic action repetition for deep reinforcement learning,” AAAI Conference on Artificial Intelligence (AAAI), 2017.
-  R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
-  M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” arXiv preprint arXiv:1611.05397, 2016.
-  C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor, “A deep hierarchical approach to lifelong learning in minecraft,” arXiv preprint arXiv:1604.07255, 2016.
-  T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in Advances in Neural Information Processing Systems, 2016, pp. 3675–3683.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, vol. 30, no. 1, 2013.
-  T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop, coursera: Neural networks for machine learning,” University of Toronto, Tech. Rep, 2012.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
-  S. Thrun, “Is learning the n-th thing any easier than learning the first?” Advances in neural information processing systems, pp. 640–646, 1996.
-  D. Isele, M. Rostami, and E. Eaton, “Using task features for zero-shot knowledge transfer in lifelong learning,” In Proceedings of the International Joint Conference on Artificial Intelligence, 2016.
-  M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” Psychology of learning and motivation, vol. 24, pp. 109–165, 1989.
-  R. Ratcliff, “Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions,” Psychological review, vol. 97, no. 2, pp. 285–308, 1990.
-  P. Ruvolo and E. Eaton, “ELLA: An efficient lifelong learning algorithm,” Proceedings of the International Conference on Machine Learning, vol. 28, pp. 507–515, 2013.
-  Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” International Conference on Machine Learning, pp. 41–48, 2009.
-  P. Ruvolo and E. Eaton, “Active task selection for lifelong machine learning.” in AAAI Conference on Artificial Intelligence (AAAI), 2013.
-  D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent development and applications of SUMO–simulation of urban mobility,” International Journal on Advances in Systems and Measurements (IARIA), vol. 5, no. 3–4, 2012.
-  S. Krauss, “Microscopic modeling of traffic flow: Investigation of collision free vehicle dynamics,” Ph.D. dissertation, Deutsches Zentrum fuer Luft-und Raumfahrt, 1998.