Transferring Autonomous Driving Knowledge on Simulated and Real Intersections

11/30/2017 ∙ by David Isele, et al. ∙ 0

We view intersection handling on autonomous vehicles as a reinforcement learning problem, and study its behavior in a transfer learning setting. We show that a network trained on one type of intersection generally is not able to generalize to other intersections. However, a network that is pre-trained on one intersection and fine-tuned on another performs better on the new task compared to training in isolation. This network also retains knowledge of the prior task, even though some forgetting occurs. Finally, we show that the benefits of fine-tuning hold when transferring simulated intersection handling knowledge to a real autonomous vehicle.



There are no comments yet.


page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Autonomous Driving (AD) has the potential to reduce accidents caused by driver fatigue and distraction and will enable more active lifestyles for the elderly and disabled. While AD technology has made important strides over the last couple of years, current technology is still not ready for large scale roll-out. Urban environments are particularly difficult for AD, due to the unpredictable nature of pedestrians and vehicles in city traffic.

Rule-based methods provide a predictable method to handle intersections. However, rule-based intersection handling approaches do not scale well due to the difficulty of designing hand-crafted rules that remain valid as the diversity and complexity of possible scenes increase. Recently it has been shown that deep reinforcement learning can improve over rule-based techniques (Isele et al., 2016), however it is unclear how well these techniques are able able to generalize to different scenarios and real systems.

(a) Direct Copy
(b) Fine Tuning
(c) Reverse Transfer
Figure 4: We analyze knowledge transfer between different types of intersections. The knowledge to handle an intersection is represented as a Deep Q-Network (DQN). We investigate a) directly copying a network to a new intersection b) fine-tuning a previously trained network on a new intersection, c) whether fine tuning destroys old intersection knowledge in reverse transfer.

We explore the ability of a reinforcement learning agent to generalize, focusing specifically on how the knowledge for one type of intersection, represented as a Deep Q-Network (DQN), translates to other types of intersections (tasks). First we look at direct copy: how well a network trained for Task A performs on Task B. Second, we analyze how a network initialized on Task A and fine-tuned on Task B compares to a randomly initialized network exclusively trained on Task B. Third, we investigate reverse transfer: if a network pre-trained for Task A and fine-tuned to Task B, preserves knowledge for Task A. Finally, we present early results of using a network trained in simulation to initialize learning on real data.

(a) Right
(b) Left
(c) Left2
(d) Forward
(e) Challenge
Figure 10: Visualizations of different intersection scenarios.

2 Related Work

Researchers have recently been investigating using machine learning techniques to control autonomous vehicles

(Cosgun et al., 2017)

. Imitation learning strategies have investigated learning from a human driver

(Bojarski et al., 2016)

. Markov Decision Processes (MDP) have been used offline to address the problem of intersection handling

(Brechtel et al., 2014; Song et al., 2016). And online planners based on partially observable Monte Carlo Planning (POMCP) have been applied to intersection problems when an accurate generative model is available (Bouton et al., 2017). Additionally, machine learning techniques have been used to optimize comfort in a space where solutions are constrained to safe trajectories (Shalev-Shwartz et al., 2016).

Large amounts of data often improve the performance of machine learning techniques. In the absence of huge datasets, training on multiple related tasks can give similar performance gains (Caruana, 1997). A large breadth of research has investigated transferring knowledge from one system to another in machine learning in general (Pan & Yang, 2010), and reinforcement learning specifically (Taylor & Stone, 2009).

Large training times and high sample complexity make transfer methods particularly appealing in deep networks (Razavian et al., 2014; Yosinski et al., 2014). Recent work in deep reinforcement learning has looked at combining networks from different tasks to share information (Rusu et al., 2016; Yin & Pan, 2017). Researchers have looked at using options (Sutton & Barto, 1998) in Deep RL to expand an agent’s capabilities (Jaderberg et al., 2016; Tessler et al., 2016; Kulkarni et al., 2016), and efforts have been made to enable a unified framework for learning multiple tasks through changes in architecture design (Srivastava et al., 2013) and modified objective functions (Kirkpatrick et al., 2016) to address the problem of catastrophic forgetting (Goodfellow et al., 2013).

In our scenario, there is the added difficulty of transferring to the real vehicle. It is a well known problem that policies trained in simulation rarely work on real robots (Barrett et al., 2010). Recent work has investigated grounding imperfect simulators to a robot’s behavior (Hanna & Stone, 2017) and there is evidence that transferring from simulation to real robots can be addressed by targeting the systems ability to generalize (Tobin et al., 2017). Given the variety of intersections, the importance of learning a general model, and the difficulty of training on the real vehicle, we examine the prospects of multi-task learning for the problem of intersection handling.

3 Intersection Handling using Deep Networks

Each intersection handling task is viewed as a reinforcement learning problem, and we use a Deep Q-Network (DQN) to learn the state-action value Q-function. We assume the vehicle is at the intersection, the path is known, and the network is tasked with choosing between two actions: wait or go, for every time step. Once the agent decides to go, it continues until it either collides or successfully navigates the intersection. Previous work has shown that deciding the wait time generally outperforms approaches that learn an entire acceleration profile (Isele et al., 2017).

3.1 Reinforcement Learning

The reinforcement learning framework considers an agent in state taking an action according to the policy . After taking an action, the agent transitions to the state , and receives a reward . This collection is defined as an experience

. Learning is formulated as a Markov decision process (MDP) and follows the Markov assumption that the probability of transitioning to a new state given the current state and action is independent of all previous states and actions


The objective at time step is to maximize the future discounted return . We optimize this objective using Q-learning (Watkins & Dayan, 1992).

3.2 Q-Learning

In Q-learning an optimal action-value function is defined as the maximum expected return that is achievable following any policy given a state and action , .

Deep Q-learning (Mnih et al., 2013)

approximates the optimal value function with a neural network

. The parameters are learned by using the Bellman equation as an iterative update and minimizing the error between the expected return and the state-action value predicted by the network. This gives the loss for an individual experience in a deep Q-network (DQN)

(a) Right
(b) Left
(c) Left2
(d) Forward
(e) Challenge
Figure 16: Fine-tuning comparison. A network for one task is initialized with the network of a different task. The colored lines indicate the initialization network. The black line indicates the performance of a network trained with a random initialization. Initializing a network with a network trained on another task is almost always advantageous. We notice a jumpstart benefit in every tested example, and observe several asymptotic improvements.

4 Knowledge Transfer

We investigate the benefits of policy re-use for sharing knowledge between different driving tasks. By sharing knowledge from different tasks we can reduce learning time and create more general and capable systems. Ideally knowledge sharing can be extended to involve a system that continues to learn after it has been deployed (Thrun, 1996) and can enable a system to accurately predict appropriate behavior in novel situations (Isele et al., 2016). We examine the behavior of various knowledge sharing strategies in the autonomous driving domain.

4.1 Direct Copy

Directly copying a policy indicates the differences between tasks. To demonstrate how well a network trained on one task fits another, we train a network on a single source task for 25,000 iterations. The unmodified network is then evaluated on every other task. We repeat this process, using each different task as a source task.

4.2 Fine-Tuning

Fine-tuning allows a network to adapt from the source to the target task. Starting with a network trained for 10,000 iterations on a source task, we then fine-tune a network for an additional 25,000 iterations on a second target task. We use 10,000 iterations because it demonstrates substantial learning, but is suboptimal in order to emphasize the possible benefits gained from transfer. Fine-tuning demonstrates the jumpstart and asymptotic performance as described by Taylor and Stone (2009).

4.3 Reverse Transfer

After a network has been fine-tuned on the target task, we evaluate the performance of that network on the source task. If training on a later task improves the performance of an earlier task this is known as reverse transfer. It is known that neural networks often forget earlier tasks in what has been termed catastrophic forgetting (McCloskey & Cohen, 1989; Ratcliff, 1990; Goodfellow et al., 2013). Since we are interested in a system learning a large variety of intersections, we wish to understand how much knowledge of a previous task is preserved.

5 Experimental Setup

Experiments were run using the Sumo simulator (Krajzewicz et al., 2012), which is an open source traffic simulation package. Traffic scenarios like multi-lane intersections can be setup by defining the road network (lanes and intersections) along with specifications that control traffic conditions. To simulate traffic, users have control over the types of vehicles, road paths, vehicle density, and departure times. Traffic cars follow the intelligent driver model to control their motion. In Sumo, randomness is simulated by varying the speed distribution of the vehicles and by using parameters that control driver imperfection (based on the Krauss stochastic driving model (Krauss, 1998)). The simulator runs based on a predefined time interval which controls the length of every step. We ran experiments using five different intersection scenarios: Right, Left, Left2, Forward and a Challenge. Each of these scenarios is depicted in Figure 10.

The Sumo traffic simulator is configured so that each lane has a 45 miles per hour (20 m/s) max speed. The car begins from a stopped position. Each time step is equal to 0.2 seconds. The max number of steps per trial is capped at 100 steps (20 seconds). The traffic density is set by the probability that a vehicle will be emitted randomly per second. We use depart probability of 0.2 for each lane for all tasks.

While navigating intersections involves multiple conflicting metrics (including time to cross, number of collisions, and disruption to traffic), we focus on the percentage of trials the vehicle successfully navigates the intersection. All simulated state representations ignore occlusion, assuming all cars are always visible.

5.1 Deep Neural Network Setup

Our DQN uses a convolutional neural network with two convolution layers, and one fully connected layer. The first convolutional layer has

filters with stride two, the second convolution layer has

filters with stride two. The fully connected layer has 100 nodes. All layers use leaky ReLU activation functions

(Maas et al., 2013). The final linear output layer has five outputs: a single go action, and a wait action at four time scales (1, 2, 4, and 8 time steps) inspired by dynamic frame skipping techniques (Srinivas et al., 2017)

. The network is optimized using the RMSProp algorithm

(Tieleman & Hinton, 2012).

At each learning iteration we samples a batch of 60 experiences. Since the use of an experience replay buffer imposes a delay between an experience occurring and being trained on, we are able to calculate the return for each state-action pair in the trajectory prior to adding each step into the replay buffer. This allows us to train directly on the n-step return (Peng & Williams, 1996) and forgo the added complexity of using target networks (Mnih et al., 2015).

The state space of the DQN is represented as a grid in local coordinates that denote the speed and direction of cars within the grid. The epsilon governing random exploration is . The reward is for successfully navigating the intersection, for a collision, and step cost.

5.2 Real data

We conducted a preliminary study in order to evaluate whether knowledge obtained by simulated intersections could be useful for real ones. We collected data from an autonomous vehicle in Mountain View, California, at an unsigned T-junction, similar to the Left scenario. A point cloud, obtained by a combination of six IBEO Lidar sensors, is first pre-processed to remove points that reside outside the road boundaries. A clustering method with hand-tuned geometric thresholds is used for vehicle detection. Each vehicle is tracked by a separate particle filter. During data collection, a human observer in the vehicle labeled at times whether making a left turn would be safe or not. Given a random starting point in the recording, the system is able to select wait actions that move it ahead in the recording, and a go action which results in either a collision or a success based on the human provided labels. Because data is collected at a higher sampling rate than the simulation step frequency (and therefore the behavior frequency), states are sampled from within the simulation step window, allowing a recording to be inflated into a large number of experiments. Note that this process only gives real sensor readings, and that the system is not able to observe how its behavior affects other drivers. Because the same few recorded scenarios are replayed, we expect training on recorded data in this way will overfit to the recording.

Figure 17: Direct Copy and Reverse Transfer. The x axis denotes the test condition. Black bars show the performance of single task learning. Light gray bars show the average performance of a network trained on one task and tested on another. The drop in performance demonstrates the difference between tasks. The dark gray indicates the average performance of reverse transfer: a network is trained on Task A, fine-tuned on Task B, and then evaluated on Task A. The drop in performance indicates catastrophic forgetting, but networks exhibit some retention of the initial task.

We train a network on approximately one minute of recorded data during which time approximately 20 cars drive past. We then test the network on a separate recording made at the same intersection. These are preliminary results, as we are currently in the process of collecting a much larger dataset. We compare the results against a network that has been pre-trained on simulation data and then fine-tuned on the real data.

6 Results

Direct Copy: Figure 17 shows the average performance of training on one task and applying it to another in light gray. While we only plot the average performance, the quality of transfer is dependent on the particular source and target task. In no instance does a network trained on a different task surpass the performance of a network trained on the matching task, but several tasks achieve similar performance with transfer. Particularly we see that each network trained on a single lane task (right, left, and forward) is consistently a top performer on other single lane tasks. Additionally the more challenging multi-lane settings (left2 and challenge) appear related. The Left2 network does substantially better than any of the single lane tasks on the Challenge task.

Fine-Tuning: Figure 16 shows fine-tuning results. We see that in nearly all cases, pre-training with a different network gives a significant advantage in jumpstart (Taylor & Stone, 2009) and in several cases there is an asymptotic benefit as well. When the fine-tuned networks are re-applied to the source task the performance looks similar to direct copy, as shown in Figure 17.

Reverse Transfer: The performance on the source task dropped after fine-tuning on the target, but performance improved compared to direct copy. This indicates that some information was retained by the network. Note that the Left2 and Challenge tasks have less overlap with other tasks in the state space. It is possible that non-overlapping regions can be left unchanged by fine-tuning. This might explain why we see the most retention on the least related tasks.

Figure 18: Transfer from Simulation to Real. The yellow lines indicate the training and test performance of a network trained on real data collected from an autonomous vehicle. The blue lines indicate the performance of a network that is first trained on simulated data and then fine-tuned on a real vehicle. We see that fine-tuning speeds up training time and improves generalization.

Real Data: Figure 18 shows the performance of training on real data. The blue lines indicate a network that has been pre-trained on simulated data, the yellow lines indicate the performance of a network that has only been trained on the real data. Lighter lines indicate performance on the training data, and darker lines indicate performance on the test data. The network pre-trained in simulation rises above success on the training data in approximately half the iterations required by the network trained only on real data to reach the same performance. The test results follow the learning curve of the training data, but the performance asymptotes at success. These results show that fine-tuning can reduce the training time of the network, however in both networks the lower performance on the test data suggests over-fitting.

It is interesting to note that training on the real task required more iterations. This may be due to imperfections in the labeling process or greater noise and variation in the state space. For completeness, we looked at transfer from real data to simulation. We observed that using a network initialized on a real left turn speeds up training a left turn in simulation. Directly applying the network fine-tuned on real data to a left turn in simulation resulted in a model that consistently timed out. We suspect this is due to the network over-fitting the small amount of data, and we will investigate this further when we have collected more data.

7 Conclusion

We view autonomous driving as a reinforcement learning problem, and analyze how the knowledge for handling one type of intersection, represented as a Deep Q-Network, translates to other types of intersections. We investigated different properties of transfer between intersections, namely the performance of direct copy, fine-tuning, and reverse transfer and showed how transfer extends from simulated to real intersections.

Our results identify autonomous intersection handling as a domain that benefits from transfer. First, we found the success rates were consistently low when a network is trained on Task A but directly tested on Task B. Second, a network that is initialized with the network of Task A and then fine-tuned on Task B generally performed better than a randomly initialized network. Third, when a network that is initialized with Task A, fine-tuned on Task B, and tested on Task A, it performed better than a network directly copied from Task B to Task A, but worse than a network trained recently on Task A. Fourth, we show that a real intersection can be treated as a separate task and transfer from simulation can be used to improve learning. Moving forward, we are interested in how transfer can be used to improve the training and robustness of a real system.