LaTeS: Latent Space Distillation for Teacher-Student Driving Policy Learning

12/06/2019 ∙ by Albert Zhao, et al. ∙ Beijing Kuaishou Technology Co.,Ltd. 19

We describe a policy learning approach to map visual inputs to driving controls that leverages side information on semantics and affordances of objects in the scene from a secondary teacher model. While the teacher receives semantic segmentation and stop "intention" values as inputs and produces an estimate of the driving controls, the primary student model only receives images as inputs, and attempts to imitate the controls while being biased towards the latent representation of the teacher model. The latent representation encodes task-relevant information in the inputs of the teacher model, which are semantic segmentation of the image, and intention values for driving controls in the presence of objects in the scene such as vehicles, pedestrians and traffic lights. Our student model does not attempt to infer semantic segmentation or intention values from its inputs, nor to mimic the output behavior of the teacher. It instead attempts to capture the representation of the teacher inputs that are relevant for driving. Our training does not require laborious annotations such as maps or objects in three dimensions; even the teacher model just requires two-dimensional segmentation and intention values. Moreover, our model runs in real time of 59 FPS. We test our approach on recent simulated and real-world driving datasets, and introduce a more challenging but realistic evaluation protocol that considers a run that reaches the destination successful only if it does not violate common traffic rules.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Sample frames from CARLA [11], a photorealistic driving simulator, along with our agent’s output controls. The green and blue trajectories represent ground truth and estimated steering angles. Up: the agent is performing a left turn with a moving car in front of it. It correctly turns left with a low throttle value. Bottom: The agent is on a straight road with a pedestrian crossing the road. It correctly outputs a high brake value to stop for the pedestrian.

Driving is a complex endeavor that requires planning and control given sensory inputs and possibly other side information, considering nearby agents and the effect of our behaviors on their actions. Due to the complexity of the full driving problem, we ignore multi-agent dynamics and focus on learned reactive driving, where we train a model that can then map images and speed data directly to low-level controls, namely throttle, steering angle, and brakes, as shown in Fig. 1.

In addition to sensory inputs and corresponding controls to imitate, supervision can include other side information like “affordances” of objects such as pedestrians, traffic lights or stalled cars, the most important being that one should stop for them (stop intention). The driving task also informs what variability in the data is irrelevant such as photometric variability due to illumination or texture and material properties of objects such as the color of cars or buildings. Ideally, we want our driving policy to be invariant to such nuisance variability.

This could be achieved simply by pre-processing the data; for instance, the curvature of the level sets is a maximal invariant to monotonic continuous transformations of the image range [31]. However, real illumination variability is far more complex, including inter-reflections, translucency, transparency, cast shadows, etc. Alternatively, one could learn away illumination variability directly in the driving model, but that would require observing images at all times of the day, all days of the year, and all weather conditions, causing the sample complexity of the training set to explode.

Towards this end, we propose to separate the management of illumination variability into two stages: first, a teacher network is trained, via its latent representation, to encode side information, which is chosen to be illumination-invariant (semantic segmentation) and informed by traffic rules and object affordances (stop intention values). The training of the teacher is constructed so the resulting representation is a sufficient invariant for the task, meaning that it retains all and only information the semantic segmentation and stop intention values contain that is relevant for driving. Sequentially, our main student model is trained via a latent embedding loss to learn the sufficient invariant representation of the teacher, helping to improve invariance to nuisances without throwing away relevant information for the driving task. Unlike previous distillation approaches, our student model has different inputs from the teacher model (images, rather than semantic segmentation and intentions), and is trained with the ground-truth driving controls, rather than with the outputs from the teacher model. Our student model is trained not to mimic the teacher behavior, but to distill its (sufficient invariant) latent representation. This is important, since the student has access to the images, that may contain relevant information for driving that is not present in the semantic segmentation and therefore not available to the teacher.

We further show that our method is effective at exploiting side-information from the teacher, while not discarding information in the images. We test the resulting agent on simulated benchmarks, where the inputs are just images, speed measurement, and high-level planning indicators, namely turn signals, and real world datasets. We show that the method performs at the state-of-the-art as evaluated by the most commonly adopted protocols [11, 9], which however we find lacking. We do not think one should be rewarded for reaching the destination while driving through red lights and mowing down pedestrians. Therefore we introduce a more realistic protocol, Traffic-school, that rewards successful routes only if they do not violate basic traffic rules. To summarize, the main contributions of this work are as follows:

  • a novel teacher-student method, latent space distillation, to encourage the student to learn the teacher’s sufficient, invariant representation for driving

  • a novel design for the teacher network, which learns a sufficient, invariant representation for driving, using (a) semantic segmentation, which provides the teacher with object class knowledge for basic tasks such as lane following and (b) stop intention values, which provide causal knowledge relating braking to dangerous driving situations

  • a new evaluation protocol, Traffic-school, that fixes the flaws of older benchmarks and hence, is more realistic. Particularly, it penalizes multiple traffic infractions that are ignored before, such as running into sidewalk.

2 Related Work

Autonomous driving has drawn significant attention for several decades [22, 17, 30]. Approaches to the driving task can generally be categorized into three classes: modular pipelines, direct perception, and end-to-end learning. Out of these three categories, end-to-end learning is most related to our work.

Modular pipelines form the most popular approach [33, 21, 12] separating driving into two components: intermediate perception modules [13, 23, 14] and control modules that estimate low-level controls. For earlier literature on modular approaches, we refer the reader to [10]. More recently, [20, 3] use semantic segmentation and environment maps as intermediate representations in a recurrent network. This approach requires laborious mapping with continuous updates but has the advantage of outputting interpretable representations, which allow for the diagnosis of failure cases. Some methods where the intermediate representation is more directly related to the driving task are referred to as direct perception; [6] trains a network to estimate affordances (e.g., distance to vehicle, center-of-lane, orientation, etc.), directly linked to low-level controls [34]. [28] proposes Conditional Affordance Learning (CAL) to apply a similar approach to urban driving [11]. With these approaches, there is still a considerable amount of hand-crafting, and the link to the driving task is unprincipled, with no guarantees that the intermediate representation has the optimal separating properties [1].

Our paper falls in the category of direct image-to-control (i.e., end-to-end) methods [17, 30, 4], which are currently not viable as an overall solution to autonomous driving. However, these methods are nevertheless worth exploring for reactive driving in an academic setting to understand the full extent of the limitations and potential of the approach, and to push the envelope of bootstrapping [5]. [8]

introduces Conditional Imitation Learning (CIL), an offline behavioral cloning method, to solve the ambiguity problem at traffic intersections by conditioning the model on high-level turning commands.

[9]

, the current state-of-the-art in this arena, analyzes issues within the CIL approach and proposes CILRS. Various methods collect data online using reinforcement learning

[29, 36, 19] or the DAgger [26] imitation learning algorithm [37, 7] to reduce the covariate shift between training and online evaluation.

To improve generalization and acquire perception knowledge, [35] adds a semantic segmentation loss while training a model for steering angle estimation. [18], which we denote as MT (multi-task), further adds depth estimation as a side task. [16] takes a different approach, which distills [2, 15, 25] the representations of pretrained networks for semantic segmentation and optical flow estimation, essentially changing the side task to mimicking features. [7] also applies distillation to driving to take advantage of ground-truth maps only available at training time. However, these approaches suffer from various issues. [8] and [9] fail to generalize to test conditions with dense traffic. Online training and the associated data collection [29, 36, 19, 37, 7] is expensive and unsafe and can be performed only in photorealistic driving simulators such as CARLA [11]. Traditional multi-task approaches suffer from the side task and main driving task not being completely correlated, leading to suboptimal representations for driving. The approach of [7] is expensive, both in terms of 3D ground-truth map annotations and online training.

Unlike modular approaches that infer semantic segmentation and intention values from the input image and use those to inform the control, we do not discard task-relevant information in the images that is not present in the semantic segmentation mask and inferred intentions. Unlike multi-task learning that attempts to infer the semantic segmentation and intention values while imitating the controls, we do not learn task-irrelevant information in the semantic segmentation mask, leaving it to the teacher to not learn it in its latent representation. Our student only attempts to mimic the latent space of the teacher that ideally has converged to a sufficient representation for driving that is invariant to other nuisance variability in the teacher input. We note that our method does not require expensive annotations, such as (3D) maps or knowledge of other agents locations, nor does it require online learning in closed-loop, since we aim to only capture reactive driving behavior irrespective of its effect on other agents.

3 Method

The training data for our method consists of temporal tuples collected from a rule-based autonomous driving agent that has access to all internal states of the CARLA driving simulator. is an RGB image taken at time and is the self-speed measurement. High-level commands are provided by a navigation system and represent the (blind) high-level plan, which avoid confusion at intersections and guide the agent to its destination. Low-level controls include brake, throttle and steering angle, respectively. represent semantic segmentation mask and stop intention value annotations of image . The stop intention values indicate how urgent it is for the agent to brake in order to avoid hazardous traffic situations such as collision with vehicles, collision with pedestrians, and red light violations, respectively. The stop intentions are like instructions given by a traffic-school instructor, which inform you of the causal relationships between braking behaviors and different types of dangerous driving scenarios.

In our method, referred to as LaTeS, we first use the provided segmentation masks, three-category stop intentions and self-speed measurement to learn an expert model (a.k.a. the teacher network): . Then, we conduct latent space distillation from the teacher network to learn a driving model (a.k.a. the student network) that does not have access to the side information: . The overall pipeline of our method is shown in Fig.2.

Figure 2: Architecture overview of our expert (teacher) and driving (student) model. Both use a three-branch structure. (a) Expert model: it processes semantic segmentation, stop intention values, and self-speed in separate branches and then feeds their concatenated representation to a command-conditioned driving module. (b) Driving model: it has a similar structure but takes in only RGB image and self-speed. Latent space distillation is achieved by enforcing losses between the segmentation and intention value branches of the teacher (,), and the two ResNet branches of the student (,).

3.1 Expert Model (Teacher)

The task of our teacher is to learn a representation of the segmentation masks , three-category stop intentions and self-speed measurement that is relevant for driving with nuisances removed. Such a representation will later be used to supervise the student model, a process we call latent embedding distillation. As shown in Fig. 2, the expert model utilizes a three-branch network architecture for estimating low-level controls . The upper branch uses a ResNet34 backbone to represent the segmentation mask input. We use a second fully-connected (FC) branch to process the three-category stop intention values, informing the agent of the causal relationships between braking behaviors and the presence of objects in the context of safe driving. A third FC branch is used to ingest the self-speed measurement, which is helpful for avoiding the long-stop inertia issue [9].

The latent feature vectors from the three branches are concatenated and fed into a driving module, composed of several FC layers and a conditional switch that chooses one out of four different output branches depending on the given turning signals

. The four output branches share the same network architecture but with separately learned weights. We use the norm to construct a loss for the expert model training:

(1)

where are estimated controls of brake, throttle, steering angle respectively, and are ground-truth controls. The ’s are weights for different loss terms.

3.2 Driving model (student)

The driving model does not have direct access to side information, but instead observes a single RGB image and self-speed measurement . Its goal is also to estimate low-level controls , for which it also adopts a three-branch network. The first and the second branches are now both ResNet34 backbones. When training the driving model, as illustrated in Fig. 2, besides applying the control loss (Eq. (1)), we extract latent embeddings from the segmentation mask branch and the stop intention branch of the expert model and enforce an loss to push student’s estimated embeddings towards teacher’s distilled ones:

(2)

in which are latent embeddings in the driving model, and are distilled high-dimensional feature vectors from the expert model. and are the lengths of the latent vectors, while

’s are tunable hyperparameters. The proposed latent space distillation loss

is jointly trained with the low-level control estimation loss in order to learn a robust driving policy for the student network:

3.3 Implementation details

We implement our approach using CARLA 0.8.4 [11]. For both the expert model and the student model, we use a batch size of 120 and the ADAM optimizer with an initial learning rate of 0.0002. An adaptive learning rate schedule is used to reduce the learning rate by a factor of 10 if the training loss has not decreased in 1000 iterations. We use a validation set for early stopping, validating every 20K iterations and stopping the training when the validation loss stops to decrease further.

4 Experiments

For the sake of fair empirical evaluation, our model is compared against a suite of competing offline alternative methods, including CIL [8], CAL [28], MT [18], and the current state-of-the-art, CILRS [9]. We use three standard online evaluation benchmarks: CARLA [11], NoCrash, and Traffic-light [9]. Concerned about the fact that multiple driving infractions are not penalized in the existing benchmarks, we then introduce a more realistic new evaluation standard Traffic-school. To show that our policy generalizes well to real-world data, we also perform experiments on two real-world datasets: Comma.ai [27], and Udacity [32]. We conduct an empirical analysis of alternative driving policy learning methods (e.g. two-stage [20], multi-task [18]) that leverage the segmentation mask and stop intention values to demonstrate the advantages of our proposed latent embedding distillation method. Finally, in ablation studies, we show that it is critical to distill the three-category stop intention values and the segmentation mask embeddings jointly from the expert model.

4.1 CARLA Simulator

The CARLA, NoCrash, Traffic-light, and Traffic-school benchmarks are evaluated online using fixed routes in the CARLA simulator [11]. The CARLA simulator provides two towns, Town-A and Town-B, which differ in the road layout and visual nuisances such as static obstacles, and two sets of weathers, Weather-A and Weather-B. Town-A and Weather-A are used for training while all other town and weather combinations are used to test generalization. Specifically, Weather-A represents the set of weathers {Clear Noon, Clear Noon After Rain, Heavy Rain Noon, Clear Sunset} used in training. Weather-B represents a set of weathers used only at test time. For the CARLA benchmark, Weather-B is {Cloudy After Rain, Soft Rain Sunset}, while for the other benchmarks, weather-B is {After Rain Sunset, Soft Rain Sunset}.

4.2 Comparison with State of the Art on CARLA

We first test our method, LaTeS, on three widely adopted benchmarks: CARLA [8], NoCrash, and Traffic-light [9]. Results and further explanation of each benchmark are as follows.

CARLA The CARLA benchmark (results in Tab. 1) evaluates navigation with and without dynamic obstacles. The metric is navigation success rate, where a route is completed successfully if the agent reaches the destination within a time limit. This benchmark is relatively easy due to not penalizing crashes and therefore has been saturated. In other words, the agent can crash many times along its way to the destination and still successfully complete a route. Nevertheless, our model LaTeS significantly outperforms all competing methods, achieving an average error of 2% and a relative failure reduction of 85% over the second-best CILRS, which achieves an average error of 13%.

NoCrash We report results on the NoCrash benchmark in Tab. 2. For this benchmark, the metric is navigation success rate in three different levels of traffic: empty, regular, and dense. Unlike the CARLA benchmark, a route is considered successful for NoCrash only if the agent reaches the destination within the time window without crashing. Hence, NoCrash is more challenging than the CARLA benchmark [8]. Though the second-best CILRS significantly outperforms the other baselines, our method, with average error 28%, still achieves a relative failure rate reduction of 30% over CILRS, which has average error 40%. Our method achieves especially large performance gains over CILRS in town-B, the test town, and under regular and dense traffic, showing the generalization benefits of distilling the segmentation and stop intention value embeddings.

Traffic-light The results are shown in Tab. 3. Since NoCrash does not penalize running red lights when computing navigation success rate, the traffic light benchmark is proposed to analyze an agent’s traffic light violation behavior. The average traffic light non-violation rate of our model LaTeS (87%) is more than twice as high as CILRS (42%).

4.3 Comparison on Traffic-school

To resolve the flaws of previous benchmarks, we propose the Traffic-school benchmark, which shares the same routes and weathers as NoCrash but with a more restrictive evaluation protocol. In the previous benchmarks, multiple driving infractions are ignored when judging whether a route is successfully finished or not. In the Traffic-school benchmark, a route is considered a success only if the agent reaches the destination while satisfying the following requirements: a) no overtime, b) no crashes, c) no traffic light violation, d) no running into the opposite lane, and e) no running into the side walk. As shown in Tab. 4, under this more realistic evaluation protocol, our results still significantly surpass the previous state-of-the-art CILRS in terms of navigation success rate. This improvement is primarily due to the large improvement in stopping for traffic lights as shown before, demonstrating the effectiveness of distilling the latent embedding about stop intention values.

Town-A & Weather-A Town-A & Weather-B Town-B & Weather-A Town-B & Weather-B Mean
Method Static Dynamic Static Dynamic Static Dynamic Static Dynamic
CIL [8] 86 83 84 82 40 38 44 42 62
CAL [28] 92 83 90 82 70 64 68 64 77
MT [18] 81 81 88 80 72 53 78 62 74
CILRS [9] 95 92 96 96 69 66 92 90 87
LaTeS 100 100 100 100 95 92 98 98 98
Table 1: Results on CARLA benchmark  [11]. We show navigation success rate in test data from, left to right, training town and weather, training town and test weather, test town and training weather, as well as test town and test weather. Dynamic / static indicate whether the test routes have moving objects (i.e. vehicles, pedestrians) or not. Our model LaTeS achieves an average error of 2%, surpassing the second-best CILRS, which has average error 13%, by 85% in terms of relative failure reduction.
Town-A & Weather-A Town-A & Weather-B Town-B & Weather-A Town-B & Weather-B Mean
Method Empty Regular Dense Empty Regular Dense Empty Regular Dense Empty Regular Dense
CIL [8]      
CAL [28]      
MT [18]         
CILRS [9]   
LaTeS 100 0 94 2 54 3 100 0 89 3 47 5 92 1 74 2 29 3 83 1 68 7 29 2 72 0.9
Table 2: Comparison against competing methods on the NoCrash benchmark  [9]. Empty, normal and dense refer to three levels of traffic congestion. We show navigation success rate

in test data from left to right: training town and weather, training town and test weather, test town and training weather, as well as test town and test weather. A route is completed successfully only if the agent reaches the destination within a certain time window without any crash along its way. Due to simulation randomness, all methods are evaluated 3 times to compute standard deviation. The average error over all runs for

our model LaTeS is 28%, which is 30% better than the next-best CILRS, 40%, in terms of relative failure reduction.
Town-A Town-B Mean
Method Weather-A Weather-B Weather-A Weather-B
CILRS [9]
LaTeS 97 0 96 1 81 1 73 1 87 0.4
Table 3: Traffic light success rate (percentage of not running the red light). We only compare with CILRS as the previous state of the art. The average success rate over all runs for our model LaTeS, 87%, is more than twice as high as CILRS, 42%.
Town-A & Weather-A Town-A & Weather-B Town-B & Weather-A Town-B & Weather-B Mean
Method Empty Regular Dense Empty Regular Dense Empty Regular Dense Empty Regular Dense
CILRS [9]                              
LaTeS 90 2 79 1 43 5 83 3 73 1 39 4 46 2 39 3 12 2 15 3 25 2 14 0 47 0.8
Table 4: The newly proposed Traffic-school benchmark provides a more solid evaluation standard than both the NoCrash, Tab. 2, and the old CARLA benchmarks, Tab. 1. The previous benchmarks are flawed due to not penalizing several driving infractions such as violating a red traffic light and running into the opposite lane or the sidewalk. On our Traffic-school benchmark, a route is considered successful only if the agent arrives at the destination within a given time without making any of the following mistakes: a) crash, b) traffic light violation, c) out-of-road infraction. Under this more realistic evaluation protocol, our results, in all conditions, still significantly surpass the previous state-of-the-art CILRS in terms of navigation success rate.
Figure 3: Qualitative comparison of LaTeS fine-tuning and LaTeS from scratch on the Comma.ai (top images) and Udacity (bottom images) real-world datasets. The green trajectory represent ground-truth steering angle while the blue and red trajectories represent steering angle estimated by LaTeS fine-tuning and LaTeS from scratch. We see that the finetuned model is accurate for a variety of situations.

4.4 Real-world Data Generalization

To demonstrate that our driving policy, learned with simulated data, generalizes to real-world driving scenarios, we conduct steering angle estimation experiments on two well known real-world datasets, Comma.ai [27] and Udacity [32]. The numerical results are reported in Tab 5, and qualitative results are shown in Figure 3. They suggest that the driving policy learned using simulated data generalizes well to real-world applications as finetuning this policy outperforms training from scratch by large gaps. Moreover, note that agents trained solely on the simulation data (LaTeS initialization) have already demonstrated comparable performance to agents trained on the real-world datasets (LaTeS from scratch).

Method Comma.ai [27] Udacity[32]
LaTeS from scratch        2.62      7.61
LaTeS initialization        2.90      7.91
LaTeS fine-tuning        2.22      4.99
Table 5: Real-world data steering angle (degrees) estimation. The results are evaluated by mean absolute error (MAE). LaTeS from scratch: we use the same network architecture as LaTeS and train it from scratch on the real-world datasets using an steering angle estimation loss. LaTeS initialization: we directly take a simulation data trained agent and test it on the real-world datasets. LaTeS fine-tuning: we fine-tune a simulation trained LaTeS model on the real-world data. Without fine-tuning, our LaTeS agent has already demonstrated comparable performance to a real-world data trained agent. With fine-tuning, we achieve the best performance, showing that the simulation data learned driving policy generalizes well to real-world driving scenarios.
Figure 4: Two-stage and multi-task approaches. Top: the two-stage (modular) approach splits the driving model into two modules. The perception module estimates a high-level intermediate representation while the driving module uses this representation to output low-level controls. Bottom: the multi-task learning approach adds high-level side information estimation as a side task to driving and trains for both tasks using a shared encoder but separate decoders.

4.5 Advantages of Model Distillation

Inspired by [20, 18], we compare against two alternative methods, depicted in Fig. 4 that can also leverage the provided segmentation masks and stop intention values for driving policy learning. Results are in Tab. 6.

Two-stage-(F) We apply an intuitive strategy of utilizing two separately trained modules: a) perception networks, b) driving networks. The perception networks are trained for segmentation masks and stop intention values estimation. In the second step, the driving networks use the same architecture as the expert model, and take estimated segmentation masks and stop intentions as input for low-level controls estimation. For two-stage, we directly take the learned weights from the expert model as the driving network. Note that the expert model is trained with ground-truth segmentation masks and stop intention values. Thus, for two-stage-F we further fine-tune the driving networks on the estimated segmentation masks and stop intentions from the perception networks in order to account for perception errors.

Multi-task We apply a similar multi-task training strategy as MT [18] but with different auxiliary tasks. On the same latent feature maps where we enforce model distillation losses in our LaTeS method to train the student model, we now train decoders to estimate segmentation masks and stop intentions as side tasks. The motivation is that by simultaneously supervising these auxiliary problems, the learned features are more invariant to environmental changes like buildings, weather, etc.

Our method outperforms the alternative approaches, showing the effectiveness of the latent embedding distillation in encouraging the driving model to learn a more optimal representation for driving. We note that all methods using the provided segmentation masks and stop intentions outperform the no-distillation baselines, showing that the side information is useful for generalization. Out of the methods that use side information, we note that two-stage(-F) performs the worst, potentially due to perception errors and the loss of relevant information present in the image but not in the semantic segmentation or intention values.

Weather-A Weather-B Mean
Method Empty Regular Dense Empty Regular Dense
LaTeS-ND 65 3 36 1   9 2 42 3 31 2   7 3 31.7 1.0
Res101-ND 70 3 44 2 13 4 50 2 33 1   7 3 36.2 1.1
Two-stage 92 1 50 3 12 1 81 2 41 6   9 3 47.5 1.3
Two-stage-F 90 2 57 4 13 1 79 3 42 4   8 2 48.2 1.2
Multi-task 91 0 62 2 17 2 83 1 65 6 16 2 55.7 1.2
LaTeS 92 1 74 2 29 3 83 1 68 7 29 2 62.5 1.4
Table 6: Comparison of alternative methods of using side information on NoCrash. We only show navigation success rate in Town-B, the test town. More results in Town-A are included in the Supp. Material. In LaTeS-ND and Res101-ND, we simply use losses for estimating low-level controls without enforcing the distilled latent embedding supervision. LaTeS-ND uses the same two-branch architecture as LaTeS. Res101-ND is a single-branch baseline that has a comparable number of network parameters. Though two-stage-(F) and multi-task all improve over the aforementioned two non-distillation models by large gaps, they are still much worse than our LaTeS model. The results provide thorough comparisons among multiple alternatives of using the provided segmentation masks and stop intention values, and show that our proposed model distillation method performs best.

4.6 Ablation Studies

We first analyze the influence of only utilizing one type of knowledge for model distillation: segmentation masks embedding or stop intentions embedding. Then, we conduct fine-grained ablation studies to understand the importance of each individual category of stop intention values, for example vehicle, pedestrian and traffic light.

Segmentation masks and stop intentions We use two different types of knowledge from the expert model for latent embedding model distillation. In Tab. 7 we conduct ablation studies using each type of information separately for model distillation. Segmentation masks provide the student model with some simple concepts of object identities, and therefore help the agent to learn basic driving skills like lane following and making turns. Meanwhile, stop intentions inform the agent of different hazardous traffic situations where braking is needed, such as getting close to pedestrians and vehicles or seeing a red traffic light. Separately, both types of knowledge bring performance gains by latent embedding model distillation, but the best results are achieved only when they are utilized jointly.

Stop intention categories An autonomous driving agent might push its brake due to various reasons, such as approaching other vehicles, pedestrians, or a red light. To analyze the impact of individual stop intentions on the learned driving model, we present Tab. 8. The results indicate that when all three types of intentions are used, the agent achieves the best performance. The set of all three stop intentions provides the model with comprehensive indication signals of the causal link between braking and different hazardous traffic situations.

In brief, the ablation studies indicate that it is beneficial to conduct model distillation jointly for three-category stop intention embeddings and segmentation mask embeddings, in order to learn a driving model that generalizes well across various maps, weathers and traffic situations.

Weather-A Weather-B Mean
Distillation source Empty Regular Dense Empty Regular Dense
Only stop intention 86 2 47 3   8 4 73 3 53 6   9 5 46.0 1.7
Only seg. mask 93 1 50 4   8 4 85 1 52 7   7 3 49.2 1.6
Both 92 1 74 2 29 3 83 1 68 7 29 2 62.5 1.4
Table 7: Partial latent embedding model distillation. We only show navigation success rate in Town-B, the test town, on NoCrash. More results in Town-A are included in the Supp. Material. Only stop intention and only segmentation mask both improve upon the LaTeS-ND baseline, , in Tab. 6 which does not conduct any type of model distillation at all. Moreover, the best results are achieved when these two types of latent embedding model distillation are applied jointly.
Weather-A Weather-B Mean
Stop intention type Empty Regular Dense Empty Regular Dense
Traffic light 60 3 37 4 11 3 46 2 25 4   9 1 31.3 1.2
Vehicle 81 2 50 4 11 2 73 2 49 7 13 3 46.2 1.5
Pedestrian 84 2 61 3 19 1 71 1 43 1 13 3 48.5 0.8
All 92 1 74 2 29 3 83 1 68 7 29 2 62.5 1.4
Table 8: Ablation study on three different categories of stop intention values: traffic light, vehicle and pedestrian. We only show navigation success rate in Town-B on NoCrash. More results in Town-A are included in the Supp. Material. We conduct ablation studies by applying our approach to expert models trained using individual stop intentions. The best performance is achieved when all three-type stop intentions are used.

5 Discussion

Our latent-embedding distillation method is unusual in that the teacher model has a different input than the student model, and the student does not use the output of the teacher model. Instead, it learns to match the teacher’s internal hidden representation. We have motivated this choice in Sect.

1, and validated this practice’s underlying assumptions in Sect. 4. There we can see that the teacher/student split significantly improves performance over both multi-task models that attempt to learn how to semantically segment images and estimate intention values while learning how to drive, as well as two-stage models that first segment the images and estimate the intention values, and then use the result for driving.

An additional factor that leads to the improved performance of our method is that the distilled information is relatively low-dimensional and therefore easier to estimate from RGB images , compared with the raw segmentation masks and stop intention values , leading to less errors in perception. Furthermore, since the distilled information is from the expert model, the latent embedding loss is more correlated with driving than the intermediate side tasks, encouraging the driving model to learn a sufficient, invariant representation of the teacher’s inputs. Finally, the latent embedding loss promotes generalization as the distilled information is a function of inputs such as segmentation mask that are invariant to photometric nuisances.

In the experiments section, we demonstrate these advantages by testing multiple alternative methods of using the provided segmentation masks and stop intention values. All the benchmark comparisons as well as ablation studies demonstrate that the proposed latent space distillation method is effective for robust driving policy learning, at least in the reactive setting. Moreover, compared with other methods which require expensive online training procedures or high-accuracy 3D maps, our model distillation based teacher-student learning method is easy to implement and only requires segmentation masks and stop intentions annotations.

Our method is not panacea, and we are not advocating it as an overall solution to autonomous driving. Modularity and failure mode management are necessary in any safety-critical system, which our model does not provide. However, our model can be a component of a sub-system. In the Supp. Material we provide the network architecture details as well as an assessment of the computational cost of training and inference.

Thinking more broadly than driving, our method uses a secondary teacher model which is trained separately for the main task using side information as inputs to learn a sufficient invariant representation for the main task. This overall idea is not specific to driving and could potentially be applied to other challenging tasks in robotics.

References

  • [1] A. Achille and S. Soatto (2018)

    A separation principle for control in the age of deep learning

    .
    Annual Review of Control, Robotics, and Autonomous Systems 1 (1). Cited by: §2.
  • [2] J. Ba and R. Caruana (2014) Do deep nets really need to be deep?. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2654–2662. Cited by: §2.
  • [3] M. Bansal, A. Krizhevsky, and A. Ogale (2019) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. In Robotics: Science and Systems, Cited by: §2.
  • [4] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. M. adn Urs Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba (2016) End to end learning for self-driving cars. ArXiv abs/1604.07316. Cited by: §2.
  • [5] A. Censi (2012) Bootstrapping Vehicles: a formal approach to unsupervised sensorimotor learning based on invariance. Technical report External Links: Link Cited by: §2.
  • [6] C. Chen, A. Seff, A. Kornhauser, and J. Xiao (2015-12) DeepDriving: learning affordance for direct perception in autonomous driving. In

    The IEEE International Conference on Computer Vision (ICCV)

    ,
    pp. 2722–2730. Cited by: §2.
  • [7] D. Chen, B. Zhou, V. Koltun, and P. Krähenbühl (2019) Learning by cheating. In CoRL, Cited by: §2, §2.
  • [8] F. Codevilla, M. Miiller, A. López, V. Koltun, and A. Dosovitskiy (2018) End-to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §2, §2, §4.2, §4.2, Table 1, Table 2, §4.
  • [9] F. Codevilla, E. Santana, A. M. Lopez, and A. Gaidon (2019-10) Exploring the limitations of behavior cloning for autonomous driving. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Table 9, Table 18, Appendix F, Table 19, §1, §2, §2, §3.1, §4.2, Table 1, Table 2, Table 3, Table 4, §4.
  • [10] E.D. Dickmanns and Th. Christians (1991) Relative 3d-state estimation for autonomous visual guidance of road vehicles. Robotics and Autonomous Systems 7 (2), pp. 113 – 123. Note: Special Issue Intelligent Autonomous Systems Cited by: §2.
  • [11] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: An open urban driving simulator. In Proceedings of the Conference on Robot Learning (CoRL), pp. 1–16. Cited by: Figure 1, §1, §2, §2, §3.3, §4.1, Table 1, §4.
  • [12] U. Franke (2017) Autonomous driving. In Computer Vision in Vehicle Technology: Land, Sea & Air, pp. 24–54. Cited by: §2.
  • [13] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: §2.
  • [14] T. He and S. Soatto (2019) Mono3D++: monocular 3d vehicle detection with two-scale 3d hypotheses and task priors. In AAAI, Cited by: §2.
  • [15] G. Hinton, O. Vinyals, and J. Dean (2015)

    Distilling the knowledge in a neural network

    .
    NeurIPS 2014 Deep Learning Workshop. Cited by: §2.
  • [16] Y. Hou, Z. Ma, C. Liu, and C. C. Loy Learning to steer by mimicking features from heterogeneous auxiliary networks. In

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    ,
    Cited by: §2.
  • [17] Y. LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp (2005) Off-road obstacle avoidance through end-to-end learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 739–746. Cited by: §2, §2.
  • [18] Z. Li, T. Motoyoshi, K. Sasaki, T. Ogata, and S. Sugano (2018) Rethinking self-driving: multi-task knowledge for better generalization and accident explanation ability. ArXiv abs/1809.11100. Cited by: §2, §4.5, §4.5, Table 1, Table 2, §4.
  • [19] X. Liang, T. Wang, L. Yang, and E. Xing (2018) Cirl: controllable imitative reinforcement learning for vision-based self-driving. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 584–599. Cited by: §2, §2.
  • [20] M. Müller, A. Dosovitskiy, B. Ghanem, and V. Koltun (2018) Driving policy transfer via modularity and abstraction. In Proceedings of the Conference on Robot Learning (CoRL), pp. 1–15. Cited by: §D.2, §2, §4.5, §4.
  • [21] B. Paden, M. Cap, S. Z. Yong, D. Yershov, and E. Frazzoli (2016) A survey of motion planning and control techniques for self-driving urban vehicles. In IEEE Transactions on Intelligent Vehicles, Vol. 1, pp. 33–55. Cited by: §2.
  • [22] D. A. Pomerleau (1988) ALVINN: an autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems (NeurIPS), pp. 305–313. Cited by: §2.
  • [23] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 918–927. Cited by: §2.
  • [24] E. Romera, J. M. Álvarez, L. M. Bergasa, and R. Arroyo (2018-01) ERFNet: efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems 19 (1), pp. 263–272. Cited by: §D.2.
  • [25] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2015) Fitnets: hints for thin deep nets. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • [26] S. Ross, G. Gordon, and J. A. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Vol. 15, pp. 627–635. Cited by: §2.
  • [27] E. Santana and G. Hotz (2016) Learning a driving simulator. ArXiv abs/1608.01230. Cited by: Appendix A, §4.4, Table 5, §4.
  • [28] A. Sauer, N. Savinov, and A. Geiger (2018) Conditional affordance learning for driving in urban environments. In CoRL, Cited by: §2, Table 1, Table 2, §4.
  • [29] S. Shalev-Shwartz, S. Shammah, and A. Shashua (2016) Safe, multi-agent, reinforcement learning for autonomous driving. NeurIPS 2016 Learning, Inference and Control of Multi-Agent Systems Workshop. Cited by: §2, §2.
  • [30] D. Silver, J. A. Bagnell, and A. Stentz (2010) Learning from demonstration for autonomous navigation in complex unstructured terrain. International Journal of Robotics Research 29, pp. 1565–1592. Cited by: §2, §2.
  • [31] G. Sundaramoorthi, P. Petersen, V. Varadarajan, and S. Soatto (2009) On the set of images modulo viewpoint and contrast changes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 832–839. Cited by: §1.
  • [32] (2018) Udacity. Note: Accessed: 2019-11-09 External Links: Link Cited by: Appendix A, §4.4, Table 5, §4.
  • [33] S. Ullman (1980) Against direct perception. Behavioral and Brain Sciences 3, pp. 373–381. Cited by: §2.
  • [34] B. Wymann, C. Dimitrakakis, A. Sumner, E. Espié, and C. Guionneau (2014) TORCS, the open racing car simulator. Note: http://www.torcs.org Cited by: §2.
  • [35] H. Xu, Y. Gao, F. Yu, and T. Darrell (2016) End-to-end learning of driving models from large-scale video datasets. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3530–3538. Cited by: §2.
  • [36] Y. You, X. Pan, Z. Wang, and C. Lu (2017) Virtual to real reinforcement learning for autonomous driving. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §2, §2.
  • [37] J. Zhang and K. Cho (2017) Query-efficient imitation learning for end-to-end autonomous driving. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 4, pp. 2891–2897. Cited by: §2, §2.

Supplementary Material

Appendix A Video Demo

Please see the video attached to this Supplementary Material111https://tonghehehe.com/lates, which illustrates the diversity of conditions (weather, number of agents) as well as the covariate shift between training (Town-A) and test (Town-B) maps, with representative successful runs of various maneuvers (avoiding pedestrians, stopping at light, turns with strong weather conditions, etc.). We also show our LaTeS model operating off-line on real-world (natural imaging) data from Comma.ai [27] and Udacity [32] datasets.

Appendix B Additional Results on Town-A

For the tables where only Town-B results were provided, here we demonstrate Town-A results. Tab. 91011 provide the Town-A results corresponding to Tab. 6, 7, 8 in the main paper, respectively. Note that our model outperforms all alternative methods in Town-A, similar to the results in Town-B.

Weather-A Weather-B Mean
Method Empty Regular Dense Empty Regular Dense
LaTeS-ND   98 0 81 2 19 3   96 0 72 4 18 5 64.0 1.2
Res101-ND   99 1 85 2 22 1   89 1 72 3 27 1 65.7 0.7
Two-stage 100 0 80 4 29 4   83 1 63 1 15 4 61.7 1.2
Two-stage-F 100 0 83 1 29 4   87 1 67 2 22 5 64.7 1.1
Multi-task 100 0 94 3 41 2   96 0 87 2 37 5 75.8 1.1
LaTeS 100 0 94 2 54 3 100 0 89 3 47 5 80.7 1.1
Table 9: Comparison of alternative methods that also use the side information on NoCrash [9] in Town-A, the training town.
Weather-A Weather-B Mean
Distillation source Empty Regular Dense Empty Regular Dense
Only stop intention 100 0 86 2 33 4   97 1 79 1 25 2 70.0 0.8
Only seg. mask 100 0 83 3 31 3   98 2 79 5 23 6 69.0 1.5
Both 100 0 94 2 54 3 100 0 89 3 47 5 80.7 1.1
Table 10: Partial latent embedding model distillation. Here, we show navigation success rate in Town-A, the training town, on NoCrash.
Weather-A Weather-B Mean
Stop intention type Empty Regular Dense Empty Regular Dense
Traffic light   95 1 73 1 16 2   92 0 63 6   9 1 58.0 1.1
Vehicle 100 0 89 1 27 3   94 2 69 1 25 5 67.3 1.1
Pedestrian 100 0 93 1 43 2   98 0 87 5 41 1 77.0 0.9
All 100 0 94 2 54 3 100 0 89 3 47 5 80.7 1.1
Table 11: Ablation study on three different categories of stop intention values: traffic light, vehicle and pedestrian. We show navigation success rate in Town-A on NoCrash.

Appendix C Expert Model Results

We compare the results of our LaTeS driving model (a.k.a. the student network) with the expert model (a.k.a. the teacher network) on NoCrash and Traffic-school benchmarks in Tab. 12 and Tab. 13, respectively. The expert model, that has access to the ground-truth semantic segmentation and stop intention values at test time, generally outperforms the driving model; however, the driving model will occasionally outperform the expert model due to the image containing relevant information that is not in the semantic segmentation and stop intentions such as object shapes, traffic light colors and so on. Generally the expert model shows better generalization ability with higher success rates in (Town-B, Weather-B), the test town and weathers.

Town-A & Weather-A Town-A & Weather-B Town-B & Weather-A Town-B & Weather-B Mean
Method Empty Regular Dense Empty Regular Dense Empty Regular Dense Empty Regular Dense
LaTeS 100 0 94 2 54 3 100 0 89 3 47 5 92 1 74 2 29 3 83 1 68 7 29 2 72 0.9
Expert 100 0 93 2 63 7 100 0 93 2 59 4 97 1 76 3 40 4 99 2 81 3 39 1 78 0.9
Table 12: Comparison of LaTeS and expert models on the NoCrash benchmark.
Town-A & Weather-A Town-A & Weather-B Town-B & Weather-A Town-B & Weather-B Mean
Method Empty Regular Dense Empty Regular Dense Empty Regular Dense Empty Regular Dense
LaTeS 90 2 79 1 43 5 83 3 73 1 39 4 46 2 39 3 12 2 15 3 25 2 14 0 47 0.8
Expert 76 1 61 1 45 4 75 2 61 4 45 10 39 1 40 2 23 4 39 3 43 1 23 1 48 1.1
Table 13: Comparison of LaTeS and expert models on the newly proposed Traffic-school benchmark.

Appendix D Detailed Model Architectures

d.1 Our LaTeS Model

Network architectures of the expert model and the driving model are explained in Tab. 14 and Tab. 15, respectively.

d.2 Two-Stage-(F)

The two-stage-(F) model uses ErfNet [24] for semantic segmentation, which is also used in [20], and a ResNet34 based network, Tab. 16, for stop intentions estimation. The two-stage-(F) model then feeds the estimated segmentation masks and intention values to the expert model for low-level control estimation (finetuned with estimated perception inputs for two-stage-F).

d.3 Multi-task

For the multi-task model, we use the same architecture as the driving model except we add two additional decoders that use the outputs of the segmentation mask embedding and stop intentions embedding branches to estimate segmentation mask and stop intentions, respectively. The architecture of the segmentation mask decoder is given in Tab. 17. The segmentation mask decoder consists of several deconvolution (Deconv) blocks with skip connections between the outputs of the Deconv block and the outputs of the corresponding ResNet block from the segmentation mask encoder. Furthermore, every deconvolution and convolution layer in the Deconv

blocks is followed by a batch normalization layer and ReLU activation. The architecture of the intentions decoder is similar to that of the

Controls module for the driving model in Tab. 15 except that it takes in a vector of length 128, the length of the intentions embedding.

d.4 Res101-ND

For the Res101-ND model, the architecture is the same as that of the driving model except the segmentation mask and stop intentions embedding branches are replaced by a single ResNet101 branch with an output FC vector size of to keep the total latent embedding size the same as LaTeS.

Module Layer Input Dimension Output Dimension
Segmentation Mask ResNet34 512
Stop Intentions FC 3 128
128 128
Self-Speed FC 1 128
128 128
Joint Embedding FC 512 + 128 + 128 512
Controls FC 512 256
256 256
256 3
Table 14: Network architectures of the expert model. The dimension format is for feature maps or just for feature vectors.
Module Layer Input Dimension Output Dimension
Seg Mask Embedding ResNet34 512
Stop Intentions Embedding ResNet34 128
Self-Speed FC 1 128
128 128
Joint Embedding FC 512 + 128 + 128 512
Controls FC 512 256
256 256
256 3
Table 15: Network architectures of the driving model.
Module Layer Input Dimension Output Dimension
Perception ResNet34 512
Stop Intentions FC 512 256
256 256
256 3
Table 16: Network architectures of the stop intention estimation network used in Two-stage-(F).
Layer Input Dimension Output Dimension
FC 512 1536
Reshape 1536
Deconv (

, 512, stride 2)

Conv (, 512, stride 1)
Deconv (, 256, stride 2)
Conv (, 256, stride 1)
Deconv (, 128, stride 2)
Conv (, 128, stride 1)
Deconv (, 64, stride 2)
Conv (, 64, stride 1)
Deconv (, 64, stride 2)
Conv (, 64, stride 1)
Deconv (, 64, stride 2)
Conv (, 64, stride 1)
Conv (, 6, stride 1)
Table 17: Network architectures of the segmentation mask decoder used in Multi-task. The layer format is (kernel size, output channel, stride).
Town-A & Weather-A Town-A & Weather-B Town-B & Weather-A Town-B & Weather-B Mean
Method Empty Regular Dense Empty Regular Dense* Empty Regular Dense Empty* Regular Dense
CILRS Original
CILRS Rerun 61 0.7
Table 18: Comparison on NoCrash as reported in the original CILRS paper [9] v.s. our rerun with author-released code and best model. Columns with * indicate evaluation settings where we report the numbers from the rerun since the success rate differences are larger than 5%; otherwise, we report the numbers from the original CILRS paper.

Appendix E Semantic Segmentation

For the semantic segmentation annotations, we retain classes relevant to driving and throw out nuisance classes. Specifically, we use the pedestrians, roads, vehicles, and trafficSigns classes and map the roadlines and sidewalks classes to the same class. We map all other classes to a nuisance class. Hence, we obtain a total of 6 classes for the semantic segmentation.

Appendix F CILRS Original v.s. Rerun

We rerun CILRS [9] using the author-provided code and best model. Comparisons between the original and the rerun results are shown in Tab. 18 and Tab. 19. We notice that some of the rerun numbers differed significantly (by more than 5%) from those reported in the original CILRS paper. For these numbers, we report the numbers we obtained from rerunning their released code and model.

Appendix G Timing

The expert model trains in 10 hours on a GTX 1080Ti while the driving model trains in 1 day on a Titan Xp. At inference time, our driving model demonstrates real-time performance of 59 FPS on a GTX 1080Ti.

Town-A Town-B Mean
Method Weather-A Weather-B Weather-A Weather-B
CILRS Original 53 N/A N/A 36 45
CILRS Rerun
Table 19: Comparison on traffic light success rate (percentage of not running the red light) as reported in the original CILRS paper [9] v.s. our rerun using author-released code and best model. We note that the CILRS paper does not report standard deviations, and results on (Town-A, Weather-B) as well as (Town-B, Weather-A). In general, the original numbers are comparable with our rerun results in terms of the mean success rate.