1 Introduction
![]() |
![]() |
Driving is a complex endeavor that requires planning and control given sensory inputs and possibly other side information, considering nearby agents and the effect of our behaviors on their actions. Due to the complexity of the full driving problem, we ignore multi-agent dynamics and focus on learned reactive driving, where we train a model that can then map images and speed data directly to low-level controls, namely throttle, steering angle, and brakes, as shown in Fig. 1.
In addition to sensory inputs and corresponding controls to imitate, supervision can include other side information like “affordances” of objects such as pedestrians, traffic lights or stalled cars, the most important being that one should stop for them (stop intention). The driving task also informs what variability in the data is irrelevant such as photometric variability due to illumination or texture and material properties of objects such as the color of cars or buildings. Ideally, we want our driving policy to be invariant to such nuisance variability.
This could be achieved simply by pre-processing the data; for instance, the curvature of the level sets is a maximal invariant to monotonic continuous transformations of the image range [31]. However, real illumination variability is far more complex, including inter-reflections, translucency, transparency, cast shadows, etc. Alternatively, one could learn away illumination variability directly in the driving model, but that would require observing images at all times of the day, all days of the year, and all weather conditions, causing the sample complexity of the training set to explode.
Towards this end, we propose to separate the management of illumination variability into two stages: first, a teacher network is trained, via its latent representation, to encode side information, which is chosen to be illumination-invariant (semantic segmentation) and informed by traffic rules and object affordances (stop intention values). The training of the teacher is constructed so the resulting representation is a sufficient invariant for the task, meaning that it retains all and only information the semantic segmentation and stop intention values contain that is relevant for driving. Sequentially, our main student model is trained via a latent embedding loss to learn the sufficient invariant representation of the teacher, helping to improve invariance to nuisances without throwing away relevant information for the driving task. Unlike previous distillation approaches, our student model has different inputs from the teacher model (images, rather than semantic segmentation and intentions), and is trained with the ground-truth driving controls, rather than with the outputs from the teacher model. Our student model is trained not to mimic the teacher behavior, but to distill its (sufficient invariant) latent representation. This is important, since the student has access to the images, that may contain relevant information for driving that is not present in the semantic segmentation and therefore not available to the teacher.
We further show that our method is effective at exploiting side-information from the teacher, while not discarding information in the images. We test the resulting agent on simulated benchmarks, where the inputs are just images, speed measurement, and high-level planning indicators, namely turn signals, and real world datasets. We show that the method performs at the state-of-the-art as evaluated by the most commonly adopted protocols [11, 9], which however we find lacking. We do not think one should be rewarded for reaching the destination while driving through red lights and mowing down pedestrians. Therefore we introduce a more realistic protocol, Traffic-school, that rewards successful routes only if they do not violate basic traffic rules. To summarize, the main contributions of this work are as follows:
-
a novel teacher-student method, latent space distillation, to encourage the student to learn the teacher’s sufficient, invariant representation for driving
-
a novel design for the teacher network, which learns a sufficient, invariant representation for driving, using (a) semantic segmentation, which provides the teacher with object class knowledge for basic tasks such as lane following and (b) stop intention values, which provide causal knowledge relating braking to dangerous driving situations
-
a new evaluation protocol, Traffic-school, that fixes the flaws of older benchmarks and hence, is more realistic. Particularly, it penalizes multiple traffic infractions that are ignored before, such as running into sidewalk.
2 Related Work
Autonomous driving has drawn significant attention for several decades [22, 17, 30]. Approaches to the driving task can generally be categorized into three classes: modular pipelines, direct perception, and end-to-end learning. Out of these three categories, end-to-end learning is most related to our work.
Modular pipelines form the most popular approach [33, 21, 12] separating driving into two components: intermediate perception modules [13, 23, 14] and control modules that estimate low-level controls. For earlier literature on modular approaches, we refer the reader to [10]. More recently, [20, 3] use semantic segmentation and environment maps as intermediate representations in a recurrent network. This approach requires laborious mapping with continuous updates but has the advantage of outputting interpretable representations, which allow for the diagnosis of failure cases. Some methods where the intermediate representation is more directly related to the driving task are referred to as direct perception; [6] trains a network to estimate affordances (e.g., distance to vehicle, center-of-lane, orientation, etc.), directly linked to low-level controls [34]. [28] proposes Conditional Affordance Learning (CAL) to apply a similar approach to urban driving [11]. With these approaches, there is still a considerable amount of hand-crafting, and the link to the driving task is unprincipled, with no guarantees that the intermediate representation has the optimal separating properties [1].
Our paper falls in the category of direct image-to-control (i.e., end-to-end) methods [17, 30, 4], which are currently not viable as an overall solution to autonomous driving. However, these methods are nevertheless worth exploring for reactive driving in an academic setting to understand the full extent of the limitations and potential of the approach, and to push the envelope of bootstrapping [5]. [8]
introduces Conditional Imitation Learning (CIL), an offline behavioral cloning method, to solve the ambiguity problem at traffic intersections by conditioning the model on high-level turning commands.
[9], the current state-of-the-art in this arena, analyzes issues within the CIL approach and proposes CILRS. Various methods collect data online using reinforcement learning
[29, 36, 19] or the DAgger [26] imitation learning algorithm [37, 7] to reduce the covariate shift between training and online evaluation.To improve generalization and acquire perception knowledge, [35] adds a semantic segmentation loss while training a model for steering angle estimation. [18], which we denote as MT (multi-task), further adds depth estimation as a side task. [16] takes a different approach, which distills [2, 15, 25] the representations of pretrained networks for semantic segmentation and optical flow estimation, essentially changing the side task to mimicking features. [7] also applies distillation to driving to take advantage of ground-truth maps only available at training time. However, these approaches suffer from various issues. [8] and [9] fail to generalize to test conditions with dense traffic. Online training and the associated data collection [29, 36, 19, 37, 7] is expensive and unsafe and can be performed only in photorealistic driving simulators such as CARLA [11]. Traditional multi-task approaches suffer from the side task and main driving task not being completely correlated, leading to suboptimal representations for driving. The approach of [7] is expensive, both in terms of 3D ground-truth map annotations and online training.
Unlike modular approaches that infer semantic segmentation and intention values from the input image and use those to inform the control, we do not discard task-relevant information in the images that is not present in the semantic segmentation mask and inferred intentions. Unlike multi-task learning that attempts to infer the semantic segmentation and intention values while imitating the controls, we do not learn task-irrelevant information in the semantic segmentation mask, leaving it to the teacher to not learn it in its latent representation. Our student only attempts to mimic the latent space of the teacher that ideally has converged to a sufficient representation for driving that is invariant to other nuisance variability in the teacher input. We note that our method does not require expensive annotations, such as (3D) maps or knowledge of other agents locations, nor does it require online learning in closed-loop, since we aim to only capture reactive driving behavior irrespective of its effect on other agents.
3 Method
The training data for our method consists of temporal tuples collected from a rule-based autonomous driving agent that has access to all internal states of the CARLA driving simulator. is an RGB image taken at time and is the self-speed measurement. High-level commands are provided by a navigation system and represent the (blind) high-level plan, which avoid confusion at intersections and guide the agent to its destination. Low-level controls include brake, throttle and steering angle, respectively. represent semantic segmentation mask and stop intention value annotations of image . The stop intention values indicate how urgent it is for the agent to brake in order to avoid hazardous traffic situations such as collision with vehicles, collision with pedestrians, and red light violations, respectively. The stop intentions are like instructions given by a traffic-school instructor, which inform you of the causal relationships between braking behaviors and different types of dangerous driving scenarios.
In our method, referred to as LaTeS, we first use the provided segmentation masks, three-category stop intentions and self-speed measurement to learn an expert model (a.k.a. the teacher network): . Then, we conduct latent space distillation from the teacher network to learn a driving model (a.k.a. the student network) that does not have access to the side information: . The overall pipeline of our method is shown in Fig.2.

3.1 Expert Model (Teacher)
The task of our teacher is to learn a representation of the segmentation masks , three-category stop intentions and self-speed measurement that is relevant for driving with nuisances removed. Such a representation will later be used to supervise the student model, a process we call latent embedding distillation. As shown in Fig. 2, the expert model utilizes a three-branch network architecture for estimating low-level controls . The upper branch uses a ResNet34 backbone to represent the segmentation mask input. We use a second fully-connected (FC) branch to process the three-category stop intention values, informing the agent of the causal relationships between braking behaviors and the presence of objects in the context of safe driving. A third FC branch is used to ingest the self-speed measurement, which is helpful for avoiding the long-stop inertia issue [9].
The latent feature vectors from the three branches are concatenated and fed into a driving module, composed of several FC layers and a conditional switch that chooses one out of four different output branches depending on the given turning signals
. The four output branches share the same network architecture but with separately learned weights. We use the norm to construct a loss for the expert model training:(1) |
where are estimated controls of brake, throttle, steering angle respectively, and are ground-truth controls. The ’s are weights for different loss terms.
3.2 Driving model (student)
The driving model does not have direct access to side information, but instead observes a single RGB image and self-speed measurement . Its goal is also to estimate low-level controls , for which it also adopts a three-branch network. The first and the second branches are now both ResNet34 backbones. When training the driving model, as illustrated in Fig. 2, besides applying the control loss (Eq. (1)), we extract latent embeddings from the segmentation mask branch and the stop intention branch of the expert model and enforce an loss to push student’s estimated embeddings towards teacher’s distilled ones:
(2) |
in which are latent embeddings in the driving model, and are distilled high-dimensional feature vectors from the expert model. and are the lengths of the latent vectors, while
’s are tunable hyperparameters. The proposed latent space distillation loss
is jointly trained with the low-level control estimation loss in order to learn a robust driving policy for the student network:3.3 Implementation details
We implement our approach using CARLA 0.8.4 [11]. For both the expert model and the student model, we use a batch size of 120 and the ADAM optimizer with an initial learning rate of 0.0002. An adaptive learning rate schedule is used to reduce the learning rate by a factor of 10 if the training loss has not decreased in 1000 iterations. We use a validation set for early stopping, validating every 20K iterations and stopping the training when the validation loss stops to decrease further.
4 Experiments
For the sake of fair empirical evaluation, our model is compared against a suite of competing offline alternative methods, including CIL [8], CAL [28], MT [18], and the current state-of-the-art, CILRS [9]. We use three standard online evaluation benchmarks: CARLA [11], NoCrash, and Traffic-light [9]. Concerned about the fact that multiple driving infractions are not penalized in the existing benchmarks, we then introduce a more realistic new evaluation standard Traffic-school. To show that our policy generalizes well to real-world data, we also perform experiments on two real-world datasets: Comma.ai [27], and Udacity [32]. We conduct an empirical analysis of alternative driving policy learning methods (e.g. two-stage [20], multi-task [18]) that leverage the segmentation mask and stop intention values to demonstrate the advantages of our proposed latent embedding distillation method. Finally, in ablation studies, we show that it is critical to distill the three-category stop intention values and the segmentation mask embeddings jointly from the expert model.
4.1 CARLA Simulator
The CARLA, NoCrash, Traffic-light, and Traffic-school benchmarks are evaluated online using fixed routes in the CARLA simulator [11]. The CARLA simulator provides two towns, Town-A and Town-B, which differ in the road layout and visual nuisances such as static obstacles, and two sets of weathers, Weather-A and Weather-B. Town-A and Weather-A are used for training while all other town and weather combinations are used to test generalization. Specifically, Weather-A represents the set of weathers {Clear Noon, Clear Noon After Rain, Heavy Rain Noon, Clear Sunset} used in training. Weather-B represents a set of weathers used only at test time. For the CARLA benchmark, Weather-B is {Cloudy After Rain, Soft Rain Sunset}, while for the other benchmarks, weather-B is {After Rain Sunset, Soft Rain Sunset}.
4.2 Comparison with State of the Art on CARLA
We first test our method, LaTeS, on three widely adopted benchmarks: CARLA [8], NoCrash, and Traffic-light [9]. Results and further explanation of each benchmark are as follows.
CARLA The CARLA benchmark (results in Tab. 1) evaluates navigation with and without dynamic obstacles. The metric is navigation success rate, where a route is completed successfully if the agent reaches the destination within a time limit. This benchmark is relatively easy due to not penalizing crashes and therefore has been saturated. In other words, the agent can crash many times along its way to the destination and still successfully complete a route. Nevertheless, our model LaTeS significantly outperforms all competing methods, achieving an average error of 2% and a relative failure reduction of 85% over the second-best CILRS, which achieves an average error of 13%.
NoCrash We report results on the NoCrash benchmark in Tab. 2. For this benchmark, the metric is navigation success rate in three different levels of traffic: empty, regular, and dense. Unlike the CARLA benchmark, a route is considered successful for NoCrash only if the agent reaches the destination within the time window without crashing. Hence, NoCrash is more challenging than the CARLA benchmark [8]. Though the second-best CILRS significantly outperforms the other baselines, our method, with average error 28%, still achieves a relative failure rate reduction of 30% over CILRS, which has average error 40%. Our method achieves especially large performance gains over CILRS in town-B, the test town, and under regular and dense traffic, showing the generalization benefits of distilling the segmentation and stop intention value embeddings.
Traffic-light The results are shown in Tab. 3. Since NoCrash does not penalize running red lights when computing navigation success rate, the traffic light benchmark is proposed to analyze an agent’s traffic light violation behavior. The average traffic light non-violation rate of our model LaTeS (87%) is more than twice as high as CILRS (42%).
4.3 Comparison on Traffic-school
To resolve the flaws of previous benchmarks, we propose the Traffic-school benchmark, which shares the same routes and weathers as NoCrash but with a more restrictive evaluation protocol. In the previous benchmarks, multiple driving infractions are ignored when judging whether a route is successfully finished or not. In the Traffic-school benchmark, a route is considered a success only if the agent reaches the destination while satisfying the following requirements: a) no overtime, b) no crashes, c) no traffic light violation, d) no running into the opposite lane, and e) no running into the side walk. As shown in Tab. 4, under this more realistic evaluation protocol, our results still significantly surpass the previous state-of-the-art CILRS in terms of navigation success rate. This improvement is primarily due to the large improvement in stopping for traffic lights as shown before, demonstrating the effectiveness of distilling the latent embedding about stop intention values.
Town-A & Weather-A | Town-A & Weather-B | Town-B & Weather-A | Town-B & Weather-B | Mean | |||||
Method | Static | Dynamic | Static | Dynamic | Static | Dynamic | Static | Dynamic | |
CIL [8] | 86 | 83 | 84 | 82 | 40 | 38 | 44 | 42 | 62 |
CAL [28] | 92 | 83 | 90 | 82 | 70 | 64 | 68 | 64 | 77 |
MT [18] | 81 | 81 | 88 | 80 | 72 | 53 | 78 | 62 | 74 |
CILRS [9] | 95 | 92 | 96 | 96 | 69 | 66 | 92 | 90 | 87 |
LaTeS | 100 | 100 | 100 | 100 | 95 | 92 | 98 | 98 | 98 |
Town-A & Weather-A | Town-A & Weather-B | Town-B & Weather-A | Town-B & Weather-B | Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | Empty | Regular | Dense | Empty | Regular | Dense | Empty | Regular | Dense | Empty | Regular | Dense | |
CIL [8] | |||||||||||||
CAL [28] | |||||||||||||
MT [18] | |||||||||||||
CILRS [9] | |||||||||||||
LaTeS | 100 0 | 94 2 | 54 3 | 100 0 | 89 3 | 47 5 | 92 1 | 74 2 | 29 3 | 83 1 | 68 7 | 29 2 | 72 0.9 |
in test data from left to right: training town and weather, training town and test weather, test town and training weather, as well as test town and test weather. A route is completed successfully only if the agent reaches the destination within a certain time window without any crash along its way. Due to simulation randomness, all methods are evaluated 3 times to compute standard deviation. The average error over all runs for
our model LaTeS is 28%, which is 30% better than the next-best CILRS, 40%, in terms of relative failure reduction.Town-A | Town-B | Mean | |||
---|---|---|---|---|---|
Method | Weather-A | Weather-B | Weather-A | Weather-B | |
CILRS [9] | |||||
LaTeS | 97 0 | 96 1 | 81 1 | 73 1 | 87 0.4 |
Town-A & Weather-A | Town-A & Weather-B | Town-B & Weather-A | Town-B & Weather-B | Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | Empty | Regular | Dense | Empty | Regular | Dense | Empty | Regular | Dense | Empty | Regular | Dense | |
CILRS [9] | |||||||||||||
LaTeS | 90 2 | 79 1 | 43 5 | 83 3 | 73 1 | 39 4 | 46 2 | 39 3 | 12 2 | 15 3 | 25 2 | 14 0 | 47 0.8 |
![]() |
![]() |
![]() |
![]() |
4.4 Real-world Data Generalization
To demonstrate that our driving policy, learned with simulated data, generalizes to real-world driving scenarios, we conduct steering angle estimation experiments on two well known real-world datasets, Comma.ai [27] and Udacity [32]. The numerical results are reported in Tab 5, and qualitative results are shown in Figure 3. They suggest that the driving policy learned using simulated data generalizes well to real-world applications as finetuning this policy outperforms training from scratch
by large gaps. Moreover, note that agents trained solely on the simulation data (LaTeS initialization) have already demonstrated comparable performance to agents trained on the real-world datasets (LaTeS from scratch).
Method | Comma.ai [27] | Udacity[32] |
---|---|---|
LaTeS from scratch | 2.62 | 7.61 |
LaTeS initialization | 2.90 | 7.91 |
LaTeS fine-tuning | 2.22 | 4.99 |
![]() |
![]() |
4.5 Advantages of Model Distillation
Inspired by [20, 18], we compare against two alternative methods, depicted in Fig. 4 that can also leverage the provided segmentation masks and stop intention values for driving policy learning. Results are in Tab. 6.
Two-stage-(F) We apply an intuitive strategy of utilizing two separately trained modules: a) perception networks, b) driving networks. The perception networks are trained for segmentation masks and stop intention values estimation. In the second step, the driving networks use the same architecture as the expert model, and take estimated segmentation masks and stop intentions as input for low-level controls estimation. For two-stage, we directly take the learned weights from the expert model as the driving network. Note that the expert model is trained with ground-truth segmentation masks and stop intention values. Thus, for two-stage-F we further fine-tune the driving networks on the estimated segmentation masks and stop intentions from the perception networks in order to account for perception errors.
Multi-task We apply a similar multi-task training strategy as MT [18] but with different auxiliary tasks. On the same latent feature maps where we enforce model distillation losses in our LaTeS method to train the student model, we now train decoders to estimate segmentation masks and stop intentions as side tasks. The motivation is that by simultaneously supervising these auxiliary problems, the learned features are more invariant to environmental changes like buildings, weather, etc.
Our method outperforms the alternative approaches, showing the effectiveness of the latent embedding distillation in encouraging the driving model to learn a more optimal representation for driving. We note that all methods using the provided segmentation masks and stop intentions outperform the no-distillation baselines, showing that the side information is useful for generalization. Out of the methods that use side information, we note that two-stage(-F) performs the worst, potentially due to perception errors and the loss of relevant information present in the image but not in the semantic segmentation or intention values.
Weather-A | Weather-B | Mean | |||||
---|---|---|---|---|---|---|---|
Method | Empty | Regular | Dense | Empty | Regular | Dense | |
LaTeS-ND | 65 3 | 36 1 | 9 2 | 42 3 | 31 2 | 7 3 | 31.7 1.0 |
Res101-ND | 70 3 | 44 2 | 13 4 | 50 2 | 33 1 | 7 3 | 36.2 1.1 |
Two-stage | 92 1 | 50 3 | 12 1 | 81 2 | 41 6 | 9 3 | 47.5 1.3 |
Two-stage-F | 90 2 | 57 4 | 13 1 | 79 3 | 42 4 | 8 2 | 48.2 1.2 |
Multi-task | 91 0 | 62 2 | 17 2 | 83 1 | 65 6 | 16 2 | 55.7 1.2 |
LaTeS | 92 1 | 74 2 | 29 3 | 83 1 | 68 7 | 29 2 | 62.5 1.4 |
4.6 Ablation Studies
We first analyze the influence of only utilizing one type of knowledge for model distillation: segmentation masks embedding or stop intentions embedding. Then, we conduct fine-grained ablation studies to understand the importance of each individual category of stop intention values, for example vehicle, pedestrian and traffic light.
Segmentation masks and stop intentions We use two different types of knowledge from the expert model for latent embedding model distillation. In Tab. 7 we conduct ablation studies using each type of information separately for model distillation. Segmentation masks provide the student model with some simple concepts of object identities, and therefore help the agent to learn basic driving skills like lane following and making turns. Meanwhile, stop intentions inform the agent of different hazardous traffic situations where braking is needed, such as getting close to pedestrians and vehicles or seeing a red traffic light. Separately, both types of knowledge bring performance gains by latent embedding model distillation, but the best results are achieved only when they are utilized jointly.
Stop intention categories An autonomous driving agent might push its brake due to various reasons, such as approaching other vehicles, pedestrians, or a red light. To analyze the impact of individual stop intentions on the learned driving model, we present Tab. 8. The results indicate that when all three types of intentions are used, the agent achieves the best performance. The set of all three stop intentions provides the model with comprehensive indication signals of the causal link between braking and different hazardous traffic situations.
In brief, the ablation studies indicate that it is beneficial to conduct model distillation jointly for three-category stop intention embeddings and segmentation mask embeddings, in order to learn a driving model that generalizes well across various maps, weathers and traffic situations.
Weather-A | Weather-B | Mean | |||||
---|---|---|---|---|---|---|---|
Distillation source | Empty | Regular | Dense | Empty | Regular | Dense | |
Only stop intention | 86 2 | 47 3 | 8 4 | 73 3 | 53 6 | 9 5 | 46.0 1.7 |
Only seg. mask | 93 1 | 50 4 | 8 4 | 85 1 | 52 7 | 7 3 | 49.2 1.6 |
Both | 92 1 | 74 2 | 29 3 | 83 1 | 68 7 | 29 2 | 62.5 1.4 |
Weather-A | Weather-B | Mean | |||||
---|---|---|---|---|---|---|---|
Stop intention type | Empty | Regular | Dense | Empty | Regular | Dense | |
Traffic light | 60 3 | 37 4 | 11 3 | 46 2 | 25 4 | 9 1 | 31.3 1.2 |
Vehicle | 81 2 | 50 4 | 11 2 | 73 2 | 49 7 | 13 3 | 46.2 1.5 |
Pedestrian | 84 2 | 61 3 | 19 1 | 71 1 | 43 1 | 13 3 | 48.5 0.8 |
All | 92 1 | 74 2 | 29 3 | 83 1 | 68 7 | 29 2 | 62.5 1.4 |
5 Discussion
Our latent-embedding distillation method is unusual in that the teacher model has a different input than the student model, and the student does not use the output of the teacher model. Instead, it learns to match the teacher’s internal hidden representation. We have motivated this choice in Sect.
1, and validated this practice’s underlying assumptions in Sect. 4. There we can see that the teacher/student split significantly improves performance over both multi-task models that attempt to learn how to semantically segment images and estimate intention values while learning how to drive, as well as two-stage models that first segment the images and estimate the intention values, and then use the result for driving.An additional factor that leads to the improved performance of our method is that the distilled information is relatively low-dimensional and therefore easier to estimate from RGB images , compared with the raw segmentation masks and stop intention values , leading to less errors in perception. Furthermore, since the distilled information is from the expert model, the latent embedding loss is more correlated with driving than the intermediate side tasks, encouraging the driving model to learn a sufficient, invariant representation of the teacher’s inputs. Finally, the latent embedding loss promotes generalization as the distilled information is a function of inputs such as segmentation mask that are invariant to photometric nuisances.
In the experiments section, we demonstrate these advantages by testing multiple alternative methods of using the provided segmentation masks and stop intention values. All the benchmark comparisons as well as ablation studies demonstrate that the proposed latent space distillation method is effective for robust driving policy learning, at least in the reactive setting. Moreover, compared with other methods which require expensive online training procedures or high-accuracy 3D maps, our model distillation based teacher-student learning method is easy to implement and only requires segmentation masks and stop intentions annotations.
Our method is not panacea, and we are not advocating it as an overall solution to autonomous driving. Modularity and failure mode management are necessary in any safety-critical system, which our model does not provide. However, our model can be a component of a sub-system. In the Supp. Material we provide the network architecture details as well as an assessment of the computational cost of training and inference.
Thinking more broadly than driving, our method uses a secondary teacher model which is trained separately for the main task using side information as inputs to learn a sufficient invariant representation for the main task. This overall idea is not specific to driving and could potentially be applied to other challenging tasks in robotics.
References
-
[1]
(2018)
A separation principle for control in the age of deep learning
. Annual Review of Control, Robotics, and Autonomous Systems 1 (1). Cited by: §2. - [2] (2014) Do deep nets really need to be deep?. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2654–2662. Cited by: §2.
- [3] (2019) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. In Robotics: Science and Systems, Cited by: §2.
- [4] (2016) End to end learning for self-driving cars. ArXiv abs/1604.07316. Cited by: §2.
- [5] (2012) Bootstrapping Vehicles: a formal approach to unsupervised sensorimotor learning based on invariance. Technical report External Links: Link Cited by: §2.
-
[6]
(2015-12)
DeepDriving: learning affordance for direct perception in autonomous driving.
In
The IEEE International Conference on Computer Vision (ICCV)
, pp. 2722–2730. Cited by: §2. - [7] (2019) Learning by cheating. In CoRL, Cited by: §2, §2.
- [8] (2018) End-to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §2, §2, §4.2, §4.2, Table 1, Table 2, §4.
- [9] (2019-10) Exploring the limitations of behavior cloning for autonomous driving. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Table 9, Table 18, Appendix F, Table 19, §1, §2, §2, §3.1, §4.2, Table 1, Table 2, Table 3, Table 4, §4.
- [10] (1991) Relative 3d-state estimation for autonomous visual guidance of road vehicles. Robotics and Autonomous Systems 7 (2), pp. 113 – 123. Note: Special Issue Intelligent Autonomous Systems Cited by: §2.
- [11] (2017) CARLA: An open urban driving simulator. In Proceedings of the Conference on Robot Learning (CoRL), pp. 1–16. Cited by: Figure 1, §1, §2, §2, §3.3, §4.1, Table 1, §4.
- [12] (2017) Autonomous driving. In Computer Vision in Vehicle Technology: Land, Sea & Air, pp. 24–54. Cited by: §2.
- [13] (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: §2.
- [14] (2019) Mono3D++: monocular 3d vehicle detection with two-scale 3d hypotheses and task priors. In AAAI, Cited by: §2.
-
[15]
(2015)
Distilling the knowledge in a neural network
. NeurIPS 2014 Deep Learning Workshop. Cited by: §2. -
[16]
Learning to steer by mimicking features from heterogeneous auxiliary networks.
In
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)
, Cited by: §2. - [17] (2005) Off-road obstacle avoidance through end-to-end learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 739–746. Cited by: §2, §2.
- [18] (2018) Rethinking self-driving: multi-task knowledge for better generalization and accident explanation ability. ArXiv abs/1809.11100. Cited by: §2, §4.5, §4.5, Table 1, Table 2, §4.
- [19] (2018) Cirl: controllable imitative reinforcement learning for vision-based self-driving. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 584–599. Cited by: §2, §2.
- [20] (2018) Driving policy transfer via modularity and abstraction. In Proceedings of the Conference on Robot Learning (CoRL), pp. 1–15. Cited by: §D.2, §2, §4.5, §4.
- [21] (2016) A survey of motion planning and control techniques for self-driving urban vehicles. In IEEE Transactions on Intelligent Vehicles, Vol. 1, pp. 33–55. Cited by: §2.
- [22] (1988) ALVINN: an autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems (NeurIPS), pp. 305–313. Cited by: §2.
-
[23]
(2018)
Frustum pointnets for 3d object detection from rgb-d data.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 918–927. Cited by: §2. - [24] (2018-01) ERFNet: efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems 19 (1), pp. 263–272. Cited by: §D.2.
- [25] (2015) Fitnets: hints for thin deep nets. In International Conference on Learning Representations (ICLR), Cited by: §2.
- [26] (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Vol. 15, pp. 627–635. Cited by: §2.
- [27] (2016) Learning a driving simulator. ArXiv abs/1608.01230. Cited by: Appendix A, §4.4, Table 5, §4.
- [28] (2018) Conditional affordance learning for driving in urban environments. In CoRL, Cited by: §2, Table 1, Table 2, §4.
- [29] (2016) Safe, multi-agent, reinforcement learning for autonomous driving. NeurIPS 2016 Learning, Inference and Control of Multi-Agent Systems Workshop. Cited by: §2, §2.
- [30] (2010) Learning from demonstration for autonomous navigation in complex unstructured terrain. International Journal of Robotics Research 29, pp. 1565–1592. Cited by: §2, §2.
- [31] (2009) On the set of images modulo viewpoint and contrast changes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 832–839. Cited by: §1.
- [32] (2018) Udacity. Note: Accessed: 2019-11-09 External Links: Link Cited by: Appendix A, §4.4, Table 5, §4.
- [33] (1980) Against direct perception. Behavioral and Brain Sciences 3, pp. 373–381. Cited by: §2.
- [34] (2014) TORCS, the open racing car simulator. Note: http://www.torcs.org Cited by: §2.
- [35] (2016) End-to-end learning of driving models from large-scale video datasets. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3530–3538. Cited by: §2.
- [36] (2017) Virtual to real reinforcement learning for autonomous driving. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §2, §2.
- [37] (2017) Query-efficient imitation learning for end-to-end autonomous driving. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 4, pp. 2891–2897. Cited by: §2, §2.
Supplementary Material
Appendix A Video Demo
Please see the video attached to this Supplementary Material111https://tonghehehe.com/lates, which illustrates the diversity of conditions (weather, number of agents) as well as the covariate shift between training (Town-A) and test (Town-B) maps, with representative successful runs of various maneuvers (avoiding pedestrians, stopping at light, turns with strong weather conditions, etc.). We also show our LaTeS model operating off-line on real-world (natural imaging) data from Comma.ai [27] and Udacity [32] datasets.
Appendix B Additional Results on Town-A
For the tables where only Town-B results were provided, here we demonstrate Town-A results. Tab. 9, 10, 11 provide the Town-A results corresponding to Tab. 6, 7, 8 in the main paper, respectively. Note that our model outperforms all alternative methods in Town-A, similar to the results in Town-B.
Weather-A | Weather-B | Mean | |||||
---|---|---|---|---|---|---|---|
Method | Empty | Regular | Dense | Empty | Regular | Dense | |
LaTeS-ND | 98 0 | 81 2 | 19 3 | 96 0 | 72 4 | 18 5 | 64.0 1.2 |
Res101-ND | 99 1 | 85 2 | 22 1 | 89 1 | 72 3 | 27 1 | 65.7 0.7 |
Two-stage | 100 0 | 80 4 | 29 4 | 83 1 | 63 1 | 15 4 | 61.7 1.2 |
Two-stage-F | 100 0 | 83 1 | 29 4 | 87 1 | 67 2 | 22 5 | 64.7 1.1 |
Multi-task | 100 0 | 94 3 | 41 2 | 96 0 | 87 2 | 37 5 | 75.8 1.1 |
LaTeS | 100 0 | 94 2 | 54 3 | 100 0 | 89 3 | 47 5 | 80.7 1.1 |
Weather-A | Weather-B | Mean | |||||
---|---|---|---|---|---|---|---|
Distillation source | Empty | Regular | Dense | Empty | Regular | Dense | |
Only stop intention | 100 0 | 86 2 | 33 4 | 97 1 | 79 1 | 25 2 | 70.0 0.8 |
Only seg. mask | 100 0 | 83 3 | 31 3 | 98 2 | 79 5 | 23 6 | 69.0 1.5 |
Both | 100 0 | 94 2 | 54 3 | 100 0 | 89 3 | 47 5 | 80.7 1.1 |
Weather-A | Weather-B | Mean | |||||
---|---|---|---|---|---|---|---|
Stop intention type | Empty | Regular | Dense | Empty | Regular | Dense | |
Traffic light | 95 1 | 73 1 | 16 2 | 92 0 | 63 6 | 9 1 | 58.0 1.1 |
Vehicle | 100 0 | 89 1 | 27 3 | 94 2 | 69 1 | 25 5 | 67.3 1.1 |
Pedestrian | 100 0 | 93 1 | 43 2 | 98 0 | 87 5 | 41 1 | 77.0 0.9 |
All | 100 0 | 94 2 | 54 3 | 100 0 | 89 3 | 47 5 | 80.7 1.1 |
Appendix C Expert Model Results
We compare the results of our LaTeS driving model (a.k.a. the student network) with the expert model (a.k.a. the teacher network) on NoCrash and Traffic-school benchmarks in Tab. 12 and Tab. 13, respectively. The expert model, that has access to the ground-truth semantic segmentation and stop intention values at test time, generally outperforms the driving model; however, the driving model will occasionally outperform the expert model due to the image containing relevant information that is not in the semantic segmentation and stop intentions such as object shapes, traffic light colors and so on. Generally the expert model shows better generalization ability with higher success rates in (Town-B, Weather-B), the test town and weathers.
Town-A & Weather-A | Town-A & Weather-B | Town-B & Weather-A | Town-B & Weather-B | Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | Empty | Regular | Dense | Empty | Regular | Dense | Empty | Regular | Dense | Empty | Regular | Dense | |
LaTeS | 100 0 | 94 2 | 54 3 | 100 0 | 89 3 | 47 5 | 92 1 | 74 2 | 29 3 | 83 1 | 68 7 | 29 2 | 72 0.9 |
Expert | 100 0 | 93 2 | 63 7 | 100 0 | 93 2 | 59 4 | 97 1 | 76 3 | 40 4 | 99 2 | 81 3 | 39 1 | 78 0.9 |
Town-A & Weather-A | Town-A & Weather-B | Town-B & Weather-A | Town-B & Weather-B | Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | Empty | Regular | Dense | Empty | Regular | Dense | Empty | Regular | Dense | Empty | Regular | Dense | |
LaTeS | 90 2 | 79 1 | 43 5 | 83 3 | 73 1 | 39 4 | 46 2 | 39 3 | 12 2 | 15 3 | 25 2 | 14 0 | 47 0.8 |
Expert | 76 1 | 61 1 | 45 4 | 75 2 | 61 4 | 45 10 | 39 1 | 40 2 | 23 4 | 39 3 | 43 1 | 23 1 | 48 1.1 |
Appendix D Detailed Model Architectures
d.1 Our LaTeS Model
d.2 Two-Stage-(F)
The two-stage-(F) model uses ErfNet [24] for semantic segmentation, which is also used in [20], and a ResNet34 based network, Tab. 16, for stop intentions estimation. The two-stage-(F) model then feeds the estimated segmentation masks and intention values to the expert model for low-level control estimation (finetuned with estimated perception inputs for two-stage-F).
d.3 Multi-task
For the multi-task model, we use the same architecture as the driving model except we add two additional decoders that use the outputs of the segmentation mask embedding and stop intentions embedding branches to estimate segmentation mask and stop intentions, respectively. The architecture of the segmentation mask decoder is given in Tab. 17. The segmentation mask decoder consists of several deconvolution (Deconv) blocks with skip connections between the outputs of the Deconv block and the outputs of the corresponding ResNet block from the segmentation mask encoder. Furthermore, every deconvolution and convolution layer in the Deconv
blocks is followed by a batch normalization layer and ReLU activation. The architecture of the intentions decoder is similar to that of the
Controls module for the driving model in Tab. 15 except that it takes in a vector of length 128, the length of the intentions embedding.d.4 Res101-ND
For the Res101-ND model, the architecture is the same as that of the driving model except the segmentation mask and stop intentions embedding branches are replaced by a single ResNet101 branch with an output FC vector size of to keep the total latent embedding size the same as LaTeS.
Module | Layer | Input Dimension | Output Dimension |
---|---|---|---|
Segmentation Mask | ResNet34 | 512 | |
Stop Intentions | FC | 3 | 128 |
128 | 128 | ||
Self-Speed | FC | 1 | 128 |
128 | 128 | ||
Joint Embedding | FC | 512 + 128 + 128 | 512 |
Controls | FC | 512 | 256 |
256 | 256 | ||
256 | 3 |
Module | Layer | Input Dimension | Output Dimension |
---|---|---|---|
Seg Mask Embedding | ResNet34 | 512 | |
Stop Intentions Embedding | ResNet34 | 128 | |
Self-Speed | FC | 1 | 128 |
128 | 128 | ||
Joint Embedding | FC | 512 + 128 + 128 | 512 |
Controls | FC | 512 | 256 |
256 | 256 | ||
256 | 3 |
Module | Layer | Input Dimension | Output Dimension |
---|---|---|---|
Perception | ResNet34 | 512 | |
Stop Intentions | FC | 512 | 256 |
256 | 256 | ||
256 | 3 |
Layer | Input Dimension | Output Dimension |
---|---|---|
FC | 512 | 1536 |
Reshape | 1536 | |
Deconv ( , 512, stride 2) |
||
Conv (, 512, stride 1) | ||
Deconv (, 256, stride 2) | ||
Conv (, 256, stride 1) | ||
Deconv (, 128, stride 2) | ||
Conv (, 128, stride 1) | ||
Deconv (, 64, stride 2) | ||
Conv (, 64, stride 1) | ||
Deconv (, 64, stride 2) | ||
Conv (, 64, stride 1) | ||
Deconv (, 64, stride 2) | ||
Conv (, 64, stride 1) | ||
Conv (, 6, stride 1) |
Town-A & Weather-A | Town-A & Weather-B | Town-B & Weather-A | Town-B & Weather-B | Mean | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | Empty | Regular | Dense | Empty | Regular | Dense* | Empty | Regular | Dense | Empty* | Regular | Dense | |
CILRS Original | |||||||||||||
CILRS Rerun | 61 0.7 |
Appendix E Semantic Segmentation
For the semantic segmentation annotations, we retain classes relevant to driving and throw out nuisance classes. Specifically, we use the pedestrians, roads, vehicles, and trafficSigns classes and map the roadlines and sidewalks classes to the same class. We map all other classes to a nuisance class. Hence, we obtain a total of 6 classes for the semantic segmentation.
Appendix F CILRS Original v.s. Rerun
We rerun CILRS [9] using the author-provided code and best model. Comparisons between the original and the rerun results are shown in Tab. 18 and Tab. 19. We notice that some of the rerun numbers differed significantly (by more than 5%) from those reported in the original CILRS paper. For these numbers, we report the numbers we obtained from rerunning their released code and model.
Appendix G Timing
The expert model trains in 10 hours on a GTX 1080Ti while the driving model trains in 1 day on a Titan Xp. At inference time, our driving model demonstrates real-time performance of 59 FPS on a GTX 1080Ti.
Town-A | Town-B | Mean | |||
---|---|---|---|---|---|
Method | Weather-A | Weather-B | Weather-A | Weather-B | |
CILRS Original | 53 | N/A | N/A | 36 | 45 |
CILRS Rerun |
Comments
There are no comments yet.