Log In Sign Up

Multi-Task Conditional Imitation Learning for Autonomous Navigation at Crowded Intersections

by   Zeyu Zhu, et al.
Peking University

In recent years, great efforts have been devoted to deep imitation learning for autonomous driving control, where raw sensory inputs are directly mapped to control actions. However, navigating through densely populated intersections remains a challenging task due to uncertainty caused by uncertain traffic participants. We focus on autonomous navigation at crowded intersections that require interaction with pedestrians. A multi-task conditional imitation learning framework is proposed to adapt both lateral and longitudinal control tasks for safe and efficient interaction. A new benchmark called IntersectNav is developed and human demonstrations are provided. Empirical results show that the proposed method can achieve a success rate gain of up to 30 the state-of-the-art.


page 2

page 4


Multi-task Learning with Attention for End-to-end Autonomous Driving

Autonomous driving systems need to handle complex scenarios such as lane...

Urban Driving with Conditional Imitation Learning

Hand-crafting generalised decision-making rules for real-world urban aut...

Traffic-Aware Autonomous Driving with Differentiable Traffic Simulation

While there have been advancements in autonomous driving control and tra...

Uncertainty-Aware Data Aggregation for Deep Imitation Learning

Estimating statistical uncertainties allows autonomous agents to communi...

Adversarial Imitation Learning via Random Search in Lane Change Decision-Making

As the advanced driver assistance system (ADAS) functions become more so...

Operation and Imitation under Safety-Aware Shared Control

We describe a shared control methodology that can, without knowledge of ...

I Introduction

Navigating through dense intersections is one of the most challenging tasks in autonomous driving due to the uncertainty created by pedestrians and other human-driven vehicles [shu2020autonomous, shu2021driving, li2021planning, wei2021autonomous].

Nowadays autonomous agents generate driving policies at multiple levels of abstraction [zhu2021survey]. Given a planned driving route on a map and mission points, during online navigation, rule- and modular-based methods are utilized to decide appropriate behavior and to plan trajectories taking into account the kinematic and dynamic constraints of the vehicle. MPC (model predictive control) [camacho2013model, qian2015decentralized, schildbach2016collision] or PID (proportional-integral-derivative) [misir1996design] finally realizes autonomous control. While these methods are easy to implement, they lack the ability to generalize in scenarios that cannot be accurately modeled beforehand. Therefore, in this case, the parameters are tuned to guarantee safety first. At dense intersections, today’s autonomous driving systems are often complained of conservative behavior, inefficiency and inhuman driving.

Recently, great efforts in deep learning-based autonomous driving control have been witnessed

[kuutti2020survey, zhu2021survey]

. The appeal of deep learning is that sensorimotor control actions can be implicitly learned from sensory input (e.g., front-view images) in an end-to-end fashion, where deep reinforcement learning (DRL)

[bouton2019safe, liang2018cirl, zhang2021end] and deep imitation learning (DIL) [bojarski2016end, codevilla2018end, codevilla2019exploring, zhao2019sam, chen2020learning] are two representatives. On the one hand, DRL typically learns from online trial and error (i.e., interaction with the environment), which can be dangerous in real world. Therefore, most current DRL methods [bouton2019safe, liang2018cirl, zhang2021end] rely heavily on simulators. On the other hand, DIL learns from expert demonstrations and can be executed offline, which is important for safety-critical applications such as autonomous driving [codevilla2018end, codevilla2019exploring]. Furthermore, it has the potential to achieve human-like driving through human demonstrations, which can be easily collected using low-cost on-board sensors.

Despite recent success, deep imitation learning still suffers from covariate shift [ross2010efficient] and causal confusion [de2019causal]. Generalizing to dense traffic scenarios (e.g., intersections with many pedestrians) remains an open problem [codevilla2019exploring], where autonomous agents need to perform both lateral and longitudinal controls simultaneously to interact with pedestrians on crosswalks, and navigate the intersection safely and efficiently. While some DIL studies have shown results for intersection navigation [codevilla2018end, sauer2018conditional, codevilla2019exploring], control strategies when interacting with pedestrians have not been rigorously studied, and the different nature of lateral and longitudinal control has been ignored. Several DIL benchmarks [dosovitskiy2017carla, codevilla2019exploring, carlaleaderboard] were developed on a high-fidelity CARLA simulator [dosovitskiy2017carla] in urban scenes. However, none of them focused on intersection navigation or interaction with pedestrians, which may be a reason for limiting research.

This study investigates DIL-based autonomous driving policy learning for intersection navigation with pedestrian interaction. To address the different uncertainty in lateral and longitudinal control, a multi-task setup based on a homoscedastic uncertainty is designed. By extending the popular Conditional Imitation Learning (CIL) [codevilla2018end] framework, this work proposes Multi-Task Conditional Imitation Learning (MTCIL) to adapt lateral and longitudinal control simultaneously for safe and smooth interaction with pedestrians on crosswalks, meanwhile navigating through intersections efficiently. A new benchmark called IntersectNav is developed, in which about 800 human driving trajectories on 40 routes are collected at four intersections under different weather conditions for train and validation. The other two intersections are used for testing. In addition, new evaluation protocols and metrics are defined to enrich the criteria of traditional benchmarks. The performance of the proposed method is extensively studied, where experimental results show that our model achieves up to 30% success rate gain compared to the state-of-the-art. The benchmark, collected dataset and video are available at

Our paper is organized as follows. Section II related work. Section III the proposed method. Section IV the proposed benchmark. Section V experimental results and Section VI our conclusion.

Ii Related Work

Ii-a Visual-based Imitation Learning for Autonomous Driving

Direct perception methods [chen2015deepdriving, sauer2018conditional]

utilize neural networks to extract compact intermediate representations which are then passed to subsequent decision and control modules. CAL

[sauer2018conditional] learns to predict affordances, such as distance to the preceding vehicle. However, affordance design requires system expertise, which may not be optimal.

End-to-end methods [pomerleau1988alvinn, bojarski2016end, codevilla2018end] learn to map raw sensor input (e.g., images) to control signals (e.g., acceleration, steering). Bojarski [bojarski2016end] successfully learned a steering policy. However, their model only adapts to lane keeping and has difficulty in addressing complex scenarios. Codevilla proposed Conditional Imitation Learning (CIL) [codevilla2018end], where the output is conditioned on high-level commands. They also proposed CILRS [codevilla2019exploring], an improved version of CIL. However, these models have limitations in generalizing to dense traffic due to the instinctive covariate shift problem [ross2010efficient] of imitation learning, . Furthermore, offline imitation learning suffers from causal confusion [de2019causal], where the model cannot distinguish spurious correlations from true causes in observed training demonstration patterns. A large body of CIL-based work has been proposed to address these issues.

Privileged supervisions such as road maps (LBC [chen2020learning]) or BEV representations (Roach [zhang2021end]) are used as input. Object-level detections such as vehicles and pedestrians can be integrated into the input, reducing the perceptual burden on DNNs compared to front-view images. Although privileged information can be easily and efficiently accessed in the simulator, retrieving it from real-world observations is not trivial. To overcome the covariate shift problem, some works [prakash2020exploring, chen2020learning] employ DAgger [ross2010efficient] to transfer offline imitation learning to online refinement. Alternatively, online/on-policy reinforcement learning is utilized for more exploration, where an offline trained IL agent serves as the initialization of the RL agent (CIRL [liang2018cirl], LSD [ohn2020learning]), or the IL agent imitates a well-trained RL agent (Roach [zhang2021end]). However, both DAgger and online RL can only perform effectively in simulations because accessing online demonstrations in real-world is not trivial. They also suffer from expensive training costs. Besides, a well-designed reward function is crucial for the learned policy [zhu2021survey], which may not reflect realistic human driving behavior. Our work differs from the above works in several ways. First, we leverage efficient offline imitation learning and introduce an additional longitudinal branch to overcome casual confusion, where a simulator is unnecessary. Therefore, our work is more scalable to the real world. Second, we focus on more complex interactive traffic scenarios. Third, we learn from human’s rather than autopilot agents’ demonstrations in previous work.

Ii-B Multi-task Learning in Computer Vision

Multi-task learning [ruder2017overview, zhang2021survey]

aims to improve learning efficiency by learning multiple complimentary tasks from shared representations. Many multi-task methods have been proposed for computer vision. For semantic tasks, different combinations of tasks can be used (e.g., classification and semantic segmentation

[liao2016understand] or detection [sermanet2013overfeat]). For geometry and regression tasks, depth, surface normals and semantic segmentation are learned in [eigen2015predicting].

Some works build on multi-task learning and learn policies for robotics [xu2018shared] or self-driving [xu2017end, kim2020multi, ishihara2021multi]. Kim [kim2020multi] used prediction of future actions and states as side tasks and learned together with primary control task in multi-task learning fashion. [xu2017end, ishihara2021multi] trained the policy together with semantic segmentation side task to obtain a meaningful and generic feature space. Our method differs from these methods in several ways. Instead of introducing side tasks that increases training cost, we split the primary task into lateral and longitudinal tasks, which are learned together in multi-task setting. Since the units and scales of two tasks are different, we build upon homoscedastic uncertainty and learn to adjust their weights adaptively. Empirical results demonstrate our effectiveness.

Fig. 1: Illustration of intersection scenarios. Given a planned route and high-level commands, the agent needs to complete three kinds of missions.
Fig. 2: Our proposed multi-task conditional imitation learning (MTCIL) framework, where two separate branches predict lateral and longitudinal control actions, respectively. Both branches share the same perception representation. For each task, corresponding high-level commands are given by rule-based decision module to select the target submodules. Task-dependent uncertainties are learned to adaptively adjust task weights.

Iii Methodology

Iii-a Scenario

This work studies the scenario of an autonomous driving agent navigating through a densely populated intersection, where it needs to adjust its controls and interact safely with pedestrians on crosswalks. In order to have the problem focused, this study does not consider interactions with other vehicles and reactions to traffic signals. The influence of these factors will be further studied in future work.

As shown in Fig. 1, the autonomous vehicle completes the missions of left turn, right turn and go straight at the intersection, guided by the route from a start point to an end point and commands issued by a higher-level module. To accomplish a mission, the agent needs to perform a sequence of driving behaviors, hereinafter referred to as commands, each of which is completed by a sequence of control actions. Specifically, lateral commands include follow lane, go straight, turn left and turn right. Longitudinal commands are decelerate, maintain and accelerate.

Iii-B Conditional Imitation Learning (CIL)

This research follows the Conditional Imitation Learning [codevilla2018end] framework to formulate the problem as follows: Human driving demonstration dataset consist of trajectories. Each trajectory is composed of a sequence of observation-action pairs , where , and denote the observation, action, and high-level command, respectively. The observations are tuples which include an onboard front-view RGB image and scalar value ego speed . The actions contain steering angle and acceleration value . The goal is to learn a deep neural network policy parameterized by that imitates human driving behavior. The optimal parameters are obtained by minimizing the imitation cost :


Iii-C Multi-task Learning (MTL)

Lateral and longitudinal control are two tasks of very different properties. For example, scene features have different importance in accomplishing each task, where lane markings and road structures are more important for lateral control task while obstacles ahead and ego speed have significant influence on the longitudinal control task. Lateral and longitudinal control have different tolerances for vibration in the control actions. Faced with the same scenario, the confidence levels of the lateral and longitudinal controls differ, reflecting the various uncertainties inherent in these tasks.

In multi-task learning, separate deep models are learned for each task and different learning objectives are combined in one loss function

[ruder2017overview, zhang2021survey]

. Linear combination is typically applied by weighting the losses for each individual task using the hand-tuned hyperparameters

[kendall2018multi]. However, the search and tuning of hyperparameters is not trivial. Since model performance is often hyperparameter-sensitive, its versatility may be limited in various scenarios.

Following [kendall2018multi], this work formulates simultaneous lateral and longitudinal control learning in a multi-task learning framework, where task-dependent uncertainties are used to weight tasks. These uncertainties are also learned from data and optimized simultaneously with model parameters.

Fig. 3: Benchmark scenes and human demonstration trajectories. (Better view in color)

Iii-D Task-dependent Uncertainty Loss

We derive from a single regression task such as learning only lateral or longitudinal control. Let be a DNN policy model with parameter , which takes input data and outputs control action . The likelihood is modeled as a Gaussian with the mean given by the model output, and the noise scalar represents task-dependent uncertainty:


Now consider a multi-task problem that yields two outputs and . Assuming the independence of two tasks, we have:

Consequently, we have the task-dependent uncertainty loss for the multi-task learning of lateral and longitudinal controls:

where denotes , which is composed of three sub DNN models, i.e., a feature encoder shared by the lateral and longitudinal conditional modules and . and denote the task-dependent uncertainties of lateral and longitudinal controls, respectively. We can interpret the first and second terms in the loss function as the objectives of each individual task, which are weighted by and , respectively. Minimizing the loss function with respect to and can learn their relative weights from data. For example, large implies that the lateral control task is inherently more uncertain, then we have a smaller weight of the task, and vice versa. Different from literature where the weights of steering and acceleration losses are manually tuned hyperparameters, our method adaptively learns to balance between them. The last term serves as a regularization for preventing and from increasing too much.

Iii-E Multi-Task Conditional Imitation Learning

The proposed Multi-Task Conditional Imitation Learning (MTCIL) architecture is shown in Fig. 2. We take the single-frame front view image and the ego velocity value as the input to the image encoder and measurement encoder, respectively. For image encoders, we evaluate the performance of CarlaNet [codevilla2018end] and ResNet34 [he2016deep]

in the experiments. The measurement encoder is a multi-layer perceptron (MLP) consisting of three fully connected layers. The concatenated features from two encoders are passed to the control modules. The lateral and longitudinal control tasks are completed by a conditional module, which contains multiple MLPs corresponding to each lateral or longitudinal command. Given current commands

and determined by a rule-based model, the corresponding modules are switched on and responsible for predicting control actions and .

Benchmark Failure events Definition of success Metrics
benchmark [dosovitskiy2017carla]
1. collision with static object/car/pedestrian
2. opposite lane
3. sidewalk
The agent reaches the goal regardless of
what happened during the episode.
1. success rate
2. avg. distance travelled
between infractions
benchmark [codevilla2019exploring]
1. collision with static object/car/pedestrian
2. timeout
3. traffic light violations
The agent reaches the goal under a time
limit without colliding with any object.
1. success rate
2. collision rate
3. timeout rate
Leaderboard [carlaleaderboard]
1. collision with static object/car/pedestrian
2. running a red light/stop sign
3. timeout
not applicable
1. driving score
2. route completion rate
3. infraction penalty
1. collision with static object/car/pedestrian
2. lane invasion (invasion time 5)
3. poor end pose (The agent approaches
the end point, but its heading’s deviation
from lane direction 15°or vertical
deviation from lane centerline 1m)
4. timeout (failure to arrive at the goal
within 1000 steps)
The agent reaches the goal under a time
limit without any failure events happened
1. success rate
2. collision rate
3. timeout rate
4. lane invasion rate
5. poor end pose rate
6. other metrics reflecting
control quality (see Tab.II)
TABLE I: Considered events in our benchmark and comparison to other benchmarks
Metric(Unit) Description Formula
Ego Jerk(#)
Average times of the absolute values of control actions >0.9
Other Jerk(#)
Average times of pedestrians , disrupted by
ego agent (e.g., emergent stop in close range)
Deviation from
Mean location ’s deviation from centerline represented
by the current nearest waypoint and next waypoint
Deviation from
Mean final location ’s deviation from the goal location
Heading Angle
Mean final heading ’s deviation from lane direction
at the episode ending
Total Step(#) Average total steps for each episode
TABLE II: Metrics that reflect the control quality

Compared with literature work that uses a single deep model to output both lateral and longitudinal control actions, separate modeling can greatly improve the performance of longitudinal control, which is crucial for dense intersections with pedestrian interactions, as shown in experiments. Furthermore, combining both controls into a multi-task framework can improve efficiency by sharing encoders, while balancing performance by weighting both tasks according to task-dependent uncertainties, which can be learned automatically from data. Note that this framework can be easily extended to allow more tasks such as speed prediction. Similar to [codevilla2019exploring], an optional branch predicting the current speed can be added in our framework (see Fig. 2), which encourages the perception module to extract visual cues that reflect the scene dynamics. The performance is examined in the experiments.

Iv A New Benchmark: IntersectNav

We propose a new benchmark named IntersectNav in this section. Unlike the CoRL2017 benchmark [dosovitskiy2017carla] and the NoCrash benchmark [codevilla2019exploring], we focus on intersections that challenge and extensively analyze the ability of driving agents to interact with pedestrians. Specifically, we use CARLA [dosovitskiy2017carla] driving simulator 0.9.7 for realistic 3D simulation. Compared to the 0.8.X version used in previous benchmarks [dosovitskiy2017carla, codevilla2019exploring], the graphics and simulation behavior changed a lot in 0.9.7, making it more complex and realistic.

Iv-a Scenarios

Demonstrated in Fig. 3, six different US-style unsignalized intersections from two towns are selected for evaluation. Four scenes are used for train and validation while the other two are reserved for test. We configure the available start and goal points, which define the reference routes (adds up to 40). The benchmark adopts an episodic setup. At each episode, an intersection is chosen and the ego car randomly starts from one of the available configurations. The world weather is randomly selected from {ClearNoon, CloudyNoon, WetNoon, HardRainNoon}. Other four new weathers {ClearSunset, CloudySunset, WetSunset, HardRainSunet} are reserved for test. Three missions are considered, i.e., performing left turn/go straight/right turn and navigate through the intersection (c.f. Fig. 3 row 2). A random number of 20-30 pedestrians are generated to walk through the crosswalks around intersections. Our setup ensures that the driving agent will inevitably encounter pedestrians during the course of turning. Although only pedestrians are considered in current settings, our benchmark can be easily extended to consider other vehicles, traffic lights and signs.

Iv-B Evaluation

During the close-loop simulation for evaluation, the ego agent and pedestrians are initialized according to protocols described above. At each simulation step, current observations and commands are fed into the control model. The network’s control outputs (both lateral and longitudinal) are then clipped by the range and passed to the actuators in CARLA. The backend engine simulates the world dynamics and moves on to the next step. This process iterates until an episode is terminated. We consider five possible events that the episode ends with: collision, lane invasion, poor end pose, timeout and success. Detailed information can be found in Tab. I, which also compares with other benchmarks. Note that our benchmark sets up higher requirements of the model’s control precision through lane invasion and poor end pose metrics.

Aside from above metrics that consider task completeness, we also define metrics to evaluate the model’s control quality. The details are provided in Tab. II. By introducing the statistics of ego and other jerks, we can further analysis the ego’s driving comfort along with its influence on other pedestrians. The deviations consider the control precision while total steps measure how efficient is the learned model.

Iv-C Human Demonstration Dataset

As is shown in Fig. 4, we collect human driving demonstrations in CARLA through the driving suite that includes a dual-motor force feedback wheel and a floor pedal. The human driver is provided with real-time front-view RGB images and bird-view images. Reference routes are projected onto the bird-view map to provide the mission information. Real-time high-level driving commands from a rule-based decision module (cf. Fig. 5) are provided for reference. In each episode, the operator is asked to keep a preferred 20 km/h speed and drive through the intersection following the high-level commands.

Fig. 4: Data collection procedure. The human operator manipulates the driving suite to demonstrate the mission in CARLA simulator.
Fig. 5: Rule-based decision module.
Frames (Trajectories) by Scene
Scene 0 Scene 1 Scene 3 Scene 4 Scene 2 Scene 5
Frames (Trajectories) by Mission
Left turn Go straight Right turn
25952 (229) 25253 (515) 24578 (244)
Frames by Lat. Cmd.
Follow lane Turn left Turn right Go straight
34812 18015 13719 9237
Frames by Lon. Cmd.
Decelerate Maintain Accelerate
16258 25432 34093
TABLE III: Statistics of the human demonstration dataset

At each time step, a small random uniform noise is added to the human’s steering with probability 0.1. This technique aims to collect experts’ demonstrations that recover from perturbations. Once an episode is over, the operator can review this episode’s metrics in Tab.

II. Data from successful episodes with good control metrics is stored. We record raw sensor data (e.g., RGB/depth images, ego’s speed and poses etc.) along with the expert’s demonstrations (e.g., control steer/throttle/brake, corresponding high-level commands). The observation , expert action and high-level commands are bounded together as one tuple , which serves as a training sample. Meta task information such as town/scene/pose index and weather are also recorded, .

We collected over 30 hours of human driving data at six intersections, which contains more than 800 trajectories. Illustration of collected human trajectories is provided in Fig. 3, where the colors represent the different missions/lateral commands/longitudinal commands in 2/3/4 rows, respectively. Detailed statistics on the number of samples and trajectories can be found in Tab. III. The dataset covers four training weathers, where the proportion of ClearNoon : CloudyNoon : WetNoon : HardRainNoon is about 0.45 : 0.17 : 0.18 : 0.19. The data from four intersections is split into the train dataset and validation dataset at a ratio of approximately 5:1. Data from the other two intersections is used for test.

Condition Model
Train scene
Train weather
CIL 59.5 8.7 29.4 0.8 1.6
CILRS 63.2 3.2 4.8 10.4 18.4
Ours 87.6 7.6 2.4 0.0 2.4
Test scene
Train weather
CIL 66.3 0.0 27.5 2.5 3.8
CILRS 57.5 0.0 17.5 6.2 18.8
Ours 92.5 2.5 0.0 2.5 2.5
Test scene
Test weather
CIL 51.2 0 40.0 1.3 7.5
CILRS 43.2 4.0 40.0 3.2 9.6
Ours 82.5 3.8 1.2 12.5 0.0
TABLE IV: Evaluation results of task completeness. Abbreviations: success rate (SR), poor end pose rate (PR), timeout rate (TR), lane invasion rate (LR), collision rate (CR).
Condition Model
Ego Jerk
Other Jerk
Deviation from
Waypoint m,
Deviation from
Destination m,
Heading Angle
Deviation °,
Total Steps
Train scene
Train weather
CIL 0.294 43.96 1.4 5.248 10.618 488.556
CILRS 0 20.872 0.429 3.988 11.122 226.456
Ours 0 55.088 0.658 1.588 5.472 326.008
Test scene
Train weather
CIL 0.125 68.988 1.286 5.286 9.83 505.7
CILRS 0 173.725 0.545 6.767 12.762 376.062
Ours 0 12.863 0.606 1.153 4.182 318.375
Test scene
Test weather
CIL 0.038 67.7 0.938 8.713 17.185 537.888
CILRS 0 106.648 0.494 10.685 19.992 539.384
Ours 0 31.438 0.627 1.038 3.927 333.55
TABLE V: Evaluation results of control quality
Fig. 6: Models for comparison in ablation studies.

V Experiment

V-a Training Details

All models are trained using Adam optimizer [kingma2014adam]

with an initial learning rate 2e-4, which will be divided by 10 if validation loss stops decreasing for more than 5 epochs. Dropout is used after fully-connected layers with a probability of 0.5. Each minibatch contains 120 samples, which are randomly sampled from the shuffled trainset. We follow Codevilla and employ a 200

88 image resolution for CarlaNet [codevilla2018end] perception backbone. For ResNet34 backbone, we resize the image to resolution 224 224. If specified, online image augmentation is performed during training, which includes Gaussian blur and noise, dropout, adjust of brightness, contrast and etc. Our results demonstrate the effectiveness of data augmentation, especially for ResNet34 backbone.

V-B Evaluation Results

Since offline and online methods cannot be directly compared, this work focuses on offline methods and chooses CIL [codevilla2018end] and CILRS [codevilla2019exploring] as our baselines where no additional supervisions (e.g., reconstructions, BEV representations) apart from expert demonstrations are used. Multiple episodes for each route in our benchmark are simulated to calculate the average metrics. The evaluation results of task completeness are presented in Tab. IV. Our reported multi-task model uses ResNet34 backbone and uncertainty loss.

CIL and CILRS have a similar success rate near 60% on train and test scenes with train weather. When facing new weathers, the success rates of both models decrease much. Besides, CIL suffers from a large timeout rate (30%). We regard this as the inertia problem [codevilla2019exploring], where the model creates a spurious correlation between low speed and no acceleration, inducing excessive stopping and difficult restarting. CILRS mitigates the problem by introducing the speed prediction branch. However, CILRS suffers from higher collision rate. These failures show that baselines have difficulty in learning longitudinal control under interactive scenarios. Compared to baselines, our method achieves a 30% success rate gain, which demonstrates the effectiveness of multi-task learning.

Evaluation results of control quality are provided in Tab. V, which demonstrate that our method has a better control quality than baselines in most conditions. Our model achieves smaller ego jerk and other jerk, which means more comfortable driving and less influence on pedestrians. As for deviations, our method achieves the best destination deviation, which is consistent with its highest success rate. It also has a much smaller average total steps than baselines in test conditions, which means higher efficiency. When tested on new conditions, our model shows good generalization ability while baseline models exhibit a large decline in performance.

Group Model
CILAug 35 1.3 62.5 0.0 1.2
CILRSAug 8.7 0.0 80.0 2.5 8.8
CIL 51.2 0.0 40.0 1.3 7.5
CILRS 43.2 4.0 40.0 3.2 9.6
CN+MT+hLossAug 50.0 18.8 1.2 20.0 10.0
RN+MT+hLossAug 71.2 2.5 5.0 3.8 17.5
CN+MT+hLoss 72.5 5.0 2.5 1.2 18.8
RN+MT+hLoss 67.5 5.0 1.3 5.0 21.2
81.2 5.0 1.3 11.2 1.3
82.5 3.8 1.2 12.5 0.0
TABLE VI: Task completeness evaluation results of ablation studies on test scene and test weather. means without.

V-C Ablation Studies

Ablation experiments in Fig. 6 are conducted to further investigate the importance of three components: backbone image encoder (CN for CarlaNet [codevilla2018end] or RN for ResNet34), multi-task learning (MT) and loss (hLoss for hard weight loss and uLoss for uncertainty weighted loss). The influence of data augmentation (Aug) is also evaluated. Detailed results of task completeness in test scenarios are provided in Tab.VI.

Experiments in the first group compares between different backbones and demonstrate that data augmentation is of vital importance to baseline models in our benchmark, especially for ResNet34. Without data augmentation, baseline models have a poor performance due to high timeout rate. Through modeling Lat. and Lon. control as multi-task, performance of models in the second group greatly exceeds that of single-task baselines with respect to success rate and timeout rate.

The last group, which uses uncertainty weighted loss instead of hard weight loss, achieves the best testing performance. Our model adaptively learns to balance between lateral and longitudinal control tasks and further reduces the relatively high collision rates in the second group.

Vi Conclusion and Future Works

This work studies DIL-based autonomous control for intersection navigation with pedestrians interaction. We propose a multi-task conditional imitation learning method to adapt both lateral and longitudinal control tasks simultaneously, where task-dependent uncertainties are learned to weight tasks. We applied the presented approach to our proposed IntersectNav benchmark and learned from human demonstrations. Experimental results show that the proposed multi-task learning and uncertainty weighting improves performance a lot. There remains room for progress, where interaction with other vehicles and reaction to traffic signals are left for future work.