Visual perception is at the heart of autonomous systems and vehicles [horgan2015vision, heimberger2017computer]
. This field has seen tremendous progress during the recent wave of Deep Neural Network (DNN) architectures and methods[krizhevsky2012imagenet, simonyan2014very, szegedy2014going, he2016deep, he2017mask]
. The large majority of computer vision benchmarks are currently dominated by diverse and increasingly effective models encouraging further use in practical applications,e.g. automatic diagnosis for healthcare, traffic surveillance, autonomous vehicles, etc.
Such methods reach top performances on individual tasks by leveraging multi-million parameter models requiring powerful hardware usually for training, but also for predictions. Perception systems in autonomous vehicles must analyse and understand surrounding at all time in order to support the multiple micro-decisions needed in traffic, e.g. steering, accelerating, braking, signaling, etc. Consequently, a plethora of specific tasks must be addressed simultaneously, e.g. object detection, semantic segmentation [siam2017deep]
, depth estimation[kumar2018monocular], motion estimation [siam2018modnet], localization [milz2018visual], soiling detection [uvrivcavr2019soilingnet]. Meanwhile hardware constraints in vehicles are limiting significantly the capacity and the number of tasks that can be solved. Using a neural network for each individual task is an unfeasible direction. Thus Multi-Task Learning (MTL) is a highly appealing solution striking a good compromise between the two sides, reliable and high performing methods under limited hardware.
Multi-task networks consist of a shared network back-bone followed by a collection of ”heads”, usually one for each task. The flexibility DNNs, make it easy for practitioners to envision diverse architectures according to the available data and annotations. The main advantage of unified model is improving computational efficiency [sistu2019real, sistu2019neurall]. Moreover, such models reduce development effort and training time as shared layers minimize the need of learning multiple set of parameters in different models. Unified models learn features jointly for all tasks which makes them robust to over-fitting by acting as a regularizer, as demonstrated in various multi-task networks [kokkinos2017ubernet, neven2017fast, teichmann2018multinet].
However, multitask networks are typically difficult to train as different tasks need to be adequately balanced such that learned network parameters are useful across all tasks. Furthermore, tasks might have different difficulties and learning paces [guo2018dynamic] and negatively impact each other once a task starts overfitting before others. Multiple MTL approaches have recently attempted to mitigate this problem through optimization of multi-task architectures [Misra_2016, rusu2016progressive, mallya2018packnet, mallya2018piggyback], learning relationships between tasks [long2017learning, standley2019tasks] or, most commonly, by weighting the task losses [Chen2018GradNormGN, kendall2017multi, liu2018endtoend] (Fig. 1). Given the versatility of MTL, in most works a new problem and task configuration is proposed and only a few baselines are considered. It remains difficult to conclude which technique is better, given a new problem and dataset. In this work we benchmark multiple task-weighting methods for a better view on the progress so far.
Meta-learning derived techniques are increasingly popular for solving the tedious and difficult task of tuning hyper-parameters for training a network. Recent methods show encouraging results in finding the network architecture for a given task [zoph2016neural, liu2019auto]. We propose an evolutionary meta-learning strategy for finding the optimal task weights.
In summary, the contributions of our work are: (1) We conduct a thorough evaluation of several popular and high-performing task-weighting approaches on a two-task setup across three automotive datasets. We observe that among state-of-the-art methods there is no clear winner across datasets as methods are relatively close in performance (including simple baselines) and the ranking is varying. (2) We propose a simple weight learning technique for the two-task setting, where the network learns the task weights by itself. (3) We propose learning the optimal task-weights by combining evolutionary meta-learning with task-based selective backpropagation (deciding which tasks to be turned off for a number of iterations). This method outperforms baseline methods across tasks and datasets.
Ii Related work
MTL is not a novel problem and has been studied before the deep learning revival[caruana93multitasklearning]. MTL has been applied to various applications outside computer vision, e.g.collobert2008unified], speech processing [huang2015rapid]lazaric2010bayesian]. For additional background on MTL we refer the reader to this recent review [ruder2017learning].
Multi-task networks. In general, MTL is compatible with several computer vision problems where the tasks rather complementary and help out optimization. MultiNet [teichmann2018multinet] introduces an architecture for semantic segmentation and object detection and classification. With UberNet [kokkinos2017ubernet], Kokkinos tackles 7 computer vision problems over the same backbone architecture. CrossStich networks [Misra_2016] learn to combine multi-task neural activations at multiple intermediate layers. Progressive Networks [rusu2016progressive] consist of multiple neural networks which are added sequentially with new tasks and transfer knowledge from previously trained networks to the newly added one. In PackNet [mallya2018packnet]
, the authors train a network over a sequence of tasks and for each new task they train only the least-active neurons from the previous task. Rebuffiet al [rebuffi2017learning] train a network over 10 datasets and tasks, and for each task require a reduced set of parameters attached to several intermediate layers. In some cases, a single computer vision problem can be transformed into a MTL problem, et al Mask R-CNN for instance segmentation, which is object detection + classification + semantic segmentation [he2017mask], or YOLO for object detection [redmon2016you].
Task loss weighting. Initial Deep MTL networks made use of a weighted sum of individual task losses [8500504, 8100062]
. Recently, more complex heuristics have started to emerge for balancing the task weights using: per-task uncertainty estimation[kendall2017multi], difficulty of the tasks in terms of precision and accuracy [guo2018dynamic], statistics from task losses over time [liu2018endtoend] or from their corresponding gradients [Chen2018GradNormGN].
Meta-learning is a learning mechanism that uses experience from other tasks. The most common use-case of meta-learning is the automatic adaptation of an algorithm for a task at hand. More specifically, meta-learning can be used for hyper-parameter optimization [li2016hyperband], for exploring network architectures [zoph2016neural, liu2019auto, jaderberg2017population] or various non-trivial combinations of variables, e.g. data augmentation [cubuk2018autoaugment]. In this line of research, we adapt an evolutionary meta-learning strategy for finding the optimal task weights along with the strategy for alternatively training one of the two tasks.
In the following, we provide a formal definition of the MTL setting which will allow us to provide a common background and easier understanding of the multiple task weighting approaches compared and proposed in this work. Consider an input data space and a collection of tasks with corresponding labels . In MTL problems, we have access to a dataset of i.i.d. samples , where is the label of the data point for the task . In computer vision usually corresponds to an image, while can correspond to a variety of data types, e.g. scalar, class label, 2D heatmap, 2D class map, list of 2D/3D coordinates, etc.
The main component in any MTL is a model , which in our case is a CNN with learnable parameters . The most commonly encountered approach for MTL in neural networks is hard parameter sharing [caruana93multitasklearning], where there is a set of hidden layers shared between all tasks, i.e., backbone, to which multiple task-specific layers are connected. Formally, the model becomes:
For clarity, we denote as the set of parameters coming for all task-specific layers
. Each task has its own specific loss functionattached to both its specific layers and the common backbone . The optimization objective for boils down to the joint minimization of all the task losses as following:
where are per-task weights that can be static, computed dynamically or learned by , in which case .
Weighted losses for MTL are intuitive and easy for formulate, however they are more difficult to deploy. The main challenge is related to computing . This is non-trivial as the optimal weights for a given task can evolve in time depending on the difficulty of the task and of the content of the train set [Chen2018GradNormGN, guo2018dynamic], e.g. diversity of samples, class imbalance, etc. Moreover, the task weights can depend on the affinity between the considered tasks [zamir2018taskonomy] and the way the help, complement [standley2019tasks] or counter each other [sener2018multitask]
, relationships that potentially evolve across training iterations. Recent moment-based optimization algorithms with adaptive updates, SGD[bottou2010large], and adaptive step-size, e.g. ADAM [kingma2014adam], can also influence the dynamics of the MTL, by attenuating the impact of a wrongly tuned weight or on the contrary by keeping the bias of a previously wrong direction for more iterations. In practice, this challenging problem is solved via lengthy and expensive grid search or alternatively via a diversity of heuristics with varying degrees of complexity. In this work we rather explore the former type of approaches and propose two heuristics for computing the weights towards improving performances: two simple dynamic task weighting loss approaches and a meta-learning based approach with asynchronous backpropagation.
Iv Task-weighting Methods
In this section, we first review the most frequent task weighting methods encountered in literature and in practice (IV-A) and then describe our contributed approaches for this problem (IV-B, IV-C, IV-D). In this work we consider a two-task setup, where we train a CNN for joint object detection and semantic segmentation (Fig. 2). In the following we will adapt the definitions of the task weighting methods to this setup with .
Iv-A1 No task weighting
An often encountered approach in MTL is to not assign any weights to the task losses [TeichmannWZCU18, neven2017fast, 8100062]. The optimized loss is then just the standard sum of task losses with all task weights set to 1.0. This can happen also when the practitioner adds an extra-loss on at the output of the network, not necessarily realising that the problem has become MTL. While very simple, there are number of issues with this approach. First the network is now extremely sensitive to imbalances in task data, task loss ranges and scales (cross entropy, , etc). Due to these variations and desynchronization, some of the task losses advance faster than the others. These task will be reaching overfitting, by the time the other task losses converge, highlighting the necessity of balancing the losses during training.
Iv-A2 Handcrafted task weighting
Here, the loss weights are found and set manually. We can achieve this by inspecting the value of the loss for several samples. Then the losses are weighted such that they are brought to the same scale: the is computed using from the values of the loss at first iterations and remains constant during the training.111A more technically sound way of selecting the losses would be to look at the gradients of the losses instead of the values of the losses. However, we include this baseline as it is frequently performed by practitioners when tuning hyper-parameters after short trials and inspections.
where and are the losses for the semantic segmentation branch and object detection respectively, while is the loss for task at the first training iterations.
Iv-A3 Dynamic task loss scaling
For this method, we take into account the evolution of per-task losses during training. We compute task weights dynamically, at the end of every training epoch as follows:
where is the average loss over the previous epoch.
Iv-A4 Uncertainty-based weighting
Kendall et al [kendall2017multi] propose looking into aleatoric or data uncertainty for computing the task weights adaptively during training. They argue that each task has with its own homoscedastic uncertainty which can be learned by the network for each task during training (). Since they are based on homoscedastic uncertainty, the task weights are not input-dependent and have been shown to converge to a constant value after some iterations [kendall2017multi]. The loss functions for this method are derived from the Gaussian likelihood.
This method from [Chen2018GradNormGN] sees multi task network training as a problem of unbalanced gradient magnitudes back propagated through the shared layers (encoder). And proposes a solution to normalize the unbalanced task gradients by optimizing a new gradient loss that controls the task loss weights. These task loss weights are updated using gradient descent of this new loss.
Iv-A6 Geometric loss
authors proposed a parameter free loss function called Geometric Loss Strategy to over come the manual fine tuning of task weights. A geometric mean of losses is used instead of weighted arithmetic mean. For example atask loss function can be expressed as,
The loss strategy was tested with a three task network network on KITTI [Geiger2012CVPR] and Cityscapes [Cordts2016Cityscapes] datasets. The loss function acts as a dynamically adapted weighted arithmetic sum in log space, these weights acts as regularizers and controls the rate of convergence between the losses.
In the following we describe our proposed approaches for task weighting.
Iv-B Weight learning
In [doersch2017multi] cross connections between a shared encoder and task specific decoder are adjusted as learnable parameters. In [kendall2017multi] task weighting parameters are learned during the training. Inspired by these two works we propose a single parameter learning strategy for a two task network as follows,
where is the weight balancing term and it is computed from the learnable parameter , which is updated by backpropagation at each training iteration. Note that here the task weights are updated after each mini-batch.
This simple weight learning method enables the network to adjust by itself the pace of learning of the two tasks. The sigmoid outputting the term serves as a gating mechanism [cho2014learning] to balance the two tasks while taking into consideration the interactions between the two. Bounding the weights in implicitly regularizes learning by removing the risk of having extremely unbalanced task weights.
Iv-C Task weighting using Evolutionary Meta-learning
The task weighting problem can be understood as a hyperparameter optimization problem withnumeric variables equal to the number of tasks. An efficient and extended version of Evolution Strategies [rechenberg1978]
is used as base optimization method. The extensions allow the optimization of numerical and categorical variables simultaneously[burger2016understanding]. Furthermore, the gradient information with respect to the target metric similar to Natural Evolution Strategies [wierstra2014] is exploited. Finally, in order to prevent to evaluate parts of the search space multiple times, a Tabu search method [glover1986] is applied.
The search space is defined as numerical variable for each task as with . The weight is optimized on an exponential scale as the optimal weight ratio can be non-linear. Furthermore, the final task weight coefficients are normalized such that their sum is one with the goal to leave the overall magnitude in the loss unchanged, i.e. .
In order to guide the optimization to an equilibrium between the tasks, the geometric mean between the detection mAP and the segmentation mIoU is used as target metric.
We accelerate optimization by adopting dynamic weight transfer that reuses the weights of the current best models during the training. For each new configuration of hyper-parameters, we don’t start from scratch, but instead train from the previously best model. In this way the number of epochs for each run can be effectively reduced (e.g. to 8 epochs for Woodscape dataset) by doing continuous finetuning while simultaneously tuning the hyperparameters.
One drawback of the meta-learning approach is increased computational cost as several trainings need to be performed to find the optimal solutions. However, this algorithm can well exploit multiple GPUs for speed up.
Iv-D Asynchronous backpropagation with task weighting using Evolutionary Meta-learning
In order to balance the convergence speed of the tasks, one method can be to control the backpropagation frequency of the tasks [TeichmannWZCU18]. In this way, a task that converges faster is updated less often than a task that takes more time to learn. An implementation trick is to set the task loss weight to 0.0 for the epochs for which we want to slow down training for the fast task.
with the update frequency of the detection task. This frequency is optimized by the meta-learning method described in the previous section using a numeric variable in the range of 1 to 10, followed by a rounding operation to an integer.
We conduct experiments on three automotive datasets. The proposed meta-learning method outperforms the state of the art techniques [Chen2018GradNormGN] and [kendall2017multi] on all the three datasets with a 3-4% margin. The method’s only drawback is higher computational resources needed as multiple (shorter) trainings are performed. However, this can be justified with an increased performance and safety of the final ADAS application.
V-a Implementation details
Network architecture. We have tested all the task weighting methods discussed in the previous section with a two task network. We have designed a model which is suitable for low-power hardware. It consists of ResNet10 as a shared encoder and YOLO style bounding box decoder and FCN8 style semantic segmentation decoder. Fig.2
shows our network architecture. The Encoder head is pre-trained on ImageNet for all the experiments.
Meta-learning configuration. We optimized four parameters simultaneously, namely segmentation task weight (ws), detection task weightb (wd), asynchronous frequency for segmentation (fs) and detection (fd). The variable ranges for the two task weights are and and as segmentation tasks usually profit from a higher weight due to longer convergence time. Table. I shows the optimal values found out via optimization. The values represented are normalized between 0-1. The following optimization parameters for these experiments are determined empirically: size of initial population: 4, number of newly generated configuration: 4, number parents per generated configuration: 2.
KITTI [Geiger2012CVPR]. This dataset for object detection consists of 7481 training images which we divided into training and validation set. The dataset has bounding box annotations for cars, pedestrians and cyclists. For semantic segmentation task we have used [krevso2016convolutional] that provided 445 images. Instead of 11 semantic classes we used only road, sidewalk and merged the other classes into void. This not only helps to synchronize the classes with other two datasets but also to simplify the analysis as semantic data is already highly imbalanced and its important to balance the class distributions at overall pixel count.
Cityscapes dataset [Cordts2016Cityscapes]
consists of 5000 images with pixel level annotations. We extracted bounding boxes and semantic annotations from the provided polygon annotations. As the test data is not defined for bounding box regression, we have used at 60/20/20 split of the provided 5000 images for training, validation and testing. Similar to KITTI the proposed method has removed skewness towards segmentation performance.
WoodScape [yogamani2019woodscape] is an automotive fisheye dataset with annotations for multiple tasks like detection, segmentation and motion estimation. The dataset consists of 6K training, 2K validation and 2K test images. Similar to other datasets the existing task weighting methods favoured the segmentation task over the detection task.
V-C Insights into the meta-learning method
In order to understand the optimization of the proposed meta-learning approach, some insights into the results on the WoodScape dataset are discussed in the following. Figure 2(a) shows the target metric over tested configurations for the optimization of task weights and the asynchronous backpropagation parameter. From initially low values a slow, but steady increase is observed. The best configuration is obtained after 44 iterations.
Figure 2(b) shows the progression of the metrics of the two tasks during optimzation. The segmentation performance is initally low and noisy and then steadly increases. The detection metric reaches it maximum early then degrades slightly to allow a compromise in favor of the segmentation towards the end of the optimization. Figure 3(a) shows the progression of the task loss weights, and Figure 3(b) the progress of the asynchronous backpropagation parameter over time during optimization. Figure 5 contains qualitative examples on WoodScape and Cityscapes validation dataset demonstrating improvements by the proposed method.
Multi-task learning provides promising performances in autonomous driving applications and is key in enabling efficient implementations at a system level. In this work, we take a closer look at this paradigm, which albeit popular has been rarely benchmarked across the same range of tasks and datasets. We thus evaluate nine different weighting strategies for finding the optimal method of training an efficient two task model. We further propose two novel methods for learning the optimal weights during training: an adaptive one and one based on metalearning. Our proposed method outperform state-of-the-art approaches by in compromise value. In future work, we intend to extend our benchmarking to additional tasks, e.g. on the wide range of tasks from the Woodscape dataset [yogamani2019woodscape].