DeepAI
Log In Sign Up

Composite Learning for Robust and Effective Dense Predictions

10/13/2022
by   Menelaos Kanakis, et al.
7

Multi-task learning promises better model generalization on a target task by jointly optimizing it with an auxiliary task. However, the current practice requires additional labeling efforts for the auxiliary task, while not guaranteeing better model performance. In this paper, we find that jointly training a dense prediction (target) task with a self-supervised (auxiliary) task can consistently improve the performance of the target task, while eliminating the need for labeling auxiliary tasks. We refer to this joint training as Composite Learning (CompL). Experiments of CompL on monocular depth estimation, semantic segmentation, and boundary detection show consistent performance improvements in fully and partially labeled datasets. Further analysis on depth estimation reveals that joint training with self-supervision outperforms most labeled auxiliary tasks. We also find that CompL can improve model robustness when the models are evaluated in new domains. These results demonstrate the benefits of self-supervision as an auxiliary task, and establish the design of novel task-specific self-supervised methods as a new axis of investigation for future multi-task learning research.

READ FULL TEXT VIEW PDF

page 5

page 6

page 11

11/07/2018

Learning to Steer by Mimicking Features from Heterogeneous Auxiliary Networks

The training of many existing end-to-end steering angle prediction model...
05/17/2021

Learning to Relate Depth and Semantics for Unsupervised Domain Adaptation

We present an approach for encoding visual task relationships to improve...
12/21/2021

Generalizing Interactive Backpropagating Refinement for Dense Prediction

As deep neural networks become the state-of-the-art approach in the fiel...
11/28/2018

GIRNet: Interleaved Multi-Task Recurrent State Sequence Models

In several natural language tasks, labeled sequences are available in se...
01/07/2022

Learning Multi-Tasks with Inconsistent Labels by using Auxiliary Big Task

Multi-task learning is to improve the performance of the model by transf...
06/21/2022

Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of Semantics and Depth

Multi-task learning (MTL) paradigm focuses on jointly learning two or mo...

1 Introduction

Learning robust and generalizable feature representations have enabled the utilization of Convolutional Neural Networks (CNNs) on a wide range of tasks. This includes tasks that require efficient learning due to limited annotations. A commonly used paradigm to improve generalization of target tasks is Multi-Task Learning (MTL), the joint optimization of multiple tasks. MTL exploits domain information contained in the training signals of related tasks as an inductive bias in the learning process of the target task 

[8, 9]. The goal is to find joint representations that better explain the optimized tasks. MTL has demonstrated success in tasks such as instance segmentation [16] and depth estimation [12], amongst others. In reality, however, such performance improvements are not common when naively selecting the jointly optimized tasks [43]. To complicate things further, the relationship between tasks for MTL is also dependent on the learning setup, such as training set size and network capacity [58]. As a consequence, MTL practitioners are forced to iterate through various candidate task combinations in search of a synergetic setting. This empirical process is arduous and expensive since annotations are required a priori for each candidate task.

(a) Multi-Task Learning
(b) Composite Learning (ours)
Figure 1: The generalization of target tasks can be improved by jointly optimizing with a related auxiliary task. (a) In traditional multi-task learning, one uses labeled auxiliary tasks that require manual annotation efforts. (b) In this paper, we show that jointly training a dense task with a self-supervised task can consistently improve the performance, while eliminating the need for additional labeling efforts.

In this paper, we find that the joint optimization of a dense prediction (target) task with a self-supervised (auxiliary) task improves the performance on the target task, outperforming traditional MTL practices. We refer to this joint training as Composite Learning (CompL), inspired by material science where two materials are merged to form a new one with enhanced properties. The benefits and intuition of CompL resemble those of traditional MTL, however, CompL exploits the label-free supervision of self-supervised methods. This facilitates faster iterations through different task combinations, and eliminates manual labeling effort for auxiliary tasks from the process.

We provide thorough evaluations of CompL on three dense prediction target tasks with different model structures, combined with three self-supervised auxiliary tasks. The target tasks include depth estimation, semantic segmentation, and boundary detection, while self-supervised tasks include rotations, MoCo, and DenseCL. We find that jointly optimizing with self-supervised auxiliary tasks consistently outperforms ImageNet-pretrained baselines. The benefits of CompL are most pronounced in low-data regimes, where the importance of inductive biases increases 

[5]. We also find that jointly optimizing monocular depth estimation with a self-supervised objective can outperform most labeled auxiliary tasks. CompL can additionally improve semantic segmentation and boundary detection model robustness, when evaluated on new domains. Our experiments demonstrate the promise of self-supervision as an auxiliary task. We envision these findings will establish the design of novel task-specific self-supervised methods as a new axis of investigation for future multi-task learning research.

2 Related Work

Multi-Task Learning (MTL) MTL aims to enhance performance and robustness of a predictor by jointly optimizing a shared representation between several tasks [8]. This is accomplished by exploiting the domain-specific information contained in the training signal of one task (e.g., semantic segmentation), to more informatively select hypotheses for other tasks (e.g., depth), and vice versa [54, 7]. For example, pixels of class “sky” will always have a larger depth that those of class “car” [56]. If non-related tasks are combined, however, the overall performance degrades. This is referred to as task-interference and has been well documented in the literature [49, 41]. However, no measurement of task relations can tell us whether performance gain can be achieved without training the final models. Although several works have shown that while MTL can improve performance, it requires an exhaustive manual search of task interactions [58], and labeled datasets with many tasks. In this work we also jointly optimize a network on multiple tasks, but we instead evaluate the efficacy of self-supervision as an auxiliary task. This enables the use of joint training in any dataset and eliminates expensive annotation efforts that do not guarantee performance gains. To further improve performance of a target task, [38, 30, 4, 24] designed specialised architectures for a predefined set of tasks. These architectures do not generalize to other tasks. On the other end, [47] aim to learn a sub-class labelling problem as an auxiliary task, i.e. for class dog learn the breed subclass, however the notion of subclass does not generalize to dense tasks like depth estimation. Instead, we conduct a systematic investigation using a common pipeline, applicable to any dense target task. This enables the easy switching of different supervised target tasks or auxiliary self-supervised tasks, without requiring any architectural changes, enabling the wider reach of joint training across tasks and datasets.

Transfer learning

Given a large labeled dataset, neural networks can optimize for any task, whether image-level 

[44], or dense [33]

. In practice, however, large datasets can be prohibitively expensive to acquire, giving rise to the transfer learning paradigm. The most prominent example of transfer learning is the fine-tuning of an ImageNet 

[17] pre-trained model on target tasks such as semantic segmentation [48], or monocular depth estimation [22]. However, ImageNet models do not always provide the best representations for all downstream tasks, raising interest in finding task relationships for better transfer capabilities [71]. In this work we are not interested in learning better pre-trained networks for knowledge transfer. Rather, we start from strong transfer learning baselines and improve generalization by jointly optimizing the target and auxiliary tasks.

Self-supervised learning

Learning representations that can effectively transfer to downstream tasks, coupled with the cost associated with the acquisition of large labeled datasets, has given rise to self-supervised methods. These methods can learn representations through explicit supervision on pre-text tasks [19, 27], or through contrastive methods [13, 32]. Commonly, self-supervised methods aim to optimize a given architecture, yielding better pre-training models for fine-tuning on the target task [19, 27, 13, 32, 25, 65, 51, 45]. We instead utilized such pre-trained models as a starting point and fine-tune on both the target and self-superivsed auxiliary tasks jointly, rather than just the target task, to further improve performance and robustness. More recently, supervised tasks have been used in conjunction with self-supervised techniques by exploiting the labels to guide contrastive learning. This can be seen as a form of sampling guidance and has been utilized in classification [42], semantic segmentation [63], and tracking [53]. These methods differ from our work as they require target task labels to optimize the self-supervised objective, while our self-supervised objectives are independent of the target labels and can be applied on any set of images. Instead, [18] jointly train a model for classification and rotation, but utilize the rotation performance at test time as a proxy to the classification performance. The goal of this work is instead to improve the target task’s performance and robustness. More closely to our work, [26] and [72] jointly train classification and self-supervised objectives under a semi-supervised training protocol. We also perform joint training with a self-supervised task, however, we follow a more general MTL methodology, and investigate whether self-supervised tasks can provide inductive bias to dense tasks.

Robustness Robust predictors are important to ensure their performance under various conditions during deployment. Recent works have focused on improving different aspects of robustness, such as image corruption [36], adversarial samples [73], and domain shifts [68]. More related to our work, [37] jointly train classification and self-supervised rotation, demonstrating that the strong regularization of the rotations improves model robustness to adversarial examples, and label or input corruptions. [64] similarly used joint training but employed both image and video-level self-supervised tasks and found them to improve the model’s robustness to domain shifts for object detection. We also evaluate the effect of joint training on robustness to unseen datasets, but focus on dense prediction tasks.

3 Composite Learning

In this section, we introduce and motivate Composite Learning (CompL). Specifically, Sec. 3.1 formalizes the problem setting, Sec. 3.2 describes the self-supervised methods investigated, and Sec. 3.3 lists the network structure choices in our study.

3.1 Joint Learning with Supervised and Self-Supervised Tasks

Multi-task learning may improve the model robustness and generalizability. We aim to investigate the efficacy of joint training with self-supervision on dense prediction tasks as the targets. The shared representation between the target task and an auxiliary task may be more effective than training on alone.

In the traditional MTL setup, the label sets , and , are manually labeled. In contrast, the auxiliary labels in CompL are implicitly created in the self-supervised task. Formally, CompL aims to produce the two predictive functions and , where and share parameters and have disjoint parameters . During inference we are only interested in , however, we hypothesize that we can learn a more effective parameterization through the above weight sharing scheme. In our investigation, and are trained jointly using samples and .

The overall optimization objective therefore becomes

(1)

where and are the losses for the supervised and self-supervised tasks respectively, and is a scaling factor controlling the magnitude and importance of the self-supervised task.

The experiments in this paper use the same dataset for both the target and auxiliary tasks. We additionally train our models using different-sized subsets for the target task, where . However, the above is not a necessary condition for CompL, meaning the self-supervised task could be trained on an independent dataset.

Training method We jointly optimize two objectives. We construct a minibatch by sampling at random independently from the two training sets. For simplicity, we sample an identical number of images from each training set. The input images and are treated independently. This enables us to apply task/method-specific augmentations to each task input without causing task conflicts. We apply the baseline augmentations to , ensuring a fair comparison with our single-task baselines. used for self-supervised training is instead processed with the proposed task-specific augmentations for each method investigated. These augmentations include Gaussian blur and rotation. They can significantly degrade performance for dense tasks if applied on the target task, but they are important for self-supervision. Therefore, by using distinct augmentations on two tasks, we can minimize performance degradation brought by training the auxiliary tasks.

3.2 Self-Supervised Methods in Our Study

Rotation (Rot) [27] proposed to utilize 2-dimensional rotations on the input images to learn feature representations. Specifically, they optimize a classification model to predict the rotation angles, equally spaced in . Joint optimization with self-supervised rotation has demonstrated success in semi-supervised image classification [26, 72], and enhanced robustness to input/output corruptions [37], making it a prime candidate for further investigation in a dense prediction setting.

Global contrastive Global contrastive methods treat every image as its own class, while artificially creating novel instances of said class through random data augmentations. In this work, we evaluate contrastive methods using Momentum Contrast (MoCo) [32], and specifically MoCo v2 [14]. These methods formulate contrastive learning as dictionary look-up, enabling for the construction of a large and consistent dictionary of size without the need for large batch sizes, a common challenge amongst dense prediction tasks [11]. MoCo is optimized using InfoNCE [52]

, a contrastive loss function defined as

(2)

InfoNCE is a softmax-based classifier that optimizes for distinguishing the positive representation

from the negative representations. The temperature

is used to control the smoothness of the probability distribution, with higher values resulting in softer distributions.

Local contrastive In dense predictions tasks, we desire a fine-grained pixel wise prediction rather than a global one. As such, we further investigate the difference between global contrastive MoCo v2 [14], and its variant DenseCL [65], that includes an additional contrastive loss acting on local representations.

Model Labeled Data
5% 10% 20% 50% 100%
Depth 0.8871 0.8120 0.7471 0.6655 0.6223
Rot Depth 1.0830 1.0120 0.9114 0.8322 0.7822
MoCo Depth 0.8758 0.7708 0.7113 0.6311 0.5890
DenseCL Depth 0.8736 0.7726 0.7152 0.6321 0.5982
Depth + Rot 0.8762 0.8071 0.7298 0.6460 0.6107
Depth + MoCo 0.8501 0.7955 0.7206 0.6434 0.6000
Depth + DenseCL 0.8479 0.7866 0.7131 0.6420 0.5990
MoCo MoCo + Depth 0.8614 0.7732 0.7008 0.6220 0.5773
DenseCL DenseCL + Depth 0.8468 0.7641 0.6989 0.6157 0.5690
Table 1: Monocular depth estimation performance in RMSE on NYUD-v2. ‘’ denote transfer learning methods, while ‘+’ denote joint training (CompL). Initialization with DenseCL coupled with DenseCL joint training outperforms all other methods.
Figure 2: Monocular depth estimation performance in RMSE on different ResNet encoders. Use of CompL (orange) denotes the addition of the best performing self-supervised objective (DenseCL). CompL consistently outperforms the baselines in all experiments.

3.3 Network Structures

Dense prediction networks are initially pre-trained on classification, and then modified according to the downstream task of interest, e.g., by introducing dilations [70]. In our investigation, we jointly optimize heterogeneous tasks such as a dense prediction task and image rotations. Therefore, our networks call for special structure considerations. This section presents the details.

Dense prediction networks Common dense prediction networks use an encoder-decoder structure [55, 3], maintain a constant resolution past a certain network depth [69], or even utilize both high and low representation resolutions in multiple layers of the network [62]. Due to the large differences among networks, we opt to treat the entire network as a single unit, and only utilize the last feature representation of the networks for the task-specific predictions. In other words, we branch out at the last layer and employ a single task-specific module for the predictions. This ensures that our findings do not depend on network structures, and it is easy to generalize to new network designs.

We perform our experiments on DeepLabv3+ [11] based on ResNets [34]. The networks demonstrated competitive performance on a large number of dense prediction tasks, such as semantic segmentation, and depth estimation and has been used extensively when jointly learning multiple tasks [49, 6]. Our investigation is primarily on the smaller ResNet-26 architecture for easy comparison with existing MTL results. As it is a common practice in dense prediction tasks, we initialize the ResNet encoder with ImageNet pre-trained weights, unless stated otherwise.

Task-specific heads The final representation of the dense prediction networks is utilized in two task-specific modules. The first module, consisting of a 11 convolutional layer, generates the predictions of the supervised task, with the output dimension being task dependent, such as the number of classes. The second prediction head is specific for self-supervised tasks. Unlike the supervised prediction head, the self-supervised prediction head is utilized only during network optimization, and is discarded at test time. The features for Rot and MoCo are first pooled with a global average pooling layer. Rot is then processed by an fully connected layer with output dimensions equal to 4, number of potential rotations, while MoCo is processed by 2-layer MLP head with output dimensions equal to 128, feature embedding dimension. DenseCL, on the other hand, generates two outputs. The first one is identical to MoCo for the global representation, while for the second representation, the initial dense features are pooled to a smaller grid size, and then processed with two 11 convolutional layer to get the local feature representations.

(a) Monocular depth estimation.
(b) Semantic segmentation.
Figure 3: t-SNE visualization of the DenseCL local representations. The representations are depicted using their ground-truth maps. Specifically, (a) depth values for monocular depth estimation and (b) semantic patches for semantic segmentation. The local representations adapt to the target task, i.e., (a) smooth depth variation for the regression task while (b) clusters are formed for the classification task.
Figure 4: Monocular depth estimation performance in RMSE on NYUD-v2 when trained with additional auxiliary tasks. CompL can improve depth more than training with boundary or normal predictions. Semantic segmentation can improve the depth prediction more, but it requires expensive manual annotations.

Normalization

Large CNNs are often challenging to train, and thus utilize Batch Normalization (BN) to accelerate training 

[39]. In self-supervised training, BNs often degrade performance due to intra-batch knowledge transfer among samples. Workarounds include shuffling BNs [32, 14], using significantly larger batch sizes [13], or even replacing BNs altogether [35]. To ensure BNs will not affect our study, and findings can be attributed to the jointly trained tasks, we replace BNs with group normalization (GN) [66]. We chose GN as it yielded the best performance when trained on ImageNet [66]. However, other normalization layers that are not affected by batch statistics can also be utilized, such as layer [2] and instance [60] normalizations.

Model Labeled Data
1% 2% 5% 10% 20% 50% 100%
Semseg 30.82 37.66 49.95 55.17 61.30 67.38 70.42
Rot Semseg 10.35 12.43 18.29 24.71 29.21 35.43 39.46
MoCo Semseg 31.55 37.55 48.60 53.27 58.74 64.04 68.09
DenseCL Semseg 34.89 39.72 50.96 55.60 61.13 65.71 69.56
Semseg + Rot 28.75 36.81 50.46 56.21 62.17 67.96 70.52
Semseg + MoCo 32.90 40.31 52.18 56.50 62.49 68.40 71.15
Semseg + DenseCL 33.51 40.91 52.76 57.33 63.22 68.81 71.16
DenseCL Semseg + DenseCL 36.32 41.24 52.94 56.87 62.71 65.89 69.81
Table 2:

Semantic segmentation performance in mIoU on the PASCAL VOC dataset. ‘

’ denote transfer learning methods, while ‘+’ denote joint training (CompL). Joint training with DenseCL significantly outperforms the “Semseg” baselines.
Figure 5: Semantic segmentation performance in mIoU on different ResNet encoders. Use of CompL (orange) denotes the addition of the best performing self-supervised objective (DenseCL). CompL consistently outperforms the baselines in all experiments.

4 Experiments

In this section we investigate the effects of jointly training dense predictions and self-supervised tasks. To systematically assess the effect of joint learning in label-deficient cases, we use different-sized subsets of the full target task data , i.e., . To ensure consistent contribution from the auxiliary task, we always use the full data split for the self-supervised task. The supplementary material includes additional experiments using the same subsets for both tasks.

Implementation details

We sample 8 images at random from each of the target and auxiliary training sets. We apply the baseline augmentations to target samples, namely, random horizontal flipping, random image scaling in the range [0.5, 2.0] in 0.25 increments, and then crop or pad the image to ensure a consistent size. The auxiliary loss is scaled by

. We found 0.2 works best for MoCo and DenseCL, while 0.05 for Rot. The model is optimized using stochastic gradient decent with momentum 0.9, weight decay 0.0001, and the “poly” learning rate schedule [10].

4.1 Monocular Depth Estimation

We first evaluate CompL on monocular depth estimation. Monocular depth estimation is a widely used dense prediction task, and is typically casted as a regression problem.

Experimental protocol Monocular depth estimation is explored on NYUD-v2 [57]

, comprised of 795 train and 654 test images from indoor scenes, and evaluated using the root mean squared error (RMSE) metric. All models are trained for 20k iterations, corresponding to 200 epochs of the fully labeled dataset, with an input image size of 425560, and are optimized with the

loss.

Joint optimization Table 1 presents the performance of the single-task baseline, “Depth”, and the models trained jointly with different self-supervised tasks, “Depth + Task name”. We find that joint training with any self-supervised task consistently improves the performance of the target task, even in the fully labeled dataset. In particular, joint training with self-supervision yields the biggest performance improvements on the lower labeled percentages, where the importance of inductive bias increases [5]. These findings are consistent also when utilizing stronger ResNet encoders, as depicted in Fig. 2 for the best performing self-supervised DenseCL method.

DenseCL contrasts both local and global representations, yielding richer representations for dense task pre-training, as compared to the image-level self-supervised tasks. We find this to also be the case in our joint-training setup, where richer local representations help guide the optimization of depth. To better understand the benefit of utilizing DenseCL for joint training with depth, we visualize the representations in Fig. (a)a using a t-SNE plot [61]. Specifically, we depict the latent representations of DenseCL using their corresponding ground-truth depth measurements. The depth values smoothly transition from larger distances (in red) to smaller distances (in blue). This indicates that the DenseCL objective, which is discriminative by construction, promotes a smooth variation in the representations when combined with a regression target objective.

Traditional MTL In order to determine how CompL compares to traditional MTL, we evaluate and compare the effect of using labeled auxiliary tasks. Specifically, we investigate the effect of the remaining three tasks of NYUD-v2, that is, boundaries, normals, and semantic segmentation, in Fig. 4. For fair comparisons to CompL, the auxiliary tasks also use the entire dataset. CompL consistently outperforms the use of labeled boundaries and normals as auxiliary tasks. This is particularly pronounced in the lower data splits where the contribution of CompL becomes more prominent, while boundaries and normals contribute less. Surface normals, derivatives of depth maps, could be expected to boost depth prediction due to their close relationship. However, we find it to help only marginally. On the other hand, joint training with semantic segmentation consistently improves the baseline performance, which aligns with findings in the previous works [12, 29, 40]. These results exemplify the importance of an arduous iteration process in search of a synergistic auxiliary task, where knowledge of label interactions are not necessarily helpful. This process is further complicated when additional auxiliary task annotations are needed. Therefore, eliminating manual labeling from auxiliary tasks opens up a new axis of investigation for the future of multi-task learning research as it can enable faster iterations in task interaction research.

Transfer learning The experiments have so far shown that joint training with self-supervision can enhance performance, and in most cases outperforms traditional MTL practices. Notably, outperforming the baselines even when all models are initialized with ImageNet pre-trained weights, a strong transfer learning baseline. However, is ImageNet pre-training the best initialization for Depth, and how does it compare to self-supervised pre-training? In Table 1 we repeat the baseline experiments starting from self-supervised pre-training, (“Initial task Depth”). In depth estimation, the contrastive methods gain the advantage and outperform the joint training methods. However, our proposed method is not limited by the initialization used. We find that initialization with MoCo or DenseCL weights coupled with joint training (“Initial task Initial task + Depth”) can increase the performance even further, giving the best performing models.

4.2 Semantic Segmentation

We additionally evaluate semantic segmentation. Semantic segmentation is representative for discrete labeling dense predictions.

Experimental protocol Semantic segmentation (Semseg) experiments are conducted on PASCAL VOC 2012 [21], and specifically the augmented version (aug.) from [31], that provides 10,582 train and 1,449 test images. We evaluate performance in terms of mean Intersection-over-Union (mIoU) across the classes. All models are trained for 80k iterations, accounting for 60 epochs of the fully labeled dataset, and are optimized with the cross-entropy loss with image input size of 512512.

Joint optimization Table 2 present the performance of the single-task baseline and the models trained jointly with different self-supervised tasks. In contrast to findings from classification literature [26, 72], joint training with Rot minimally affects the performance in most cases, with lower labeled percentages even incurring a performance degradation. On the other hand, the contrastive methods increase performance on all labeled splits, with lower labeled percentages incurring the biggest performance improvement. These findings are once again consistent when utilizing stronger ResNet encoders, as depicted in Fig. 5 for the best performing self-supervised method DenseCL. Similar to depth, we further visualize in Fig. (b)b the latent representations contrasted by DenseCL, and depict them with their ground-truth semantic maps. Unlike in depth regression, where the representations were smooth due to the continuous nature of the problem, the DenseCL representations for semantic segmentation form clusters given the discriminative nature of semantic segmentation.

Figure 6: Semantic segmentation performance in mIoU trained on PASCAL VOC and evaluated on BDD100K. The local contrastive loss of DenseCL provides significant robustness improvements.

Robustness to zero-shot dataset transfer So far we have only evaluated on the same distribution as that used for training, however, distribution shifts during deployment are common. We therefore investigate the generalization capabilities to new and unseen datasets. We evaluate the zero-shot capabilities of the models on the challenging BDD100K [68] dataset in Fig. 6, a diverse driving dataset. The test frames from BDD100K are therefore significantly different to those observed during training, making zero-shot transfer particularly interesting due to the large domain shift. We report the mIoU with respect to the shared classes between the two datasets. Please refer to the supplementary for the table of the BDD100K experiments.

We find that Rot often performs worse than the baseline model. This yields dissimilar findings to classification [37] that observed increased robustness, attributed to the strong regularization induced by the joint training. For Semseg, such regularizations degrade the fine-grained precision required. Joint training with DenseCL significantly outperforms all other self-supervised methods. While MoCo was comparable to DenseCL on VOC (Table 2), we find that local contrastive plays a big role in improving robustness. Interestingly, when using 100% of the data points, performance on all methods utilizing self-supervision is lower than when using 50% of the labels. We conjecture that, using the fully labeled split decreases the influence of self-supervision, making the model more prone to overfit to the training dataset and loose generalizability.

Model Labeled Data
10% 20% 50% 100%
Boundaries 71.10 73.50 75.90 76.80
Rot Boundaries 60.20 62.80 66.00 67.70
MoCo Boundaries 71.00 73.40 75.60 76.40
DenseCL Boundaries 68.90 71.70 75.40 75.90
Boundaries + Semseg 70.60 73.30 75.60 76.90
Boundaries + Rot 69.70 73.00 75.70 76.60
Boundaries + MoCo 71.30 73.80 76.20 76.90
Boundaries + DenseCL 71.30 73.90 76.00 76.20
Table 3:

Boundary detection performance in ODS F-score on the BSDS500 dataset. ‘

’ denote transfer learning methods, while ‘+’ denotes joint training. Performance improvements are marginal, in contrast to the findings for other target tasks.

Transfer learning Table 2 additionally reports the baseline experiments starting from self-supervised pre-training (indicated by “Initial task Semseg”), or additionally optimized with the best performing DenseCL method, as in the Depth experiments. Joint training with self-supervision consistently outperforms the sequential training counterpart, and in the majority of the cases by a significant margin. In other words, CompL consistently reports performance gains when initializing with either ImageNet or DenseCL.

4.3 Boundary Detection

Boundary detection is another common dense prediction tasks. Unlike depth prediction and semantic segmentation, the target boundary pixels only account for a small percentage of the overall pixels. We find that CompL significantly improves the model robustness for boundary detection.

Experimental protocol We study boundary detection on the BSDS500 [1] dataset, consisting of 300 train and 200 test images. Since the ground truth labels of BSDS500 are provided by multiple annotators, we follow the approach of [67] and only count a pixel as positive if it was annotated as positive by at least three annotators. Performance is evaluated using the Optimal-Dataset-Scale F-measure (ODS F-score) [50]. All models are trained for 10k iterations on input images of size 481481. Following [67], we use a cross-entropy loss with a weight of 0.95 for the positive and 0.05 for the negative pixels.

Model Depth Labeled Data Semseg Labeled Data
5% 10% 20% 50% 100% 5% 10% 20% 50% 100%
Depth + Semseg 0.997 0.904 0.794 0.665 0.606 10.46 14.99 19.41 26.24 31.66
Depth + Semseg + DenseCL 0.902 0.806 0.744 0.641 0.590 10.72 15.29 20.08 28.18 33.48
Table 4: Performance of a multi-task model for monocular depth estimation in RMSE and semantic segmentation in mIoU on NYUD-v2. ‘+’ denote joint training. The multi-task model combined with CompL yields consistent improvements in both tasks.

Joint optimization Table 3 presents the performance of the single-task baseline and the models trained jointly with different self-supervised tasks. Compared to the previous two tasks, boundary detection is marginally improved by CompL. Since convolutional networks are biased towards recognising texture rather than shape [23], we hypothesize that the supervisory signal of contrastive learning interferes with the learning of edge / shape filters essential for boundary detection. To investigate this hypothesis further, we jointly train boundary detection with a labeled high-level semantic task. Specifically, we jointly train boundary detection with the ground-truth foreground-background segmentation maps for BSDS500 [1] from [20]. As seen in Table 3, the incorporation of semantic information once again does not enhance the single-task performance of boundaries, and even slightly degrades at lower percentage splits.

While CompL yielded performance improvements for monocular depth estimation and semantic segmentation as target tasks, boundary estimation does not observe the same benefits. This further demonstrates the complexity of identifying a universal auxiliary task for all target tasks. Instead, it demonstrates the importance of co-designed self-supervised tasks alongside the downstream task.

Robustness to zero-shot dataset transfer We evaluate the zero-shot dataset transfer capabilities of the BSDS500 [1] models from Table 3 on NYUD-v2 [57]. Interestingly, even though CompL did not significantly improve the performance in Table 3, we find that the robustness experiments in Fig. 7 paint a different picture. While MoCo often outperformed DenseCL in Table 3, and most methods perform comparatively to the baseline, the additional local constrast of DenseCL significantly improves the robustness experiments. This can be seen from DenseCL consistently outperforming the baseline, as well as all other methods.

Figure 7: Boundary detection performance in ODS F-score trained on BSDS and evaluated on NYUD. The additional local contrast of DenseCL increases robustness to zero-shot dataset transfer.

Transfer learning Table 3 also reports the performance of the boundary detection transfer learning experiments. All three transfer learning approaches fare worse than ImageNet initialization, corroborating our hypothesis that boundary detection requires representations which are fairly unrelated to the features learned through self-supervision.

4.4 Multi-Task Model (Semseg and Depth)

Both semantic segmentation (Semseg) and monocular depth estimation (Depth) observed improvements when trained under CompL. In this section, we further investigate the applicability of CompL on MTL models optimized jointly for Depth and Semseg (Depth + Semseg).

Experimental protocol We explore joint training on NYUD-v2 [57]

, which provides ground-truth labels for both tasks. We maintain the exact same hyperparameters as the models in Sec. 

4.1, however, we expect an explicit search could yield additional improvements. No additional task-specific scaling of the losses is used, following [49]. For self-supervised tasks, we only evaluate DenseCL [65], as it performed the best for both tasks independently.

Joint optimization Table 4 presents the performance of the baseline multi-task model (Depth + Semseg) and the model trained jointly with DenseCL (Depth + Semseg + DenseCL). As in the single-task settings, training under CompL enhances the performance of both Semseg and Depth. Specifically, we again observe a performance gain in every labeled percentage. This demonstrates that, even in the traditional multi-task setting, the additional use of CompL has the potential of yielding further performance gains. In the current setting, Depth observes a noticeable gain over Semseg in low data regimes. This can be attributed to the DenseCL hyperparameters being optimized directly for the improvement of Depth. More advanced loss balancing schemes [15] could yield a redistribution of the performance gains, however, such investigation is beyond the scope of our work.

5 Conclusion

In this paper we introduced CompL, a method that exploits the inductive bias provided by a self-supervised task to enhance the performance of a target task. CompL exploits the label-free supervision of self-supervised methods, facilitating faster iterations through different task combinations. We show consistent performance improvements in fully and partially labeled datasets for both semantic segmentation and monocular depth estimation. While our method eliminated the need for labeling the auxiliary task, it commonly outperforms the traditional MTL with labeled auxiliary tasks on monocular depth estimation. Additionally, the semantic segmentation models trained under CompL yield better robustness on zero-shot cross dataset transfer. We envision our contribution to spark interest in the explicit design of self-supervised tasks for their use in joint training, opening up a new axis of investigation for future multi-task learning research.

Supplementary Material

Appendix A Monocular Depth Estimation

Joint optimization on identical dataset subsets Table S.5 presents the performance of the monocular depth estimation single-task baseline and the best performing self-supervised task, DenseCL. While in the main paper (Sec. 4.1, Table 1) the self-supervised objective had access to the entire dataset, in Table S.5 both objectives use the same subset for optimization. Consistent improvements across all dataset splits are still observed.

CompL Dataset Size
5% 10% 20% 50% 100%
0.8871 0.8120 0.7471 0.6655 0.6223
0.8840 0.8080 0.7305 0.6508 0.5990
Table S.5: Monocular depth estimation performance in RMSE on NYUD-v2. Both supervised and self-supervised objectives use identical splits. CompL denotes the addition of the best performing self-supervised objective, DenseCL, and yields consistent improvements.

Appendix B Semantic Segmentation

Joint optimization on identical dataset subsets Table S.6 presents the performance of the semantic segmentation single-task baseline and the best performing self-supervised task DenseCL. Similar to Table S.5, both objectives use the same subset for optimization. Consistent improvements across all dataset splits are again observed.

Robustness to zero-shot dataset transfer In Sec. 4.2 , we additionally investigated the generalization capabilities of CompL to a new and unseen dataset. Table S.7 presents the performance of the BDD100K experiments from Fig. 6.

We additionally evaluate how the models trained on PASCAL VOC from Table 2 (“Semseg” and “Semseg + Task name

”) perform without re-training on COCO 

[46] on the same classes. As seen in Table S.8 and Fig. S.8, joint training with the contrastive methods consistently outperform across all percentage splits, with the lower labeled percentages observing the biggest improvement.

Appendix C Multi-Task Model (Semseg and Depth)

CompL Dataset Size
1% 2% 5% 10% 20% 50% 100%
30.82 37.66 49.95 55.17 61.30 67.38 70.42
31.59 38.85 50.87 56.45 61.92 68.06 71.15
Table S.6: Semantic segmentation performance in mIoU on PASCAL VOC. Both supervised and self-supervised objectives use identical splits. CompL denotes the addition of the best performing self-supervised objective, DenseCL, and yields consistent improvements.
Figure S.8: Performance of semantic segmentation in mIoU trained on PASCAL VOC and evaluated on COCO. The local contrastive loss of DenseCL provides consistent robustness improvements.
Model Labeled Data
1% 2% 5% 10% 20% 50% 100%
Semseg 8.18 8.95 10.16 11.18 13.45 17.95 19.51
Semseg + Rot 9.41 8.42 10.71 12.25 13.00 18.00 17.45
Semseg + MoCo 8.56 9.28 11.8 12.28 14.56 20.79 20.45
Semseg + DenseCL 10.36 10.90 15.30 17.71 20.62 23.20 22.03
Table S.7: Performance of semantic segmentation in mIoU trained on PASCAL VOC and evaluated on BDD100K. The local contrastive loss of DenseCL provides significant robustness improvements.
Model Labeled Data
1% 2% 5% 10% 20% 50% 100%
Semseg 23.78 28.62 36.53 39.05 43.85 47.37 50.76
Semseg + Rot 22.05 26.92 36.29 39.64 43.93 48.01 50.70
Semseg + MoCo 25.42 30.06 37.44 40.88 44.64 48.99 51.41
Semseg + DenseCL 25.96 30.67 38.66 41.40 45.21 49.00 50.93
Table S.8: Performance of semantic segmentation in mIoU trained on PASCAL VOC and evaluated on COCO. The local contrastive loss of DenseCL provides consistent robustness improvements.
(a) Depth
(b) Semseg
Figure S.9: Performance of (a) monocular depth estimation (Depth) and (b) semantic segmentation (Semseg) on NYUD-v2 for their multi-task model. The multi-task model combined with CompL yields consistent improvements in both tasks.

Joint optimization In Table 4 of the main paper, we presented the performance of the baseline multi-task model (Depth + Semseg), and the model trained jointly with DenseCL (Depth + Semseg + DenseCL). For ease in comparison between the different models, Fig. S.9 additionally visualizes the results. Training under CompL enhances the performance of both Semseg and Depth, with Depth observing a noticeable gain over Semseg in low data regimes. As discussed in the main paper, this can be attributed to the DenseCL hyperparameters being optimized directly for the improvement of Depth. Furthermore, more advanced loss balancing schemes [15] could yield a redistribution of the performance gains, however, such investigation is beyond the scope of our work.

(a) Input image
(b) Blue crop
(c) Purple crop
Figure S.10: Low overlapping crops can be semantically different. This is more apparent in dense prediction datasets where multiple objects can be present in each image.

Appendix D Experiment Details

d.1 Codebase

In this work, we base our experiments on the VIsion library for state-of-the-art Self-Supervised Learning (VISSL) [28], released under the MIT License. VISSL includes implementations of self-supervised methods, and was adapted to enable for the joint optimization of the existing algorithms with supervised methods (semantic segmentation, monocular depth estimation, and boundary detection). The code will be made publicly available upon publication to spark further research in Composite Learning (CompL).

d.2 Technical details

All experiments were conducted in our internal cluster using single V-100 GPUs. Due to the considerable costs associated with multiple runs (beyond our compute infrastructure capabilities), we run all experiments with a random seed of 1, the default setting of VISSL. We provide additional details about different aspects that affect the self-supervised methods below:

Memory bank MoCo [32] and DenceCL [65] utilized a memory bank to enlarge the number of negative samples observed during training, while keeping a tractable batch size. Specifically, both methods use a memory bank of size 65,536. All the datasets we used in our study are of a smaller magnitude compared to that memory bank, e.g. 10,582 and 795 for PASCAL VOC 2012 (aug.) [31] and NYUD-v2 [57], respectively. We therefore set the memory bank to have the same size as the training dataset, yielding a single positive per sample, and therefore allowing for the direct use of the InfoNCE loss [52]. A larger memory bank can also be used, however the contrastive loss would need to be adapted to account for multiple positives [42].

Image cropping We use nearly identical augmentations to those proposed in MoCo v2 [14] for the self-supervised methods of [32, 65], but found it beneficial to modify image cropping. In most classification datasets, each image is comprised of a single object, and thus low overlapping crops can still include the same object. In dense tasks such as semantic segmentation, low overlapping crops can contain different objects (Fig. S.10). We follow the practice of [59] and find a constant crop size and distance between the two patches for each task. We empirically find that square crops of size 384 with a distance of 32 pixels on both axis works best for semantic segmentation, crops of size 283373 (to maintain input size ratio) with a distance of 8 pixels worked best for depth, and square crops of size 320 with a distance of 4 pixels worked best for boundary estimation.

DenceCL global vs local contrastive DenseCL, as discussed in Sec. 3.2 of the main paper, includes a global and a local contrastive term. The importance of the local contrastive term is weighed by a constant parameter. The original paper found that 0.7 for local contrastive and 0.3 for global contrastive performed best for detection, but used 0.5 to strike a balance between the downstream performance on detection and classification. In our study, we also found 0.7 for local contrastive yields the best performance, and as such, used it for all DenseCL experiments.

Hyperparameter During training, the auxiliary loss is scaled by the hyperparameter , weighting the contribution of the auxiliary self-supervised task. The hyperparameter was selected by performing a logarithmic grid search, as commonly done in MTL literature, chosen from the set {0.05, 0.1, 0.2, 0.5, 1.0}. We found the performance of the models to be consistent when is in the range of 0.1 to 0.5, as seen in Table S.9. The performance quickly degrades for values an order of magnitude larger as the model prioritizes the auxiliary task over the target task, while smaller values converge to the baseline performance.

Labeled Data
10% 50%
0.1 57.21 68.64
0.2 57.33 68.81
0.5 57.27 68.79
Table S.9: Ablation of the parameter for the semantic segmentation model trained jointly with DenseCL. The model yields comparable performance for all three values.

References

  • [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik (2010) Contour detection and hierarchical image segmentation. TPAMI 33 (5), pp. 898–916. Cited by: §4.3, §4.3, §4.3.
  • [2] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.3.
  • [3] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. TPAMI 39 (12), pp. 2481–2495. Cited by: §3.3.
  • [4] J. Baek, G. Kim, and S. Kim (2022) Semi-supervised learning with mutual distillation for monocular depth estimation. arXiv preprint arXiv:2203.09737. Cited by: §2.
  • [5] J. Baxter (2000) A model of inductive bias learning. JAIR 12, pp. 149–198. Cited by: §1, §4.1.
  • [6] D. Bruggemann, M. Kanakis, S. Georgoulis, and L. Van Gool (2020) Automated search for resource-efficient branched multi-task networks. In BMVC, Cited by: §3.3.
  • [7] D. Bruggemann, M. Kanakis, A. Obukhov, S. Georgoulis, and L. Van Gool (2021) Exploring relational context for multi-task dense prediction. In ICCV, Cited by: §2.
  • [8] R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §1, §2.
  • [9] R. Caruana (1998) A dozen tricks with multitask learning. In Neural networks: tricks of the trade, pp. 165–191. Cited by: §1.
  • [10] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40 (4), pp. 834–848. Cited by: §4.
  • [11] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, Cited by: §3.2, §3.3.
  • [12] P. Chen, A. H. Liu, Y. Liu, and Y. F. Wang (2019)

    Towards scene understanding: unsupervised monocular depth estimation with semantic-aware representation

    .
    In CVPR, Cited by: §1, §4.1.
  • [13] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In ICML, Cited by: §2, §3.3.
  • [14] X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §D.2, §3.2, §3.2, §3.3.
  • [15] Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich (2018) Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, Cited by: Appendix C, §4.4.
  • [16] J. Dai, K. He, and J. Sun (2016) Instance-aware semantic segmentation via multi-task network cascades. In CVPR, Cited by: §1.
  • [17] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §2.
  • [18] W. Deng, S. Gould, and L. Zheng (2021) What does rotation prediction tell us about classifier accuracy under varying testing environments?. In ICML, Cited by: §2.
  • [19] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In ICCV, Cited by: §2.
  • [20] I. Endres and D. Hoiem (2010) Category independent object proposals. In ECCV, Cited by: §4.3.
  • [21] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §4.2.
  • [22] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In CVPR, Cited by: §2.
  • [23] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2018) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In ICLR, Cited by: §4.3.
  • [24] M. Georgescu, A. Barbalau, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah (2021) Anomaly detection in video via self-supervised and multi-task learning. In CVPR, Cited by: §2.
  • [25] G. Ghiasi, B. Zoph, E. D. Cubuk, Q. V. Le, and T. Lin (2021) Multi-task self-training for learning general representations. In ICCV, Cited by: §2.
  • [26] S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord (2019) Boosting few-shot visual learning with self-supervision. In ICCV, Cited by: §2, §3.2, §4.2.
  • [27] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. ICLR. Cited by: §2, §3.2.
  • [28] P. Goyal, Q. Duval, J. Reizenstein, M. Leavitt, M. Xu, B. Lefaudeux, M. Singh, V. Reis, M. Caron, P. Bojanowski, A. Joulin, and I. Misra (2021) VISSL. Note: https://github.com/facebookresearch/vissl Cited by: §D.1.
  • [29] V. Guizilini, R. Hou, J. Li, R. Ambrus, and A. Gaidon (2020) Semantically-guided representation learning for self-supervised monocular depth. In ICLR, Cited by: §4.1.
  • [30] V. Guizilini, J. Li, R. Ambrus, S. Pillai, and A. Gaidon (2020) Robust semi-supervised monocular depth estimation with reprojected distances. In CoRL, Cited by: §2.
  • [31] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In ICCV, Cited by: §D.2, §4.2.
  • [32] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In CVPR, Cited by: §D.2, §D.2, §2, §3.2, §3.3.
  • [33] K. He, R. Girshick, and P. Dollár (2019) Rethinking imagenet pre-training. In ICCV, Cited by: §2.
  • [34] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.3.
  • [35] O. Henaff (2020) Data-efficient image recognition with contrastive predictive coding. In ICML, Cited by: §3.3.
  • [36] D. Hendrycks and T. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, Cited by: §2.
  • [37] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song (2019) Using self-supervised learning can improve model robustness and uncertainty. In NeurIPS, Cited by: §2, §3.2, §4.2.
  • [38] L. Hoyer, D. Dai, Y. Chen, A. Koring, S. Saha, and L. Van Gool (2021) Three ways to improve semantic segmentation with self-supervised depth estimation. In CVPR, Cited by: §2.
  • [39] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §3.3.
  • [40] J. Jiao, Y. Cao, Y. Song, and R. Lau (2018) Look deeper into depth: monocular depth estimation with semantic booster and attention-driven loss. In ECCV, Cited by: §4.1.
  • [41] M. Kanakis, D. Bruggemann, S. Saha, S. Georgoulis, A. Obukhov, and L. Van Gool (2020) Reparameterizing convolutions for incremental multi-task learning without task interference. In ECCV, Cited by: §2.
  • [42] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. In NeurIPS, Cited by: §D.2, §2.
  • [43] I. Kokkinos (2017) Ubernet: training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, Cited by: §1.
  • [44] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §2.
  • [45] Z. Li, Y. Zhu, F. Yang, W. Li, C. Zhao, Y. Chen, Z. Chen, J. Xie, L. Wu, R. Zhao, et al. (2022) UniVIP: a unified framework for self-supervised visual pre-training. In CVPR, Cited by: §2.
  • [46] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: Appendix B.
  • [47] S. Liu, A. Davison, and E. Johns (2019) Self-supervised generalisation with meta auxiliary learning. NeurIPS. Cited by: §2.
  • [48] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §2.
  • [49] K. Maninis, I. Radosavovic, and I. Kokkinos (2019) Attentive single-tasking of multiple tasks. In CVPR, Cited by: §2, §3.3, §4.4.
  • [50] D. R. Martin, C. C. Fowlkes, and J. Malik (2004) Learning to detect natural image boundaries using local brightness, color, and texture cues. TPAMI 26 (5), pp. 530–549. Cited by: §4.3.
  • [51] A. Newell and J. Deng (2020) How useful is self-supervised pretraining for visual tasks?. In CVPR, Cited by: §2.
  • [52] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §D.2, §3.2.
  • [53] J. Pang, L. Qiu, X. Li, H. Chen, Q. Li, T. Darrell, and F. Yu (2021) Quasi-dense similarity learning for multiple object tracking. In CVPR, Cited by: §2.
  • [54] R. Ranjan, V. M. Patel, and R. Chellappa (2017)

    Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition

    .
    TPAMI 41 (1), pp. 121–135. Cited by: §2.
  • [55] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §3.3.
  • [56] S. Saha, A. Obukhov, D. P. Paudel, M. Kanakis, Y. Chen, S. Georgoulis, and L. Van Gool (2021) Learning to relate depth and semantics for unsupervised domain adaptation. In CVPR, Cited by: §2.
  • [57] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §D.2, §4.1, §4.3, §4.4.
  • [58] T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, and S. Savarese (2020) Which tasks should be learned together in multi-task learning?. In ICML, Cited by: §1, §2.
  • [59] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola (2020) What makes for good views for contrastive learning. In NeurIPS, Cited by: §D.2.
  • [60] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §3.3.
  • [61] L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. JMLR 9 (11). Cited by: §4.1.
  • [62] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, et al. (2020) Deep high-resolution representation learning for visual recognition. TPAMI. Cited by: §3.3.
  • [63] W. Wang, T. Zhou, F. Yu, J. Dai, E. Konukoglu, and L. Van Gool (2021) Exploring cross-image pixel contrast for semantic segmentation. In ICCV, Cited by: §2.
  • [64] X. Wang, T. E. Huang, B. Liu, F. Yu, X. Wang, J. E. Gonzalez, and T. Darrell (2021) Robust object detection via instance-level temporal cycle confusion. In ICCV, Cited by: §2.
  • [65] X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li (2021) Dense contrastive learning for self-supervised visual pre-training. In CVPR, Cited by: §D.2, §D.2, §2, §3.2, §4.4.
  • [66] Y. Wu and K. He (2018) Group normalization. In ECCV, Cited by: §3.3.
  • [67] S. Xie and Z. Tu (2015) Holistically-nested edge detection. In ICCV, Cited by: §4.3.
  • [68] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell (2020) Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In CVPR, Cited by: §2, §4.2.
  • [69] F. Yu, V. Koltun, and T. Funkhouser (2017) Dilated residual networks. In CVPR, Cited by: §3.3.
  • [70] F. Yu and V. Koltun (2016) Multi-scale context aggregation by dilated convolutions. In ICLR, Cited by: §3.3.
  • [71] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese (2018) Taskonomy: disentangling task transfer learning. In CVPR, Cited by: §2.
  • [72] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer (2019) S4l: self-supervised semi-supervised learning. In ICCV, Cited by: §2, §3.2, §4.2.
  • [73] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan (2019) Theoretically principled trade-off between robustness and accuracy. In ICML, Cited by: §2.