UM-Adapt: Unsupervised Multi-Task Adaptation Using Adversarial Cross-Task Distillation

by   Jogendra Nath Kundu, et al.
indian institute of science

Aiming towards human-level generalization, there is a need to explore adaptable representation learning methods with greater transferability. Most existing approaches independently address task-transferability and cross-domain adaptation, resulting in limited generalization. In this paper, we propose UM-Adapt - a unified framework to effectively perform unsupervised domain adaptation for spatially-structured prediction tasks, simultaneously maintaining a balanced performance across individual tasks in a multi-task setting. To realize this, we propose two novel regularization strategies; a) Contour-based content regularization (CCR) and b) exploitation of inter-task coherency using a cross-task distillation module. Furthermore, avoiding a conventional ad-hoc domain discriminator, we re-utilize the cross-task distillation loss as output of an energy function to adversarially minimize the input domain discrepancy. Through extensive experiments, we demonstrate superior generalizability of the learned representation simultaneously for multiple tasks under domain-shifts from synthetic to natural environments. UM-Adapt yields state-of-the-art transfer learning results on ImageNet classification and comparable performance on PASCAL VOC 2007 detection task, even with a smaller backbone-net. Moreover, the resulting semi-supervised framework outperforms the current fully-supervised multi-task learning state-of-the-art on both NYUD and Cityscapes dataset.


page 3

page 4

page 8

page 14


Cross-Domain Self-supervised Multi-task Feature Learning using Synthetic Imagery

In human learning, it is common to use multiple sources of information j...

Multi-task Domain Adaptation for Sequence Tagging

Many domain adaptation approaches rely on learning cross domain shared r...

Unsupervised Domain Adaptation: A Multi-task Learning-based Method

This paper presents a novel multi-task learning-based method for unsuper...

TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation

Unsupervised domain adaptation (UDA) aims to transfer the knowledge lear...

Adversarial Style Mining for One-Shot Unsupervised Domain Adaptation

We aim at the problem named One-Shot Unsupervised Domain Adaptation. Unl...

Data Distillation, Face-Related Tasks, Multi Task Learning, Semi-Supervised Learning

We propose a new semi-supervised learning method on face-related tasks b...

Domain adaptation strategies for cancer-independent detection of lymph node metastases

Recently, large, high-quality public datasets have led to the developmen...

1 Introduction

Deep networks have proven to be highly successful in a wide range of computer vision problems. They not only excel in classification or recognition based tasks, but also deliver comparable performance improvements for complex spatially-structured prediction tasks 


like semantic segmentation, monocular depth estimation etc. However, generalizability of such models is one of the major concerns before deploying them in a target environment, since such models exhibit alarming dataset or domain bias 

[5, 29]. To effectively address this, researchers have started focusing on unsupervised domain adaptation approaches [7]. In a fully-unsupervised setting without target annotations, one of the effective approaches [13, 58] is to minimize the domain discrepancy at a latent feature level so that the model extracts domain agnostic and task-specific representations. Although such approaches are very effective in classification or recognition based tasks [57], they yield suboptimal performance for adaptation of fully-convolutional architectures, which is particularly essential for spatial prediction tasks [62]

. One of the major issues encountered in such scenarios is attributed to the spatially-structured high-dimensional latent representation in contrast to vectorized form 

[30]. Moreover, preservation of spatial-regularity, avoiding mode-collapse [50] becomes a significant challenge while aiming to adapt in a fully unsupervised setting.

Figure 1: A schematic diagram to understand the implications of cross-task distillation. The green arrows show consistency in cross-task transfer for a source sample. Whereas, the red and purple arrows show a discrepancy (yellow arrows) in cross-task transfer for a target sample as a result of input domain-shift. UM-Adapt aims to minimize this discrepancy as a proxy to achieve adaptation at a spatially-structured common latent representation.

While we aim towards human level performance, there is a need to explore scalable learning methods, which can yield generic image representation with improved transferability across both tasks and data domains. Muti-task learning [39, 25] is an emerging field of research in this direction, where the objective is to realize a task-agnostic visual representation by jointly training a model on several complementary tasks [18, 37]. In general, such networks are difficult to train as they require explicit attention to balance performance across each individual task. Also, such approaches only address a single aspect of the final objective (i.e. generalization across tasks) ignoring the other important aspect of generalization across data domains.

In this paper, we focus on multi-task adaptation of spatial prediction tasks, proposing efficient solutions to the specific difficulties discussed above. To effectively deliver optimal performance in both, generalization across tasks and input domain shift, we formulate a multitask adaptation framework called UM-Adapt. To effectively preserve the spatial-regularity information during the unlabelled adversarial adaptation [57] procedure, we propose two novel regularization techniques. Motivated by the fact that, the output representations share a common spatial-structure with respect to the input image, we first introduce a novel contour-based content regularization procedure. Additionally, we formalize a novel idea of exploiting cross-task coherency as an important cue to further regularize the multi-task adaptation process.

Consider a base-model trained on two different tasks, task-A and task-B. Can we use the supervision of task-A to learn the output representation of task-B and vice versa? Such an approach is feasible, particularly when the tasks in consideration share some common characteristics (i.e. consistency in spatial structure across task outputs). Following this reasoning, we introduce a cross-task Distillation module (see Figure 1). The module essentially constitutes of multiple encoder-decoder architectures called task-transfer networks, which are trained to get back the representation of a certain task as output from a combination of rest of the tasks as the input. The overarching motivation behind such a framework is to effectively balance performance across all the tasks in consideration, thereby avoiding domination of the easier tasks during training. To intuitively understand the effectiveness of cross-task distillation, let us consider a particular training state of the base-model, where the performance on task-A is much better than the performance on task-B. Here the task-transfer network (which is trained to output the task-B representation using the base-model task-A prediction as input) will yield improved performance on task-B, as a result of the dominant base task-A performance. This results in a clear discrepancy between the base-task-B performance and the task-B performance obtained through the task-transfer network. We aim to minimize this discrepancy which in turn acts as a regularization encouraging balanced learning across all the tasks.

In single-task domain adaptation approaches [57, 30], it is common to employ an ad-hoc discriminator to minimize the domain discrepancy. However, in presence of cross-task distillation module, this approach highly complicates the overall training pipeline. Therefore, avoiding such a direction, we propose to design a unified framework to effectively address both the diverse objectives i.e. a) to realize a balanced performance across all the tasks and b) to perform domain adaptation in an unsupervised setting. Taking inspirations from energy-based GAN [65], we re-utilize the task-transfer networks, treating the transfer-discrepancies as output of energy-functions, to adversarially minimize the domain discrepancy in a fully-unsupervised setting.

Our contributions in this paper are as follows:

  • We propose a simplified, yet effective unsupervised multi-task adaptation framework, utilizing two novel regularization strategies; a) Contour-based content regularization (CCR) and b) exploitation of inter-task coherency using a cross-task distillation module.

  • Further, we adopt a novel direction by effectively utilizing cross-task distillation loss as an energy-function to adversarially minimize the input domain discrepancy in a fully-unsupervised setting.

  • UM-Adapt yields state-of-the-art transfer learning results on ImageNet classification, and comparable performance on PASCAL VOC 2007 detection task, even with a smaller backbone-net. The resulting semi-supervised framework outperforms the current fully-supervised multi-task learning state-of-the-art on both NYUD and Cityscapes dataset.

Figure 2: An overview of the proposed UM-Adapt architecture for multi-task adaptation. The blue and pink background wide-channel indicates data flow for synthetic and natural domain respectively. On the right we show an illustration of the proposed cross-task distillation module, which is later utilized as an energy-function for adversarial adaptation (Section 3.3.2).

2 Related work

Domain adaptation. Recent adaptation approaches for deep networks focus on minimization of domain discrepancy by optimizing some distance function related to higher order statistical distributions [7]. Works following adversarial discriminative approaches  [56, 57, 11, 12] utilize motivations from Generative Adversarial Networks [15] to bridge the domain-gap. Recent domain adaptation approaches, particularly targeting spatially-structured prediction tasks, can be broadly divided into two sub branches viz. a) pixel-space adaptation and b) feature space adaptation. In pixel-space adaptation  [2, 20, 3] the objective is to train a image-translation network [67], which can transform an image from the target domain to resemble like an image from the source domain. On the other hand, feature space adaptation approaches focus on minimization of various statistical distance metrics [16, 37, 54, 22] at some latent feature-level, mostly in a fully-shared source and target domain parameter setting [46]. However, unshared setups show improved adaptation performance as a result of learning dedicated filter parameters for both the domains in consideration [57, 30]. But, a fully unshared setup comes with other difficulties such as mode-collapse due to inconsistent output in absence of paired supervision [23, 67]. Therefore, an optimal strategy would be to adapt minimally possible parameters separately for the target domain in a partially-shared architecture setting [30].

Multi-task learning. Multi-task learning [4] has been applied in computer vision literature [39] for quite a long time for a varied set of tasks in consideration [28, 9, 10, 17, 44]

. To realize this, a trivial direction is to formulate a multi-task loss function, which weighs the relative contribution of each task, enabling equal importance to individual task performance. It is desirable to formulate techniques which adaptively modify the relative weighting of individual tasks as a function of the current learning state or iteration. Kendall  

[25] proposed a principled approach by utilizing a joint likelihood formulation to derive task weights based on the intrinsic uncertainty in individual tasks. Chen  [64] proposed a gradient normalization (GradNorm) algorithm that automatically balances training in deep multitask models by dynamically tuning gradient magnitudes. Another set of work, focuses on learning task-agnostic generalized visual representations utilizing the advances in multi-task learning techniques [61, 46].

3 Approach

Here we define the notations and problem setting for unsupervised multi-task adaptation. Consider source input image samples with the corresponding outputs for different tasks being , and for a set of three complementary tasks namely, monocular-depth, semantic-segmentation, and surface-normal respectively. We have full access to the source image and output pair as it is extracted from synthetic graphical environment. The objective of UM-Adapt is to estimate the most reliable task-based predictions for an unknown target domain input, . Considering natural images as samples from the target domain , to emulate an unsupervised setting, we restrict access to the corresponding task-specific outputs viz. , and . Note that the objective can be readily extended to a semi-supervised setting, considering availability of output annotations for only few input samples from the target domain.

3.1 UM-Adapt architecture

As shown in Figure 2, the base multi-task adaptation architecture is motivated from the standard CNN encoder-decoder framework. The mapping function from the source domain, to a spatially-structured latent representation is denoted as . Following this, three different decoders with up-convolutional layers [31] are employed for the three tasks in consideration, (see Figure 2) i.e., , , and . Initially the entire architecture is trained with full supervision on source domain data. To effectively balance performance across all the tasks, we introduce the cross-task Distillation module as follows.

3.1.1 Cross-task Distillation module

This module aims to get back the representation of a certain task through a transfer function which takes a combined representation of all other tasks as input. Consider as the set of all tasks, i.e. , where is the total number of tasks in consideration. For a particular task , we denote as the prediction of the base-model at the output-head. We denote the task specific loss function as in further sections of the paper. Here, task-transfer network is represented as , which takes a combined set as the input representation and the corresponding output is denoted as . The parameters of , are obtained by optimizing a task-transfer loss function, denoted by and are kept frozen in further stages of training. However, one can feed the predictions of base-model through the task-transfer network to realize another estimate of task represented as , where . Following this, we define the distillation-loss for task as . While optimizing parameters of the base-model, this distillation-loss is utilized as one of the important loss components to realize an effective balance across all the task objectives (see Algorithm 1). Here, we aim to minimize the discrepancy between the direct and indirect prediction (via other tasks) of individual tasks. The proposed learning algorithm does not allow any single task to dominate the training process, since the least performing task will exhibit higher discrepancy and hence will be given more importance in further training iterations.

Compared to the general knowledge-distillation framework [19], one can consider to be analogous to the output of a teacher network and as the output of a student network. Here the objective is to optimize the parameters of the base-model by effectively employing the distillation loss which in turn enforces coherence among the individual task performances.

Figure 3: An overview of the (a) proposed CCR framework with (b) evidence of consistency in contour-map computed on the output-map of each task to that of the input RGB image.

3.1.2 Architecture for target domain adaptation

Following a partially-shared adaptation setup, a separate latent mapping network is introduced specifically for the target domain samples, i.e. (see Figure 2). Inline with AdaDepth [30], we initialize using the pre-trained parameters of the source domain counterpart, (Resnet-50 encoder), in order to start with a good baseline initialization. Following this, only Res-5 block (i.e. ) parameters are updated for the additional encoder branch, . Note that the learning algorithm for unsupervised adaptation does not update other layers in the task-specific decoders and the initial shared layers till the Res-4f block.

/*Initialization of parameters */
: base-model parameters }
for  iterations do
       for task ;  do
Algorithm 1 Base-model training algorithm on fully-supervised source data with cross-task distillation loss.

3.2 Contour-based Content Regularization (CCR)

Spatial content inconsistency is a serious problem for unsupervised domain adaptation focused on pixel-wise dense-prediction tasks [20, 30]. In order to address this, [30] proposes a Feature Consistency Framework (FCF), where the authors employ a cyclic feature reconstruction setup to preserve the spatially-structured content such as semantic contours, at the Res-4f activation map of the frozen Resnet-50 (till Res-4f) encoder. However, the spatial size of the output activation of Res-4f feature (i.e. 2016) is inefficient to capture relevant spatial regularities required to alleviate the contour alignment problem.

To address the above issue, we propose a novel Contour-based Content Regularization (CCR) method. As shown in Figure 3a, we introduce a shallow (4 layer) contour decoder, to reconstruct only the contour-map of the given input image, where the ground-truth is obtained using a standard contour prediction algorithm [1]. This content regularization loss (mean-squared loss) is denoted as in further sections of the paper. Assuming that majority of the image based contours align with the contours of task-specific output maps, we argue that the encoded feature (Res-5c activation) must retain the contour information during the adversarial training for improved adaptation performance. This clearly makes CCR superior over the existing approaches of enforcing image reconstruction based regularization [2, 41, 51], by simplifying the additionally introduced decoder architecture devoid of the burden of generating irrelevant color-based appearance. is trained on the fixed output transformation and the corresponding ground-truth contour pair, i.e. . However, unlike FCF regularization [30], the parameter of is not updated during the adversarial learning, as the expected output contour map is independent of the or transformation. As a result, is treated as output of an energy-function, which is later minimized for to bridge the discrepancy between the distributions and during adaptation, as shown in Algorithm 2.

3.3 Unsupervised Multi-task adaptation

In unsupervised adaptation, the overall objective is to minimize the discrepancy between the source and target input distributions. However, minimizing the discrepancy between and can possibly overcome the differences between the ground-truth and prediction better, when compared to matching with as proposed in some previous approaches [55]. Aiming towards optimal performance, UM-Adapt focuses on matching target prediction with the actual ground-truth map distribution, and the proposed cross-task distillation module provides a means to effectively realize such an objective.

3.3.1 UM-Adapt baseline (UM-Adapt-B)

Existing literature [38, 30]

shows efficacy of simultaneous adaptation at hierarchical feature levels, while minimizing domain discrepancy for multi-layer deep architectures. Motivated by this, we design a single discriminator which can match the joint distribution of latent representation and the final task-specific structured prediction maps with the corresponding true joint distribution. As shown in Figure 

2, the predicted joint distribution denoted by , is matched with true distribution denoted by , following the usual adversarial discriminative strategy [30] (see Supplementary for more details). We will denote this framework as UM-Adapt-B in further sections of this paper.

/*Initialization of parameters */
: Res5 parameters of initialized from
: parameters of fully trained (i.e. )
          on ground-truth task output-maps,
for  iterations do
       for  steps do
             for task ;  do
                   /* Update trainable parameters of by minimizing energy of target samples.*/
      for task ;  do
             /* Update the energy function */
Algorithm 2 Training algorithm of UM-Adapt-(Adv.) utilizing energy-based adversarial cross-task distillation. In UM-Adapt-(noAdv.) we do not update parameters of the task-transfer network, i.e. throughout the adaptation procedure (see Section 3.3.2).

3.3.2 Adversarial cross-task distillation

Aiming towards formalizing a unified framework to effectively address multi-task adaptation as a whole, we plan to treat the task-transfer networks, as energy functions to adversarially minimize the domain discrepancy. Following the analogy of Energy-based GAN [65], the task-transfer networks are first trained to obtain low-energy for the ground-truth task-based source tuples (i.e. ()) and high-energy for the similar tuples from the target predictions (i.e. ()). This is realized by minimizing as defined in Algorithm 2. Conversely, the trainable parameters of are updated to assign low energy to the predicted target prediction tuples, as enforced by (see Algorithm 2). Along with the previously introduced CCR regularization, the final update equation for is represented as . We use different optimizers for energy functions of each task, . As a result, is optimized to have a balanced performance across all the tasks even in a fully unsupervised setting. We denote this framework as UM-Adapt-(Adv.) in further sections of this paper.

Note that the task-transfer networks are trained only on ground-truth output-maps under sufficient regularization due to the compressed latent representation, as a result of the encoder-decoder setup. This enables to learn a better approximation of the intended cross-task energy manifold, even in absence of negative examples (target samples) [65]. This analogy is used in Algorithm 1 to effectively treat the frozen task-transfer network as an energy-function to realize a balanced performance across all the tasks on the fully-supervised source domain samples. Following this, we plan to formulate an ablation of UM-Adapt, where we restrain the parameter update of in Algorithm 2. We denote this framework as UM-Adapt-(noAdv.) in further sections of this paper. This modification gracefully simplifies the unsupervised adaptation algorithm, as it finally retains only as the minimal set of trainable parameters (with frozen parameter of as ).

4 Experiments

To demonstrate effectiveness of the proposed framework, we evaluate on three different publicly available benchmark datasets, separately for indoor and outdoor scenes. Further, in this section, we discuss details of our adaptation setting and analysis of results on standard evaluation metrics for a fair comparison against prior art.

4.1 Experimental Setting

We follow the encoder-decoder architecture exactly as proposed by Liana  [31]. The decoder architecture is replicated three times to form , and respectively. However, the number of feature maps and nonlinearity for the final task-based prediction layers is adopted according to the standard requirements. We use BerHu loss [31] as the loss function for the depth estimation task, i.e. . Following Eigen  [9], an inverse of element-wise dot product on unit normal vectors for each pixel location is consider as the loss function for surface-normal estimation, . Similarly for segmentation, i.e , classification based cross-entropy loss is implemented with a weighing scheme to balance gradients from different classes depending on their coverage.

We also consider a semi-supervised setting (UM-Adapt-S), where the training starts from the initialization of the trained unsupervised version, UM-Adapt-(Adv.). For better generalization, alternate batches of labelled (optimize supervised loss, ) and unlabelled (optimize unsupervised loss, ) target samples are used to update the network parameters (i.e. ).

Accuracy ()
rel log10 rms
Saxena  [52] 795 0.349 - 1.214 0.447 0.745 0.897
Liu  [34] 795 0.230 0.095 0.824 0.614 0.883 0.975
Eigen  [10] 120K 0.215 - 0.907 0.611 0.887 0.971
Roy  [48] 795 0.187 0.078 0.744 - - -
Laina  [31] 96K 0.129 0.056 0.583 0.801 0.950 0.986
Simultaneous multi-task learning
Multi-task baseline 0 0.27 0.095 0.862 0.559 0.852 0.942
UM-Adapt-B(FCF) 0 0.218 0.091 0.679 0.67 0.898 0.974
UM-Adapt-B(CCR) 0 0.192 0.081 0.754 0.601 0.877 0.971
UM-Adapt-(noAdv.)-1 0 0.181 0.077 0.743 0.623 0.889 0.978
UM-Adapt-(noAdv.) 0 0.178 0.063 0.712 0.781 0.917 0.984
UM-Adapt-(Adv.) 0 0.175 0.065 0.673 0.783 0.92 0.984
Wang  [59] 795 0.220 0.094 0.745 0.605 0.890 0.970
Eigen  [9] 795 0.158 - 0.641 0.769 0.950 0.988
Jafari  [24] 795 0.157 0.068 0.673 0.762 0.948 0.988
UM-Adapt-S 795 0.149 0.067 0.637 0.793 0.938 0.983
Table 1: Quantitative comparison of different ablations of UM-Adapt framework with comparison against prior arts for depth estimation on NYUD-v2. The second column indicates amount of supervised target samples used during training.
mean median 11.25° 22.5° 30°
Eigen  [9] 120k 22.2 15.3 38.6 64 73.9
PBRS [63] 795 21.74 14.75 39.37 66.25 76.06
SURGE [60] 795 20.7 12.2 47.3 68.9 76.6
GeoNet [43] 30k 19.0 11.8 48.4 71.5 79.5
Simultaneous multi-task learning
Multi Task Baseline 0 25.8 18.73 29.65 61.69 69.83
UM-Adapt-B(FCF) 0 24.6 16.49 37.53 65.73 75.51
UM-Adapt-B(CCR) 0 23.8 14.67 42.08 69.13 77.28
UM-Adapt-(noAdv.)-1 0 22.3 15.56 43.17 69.11 78.36
UM-Adapt-(noAdv.) 0 22.2 15.31 43.74 70.18 78.83
UM-Adapt-(Adv.) 0 22.2 15.23 43.68 70.45 78.95
UM-Adapt-S 795 21.2 13.98 44.66 72.11 81.08
Table 2: Quantitative comparison of different ablations of UM-Adapt framework with comparison against prior arts for Surface-normal estimation on the standard test-set of NYUD-v2.

Datasets. For representation learning on indoor scenes, we use the publicly available NYUD-v2 [53] dataset, which has been used extensively for supervised multi-task prediction of depth-estimation, sematic segmentation and surface-normal estimation. The processed version of the dataset consists of 1449 sample images with a standard split of 795 for training and 654 for testing. While adapting in semi-supervised setting, we use the corresponding ground-truth maps of all the 3 tasks (795 training images) for the supervised loss. The CNN takes an input of size 228304 with various augmentations of scale and flip following [10], and outputs three task specific maps, each of size 128160. For the synthetic counterpart, we use 100,000 randomly sampled synthetic renders from PBRS [63] dataset along with the corresponding clean ground-truth maps (for all the three tasks) as the source domain samples.

To demonstrate generalizability of UM-Adapt, we consider outdoor-scene dataset for two different tasks, sematic segmentation and depth estimation. For the synthetic source domain, we use the publicly available GTA5 [47] dataset consisting of 24966 images with the corresponding depth and segmentation ground-truths. However, for real outdoor scenes, the widely used KITTI dataset does not have semantic labels that are compatible with the synthetic counterpart. On the other hand, the natural image Cityscapes dataset [6] does not contain ground-truth depth maps. Therefore to formulate a simultaneous multi-task learning problem and to perform a fair comparison against prior art, we consider the Eigen test-split on KITTI [9] for comparison of depth-estimation result and the Cityscapes validation set to benchmark our outdoor segmentation results in a single UM-Adapt framework. For the semi-supervised setting, we feed alternate KITTI and Cityscapes minibatches with the corresponding ground-truth maps for supervision. Here, input and output resolution for the network is considered to be 256512 and 128256 respectively.

Mean IOU Mean Accuracy Pixel Accuracy
PBRS [63] 795 0.332 - -
Long  [36] 795 0.292 0.422 0.600
Lin  [33] 795 0.406 0.536 0.700
Kong  [27] 795 0.445 - 0.721
RefineNet(Res50) [32] 795 0.438 - -
Simultaneous multi-task learning
Multi Task Baseline 0 0.022 0.063 0.067
UM-Adapt-B(FCF) 0 0.154 0.295 0.514
UM-Adapt-B(CCR) 0 0.163 0.308 0.557
UM-Adapt-(noAdv.)-1 0 0.189 0.345 0.603
UM-Adapt-(noAdv.) 0 0.214 0.364 0.608
UM-Adapt-(Adv.) 0 0.221 0.366 0.619
Eigen  [9] 795 0.341 0.451 0.656
Arsalan  [40] 795 0.392 0.523 0.686
UM-Adapt-S 795 0.444 0.536 0.739
Table 3: Quantitative comparison of different ablations of UM-Adapt framework with comparison against prior arts for sematic segmentation on the standard test-set of NYUD-v2.

Training details. We first train a set of task-transfer networks on the synthetic task label-maps separately for both indoor (PBRS) and outdoor (GTA5) scenes. For indoor dataset, we train only the following two task-transfer networks; and considering the fact that surface-normal and depth estimation are more correlated, when compared to other pair of tasks. Similarly for outdoor, we choose the only two task-transfer possible combinations and . Following this, two separate base-models are trained with full-supervision on synthetic source domain using Algorithm 1 () with different optimizers (Adam [26]) for each individual task. After obtaining a frozen fully-trained source-domain network, the network is trained as discussed in Section 3.2 and it remains frozen during its further usage as a regularizer.

4.2 Evaluation of the UM-Adapt Framework

We have conducted a thorough ablation study to establish effectiveness of different components of the proposed UM-Adapt framework. We report results on the standard benchmark metrics as followed in literature for each of the individual tasks, to have a fair comparison against state-of-the-art approaches. Considering the difficulties of simultaneous multi-task learning, we have clearly segregated prior art based on single-task or multi-task optimization approaches in all the tables in this section.

Target image
rel sq.rel rms rms(log10)
Eigen  [10] Full 0.203 1.548 6.307 0.282
Godard  [14] Binocular 0.148 1.344 5.927 0.247
zhou  [66] Video 0.208 1.768 6.856 0.283
AdaDepth [30] No 0.214 1.932 7.157 0.295
Simultaneous multi-task learning
Multi-task baseline No 0.381 2.08 8.482 0.41
UM-Adapt-(noAdv.) No 0.28 1.99 7.791 0.346
UM-Adapt-(Adv.) No 0.27 1.98 7.823 0.336
UM-Adapt-S few-shot 0.201 1.72 5.876 0.259
Table 4: Quantitative comparison of ablations of UM-Adapt framework with comparison against prior arts for depth-estimation on the Eigen test-split [10] of KITTI dataset.

Ablation study of UM-Adapt.   As a multi-task baseline, we report performance on the standard test-set of natural samples with direct inference on the frozen source-domain parameters without adaptation. With the exception of results on sematic-segmentation, baseline performance for the other two regression tasks (i.e. depth estimation and surface-normal prediction) are strong enough to support the idea of achieving first-level generalization using multi-task learning. However, the prime focus of UM-Adapt is to achieve the second-level of generalization through unsupervised domain adaptation. In this regard, to analyze effectiveness of the proposed CCR regularization (Section 3.2) against FCF [30], we conduct an experiment on the UM-Adapt-B framework defined in Section 3.3.1. The reported benchmark numbers for unsupervised adaptation from PBRS to NYUD (see Table 12 and 3) clearly indicates the superiority of CCR for adaptation of structured prediction tasks. Following this inference, all later ablations (i.e. UM-Adapt-(noAdv.), UM-Adapt-(Adv.) and UM-Adapt-S) use only CCR as content-regularizer.

Image supervision
Mean IOU
FCN-Wild [21] 0 0.271
CDA [62] 0 0.289
DC [56] 0 0.376
Cycada [20] 0 0.348
AdaptSegNet [55] 0 0.424
Simultaneous multi-task learning
Multi Task Baseline 0 0.224
UM-Adapt-(noAdv.) 0 0.408
UM-Adapt-(Adv.) 0 0.420
UM-Adapt-S 500 0.544
Table 5: Quantitative comparison of ablations of UM-Adapt framework with comparison against prior arts for sematic segmentation on the validation set of Cityscapes dataset.

Utilizing gradients from the frozen task-transfer network yields a clear improvement over UM-Adapt-B as shown in Tables 12 and 3 for all the three tasks in NYUD dataset. This highlights significance of the idea to effectively exploit the inter-task correlation information for adaptation of a multi-task learning framework. To quantify the importance of multiple task-transfer network against employing a single such network, we designed another ablation setting denoted as UM-Adapt-(noAdv.)-1, which utilizes only for adaptation to NYUD, as reported in Tables 12 and 3. Next, we report a comparison between the proposed energy-based cross-task distillation frameworks (Section 3.3.2) i.e. a) UM-Adapt-(Adv.) and b) UM-Adapt-(noAdv.). Though, UM-Adapt-(Adv.) shows minimal improvement over the other counterpart, training of UM-Adapt-(noAdv.) is found to be significantly stable and faster as it does not include parameter update of the task-transfer networks during the adaptation process.

Figure 4: Qualitative comparison of different ablations of UM-Adapt, i.e. a) Multi-task baseline, b) UM-Adapt-(Adv.), and c) UM-Adapt-S.

Comparison against prior structured-prediction works.   The final unsupervised multi-task adaptation result by the best variant of UM-Adapt framework, i.e. UM-Adapt-(Adv.) delivers comparable performance against previous fully-supervised approaches (see Table 1 and 3). One must consider the explicit challenges faced by UM-Adapt to simultaneously balance performance across multiple tasks in a unified architecture as compared to prior arts focusing on single-task at a time. This clearly demonstrates superiority of the proposed approach towards the final objective of realizing generalization across both tasks and data domains. The semi-supervised variant, UM-Adapt-S is able to achieve state-of-the-art multi-task learning performance when compared against other fully supervised approaches as clearly highlighted in Tables 12 and 3.

Note that the adaptation of depth estimation from KITTI and sematic-segmentation from Cityscapes in a single UM-Adapt framework is a much harder task due to the input domain discrepancy (cross-city [5]) along with the challenges in simultaneous multi-task optimization setting. Even in such a drastic scenario UM-Adapt is able to achieve reasonable performance in both depth estimation and sematic segmentation as compared to other unsupervised single task adaptation approaches reported in Tables 4 and 5.

Comparison against prior multi-task learning works.   Table 6 and Table 7 present a comparison of UM-Adapt with recent multi-task learning approaches [25, 64] on NYUD test-set and CityScapes validation-set respectively. It clearly highlights  state-of-the-art performance achieved by UM-Adapt-S as a result of the proposed cross-task distillation framework.

Method Sup.
Depth rms
Err. (m)
Seg. Err.
Normals Err.
Kendall  [25] 30k 0.702 - 0.182
GradNorm [64] 30k 0.663 67.5 0.155
UM-Adapt-S 795 0.637 55.6 0.139
Table 6: Test error on NYUDv2 with ResNet as the base-model
Method Semantic (mean IOU)
Kendall  [25] (Uncert. Weights) 51.52
Liu  [35] 52.68
UM-Adapt-S 54.4
Table 7: Validation mIOU on Cityscapes, where all the approaches are trained simultaneously for segmentation and depth estimation.

Transferability of the learned representation.   One of the overarching goals of UM-Adapt is to learn general-purpose visual representation, which can demonstrate improved transferability across both tasks and data-domains. To evaluate this, we perform experiments on large-scale representation learning benchmarks. Following evaluation protocol by Doersch  [8], we setup  UM-Adapt for transfer-learning on ImageNet [49] classification and PASCAL VOC 2007 Detection tasks. The base trunk till Res5 block is initialized from our  UM-Adapt-(Adv.) variant (the adaptation of PBRS to NYUD) for both classification and detection task. For ImageNet classification, we train randomly initialized fully connected layers after the output of Res5 block. Similarly, for detection we use Faster-RCNN [45] with 3 different output heads for object proposal, classification, and localization after the Res4 block. We finetune all the network weights separately for classification and detection [8]. The results in Table 8 clearly highlight superior transfer learning performance of our learned representation even for novel unseen tasks with a smaller backbone-net.

Method Backbone
ImageNet top5
Motion Seg. [42] ResNet-101 48.29 61.13
Exemplar [8] ResNet-101 53.08 60.94
RP+Col+Ex+MS [8] ResNet-101 69.30 70.53
UM-Adapt-S ResNet-50 69.51 70.02
Table 8: Transfer learning results on novel unseen tasks.

5 Conclusion

The proposed UM-Adapt framework addresses two important aspects of generalized feature learning by formulating the problem as a multi-task adaptation approach. While the multi-task training ensures learning of task-agnostic representation, the unsupervised domain adaptation method provides domain agnostic representation projected to a common spatially-structured latent representation. The idea of exploiting cross-task coherence as an important cue for preservation of spatial regularity can be utilized in many other scenarios involving fully-convolutional architectures. Exploitation of auxiliary task setting is another direction that remains to be explored in this context.

Acknowledgements. This work was supported by a CSIR Fellowship (Jogendra) and a grant from RBCCPS, IISc. We also thank Google India for the travel grant.


  • [1] S. ”Xie and Z. Tu (2015) Holistically-nested edge detection. In ICCV, Cited by: §3.2.
  • [2] A. Atapour-Abarghouei and T. P. Breckon (2018) Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In CVPR, Cited by: §2, §3.2.
  • [3] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR, Cited by: §2.
  • [4] R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §2.
  • [5] Y. Chen, W. Chen, Y. Chen, B. Tsai, Y. F. Wang, and M. Sun (2017) No more discrimination: cross city adaptation of road scene segmenters. In ICCV, Cited by: §1, §4.2.
  • [6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR, Cited by: §4.1.
  • [7] G. Csurka (2017) Domain adaptation for visual applications: a comprehensive survey. arXiv preprint arXiv:1702.05374. Cited by: §1, §2.
  • [8] C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In ICCV, Cited by: §4.2, Table 8.
  • [9] D. Eigen and R. Fergus (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, Cited by: §1, §2, §4.1, §4.1, Table 1, Table 2, Table 3.
  • [10] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In NIPS, pp. 2366–2374. Cited by: §2, §4.1, Table 1, Table 4.
  • [11] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    In ICML, Cited by: §2.
  • [12] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016)

    Domain-adversarial training of neural networks

    The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §2.
  • [13] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi (2015)

    Domain generalization for object recognition with multi-task autoencoders

    In ICCV, Cited by: §1.
  • [14] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In CVPR, Cited by: Table 4.
  • [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §2.
  • [16] A. Gretton, AJ. Smola, J. Huang, M. Schmittfull, KM. Borgwardt, and B. Schölkopf Covariate shift and local learning by distribution matching. In Dataset Shift in Machine Learning, Cited by: §2.
  • [17] H. Han, A. K. Jain, S. Shan, and X. Chen (2017) Heterogeneous face attribute estimation: a deep multi-task learning approach. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
  • [18] K. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher (2016) A joint many-task model: growing a neural network for multiple nlp tasks. arXiv preprint arXiv:1611.01587. Cited by: §1.
  • [19] G. Hinton, O. Vinyals, and J. Dean (2014) Distilling the knowledge in a neural network. Cited by: §3.1.1.
  • [20] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2018) Cycada: cycle-consistent adversarial domain adaptation. In ICML, Cited by: §2, §3.2, Table 5.
  • [21] J. Hoffman, D. Wang, F. Yu, and T. Darrell (2016) Fcns in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649. Cited by: Table 5.
  • [22] W. Hong, Z. Wang, M. Yang, and J. Yuan (2018) Conditional generative adversarial network for structured domain adaptation. In CVPR, Cited by: §2.
  • [23] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §2.
  • [24] O. H. Jafari, O. Groth, A. Kirillov, M. Y. Yang, and C. Rother (2017) Analyzing modular cnn architectures for joint depth prediction and semantic segmentation. In ICRA, Cited by: Table 1.
  • [25] A. Kendall, Y. Gal, and R. Cipolla (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, Cited by: §1, §2, §4.2, Table 6, Table 7.
  • [26] D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [27] S. Kong and C. Fowlkes (2017) Recurrent scene parsing with perspective understanding in the loop. arXiv preprint arXiv:1705.07238. Cited by: Table 3.
  • [28] S. S. Kruthiventi, V. Gudisa, J. H. Dholakiya, and R. Venkatesh Babu (2016) Saliency unified: a deep architecture for simultaneous eye fixation prediction and salient object segmentation. In CVPR, Cited by: §2.
  • [29] J. N. Kundu, M. Gor, P. K. Uppala, and V. B. Radhakrishnan (2019) Unsupervised feature learning of human actions as trajectories in pose embedding manifold. In WACV, Cited by: §1.
  • [30] J. N. Kundu, P. K. Uppala, A. Pahuja, and R. V. Babu (2018) AdaDepth: unsupervised content congruent adaptation for depth estimation. In CVPR, Cited by: §1, §1, §2, §3.1.2, §3.2, §3.2, §3.3.1, §4.2, Table 4.
  • [31] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. In 3DV, Cited by: §3.1, §4.1, Table 1.
  • [32] G. Lin, A. Milan, C. Shen, and I. D. Reid (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation.. In CVPR, Cited by: Table 3.
  • [33] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid (2016) Efficient piecewise training of deep structured models for semantic segmentation. In CVPR, Cited by: Table 3.
  • [34] F. Liu, C. Shen, and G. Lin (2015) Deep convolutional neural fields for depth estimation from a single image. In CVPR, Cited by: Table 1.
  • [35] S. Liu, E. Johns, and A. J. Davison (2018) End-to-end multi-task learning with attention. arXiv preprint arXiv:1803.10704. Cited by: Table 7.
  • [36] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: Table 3.
  • [37] M. Long, Y. Cao, J. Wang, and M. Jordan (2015) Learning transferable features with deep adaptation networks. In ICML, Cited by: §1, §2.
  • [38] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. In NIPS, Cited by: §3.3.1.
  • [39] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert (2016) Cross-stitch networks for multi-task learning. In CVPR, Cited by: §1, §2.
  • [40] A. Mousavian, H. Pirsiavash, and J. Košecká (2016) Joint semantic segmentation and depth estimation with deep convolutional networks. In 3DV, Cited by: Table 3.
  • [41] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim (2018) Image to image translation for domain adaptation. In CVPR, Cited by: §3.2.
  • [42] D. Pathak, R. Girshick, P. Dollar, T. Darrell, and B. Hariharan (2017) Learning features by watching objects move. In CVPR, Cited by: Table 8.
  • [43] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia (2018) GeoNet: geometric neural network for joint depth and surface normal estimation. In CVPR, Cited by: Table 2.
  • [44] R. Ranjan, V. M. Patel, and R. Chellappa (2017)

    Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
  • [45] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §4.2.
  • [46] Z. Ren and Y. J. Lee (2018) Cross-domain self-supervised multi-task feature learning using synthetic imagery. In CVPR, Cited by: §2, §2.
  • [47] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: ground truth from computer games. In ECCV, Cited by: §4.1.
  • [48] A. Roy and S. Todorovic (2016) Monocular depth estimation using neural regression forest. In CVPR, Cited by: Table 1.
  • [49] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.2.
  • [50] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In NIPS, Cited by: §1.
  • [51] S. Sankaranarayanan, Y. Balaji, A. Jain, S. N. Lim, and R. Chellappa (2018) Learning from synthetic data: addressing domain shift for semantic segmentation. In CVPR, Cited by: §3.2.
  • [52] A. Saxena, M. Sun, and A. Y. Ng (2009) Make3d: learning 3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (5), pp. 824–840. Cited by: Table 1.
  • [53] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §4.1.
  • [54] B. Sun and K. Saenko (2016) Deep coral: correlation alignment for deep domain adaptation. In ECCV Workshops, Cited by: §2.
  • [55] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: §3.3, Table 5.
  • [56] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko (2015) Simultaneous deep transfer across domains and tasks. In ICCV, Cited by: §2, Table 5.
  • [57] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §1, §1, §1, §2.
  • [58] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell (2014) Deep domain confusion: maximizing for domain invariance. arXiv preprint arXiv:1412.3474. Cited by: §1.
  • [59] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille (2015) Towards unified depth and semantic prediction from a single image. In CVPR, Cited by: Table 1.
  • [60] X. Wang, D. Fouhey, and A. Gupta (2015) Designing deep networks for surface normal estimation. In CVPR, Cited by: Table 2.
  • [61] J. Yao, S. Fidler, and R. Urtasun (2012) Describing the scene as a whole: joint object detection, scene classification and semantic segmentation. In CVPR, Cited by: §2.
  • [62] Y. Zhang, P. David, and B. Gong (2017) Curriculum domain adaptation for semantic segmentation of urban scenes. In ICCV, Cited by: §1, Table 5.
  • [63] Y. Zhang, S. Song, E. Yumer, M. Savva, J. Lee, H. Jin, and T. Funkhouser (2017)

    Physically-based rendering for indoor scene understanding using convolutional neural networks

    In CVPR, Cited by: §4.1, Table 2, Table 3.
  • [64] C. L. Zhao Chen and A. Rabinovich (2018) GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, Cited by: §2, §4.2, Table 6.
  • [65] J. Zhao, M. Mathieu, and Y. LeCun (2017) Energy-based generative adversarial network. In ICLR, Cited by: §1, §3.3.2, §3.3.2.
  • [66] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In CVPR, Cited by: Table 4.
  • [67] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §2.