Deep networks have proven to be highly successful in a wide range of computer vision problems. They not only excel in classification or recognition based tasks, but also deliver comparable performance improvements for complex spatially-structured prediction tasks
like semantic segmentation, monocular depth estimation etc. However, generalizability of such models is one of the major concerns before deploying them in a target environment, since such models exhibit alarming dataset or domain bias[5, 29]. To effectively address this, researchers have started focusing on unsupervised domain adaptation approaches . In a fully-unsupervised setting without target annotations, one of the effective approaches [13, 58] is to minimize the domain discrepancy at a latent feature level so that the model extracts domain agnostic and task-specific representations. Although such approaches are very effective in classification or recognition based tasks , they yield suboptimal performance for adaptation of fully-convolutional architectures, which is particularly essential for spatial prediction tasks 
. One of the major issues encountered in such scenarios is attributed to the spatially-structured high-dimensional latent representation in contrast to vectorized form. Moreover, preservation of spatial-regularity, avoiding mode-collapse  becomes a significant challenge while aiming to adapt in a fully unsupervised setting.
While we aim towards human level performance, there is a need to explore scalable learning methods, which can yield generic image representation with improved transferability across both tasks and data domains. Muti-task learning [39, 25] is an emerging field of research in this direction, where the objective is to realize a task-agnostic visual representation by jointly training a model on several complementary tasks [18, 37]. In general, such networks are difficult to train as they require explicit attention to balance performance across each individual task. Also, such approaches only address a single aspect of the final objective (i.e. generalization across tasks) ignoring the other important aspect of generalization across data domains.
In this paper, we focus on multi-task adaptation of spatial prediction tasks, proposing efficient solutions to the specific difficulties discussed above. To effectively deliver optimal performance in both, generalization across tasks and input domain shift, we formulate a multitask adaptation framework called UM-Adapt. To effectively preserve the spatial-regularity information during the unlabelled adversarial adaptation  procedure, we propose two novel regularization techniques. Motivated by the fact that, the output representations share a common spatial-structure with respect to the input image, we first introduce a novel contour-based content regularization procedure. Additionally, we formalize a novel idea of exploiting cross-task coherency as an important cue to further regularize the multi-task adaptation process.
Consider a base-model trained on two different tasks, task-A and task-B. Can we use the supervision of task-A to learn the output representation of task-B and vice versa? Such an approach is feasible, particularly when the tasks in consideration share some common characteristics (i.e. consistency in spatial structure across task outputs). Following this reasoning, we introduce a cross-task Distillation module (see Figure 1). The module essentially constitutes of multiple encoder-decoder architectures called task-transfer networks, which are trained to get back the representation of a certain task as output from a combination of rest of the tasks as the input. The overarching motivation behind such a framework is to effectively balance performance across all the tasks in consideration, thereby avoiding domination of the easier tasks during training. To intuitively understand the effectiveness of cross-task distillation, let us consider a particular training state of the base-model, where the performance on task-A is much better than the performance on task-B. Here the task-transfer network (which is trained to output the task-B representation using the base-model task-A prediction as input) will yield improved performance on task-B, as a result of the dominant base task-A performance. This results in a clear discrepancy between the base-task-B performance and the task-B performance obtained through the task-transfer network. We aim to minimize this discrepancy which in turn acts as a regularization encouraging balanced learning across all the tasks.
In single-task domain adaptation approaches [57, 30], it is common to employ an ad-hoc discriminator to minimize the domain discrepancy. However, in presence of cross-task distillation module, this approach highly complicates the overall training pipeline. Therefore, avoiding such a direction, we propose to design a unified framework to effectively address both the diverse objectives i.e. a) to realize a balanced performance across all the tasks and b) to perform domain adaptation in an unsupervised setting. Taking inspirations from energy-based GAN , we re-utilize the task-transfer networks, treating the transfer-discrepancies as output of energy-functions, to adversarially minimize the domain discrepancy in a fully-unsupervised setting.
Our contributions in this paper are as follows:
We propose a simplified, yet effective unsupervised multi-task adaptation framework, utilizing two novel regularization strategies; a) Contour-based content regularization (CCR) and b) exploitation of inter-task coherency using a cross-task distillation module.
Further, we adopt a novel direction by effectively utilizing cross-task distillation loss as an energy-function to adversarially minimize the input domain discrepancy in a fully-unsupervised setting.
UM-Adapt yields state-of-the-art transfer learning results on ImageNet classification, and comparable performance on PASCAL VOC 2007 detection task, even with a smaller backbone-net. The resulting semi-supervised framework outperforms the current fully-supervised multi-task learning state-of-the-art on both NYUD and Cityscapes dataset.
2 Related work
Domain adaptation. Recent adaptation approaches for deep networks focus on minimization of domain discrepancy by optimizing some distance function related to higher order statistical distributions . Works following adversarial discriminative approaches [56, 57, 11, 12] utilize motivations from Generative Adversarial Networks  to bridge the domain-gap. Recent domain adaptation approaches, particularly targeting spatially-structured prediction tasks, can be broadly divided into two sub branches viz. a) pixel-space adaptation and b) feature space adaptation. In pixel-space adaptation [2, 20, 3] the objective is to train a image-translation network , which can transform an image from the target domain to resemble like an image from the source domain. On the other hand, feature space adaptation approaches focus on minimization of various statistical distance metrics [16, 37, 54, 22] at some latent feature-level, mostly in a fully-shared source and target domain parameter setting . However, unshared setups show improved adaptation performance as a result of learning dedicated filter parameters for both the domains in consideration [57, 30]. But, a fully unshared setup comes with other difficulties such as mode-collapse due to inconsistent output in absence of paired supervision [23, 67]. Therefore, an optimal strategy would be to adapt minimally possible parameters separately for the target domain in a partially-shared architecture setting .
. To realize this, a trivial direction is to formulate a multi-task loss function, which weighs the relative contribution of each task, enabling equal importance to individual task performance. It is desirable to formulate techniques which adaptively modify the relative weighting of individual tasks as a function of the current learning state or iteration. Kendall proposed a principled approach by utilizing a joint likelihood formulation to derive task weights based on the intrinsic uncertainty in individual tasks. Chen  proposed a gradient normalization (GradNorm) algorithm that automatically balances training in deep multitask models by dynamically tuning gradient magnitudes. Another set of work, focuses on learning task-agnostic generalized visual representations utilizing the advances in multi-task learning techniques [61, 46].
Here we define the notations and problem setting for unsupervised multi-task adaptation. Consider source input image samples with the corresponding outputs for different tasks being , and for a set of three complementary tasks namely, monocular-depth, semantic-segmentation, and surface-normal respectively. We have full access to the source image and output pair as it is extracted from synthetic graphical environment. The objective of UM-Adapt is to estimate the most reliable task-based predictions for an unknown target domain input, . Considering natural images as samples from the target domain , to emulate an unsupervised setting, we restrict access to the corresponding task-specific outputs viz. , and . Note that the objective can be readily extended to a semi-supervised setting, considering availability of output annotations for only few input samples from the target domain.
3.1 UM-Adapt architecture
As shown in Figure 2, the base multi-task adaptation architecture is motivated from the standard CNN encoder-decoder framework. The mapping function from the source domain, to a spatially-structured latent representation is denoted as . Following this, three different decoders with up-convolutional layers  are employed for the three tasks in consideration, (see Figure 2) i.e., , , and . Initially the entire architecture is trained with full supervision on source domain data. To effectively balance performance across all the tasks, we introduce the cross-task Distillation module as follows.
3.1.1 Cross-task Distillation module
This module aims to get back the representation of a certain task through a transfer function which takes a combined representation of all other tasks as input. Consider as the set of all tasks, i.e. , where is the total number of tasks in consideration. For a particular task , we denote as the prediction of the base-model at the output-head. We denote the task specific loss function as in further sections of the paper. Here, task-transfer network is represented as , which takes a combined set as the input representation and the corresponding output is denoted as . The parameters of , are obtained by optimizing a task-transfer loss function, denoted by and are kept frozen in further stages of training. However, one can feed the predictions of base-model through the task-transfer network to realize another estimate of task represented as , where . Following this, we define the distillation-loss for task as . While optimizing parameters of the base-model, this distillation-loss is utilized as one of the important loss components to realize an effective balance across all the task objectives (see Algorithm 1). Here, we aim to minimize the discrepancy between the direct and indirect prediction (via other tasks) of individual tasks. The proposed learning algorithm does not allow any single task to dominate the training process, since the least performing task will exhibit higher discrepancy and hence will be given more importance in further training iterations.
Compared to the general knowledge-distillation framework , one can consider to be analogous to the output of a teacher network and as the output of a student network. Here the objective is to optimize the parameters of the base-model by effectively employing the distillation loss which in turn enforces coherence among the individual task performances.
3.1.2 Architecture for target domain adaptation
Following a partially-shared adaptation setup, a separate latent mapping network is introduced specifically for the target domain samples, i.e. (see Figure 2). Inline with AdaDepth , we initialize using the pre-trained parameters of the source domain counterpart, (Resnet-50 encoder), in order to start with a good baseline initialization. Following this, only Res-5 block (i.e. ) parameters are updated for the additional encoder branch, . Note that the learning algorithm for unsupervised adaptation does not update other layers in the task-specific decoders and the initial shared layers till the Res-4f block.
3.2 Contour-based Content Regularization (CCR)
Spatial content inconsistency is a serious problem for unsupervised domain adaptation focused on pixel-wise dense-prediction tasks [20, 30]. In order to address this,  proposes a Feature Consistency Framework (FCF), where the authors employ a cyclic feature reconstruction setup to preserve the spatially-structured content such as semantic contours, at the Res-4f activation map of the frozen Resnet-50 (till Res-4f) encoder. However, the spatial size of the output activation of Res-4f feature (i.e. 2016) is inefficient to capture relevant spatial regularities required to alleviate the contour alignment problem.
To address the above issue, we propose a novel Contour-based Content Regularization (CCR) method. As shown in Figure 3a, we introduce a shallow (4 layer) contour decoder, to reconstruct only the contour-map of the given input image, where the ground-truth is obtained using a standard contour prediction algorithm . This content regularization loss (mean-squared loss) is denoted as in further sections of the paper. Assuming that majority of the image based contours align with the contours of task-specific output maps, we argue that the encoded feature (Res-5c activation) must retain the contour information during the adversarial training for improved adaptation performance. This clearly makes CCR superior over the existing approaches of enforcing image reconstruction based regularization [2, 41, 51], by simplifying the additionally introduced decoder architecture devoid of the burden of generating irrelevant color-based appearance. is trained on the fixed output transformation and the corresponding ground-truth contour pair, i.e. . However, unlike FCF regularization , the parameter of is not updated during the adversarial learning, as the expected output contour map is independent of the or transformation. As a result, is treated as output of an energy-function, which is later minimized for to bridge the discrepancy between the distributions and during adaptation, as shown in Algorithm 2.
3.3 Unsupervised Multi-task adaptation
In unsupervised adaptation, the overall objective is to minimize the discrepancy between the source and target input distributions. However, minimizing the discrepancy between and can possibly overcome the differences between the ground-truth and prediction better, when compared to matching with as proposed in some previous approaches . Aiming towards optimal performance, UM-Adapt focuses on matching target prediction with the actual ground-truth map distribution, and the proposed cross-task distillation module provides a means to effectively realize such an objective.
3.3.1 UM-Adapt baseline (UM-Adapt-B)
shows efficacy of simultaneous adaptation at hierarchical feature levels, while minimizing domain discrepancy for multi-layer deep architectures. Motivated by this, we design a single discriminator which can match the joint distribution of latent representation and the final task-specific structured prediction maps with the corresponding true joint distribution. As shown in Figure2, the predicted joint distribution denoted by , is matched with true distribution denoted by , following the usual adversarial discriminative strategy  (see Supplementary for more details). We will denote this framework as UM-Adapt-B in further sections of this paper.
3.3.2 Adversarial cross-task distillation
Aiming towards formalizing a unified framework to effectively address multi-task adaptation as a whole, we plan to treat the task-transfer networks, as energy functions to adversarially minimize the domain discrepancy. Following the analogy of Energy-based GAN , the task-transfer networks are first trained to obtain low-energy for the ground-truth task-based source tuples (i.e. ()) and high-energy for the similar tuples from the target predictions (i.e. ()). This is realized by minimizing as defined in Algorithm 2. Conversely, the trainable parameters of are updated to assign low energy to the predicted target prediction tuples, as enforced by (see Algorithm 2). Along with the previously introduced CCR regularization, the final update equation for is represented as . We use different optimizers for energy functions of each task, . As a result, is optimized to have a balanced performance across all the tasks even in a fully unsupervised setting. We denote this framework as UM-Adapt-(Adv.) in further sections of this paper.
Note that the task-transfer networks are trained only on ground-truth output-maps under sufficient regularization due to the compressed latent representation, as a result of the encoder-decoder setup. This enables to learn a better approximation of the intended cross-task energy manifold, even in absence of negative examples (target samples) . This analogy is used in Algorithm 1 to effectively treat the frozen task-transfer network as an energy-function to realize a balanced performance across all the tasks on the fully-supervised source domain samples. Following this, we plan to formulate an ablation of UM-Adapt, where we restrain the parameter update of in Algorithm 2. We denote this framework as UM-Adapt-(noAdv.) in further sections of this paper. This modification gracefully simplifies the unsupervised adaptation algorithm, as it finally retains only as the minimal set of trainable parameters (with frozen parameter of as ).
To demonstrate effectiveness of the proposed framework, we evaluate on three different publicly available benchmark datasets, separately for indoor and outdoor scenes. Further, in this section, we discuss details of our adaptation setting and analysis of results on standard evaluation metrics for a fair comparison against prior art.
4.1 Experimental Setting
We follow the encoder-decoder architecture exactly as proposed by Liana . The decoder architecture is replicated three times to form , and respectively. However, the number of feature maps and nonlinearity for the final task-based prediction layers is adopted according to the standard requirements. We use BerHu loss  as the loss function for the depth estimation task, i.e. . Following Eigen , an inverse of element-wise dot product on unit normal vectors for each pixel location is consider as the loss function for surface-normal estimation, . Similarly for segmentation, i.e , classification based cross-entropy loss is implemented with a weighing scheme to balance gradients from different classes depending on their coverage.
We also consider a semi-supervised setting (UM-Adapt-S), where the training starts from the initialization of the trained unsupervised version, UM-Adapt-(Adv.). For better generalization, alternate batches of labelled (optimize supervised loss, ) and unlabelled (optimize unsupervised loss, ) target samples are used to update the network parameters (i.e. ).
|Simultaneous multi-task learning|
|Simultaneous multi-task learning|
|Multi Task Baseline||0||25.8||18.73||29.65||61.69||69.83|
Datasets. For representation learning on indoor scenes, we use the publicly available NYUD-v2  dataset, which has been used extensively for supervised multi-task prediction of depth-estimation, sematic segmentation and surface-normal estimation. The processed version of the dataset consists of 1449 sample images with a standard split of 795 for training and 654 for testing. While adapting in semi-supervised setting, we use the corresponding ground-truth maps of all the 3 tasks (795 training images) for the supervised loss. The CNN takes an input of size 228304 with various augmentations of scale and flip following , and outputs three task specific maps, each of size 128160. For the synthetic counterpart, we use 100,000 randomly sampled synthetic renders from PBRS  dataset along with the corresponding clean ground-truth maps (for all the three tasks) as the source domain samples.
To demonstrate generalizability of UM-Adapt, we consider outdoor-scene dataset for two different tasks, sematic segmentation and depth estimation. For the synthetic source domain, we use the publicly available GTA5  dataset consisting of 24966 images with the corresponding depth and segmentation ground-truths. However, for real outdoor scenes, the widely used KITTI dataset does not have semantic labels that are compatible with the synthetic counterpart. On the other hand, the natural image Cityscapes dataset  does not contain ground-truth depth maps. Therefore to formulate a simultaneous multi-task learning problem and to perform a fair comparison against prior art, we consider the Eigen test-split on KITTI  for comparison of depth-estimation result and the Cityscapes validation set to benchmark our outdoor segmentation results in a single UM-Adapt framework. For the semi-supervised setting, we feed alternate KITTI and Cityscapes minibatches with the corresponding ground-truth maps for supervision. Here, input and output resolution for the network is considered to be 256512 and 128256 respectively.
|Mean IOU||Mean Accuracy||Pixel Accuracy|
|Simultaneous multi-task learning|
|Multi Task Baseline||0||0.022||0.063||0.067|
Training details. We first train a set of task-transfer networks on the synthetic task label-maps separately for both indoor (PBRS) and outdoor (GTA5) scenes. For indoor dataset, we train only the following two task-transfer networks; and considering the fact that surface-normal and depth estimation are more correlated, when compared to other pair of tasks. Similarly for outdoor, we choose the only two task-transfer possible combinations and . Following this, two separate base-models are trained with full-supervision on synthetic source domain using Algorithm 1 () with different optimizers (Adam ) for each individual task. After obtaining a frozen fully-trained source-domain network, the network is trained as discussed in Section 3.2 and it remains frozen during its further usage as a regularizer.
4.2 Evaluation of the UM-Adapt Framework
We have conducted a thorough ablation study to establish effectiveness of different components of the proposed UM-Adapt framework. We report results on the standard benchmark metrics as followed in literature for each of the individual tasks, to have a fair comparison against state-of-the-art approaches. Considering the difficulties of simultaneous multi-task learning, we have clearly segregated prior art based on single-task or multi-task optimization approaches in all the tables in this section.
|Simultaneous multi-task learning|
Ablation study of UM-Adapt. As a multi-task baseline, we report performance on the standard test-set of natural samples with direct inference on the frozen source-domain parameters without adaptation. With the exception of results on sematic-segmentation, baseline performance for the other two regression tasks (i.e. depth estimation and surface-normal prediction) are strong enough to support the idea of achieving first-level generalization using multi-task learning. However, the prime focus of UM-Adapt is to achieve the second-level of generalization through unsupervised domain adaptation. In this regard, to analyze effectiveness of the proposed CCR regularization (Section 3.2) against FCF , we conduct an experiment on the UM-Adapt-B framework defined in Section 3.3.1. The reported benchmark numbers for unsupervised adaptation from PBRS to NYUD (see Table 1, 2 and 3) clearly indicates the superiority of CCR for adaptation of structured prediction tasks. Following this inference, all later ablations (i.e. UM-Adapt-(noAdv.), UM-Adapt-(Adv.) and UM-Adapt-S) use only CCR as content-regularizer.
|Simultaneous multi-task learning|
|Multi Task Baseline||0||0.224|
Utilizing gradients from the frozen task-transfer network yields a clear improvement over UM-Adapt-B as shown in Tables 1, 2 and 3 for all the three tasks in NYUD dataset. This highlights significance of the idea to effectively exploit the inter-task correlation information for adaptation of a multi-task learning framework. To quantify the importance of multiple task-transfer network against employing a single such network, we designed another ablation setting denoted as UM-Adapt-(noAdv.)-1, which utilizes only for adaptation to NYUD, as reported in Tables 1, 2 and 3. Next, we report a comparison between the proposed energy-based cross-task distillation frameworks (Section 3.3.2) i.e. a) UM-Adapt-(Adv.) and b) UM-Adapt-(noAdv.). Though, UM-Adapt-(Adv.) shows minimal improvement over the other counterpart, training of UM-Adapt-(noAdv.) is found to be significantly stable and faster as it does not include parameter update of the task-transfer networks during the adaptation process.
Comparison against prior structured-prediction works. The final unsupervised multi-task adaptation result by the best variant of UM-Adapt framework, i.e. UM-Adapt-(Adv.) delivers comparable performance against previous fully-supervised approaches (see Table 1 and 3). One must consider the explicit challenges faced by UM-Adapt to simultaneously balance performance across multiple tasks in a unified architecture as compared to prior arts focusing on single-task at a time. This clearly demonstrates superiority of the proposed approach towards the final objective of realizing generalization across both tasks and data domains. The semi-supervised variant, UM-Adapt-S is able to achieve state-of-the-art multi-task learning performance when compared against other fully supervised approaches as clearly highlighted in Tables 1, 2 and 3.
Note that the adaptation of depth estimation from KITTI and sematic-segmentation from Cityscapes in a single UM-Adapt framework is a much harder task due to the input domain discrepancy (cross-city ) along with the challenges in simultaneous multi-task optimization setting. Even in such a drastic scenario UM-Adapt is able to achieve reasonable performance in both depth estimation and sematic segmentation as compared to other unsupervised single task adaptation approaches reported in Tables 4 and 5.
Comparison against prior multi-task learning works. Table 6 and Table 7 present a comparison of UM-Adapt with recent multi-task learning approaches [25, 64] on NYUD test-set and CityScapes validation-set respectively. It clearly highlights state-of-the-art performance achieved by UM-Adapt-S as a result of the proposed cross-task distillation framework.
Transferability of the learned representation. One of the overarching goals of UM-Adapt is to learn general-purpose visual representation, which can demonstrate improved transferability across both tasks and data-domains. To evaluate this, we perform experiments on large-scale representation learning benchmarks. Following evaluation protocol by Doersch , we setup UM-Adapt for transfer-learning on ImageNet  classification and PASCAL VOC 2007 Detection tasks. The base trunk till Res5 block is initialized from our UM-Adapt-(Adv.) variant (the adaptation of PBRS to NYUD) for both classification and detection task. For ImageNet classification, we train randomly initialized fully connected layers after the output of Res5 block. Similarly, for detection we use Faster-RCNN  with 3 different output heads for object proposal, classification, and localization after the Res4 block. We finetune all the network weights separately for classification and detection . The results in Table 8 clearly highlight superior transfer learning performance of our learned representation even for novel unseen tasks with a smaller backbone-net.
The proposed UM-Adapt framework addresses two important aspects of generalized feature learning by formulating the problem as a multi-task adaptation approach. While the multi-task training ensures learning of task-agnostic representation, the unsupervised domain adaptation method provides domain agnostic representation projected to a common spatially-structured latent representation. The idea of exploiting cross-task coherence as an important cue for preservation of spatial regularity can be utilized in many other scenarios involving fully-convolutional architectures. Exploitation of auxiliary task setting is another direction that remains to be explored in this context.
Acknowledgements. This work was supported by a CSIR Fellowship (Jogendra) and a grant from RBCCPS, IISc. We also thank Google India for the travel grant.
-  (2015) Holistically-nested edge detection. In ICCV, Cited by: §3.2.
-  (2018) Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In CVPR, Cited by: §2, §3.2.
-  (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR, Cited by: §2.
-  (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §2.
-  (2017) No more discrimination: cross city adaptation of road scene segmenters. In ICCV, Cited by: §1, §4.2.
The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: §4.1.
-  (2017) Domain adaptation for visual applications: a comprehensive survey. arXiv preprint arXiv:1702.05374. Cited by: §1, §2.
-  (2017) Multi-task self-supervised visual learning. In ICCV, Cited by: §4.2, Table 8.
-  (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, Cited by: §1, §2, §4.1, §4.1, Table 1, Table 2, Table 3.
-  (2014) Depth map prediction from a single image using a multi-scale deep network. In NIPS, pp. 2366–2374. Cited by: §2, §4.1, Table 1, Table 4.
Unsupervised domain adaptation by backpropagation. In ICML, Cited by: §2.
Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §2.
Domain generalization for object recognition with multi-task autoencoders. In ICCV, Cited by: §1.
-  (2017) Unsupervised monocular depth estimation with left-right consistency. In CVPR, Cited by: Table 4.
-  (2014) Generative adversarial nets. In NIPS, Cited by: §2.
-  Covariate shift and local learning by distribution matching. In Dataset Shift in Machine Learning, Cited by: §2.
-  (2017) Heterogeneous face attribute estimation: a deep multi-task learning approach. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
-  (2016) A joint many-task model: growing a neural network for multiple nlp tasks. arXiv preprint arXiv:1611.01587. Cited by: §1.
-  (2014) Distilling the knowledge in a neural network. Cited by: §3.1.1.
-  (2018) Cycada: cycle-consistent adversarial domain adaptation. In ICML, Cited by: §2, §3.2, Table 5.
-  (2016) Fcns in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649. Cited by: Table 5.
-  (2018) Conditional generative adversarial network for structured domain adaptation. In CVPR, Cited by: §2.
-  (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §2.
-  (2017) Analyzing modular cnn architectures for joint depth prediction and semantic segmentation. In ICRA, Cited by: Table 1.
-  (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, Cited by: §1, §2, §4.2, Table 6, Table 7.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2017) Recurrent scene parsing with perspective understanding in the loop. arXiv preprint arXiv:1705.07238. Cited by: Table 3.
-  (2016) Saliency unified: a deep architecture for simultaneous eye fixation prediction and salient object segmentation. In CVPR, Cited by: §2.
-  (2019) Unsupervised feature learning of human actions as trajectories in pose embedding manifold. In WACV, Cited by: §1.
-  (2018) AdaDepth: unsupervised content congruent adaptation for depth estimation. In CVPR, Cited by: §1, §1, §2, §3.1.2, §3.2, §3.2, §3.3.1, §4.2, Table 4.
-  (2016) Deeper depth prediction with fully convolutional residual networks. In 3DV, Cited by: §3.1, §4.1, Table 1.
-  (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation.. In CVPR, Cited by: Table 3.
-  (2016) Efficient piecewise training of deep structured models for semantic segmentation. In CVPR, Cited by: Table 3.
-  (2015) Deep convolutional neural fields for depth estimation from a single image. In CVPR, Cited by: Table 1.
-  (2018) End-to-end multi-task learning with attention. arXiv preprint arXiv:1803.10704. Cited by: Table 7.
-  (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: Table 3.
-  (2015) Learning transferable features with deep adaptation networks. In ICML, Cited by: §1, §2.
-  (2016) Unsupervised domain adaptation with residual transfer networks. In NIPS, Cited by: §3.3.1.
-  (2016) Cross-stitch networks for multi-task learning. In CVPR, Cited by: §1, §2.
-  (2016) Joint semantic segmentation and depth estimation with deep convolutional networks. In 3DV, Cited by: Table 3.
-  (2018) Image to image translation for domain adaptation. In CVPR, Cited by: §3.2.
-  (2017) Learning features by watching objects move. In CVPR, Cited by: Table 8.
-  (2018) GeoNet: geometric neural network for joint depth and surface normal estimation. In CVPR, Cited by: Table 2.
Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §4.2.
-  (2018) Cross-domain self-supervised multi-task feature learning using synthetic imagery. In CVPR, Cited by: §2, §2.
-  (2016) Playing for data: ground truth from computer games. In ECCV, Cited by: §4.1.
-  (2016) Monocular depth estimation using neural regression forest. In CVPR, Cited by: Table 1.
-  (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.2.
-  (2016) Improved techniques for training gans. In NIPS, Cited by: §1.
-  (2018) Learning from synthetic data: addressing domain shift for semantic segmentation. In CVPR, Cited by: §3.2.
-  (2009) Make3d: learning 3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (5), pp. 824–840. Cited by: Table 1.
-  (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §4.1.
-  (2016) Deep coral: correlation alignment for deep domain adaptation. In ECCV Workshops, Cited by: §2.
-  (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: §3.3, Table 5.
-  (2015) Simultaneous deep transfer across domains and tasks. In ICCV, Cited by: §2, Table 5.
-  (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §1, §1, §1, §2.
-  (2014) Deep domain confusion: maximizing for domain invariance. arXiv preprint arXiv:1412.3474. Cited by: §1.
-  (2015) Towards unified depth and semantic prediction from a single image. In CVPR, Cited by: Table 1.
-  (2015) Designing deep networks for surface normal estimation. In CVPR, Cited by: Table 2.
-  (2012) Describing the scene as a whole: joint object detection, scene classification and semantic segmentation. In CVPR, Cited by: §2.
-  (2017) Curriculum domain adaptation for semantic segmentation of urban scenes. In ICCV, Cited by: §1, Table 5.
Physically-based rendering for indoor scene understanding using convolutional neural networks. In CVPR, Cited by: §4.1, Table 2, Table 3.
-  (2018) GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, Cited by: §2, §4.2, Table 6.
-  (2017) Energy-based generative adversarial network. In ICLR, Cited by: §1, §3.3.2, §3.3.2.
-  (2017) Unsupervised learning of depth and ego-motion from video. In CVPR, Cited by: Table 4.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §2.