Anchor Tasks: Inexpensive, Shared, and Aligned Tasks for Domain Adaptation

08/16/2019 ∙ by Zhizhong Li, et al. ∙ Snap Inc. 0

We introduce a novel domain adaptation formulation from synthetic dataset (source domain) to real dataset (target domain) for the category of tasks with per-pixel predictions. The annotations of these tasks are relatively hard to acquire in the real world, such as single-view depth estimation or surface normal estimation. Our key idea is to introduce anchor tasks, whose annotations are (1) less expensive to acquire than the main task, such as facial landmarks and semantic segmentations; and (2) shared in availability for both synthetic and real datasets so that it serves as "anchor" between tasks; and finally (3) aligned spatially with main task annotations on a per-pixel basis so that it also serves as spatial anchor between tasks' outputs. To further utilize spatial alignment between the anchor and main tasks, we introduce a novel freeze approach that freezes the final layers of our network after training on the source domain so that spatial and contextual relationship between tasks are maintained when adapting on the target domain. We evaluate our methods on two pairs of datasets, performing surface normal estimation in indoor scenes and faces, using semantic segmentation and facial landmarks as anchor tasks separately. We show the importance of using anchor tasks in both synthetic and real domains, and that the freeze approach outperforms competing approaches, reaching results in facial images on par with the state-of-the-art system that leverages detailed facial appearance model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Illustration of our formulation compared to unsupervised domain adaptation. Although target domain main task labels are hard or expensive to obtain, we can use cheaper, easily available labels from an ‘‘anchor’’ task to help align the domains with clear correspondence between images and the anchor task label space.

While deep learning achieves pinnacle success with the powerful machinery of supervised learning on tasks with abundant annotations (e.g. image classification and object detection), its progress is throttled on tasks whose annotations are not available or much harder to be acquired manually, such as depths 

[32, 43], surface normals [29], 3D poses [39], and albedo textures [29]. Fortunately, these tasks can access seemingly infinite amount of annotations from synthetically-generated datasets. However, the domain gap between these synthetic data and the real data remains significant, which prevents supervised learning methods to be applied directly. How to perform domain adaptation from the source domain to the target domain

and improve machine learning generalization remains an active line of recent research 

[34, 26, 28, 12].

Without further annotation on the target domain, unsupervised domain adaptation usually works by forcing the distribution (or conditional distribution) of some variables to be similar across domains. [8, 37] These variables can be either the input [30], the intermediate feature [8], or the output [36]. Performance improvements over a predictor trained only on the source domain have been shown. In practice, however, this distribution alignment approach to domain adaptation is often found too coarse to account for the fine-grained distinctions between different datasets. For example, it is not uncommon that the target domain contains much more objects in one category than the source, and completely lacks another. Moreover, dataset-specific distinctive details (such as surface normal details and head poses) can make the alignment process, which usually involves training a discriminator, difficult to converge.

On the other hand, multi-task learning (MTL) has long been used to improve generalization on different tasks [16, 42]. Inspired by MTL, using auxiliary tasks for domain adaptation on main tasks, which we name Task-Assisted Domain Adaptation (or ‘‘TADA’’), has also been studied in various specific problem settings -- e.g. using attributes for fine-grained classification [9], 2D human pose for 3D human pose [41], image-level labels for object detection [14] and indiscriminate grasping for instance grasping [5]. However, these formulations all rely on problem-specific mappings between the auxiliary tasks and the main tasks. Thus their applications are restricted to their own problem settings.

In this paper, we introduce a novel domain adaptation formulation from synthetic dataset (source domain) to real dataset (target domain) for the category of tasks with per-pixel predictions. The annotations of these tasks are relatively hard to acquire in the real world, such as single-view depth estimation or normal estimation. Our key idea is to introduce anchor tasks, whose annotations are (1) inexpensive to acquire since it is a ‘‘weaker’’ task than the main task such as facial landmarks and semantic segmentations, one could even apply existing pre-trained models to generate anchor task annotations for our formulation; and (2) shared in availability for both synthetic and real datasets so that it serves as ‘‘anchor’’ between tasks; and finally (3) aligned spatially with main task annotations on a per-pixel basis so that it also serves as spatial anchor between tasks’ outputs.

Specifically, the anchor task can improve domain adaptation performance in two ways. (1) Since anchor task is shared between domains, it aligns the domains in the feature space with clear annotation space correspondence. This alignment can be more fine-grained compared to a overall feature distribution matching as seen in unsupervised domain adaptation. (2) Since the annotations of anchor task is spatially aligned with main task’s, it helps to more reliably infer the main task prediction on the target domain, regardless of whether the feature is adapted. These advantages are in parallel to how multi-task learning within one domain may already improves generalization performance.

To further utilize the spatial alignment between the anchor and main tasks, we adopted a Freeze approach inspired by Mostajabi  [23] which models semantic segmentation label space structure by freezing a decoder. Specifically, we freeze the final layers of our network after they are trained on the source domain. Then we adapt the rest of the network to the target domain. Intuitively, this would constrain the output to conform with the spatial and contextual relationship between task annotation space that the final layers have learned on the source domain, thus discouraging overfitting the features to only the anchor task in the target domain. In contrast to the problem-specific treatments between auxiliary tasks and main tasks [9, 41, 14, 5], our Freeze approach is generic and supports any pair of spatially-aligned auxiliary tasks and main tasks. We demonstrate this with different anchor tasks, such as semantic segmentation and facial landmarks for the same main task in a single training framework.

We evaluate our methods on two pairs of datasets, performing surface normal estimation in indoor scenes and faces, using semantic segmentation and facial landmark as anchor tasks separately, both with a synthetic source domain. We show the importance of anchor labels in both domains, and that Freeze outperforms compared approaches, reaching results in facial images on par with the state-of-the-art system that leverages detailed facial appearance model. We also find that, surprisingly, distribution matching adaptation methods may fail when the domains are different on the ground truth distribution, where Freeze is more robust.

In summary, our main contributions are:

  • We propose a novel domain adaptation formulation from for per-pixel main tasks from synthetic domain to real domain using inexpensive, shared and aligned anchor tasks.

  • We introduce a Freeze approach to further utilize the spatial and contextual alignment between anchor tasks and main task that can be applied to different pairings of anchor tasks and the same main task.

2 Related work

TADA that are constrained to specific task pairs. An emerging line of work recently is using multi-task learning or weakly supervised learning to help unsupervised domain adaptation. Gebru  [9] adapt fine-grain classification between an easy domain and in-the-wild images with the help of classes’ attributes. They adapt a consistency loss between attributes and classes and domain adaptation losses from Tzeng  [38]. Yang  [41] adapts lab-environment 3D human pose estimation for in-the-wild data with only 2D pose ground truth, by jointly training on 2D and 3D labels and aligning domains with a GAN-based discriminator. Fang  [5] adapts a robot grasping application from simulation to real images, and from the indiscriminate grasping task to instance-specific grasping. They perform joint training on all existing labels and the optional input of the instance mask. Inoue  [14] adapts image object detection to paintings by generating pseudo-labels which are filtered using auxiliary image-level labels.

Our paper has two major differences from these prior work. (1) All prior work have very application-specific constraints on the formulation of their task-pair relationship, making them inapplicable to nearly all other cases. Gebru  [9] assumes the auxiliary and main annotation to have a known linear relationship. Yang  [41] constraints the 2D pose and 3D pose should both use the same 2D pose output layer. Fang  [5] assumes both tasks’ output are in the same format (binary prediction) and the tasks are differentiated by an extra input. Inoue  [14] must use a hard-coded procedure to filter erroneous outputs of the main detection task using the anchor task classification labels. In contrast, our work only requires that the two tasks’ annotations are spatially aligned, without any constraint on the output layers or the loss of each task -- a much weaker assumption -- and model the inter-task relationship without hard-coded domain knowledge. In this way, we show that the TADA formulation helps more generally, beyond these task-specific designs on very closely related tasks. (2) We focus on tasks with pixel-wise outputs, such as surface normal estimation or keypoint detection (in the form of heatmaps for each keypoint). We argue that such spatial information in the anchor task are more informative to the main task than e.g. attribute labels global to the sample.

Weakly supervised learning [44] uses a weaker label to help infer a stronger label, e.g. when inexact coarse category are provided to help fine-grain classification [18]. Hu  [13] uses network weights for object detection to predict weights for instance segmentation, thus learning to segment objects with only bounding box labels in a zero-shot learning manner. These weakly supervision frameworks differ from us, in that they (1) do not aim to use the weaker task to help domain adaptation, and (2) are task-specific in the design of network structure or task pair consistency constraints (e.g. Hu  [13] requires main and auxiliary tasks to have overlapping categories).

Unsupervised Domain Adaptation [26, 28, 12, 30, 8] is similar to our formulation without the anchor task. Especially worth mentioning is Tsai  [36]. Instead of matching feature space distributions, they adapt the structured output space to have a similar distribution between domains. This is done by applying a GAN-based domain confusion loss over the output from the two domains, and optionally, the feature space as well. Our method additionally uses the anchor task to help on top of these methods, hoping to achieve a more fine-grained adaptation with the correspondence the anchor brings. We experimentally show that UDA either works well with our method, or adding it degrades performance for nearly all methods due to systematic domain difference.

Semi-supervised Domain Adaptation, on the other hand, assumes a small number of target domain samples are labeled, compared to our assumption that an inexpensive task is labeled for both domains. This is a separate research direction orthogonal with ours.

Combining Multi-task learning and Domain Adaptation [38, 7, 40] is topically similar. Besides the TADA works we mentioned earlier, most assume all tasks’ labels from the target domain are available, and some still requires specially designed losses or constraints for the task pair (e.g. Tzeng  [38] constrains the all tasks to be classification tasks). Whether their formulation are still effective in our unsupervised case is beyond the scope of our paper.

Transfer Learning (e.g. Taskonomy [42]), including Multi-task Learning [4] (e.g. UberNet [16]) and Meta-transfer Learning (e.g. MAML [6]), are methods that use knowledge learned from one task to help another. Most methods assume all tasks are in the same domain or ignore the domain difference (e.g. Taskonomy [42], UberNet [16]), and some assume one task per domain or dataset (e.g. Liu  [20]). Our idea makes use of the knowledge in one task to help another, but we are more interested in how having the anchor task knowledge in both domains can help domain adaptation instead. Compared to prior methods, we empirically show that the anchor task is needed in both domains to bridge the domain gap.

Among these, UberNet [16] has similar formulation with our MTL-a baseline, but without using an anchor task shared by all samples. The paper also ignores any domain difference, and only focuses on tasks in datasets where it has supervision, making it irrelevant to domain adaptation.

Some other methods consider performing the same task on different domains as multitask learning [22], but in our formulation of multitask learning (performing tasks that have conceptually different labels) they are performing semi-supervised domain adaptation instead.

Modeling output spatial structure [23, 36, 37] is related to how we preserve the relationship between two tasks’ outputs. Mostajabi  [23]

regularizes semantic segmentation by training an autoencoder on the

semantic labels, and force the network to use the fixed decoder to output its prediction. We are inspired by these ideas, but are focused on how two tasks’ output spaces interact, and generalizing across domains.

3 Method

Figure 2: Illustration of various compared methods and their training label usage. TADA methods (d-f) uses the anchor task on both domains to establish clear correspondence in the anchor task annotation space. Our Freeze method first trains only on the source domain, and then freezes the final network layers to consolidate the learned inter-task output spatial relationship.

To formulate our Task-Assisted Domain Adaptation (TADA), we first start from a brief review of Unsupervised Domain Adaptation (UDA). In UDA, we have labeled data in the source domain , and unlabeled data in the target domain . However, only the test set in may contain labels for evaluation purposes, and in the train set, is provided. A model, perhaps with the form of , is trained on all available data, where is the network backbone for input-feature mapping, and is the network head for feature-prediction mapping. In this paper, unless otherwise specified, we refer to the networks’ second-to-last layer output as the features.

Usually, to reduce the domain gap, features in both domains and are encouraged to follow the same distribution [8] (although this can also be done in output space as well [36]). However, it is usually not guaranteed that ground truths follow the same distribution. When that is not the case, the ideal features and outputs have to distribute differently too. Forcing either of them to distribute similarly would deviate the prediction from the ground truth.

In our Task-Assisted Domain Adaptation scenario, in addition to the main task, an anchor task is defined for both domains. The domains become , and . Here, and stand for the main and anchor tasks. In the train set of , only is provided, while is unknown or unavailable. A model, perhaps with the form of is trained on those data, where and are sub-modules specific to each task. In this work, we focus on the popular formulation above where the two tasks share the same network backbone .

In this work, we consider that the anchor task exists solely to aid the learning of the main task. We evaluate only on the target domain main task, not on the anchor. If the anchor task is important, one can always train a separate model for it using a variety of transfer learning methods.

3.1 MTL-a for feature alignment

When prior work has performed Multi-task Learning (MTL), either all tasks are assumed to be in one domain, or one task is available in each domain ( main, and anchor). Formally, there can be three supervised losses in the TADA scenario:

(1)
(2)
(3)

In prior work, a multitask learning loss may only comprise of two of the three:

(4)
(5)

for the cases ‘‘everything in source domain’’ and ‘‘source main target anchor’’ respectively. We instead use the alternative baseline -- MTL-a (multitask learning with anchors), which simply uses all the supervised losses.

(6)

It may be tempting to hypothesize that losses in Eq. 45 will be enough for the TADA scenario, and that we do not need to collect anchor labels on both domains. Maybe in the first case the multitask learning aspect can already improve model generalization. And in the second case the network is trained on the target domain, so it may be forced to adapt to perform well on the anchor task. One can also add an unsupervised domain adaptation loss to reduce the domain gap. We show experimentally that these baselines would underperform MTL-a.

To prove that TADA works in a more general formulation than the very task-pair-specific ones from prior work, we do not put any more constraints or losses between the tasks. Yet, this still makes sure that the anchor losses act as anchors that establish correspondence between and on the anchor label space.

In addition to any of these supervised losses, an unsupervised domain adaptation loss can be added. For example, adversarial losses (a.k.a. GAN losses) on the features or the output space are used in prior work [8, 36]:

(7)

where is the discriminator network, trained in a mini-max fashion,

(8)

We refer our readers to the prior work [8, 36] for details.

3.2 Freeze for preserving inter-task relationship

Building on MTL-a, we further propose our Freeze method to use the output space relationship between the two tasks, which is helpful in guiding the target domain main task based on the anchor. But with a lack of labels, the relationship can only be learned on the source domain.

Analogous to the observation in Mostajabi  [23], the final layers of a trained multitask network can act like a decoder from its input feature space to the joint label space of the two tasks. These final layers are trained to produce output on the joint manifold of the two label spaces. In other words, improperly paired outputs (such as a side-facing surface with a ceiling label, or a concave surface with a nose label) are outside this manifold, and hopefully the network would learn not to output them. This joint manifold can be seen as a soft constraint of the relationship between the tasks’ labels, and we assume that the constraint may be learned internally by those layers.

We first train the multitask network on the source domain, using . When it approaches convergence (or just before it overfits to one of the tasks), we freeze the parameters of its final layers. We then train only the lower layers jointly on all available labels using , forcing their output to go through the pre-trained final layers.

For implementation details such as network structure, please see Section 4.4.

4 Experiment setup

We validate our methods and claims on two sets of experiments, facial images and indoor scenes, both adapting from synthetic data to real images -- our motivating scenario.

4.1 Facial images

We perform facial surface normal estimation as the main task, and for the anchor task we choose 3D facial keypoint detection with automatically generated ground truth. Intuitively, 3D keypoints can inform much, but not all, of surface normal information, and thus is a good form of guidance. As 3D keypoints can currently be reliably generated by methods that generalize well across domains, we use this to show whether free anchor task labels can still be helpful for another label-deprived task.

We adapt from synthetic data generated by Sengupta  [29] (‘‘SfSsyn’’), which uses 3DMM models [1]. The dataset provides facial images with surface normal ground truth, with synthetic faces both frontal and looking to the side. We change the reference frame of the surface normal to camera coordinates to follow the definition of all other datasets. In our experiments we found that the domain adaptation techniques we compare to both fail to adapt, and for analyzing the reason, we generate a second version of the dataset with only frontal faces (‘‘SfSsyn-front’’), with rotation distribution closely following the estimated poses from the target dataset.

For the target domain, we use real data from FaceWarehouse [3] (‘‘FaceWH’’). The dataset provides facial models fitted using a morphable model followed by a laplacian-based mesh deformation without any PCA reduction, so the surface normals rendered from them are both clean and faithful to the raw RGBD scan.

None of these two datasets provide an official split. We split the subjects (separated by dataset folders) into 70% for training, and 15% each for validation and test.

On both datasets, we use state-of-the-art Bulat  [2] to extract both 3D keypoints and 2D keypoints using their separate models. 3D keypoints are used as anchor training ground truth. We compute the facial region mask from the 2D keypoints for performing evaluation, which is a standard practice in facial surface normal estimation. [35, 29]

During training, we use the standard losses for both tasks: for surface normal estimation, cosine loss (see [35]

); for 3D keypoint detection, a heatmap regression for the 2D positions, and a vector regression for depth (see 

[2]). During evaluation of surface normal, we use five metrics in the literature. Specifically, the angular difference between predicted 3D surface normal and the ground truth is treated as the error and computed for each pixel. Then we aggregate the root mean square angular error (RMSE), mean of the error (Mean), median of the error (Median), and percentages of pixels with errors below 11.25 and 30. Only valid regions are considered, so we ignore pixels outside the face or where there is no ground truth (e.g. where depth is missing and surface normal cannot be correctly estimated).

4.2 Indoor scenes

We again perform surface normal estimation as the main task, but use semantic segmentation for the anchor task. The semantic boundaries can inform discontinuities in surface normal space, and some categories such as ceilings have very constrained normal directions. Other categories with no fixed shape or expected direction can be hard to improve.

We adapt from the SUNCG dataset [33] with physically-based rendering [43], which provides images, semantic segmentation, and surface normal ground truth. We use NYUdv2 [31] as the target domain, with additional surface normal estimated from depth by Ladicky  [17]. We only use the labeled portion of the dataset.

SUNCG is large, so we use a 90%-5%-5% split for train, validation, and test. We use NYUdv2’s official split. Normal estimation loss and metrics are the same as before, and semantic segmentation is trained using cross-entropy.

4.3 Compared methods

We compare our methods to baselines shown in Fig. 2: single task learning (STL), multitask with only one domain (MTL-src), multitask learning with source main task and target anchor task (MTL-SmTa) as used in prior work such as Liu  [20], and MTL-a. Since we address the domain adaptation problem, we also compare to unsupervised adaptation methods that applies either a multi-level version of Ganin  [8], or state-of-the-art Tsai  [36]

on these baselines (*+DA). STL+DA is the unsupervised domain adaptation method, which we denote by DA. Adversarial training is brittle and not all configurations work too well. We implement our own version and perform hyperparameter tuning, and omit some of the underperforming combinations.

We also compare to an oracle method that uses both tasks labels on both domains, including the target domain main task. This gauges how far each method is from fullly successful adaptation. For facial surface normal, we compare to a state-of-the-art intrinsic decomposition method SfSNet [29]

, which produces state-of-the-art surface normal based on extra domain knowledge (lighting model for unsupervised learning). We use their released model trained on synthetic data and on unsupervised CelebA 

[21], a much larger dataset. This comparison only serves to prove that our method is effective instead of being a controlled experiment, since neither our network structure or external knowledge is similar.

4.4 Implementation details

Code will be released for our experiments.111https://github.com/lizhitwo/TADA We use a ResNet50 [11] with FPN [19]

for our network backbone, with the ResNet pre-trained on ImageNet 

[25]. We use the variant with 3 upsampling layers with skip connection, and used a deconvolution layer as the output layer for both tasks, making the output 50% of the input resolution. For Freeze, we freeze the layers after the second upsampling layer, including any skip connection weights. Some tasks require additional non-spatial outputs. A common practice of 3D keypoint estimation [2] is to output a heatmap for projected 2D positions, and a vector for 3D depth. We add a fully-connected branch of 2 layers with 256 hidden units after the global average pooling over the second upsampling layer’s output. Batches of the same size are sampled from each domain for each iteration. We choose so losses from different domains and tasks have similar magnitudes. For adversarial training and dataset processing, please refer to our supplemental material.

Hyperparameter tuning is hard in TADA, just like in any unsupervised domain adaptation, due to the lack of target domain main task ground truth in validation. Although by evaluating against available ground truth we can tune most hyperparameters (e.g. stop criteria, learning rate, layers to freeze), some parameters critical to target main task (e.g. discriminator network complexity, its learning rate and loss weights) may barely cause any change. We empirically find that the discriminator accuracy being very frequently lower than 55% and the absence of artifact are good indicators of successful adaptation, and tune the parameters accordingly.

5 Results

Faces   SfSsynFaceWH Indoor   SUNCGNYUdv2
RMSE Mean Median RMSE Mean Median
STL 0.424 0.929 17.8 14.8 12.8 0.298 0.683 33.5 25.8 18.8
MTL-src 0.409 0.935 17.7 14.9 13.1 0.280 0.666 34.1 26.6 19.8
MTL-SmTa 0.162 0.791 24.3 21.8 20.4 0.260 0.662 32.9 26.2 20.6
MTL-a 0.492 0.953 16.0 13.3 11.4 0.275 0.675 32.4 25.7 19.8
Freeze (ours) 0.519 0.954 15.8 12.9 10.9 0.301 0.708 31.8 24.6 18.0
Oracle 0.907 0.995 7.8 6.2 5.2 0.340 0.734 30.4 23.1 16.5
SfSNet [29] -- -- -- -- 0.495* 0.965 15.2 12.9* 11.3*
with unsupervised domain adaptation [8, 36]:
DA 0.456 0.937 17.2 14.2 12.1 0.316 0.703 33.3 25.2 17.6
MTL-src 0.402 0.932 18.0 15.1 13.3 0.304 0.698 33.2 25.4 18.1
MTL-SmTa 0.216 0.854 22.0 19.5 18.1 0.328 0.720 31.9 24.2 16.9
MTL-a 0.455 0.946 16.7 13.9 12.1 0.309 0.708 32.2 24.7 17.7
Freeze (ours) 0.455 0.935 17.2 14.2 12.1 0.316 0.715 32.0 24.4 17.4
Table 1: Comparison in our two experimental settings. Based on performance, unsupervised domain adaptation with Tsai  [36] is shown for indoor scenes, and with Ganin  [8] shown for faces, whereas the other combinations underperform (see supplemental). Our baseline MTL-a outperforms other MTL variants, indicating the importance of shared anchor tasks. Our Freeze method is comparable to state-of-the-art surface normal estimated from SfSNet, without the use of a lighting model. Adaptation methods fail to improve some of the methods for faces. With adaptation, MTL-a and Freeze still outperform MTL variants in most cases. Statistical significance computed from 3 runs. (*) denotes a method with domain knowledge performs equal to or worse than our best performing method.

Facial images. Table 1 shows our results on SfSsyn to FaceWH adaptation. For baselines, STL, MTL-src, and MTL-SmTa all underperform. Two observations are interesting: (1) MTL-src does not perform very differently from STL on the target domain, indicating that the effect of multi-task learning is limited here. (2) MTL-SmTa vastly underperforms STL when trained with one task per domain. We hypothesize that despite the network being trained on the target , the task performed on is too different, which encourages the network to learn very different features for the tasks, harming adaptation. For these baselines, adding domain adaptation does improve their results, but the effect is not strong enough to recover their difference from our best performing methods.

Our TADA baseline MTL-a outperforms both MTL-SmTa and MTL-src. This indicates the importance of the anchor task being trained on both domains, affirming our hypothesis. Freeze further improves all criteria by a margin, implying that the inter-task relationship learned in the source domain can be helpful for target as well. Freeze is comparable to the state-of-the-art SfSNet [29], which underperforms on Median and 11.25 but outperforms on RMSE and 30.

Perhaps a very surprising observation is that the unsupervised domain adaptation methods added to either MTL-a or Freeze would hurt performance instead of improving them. In fact, Freeze without adaptation is the best method apart from the oracle (and SfSNet). The adaptation puts them at the same level with DA [8], eliminating any advantage brought by the anchor task. We have vigorously tuned the adversarial loss hyperparameters, yet still cannot find a configuration that would not hurt performance. In comparison, the compared two TADA methods work naturally. We analyze the reason for our robustness in Section 5.1.

Indoor scenes. Different from the facial experiments is that all MTL variations suffer from negative transfer, i.e. main task performance degrades as the second task is jointly learned. This makes DA [36] a strong method. We still observe that our baseline MTL-a outperforms other MTL variants, indicating that MTL-a has an adaptation effect that other variants do not possess, despite the negative transfer. We also observe that Freeze makes a much larger improvement on MTL-a than in the facial experiments, and outperforms DA [36] on three metrics. Freeze +DA [36] together performs second to the best, closing much of the gap between STL and the oracle. Surprisingly, MTL-SmTa with adaptation [36] outperforms every method, despite MTL-SmTa performing the worst. We attribute this to instability of adversarial loss based adaptation. We note that it is not observed in the literature that the MTL-SmTa+DA [36] formulation, specifically omitting the source domain anchor task and applying adaptation, outperforms a variety of methods. Nonetheless, Freeze +DA [36] outperforms all other MTL variants.

RMSE Mean Median
STL 0.418 0.913 18.6 15.3 13.0
DA [8] 0.495 0.944 16.6 13.5 11.3
MTL-a 0.540 0.958 15.4 12.5 10.5
MTL-a +DA [8] 0.560 0.961 15.0 12.1 10.2
Freeze (ours) 0.550 0.958 15.2 12.4 10.4
Freeze (ours)+DA [8] 0.573 0.963 14.7 11.9 10.0
Table 2: Facial normal estimation, with SfSsyn-frontal as the source domain, which has head pose distribution similar to FaceWH. In this experiment, domain adaptation [8] always helps performance, indicating that adaptation by matching distributions is not robust to systematic dataset difference, but ours are.

5.1 Analysis

Frontal faces. We analyze why the compared domain adaptation methods fail in the SfSsyn-FaceWH experiment. After trials and errors, we found that the difference of head pose distributions between domains may be a major contributor. We evaluate the domain adaptation methods with SfSsyn-frontal as the source domain in Table 2 instead, with all methods using the same hyperparameter.

Figure 3: PCA visualization of compared methods’ feature space at different facial keypoint locations. (a) Oracle does not have fully overlapping domain due to systematic distribution differences. (b) Using domain adaptation to force the distributions to be similar deviates the features from the oracle and hurts performance (see Table 1). (c) Training multitask learning with one task per domain encourages using separate feature space regions for different domains. (d) MTL-a produces feature distributions more visually similar to the oracle. Disclaimer: STL’s visualization (not shown) is also similar to the oracle, so apparent similarity cannot indicate higher performance. Other facial locations may not exhibit the observed behavior as clearly. Best viewed in color.
Figure 4: Qualitative results for compared methods. For domain adaptation, Ganin  [8] is shown for facial images (top), and Tsai  [36] is shown for indoor scenes (bottom). Best viewed in color.

The trends and conclusions are exactly the same, except that unsupervised domain adaptation always helps, making our Freeze +DA the top method. This experiment indicates that the distributional difference is indeed why adaptations [8, 36] fail. We conclude that while these prior works are effective, they would hurt performance when domains are systematically differently distributed, whereas our methods are more robust to such differences. While these differences may sometimes be easily eliminated in data synthesis procedures, other times they may be expensive to eliminate, or difficult to pinpoint.

Impact on feature space. To better understand the impact of different methods on the feature distribution, we visualize their feature space for source and target domain. Since features at different spatial locations may encode information differently, we extract the feature at separate facial keypoint locations in the facial experiment. For each location (e.g. nose tip), we perform PCA and obtain the top two components, and visualize them in Fig. 3. Please refer to its caption for observations. This experiment resonates with our hypothesis that training MTL-SmTa with one task per domain would map source and target to different feature space regions, and that blindly matching feature distribution may be suboptimal.

Qualitative results are shown in Figure 4. For faces, the synthetic dataset has less facial expressions than FaceWarehouse, so baselines struggle with e.g. open mouths. Unsupervised adaptation [8] tends to erroneously force the cheeks and nose normals to the side to force the output look like side-facing faces locally. The ground truth is not extremely faithful to the image due to being fitted on RGBD scans, and both our Freeze method and SfSNet [29] capture local details better than the ground truth, although SfSNet performs better with open mouths due to their usage of a lighting model on unlabeled real faces.

For indoor scenes, Freeze improves the performance for shelves, cabinets, and ceilings more effectively than the facial datasets, possibly due to those objects’ semantic labels providing much information for their surface normal.

6 Summary and future work

In this work, we propose a more general strategy TADA for per-pixel prediction tasks. We use spatial information of an inexpensive shared anchor task to align both features between domains and spatial prediction between tasks, and propose Freeze to further exploit the inter-task relationship for any spatially aligned task pairs. We show effectiveness and robustness of using anchor tasks against multitask traning, and Freeze against conventional domain adaptation methods.

There are many open questions to answer for the effect of anchor tasks. How do we make sure main task get information from anchor task output directly? Would a design built on PAD-Net 

[40] work? Can we adapt multiple main tasks from only one anchor task to leverage all the rich labeling of synthetic data? How cheap can the anchor task be made? Can Taskonomy [42] help in choosing which anchor task to use? We leave these questions for future work.

Acknowledgments

This work is supported in part by the Office of Naval Research grant ONR MURI N00014-16-1-2007 and by a gift from Snap Inc.

References

  • [1] V. Blanz and T. Vetter (1999) A morphable model for the synthesis of 3d faces. In SIGGRAPH, Cited by: §4.1.
  • [2] A. Bulat and G. Tzimiropoulos (2017) How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In

    International Conference on Computer Vision

    ,
    Cited by: §4.1, §4.1, §4.4.
  • [3] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou (2014) FaceWarehouse: a 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20, pp. 413–425. Cited by: §4.1.
  • [4] R. Caruana (1997-07-01) Multitask learning. Machine Learning 28 (1), pp. 41–75. External Links: ISSN 1573-0565, Document, Link Cited by: §2.
  • [5] K. Fang, Y. Bai, S. Hinterstoißer, S. Savarese, and M. Kalakrishnan (2018) Multi-task domain adaptation for deep learning of instance grasping from simulation. 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3516–3523. Cited by: §1, §1, §2, §2.
  • [6] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §2.
  • [7] D. Fourure, R. Emonet, É. Fromont, D. Muselet, N. Neverova, A. Trémeau, and C. Wolf (2017) Multi-task, multi-domain learning: application to semantic segmentation and pose regression. Neurocomputing 251, pp. 68–80. Cited by: §2.
  • [8] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In International Conference on Machine Learning, pp. 1180–1189. Cited by: §A.2, §A.2, Table 3, §1, §2, §3.1, §3, §4.3, Figure 4, §5.1, §5.1, Table 1, Table 2, §5.
  • [9] T. Gebru, J. Hoffman, and L. Fei-Fei (2017) Fine-grained recognition in the wild: a multi-task domain adaptation approach. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1358–1367. Cited by: §1, §1, §2, §2.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. Cited by: §A.1.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition.

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 770–778.
    Cited by: §4.4.
  • [12] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2018) CyCADA: cycle consistent adversarial domain adaptation. In International Conference on Machine Learning (ICML), Cited by: §1, §2.
  • [13] R. Hu, P. Dollár, K. He, T. Darrell, and R. B. Girshick (2018) Learning to segment every thing. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4233–4241. Cited by: §2.
  • [14] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa (2018) Cross-domain weakly-supervised object detection through progressive domain adaptation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5001–5009. Cited by: §1, §1, §2, §2.
  • [15] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §A.1.
  • [16] I. Kokkinos (2017)

    UberNet: training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory

    .
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5454–5463. Cited by: §1, §2, §2.
  • [17] L. Ladicky, B. Zeisl, and M. Pollefeys (2014) Discriminatively trained dense surface normal estimation. In ECCV, Cited by: §4.2.
  • [18] J. Lei, Z. Guo, and Y. Wang (2017) Weakly supervised image classification with coarse and fine labels. 2017 14th Conference on Computer and Robot Vision (CRV), pp. 240–247. Cited by: §2.
  • [19] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936–944. Cited by: §4.4.
  • [20] X. Liu, J. Gao, X. He, L. Deng, K. Duh, and Y. Wang (2015) Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In HLT-NAACL, Cited by: §2, §4.3.
  • [21] Z. Liu, P. Luo, X. Wang, and X. Tang (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §4.3.
  • [22] M. Long, Z. Cao, J. Wang, and P. S. Yu (2017) Learning multiple tasks with multilinear relationship networks. In NIPS, Cited by: §2.
  • [23] M. Mostajabi, M. Maire, and G. Shakhnarovich (2018) Regularizing deep networks by modeling and predicting label structure. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5629–5638. Cited by: §1, §2, §3.2.
  • [24] A. Radford, L. Metz, and S. Chintala (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434. Cited by: §A.1.
  • [25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §4.4.
  • [26] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada (2018)

    Maximum classifier discrepancy for unsupervised domain adaptation

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3723–3732. Cited by: §1, §2.
  • [27] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In NIPS, Cited by: §A.1.
  • [28] S. Sankaranarayanan, Y. Balaji, A. Jain, S. Nam Lim, and R. Chellappa (2018) Learning from synthetic data: addressing domain shift for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3752–3761. Cited by: §1, §2.
  • [29] S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs (2018) SfSNet: learning shape, reflectance and illuminance of facesin the wild’. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6296–6305. Cited by: §1, §4.1, §4.1, §4.3, §5.1, Table 1, §5.
  • [30] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb (2017) Learning from simulated and unsupervised images through adversarial training. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2242–2251. Cited by: §A.1, §1, §2.
  • [31] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In ECCV, Cited by: §4.2.
  • [32] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser (2017) Semantic scene completion from a single depth image. Proceedings of 29th IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §1.
  • [33] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. A. Funkhouser (2017) Semantic scene completion from a single depth image. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 190–198. Cited by: §4.2.
  • [34] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield (2018) Training deep networks with synthetic data: bridging the reality gap by domain randomization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 969–977. Cited by: §1.
  • [35] G. Trigeorgis, P. Snape, I. Kokkinos, and S. P. Zafeiriou (2017) Face normals ”in-the-wild” using fully convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 340–349. Cited by: §4.1, §4.1.
  • [36] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. K. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7472–7481. Cited by: §A.2, §A.2, Table 3, §1, §2, §2, §3.1, §3, §4.3, Figure 4, §5.1, Table 1, §5.
  • [37] Y. Tsai, K. Sohn, S. Schulter, and M. K. Chandraker (2019) Domain adaptation for structured output via discriminative patch representations. CoRR abs/1901.05427. Cited by: §1, §2.
  • [38] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko (2015) Simultaneous deep transfer across domains and tasks. In ICCV, Cited by: §2, §2.
  • [39] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid (2017) Learning from synthetic humans. In CVPR, Cited by: §1.
  • [40] D. Xu, W. Ouyang, X. Wang, and N. Sebe (2018) PAD-net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 675–684. Cited by: §2, §6.
  • [41] W. Yang, W. Ouyang, X. Wang, J. S. J. Ren, H. Li, and X. Wang (2018) 3D human pose estimation in the wild by adversarial learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5255–5264. Cited by: §1, §1, §2, §2.
  • [42] A. R. Zamir, A. Sax, W. B. Shen, L. J. Guibas, J. Malik, and S. Savarese (2018) Taskonomy: disentangling task transfer learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3712–3722. Cited by: §1, §2, §6.
  • [43] Y. Zhang, S. Song, E. Yumer, M. Savva, J. Lee, H. Jin, and T. A. Funkhouser (2017)

    Physically-based rendering for indoor scene understanding using convolutional neural networks

    .
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5057–5065. Cited by: §1, §4.2.
  • [44] Z. Zhou (2017-08) A brief introduction to weakly supervised learning. National Science Review 5, pp. . External Links: Document Cited by: §2.

Appendix A Supplemental material

a.1 Implementation details (cont’d)

The discriminator for the adversarial loss is based on DCGAN [24]. However, adversarial training is very delicate, and hyperparameters have to be carefully tuned to avoid artifacts that devastate performance. The discriminator is trained jointly with the main network. For stability, we make the following adjustments. We use a 2-layer discriminator with a batchnorm between layers for facial images, and a 5-layer discriminator for indoor scenes. We use a common training strategy of adding the adversarial loss after the training has nearly converged. The DCGAN is trained using a PatchGAN loss that reduces overfitting [30], and noise is added to the input, activation, and real-fake ground truth, as recommended by Salimans  [27]

. Stochastic gradient descent is used for the adversarial network, while the main network uses Adam 

[15] with a lower learning rate. The surface normals are normalized to a magnitude of 1 before input to the discriminator. For faces, regions outside the facial area for both features and outputs are masked to zero. New parameters are initialized with He  [10].

Facial images are resized to 256256 for input. Indoor scene images are randomly cropped to 256256 during training, and resized to 50% resolution at test time. For data augmentation, images are subject to random color shift, but not transform or flipping due to the nature of surface normal estimation. All evaluations are against the original resolution ground truth.

a.2 Underperforming domain adaptation

Facial images SfSsynFaceWH
RMSE Mean Median
MTL-a 0.492 0.953 16.0 13.3 11.4
MTL-a +DA [8] 0.455 0.946 16.7 13.9 12.1
MTL-a +DA [36] 0.407 0.931 18.0 15.0 13.1
Frontal faces SfSsyn-frontalFaceWH
RMSE Mean Median
MTL-a 0.540 0.958 15.4 12.5 10.5
MTL-a +DA [8] 0.560 0.961 15.0 12.1 10.2
MTL-a +DA [36] 0.522 0.951 16.2 13.0 10.9
Indoor scenes SUNCGNYUdv2
RMSE Mean Median
MTL-a 0.275 0.675 32.4 25.7 19.8
MTL-a +DA [8] 0.282 0.682 32.4 25.4 19.3
MTL-a +DA [36] 0.309 0.708 32.2 24.7 17.7
Table 3: Comparing unsupervised domain adaptation with Tsai  [36], and with Ganin  [8]. Tsai  [36] works better for indoor scenes, while Ganin  [8] works better for faces.

Table 3 shows results with different combinations of unsupervised domain adaptation methods and datasets. Tsai  [36] (original paper tested on semantic segmentation on Cityscapes) works better for indoor scenes, while Ganin  [8] works better for faces. These combinations are used in the main paper.

In addition, we are unable to find a hyperparameter that makes Tsai  [36] improve performance on facial images, even with SfSsyn-frontal as the source domain. Although SfSsyn-frontal eliminated the head pose distribution difference from the target domain, which has helped Ganin  [8]’s performance, we note that the distribution of ground truth local details (such as the complexity of facial expressions) still differ from FaceWarehouse. This difference may have contributed to the lower performance of Tsai  [36] which tries to eliminate the domain difference in the output space. This indicates the sensitivity of adaptation methods based on adversarial network.