1 Introduction
The major driving force behind modern computer vision, machine learning, and deep neural network models is the availability of large amounts of curated labeled data. Deep models have shown stateoftheart performances on different vision tasks. Effective models that work in practice entail a requirement of very large labeled data due to their large parameter spaces. Expecting availability of largescale handannotated datasets for every vision task is not practical. Some tasks require extensive domain expertise, long hours of human labor, expensive data collection sensors  which collectively make the overall process very expensive. Even when data annotation is carried out using crowdsourcing (e.g. Amazon Mechanical Turk), additional effort is required to measure the correctness (or goodness) of the obtained labels. Due to this, many vision tasks are considered expensive
[43], and practitioners either avoid such tasks or continue with lesser amounts of data that can lead to poorly performing models. We seek to address this problem in this work, viz., to build an alternative approach that can obtain model parameters for tasks without any labeled data. Extending the definition of zeroshot learning from basic recognition settings, we call our work ZeroShot Task Transfer.Cognitive studies show results where a subject (human baby) can adapt to a novel concept (e.g. depth understanding) by correlating it with known concepts (hand movement or selfmotion), without receiving an explicit supervision. In similar spirit, we present our metalearning algorithm that computes model parameters for novel tasks for which no ground truth is available (called zeroshot tasks). In order to adapt to a zeroshot task, our metalearner learns from the model parameters of known tasks (with ground truth) and their task correlation to the novel task. Formally, given the knowledge of known tasks {}, a metalearner can be used to extrapolate parameters for , a novel task.
However, with no knowledge of relationships between the tasks, it may not be plausible to learn a metalearner, as its output could map to any point on the metamanifold (see Figure 1). We hence consider the task correlation between known tasks and a novel task as an additional input to our framework. There could be different notions on how task correlation is obtained. In this work, we use the approach of wisdomofcrowd for this purpose. Many vision [30] and nonvision machine learning applications [32], [38] encode such crowd wisdom in their learning methods. Harvesting task correlation knowledge from the crowd is fast, cheap, and brings domain knowledge. Highfidelity aggregation of crowd votes is used to integrate the task correlation between known and zeroshot tasks in our model. We however note that our framework can admit any other source of task correlation beyond crowdsourcing. (We show our results with other sources in the supplementary section.)
Our broad idea of leveraging task correlation can be found similar to the recently proposed idea of Taskonomy [42], but our method and objectives are different in many ways (see Figure 1): (i) Taskonomy studies task correlation to find a way to transfer one task model to another, while our method extrapolates to a zeroshot task, for which no labeled data is available; (ii) To adapt to a new task, Taskonomy requires a considerable amount of target labeled data, while our work does not require any target labeled data (which is, in fact, our objective); (iii) Taskonomy obtains a task transfer graph based on the representations learned by neural networks; while in this work, we leverage task correlation to learn new tasks; and (iv) Lastly, our method can be used to learn multiple novel tasks simultaneously. As stated earlier, though we use crowdsourced task correlation, any other compact notion of task correlation can easily be encoded in our methodology. More precisely, our proposal in this work is not to learn an optimal task relation, but to extrapolate to zeroshot tasks.
Our contributions can be summarized as follows:

[noitemsep,topsep=0pt]

We propose a novel methodology to infer zeroshot task parameters that be used to solve vision tasks with no labeled data.

The methodology can scale to solving multiple zeroshot tasks simultaneously, as shown in our experiments. Our methodology provides near stateoftheart results by considering a smaller set of known tasks, and outperforms stateoftheart models (learned with ground truth) when using all the known tasks, although trained with no labeled data.

We also show how our method can be used in a transfer learning setting, as well as conduct various studies to study the effectiveness of the proposed method.
2 Related Work
We divide our discussion of related work into subsections that capture earlier efforts that are related to ours from different perspectives.
Transfer Learning:
Reusing supervision is the core component of transfer learning, where an already learned model of a task is finetuned to a target task. From the early experimentation on CNN features [41], it was clear that initial layers of deep networks learn similar kind of filters, can can hence be shared across tasks. Methods such as in [3], [23] augment generation of samples by transferring knowledge from one category to another. Recent efforts have shown the capability to transfer knowledge from model of one task to a completely new task [34][33]. Zamir et al. [42] extended this idea and built a task graph for 26 vision tasks to facilitate task transfer. However, unlike our work, [42] cannot be generalized to a novel task without accessing the ground truth.
Multitask Learning:
Multitask learning learns multiple tasks simultaneously with a view of task generalization. Some methods in multitask learning assume a prior and then iterate to learn a joint space of tasks [7][19], while other methods [26][19] do not use a prior but learn a joint space of tasks during the process of learning. Distributed multitask learning methods [25] address the same objective when tasks are distributed across a network. However, unlike our method, a binding thread for all these methods is that there is an explicit need of having labeled data for all tasks in the setup. These methods can not solve a zeroshot target task without labeled samples.
Domain Adaptation:
The main focus of domain adaptation is to transfer domain knowledge from a datarich domain to a domain with limited data [27][9]. Learning domaininvariant features requires domain alignment. Such matching is done either by midlevel features of a CNN [14]
, using an autoencoder
[14], by clustering [36], or more recently, by using generative adversarial networks [24]. In some recent efforts [35][6], source and target domain discrepancy is learned in an unsupervised manner. However, a considerable amount of labeled data from both domains is still unavoidable. In our methodology, we propose a generalizable framework that can learn models for a novel task from the knowledge of available tasks and their correlation with novel tasks.MetaLearning:
Earlier efforts on metalearning (with other objectives) assume that task parameters lie on a lowdimensional subspace [2], share a common probabilistic prior [22], etc. Unfortunately, these efforts are targeted only to achieve knowledge transfer among known tasks and tasks with limited data. Recent metalearning approaches consider all task parameters as input signals to learn a meta manifold that helps fewshot learning [28], [37], transfer learning [33] and domain adaptation [14]. A recent approach introduces learning a meta model in a modelagnostic manner [13][17] such that it can be applied to a variety of learning problems. Unfortunately, all these methods depend on the availability of a certain amount of labeled data in target domain to learn the transfer function, and cannot be scaled to novel tasks with no labeled data. Besides, the meta manifold learned by these methods are not explicit enough to extrapolate parameters of zeroshot tasks. Our method relaxes the need for ground truth by leveraging task correlation among known tasks and novel tasks. To the best of our knowledge, this is the first such work that involves regressing model parameters of novel tasks without using any ground truth information for the task.
Learning with Weak Supervision:
Task correlation is used as a form of weak supervision in our methodology. Recent methods such as [32][38] proposed generative models that use a fixed number of userdefined weak supervision to programatically generate synthetic labels for data in nearconstant time. Alfonseca et al. [1]
use heuristics for weak supervision to acccomplish hierachical topic modeling. Broadly, such weak supervision is harvested from knowledge bases, domain heuristics, ontologies, rulesofthumb, educated guesses, decisions of weak classifiers or obtained using crowdsourcing. Structure learning
[4] also exploits the use of distant supervision signals for generating labels. Such methods use factor graph to learn a high fidelity aggregation of crowd votes. Similar to this, [30]uses weak supervision signals inside the framework of a generative adversarial network. However, none of them operate in a zeroshot setting. We also found related work zeroshot task generalization in the context of reinforcement learning (RL)
[29], or in lifelong learning [16]. An agent is validated based on its performance on unseen instructions or a longer instructions. We find that the interpretation of task, as well as the primary objectives, are very different from our present course of study.3 Methodology
The primary objective of our methodology is to learn a metalearning algorithm that regresses nearly optimum parameters of a novel task for which no ground truth (data or labels) is available. To this end, our metalearner seeks to learn from the model parameters of known tasks (with ground truth) to adapt to a novel zeroshot task. Formally, let us consider tasks to accomplish, i.e. , each of whose model parameters lie on a metamanifold of task model parameters. We have groundtruth available for first tasks, i.e. {}, and we know their corresponding model parameters {} on . Complementarily, we have no knowledge of the ground truth for the zeroshot tasks {}. (We call the tasks {} as known tasks, and the rest {} as zeroshot tasks for convenience.) Our aim is to build a metalearning function that can regress the unknown zeroshot model parameters {} from the knowledge of known model parameters (see Figure 2 (b), i.e.:
(1) 
However, with no knowledge of relationships between the tasks, it may not be plausible to learn as it can map to any point on . We hence introduce a task correlation matrix, , where each entry captures the task correlation between two tasks . Equation 1 hence now becomes:
(2) 
The function is itself parameterized by . We design our objective function to compute an optimum value for as follows:
(3) 
Similar to [42], without any loss of generality, we assume that all task parameters are learned as an autoencoder. Hence, our previously mentioned task parameters can be described in terms of an encoder, i.e. , and a decoder, i.e. . We observed that considering only encoder parameters in Equation 3 is sufficient to regress zeroshot encoders and decoders for tasks {}. Based on this observation, we rewrite our objective as (we show how our methodology works with other inputs in later sections of the paper):
(4) 
where and and the learned model parameters of a known task
. This alone is, however, insufficient. The model parameters thus obtained not only should minimize the above loss function on the metamanifold
, but should also have low loss on the original data manifold (ground truth of known tasks).Let denote the data decoder parametrized by , and denote the data encoder parametrized by . We now add a data model consistency loss to Equation 4 to ensure that our regressed encoder and decoder parameters perform well on both the metamanifold network as well as the original data network:
(5) 
where is an appropriate loss function (meansquared error, crossentropy or similar) defined for the task .
Network:
To accomplish the aforementioned objective in equation 5, we design as a network of branches, each with parameters {} respectively. These are not coupled in the initial layers but are later combined in a block that regresses encoder and decoder parameters. Dividing into two parts, s and , is driven by the intuition discussed in [41], that initial layers of transform the individual task model parameters into a suitable representation space, and later layers parametrized by capture the relationships between tasks and contribute to regressing the encoder and decoder parameters. For simplicity, we refer to mean {} and . More specifics of the architecture of our model, TTNet, are discussed as part of our implementation details in Section 4.
Learning Task Correlation:
Our methodology admits any source of obtaining task correlation, including through other work such as [42]. In this work, we obtain the task correlation matrix, , using crowdsourcing. Obtaining task relationships from wisdomofcrowd (and subsequent vote aggregation) is fast, cheap, and allows several inputs such as ruleofthumb, ontologies, domain expertise, etc. We obtain correlations for commonplace tasks used in our experiments from multiple human users. The obtained crowd votes are aggregated using the DawidSkene algorithm [10] to provide a high fidelity task relationship matrix, .
Input:
To train our meta network , we need a batch of model parameters for each known task . This process is similar to the way a batch of data samples are used to train a standard data network. To obtain a batch of model parameters for each task, we closely follow the procedure described in [40]. This process is as follows. In order to obtain one model parameter set , for a known task , we train a base learner (autoencoder), defined by . This is achieved by optimizing the base learner on a subset (of size ) of data and corresponding labels with an appropriate loss function for the known task (meansquare error, crossentropy or the like, based on the task). Hence, we learn one . Similarly, subsets of labeled data are obtained using a samplingwithreplacement strategy from the dataset corresponding to . Following this, we obtain a set of optimal model parameters (one for each of subsets sampled), i.e. , for task . A similar process is followed to obtain “optimal” model parameters for each known task . These model parameters (a total of across all known tasks) serve as the input to our meta network .
Training:
The meta network is trained on the objective function in Eqn 5 in two modes: a self mode and a transfer mode for each task. Given a known task , training in self mode implies updation of weights and alone. On the other hand, training in transfer mode implies updation of weights (all ) and of . Self mode is similar to training a standard autoencoder, where leanrs to projects the model parameters near the given model parameter (learned from ground truth) . In transfer mode, a set of model parameters of tasks (other than ) attempt to map the position of learned , near the given model parameter on the meta manifold. We note that the transfer mode is essential in being able to regress model parameters of a task, given model parameters of other tasks. At inference time (for zeroshot task transfer), operates in transfer mode.
Regressing ZeroShot Task Parameters:
Once we learn the optimal parameters for using Algorithm LABEL:alg_TTNet_training, we use this to regress zeroshot task parameters, i.e. for all . (We note that the implementation of Algorithm 1 was found to be independent of the ordering of the tasks, .)
4 Results
To evaluate our proposed framework, we consider the vision tasks defined in [42]
. (Whether this is an exhaustive list of vision tasks is arguable, but they are sufficient to support our proof of concept.) In this section, we consider four of the tasks as unknown or zeroshot: surface normal, depth estimation, room layout, and camerapose estimation. We have curated this list based on the data acquisition complexity and the complexity associated with the learning process using a deep network. Surface normal, depth estimation and room layout estimation tasks are monocular tasks but involve expensive sensors to get labeled data points. Camera pose estimation requires multiple images (two or more) to infer six degreesoffreedom and is generally considered a difficult task. We have four different TTNet s to accomplish them; (1)
considers 6 vision tasks as known tasks; (2) considers 10 vision tasks as known tasks; and (3) considers 20 vision tasks as known tasks. In addition, we have another model (20 known tasks) in which, the regressed parameters are finetuned on a small amount, (20%), of data for the zeroshot tasks. (This provides low supervision and hence, the name .) Studies on other sets of tasks as zeroshot tasks are presented in Section 5. We also performed an ablation study on permuting the source tasks differently, which is presented in the supplementary section due to space constraints.4.1 Dataset
We evaluated TTNet on the Taskonomy dataset [42], a publicly available dataset comprised of more than 150K RGB data samples of indoor scenes. It provides the ground truths of 26 tasks given the same RGB images, which is the main reason for considering this dataset. We considered 120K images for training, 16K images for validation, and, 17K images for testing.
4.2 Implementation Details
Network Architecture:
Following Section 3, each data network is considered an autoencoder, and closely follows the model architecture of [42]
. The encoder is a fully convolutional ResNet 50 model without pooling, and the decoder comprises of 15 fully convolutional layers for all pixeltopixel tasks, e.g. normal estimation, and for low dimensional tasks, e.g. vanishing points, it consists of 23 FC layers. To make input samples for TTNet, we created 5000 samples of the model parameters for each task, each of which is obtained by training the model on 1k data points sampled (with replacement) from the Taskonomy dataset. These data networks were trained with minibatch Stochastic Gradient Descent (SGD) using a batch size of 32, learning rate of 0.001, momentum factor of 0.5 and Adam as an optimizer.
TTNet:
TTNet’s architecture closely follows the “classification” network of [13]. We show our network is shown in Figure 2 (b). The TTNet initially has branches, where depends on the model under consideration (). Each of the branches is comprised of 15 fully convolutional (FCONV) layers followed by 14 fully connected layers. The branches are then merged to form a common layer comprised of 15 FCONV layers. We trained the complete model with minibatch SGD using a batch size of 32, learning rate of 0.0001, momentum factor of 0.5 and Adam as an optimizer.
Task correlation:
Crowds are asked to response for each pair of tasks (known and zero) on a scale of (strong correlation) to (no correlation), while
is reserved to denote self relation. We then aggregated crowd votes using Dawidskene algorithm which is based on the principle of ExpectationMaximization (EM). More details of the Dawidskene methodology and vote aggregation are deferred to the supplementary section.
4.3 Comparison with StateoftheArt Models
We show both qualitative and quantitative results for our TTNet, trained using the aforementioned methodology, on each of the four identified zeroshot tasks against stateoftheart models for each respective task below. We note that the same TTNet is validated against all tasks.
4.3.1 Qualitative Results
Surface Normal Estimation:
For this task, our TTNet is compared against the following stateoftheart models: Multiscale CNN (MC) [12], Deep3D (D3D) [39], Deep Network for surface normal estimation (DD) [39], SkipNet [5], GeoNet [31] and Taskonomy (TN) [42]. The results are shown in Figure 3(a), where the red boxes correspond to our models trained under different settings (as described at the beginning of Section 4. It is evident from the result that gives visual results similar to [42]. As we increase the number of source tasks, our TTNet shows improved results. captures finer details (see edges of chandelier) which is not visible in any other result.
Room Layout Estimation:
We followed the definition of layout types in [20], and our TTNet’s results are compared against following camera pose methods: Volumetric [15], Edge Map [44], LayoutNet [46], RoomNet [20], and Taskonomy [42]. The green boxes in Figure 3(b) indicate TTNet results; the red edges indicate the predicted room edges. Each model infers room corner points and joins them with straight lines. We report two complex cases in Figure 3 (b): (1) lot of occlusions, and (2) multiple edges such as rooftop, door, etc.
Depth Estimation:
Camera Pose Estimation (fixed):
Camera pose estimation requires two images captured from two different geometric points of view of the same scene. A fixed camera pose estimation predicts any five of the 6degrees of freedom: yaw, pitch, roll and x,y,z translation. In Figure 3(d), we show two different geometric camera angle translations: (1) perspective, and (2) translation in y and z coordinate. First image is the reference frame of the camera, i.e. green arrow. The second image, i.e. the red arrow, is taken after a geometric translation w.r.t the first image. We compared our model against: RANSAC [11], Latent RANSAC [18], Generic3D pose [43] and Taskonomy [42]. Once again, and outperform all other methods studied.
4.3.2 Quantitative Results
Surface Normal Estimation:
We evaluated our method based on the evaluation criteria described in [31], [5]. The results are presented in Table 1. Our is comparable to stateoftheart Taskonomy [42] and GeoNet [31]. Our , , and outperforms all stateoftheart models.
Method  Mean ()  Medn ()  RMSE ()  11.25 ()  22.5 ()  30 () 

MC[12]  30.30  35.30    30.29  57.17  68.29 
D3D [39]  25.71  20.81  31.01  38.12  59.18  67.21 
DD[39]  21.10  15.61    44.39  64.48  66.21 
SkipNet [5]  20.21  12.19  28.20  47.90  70.00  78.23 
TN[42]  19.90  11.93  23.13  48.03  70.02  78.88 
TTNet  19.22  12.01  26.13  48.02  71.11  78.29 
GeoNet [31]  19.00  11.80  26.90  48.04  72.27  79.68 
TTNet  19.81  11.09  22.37  48.83  71.61  79.00 
TTNet  19.27  11.91  26.44  48.81  71.97  79.72 
TTNet  15.10  9.29  24.31  56.11  75.19  84.71 
Room Layout Estimation:
We use two standard evaluation criteria: (1) Keypoint error: a global measurement avaraged on Euclidean distance between model’s predicted keypoint and the ground truth; and (2) Pixel error: a local measurement that estimates pixelwise error between the predicted surface labels and ground truth labels. Table 2 presents the results. A lower number corresponding to our TTNet models indicate good performance.
Depth Estimation:
We followed the evaluation criteria for depth estimation as in [21], where the metrics are: RMSE (lin) = ; RMSE(log) = ; Absolute relative distance = ; Squared absolute relative distance = . Here, is ground truth depth, is estimated depth, and is the total number of pixels in all images in the test set.
Method  RMSE(lin)  RMSE(log)  ARD  SRD 

FDA [21]  0.877  0.283  0.214  0.204 
TTN  0.745  0.262  0.220  0.210 
TN [42]  0.591  0.231  0.242  0.206 
TTNt  0.575  0.172  0.236  0.179 
Geonet[31]  0.591  0.205  0.149  0.118 
TTNet  0.597  0.204  0.140  0.106 
TTNet  0.572  0.193  0.139  0.096 
Camera Pose Estimation (fixed):
We adopted the win rate (%) evaluation criteria [42] that counts the proportion of images for which a baseline is outperformed. Table 4 shows the win rate of TTNet models on angular error with respect to stateoftheart models: RANSAC [41], LRANSAC [18], G3D and Taskonomy [42]. The results show the promising performance of TTNet.
5 Discussion and Analysis
Significance Analysis of Source Tasks:
An interesting question on our approach is: how do we quantify the contribution of each individual source task towards regressing parameter of target task? In other word, which source task plays the most important role to regress the zeroshot task parameter. Figure 5 quantifies this by considering latent task basis to estimate this. We followed GOMTL approach [19]
to compute the task basis. Optimal model parameters of known tasks are mapped to a lowdim vector space
using an autoencoder, before applying GOMTL.Formally speaking, optimal model parameters of each known task are mapped to a lowdimensional space , i.e. . using an autoencoder trained on model parameters of known tasks , i.e. (similar to Eqn 5). infers latent representation for regressed model parameter of zeroshot task . We used ResNet18 both for encoderdecoder, dimension of as 100, and the dimension of task basis as 8. We can then have task matrix W = LS, comprised of all and . In Figure 5, boxes of same color denote similarvalued weights of task basis vectors. Most important source has the highest number of basis elements with similar values as zeroshot task. In Figure 5 (a) below, source task “Autoencoding” (col ) is important for zeroshot task “ZDepth” (col ) as they share 4 such basis elements.
Why Zeroshot Task Parameters Performs Better than Supervised Training?
It is evident from our qualitative and quantitative study that regressed zeroshot parameters outperforms results from supervised learning. When tasks are related (which is the setting in our work), learning from similar tasks can by itself provide good performance. From Figure
5, we can see that, the basis vector of zeroshot task “Depth” is composed of latent elements several source tasks. E.g. in Figure 5 (b) above, learning of 1 element (red box) of zeroshot task “Zdepth” is supported by 4 related source tasks.Zeroshot to Known Task Transfer:
Are our regressed model parameters for zeroshot tasks capable of transferring to a known task? To study this, we consider the autoencoderdecoder parameters for a zeroshot task, and finetune the decoder to a target known task, following the procedure in [42] (encoder remains the same as of zeroshot task). Figure 4 shows the qualitative results, which are promising. We also compared our TTNet against [42] quantitatively by studying the win rate (%) of the two methods against other stateoftheart methods: Wang et al. [40], G3D [43], and full supervision. Owing to space constraints, these results are presented in the supplementary section.
Choice of Zeroshot Tasks:
In order to study the generalizability of our method, we conducted experiments with a different set of zero tasks than those considered in Section 4.
Figure 6 shows promising results for our weakest model, , on other tasks as zero shot tasks. More results of our other models , , are included in the supplementary section.
Performance on Other Datasets:
Object detection on COCOStuff dataset:
TTNet
is finetuned on the COCOstuff dataset to do object detection on COCOstuff dataset. To facilitate the object detection, we considered object classification as source task instead of colorization. TTNet
performs fairly well.Optimal Number of Known Tasks:
In this work, we have reported results of TTNet with 6, 10 and 20 known tasks. We studied the question  how many tasks are sufficient to adapt to zeroshot tasks in the considered setting, and the results are reported in Table 5. Expectedly, a higher number of known tasks provided improved performance. A direction of our future work includes the study of the impact of negatively correlated tasks on zeroshot task transfer.
6 Conclusion
In summary, we present a metalearning algorithm to regress model parameters of a novel task for which no ground truth is available (zeroshot task). We evaluated our learned model on the Taskonomy [42] dataset, with four zeroshot tasks: surface normal estimation, room layout estimation, depth estimation and camera pose estimation. We conducted extensive experiments to study the usefulness of zeroshot task transfer, as well as showed how the proposed TTNet can also be used in transfer learning. Our future work will involve closer analysis of the implications of obtaining task correlation from various sources, and the corresponding results for zeroshot task transfer. In particular, negative transfer in task space is a particularly interesting direction of future work.
References
 [1] E. Alfonseca, K. Filippova, J.Y. Delort, and G. Garrido. Pattern learning for relation extraction with a hierarchical topic model. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short PapersVolume 2, pages 54–59. Association for Computational Linguistics, 2012.
 [2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multitask feature learning. Machine Learning, 73(3):243–272, 2008.
 [3] Y. Aytar and A. Zisserman. Tabula rasa: Model transfer for object category detection. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2252–2259. IEEE, 2011.
 [4] S. H. Bach, B. He, A. Ratner, and C. Ré. Learning the structure of generative models without labeled data. arXiv preprint arXiv:1703.00854, 2017.
 [5] A. Bansal, B. Russell, and A. Gupta. Marr revisited: 2d3d alignment via surface normal prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5965–5974, 2016.
 [6] Q. Chen, Y. Liu, Z. Wang, I. Wassell, and K. Chetty. Reweighted adversarial adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7976–7985, 2018.
 [7] H. Cohen and K. Crammer. Learning multiple tasks in parallel with a shared annotator. In Advances in Neural Information Processing Systems, pages 1170–1178, 2014.

[8]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.
In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.  [9] G. Csurka. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374, 2017.
 [10] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer errorrates using the em algorithm. Applied statistics, pages 20–28, 1979.
 [11] K. G. Derpanis. Overview of the ransac algorithm. Image Rochester NY, 4(1):2–3, 2010.
 [12] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
 [13] C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
 [14] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep reconstructionclassification networks for unsupervised domain adaptation. In European Conference on Computer Vision, pages 597–613. Springer, 2016.
 [15] A. Gupta, M. Hebert, T. Kanade, and D. M. Blei. Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In Advances in neural information processing systems, pages 1288–1296, 2010.
 [16] D. Isele, M. Rostami, and E. Eaton. Using task features for zeroshot knowledge transfer in lifelong learning. In IJCAI, pages 1620–1626, 2016.
 [17] T. Kim, J. Yoon, O. Dia, S. Kim, Y. Bengio, and S. Ahn. Bayesian modelagnostic metalearning. arXiv preprint arXiv:1806.03836, 2018.
 [18] S. Korman and R. Litman. Latent ransac. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6693–6702, 2018.
 [19] A. Kumar and H. Daume III. Learning task grouping and overlap in multitask learning. arXiv preprint arXiv:1206.6417, 2012.
 [20] C.Y. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich. Roomnet: Endtoend room layout estimation. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 4875–4884. IEEE, 2017.
 [21] J.H. Lee, M. Heo, K.R. Kim, and C.S. Kim. Singleimage depth estimation based on fourier domain analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 330–339, 2018.
 [22] S.I. Lee, V. Chatalbashev, D. Vickrey, and D. Koller. Learning a metalevel prior for feature relevance from multiple related tasks. In Proceedings of the 24th international conference on Machine learning, pages 489–496. ACM, 2007.
 [23] J. J. Lim, R. R. Salakhutdinov, and A. Torralba. Transfer learning by borrowing examples for multiclass object detection. In Advances in neural information processing systems, pages 118–126, 2011.
 [24] M.Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in neural information processing systems, pages 469–477, 2016.
 [25] S. Liu, S. J. Pan, and Q. Ho. Distributed multitask relationship learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 937–946. ACM, 2017.
 [26] M. Long, Z. Cao, J. Wang, and S. Y. Philip. Learning multiple tasks with multilinear relationship networks. In Advances in Neural Information Processing Systems, pages 1594–1603, 2017.
 [27] Z. Luo, Y. Zou, J. Hoffman, and L. F. FeiFei. Label efficient learning of transferable representations acrosss domains and tasks. In Advances in Neural Information Processing Systems, pages 165–177, 2017.
 [28] D. K. Naik and R. Mammone. Metaneural networks that learn by learning. In Neural Networks, 1992. IJCNN., International Joint Conference on, volume 1, pages 437–442. IEEE, 1992.
 [29] J. Oh, S. Singh, H. Lee, and P. Kohli. Zeroshot task generalization with multitask deep reinforcement learning. arXiv preprint arXiv:1706.05064, 2017.
 [30] A. Pal and V. N. Balasubramanian. Adversarial data programming: Using gans to relax the bottleneck of curated labeled data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1556–1565, 2018.
 [31] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 283–291, 2018.
 [32] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems, pages 3567–3575, 2016.
 [33] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, realtime object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
 [34] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 [35] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. arXiv preprint arXiv:1712.02560, 3, 2017.
 [36] O. Sener, H. O. Song, A. Saxena, and S. Savarese. Learning transferrable representations for unsupervised domain adaptation. In Advances in Neural Information Processing Systems, pages 2110–2118, 2016.
 [37] S. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 2012.
 [38] P. Varma, B. He, D. Iter, P. Xu, R. Yu, C. De Sa, and C. Ré. Socratic learning: Augmenting generative models to incorporate latent subsets in training data. arXiv preprint arXiv:1610.08123, 2016.
 [39] X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 539–547, 2015.
 [40] Y.X. Wang, D. Ramanan, and M. Hebert. Learning to model the tail. In NIPS, 2017.
 [41] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
 [42] A. R. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3712–3722, 2018.
 [43] A. R. Zamir, T. Wekel, P. Agrawal, C. Wei, J. Malik, and S. Savarese. Generic 3d representation via pose estimation and matching. In European Conference on Computer Vision, pages 535–553. Springer, 2016.
 [44] W. Zhang, W. Zhang, K. Liu, and J. Gu. Learning to predict highquality edge maps for room layout estimation. IEEE Transactions on Multimedia, 19(5):935–943, 2017.
 [45] C. Zhu, H. Xu, and S. Yan. Online crowdsourcing. CoRR, abs/1512.02393, 2015.
 [46] C. Zou, A. Colburn, Q. Shan, and D. Hoiem. Layoutnet: Reconstructing the 3d room layout from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2051–2059, 2018.
7 More on Task Correlation
DawidSkene method:
As mentioned in Section 3, we used the wellknown DawidSkene (DS) method [10][45] to aggregate votes from human users to compute the task correlation matrix . We now describe the DS method.
We assume a total of annotators providing labels for items, where each label belongs to one of classes. DS associates each annotator with a confusion matrix to measure an annotator’s performance. The final label is a weighted sum of annotator’s decisions based on their confusion matrices, i.e. s where {}. Each entry of the confusion matrix
is the probability of predicting class
when the true class is . A true label of an item is and the vector y denotes true label for all items, i.e. y = {}. Let’s denote as annotator ’s label for item , i.e. if the annotator labeled the item as , we will write = . Let matrixdenote all labels for all items given by all annotators. The DS method computes the annotators’ error tensor
, where each entry denotes the probability of annotator giving label as label for item . The joint likelihood of true labels and observed labels can hence be written as:(6) 
Maximizing the above likelihood provides us a mechanism to aggregate the votes from the annotators. To this end, we find the maximum likelihood estimate using Expectation Maximization (EM) for the marginal log likelihood below:
(7) 
The E step is given by:
(8) 
The M step subsequently computes the estimate that maximizes the log likelihood:
(9) 
(10) 
Implementation:
In our experiments, as mentioned before, we considered the Taskonomy dataset [42]. This dataset has 26 visionrelated tasks. We are interested in finding the task correlation for each pair of tasks in {}. Let’s assume that, we have annotators. To fit our model in the DS framework, we flatten the task correlation matrix (described in section 3) in rowmajor order to get item set = {}. For each item , the annotator is asked to give task correlation label on a scale of {}. On that scale, a denotes self relation, describes strong relation, implies weak relation, to mention abstain and to denote no relation between two tasks . After getting annotators’ vote, we build matrix . Subsequently, we find annotators’ error tensor (equation 6), likelihood estimation (equations 7, 8, 9, 10). We get predicted class labels after a winnertakesall in equation 11. Predicted class labels are the task correlation we wish to get. We get the final task correlational matrix , after a deflattening of for all .
Figure 9 shows the final task correlation matrix used in our experiments. The matrix fairly reflects an intuitive knowledge of the tasks considered. We also considered an alternate mechanism for obtaining the task correlation matrix from the task graph computed in [42]. We present these results later in Section 9.2.
Annotators  RANSAC[41]  LR[18]  G3D[43]  TN[42] 

3  28%  22%  29%  40% 
10  51%  29%  31%  52% 
20  90%  82%  92%  42% 
30  88%  81%  72%  64% 
35  88%  82%  75%  61% 
40  90%  72%  69%  63% 
45  87%  80%  61%  70% 
50  90%  82%  72%  50% 
Ablation study on different number of annotators:
The results in the main paper were performed with 30 annotators. In this section, we studied the robustness of our method when is obtained by varying the number of annotators, , where . Table 6 shows a win rate (%) [42] on the camera pose estimation task using TTNet (when 6 source tasks are used). While there are variations, we identified as the number of annotators where the results are most robust, and used this setting for the rest of our experiments (in the main paper).
8 Ablation Studies on Varying Known Tasks
In this section, we present two ablation studies w.r.t. known tasks, viz, (i) number of known tasks and (ii) choice of known tasks. These studies attempt to answer the questions: how many known tasks are sufficient to adapt to zeroshot tasks in the considered setting? Which known tasks are more favorable to transfer to zeroshot tasks? While an exhaustive study is infeasible, we attempt to answer these questions by conducting a study across six different models: , , , , , and (where the subscript denotes the number of source tasks considered). We used win rate (%) against [42] for each of the zeroshot tasks. Table 7 shows the results of our studies with varying number and choice of known source tasks. Expectedly, a higher number of known tasks provides improved performance. It is observed from the table that our methodology is fairly robust despite changes in choice of source tasks, and that provides a good balance by having a good performance even with a low number of source tasks. Interestingly, most of the source tasks considered for (autoencoding, denoising, 2D edges, occlusion edges, vanishing point, and colorization) are tasks that do not require significant annotation, thus providing a model where very little source annotation can help generalize to more complex target tasks on the same domain.
Model  TTNet  Taskonomy  
Wang  Zamir  Full Sup  Wang  Zamir  Full Sup  
N  L  N  L  N  L  N  L  N  L  N  L  
Depth  85  87  81  97  67  42  98  85  92  88  60  46 
2.5 D  88  75  75  81  89  35  88  77  73  88  85  39 
Curvature  84  87  91  58  86  47  78  89  88  78  60  50 
9 Other Results
9.1 Zeroshot to Known Task Transfer: Quantitative Evaluation
In continuation to our discussions in Section 5, we ask ourselves the question: are our regressed model parameters for zeroshot tasks capable of transferring to a known task? To study this, we consider the autoencoderdecoder parameters for a zeroshot task learned through our methodology, and finetune the decoder (fixing the encoder parameters) to a target known task, following the procedure in [42]. Table 7 shows the quantitative results when choosing the source (zeroshot) tasks as surface normal estimation (N) and room layout estimation (L). We compared our TTNet against [42] quantitatively by studying the win rate (%) of the two methods against other stateoftheart methods: Wang et al. [40], G3D [43], and full supervision. However, it is worthy to mention that our parameters are obtained through the proposed zeroshot task transfer, while all other comparative methods are explicitly trained on the dataset for the task.
9.2 Alternate Methods for Task Correlation Computation
In our results so far, we studied the effectiveness of computing the task correlation matrix by aggregation of crowd votes. In this section, we instead use the task graph obtained in [42] to obtain the task correlation matrix . We call this matrix . Figure 10 shows a qualitative comparison of where the is obtained from the taskonomy graph, and is based on crowd knowledge. It is evident that our method shows promising results on both cases.
It is worthy to note that although one can use the taskonomy graph to build : (i) the taskonomy graph is model and data specific [42]; while coming from crowd votes does not explicitly assume any model or data and can be easily obtained; (ii) during the process of building the taskonomy graph, an explicit access to zeroshot task ground truth is unavoidable; while, constructing from crowd votes is possible without accessing any explicit ground truth.
9.3 Evolution of TTNet:
Thus far, we showed the final results of our metalearner after the model is fully trained. We now ask the question  how does the training of the TTNet model progress over training? We used the zeroshot task model parameters from during its course of training, and Figure 11 shows qualitative results of different epochs of four zeroshot tasks over the training phase. The results show that the model’s training progresses gradually over the epochs, and the model obtains promising results in later epochs. For example, in Figure 11(a), finer details such as wall boundaries, sofa, chair and other minute details are learned in later epochs.
9.4 Qualitative Results on Cityscapes Dataset
To further study the generalizability of our models, we finetuned TTNet on the Cityscapes dataset [8]. We get source task model parameters (trained on Taskonomy dataset) to train . We then finetuned on the segmentation model parameters trained on Cityscapes data. (We modified one source task, i.e. autoencoding to segmentation, of our proposed TTNet , see table LABEL:table_ablation_source_task, 3 row. All other source tasks are unaltered.) Results of the learned model parameters for four zeroshot tasks, i.e. Surface normal, depth, 2D edge and 3D keypoint, are reported in Figure 12, with comparison to [42] (which is trained explicitly for these tasks). Despite the lack of supervised learning, the figure shows that tt is evident from the qualitative assessment (figure 12) that our model seems to capture more detail.