Zero-Shot Task Transfer

03/04/2019 ∙ by Arghya Pal, et al. ∙ Indian Institute of Technology Hyderabad 12

In this work, we present a novel meta-learning algorithm, i.e. TTNet, that regresses model parameters for novel tasks for which no ground truth is available (zero-shot tasks). In order to adapt to novel zero-shot tasks, our meta-learner learns from the model parameters of known tasks (with ground truth) and the correlation of known tasks to zero-shot tasks. Such intuition finds its foothold in cognitive science, where a subject (human baby) can adapt to a novel-concept (depth understanding) by correlating it with old concepts (hand movement or self-motion), without receiving explicit supervision. We evaluated our model on the Taskonomy dataset, with four tasks as zero-shot: surface-normal, room layout, depth, and camera pose estimation. These tasks were chosen based on the data acquisition complexity and the complexity associated with the learning process using a deep network. Our proposed methodology out-performs state-of-the-art models (which use ground truth)on each of our zero-shot tasks, showing promise on zero-shot task transfer. We also conducted extensive experiments to study the various choices of our methodology, as well as showed how the proposed method can also be used in transfer learning. To the best of our knowledge, this is the firstsuch effort on zero-shot learning in the task space.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 9

page 12

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The major driving force behind modern computer vision, machine learning, and deep neural network models is the availability of large amounts of curated labeled data. Deep models have shown state-of-the-art performances on different vision tasks. Effective models that work in practice entail a requirement of very large labeled data due to their large parameter spaces. Expecting availability of large-scale hand-annotated datasets for every vision task is not practical. Some tasks require extensive domain expertise, long hours of human labor, expensive data collection sensors - which collectively make the overall process very expensive. Even when data annotation is carried out using crowdsourcing (e.g. Amazon Mechanical Turk), additional effort is required to measure the correctness (or goodness) of the obtained labels. Due to this, many vision tasks are considered expensive

[43], and practitioners either avoid such tasks or continue with lesser amounts of data that can lead to poorly performing models. We seek to address this problem in this work, viz., to build an alternative approach that can obtain model parameters for tasks without any labeled data. Extending the definition of zero-shot learning from basic recognition settings, we call our work Zero-Shot Task Transfer.

Figure 1: Our Zero-Shot Task Transfer framework explores meta-manifold of model parameters to regress model parameters of zero-shot tasks for which no ground truth is available. We compare our objective with that of Taskonomy [42] to delineate the difference. Our algorithm TTNet (described in section 3 assume data manifold and meta-manifold that is furthur divided into meta-encoder manifold , and meta-decoder manifold .

Cognitive studies show results where a subject (human baby) can adapt to a novel concept (e.g. depth understanding) by correlating it with known concepts (hand movement or self-motion), without receiving an explicit supervision. In similar spirit, we present our meta-learning algorithm that computes model parameters for novel tasks for which no ground truth is available (called zero-shot tasks). In order to adapt to a zero-shot task, our meta-learner learns from the model parameters of known tasks (with ground truth) and their task correlation to the novel task. Formally, given the knowledge of known tasks {}, a meta-learner can be used to extrapolate parameters for , a novel task.

However, with no knowledge of relationships between the tasks, it may not be plausible to learn a meta-learner, as its output could map to any point on the meta-manifold (see Figure 1). We hence consider the task correlation between known tasks and a novel task as an additional input to our framework. There could be different notions on how task correlation is obtained. In this work, we use the approach of wisdom-of-crowd for this purpose. Many vision [30] and non-vision machine learning applications [32], [38] encode such crowd wisdom in their learning methods. Harvesting task correlation knowledge from the crowd is fast, cheap, and brings domain knowledge. High-fidelity aggregation of crowd votes is used to integrate the task correlation between known and zero-shot tasks in our model. We however note that our framework can admit any other source of task correlation beyond crowdsourcing. (We show our results with other sources in the supplementary section.)

Our broad idea of leveraging task correlation can be found similar to the recently proposed idea of Taskonomy [42], but our method and objectives are different in many ways (see Figure 1): (i) Taskonomy studies task correlation to find a way to transfer one task model to another, while our method extrapolates to a zero-shot task, for which no labeled data is available; (ii) To adapt to a new task, Taskonomy requires a considerable amount of target labeled data, while our work does not require any target labeled data (which is, in fact, our objective); (iii) Taskonomy obtains a task transfer graph based on the representations learned by neural networks; while in this work, we leverage task correlation to learn new tasks; and (iv) Lastly, our method can be used to learn multiple novel tasks simultaneously. As stated earlier, though we use crowdsourced task correlation, any other compact notion of task correlation can easily be encoded in our methodology. More precisely, our proposal in this work is not to learn an optimal task relation, but to extrapolate to zero-shot tasks.

Our contributions can be summarized as follows:

  • [noitemsep,topsep=0pt]

  • We propose a novel methodology to infer zero-shot task parameters that be used to solve vision tasks with no labeled data.

  • The methodology can scale to solving multiple zero-shot tasks simultaneously, as shown in our experiments. Our methodology provides near state-of-the-art results by considering a smaller set of known tasks, and outperforms state-of-the-art models (learned with ground truth) when using all the known tasks, although trained with no labeled data.

  • We also show how our method can be used in a transfer learning setting, as well as conduct various studies to study the effectiveness of the proposed method.

Figure 2: Overview of our work

2 Related Work

We divide our discussion of related work into subsections that capture earlier efforts that are related to ours from different perspectives.

Transfer Learning:

Reusing supervision is the core component of transfer learning, where an already learned model of a task is finetuned to a target task. From the early experimentation on CNN features [41], it was clear that initial layers of deep networks learn similar kind of filters, can can hence be shared across tasks. Methods such as in [3], [23] augment generation of samples by transferring knowledge from one category to another. Recent efforts have shown the capability to transfer knowledge from model of one task to a completely new task [34][33]. Zamir et al. [42] extended this idea and built a task graph for 26 vision tasks to facilitate task transfer. However, unlike our work, [42] cannot be generalized to a novel task without accessing the ground truth.

Multi-task Learning:

Multi-task learning learns multiple tasks simultaneously with a view of task generalization. Some methods in multi-task learning assume a prior and then iterate to learn a joint space of tasks [7][19], while other methods [26][19] do not use a prior but learn a joint space of tasks during the process of learning. Distributed multi-task learning methods [25] address the same objective when tasks are distributed across a network. However, unlike our method, a binding thread for all these methods is that there is an explicit need of having labeled data for all tasks in the setup. These methods can not solve a zero-shot target task without labeled samples.

Domain Adaptation:

The main focus of domain adaptation is to transfer domain knowledge from a data-rich domain to a domain with limited data [27][9]. Learning domain-invariant features requires domain alignment. Such matching is done either by mid-level features of a CNN [14]

, using an autoencoder

[14], by clustering [36], or more recently, by using generative adversarial networks [24]. In some recent efforts [35][6], source and target domain discrepancy is learned in an unsupervised manner. However, a considerable amount of labeled data from both domains is still unavoidable. In our methodology, we propose a generalizable framework that can learn models for a novel task from the knowledge of available tasks and their correlation with novel tasks.

Meta-Learning:

Earlier efforts on meta-learning (with other objectives) assume that task parameters lie on a low-dimensional subspace [2], share a common probabilistic prior [22], etc. Unfortunately, these efforts are targeted only to achieve knowledge transfer among known tasks and tasks with limited data. Recent meta-learning approaches consider all task parameters as input signals to learn a meta manifold that helps few-shot learning [28], [37], transfer learning [33] and domain adaptation [14]. A recent approach introduces learning a meta model in a model-agnostic manner [13][17] such that it can be applied to a variety of learning problems. Unfortunately, all these methods depend on the availability of a certain amount of labeled data in target domain to learn the transfer function, and cannot be scaled to novel tasks with no labeled data. Besides, the meta manifold learned by these methods are not explicit enough to extrapolate parameters of zero-shot tasks. Our method relaxes the need for ground truth by leveraging task correlation among known tasks and novel tasks. To the best of our knowledge, this is the first such work that involves regressing model parameters of novel tasks without using any ground truth information for the task.

Learning with Weak Supervision:

Task correlation is used as a form of weak supervision in our methodology. Recent methods such as [32][38] proposed generative models that use a fixed number of user-defined weak supervision to programatically generate synthetic labels for data in near-constant time. Alfonseca et al. [1]

use heuristics for weak supervision to acccomplish hierachical topic modeling. Broadly, such weak supervision is harvested from knowledge bases, domain heuristics, ontologies, rules-of-thumb, educated guesses, decisions of weak classifiers or obtained using crowdsourcing. Structure learning

[4] also exploits the use of distant supervision signals for generating labels. Such methods use factor graph to learn a high fidelity aggregation of crowd votes. Similar to this, [30]

uses weak supervision signals inside the framework of a generative adversarial network. However, none of them operate in a zero-shot setting. We also found related work zero-shot task generalization in the context of reinforcement learning (RL)

[29], or in lifelong learning [16]. An agent is validated based on its performance on unseen instructions or a longer instructions. We find that the interpretation of task, as well as the primary objectives, are very different from our present course of study.

3 Methodology

The primary objective of our methodology is to learn a meta-learning algorithm that regresses nearly optimum parameters of a novel task for which no ground truth (data or labels) is available. To this end, our meta-learner seeks to learn from the model parameters of known tasks (with ground truth) to adapt to a novel zero-shot task. Formally, let us consider tasks to accomplish, i.e. , each of whose model parameters lie on a meta-manifold of task model parameters. We have ground-truth available for first tasks, i.e. {}, and we know their corresponding model parameters {} on . Complementarily, we have no knowledge of the ground truth for the zero-shot tasks {}. (We call the tasks {} as known tasks, and the rest {} as zero-shot tasks for convenience.) Our aim is to build a meta-learning function that can regress the unknown zero-shot model parameters {} from the knowledge of known model parameters (see Figure 2 (b), i.e.:

(1)

However, with no knowledge of relationships between the tasks, it may not be plausible to learn as it can map to any point on . We hence introduce a task correlation matrix, , where each entry captures the task correlation between two tasks . Equation 1 hence now becomes:

(2)

The function is itself parameterized by . We design our objective function to compute an optimum value for as follows:

(3)

Similar to [42], without any loss of generality, we assume that all task parameters are learned as an autoencoder. Hence, our previously mentioned task parameters can be described in terms of an encoder, i.e. , and a decoder, i.e. . We observed that considering only encoder parameters in Equation 3 is sufficient to regress zero-shot encoders and decoders for tasks {}. Based on this observation, we rewrite our objective as (we show how our methodology works with other inputs in later sections of the paper):

(4)

where and and the learned model parameters of a known task

. This alone is, however, insufficient. The model parameters thus obtained not only should minimize the above loss function on the meta-manifold

, but should also have low loss on the original data manifold (ground truth of known tasks).

Let denote the data decoder parametrized by , and denote the data encoder parametrized by . We now add a data model consistency loss to Equation 4 to ensure that our regressed encoder and decoder parameters perform well on both the meta-manifold network as well as the original data network:

(5)

where is an appropriate loss function (mean-squared error, cross-entropy or similar) defined for the task .

Network:

To accomplish the aforementioned objective in equation 5, we design as a network of branches, each with parameters {} respectively. These are not coupled in the initial layers but are later combined in a block that regresses encoder and decoder parameters. Dividing into two parts, s and , is driven by the intuition discussed in [41], that initial layers of transform the individual task model parameters into a suitable representation space, and later layers parametrized by capture the relationships between tasks and contribute to regressing the encoder and decoder parameters. For simplicity, we refer to mean {} and . More specifics of the architecture of our model, TTNet, are discussed as part of our implementation details in Section 4.

Learning Task Correlation:

Our methodology admits any source of obtaining task correlation, including through other work such as [42]. In this work, we obtain the task correlation matrix, , using crowdsourcing. Obtaining task relationships from wisdom-of-crowd (and subsequent vote aggregation) is fast, cheap, and allows several inputs such as rule-of-thumb, ontologies, domain expertise, etc. We obtain correlations for commonplace tasks used in our experiments from multiple human users. The obtained crowd votes are aggregated using the Dawid-Skene algorithm [10] to provide a high fidelity task relationship matrix, .

Input:

To train our meta network , we need a batch of model parameters for each known task . This process is similar to the way a batch of data samples are used to train a standard data network. To obtain a batch of model parameters for each task, we closely follow the procedure described in [40]. This process is as follows. In order to obtain one model parameter set , for a known task , we train a base learner (autoencoder), defined by . This is achieved by optimizing the base learner on a subset (of size ) of data and corresponding labels with an appropriate loss function for the known task (mean-square error, cross-entropy or the like, based on the task). Hence, we learn one . Similarly, subsets of labeled data are obtained using a sampling-with-replacement strategy from the dataset corresponding to . Following this, we obtain a set of optimal model parameters (one for each of subsets sampled), i.e. , for task . A similar process is followed to obtain “optimal” model parameters for each known task . These model parameters (a total of across all known tasks) serve as the input to our meta network .

Training:

The meta network is trained on the objective function in Eqn 5 in two modes: a self mode and a transfer mode for each task. Given a known task , training in self mode implies updation of weights and alone. On the other hand, training in transfer mode implies updation of weights (all ) and of . Self mode is similar to training a standard autoencoder, where leanrs to projects the model parameters near the given model parameter (learned from ground truth) . In transfer mode, a set of model parameters of tasks (other than ) attempt to map the position of learned , near the given model parameter on the meta manifold. We note that the transfer mode is essential in being able to regress model parameters of a task, given model parameters of other tasks. At inference time (for zero-shot task transfer), operates in transfer mode.

Regressing Zero-Shot Task Parameters:

Once we learn the optimal parameters for using Algorithm LABEL:alg_TTNet_training, we use this to regress zero-shot task parameters, i.e. for all . (We note that the implementation of Algorithm 1 was found to be independent of the ordering of the tasks, .)

4 Results

To evaluate our proposed framework, we consider the vision tasks defined in [42]

. (Whether this is an exhaustive list of vision tasks is arguable, but they are sufficient to support our proof of concept.) In this section, we consider four of the tasks as unknown or zero-shot: surface normal, depth estimation, room layout, and camera-pose estimation. We have curated this list based on the data acquisition complexity and the complexity associated with the learning process using a deep network. Surface normal, depth estimation and room layout estimation tasks are monocular tasks but involve expensive sensors to get labeled data points. Camera pose estimation requires multiple images (two or more) to infer six degrees-of-freedom and is generally considered a difficult task. We have four different TTNet s to accomplish them; (1)

considers 6 vision tasks as known tasks; (2) considers 10 vision tasks as known tasks; and (3) considers 20 vision tasks as known tasks. In addition, we have another model (20 known tasks) in which, the regressed parameters are finetuned on a small amount, (20%), of data for the zero-shot tasks. (This provides low supervision and hence, the name .) Studies on other sets of tasks as zero-shot tasks are presented in Section 5. We also performed an ablation study on permuting the source tasks differently, which is presented in the supplementary section due to space constraints.

4.1 Dataset

We evaluated TTNet on the Taskonomy dataset [42], a publicly available dataset comprised of more than 150K RGB data samples of indoor scenes. It provides the ground truths of 26 tasks given the same RGB images, which is the main reason for considering this dataset. We considered 120K images for training, 16K images for validation, and, 17K images for testing.

4.2 Implementation Details

Network Architecture:

Following Section 3, each data network is considered an autoencoder, and closely follows the model architecture of [42]

. The encoder is a fully convolutional ResNet 50 model without pooling, and the decoder comprises of 15 fully convolutional layers for all pixel-to-pixel tasks, e.g. normal estimation, and for low dimensional tasks, e.g. vanishing points, it consists of 2-3 FC layers. To make input samples for TTNet, we created 5000 samples of the model parameters for each task, each of which is obtained by training the model on 1k data points sampled (with replacement) from the Taskonomy dataset. These data networks were trained with mini-batch Stochastic Gradient Descent (SGD) using a batch size of 32, learning rate of 0.001, momentum factor of 0.5 and Adam as an optimizer.

TTNet:

TTNet’s architecture closely follows the “classification” network of [13]. We show our network is shown in Figure 2 (b). The TTNet initially has branches, where depends on the model under consideration (). Each of the branches is comprised of 15 fully convolutional (FCONV) layers followed by 14 fully connected layers. The branches are then merged to form a common layer comprised of 15 FCONV layers. We trained the complete model with mini-batch SGD using a batch size of 32, learning rate of 0.0001, momentum factor of 0.5 and Adam as an optimizer.

Task correlation:

Crowds are asked to response for each pair of tasks (known and zero) on a scale of (strong correlation) to (no correlation), while

is reserved to denote self relation. We then aggregated crowd votes using Dawid-skene algorithm which is based on the principle of Expectation-Maximization (EM). More details of the Dawid-skene methodology and vote aggregation are deferred to the supplementary section.

4.3 Comparison with State-of-the-Art Models

Figure 3: Qualitative comparison (Best viewed in color): TTNet models compared against other state-of-the-art models, see Section 4.3.1 for details. (a) Surface Normal Estimation: Red boxes indicate results of our TTNet models; (b) Room Layout: Red edges indicate the predicted room edges; green boxes indicate our TTNet model results; (c) Depth Estimation: Red bounding boxes show our results; (d) Camera Pose Esimation: First image is the reference frame of the camera, i.e. green arrow. The second image, with red arrow, is taken after a geometric translation w.r.t first image. Blue rectangles show our results.

We show both qualitative and quantitative results for our TTNet, trained using the aforementioned methodology, on each of the four identified zero-shot tasks against state-of-the-art models for each respective task below. We note that the same TTNet is validated against all tasks.

4.3.1 Qualitative Results

Surface Normal Estimation:

For this task, our TTNet is compared against the following state-of-the-art models: Multi-scale CNN (MC) [12], Deep3D (D3D) [39], Deep Network for surface normal estimation (DD) [39], SkipNet [5], GeoNet [31] and Taskonomy (TN) [42]. The results are shown in Figure 3(a), where the red boxes correspond to our models trained under different settings (as described at the beginning of Section 4. It is evident from the result that gives visual results similar to [42]. As we increase the number of source tasks, our TTNet shows improved results. captures finer details (see edges of chandelier) which is not visible in any other result.

Room Layout Estimation:

We followed the definition of layout types in [20], and our TTNet’s results are compared against following camera pose methods: Volumetric [15], Edge Map [44], LayoutNet [46], RoomNet [20], and Taskonomy [42]. The green boxes in Figure 3(b) indicate TTNet results; the red edges indicate the predicted room edges. Each model infers room corner points and joins them with straight lines. We report two complex cases in Figure 3 (b): (1) lot of occlusions, and (2) multiple edges such as roof-top, door, etc.

Depth Estimation:

Depth is computed from a single image. We compared our TTNet against: FDA [21], Taskonomy [42], and GeoNet [31]. The red bounding boxes show our result. It can be observed from Figure 3(c) that outperforms [42]; and and outperform all other methods studied.

Camera Pose Estimation (fixed):

Camera pose estimation requires two images captured from two different geometric points of view of the same scene. A fixed camera pose estimation predicts any five of the 6-degrees of freedom: yaw, pitch, roll and x,y,z translation. In Figure 3(d), we show two different geometric camera angle translations: (1) perspective, and (2) translation in y and z coordinate. First image is the reference frame of the camera, i.e. green arrow. The second image, i.e. the red arrow, is taken after a geometric translation w.r.t the first image. We compared our model against: RANSAC [11], Latent RANSAC [18], Generic3D pose [43] and Taskonomy [42]. Once again, and outperform all other methods studied.

4.3.2 Quantitative Results

Surface Normal Estimation:

We evaluated our method based on the evaluation criteria described in [31], [5]. The results are presented in Table 1. Our is comparable to state-of-the-art Taskonomy [42] and GeoNet [31]. Our , , and outperforms all state-of-the-art models.

Method Mean () Medn () RMSE () 11.25 () 22.5 () 30 ()
MC[12] 30.30 35.30 - 30.29 57.17 68.29
D3D [39] 25.71 20.81 31.01 38.12 59.18 67.21
DD[39] 21.10 15.61 - 44.39 64.48 66.21
SkipNet [5] 20.21 12.19 28.20 47.90 70.00 78.23
TN[42] 19.90 11.93 23.13 48.03 70.02 78.88
TTNet 19.22 12.01 26.13 48.02 71.11 78.29
GeoNet [31] 19.00 11.80 26.90 48.04 72.27 79.68
TTNet 19.81 11.09 22.37 48.83 71.61 79.00
TTNet 19.27 11.91 26.44 48.81 71.97 79.72
TTNet 15.10 9.29 24.31 56.11 75.19 84.71
Table 1: Surface Normal Estimation. Mean, median and RMSE refer to the difference between the model’s predicted surface normal and ground truth surface normal (a lower value is better). Other 3 are the number of pixels within degree 11.25, 22.5 and 30 thresholds within ground truth’s predicted pixels (a higher number is better). indicates those values cannot be obtained by the corresponding method.
Room Layout Estimation:

We use two standard evaluation criteria: (1) Keypoint error: a global measurement avaraged on Euclidean distance between model’s predicted keypoint and the ground truth; and (2) Pixel error: a local measurement that estimates pixelwise error between the predicted surface labels and ground truth labels. Table 2 presents the results. A lower number corresponding to our TTNet models indicate good performance.

Methd VM [15] EM [44] LN [46] TTNet RN [20] TN [42] TTNt TTNt TTNt
Keypt. 15.48 11.2 7.64 7.51 6.30 6.22 6.00 5.82 5.52
Pixel 24.33 16.71 10.63 8.10 8.00 8.00 7.72 7.10 6.81
Table 2: Room Layout. Both and outperformed state-of-the-art models on keypoint and pixel error.
Depth Estimation:

We followed the evaluation criteria for depth estimation as in [21], where the metrics are: RMSE (lin) = ; RMSE(log) = ; Absolute relative distance = ; Squared absolute relative distance = . Here, is ground truth depth, is estimated depth, and is the total number of pixels in all images in the test set.

Method RMSE(lin) RMSE(log) ARD SRD
FDA [21] 0.877 0.283 0.214 0.204
TTN 0.745 0.262 0.220 0.210
TN [42] 0.591 0.231 0.242 0.206
TTNt 0.575 0.172 0.236 0.179
Geonet[31] 0.591 0.205 0.149 0.118
TTNet 0.597 0.204 0.140 0.106
TTNet 0.572 0.193 0.139 0.096
Table 3: Depth estimation: and outperform all other methods studied.
Camera Pose Estimation (fixed):

We adopted the win rate (%) evaluation criteria [42] that counts the proportion of images for which a baseline is outperformed. Table 4 shows the win rate of TTNet models on angular error with respect to state-of-the-art models: RANSAC [41], LRANSAC [18], G3D and Taskonomy [42]. The results show the promising performance of TTNet.

Method RANSAC[41] LR[18] G3D[43] TN[42]
TTNet 88% 81% 72% 64%
TTNet 90% 82% 79% 82%
TTNet 90% 82% 92% 80%
TTNet 96% 88% 96% 87%
Table 4: Camera Pose Estimation (fixed). We have considered win rate (%) on angular error. Columns are state-of-the-art methods and rows are our four TTNet models.
Figure 4: Zero-shot task to known task transfer. We consider the zero-shot tasks: surface normal estimation and room layout estimation, and transfer to models for Keypoint 3D, 2.5D segmentation and curvature estimation.

5 Discussion and Analysis

Significance Analysis of Source Tasks:

An interesting question on our approach is: how do we quantify the contribution of each individual source task towards regressing parameter of target task? In other word, which source task plays the most important role to regress the zero-shot task parameter. Figure 5 quantifies this by considering latent task basis to estimate this. We followed GO-MTL approach [19]

to compute the task basis. Optimal model parameters of known tasks are mapped to a low-dim vector space

using an autoencoder, before applying GO-MTL.

Formally speaking, optimal model parameters of each known task are mapped to a low-dimensional space , i.e. . using an autoencoder trained on model parameters of known tasks , i.e. (similar to Eqn 5). infers latent representation for regressed model parameter of zero-shot task . We used ResNet-18 both for encoder-decoder, dimension of as 100, and the dimension of task basis as 8. We can then have task matrix W = LS, comprised of all and . In Figure 5, boxes of same color denote similar-valued weights of task basis vectors. Most important source has the highest number of basis elements with similar values as zero-shot task. In Figure 5 (a) below, source task “Autoencoding” (col ) is important for zero-shot task “Z-Depth” (col ) as they share 4 such basis elements.

Figure 5: Finding the basis of tasks: Latent task basis are estimated following the GO-MTL approach [19]. (a) Most important source has the highest number of basis elements with similar values as zero-shot task. Example: source task “Autoencoding” (col ) is important for zero-shot task “Z-Depth” (col ) as they share 4 such basis elements. (b) When tasks are related (which is the setting in our work), learning from similar tasks can by itself provide good performance. Example: basis vector of zero-shot task “Depth” is composed of latent elements several source tasks.
Why Zero-shot Task Parameters Performs Better than Supervised Training?

It is evident from our qualitative and quantitative study that regressed zero-shot parameters out-performs results from supervised learning. When tasks are related (which is the setting in our work), learning from similar tasks can by itself provide good performance. From Figure

5, we can see that, the basis vector of zero-shot task “Depth” is composed of latent elements several source tasks. E.g. in Figure 5 (b) above, learning of 1 element (red box) of zero-shot task “Z-depth” is supported by 4 related source tasks.

Zero-shot to Known Task Transfer:

Are our regressed model parameters for zero-shot tasks capable of transferring to a known task? To study this, we consider the autoencoder-decoder parameters for a zero-shot task, and finetune the decoder to a target known task, following the procedure in [42] (encoder remains the same as of zero-shot task). Figure 4 shows the qualitative results, which are promising. We also compared our TTNet against [42] quantitatively by studying the win rate (%) of the two methods against other state-of-the-art methods: Wang et al. [40], G3D [43], and full supervision. Owing to space constraints, these results are presented in the supplementary section.

Choice of Zero-shot Tasks:

In order to study the generalizability of our method, we conducted experiments with a different set of zero tasks than those considered in Section 4.

Figure 6: Different Choice of Zero-Shot Tasks. Results of on different set of zero shot tasks: 2D segmentation, Vanishing point estimation, Curvature estimation, 2.5D segmentation and reshading.

Figure 6 shows promising results for our weakest model, , on other tasks as zero shot tasks. More results of our other models , , are included in the supplementary section.

Figure 7: Surface normal estimation on Cityscapes. Red circles highlight details (car, tree, human) captured by our model, which is missed by Taskonomy
Performance on Other Datasets:

To further study the generalizability of our models, we finetuned TTNet on the Cityscapes dataset [8], and the surface normal results are reported in Figure 7, with comparison to [42]. Our model captures more detail.

Object detection on COCO-Stuff dataset:

TTNet

is finetuned on the COCO-stuff dataset to do object detection on COCO-stuff dataset. To facilitate the object detection, we considered object classification as source task instead of colorization. TTNet

performs fairly well.

Figure 8: Object Detection using TTNet: TTNet is finetuned on the COCO-stuff dataset to do object detection on COCO-stuff dataset.
Model TT TT TT TT TT TT TT
Wang[40] 81 84 84 88 88 91 97
Zamir[43] 73 75 81 82 86 87 90
TN[42] 62 65 84 85 84 89 94
Table 5: Win rate (%) of surface normal estimation of TTNet models with varying num of known tasks against: [40], [43], and [42].
Optimal Number of Known Tasks:

In this work, we have reported results of TTNet with 6, 10 and 20 known tasks. We studied the question - how many tasks are sufficient to adapt to zero-shot tasks in the considered setting, and the results are reported in Table 5. Expectedly, a higher number of known tasks provided improved performance. A direction of our future work includes the study of the impact of negatively correlated tasks on zero-shot task transfer.

We also conducted experiments on using our methodology by using the task correlations obtained from the results of [42]

directly. We present these, as well as other results, including the evolution of our TTNet model over the epochs of training, in the supplementary section.

6 Conclusion

In summary, we present a meta-learning algorithm to regress model parameters of a novel task for which no ground truth is available (zero-shot task). We evaluated our learned model on the Taskonomy [42] dataset, with four zero-shot tasks: surface normal estimation, room layout estimation, depth estimation and camera pose estimation. We conducted extensive experiments to study the usefulness of zero-shot task transfer, as well as showed how the proposed TTNet can also be used in transfer learning. Our future work will involve closer analysis of the implications of obtaining task correlation from various sources, and the corresponding results for zero-shot task transfer. In particular, negative transfer in task space is a particularly interesting direction of future work.

References

  • [1] E. Alfonseca, K. Filippova, J.-Y. Delort, and G. Garrido. Pattern learning for relation extraction with a hierarchical topic model. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 54–59. Association for Computational Linguistics, 2012.
  • [2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.
  • [3] Y. Aytar and A. Zisserman. Tabula rasa: Model transfer for object category detection. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2252–2259. IEEE, 2011.
  • [4] S. H. Bach, B. He, A. Ratner, and C. Ré. Learning the structure of generative models without labeled data. arXiv preprint arXiv:1703.00854, 2017.
  • [5] A. Bansal, B. Russell, and A. Gupta. Marr revisited: 2d-3d alignment via surface normal prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5965–5974, 2016.
  • [6] Q. Chen, Y. Liu, Z. Wang, I. Wassell, and K. Chetty. Re-weighted adversarial adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7976–7985, 2018.
  • [7] H. Cohen and K. Crammer. Learning multiple tasks in parallel with a shared annotator. In Advances in Neural Information Processing Systems, pages 1170–1178, 2014.
  • [8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [9] G. Csurka. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374, 2017.
  • [10] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20–28, 1979.
  • [11] K. G. Derpanis. Overview of the ransac algorithm. Image Rochester NY, 4(1):2–3, 2010.
  • [12] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
  • [13] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
  • [14] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep reconstruction-classification networks for unsupervised domain adaptation. In European Conference on Computer Vision, pages 597–613. Springer, 2016.
  • [15] A. Gupta, M. Hebert, T. Kanade, and D. M. Blei. Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In Advances in neural information processing systems, pages 1288–1296, 2010.
  • [16] D. Isele, M. Rostami, and E. Eaton. Using task features for zero-shot knowledge transfer in lifelong learning. In IJCAI, pages 1620–1626, 2016.
  • [17] T. Kim, J. Yoon, O. Dia, S. Kim, Y. Bengio, and S. Ahn. Bayesian model-agnostic meta-learning. arXiv preprint arXiv:1806.03836, 2018.
  • [18] S. Korman and R. Litman. Latent ransac. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6693–6702, 2018.
  • [19] A. Kumar and H. Daume III. Learning task grouping and overlap in multi-task learning. arXiv preprint arXiv:1206.6417, 2012.
  • [20] C.-Y. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich. Roomnet: End-to-end room layout estimation. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 4875–4884. IEEE, 2017.
  • [21] J.-H. Lee, M. Heo, K.-R. Kim, and C.-S. Kim. Single-image depth estimation based on fourier domain analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 330–339, 2018.
  • [22] S.-I. Lee, V. Chatalbashev, D. Vickrey, and D. Koller. Learning a meta-level prior for feature relevance from multiple related tasks. In Proceedings of the 24th international conference on Machine learning, pages 489–496. ACM, 2007.
  • [23] J. J. Lim, R. R. Salakhutdinov, and A. Torralba. Transfer learning by borrowing examples for multiclass object detection. In Advances in neural information processing systems, pages 118–126, 2011.
  • [24] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in neural information processing systems, pages 469–477, 2016.
  • [25] S. Liu, S. J. Pan, and Q. Ho. Distributed multi-task relationship learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 937–946. ACM, 2017.
  • [26] M. Long, Z. Cao, J. Wang, and S. Y. Philip. Learning multiple tasks with multilinear relationship networks. In Advances in Neural Information Processing Systems, pages 1594–1603, 2017.
  • [27] Z. Luo, Y. Zou, J. Hoffman, and L. F. Fei-Fei. Label efficient learning of transferable representations acrosss domains and tasks. In Advances in Neural Information Processing Systems, pages 165–177, 2017.
  • [28] D. K. Naik and R. Mammone. Meta-neural networks that learn by learning. In Neural Networks, 1992. IJCNN., International Joint Conference on, volume 1, pages 437–442. IEEE, 1992.
  • [29] J. Oh, S. Singh, H. Lee, and P. Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. arXiv preprint arXiv:1706.05064, 2017.
  • [30] A. Pal and V. N. Balasubramanian. Adversarial data programming: Using gans to relax the bottleneck of curated labeled data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1556–1565, 2018.
  • [31] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 283–291, 2018.
  • [32] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems, pages 3567–3575, 2016.
  • [33] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  • [34] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [35] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. arXiv preprint arXiv:1712.02560, 3, 2017.
  • [36] O. Sener, H. O. Song, A. Saxena, and S. Savarese. Learning transferrable representations for unsupervised domain adaptation. In Advances in Neural Information Processing Systems, pages 2110–2118, 2016.
  • [37] S. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 2012.
  • [38] P. Varma, B. He, D. Iter, P. Xu, R. Yu, C. De Sa, and C. Ré. Socratic learning: Augmenting generative models to incorporate latent subsets in training data. arXiv preprint arXiv:1610.08123, 2016.
  • [39] X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 539–547, 2015.
  • [40] Y.-X. Wang, D. Ramanan, and M. Hebert. Learning to model the tail. In NIPS, 2017.
  • [41] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
  • [42] A. R. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3712–3722, 2018.
  • [43] A. R. Zamir, T. Wekel, P. Agrawal, C. Wei, J. Malik, and S. Savarese. Generic 3d representation via pose estimation and matching. In European Conference on Computer Vision, pages 535–553. Springer, 2016.
  • [44] W. Zhang, W. Zhang, K. Liu, and J. Gu. Learning to predict high-quality edge maps for room layout estimation. IEEE Transactions on Multimedia, 19(5):935–943, 2017.
  • [45] C. Zhu, H. Xu, and S. Yan. Online crowdsourcing. CoRR, abs/1512.02393, 2015.
  • [46] C. Zou, A. Colburn, Q. Shan, and D. Hoiem. Layoutnet: Reconstructing the 3d room layout from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2051–2059, 2018.

7 More on Task Correlation

Dawid-Skene method:

As mentioned in Section 3, we used the well-known Dawid-Skene (DS) method [10][45] to aggregate votes from human users to compute the task correlation matrix . We now describe the DS method.

We assume a total of annotators providing labels for items, where each label belongs to one of classes. DS associates each annotator with a confusion matrix to measure an annotator’s performance. The final label is a weighted sum of annotator’s decisions based on their confusion matrices, i.e. s where {}. Each entry of the confusion matrix

is the probability of predicting class

when the true class is . A true label of an item is and the vector y denotes true label for all items, i.e. y = {}. Let’s denote as annotator ’s label for item , i.e. if the annotator labeled the item as , we will write = . Let matrix

denote all labels for all items given by all annotators. The DS method computes the annotators’ error tensor

, where each entry denotes the probability of annotator giving label as label for item . The joint likelihood of true labels and observed labels can hence be written as:

(6)

Maximizing the above likelihood provides us a mechanism to aggregate the votes from the annotators. To this end, we find the maximum likelihood estimate using Expectation Maximization (EM) for the marginal log likelihood below:

(7)

The E step is given by:

(8)

The M step subsequently computes the estimate that maximizes the log likelihood:

(9)
(10)
Figure 9: Task correlation matrix. We get the task correlation matrix after receiving votes from annotators. Annotators are asked to give task correlation label on a scale of {}. On that scale, a denotes self relation, describes strong relation, implies weak relation, to mention abstain and to denote no relation between two tasks . We use this to build our meta-learner TTNet.

Once we get annotators’ error tensor and from Equations 9 and 10, we can estimate:

(11)

for all and . To get the final predicted label, we adopt a winner-takes-all strategy on across all . We request readers to refer [45] and [10] for more details.

Implementation:

In our experiments, as mentioned before, we considered the Taskonomy dataset [42]. This dataset has 26 vision-related tasks. We are interested in finding the task correlation for each pair of tasks in {}. Let’s assume that, we have annotators. To fit our model in the DS framework, we flatten the task correlation matrix (described in section 3) in row-major order to get item set = {}. For each item , the annotator is asked to give task correlation label on a scale of {}. On that scale, a denotes self relation, describes strong relation, implies weak relation, to mention abstain and to denote no relation between two tasks . After getting annotators’ vote, we build matrix . Subsequently, we find annotators’ error tensor (equation 6), likelihood estimation (equations 7, 8, 9, 10). We get predicted class labels after a winner-takes-all in equation 11. Predicted class labels are the task correlation we wish to get. We get the final task correlational matrix , after a de-flattening of for all .

Figure 9 shows the final task correlation matrix used in our experiments. The matrix fairly reflects an intuitive knowledge of the tasks considered. We also considered an alternate mechanism for obtaining the task correlation matrix from the task graph computed in [42]. We present these results later in Section 9.2.

Annotators RANSAC[41] LR[18] G3D[43] TN[42]
3 28% 22% 29% 40%
10 51% 29% 31% 52%
20 90% 82% 92% 42%
30 88% 81% 72% 64%
35 88% 82% 75% 61%
40 90% 72% 69% 63%
45 87% 80% 61% 70%
50 90% 82% 72% 50%
Table 6: Win rates (%) of with a varied number of annotators. We considered the win rate (%) on angular error. Columns are state-of-the-art methods and rows are our trained using different s, where = {}.
Ablation study on different number of annotators:

The results in the main paper were performed with 30 annotators. In this section, we studied the robustness of our method when is obtained by varying the number of annotators, , where . Table 6 shows a win rate (%) [42] on the camera pose estimation task using TTNet (when 6 source tasks are used). While there are variations, we identified as the number of annotators where the results are most robust, and used this setting for the rest of our experiments (in the main paper).

8 Ablation Studies on Varying Known Tasks

In this section, we present two ablation studies w.r.t. known tasks, viz, (i) number of known tasks and (ii) choice of known tasks. These studies attempt to answer the questions: how many known tasks are sufficient to adapt to zero-shot tasks in the considered setting? Which known tasks are more favorable to transfer to zero-shot tasks? While an exhaustive study is infeasible, we attempt to answer these questions by conducting a study across six different models: , , , , , and (where the subscript denotes the number of source tasks considered). We used win rate (%) against [42] for each of the zero-shot tasks. Table 7 shows the results of our studies with varying number and choice of known source tasks. Expectedly, a higher number of known tasks provides improved performance. It is observed from the table that our methodology is fairly robust despite changes in choice of source tasks, and that provides a good balance by having a good performance even with a low number of source tasks. Interestingly, most of the source tasks considered for (autoencoding, denoising, 2D edges, occlusion edges, vanishing point, and colorization) are tasks that do not require significant annotation, thus providing a model where very little source annotation can help generalize to more complex target tasks on the same domain.

Model TTNet Taskonomy
Wang Zamir Full Sup Wang Zamir Full Sup
N L N L N L N L N L N L
Depth 85 87 81 97 67 42 98 85 92 88 60 46
2.5 D 88 75 75 81 89 35 88 77 73 88 85 39
Curvature 84 87 91 58 86 47 78 89 88 78 60 50
Table 7: Zero-shot to known task transfer. We consider the autoencoder-decoder parameters for a zero-shot task learned through our method, and finetune the decoder (fixing the encoder) to a target known task, following the procedure in [42]. Source tasks (zero-shot) are surface normal (N), and, room layout (L). Target tasks are depth, 2.5D segmentation and curvature. Win rates (%) of task transfer with respect to self-supervised methods, such as, Wang et al. [39], Zamir et al. [43] as well as fully supervised setting are shown (all values are in %), with bold face numerals denoting winning entries.

9 Other Results

9.1 Zero-shot to Known Task Transfer: Quantitative Evaluation

In continuation to our discussions in Section 5, we ask ourselves the question: are our regressed model parameters for zero-shot tasks capable of transferring to a known task? To study this, we consider the autoencoder-decoder parameters for a zero-shot task learned through our methodology, and finetune the decoder (fixing the encoder parameters) to a target known task, following the procedure in [42]. Table 7 shows the quantitative results when choosing the source (zero-shot) tasks as surface normal estimation (N) and room layout estimation (L). We compared our TTNet against [42] quantitatively by studying the win rate (%) of the two methods against other state-of-the-art methods: Wang et al. [40], G3D [43], and full supervision. However, it is worthy to mention that our parameters are obtained through the proposed zero-shot task transfer, while all other comparative methods are explicitly trained on the dataset for the task.

Figure 10: Qualitative results of when task correlation matrix () is obtained from task graph computed in [42]. We studied considering the task graph computed in [42] (instead of crowd vote) to build the task correlation matrix . First column represents RGB image and, subsequent columns (from 2 to 4 columns) are zero-shot tasks: curvature, vanishing points, 2D key point and surface normal estimation

9.2 Alternate Methods for Task Correlation Computation

In our results so far, we studied the effectiveness of computing the task correlation matrix by aggregation of crowd votes. In this section, we instead use the task graph obtained in [42] to obtain the task correlation matrix . We call this matrix . Figure 10 shows a qualitative comparison of where the is obtained from the taskonomy graph, and is based on crowd knowledge. It is evident that our method shows promising results on both cases.

It is worthy to note that although one can use the taskonomy graph to build : (i) the taskonomy graph is model and data specific [42]; while coming from crowd votes does not explicitly assume any model or data and can be easily obtained; (ii) during the process of building the taskonomy graph, an explicit access to zero-shot task ground truth is unavoidable; while, constructing from crowd votes is possible without accessing any explicit ground truth.

Figure 11: Zero-shot tasks results during the training of TTNet. We regressed zero-shot task parameters from during its course of training. The qualitative results show the gradual learning of the model parameters of the epochs.

9.3 Evolution of TTNet:

Thus far, we showed the final results of our meta-learner after the model is fully trained. We now ask the question - how does the training of the TTNet model progress over training? We used the zero-shot task model parameters from during its course of training, and Figure 11 shows qualitative results of different epochs of four zero-shot tasks over the training phase. The results show that the model’s training progresses gradually over the epochs, and the model obtains promising results in later epochs. For example, in Figure 11(a), finer details such as wall boundaries, sofa, chair and other minute details are learned in later epochs.

9.4 Qualitative Results on Cityscapes Dataset

To further study the generalizability of our models, we finetuned TTNet on the Cityscapes dataset [8]. We get source task model parameters (trained on Taskonomy dataset) to train . We then finetuned on the segmentation model parameters trained on Cityscapes data. (We modified one source task, i.e. autoencoding to segmentation, of our proposed TTNet , see table LABEL:table_ablation_source_task, 3 row. All other source tasks are unaltered.) Results of the learned model parameters for four zero-shot tasks, i.e. Surface normal, depth, 2D edge and 3D keypoint, are reported in Figure 12, with comparison to [42] (which is trained explicitly for these tasks). Despite the lack of supervised learning, the figure shows that tt is evident from the qualitative assessment (figure 12) that our model seems to capture more detail.

Figure 12: Results on Cityscapes data. We finetuned on the Cityscapes dataset [8], and the surface normal, depth, 2D edge and 3D keypoint results are reported using the model parameters learned by .

9.5 More Qualitative Results

We report more qualitative results of: (i) room layout in Figure 13; (ii) surface normal estimation in Figure 14; (iii) depth estimation in Figure 15; and (iv) camera pose estimation in Figure 16.

Figure 13: More results of room layout estimation
Figure 14: More results of surface normal estimation
Figure 15: More results of depth estimation
Figure 16: More results of camera pose estimation