Reparameterizing Convolutions for Incremental Multi-Task Learning without Task Interference

07/24/2020 ∙ by Menelaos Kanakis, et al. ∙ 14

Multi-task networks are commonly utilized to alleviate the need for a large number of highly specialized single-task networks. However, two common challenges in developing multi-task models are often overlooked in literature. First, enabling the model to be inherently incremental, continuously incorporating information from new tasks without forgetting the previously learned ones (incremental learning). Second, eliminating adverse interactions amongst tasks, which has been shown to significantly degrade the single-task performance in a multi-task setup (task interference). In this paper, we show that both can be achieved simply by reparameterizing the convolutions of standard neural network architectures into a non-trainable shared part (filter bank) and task-specific parts (modulators), where each modulator has a fraction of the filter bank parameters. Thus, our reparameterization enables the model to learn new tasks without adversely affecting the performance of existing ones. The results of our ablation study attest the efficacy of the proposed reparameterization. Moreover, our method achieves state-of-the-art on two challenging multi-task learning benchmarks, PASCAL-Context and NYUD, and also demonstrates superior incremental learning capability as compared to its close competitors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

Code Repositories

RCM

The official repository (in PyTorch) for the ECCV 2020 paper "Reparameterizing Convolutions for Incremental Multi-Task Learning without Task Interference".


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Single-Task setup
(b) Multi-Task setup
(c) RCM setup (ours)
Figure 1: (a) Optimizing independent models per task allows for the easy addition of new tasks, at the expense of a multiplicative increase in the total number of parameters with respect to a single model (green and blue denote task-specific parameters). (b) A single backbone for multiple tasks must be meaningful to all, thus, all tasks interact with the said backbone (black indicates common parameters). (c) Our proposed setup, RCM (Reparameterized Convolutions for Multi-task learning), uses a pre-trained filter bank (denoted in black) and independently optimized task-specific modulators(denoted in colour) to adapt the filter bank on a per-task basis. New task addition is accomplished by training the task-specific modulators, thus explicitly addressing task interference while parameters scale at a slower rate than having independent models per task.

Over the last decade, convolutional neural networks (CNNs) have been established as the standard approach for many computer vision tasks, like image classification 

[26, 58, 17], object detection [15, 52, 34], semantic segmentation [35, 3, 67]

, and monocular depth estimation 

[12, 27]. Typically, these tasks are handled by CNNs independently, i.e., a separate model is optimized for each task, resulting in several task-specific models (Fig. (a)a). However, real-world problems are more complex and require models to perform multiple tasks on-demand without significantly compromising each task’s performance. For example, an interactive advertisement system tasked with displaying targeted content to its audience should be able to detect the presence of humans in its viewpoint effectively, estimate their gender and age group, recognize their head pose, etc. At the same time, there is a need for flexible models able to gradually add more tasks to their knowledge, without forgetting previously known tasks or having to re-train the whole model from scratch. For instance, a car originally deployed with lane and pedestrian detection functionalities can be extended with depth estimation capabilities post-production.

When it comes to learning multiple tasks under a single model, multi-task learning (MTL) techniques [2, 54] have been employed in the literature. On the one hand, encoder-focused approaches [41, 25, 36, 10, 43, 33, 1, 61] emphasize learning feature representations from multi-task supervisory signals by employing architectures that encode shared and task-specific information. On the other hand, decoder-focused approaches [63, 65, 66, 62] utilize the multi-task feature representations learned at the encoding stage to distill cross-task information at the decoding stage, thus refining the original feature representations. In both cases, however, the joint learning from multiple supervisory signals (i.e., tasks) can hinder the individual task performance if the associated tasks point to conflicting gradient directions during the update step of the shared feature representations (Fig. (b)b). Formally this is known as task interference or negative transfer and has been well documented in the literature [25, 39, 69]. To suppress negative transfer, several approaches [6, 21, 59, 16, 69, 56, 39]

dynamically re-weight each task’s loss function or re-order the task learning, to find a ‘sweet spot’ where individual task performance does not degrade significantly. Arguably, such approaches mainly focus on mitigating the negative transfer problem in the MTL architectures above, rather than eliminating it (see Section 

3.2). At the same time, existing works seem to disregard the fact that MTL models are commonly desired to be incremental, i.e., information from new tasks should be continuously incorporated while existing task knowledge is preserved. In existing works, the MTL model has to be re-trained from scratch if the task dictionary changes; this is arguably sub-optimal.

Recently, task-conditional networks [39] emerged as an alternative for MTL, inspired by work in multi-domain learning [49, 50]. That is, performing separate forward passes within an MTL model, one for each task, every time activating a set of task-specific residual responses on top of the shared responses. Note that, this is useful for many real-world setups (e.g., an MTL model deployed in a mobile phone with limited resources that adapts its responses according to the task at hand), and particularly for incremental learning (e.g., a scenario where the low-level tasks should be learned before the high-level ones). However, the proposed architecture in [39] is prone to task interference due to the inherent presence of shared modules, which is why the authors introduced an adversarial learning scheme on the gradients to minimize the performance degradation. Moreover, the model needs to be trained from scratch if the task dictionary changes.

All given, existing works primarily focus on either improving the multi-task performance or reducing the number of parameters and computations in the MTL model. In this paper, we take a different route and explicitly tackle the problems of incremental learning and task interference in MTL. We show that both problems can be addressed simply by reparameterizing the convolutional operations of a neural network. In particular, building upon the task-conditional MTL direction, we propose to decompose each convolution into a shared part that acts as a filter bank encoding common knowledge, and task-specific modulators that adapt this common knowledge uniquely for each task. Fig. (c)c illustrates our approach, RCM (Reparameterized Convolutions for Multi-task learning). Unlike existing works, the shared part in our case is not trainable to explicitly avoid negative transfer. Most notably, as any number of task-specific modulators can be introduced in each convolution, our model can incrementally solve more tasks without interfering with the previously learned ones. Our results demonstrate that the proposed RCM can outperform state-of-the-art methods in multi-task (Section 4.6) and incremental learning (Section 4.7) experiments. At the same time, we address the common multi-task challenge of task interference by construction, by ensuring tasks can only update task-specific components and cannot interact with each other.

2 Related Work

Multi-task learning (MTL) aims at developing models that can solve a multitude of tasks [2, 54]. In computer vision, MTL approaches can roughly be divided into encoder-focused and decoder-focused ones. Encoder-focused approaches primarily emphasize on architectures that can encode multi-purpose feature representations through supervision from multiple tasks. Such encoding is typically achieved, for example, via feature fusion [41], branching [25, 43, 36, 61], self-supervision [10], attention [33], or filter grouping [1]. Decoder-focused approaches start from the feature representations learned at the encoding stage, and further refine them at the decoding stage by distilling information across tasks in a one-off [63], sequential [65], recursive [66], or even multi-scale [62] manner. Due to the inherent layer sharing, the approaches above typically suffer from task interference. Several works proposed to dynamically re-weight the loss function of each task [6, 21, 59, 56], sort the order of task learning [16], or adapt the feature sharing between ‘related’ and ‘unrelated’ tasks [69], to mitigate the effect of negative transfer. In general, existing MTL approaches have primarily focused on improving multi-task performance or reducing the network parameters and computations. Instead, in this paper, we look at the largely unexplored problems of incremental learning and negative transfer in MTL models and propose a principled way to tackle them.

Incremental learning (IL) is a paradigm that attempts to augment the existing knowledge by learning from new data. IL is often used, for example, when aiming to add new classes [51] to an existing model, or learn new domains [31]. It aims to mitigate ‘catastrophic forgetting’ [14], the phenomenon of forgetting old tasks as new ones are learned. To minimize the loss of existing knowledge, Li and Hoiem [31] optimized the new task while preserving the old task’s responses. Other works [23, 29] constrained the optimization process to minimize the effect learning has on weights important for older tasks. Rebuffi et al. [51] utilized exemplars that best approximate the mean of the learned classes in the feature space to preserve performance. Note that the performance of such techniques is commonly upper bounded by the joint training of all tasks. More relevant to our work, in a multi-domain setting, a few approaches [49, 50, 53, 37] utilize a pre-trained network that remains untouched and instead learn domain-specific components that adapt the behavior of the network to address the performance drop common in IL techniques. Inspired by this research direction, we investigate the training of parts of the network, while keeping the remaining components constant from initialization amongst all tasks. This technique not only addresses catastrophic forgetting but also task interference, which is crucial in MTL.

Decomposition

of filters and tensors within CNNs has been explored in the literature. In particular, filter-wise decomposition into a product of low-rank filters 

[20], filter groups [47], a basis of filter groups [30], etc. have been utilized. In contrast, tensor-wise examples include SVD decomposition [9, 64], CP-decomposition [28], Tucker decomposition [22], Tensor-Train decomposition [45], Tensor-Ring decomposition [68], T-Basis [44], etc. These techniques have been successfully used for compressing neural networks or reducing their inference time. Instead, in this paper, we utilize decomposition differently. We decompose each convolutional operation into two components: a shared and a task-specific part. Note that although we utilize the SVD decomposition for simplicity, the same principles hold for other decomposition types too.

3 Reparameterizing CNNs for Multi-Task Learning

In this section, we present techniques to adapt a CNN architecture, such that it can increasingly learn new tasks in an MTL setting while scaling more efficiently than simply adding single-task models. Section 3.1 introduces the problem formulation. Section 3.2 demonstrates the effect of task interference in MTL and motivates the importance of CNN reparameterization. Section 3.3 presents techniques to reparameterize CNNs and limit the parameter increase with respect to task-specific models.

(a) Standard Conv.
(b) RC without NFF
(c) RC with NFF
Figure 2: (a) A standard convolutional module for a given task i, with task-specific weights in orange. (b) A reparameterized convolution (RC) consisting of a shared filter bank in black, and task-specific modulator in orange. (c) An RC with Normalized Feature Fusion (NFF), consisting of a shared filter bank in black, and task-specific modulator in orange. Each row of is reparameterized as .

3.1 Problem Formulation

Given tasks and input tensor , we aim to learn a function that holds for task , where and are the shared and task-specific parameters respectively. Unlike existing approaches [36, 41] which learn such functions on the layer level of the network, i.e., explicitly designing shared and task-specific layers, we aim to learn on a block-level by reparameterizing the convolutional operation, and adapting its behaviour conditioned on the task , as depicted in Fig. (b)b and Fig. (c)c. By doing so, we can explicitly address the task interference and catastrophic forgetting problems within an MTL setting.

3.2 Task Interference

To motivate the importance of addressing task interference by construction, we analyze the task-specific gradient directions on the shared modules of a state-of-the-art MTL model. Specifically, we utilize the work of [39], who used a discriminator to enforce indistinguishable gradients amongst tasks.

We acquire the gradients from the training dataset of PASCAL-Context [42] for each task, using minibatches of size 128, yielding 40 minibatches. We then use the Representation Similarity Analysis (RSA), proposed in [11]

for transfer learning, as a means to quantify the correlation of the gradients amongst the different tasks. Fig. 

3 depicts the task gradient correlations at different depths of a ResNet-26 model [17], trained to have indistinguishable gradients in the output layer [39]. It can be seen that there is a limited gradient correlation amongst the tasks, demonstrating that addressing task interference indirectly (here with the use of adversarial learning on the gradients) is a very challenging problem. We instead follow a different direction and propose to utilize reparameterizations with shared components amongst different tasks that are untouched during the training process, and each task being able to optimize only its parameters. As such, task interference is eliminated by construction.

Figure 3: Visualization of the Representation Similarity Analysis (RSA) on the task-specific gradients at different depths of a ResNet-26 model [39]. The analysis was conducted on: human parts segmentation (Parts), semantic segmentation (SemSeg), saliency estimation (Sal), normals estimation (Normals), and edge detection (Edge).

3.3 Reparameterizing Convolutions

We define a convolutional operation for the single-task learning setup, Fig. (a)a. denotes the parameters of a single convolutional layer (we omit the bias to simplify notation) for a kernel size and channels. is the input tensor volume at a given spatial location ( and

are expressed in vector notation), and

is the scalar response. Assuming such filters, the convolutional operator can be rewritten in matrix notation as , where provides responses, and . In a single-task setup:

(1)

where and are the task-specific parameters and responses for a given convolutional layer, respectively. The total number of parameters for this setup is . Our goal is to reparameterize in Eqn. 1 as:

(2)

using a set of shared () and task-specific () parameters for each convolutional layer of the backbone. Our formulation aims to retain the prediction performance of the original convolutional layer (Eq. 1), while simultaneously reducing the rate in which the total number of parameters grows. The complexity now becomes , which is less than for standard layers. We argue that this reparameterization is necessary for coping with task interference and incremental learning in an MTL setup, in which we only optimize for task-specific parameters , while keeping the shared parameters intact. Note that, when adding a new task , we do not need to train the entire network from scratch as in [39]. We only optimize for each layer of the reparameterized CNN.

We denote our reparameterized convolutional layer as a matrix multiplication between the two sets of parameters: . In order to find a set of parameters that approximates the single-task weights a natural choice is to minimize the Frobenius norm directly. Even though direct minimization of this metric is appealing due to its simplicity, it poses some major caveats. (i) It assumes that all directions in the parameter space affect the final performance for task in the same way and are thus penalized uniformly. However, two different solutions for with the same Frobenius norm can yield drastically different losses. (ii) This approximation is performed independently for each convolutional layer, neglecting the chain effect an inaccurate prediction in one layer can have in the succeeding layers. In the remainder of this section, we propose different techniques to address these limitations.

Reparameterized Convolution. We implement the Reparameterized Convolution (RC) as a stack of two 2D convolutional layers without non-linearity in between, with having a spatial filter size and being a convolution (Fig. (b)b)222

To ensure compliance with ImageNet 

[8]

initialization, the new architecture is first pre-trained on ImageNet using the publicly available training script from PyTorch 

[46].. We optimize only

directly on the task-specific loss function using stochastic gradient descent while keeping the shared weights

constant. This ensures that training for one task is independent of other tasks, ruling out interference amongst tasks while optimizing the metric of interest.

Normalized Feature Fusion. One can view , a row in matrix , as a soft filter adaptation mechanism, i.e., a modulator which generates new task-specific filters from a given filter bank , depicted in Fig. (b)b. However, instead of training the vector directly, we propose its reparameterization into two terms, a vector term , and a scalar term as:

(3)

where denotes the Euclidean norm. We refer to this reparameterization as Normalized Feature Fusion (NFF), depicted in Fig. (c)c. NFF provides an easier optimization process in comparison to an unconstrained . This reparametrization enforces to be unit length and point in the direction which best merges the filter bank. The vector norm learns independently the appropriate scale of the newly generated filters, and thus the scale of the activation. Directly optimizing attempts to learn both jointly, which is a harder optimization problem. Normalizing weight tensors has been generally explored for speeding up the convergence of the optimization process [7, 55, 60]. In our work, we use it differently and demonstrate empirically (see Section 4.5) that such a reparameterization in series with a filter bank also improves performance in the MTL setting. As seen in Eq. 3, additional learnable parameters are introduced in the training process ( and ), however, can be computed after training and used directly for deployment, eliminating additional overhead.

Response Initialization. We build upon the findings of matrix/tensor decomposition literature [9, 64] that network weights/responses lie on a low dimensional subspace. We further assume that such a subspace can be beneficial for multiple tasks, and thus good for network initialization under a MTL setup. To this end, we identify a meaningful subspace of the responses for the generation of a better filter bank when compared to that directly learned by pre-training on ImageNet. More formally, let be the responses for input tensor , where are the pre-trained ImageNet weights. We define as a matrix containing responses of with the mean vector subtracted. We compute the eigen-decomposition of the covariance matrix

(using Singular Value Decomposition, SVD), where

is an orthogonal matrix with the eigenvectors on the columns, and

is a diagonal matrix of the corresponding eigenvalues. We can now initialize the shared convolution parameters

with , and the task-specific with . We refer to this initialization methodology as Response Initialization (RI). We point the reader to the supplementary material for more details.

4 Experiments

4.1 Datasets

We focus our evaluation on dense prediction tasks, making use of two datasets. We conduct the majority of the experiments on PASCAL [13], and more specifically, PASCAL-Context [42]. We address edge detection (Edge), semantic segmentation (SemSeg), human parts segmentation (Parts), surface normals estimation (Normals), and saliency (Sal). We evaluate single-task performance using optimal dataset F-measure (odsF) [40] for edge detection, mean intersection over union (mIoU) for semantic segmentation, human parts and saliency, and finally mean error (mErr) for surface normals. Labels for human parts segmentation are acquired from [5], while for saliency and surface normals from [39].

We further evaluate the proposed method on the smaller NYUD dataset [57]

, comprised of indoor scenes, on edge detection (Edge), semantic segmentation (SemSeg), surface normals estimation (Normals), and depth (Depth). The evaluation metrics for edge detection, semantic segmentation, and surface normals estimation are identical to those for PASCAL-Context, while for depth we use root mean squared error (RMSE).

4.2 Architecture

All of our experiments make use of the DeepLabv3+ architecture [4], originally designed for semantic segmentation, which performs competitively for all tasks of interest as demonstrated in [39]. DeepLabv3+ encodes multi-scale contextual information by utilizing a ResNet [17] encoder with a-trous convolutions [3] and an a-trous spatial pyramid pooling (ASPP) module, while a decoder with a skip connection refines the predictions. Unless otherwise stated, we use a ResNet-18 (R-18) based DeepLabv3+, and report the mean performance of five runs for each experiment333Baseline comparisons to competing methods, as well as additional backbone experiments, can be found in the supplementary material..

4.3 Evaluation Metric

We follow standard practice [39, 62] and quantify the performance of a model as the average per-task performance drop with respect to the corresponding single-task baseline :

(4)

where is either 1 or 0 if a lower or a greater value indicates better performance, respectively, for a performance measure . P indicates the total number of tasks.

4.4 Analysis of network module sharing

We investigate the level of task-specific adaptation required for a common backbone to perform competitively to single-task models, while additionally eliminating negative transfer. In other words, the necessity for task-specific modules, i.e., convolutions (Convs) and batch normalizations (BNs)

[19]. Specifically, we optimize for task-specific Convs, BNs, or both along the network’s depth. Modules that are not being optimized maintain their ImageNet pre-trained parameters. Table 1 presents the effect on performance, while Fig. 4 depicts the total number of parameters with respect to the number of tasks. Experiments vary from common Convs and BNs (Freeze encoder) to task-specific Convs and BNs (Single-task), and anything in-between.

Figure 4: Backbone parameter scaling. Total number of parameters with respect to the number of tasks for R-18 backbone.

The model utilizing a common backbone pre-trained on ImageNet (Freeze encoder), as expected, is unable to perform competitively to the single-task counterpart, with a performance drop of 14.98%. Task-specific BNs significantly improve performance with a percentage drop of 5.76%, at a minimal increase in parameters (Fig. 4). The optimization of Convs is essential for competitive performance to single-task, with a percentage drop of 0.62%. However, the increase in parameters is comparable to single-task, which is undesirable (Fig. 4).

Method Convs BNs Edge SemSeg Parts Normals Sal
Freeze encoder 67.32 60.37 47.86 17.40 58.39 14.98
Task-specific BNs 69.80 63.93 53.22 14.78 64.44 5.76
Task-specific Convs 71.72 66.00 59.05 13.78 66.31 0.62
Single-task 71.88 66.22 59.69 13.64 66.62 -
Table 1: Performance analysis of task-specific modules. We report the effect network modules (Convs and BNs) have on the performance of PASCAL-Context.

4.5 Ablation study

To validate the proposed methodology from Section 3, we conduct an ablation study, presented in Table 2. We additionally report the performance of a model trained jointly on all tasks, consisting of a fully shared encoder and task-specific decoders (Multi-task). This multi-task model is not trained in an IL setup but merely serves as a reference to the traditional multi-tasking techniques. We report a performance drop of 3.32% with respect to the single-task setup.

Method NFF RI Edge SemSeg Parts Normals Sal
Single-task 71.88 66.22 59.69 13.64 66.62 -
Multi-task 70.74 62.43 57.89 14.43 66.31 3.32
RC 71.10 64.56 56.87 13.91 66.37 2.13
RC+NFF 71.12 64.71 56.91 13.90 66.33 2.07
RC+RI 71.36 65.58 57.99 13.70 66.21 1.12
RC+RI+NFF 71.34 65.70 58.12 13.70 66.38 0.99
Table 2: Ablation study of the proposed RCM. We present ablation experiments for the proposed Reparameterized Convolution (RC), Response Initialization (RI), Normalized Feature Fusion (NFF) on PASCAL-Context dataset.

Reparameterized Convolution. We first develop a new baseline for our proposed reparameterization, where we replace every convolution with the RC (Section 3.3) counterpart. As seen in Table 2, RC achieves a performance drop of 2.13%, outperforming the 3.32% drop of the multi-task baseline, as well as the Task-specific BNs (Table 1) that achieved a performance drop of 5.76%. This observation corroborates the claim made in Section 4.4 that task-specific adaptation of the convolutions is essential for a model to perform competitively for all tasks. Additionally, we demonstrate that even without training entirely task-specific convolutions, as in Table 1 (Task-specific Convs), a performance boost can still be observed at a smaller magnitude, while the total number of parameters scales at a slower rate (Fig. 4). RCM in Fig. 4 depicts the parameter scaling of all the RC-based methods introduced in Table 2, described in this section. As such, improvements in performance from this baseline do not stem from an increase in network capacity.

Response Initialization. We investigate the effect on the performance of a more meaningful filter bank, RI (Section 3.3), against the filter bank learned by directly pre-training the RC architecture on ImageNet. In Table 2 we report the performance of our proposed model when directly pre-trained on ImageNet (Table 2-RC), and with the RI based filter bank (Table 2-RC+RI). Compared to the RC model, the performance significantly improves from a 2.13% drop to a 1.12% drop with the RC+RI model. This observation clearly demonstrates that the filter bank generated using our proposed RI approach is beneficial for better weight initialization.

Normalized Feature Fusion. We replace the unconstrained task-specific components of RC with the proposed NFF (Section 3.3). We demonstrate in Table 2 that NFF improves the performance no matter the initialization of the filter bank. RC improves from a 2.13% drop to a 2.07% in RC+NFF, while RC+RI improved from a 1.12% drop to 0.99% for RC+RI+NFF.

The architecture used for the remaining experiments is the Reparameterized Convolution (RC) with Normalized Feature Fusion (NFF), initialized using the Response Initialization (RI) methodology. This architecture is denoted as RCM.

4.6 Comparison to state-of-the-art

Method Edge SemSeg Parts Normals Sal
Single-task 71.88 66.22 59.69 13.64 66.62 -
ASTMT (R-18 w/o SE) [39] 71.20 64.31 57.79 15.06 66.59 3.49
ASTMT (R-26 w SE) [39] 71.00 64.61 57.25 15.00 64.70 4.12
Series RA [49] 70.62 65.99 55.32 14.27 66.08 2.97
Parallel RA [50] 70.84 66.51 56.56 14.16 66.36 2.09
RCM (ours) 71.34 65.70 58.12 13.70 66.38 0.99
Table 3: Comparison with state-of-the-art methods on PASCAL-Context.
Method Edge SemSeg Normals Depth
Single-task 68.83 35.45 22.20 0.56 -
ASTMT (R-18 w/o SE) [39] 68.60 30.69 23.94 0.60 6.96
ASTMT (R-26 w SE) [39] 73.50 30.07 24.32 0.63 7.56
Series RA [49] 67.56 31.87 23.35 0.60 5.88
Parallel RA [50] 68.02 32.13 23.20 0.59 5.02
RCM (ours) 68.44 34.20 22.41 0.57 1.48
Table 4: Comparison with state-of-the-art methods on NYUD.
(a) Input image
(b) Semseg
(c) Parts
(d) Edge
(e) Normals
(f) Sal
Figure 5: Feature visualizations. We visualize the features of the input image (a) for the tasks of PASCAL-Context. The first row of each sub-figure corresponds to the responses of the single-task model (ST), the second row those of Parallel RA (Par. RA) [50] and the final row of our proposed method (RCM). For all tasks and depths of the network, the responses of RCM closely resemble those of ST, in contrast to the responses of Par. RA. This is made apparent from the colours utilized by the different methods. The RGB values were identified from a common PCA basis across the three methods in order to highlight similarities and differences between them.

In this work, we focus on comparing to task-conditional methods that can address MTL. We compare the performance of our method to Series Residual Adapter (Series RA) [49] and Parallel RA [50]. Series and Parallel RAs learn multiple visual domains by optimizing domain-specific residual adaptation modules (rather than using RCM as in our work, Fig. (c)c) on an ImageNet pre-trained backbone. Since both methods were developed for multi-domain settings, we optimize them using our own pipeline, ensuring a fair comparison amongst the methods while additionally benchmarking the capabilities of multi-domain methods in a multi-task setup. We further report the performance of ASTMT [39], which utilizes an architecture resembling that of Parallel RA [50] with Squeeze-and-Excitation (SE) blocks [18] and adversarial task disentanglement of gradients. Specifically, we report the performance of the models using a ResNet-26 (R-26) DeepLab-V3+ with SE as reported in [39], and also optimize with the use of their codebase a ResNet-18 model without SE. The latter model uses an architecture resembling more closely that of the other methods since SE can be additionally incorporated in the others as well. We report the average performance drop with respect to our single-task baseline.

The results for PASCAL-Context (Table 3) and NYUD (Table 4) demonstrate that our method achieves the best performance, outperforming the other methods that make use of RA modules. This demonstrates that although the RA can perform competitively in multi-domain settings, placing the convolution in series without non-linearity is a more promising direction for the drastic adaptations required for different tasks in a multi-task learning setup.

We visualize in Fig. 5 the learned representations of single-task, Parallel RA [50], and RCM across tasks and network depths. For each task and layer combination, we compute a common PCA basis for the methods above and depict the first three principal components as RGB values. For all tasks and layers of the network, the representations of RCM closely resemble those of the single-task models. Simultaneously, Parallel RA is unable to adapt the convolution behavior to the extent required to be comparable to single-task models.

4.7 Incremental learning for multi-tasking

Method Edge Normals SemSeg Parts Sal
Single-task 71.88 13.64 66.22 59.69 66.62 -
ASTMT (R-18 w/o SE) [39] 70.70 14.84 55.32 50.49 64.34 11.77
Series RA [49] 70.62 14.27 65.99 55.32 66.08 2.83
Parallel RA [50] 70.84 14.16 66.51 56.56 66.36 1.73
RCM (ours) 71.34 13.70 65.70 58.12 66.38 1.26
Table 5: Incremental learning experiments on a network originally trained on the low-level tasks (Edge and Normals) of PASCAL-Context.
Method SemSeg Parts Edge Normals Sal
Single-task 66.22 59.69 71.88 13.64 66.62 -
ASTMT (R-18 w/o SE) [39] 63.91 57.33 68.67 14.12 64.43 3.76
Series RA [49] 65.99 55.32 70.62 14.27 66.08 2.39
Parallel RA [50] 66.51 56.56 70.84 14.16 66.36 1.88
RCM (ours) 65.70 58.12 71.34 13.70 66.38 0.52
Table 6: Incremental learning experiments on a network originally trained on the high-level tasks (SemSeg and Parts) of PASCAL-Context.

We further evaluate the methods from Section 4.6 in the incremental learning (IL) setup. In other words, we investigate the capabilities of the models to learn new tasks without the need to be completely retrained on the entire task dictionary. We divide the tasks of PASCAL-Context into three groups, (i) edge detection and surface normals (low-level tasks), (ii) saliency (mid-level task) and (iii) semantic segmentation and human parts segmentation (high-level tasks). IL experiments are conducted by allowing the base network to initially use knowledge from either (i) or (iii), and reporting the capability for the optimized model to learn additional tasks without affecting the performance of the already learned tasks (the performance drop is calculated over the new tasks that were not used in the initial training). In the IL setup, ASTMT [39] is initially trained using an R-18 backbone without SE (a comparable backbone to the competing methods for a fair comparison) on the subset of the tasks (either i or iii). New tasks can be incorporated by training their task-specific modules independently. On the other hand, Series RA, Parallel RA, and RCM, were designed to be inherently incremental due to directly optimizing only the task-specific modules. Consequently, their task-specific performance in the IL setup is identical to that reported in Section 4.6.

In Tables 5 and 6 we report the performance of tasks that are utilized to generate the initial knowledge of the model in grey (important for ASTMT [39]), while in black the performance of the incrementally learned tasks. As shown in both tables, and in particular Table 5, ASTMT does not perform competitively in the IL experiments. This observation further demonstrates the importance of utilizing generic filter banks that can be adapted based on the task-specific needs, in particular for IL setups. We consider research in generic multi-task filter banks to be a promising direction.

5 Conclusion

We have presented a novel method of a convolutional operation reparameterization and its application to training multi-task learning architectures. These reparameterized architectures can be applied on a multitude of different tasks, and allow the CNN to be inherently incremental, while additionally eliminating task interference, all by construction. We evaluate our model on two datasets and multiple tasks, and show experimentally that it outperforms competing baselines that address similar challenges. We further demonstrate its efficacy when compared to the state-of-the-art task-conditional multi-task method, which is unable to tackle incremental learning.

Acknowledgments. This work was sponsored by Advertima AG and co-financed by Innosuisse. We thank the anonymous reviewers for their valuable feedback.

Reparameterizing Convolutions for Incremental Multi-Task Learning without Task Interference
(Supplementary Material)

Menelaos Kanakis David Bruggemann Suman Saha
Stamatios Georgoulis Anton Obukhov Luc Van Gool

Appendix 0.A Implementation Details

We based our implementation details on the work of [39], listed below for completeness.

Generic hyperparamaters. All models are optimized using SGD with a learning rate 0.005, momentum 0.9, weight decay 0.0001, and the “poly” learning rate schedule [3]

. We use a single GPU with a minibatch of 8 images. The input images during training are augmented with random horizontal flips and random scaling in the range [0.5, 2.0] in 0.25 increments. The validity of these hyperparameters has already been tested in 

[39], and hence they are used in all our experiments too, in order to ensure fair comparisons amongst different methods.

Dataset specific hyperparameters. PASCAL-Context [42]

models are trained for 60 epochs. The spatial size of the input images is 512

512. NYUD [57] models are trained for 200 epochs. The spatial size of the input images is 425

560. Images of insufficient size are padded with the mean color.


Task weighting and loss functions. As is common in multi-task learning (MTL), losses require careful loss weighting [39, 62, 21, 56], where each loss is task-dependent. For edge detection, we optimize the binary cross-entropy (BCE) loss, scaled by 50. Due to the class imbalance between the edge and non-edge pixels, edge pixels are penalized with a weight 0.95, while non-edge pixels with a scale of 0.05, accommodating [24, 38]. For evaluation, we set the maximum allowed mislocalization of the optimal dataset F-measure (odsF) [40] to 0.0075 and 0.011 for PASCAL-Context and NYUD, respectively, using the package of [48]. Semantic segmentation and human parts segmentation are optimized with cross-entropy loss, weighted by the factors of 1 and 2, respectively. Predictions of surface normals (normalized to unit vectors) and depth modalities are penalized using the loss, scaled by 10 and 1, respectively. Saliency is optimized using the BCE loss, weighted by a factor of 5.

Appendix 0.B Reparameterization Details

In Section 3.3 of the main text (Response initialization, RI), we introduced the methodology for the generation of a better filter bank when compared to that directly learned by pre-training on ImageNet, and demonstrated improved performance when utilizing RI in Section 4. In this section, we present additional detail.

Recall that we defined the responses of a convolutional layer for an input tensor , where are the pre-trained ImageNet weights. We specify as a matrix containing responses of with the mean vector subtracted. To generate the new filter bank, we first compute the eigen-decomposition of the covariance matrix (using Singular Value Decomposition, SVD), where is an orthogonal matrix with the eigenvectors on the columns, and is a diagonal matrix of the corresponding eigenvalues. We can now utilize which acts as a method to project to () and from () a latent space. Thus, we can rewrite , with the centering operation being of importance due to the space being generated from centred responses. This gives rise to

(5)

where , initialized by , represents the task-specific parameters optimized independently for each task , and is implemented as a convolution. The non-trainable shared parameters are defined as and implemented as a convolution, with being the filter size of . The bias can be added to the running mean of the batchnorm following the convolution [19].

Appendix 0.C Baseline

Method Edge SemSeg Parts Normals Sal
ASTMT [39] 70.30 63.90 55.90 15.10 63.90
MTI-Net [62] 68.20 64.49 57.43 14.77 66.38
Ours 71.88 66.22 59.69 13.64 66.62
Table 7: Single-task baseline comparison. We report the single-task performance of the baseline implementations of [39, 62] for similar architectures on PASCAL-Context. The arrow indicates the direction for better performance.

To ensure our re-implementation provides a stable baseline, Table 7 compares the single-task performance of our implementation using a ResNet-18 based DeepLabv3+, the results from [62] using a ResNet-18 based FPN [32], and the results from [39] who utilized a ResNet-26 based DeepLabv3+. We demonstrate that our single-task baseline outperforms both works on every task, and even though the numbers are not directly comparable due to minor implementation differences, it provides a verification of a strong baseline.

Appendix 0.D Additional Backbone Experiments

We additionally compare the proposed RCM (Reparameterized Convolutions for Multi-task learning) with respect to the single-task performance on the DeepLabv3+ with the deeper ResNet34 (R-34) [17] backbone. Results for PASCAL-Context [42] and NYUD [57] can be seen in Table 8 and Table 9, respectively. As seen, the percentage drops of and for PASCAL-Context and NYUD respectively are comparable to that of the ResNet18 backbone reported in the main paper.

Method Edge SemSeg Parts Normals Sal
Single-task 73.63 69.34 62.96 13.39 67.49 -
RCM (ours) 72.87 69.11 61.41 13.71 67.69 1.18
Table 8: Comparison with the single-task baseline on PASCAL-Context for a DeepLabv3+ with an R-34 backbone.
Method Edge SemSeg Normals Depth
Single-task 70.13 37.39 21.47 0.54 -
RCM (ours) 69.50 36.19 21.70 0.55 1.77
Table 9: Comparison with the single-task baseline on NYUD for a DeepLabv3+ with an R-34 backbone.

References

  • [1] F. J. Bragman, R. Tanno, S. Ourselin, D. C. Alexander, and J. Cardoso (2019) Stochastic filter groups for multi-task cnns: learning specialist and generalist convolution kernels. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1385–1394. Cited by: §1, §2.
  • [2] R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §1, §2.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: Appendix 0.A, §1, §4.2.
  • [4] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §4.2.
  • [5] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille (2014) Detect what you can: detecting and representing objects using holistic models and body parts. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1971–1978. Cited by: §4.1.
  • [6] Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich (2017) Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks. arXiv preprint arXiv:1711.02257. Cited by: §1, §2.
  • [7] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier (2017) Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 933–941. Cited by: §3.3.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: footnote 2.
  • [9] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: §2, §3.3.
  • [10] C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2051–2060. Cited by: §1, §2.
  • [11] K. Dwivedi and G. Roig (2019) Representation similarity analysis for efficient task taxonomy & transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12387–12396. Cited by: §3.2.
  • [12] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pp. 2366–2374. Cited by: §1.
  • [13] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §4.1.
  • [14] R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §2.
  • [15] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
  • [16] M. Guo, A. Haque, D. Huang, S. Yeung, and L. Fei-Fei (2018) Dynamic task prioritization for multitask learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 270–287. Cited by: §1, §2.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Appendix 0.D, §1, §3.2, §4.2.
  • [18] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §4.6.
  • [19] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: Appendix 0.B, §4.4.
  • [20] M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: §2.
  • [21] A. Kendall, Y. Gal, and R. Cipolla (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491. Cited by: Appendix 0.A, §1, §2.
  • [22] Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin (2015) Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530. Cited by: §2.
  • [23] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §2.
  • [24] I. Kokkinos (2015)

    Pushing the boundaries of boundary detection using deep learning

    .
    arXiv preprint arXiv:1511.07386. Cited by: Appendix 0.A.
  • [25] I. Kokkinos (2017) Ubernet: training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6129–6138. Cited by: §1, §2.
  • [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [27] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV), pp. 239–248. Cited by: §1.
  • [28] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky (2014) Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553. Cited by: §2.
  • [29] S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang (2017)

    Overcoming catastrophic forgetting by incremental moment matching

    .
    In Advances in neural information processing systems, pp. 4652–4662. Cited by: §2.
  • [30] Y. Li, S. Gu, L. V. Gool, and R. Timofte (2019) Learning filter basis for convolutional neural network compression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5623–5632. Cited by: §2.
  • [31] Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §2.
  • [32] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: Appendix 0.C.
  • [33] S. Liu, E. Johns, and A. J. Davison (2019) End-to-end multi-task learning with attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1871–1880. Cited by: §1, §2.
  • [34] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
  • [35] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
  • [36] Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. Feris (2017) Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5334–5343. Cited by: §1, §2, §3.1.
  • [37] A. Mallya, D. Davis, and S. Lazebnik (2018) Piggyback: adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 67–82. Cited by: §2.
  • [38] K. Maninis, J. Pont-Tuset, P. Arbeláez, and L. Van Gool (2017) Convolutional oriented boundaries: from image segmentation to high-level tasks. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 819–833. Cited by: Appendix 0.A.
  • [39] K. Maninis, I. Radosavovic, and I. Kokkinos (2019) Attentive single-tasking of multiple tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1860. Cited by: Appendix 0.A, Appendix 0.A, Appendix 0.A, Table 7, Appendix 0.C, §1, §1, Figure 3, §3.2, §3.2, §3.3, §4.1, §4.2, §4.3, §4.6, §4.7, §4.7, Table 3, Table 4, Table 5, Table 6.
  • [40] D. R. Martin, C. C. Fowlkes, and J. Malik (2004) Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE transactions on pattern analysis and machine intelligence 26 (5), pp. 530–549. Cited by: Appendix 0.A, §4.1.
  • [41] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert (2016) Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3994–4003. Cited by: §1, §2, §3.1.
  • [42] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun, and A. Yuille (2014) The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 891–898. Cited by: Appendix 0.A, Appendix 0.D, §3.2, §4.1.
  • [43] D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and L. Van Gool (2017)

    Fast scene understanding for autonomous driving

    .
    arXiv preprint arXiv:1708.02550. Cited by: §1, §2.
  • [44] A. Obukhov, M. Rakhuba, S. Georgoulis, M. Kanakis, D. Dai, and L. Van Gool (2020) T-basis: a compact representation for neural networks. In Proceedings of Machine Learning and Systems 2020, pp. 8889–8901. Cited by: §2.
  • [45] I. V. Oseledets (2011) Tensor-train decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: §2.
  • [46] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: footnote 2.
  • [47] B. Peng, W. Tan, Z. Li, S. Zhang, D. Xie, and S. Pu (2018) Extreme network compression via filter group approximation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 300–316. Cited by: §2.
  • [48] J. Pont-Tuset and F. Marques (2015) Supervised evaluation of image segmentation and object proposal techniques. IEEE transactions on pattern analysis and machine intelligence 38 (7), pp. 1465–1478. Cited by: Appendix 0.A.
  • [49] S. Rebuffi, H. Bilen, and A. Vedaldi (2017) Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, pp. 506–516. Cited by: §1, §2, §4.6, Table 3, Table 4, Table 5, Table 6.
  • [50] S. Rebuffi, H. Bilen, and A. Vedaldi (2018) Efficient parametrization of multi-domain deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8119–8127. Cited by: §1, §2, Figure 5, §4.6, §4.6, Table 3, Table 4, Table 5, Table 6.
  • [51] S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017)

    Icarl: incremental classifier and representation learning

    .
    In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §2.
  • [52] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
  • [53] A. Rosenfeld and J. K. Tsotsos (2018) Incremental learning through deep adaptation. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
  • [54] S. Ruder (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §1, §2.
  • [55] T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In Advances in neural information processing systems, pp. 901–909. Cited by: §3.3.
  • [56] O. Sener and V. Koltun (2018) Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, pp. 527–538. Cited by: Appendix 0.A, §1, §2.
  • [57] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pp. 746–760. Cited by: Appendix 0.A, Appendix 0.D, §4.1.
  • [58] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §1.
  • [59] A. Sinha, Z. Chen, V. Badrinarayanan, and A. Rabinovich (2018) Gradient adversarial training of neural networks. arXiv preprint arXiv:1806.08028. Cited by: §1, §2.
  • [60] N. Srebro and A. Shraibman (2005) Rank, trace-norm and max-norm. In

    International Conference on Computational Learning Theory

    ,
    pp. 545–560. Cited by: §3.3.
  • [61] S. Vandenhende, S. Georgoulis, B. De Brabandere, and L. Van Gool (2019) Branched multi-task networks: deciding what layers to share. arXiv preprint arXiv:1904.02920. Cited by: §1, §2.
  • [62] S. Vandenhende, S. Georgoulis, and L. Van Gool (2020) MTI-net: multi-scale task interaction networks for multi-task learning. arXiv preprint arXiv:2001.06902. Cited by: Appendix 0.A, Table 7, Appendix 0.C, §1, §2, §4.3.
  • [63] D. Xu, W. Ouyang, X. Wang, and N. Sebe (2018) Pad-net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 675–684. Cited by: §1, §2.
  • [64] X. Zhang, J. Zou, K. He, and J. Sun (2015) Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 1943–1955. Cited by: §2, §3.3.
  • [65] Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang (2018) Joint task-recursive learning for semantic segmentation and depth estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 235–251. Cited by: §1, §2.
  • [66] Z. Zhang, Z. Cui, C. Xu, Y. Yan, N. Sebe, and J. Yang (2019) Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4106–4115. Cited by: §1, §2.
  • [67] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §1.
  • [68] Q. Zhao, G. Zhou, S. Xie, L. Zhang, and A. Cichocki (2016) Tensor ring decomposition. arXiv preprint arXiv:1606.05535. Cited by: §2.
  • [69] X. Zhao, H. Li, X. Shen, X. Liang, and Y. Wu (2018)

    A modulation module for multi-task learning with applications in image retrieval

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 401–416. Cited by: §1, §2.