Maximum Roaming Multi-Task Learning

06/17/2020 ∙ by Lucas Pascal, et al. ∙ 0

Multi-task learning has gained popularity due to the advantages it provides with respect to resource usage and performance. Nonetheless, the joint optimization of parameters with respect to multiple tasks remains an active research topic. Sub-partitioning the parameters between different tasks has proven to be an efficient way to relax the optimization constraints over the shared weights, may the partitions be disjoint or overlapping. However, one drawback of this approach is that it can weaken the inductive bias generally set up by the joint task optimization. In this work, we present a novel way to partition the parameter space without weakening the inductive bias. Specifically, we propose Maximum Roaming, a method inspired by dropout that randomly varies the parameter partitioning, while forcing them to visit as many tasks as possible at a regulated frequency, so that the network fully adapts to each update. We study the properties of our method through experiments on a variety of visual multi-task data sets. Experimental results suggest that the regularization brought by roaming has more impact on performance than usual partitioning optimization strategies. The overall method is flexible, easily applicable, provides superior regularization and consistently achieves improved performances compared to recent multi-task learning formulations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-task learning (MTL) consists in jointly learning different tasks, rather than treating them individually, to improve generalization performance. This is done by jointly training tasks while using a shared representation (caruana_multitask_1997). This approach has gained much popularity in recent years with the breakthrough of deep networks in many vision tasks. Deep networks are quite demanding in terms of data, memory and speed, thus making sharing strategies between tasks attractive.

MTL exploits the plurality of the domain-specific information contained in training signals issued from different related tasks. The plurality of signals serves as an inductive bias (baxter_model_2000)

and has a regularizing effect during training, similar to the one observed in transfer learning 

(yosinski_how_2014). This allows us to build task-specific models that generalize better within their specific domains. However, the plurality of tasks optimizing the same set of parameters can lead to cases where the improvement imposed by one task is to the detriment of another task. This phenomenon is called task interference, and can be explained by the fact that different tasks need a certain degree of specificity in their representation to avoid under-fitting.

To address this problem, several works have proposed to enlarge deep networks with task specific parameters (gao_nddr-cnn_2019; he_mask_2017; kokkinos_ubernet_2017; liu_end--end_2019; lu_fully-adaptive_2017; misra_cross-stitch_2016; mordan_revisiting_2018), giving tasks more room for specialization, and thus achieving better results. Other works adopt architectural adaptations to fit a specific set of tasks (xu_pad-net_2018; zhang_learning_2018; zhang_pattern-affinitive_2019; vandenhende_mti-net_2020). These approaches, however, do not solve the problem of task interference in the shared portions of the networks. Furthermore, they generally do not scale well with the number of tasks. A more recent stream of works address task interference by constructing task-specific partitioning of the parameters (bragman_stochastic_2019; maninis_attentive_2019; strezoski_many_2019), allowing a given parameter to be constrained by fewer tasks. As such, these methods sacrifice inductive bias to better handle the problem of task interference.

In this work, we introduce Maximum Roaming, a dynamic partitioning scheme that sequentially creates the inductive bias, while keeping task interference under control. Inspired by the dropout technique (srivastava_dropout_2014), our method allows each parameter to roam across several task-specific sub-networks, thus giving them the ability to learn from a maximum number of tasks and build representations more robust to variations in the input domain. It can therefore be considered as a regularization method in the context of multi-task learning. Differently from other recent partitioning methods that aim at optimizing (bragman_stochastic_2019; maninis_attentive_2019) or fixing (strezoski_many_2019) a specific partitioning, ours privileges continuous random partition and assignment of parameters to tasks allowing them to learn from each task. Experimental results show consistent improvements over the state of the art methods.

The remaining of this document is organized as follows. Section 2 discusses related work. Section 3 sets out some preliminary elements and notations before the details of Maximum Roaming are presented in Section 4. Extensive experiments are conducted in Section 5 to, first, study the properties of the proposed method and to demonstrate its superior performance with respect to that one of other state-of-the-art MTL approaches. Finally, conclusions and perspectives are discussed in Section 6.

2 Related Work

Several prior works have pointed out the problems incurred by task interference in multi-task learning (chen_gradnorm_2018; kendall_multi-task_2018; liu_end--end_2019; maninis_attentive_2019; sener_multi-task_2018; strezoski_many_2019). We refer here to the three main categories of methods.

Loss weighting. A common countermeasure to task interference is to correctly balance the influence of the different task losses in the main optimization objective, usually a weighted sum of the different task losses. The goal is to prevent a task objective variations to be absorbed by some other tasks objectives of higher magnitude. In (kendall_multi-task_2018) each task loss coefficient is expressed as a function of some task-dependent uncertainty to make them trainable. In (liu_end--end_2019) these coefficients are modulated considering the rate of loss change for each task. GradNorm (chen_gradnorm_2018) adjusts the weights to control the gradients norms with respect to the learning dynamics of the tasks. More recently, (sinha_gradient_2018) proposed a similar scheme using adversarial training. These methods, however, do not aim at addressing task interference, their main goal being to allow each task objective to have more or less magnitude in the main objective according to its learning dynamics. Maximum Roaming, instead, is explicitly designed to control task interference during optimization.

Multi-objective optimization. Other works have formulated multi-task learning as a multi-objective optimization problem. Under this formulation, (sener_multi-task_2018) proposed MGDA-UB, a multi-gradient descent algorithm (desideri_multiple-gradient_2012) addressing task interference as the problem of optimizing multiple conflicting objectives. MGDA-UB learns a scaling factor for each task gradient to avoid conflicts. This has been extended by (lin_pareto_2019) to obtain a set of solutions with different trade-offs among tasks. These methods ensure, under reasonable assumptions, to converge into a Pareto optimal solution, from which no improvement is possible for one task without deteriorating another task. They keep the parameters in a fully shared configuration and try to determine a consensual update direction at every iteration, assuming that such consensual update direction exists. In cases with strongly interfering tasks, this can lead to stagnation of the parameters. Our method avoids this stagnation by reducing the amount of task interference, and by applying discrete updates in the parameters space, which ensures a broader exploration of this latter.

Parameter partitioning. Attention mechanisms are often used in vision tasks to make a network focus on different feature map regions (liu_end--end_2019). Recently, some works have shown that these mechanisms can be used at the convolutional filter level allowing each task to select, i.e. partition, a subset of parameters to use at every layer. The more the partitioning is selective, the less tasks are likely to use a given parameter, thus reducing task interference. Authors in (strezoski_many_2019) randomly initialize hard binary tasks partitions with a hyper-parameter controlling their selectivity.(bragman_stochastic_2019) sets task specific binary partitions along with a shared one, and trains them with the use of a Gumbel-Softmax distribution (maddison_concrete_2017; jang_categorical_2017) to avoid the discontinuities created by binary assignments. Finally, (maninis_attentive_2019) uses task specific Squeeze and Excitation (SE) modules (hu_squeeze-and-excitation_2018) to optimize soft parameter partitions. Despite the promising results, these methods may reduce the inductive bias usually produced by the plurality of tasks: (strezoski_many_2019) uses a rigid partitioning, assigning each parameter to a fixed subset of tasks, whereas (bragman_stochastic_2019) and (maninis_attentive_2019) focus on obtaining an optimal partitioning, without taking into account the contribution of each task to the learning process of each parameter. Our work contributes to address this issue by pushing each parameter to learn sequentially from every task.

3 Preliminaries

Let us define a training set , where is the number of tasks and the number of data points. The set is used to learn the tasks with a standard shared convolutional network of depth having the final prediction layer different for each task . Under this setup, we refer to the convolutional filters of the network as parameters. We denote the number of parameters of the layer and use to index them. Finally, represents the maximum number of parameters contained by a network layer.

In standard MTL, with fully shared parameters, the output of the layer for task is computed as:

(1)

where

is a non-linear function (e.g. ReLU),

a hidden input, and the convolutional kernel composed of the parameters of layer .

3.1 Parameter Partitioning

Let us now introduce

the binary parameter partitioning matrix, with

a column vector associated to task

in the layer, and an element on such vector associated to the parameter. As allows to select a subset of parameters for every , the output of the layer for task (Eq. 1) is now computed as:

(2)

This notation is consistent to formalization of the dropout (e.g. (gomez_learning_2019)). By introducing , the hidden inputs are now also task-dependent: each task requires an independent forward pass, like in (maninis_attentive_2019; strezoski_many_2019). In other words, given a training point , for each task we compute an independent forward pass and then back-propagate the associated task-specific losses . Each parameter receives independent training gradient signals from the tasks using it, i.e. . If the parameter is not used, i.e. , the received training gradient signals from those tasks account to zero.

For the sake of simplicity in the notation and without loss of generality, in the remaining of this document we will omit the use of the index to indicate a given layer.

3.2 Parameter Partitioning Initialization

Every element of

follows a Bernoulli distribution of parameter

:

We denote the sharing ratio (strezoski_many_2019). We use the same value for every layer of the network. The sharing ratio controls the overlap between task partitions, i.e. the number of different gradient signals a given parameter

will receive through training. Reducing the number of training gradient signals reduces task interference, by reducing the probability of having conflicting signals, and eases optimization. However, reducing the number of task gradient signals received by

also reduces the amount and the quality of inductive bias that different task gradient signals provide, which is one of the main motivations and benefits of multi-task learning (caruana_multitask_1997).

To guarantee the full capacity use of the network, we impose

(3)

Parameters not satisfying this constraint are attributed to a unique uniformly sampled task. The case , thus corresponds to a fully disjoint parameter partitioning, i.e. , whereas is a fully shared network, i.e. , equivalent to Eq. 1.

Following a strategy similar to dropout (srivastava_dropout_2014), which forces parameters to successively learn efficient representations in many different randomly sampled sub-networks, we aim to make every parameter learn from every possible task by regularly updating the parameter partitioning , i.e. make parameters roam among tasks to sequentially build the inductive bias, while still taking advantage of the "simpler" optimization setup regulated by . For this we introduce Maximum Roaming Multi-Task Learning, a learning strategy consisting of two core elements: 1) a parameter partitioning update plan that establishes how to introduce changes in , and 2) a parameter selection process to identify the elements of to be modified.

4 Maximum Roaming Multi-Task Learning

In this section we formalize the core of our contribution. We start with an assumption that relaxes what can be considered as inductive bias.

Assumption 1.

The benefits of the inductive bias provided by the simultaneous optimization of parameters with respect to several tasks can be obtained by a sequential optimization with respect to different subgroups of these tasks.

This assumption is in line with (yosinski_how_2014), where the authors state that initializing the parameters with transferred weights can improve generalization performance, and with other works showing the performance gain achieved by inductive transfer (see (he_mask_2017; singh_transfer_nodate; tajbakhsh_convolutional_2016; zamir_taskonomy_2018)).

Assumption 1 allows to introduce the concept of evolution in time of the parameters partitioning , by indexing over time as , where indexes update time-steps, and is the partitioning initialization from Section 3.2. At every step , the values of are updated, under constraint (3), allowing parameters to roam across the different tasks.

Definition 1.

Let be the set of parameter indices used by task , at update step , and the set of parameter indices that have been visited by , at least once, after update steps. At step , the binary parameter partitioning matrix is updated according to the following update rules:

(4)

The frequency at which is updated is governed by , where and

denotes the training epochs. This allows parameters to learn from a fixed partitioning over

training iterations in a given partitioning configuration. has to be significantly large (we express it in terms of training epochs), so the network can fully adapt to each new configuration, while a too low value could reintroduce more task interference by alternating too frequently different task signals on the parameters. Considering we apply discrete updates in the parameter space, which has an impact in model performance, we only update one parameter by update step to minimize the short-term impact.

Lemma 1.

Any update plan as in Def.1, with update frequency has the following properties:

  1. The update plan finishes in training steps.

  2. At completion, every parameter has been trained by each task for at least training epochs.

  3. The number of parameters attributed to each task remains constant over the whole duration of update plan.

Proof: The first property comes from the fact that grows by at every step , until all possible parameters in a given layer are included, thus no new can be sampled. At initialization, , and it increases by one every training iterations, which gives the indicated result, upper bounded by the layer containing the most parameters. The proof of the second property is straightforward, since each new parameter partition remains frozen for at least training epochs. The third property is also straightforward since every update consists in the exchange of parameters and

Definition 1 requires to select update candidate parameters and from their respective subsets (Eq 4). We select both

under a uniform distribution (without replacement), a lightweight solution to guarantee a constant overlap between the parameter partitions of the different tasks.

Lemma 2.

The overlap between parameter partitions of different tasks remains constant, on average, when the candidate parameters and , at every update step , are sampled without replacement under a uniform distribution from and , respectively.

Proof: We prove by induction that remains constant over , and , which ensures a constant overlap between the parameter partitions of the different tasks. The detailed proof is provided in Appendix A

We now formulate the probability of a parameter to have been used by task , after update steps as:

(5)

where

(6)

is the update ratio, which indicates the completion rate of the update process within a layer. The condition refers to the fact that there cannot be more updates than the number of available parameters. It is also a necessary condition for . The increase of this probability represents the increase in the number of visited tasks for a given parameter, which is what creates inductive bias, following Assumption 1.

We formalize the benefits of Maximum Roaming in the following theorem:

Theorem 1.

Starting from a random binary parameter partitioning controlled by the sharing ratio , Maximum Roaming maximizes the inductive bias across tasks, while controlling task interference.

Proof: Under Assumption 1, the inductive bias is correlated to the averaged number of tasks having optimized any given the parameter, which is expressed by Eq. 5. is maximized with the increase of the number of updates , to compensate the initial loss imposed by . The control over task interference cases is guaranteed by Lemma 2

5 Experiments

This section first describes the datasets (Sec. 5.1) and the baselines used for comparison (Sec. 5.2). We then evaluate the presented Maximum Roaming MTL method on several problems. First we study its properties such as the effects the sharing ratio , the impact of the interval between two updates and the completion rate of the update process and the importance of having a random selection process of parameters for update (Sec. 5.3). Finally, Section 5.4 presents a benchmark comparing our approach with the baseline methods. All code, data and experiments are available at GITHUB URL.

5.1 Datasets

We use three publicly available datasets in our experiments:

Celeb-A. We use the official release111http://personal.ie.cuhk.edu.hk/~lz013/projects/FaceAttributes.html of the Celeb-A dataset (liu_deep_2015), which consists of more than 200k celebrities images, annotated with 40 different facial attributes. To reduce the computational burden and allow for faster experimentation, we cast it into a multi-task problem by grouping the 40 attributes into eight groups of spatially or semantically related attributes (e.g. eyes attributes, hair attributes, accessories..) and creating one attribute prediction task for each group. Details on the pre-processing procedure are provided in Appendix B.
CityScapes. The Cityscapes dataset (cordts_cityscapes_2016) contains

annotated street-view images with pixel-level annotations from a car point of view. We consider the seven main semantic segmentation tasks, along with a depth-estimation regression task, for a total of 8 tasks.


NYUv2. The NYUv2 dataset (silberman_indoor_2012) is a challenging dataset containing indoor images recorded over different scenes from Microsoft Kinect camera. It provides 13 semantic segmentation tasks, depth estimation and surfaces normals estimation tasks, for a total of 15 tasks. As with CityScapes, we use the pre-processed data provided by (liu_end--end_2019)222https://github.com/lorenmt/mtan.

5.2 Baselines

We compare our method with several alternatives, including two parameter partitioning approaches (maninis_attentive_2019; strezoski_many_2019). Among these, we have not included (bragman_stochastic_2019) as we were not able to correctly replicate the method with the available resources. Specifically, we evaluate: i) MTL, a standard fully shared network with uniform task weighting; ii) GradNorm (chen_gradnorm_2018), a fully shared network with trainable task weighting method ; iii) MGDA-UB (sener_multi-task_2018), a fully shared network which formulates the MTL as a multi-objective optimization problem iv) Task Routing (TR) (strezoski_many_2019), a parameter partitioning method with fixed binary masks; and v) SE-MTL (maninis_attentive_2019) a parameters partitioning method, with trainable real-valued masks. Note that the work of (maninis_attentive_2019) consists of a more complex framework which comprises several other contributions. For a fair comparison with the other baselines, we only consider the parameter partitioning and not the other elements of their work.

5.3 Facial attributes detection

In these first experiments, we study in detail the properties of our method using the Celeb-A dataset (liu_deep_2015). Being a small dataset it allows for fast experimentation. We use a ResNet-18 (he_deep_2016) as a base network for all experiments. All models are optimized with Adam optimizer (kingma_adam_2017) (learning rate of ). The reported results are averaged over five seeds.

Effect of Roaming. In a first experiment, we study the effects of the roaming imposed to parameters in MTL performance as a function of the sharing ratio and compare these with a fixed partitioning setup. Figure 1

reports achieved F-scores as

varies. Let us remark that as all models scores are averaged over seeds, this means that the fixed partitioning scores are the average of different (fixed) partitionings.

Results show that for the same network capacity Maximum Roaming provides improved performance w.r.t. a fixed partitioning approach. Moreover, as the values of are smaller, and for the same network capacity, Maximum Roaming does not suffer from a dramatic drop in performance as it occurs using a fixed partitioning. This behaviour suggests that parameter partitioning does have an unwanted effect on the inductive bias that is, thus, reflected in poorer generalization performance. However, these negative effects can be compensated by parameter roaming across tasks.

The fixed partitioning scheme achieves its best performance at (F-score). This is explained by the fact that the dataset is not originally made for multi-task learning: all its classes are closely related, so they naturally have a lot to share with few task interference. Maximum Roaming achieves higher performance than this nearly full shared configuration (the overlap between task partitions is close to its maximum) for every in the range . In this range, the smaller is, the greater the gain in performance: it can be profitable to partially separate tasks even when they are very similar (i.e. multi-class, multi-attribute datasets) while allowing parameters to roam.

Effect of and . Here we study the impact of the interval between two updates and the completion rate of the update process (Eq.  6). Using a fixed sharing ratio, , we report the obtained F-score values of our method over a grid search over these two hyper-parameters in Figure 1(center).

Figure 1: (left) Contribution of Maximum Roaming depending on the parameter partitioning selectivity . (middle) F-score of our method reported for different values of the update interval and the update completion rate . (right) Comparison of Maximum Roaming with random and non-random selection process of parameter candidates for updates.

Results show that the models performance increase with . Improved performance in terms if the F-score can observed for epochs. This suggests that needs to be large enough so that the network can correctly adapt its weights to the changing configuration. A rough knowledge of the overall learning behaviour on the training dataset or a coarse grid search is enough to set it. Regarding the completion percentage, as it would be expected, the F-score increases with . The performance improvement becomes substantial beyond . Furthermore, the performance saturates after , suggesting that, at this point, the parameters have built robust representations, thus additional parameter roaming does not contribute to further improvements.

Role of random selection. Finally, we assess the importance of choosing candidate parameters for updates under a uniform distribution. To this end, we here define a deterministic selection process to systematically choose and within the update plan of Def. 1

. New candidate parameters are selected to minimize the average cosine similarity in the task parameter partition. The intuition behind this update plan is to select parameters which are the most likely to provide additional information for a task, while discarding the more redundant ones based on their weights. The candidate parameters

and are thus respectively selected such that:

(7)

with the parameters of the convolutional kernel . Figure 1 (right) compares this deterministic selection process with Maximum Roaming by reporting the best F-scores achieved by the fully converged models for different completion rates of the update process.

Results show that, while both selection methods perform about equally at low values of , Maximum Roaming progressively improves was

grows. We attribute this to the varying overlapping induced by the deterministic selection. With a deterministic selection, outliers in the parameter space have more chances than others to be quickly selected as update candidates, which slightly favours a specific update order, common to every task. This has the effect of increasing the overlap between the different task partitions, along with the cases of task interference.

It should be noted that the deterministic selection method still provides a significant improvement compared to a fixed partitioning (). This highlights the primary importance of making the parameters learn from a maximum number of tasks, which is guaranteed by the update plan (Def. 1), i.e. the roaming, used by both selection methods.

5.4 Scene understanding

Our final experiment compares the performance of Maximum Roaming (MR) with other state-of-the-art methods in two well-established scene-understanding benchmarks: CityScapes (cordts_cityscapes_2016) and NYUv2 (silberman_indoor_2012). For the sake of this study, we consider each segmentation task as an independent task, although it is a common approach to consider all of them as a unique segmentation task. We use as a basis network a SegNet (badrinarayanan_segnet_2017), split after the last convolution, with independent outputs for each task, on top of which we build the different methods to compare. All models are trained with Adam (learning rate of ). We report Intersection over Union (mIoU) and pixel accuracy (Pix. Acc.) measures averaged over all segmentation tasks, average absolute (Abs. Err.) and relative error (Rel. Err.) for depth estimation tasks, and mean (Mean Err.) and median errors (Med. Err.) for the normals estimation task. Tables 1 and 2 show the results of the different models on each datasets. The reported results are the best we could achieve with each method, after a basic grid-search on the hyper-parameter(s).

Segmentation Depth estimation
mIoU Pix. Acc. Abs. Err. Rel. Err.
MTL
GradNorm () (chen_gradnorm_2018)
MGDA-UB (sener_multi-task_2018)
SE-MTL (maninis_attentive_2019)
TR () (strezoski_many_2019)
MR ()
Table 1: Cityscape results
Segmentation Depth estimation Normals estimation
(Higher Better) (Lower Better) (Lower Better)
mIoU Pix. Acc. Abs. Err. Rel. Err. Mean Err. Med. Err.
MTL
GradNorm () (chen_gradnorm_2018)
MGDA-UB (sener_multi-task_2018)
SE-MTL (maninis_attentive_2019)
TR () (strezoski_many_2019)
MR ()
Table 2: NYUv2 results

Maximum Roaming reaches the best scores on segmentation and normals estimation tasks, and ranks second on depth estimation tasks. In particular, it outperforms the other methods by a significant margin on the segmentation tasks: our method restores the inductive bias decreased by parameter partitioning, so the tasks benefiting the most from it are the ones the most related to the others, which are here the segmentation tasks. We observe that GradNorm (chen_gradnorm_2018) fails on the regression tasks (depth and normals estimation): it seems that the equalization of the task respective gradient magnitudes emphasizes the unbalance between the number of segmentation and regression tasks. Surprisingly, MGDA-UB (sener_multi-task_2018) reaches pretty low performance on the NYUv2 dataset, especially on segmentation tasks, while being one of the best performing ones on CityScapes. It appears that during training, the loss computed for the shared weights quickly converges to zero, leaving task-specific prediction layers to learn their task independently from an almost frozen shared representation. This could also explain why it still achieves good results at the regression tasks, these being easier tasks. We hypothesize that the solver fails at finding good directions improving all tasks, leaving the model stuck in a Pareto-stationary point.

6 Conclusion

In this paper, we introduced Maximum Roaming, a dynamic parameter partitioning method that reduces the task interference phenomenon while taking full advantage of the latent inductive bias represented by the plurality of tasks. Our approach makes each parameter learn successively from all possible tasks, with a simple yet effective parameter selection process. The proposed algorithm achieves it in a minimal time, without any additional cost compared to other partitioning methods, nor any additional parameter to be trained on top of the base network. Experimental results show a substantially improved performance on all reported datasets, regardless of the type of convolutional network it applies on, which suggests this work could form a basis for the optimization of the shared parameters of many future Multi-Task Learning works.

Maximum Roaming relies on a binary partitioning scheme that is applied at every layer independently of the layer’s depth. However, it is well-known that the parameters in the lower layers of deep networks are generally less subject to task interference. Furthermore, it fixes an update interval, and show that the update process can in some cases be stopped prematurely. We encourage any future work to apply Maximum Roaming or similar strategies to more complex partitioning methods, and to allow the different hyper-parameters to be automatically tuned during training. As an example, one could eventually find a way to include a term favoring roaming within the loss of the network.

References

Appendix A Proof of Lemma 2

At , every element of follows a Bernoulli distribution:

We assume and prove it holds for .

The probability can be written as

Since , Eq. A can be reformulated as:

As is uniformly sampled from , the first term in Eq. A can be reformulated as

(10)

Let us now expand the second term in Eq. A by considering whether or not:

(11)

From Def. 1, and , thus (11) becomes:

Given that is uniformly sampled from :

(12)

By replacing (10) and (12) in Eq. A we obtain

which demonstrates that remains constant over , given a uniform sampling of and from and , respectively

Appendix B Experimental Setup

In this section we provide a detailed description of the experimental setup used for the experiments on each of the considered datasets.

b.1 Celeb-A

Table 3 provides details on the distribution of the 40 facial attributes between the created tasks. Every attribute in a task uses the same parameter partition. During training, the losses of all the attributes of the same task are averaged to form a task-specific loss. All baselines use a ResNet-18 [he_deep_2016] truncated after the last average pooling as a shared network. We then add fully connected layers of input size , one per task, with the appropriate number of outputs, i.e. the number of facial attributes in the task. The partitioning methods ([maninis_attentive_2019], [strezoski_many_2019] and Maximum Roaming) are applied to every shared convolutional layer in the network. The parameter in GradNorm [chen_gradnorm_2018] has been optimized in the set of values . All models were trained with an Adam optimizer [kingma_adam_2017] and a learning rate of

, until convergence, using a binary cross-entropy loss function, averaged over the different attributes of a given task. We use a batch size of

, and all input images are resized to . The reported results are evaluated validation split provided in the official release of the dataset [liu_deep_2015].

Tasks Classes
Global Attractive, Blurry, Chubby, Double Chin, Heavy Makeup, Male, Oval Face, Pale Skin, Young
Eyes Bags Under Eyes, Eyeglasses, Narrow Eyes, Arched Eyebrows, Bushy Eyebrows
Hair Bald, Bangs, Black Hair, Blond Hair, Brown Hair, Gray Hair, Receding Hairline, Straight Hair, Wavy Hair
Mouth Big Lips, Mouth Slightly Open, Smiling, Wearing Lipstick
Nose Big Nose, Pointy Nose
Beard 5 o’ Clock Shadow, Goatee, Mustache, No Beard, Sideburns
Cheeks High Cheekbones, Rosy Cheeks
Wearings Wearing Earrings, Wearing Hat, Wearing Necklace, Wearing Necktie
Table 3: Class composition of each the tasks for the Celeb-A dataset.

b.2 CityScapes

All baselines use a SegNet [badrinarayanan_segnet_2017] outputting 64 feature maps of same height and width as the inputs. For each of the tasks, we add one prediction head, composed of one and one

convolutions. A sigmoid function is applied on the output of the segmentation tasks. The partitioning methods (

[maninis_attentive_2019], [strezoski_many_2019] and Maximum Roaming) are applied to every shared convolutional layer in the network. This excludes those in the task respective prediction heads. The parameter in GradNorm [chen_gradnorm_2018] has been optimized in the set of values . All models were trained with an Adam optimizer [kingma_adam_2017] and a learning rate of , until convergence. We use the binary cross-entropy as a loss function for each segmentation task, and the averaged absolute error for the depth estimation task. We use a batch size of , and the input samples are resized to , provided as such by [liu_end--end_2019]333https://github.com/lorenmt/mtan. The reported results are evaluated on the validation split furnished by [liu_end--end_2019].

b.3 NYUv2

For both segmentation tasks and depth estimation task, we use the same configuration as for CityScapes. For the normals estimation task, the prediction head is made of one and one convolutions. Its loss is computed with an element-wise dot product between the normalized predictions and the ground-truth map. We use a batch size of , and the input samples are here resized to , provided as such by [liu_end--end_2019]. The reported results are evaluated on the validation split furnished by [liu_end--end_2019].

Appendix C Celeb-A Dataset Benchmark

We have excluded from the main document a benchmark performed on all baselines using the Celeb-A dataset as this dataset was used to study, understand and tune our method. As such, we consider that a comparison on this dataset might not be entirely fair. Nevertheless, and for the sake of completeness, we provide them in this section.

Table 4 reports precision, recall and F-score values averaged over the facial attributes for every model. Figure 2 shows radar charts with the individual F-scores obtained for each of the facial attributes. For improved readability, the scores have been plotted in two different charts, one for the highest scores and one for the remaining lowest.

Results show that, overall, partitioning strategies perform better than other approaches, with Maximum Roaming achieving the best performance in terms of the F-score (see Table 4), as well as on several individual attributes (see Figure 2). It is important to remark that in [sener_multi-task_2018] the authors report an error of for MGDA-UB and for GradNorm in the Celeb-A dataset. In our experimental setup, MGDA-UB reports an error of , GradNorm reports and Maximum Roaming . These difference might be explained by factors linked to the different experimental setups. Firstly, [sener_multi-task_2018] uses each facial attribute as an independent task, while we create tasks out of different attribute groups. Secondly, both works use different reference metrics: we report performance at highest validation F-score, while they do it on accuracy.

Classification
Precision Recall F-Score
MTL
GradNorm () [chen_gradnorm_2018]
MGDA-UB [sener_multi-task_2018]
SE-MTL[maninis_attentive_2019]
TR () [strezoski_many_2019]
MR ()
Table 4: Precision, Recall and F-score measures, averaged over the facial attributes of the Celeb-A dataset.
Figure 2: Radar chart comparing different baselines F-scores on every facial attribute of Celeb-A. (left) attributes with highest scores, (right) attributes with lowest scores. Each plot is displayed at a different scale.