Cross-Task Consistency Learning Framework for Multi-Task Learning

by   Akihiro Nakano, et al.
The University of Tokyo

Multi-task learning (MTL) is an active field in deep learning in which we train a model to jointly learn multiple tasks by exploiting relationships between the tasks. It has been shown that MTL helps the model share the learned features between tasks and enhance predictions compared to when learning each task independently. We propose a new learning framework for 2-task MTL problem that uses the predictions of one task as inputs to another network to predict the other task. We define two new loss terms inspired by cycle-consistency loss and contrastive learning, alignment loss and cross-task consistency loss. Both losses are designed to enforce the model to align the predictions of multiple tasks so that the model predicts consistently. We theoretically prove that both losses help the model learn more efficiently and that cross-task consistency loss is better in terms of alignment with the straight-forward predictions. Experimental results also show that our proposed model achieves significant performance on the benchmark Cityscapes and NYU dataset.



There are no comments yet.


page 5

page 7


Robust Learning Through Cross-Task Consistency

Visual perception entails solving a wide set of tasks, e.g., object dete...

Consistency Training of Multi-exit Architectures for Sensor Data

Deep neural networks have become larger over the years with increasing d...

MultiNet++: Multi-Stream Feature Aggregation and Geometric Loss Strategy for Multi-Task Learning

Multi-task learning is commonly used in autonomous driving for solving v...

Deep Multistage Multi-Task Learning for Quality Prediction of Multistage Manufacturing Systems

In multistage manufacturing systems, modeling multiple quality indices b...

Auto-Lambda: Disentangling Dynamic Task Relationships

Understanding the structure of multiple related tasks allows for multi-t...

Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation

Referring expression comprehension (REC) and segmentation (RES) are two ...

Sign-regularized Multi-task Learning

Multi-task learning is a framework that enforces different learning task...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning has made significant progress in the last decade, improving and enhancing its ability to classify, predict, and understand inputs from various modals. Although its performance are now excelling the skills of humans in certain domains, one disadvantage of deep learning models is that its application is limited to single-task problems. While humans are capable of handling multiple tasks simultaneously, common deep learning models require different models to be trained for each task. Therefore, recent works have focused on developing a multi-task learning model, a model optimized for multiple tasks.

Multi-Task Learning (MTL) is a method to train a model to learn multiple tasks jointly. Humans are able to process multiple tasks at the same time and combine the information for a more semantic task. For example, when we look at an image, we are able to recognize objects, estimate depth, infer the scene, and predict what will happen next. One reason we can process and integrate multiple sensory tasks is because they are closely related to each other. MTL aims to exploit this “relationship” in a way such that machines can utilize it.

Recent MTL works have investigated sharing full or part of the architecture. Some works have proposed a method to explicitly enforce the model to capture relationships between tasks using additional units or matrices [long17, misra16, yang19]. Other works have simply divided the features into task-common features and task-specific features so that it learns the relationships implicitly and focused on balancing the losses between multiple tasks [liu19, jha20, kendall18, chen18, yu20].

Less research has been conducted for self-supervised learning of MTL. Works such as

[chen19, klingner20] have utilized the stereo RGB image pairs for better semantic consistency or better depth estimation. However these methods require the data to be collected using stereo cameras and therefore cannot be applied if the data were collected otherwise (ex. NYU dataset [silberman12] which was collected using Microsoft Kinect camera, an infrarerd camera). To the best of our knowledge, no prior works have explored a self-supervised MTL which can be used for both stereo cameras and infrared camera.

In this work, we define and propose a new framework called cross-task consistency learning framework for 2-task MTL problem inspired by CycleGAN’s [zhu17] cycle-consistency loss and contrastive learning using multi-views [chen20], [tosh20]. We introduce two loss terms, alignment loss and cross-task consistency loss, aimed to maintain consistency between the predictions. We then prove an inequality relationship between alignment loss and cross-task consistency loss and show that not only cross-task consistency loss is superior to alignment loss in approximating the straight-forward predictions but also that both terms are upper-bounded by a small value.

Specifically, we demonstrate our method in learning semantic segmentation and depth estimation task in the computer vision field. We use these two tasks because (i) scene segmentation and depth perception are related to each other

[burge10], and (ii) the encoder-decoder architecture which has shown significant performance for both tasks is suited to capture task-common and task-specific features. Our architecture, which we name as XTasC-Net (Cross-Task Consistency Network), consists of two modules: first, the input image is feeded to an encoder-decoder architecture to produce predictions of the two tasks (direct predictions). Then, we prepare two separate encoder-decoder architecture networks (Task-Transfer Networks, TTNets) that takes in the direct predictions as inputs and outputs the prediction for the other task (task-transferred predictions). The proposed alignment loss and cross-task consistency loss utilizes the task-transferred predictions so that some of the information from the other task is learned through the scope of the TTNets. The experimental results on Cityscapes [cordts16] and NYU [silberman12] dataset shows that our model shows competitive performance with less number of parameters.

In summary, we make mainly two contributions. First, we present a principled multi-task learning framework for 2-task MTL problem by relating to cycle-consistency and contrastive learning. We theoretically derive our loss term using implicit latent variable models (LVMs) and prove an inequality relationship regarding the proposed two loss terms. Secondly, we propose a new architecture called XTasC-Net for semantic segmentation and depth estimation task. Our method is not only applicable to data collected via stereo cameras but also monocular cameras.

Ii Related Work

MTL has been one of the key approaches in improving generalization and learning efficiency by using information of related tasks [caruana98, ruder17]. A common challenge of MTL is the formulation of task relationships. Some works have approached using regularization methods [micchelli04] or by defining a convex optimization problem to estimate task relationships [ciliberto10, zhang10, ciliberto17]. Zamir et al. [zamir18] built a taxonomical structure between multiple vision-related tasks. Furthermore, Zhou et al. [zhou21] used adversarial networks to learn the task relationships.

In the computer vision field, some works explored modeling task relationships explicitly in their model architecture. For example, Long et al. [long17] developed Deep Relationship Networks which places a matrix prior between the fully-connected layers of each task so that the model learns the task relationship. Misra et al. [misra16] proposed Cross-stitch Networks which uses cross-stitch units, a module that learns how much sharing is needed between each respective layer of the networks. Yang et al. [yang19] used min-max optimization to make the model learn both task-specific and task-common features.

On the contrary, other works have experimented enforcing the model to learn task relationships implicitly. There have been mainly three approaches: using knowledge distillation, balancing multi-task loss function, and utilizing stereo views of the data.

One approach has been to use knowledge distillation [hinton15] by preparing two phases of training to enhance performance [xu18, vande19, zhang19, li20, vande20]. These models are trained on several tasks jointly in the first phase. Then, using the trained weights from the first phase, the features are combined through another network to distill its information to make the final predictions. For example, Xu et al. [xu18]

proposed PAD-Net which experiments concatenation of the features or applying an attention map when passing features of other tasks for a certain task. Zhang

et al. [zhang19] and Vandenhende et al. [vande20] utilized self-attention modules in each task’s network. Li et al.’s [li20] model used knowledge distillation by adding a loss term between the pretrained STL network and MTL model’s weights.

Another approach is to train a single model by balancing the loss functions of each task [mousa16, chen18, kendall18, zhang18, sener18, liu19, jha20, yu20]. Kendall et al. [kendall18] proposed uncertainty weights, a quantity that can be seen as the relative confidence between tasks. Other methods [chen18, jha20] have used loss magnitude to leverage the losses. On the other hand, Liu et al. [liu19] introduced DWA (dynamic weight averaging) which uses relative loss reduction and Yu et al. [yu20] implemented PCGrad, an algorithm that fixes contradicting gradients in the shared architecture.

Finally, some works for MTL including depth estimation task has utilized the multiple views of the data [chen19, klingner20]. Using the advantage that some dataset were captured using a stereo camera, these methods have applied the photometric reconstruction loss originally proposed by MonoDepth [godard17, godard19].

Iii Method

We use the following notations: a common network architecture in recent works consists of two modules, a shared network followed by individual networks for each task where are the weights of each networks for . Since the number of tasks is limited to , we refer the two tasks as and . Further, for simplicity, we regard the as our input and denote this as .

Our key assumption is that there exists some mappings and that describes the likelihood between the tasks,


where are noise.

Since and are both outputs from a neural network given an input , we can further denote this using as,


Using this notation, we propose cross-task consistency loss, which is defined as,


where indicates norm. The loss takes the difference between the outputs of task-specific networks, , and those outputs that are transferred to predict the other task, . Below, we refer the former as direct predictions and the latter as task-transferred predictions.

The intuition of cross-task consistency loss is to pass partial information of task-specific features to other tasks using the assumed mappings, . Although the shared network helps the model learn these task-common features, we expect some of the task-specific features learned in each task-specific network can also be utilized for the other task.

In the following sections, we derive our proposed loss term from [tiao18, tosh20]. First, in section III-A, we derive a similar loss term which we name as alignment loss from Tiao et al.’s [tiao18] proof of CycleGAN based on implicit latent variable models (LVMs). Then, in section III-B, we will prove that cross-task consistency loss is better than alignment loss based on Tosh et al.’s proof [tosh20].

Iii-a Alignment Loss

Fig. 1: Diagrams of CycleGAN (left), contrastive learning (center), and our cross-task consistency learning framework (right).

Fig. 1 describes the diagrams of CycleGAN and our cross-task consistency learning framework. Similar to how CycleGAN has a bidirectional mapping between the input (real images) and the output (generated images), our framework also a bidirectional mapping between the two tasks’ outputs.

Using LVMs, we can describe the joint probability of observing

and , the two tasks, as the product of the prior distribution and the likelihood. Since we model the likelihoods using (1), we can express the joint probability as,


where denotes probability. We further use Tial et al.’s implicit LVMs, in which we replace the prior distributions with implicit distributions . Implicit distributions are distributions given by a finite collection of data, , i.e. , for . Therefore, (4) can be written as,


since both and are conditioned on input .

Now, regardless of using parameters or

, the joint distribution should be equivalent. Therefore, we consider minimizing the statistical distance between

and using symmetric KL divergence, , where


Then, since minimizing the KL divergence is equivalent to maximizing the likelihood in (5), we can derive the loss as follows;

Assume Gaussian noise, i.e., , . Then,

Therefore using maximum likelihood estimation,


where and . A similar thing can be said for maximizing . Hence,


are the alignment losses. However, we introduce another loss, cross-task consistency loss, which is defined as (3), by replacing the first term of each equation with the direct predictions. In the following section, we explain the merits of using cross-task consistency loss over alignment loss.

Iii-B Cross-Task Consistency Loss

Our learning framework can also be related with contrastive learning using multi-views (Fig. 1). Methods such as SimCLR [chen20] learns in a self-supervised fashion by creating additional inputs that are correlated with the original input. Whereas contrastive learning uses the redundancy between the inputs and the generated images , our framework has redundancy between the two tasks.

Below, we use the following notations to express direct predictors, alignment (ALIGN) predictors, and cross-task consistency (XTC) predictors,


Furthermore, without proof, we can state that the quantity,


is small.

By assuming that there exists some relationship between and , predicting by first predicting from should, intuitively, achieve similar predictions as directly predicting from .

The following proposition tells us that not only does this strategy work but also using the direct predictions as the target works better.


Let X, Y, Z be random variables. Then, cross-task self-consistency loss yields smaller expected error between direct loss compared to cross-task consistency loss, i.e.,

This holds true for case where and are switched.

See Appendix A for proof.

The proposition implies (1) cross-task consistency loss yields a smaller expected difference between direct loss than alignment loss, and (2) both losses are upper-bounded by some small value. Our proposed loss can be seen as learning one task through the scope of the other task. As the model is composed of a shared network and task-specific networks, cross-task consistency loss forces the model to efficiently pass the information of one task to the other and hold consistency between its predictions. Furthermore, both cross-task consistency loss and alignment loss are upper-bounded by , which tells us that cross-task terms are competitive with direct predictions.

Iv XTasC-Net

Iv-a Model Architecture

Based on our cross-task consistency learning framework, we propose an original neural network model named XTasC-Net (Cross-Task Consistency Network) to conduct the experiments. Fig. 2 shows our model which is built from 3 modules.

Fig. 2: Architecture overview of XTasC-Net.

XTasC-Net is an encoder-decoder architecture with separate, individual decoders for each task. The encoder and decoder module is connected also with skip-connections, similar to U-Net [ronne15]. The output of each decoder, the direct predictions, are then fed into separate networks, which we refer as TTNet (Task-Transfer Networks), similar to [zamir20]. The outputs of TTNet are the task-transferred predictions.

We choose ResNet [he16], a commonly used network for image encoding, as the backbone of the encoder. There are 5 types of ResNet, namely ResNet18, ResNet34, ResNet50, ResNet101, ResNet152. After comparing results the results, we decided to use ResNet34 which resulted in best performance with the least number of parameters (see Appendix B).

The decoder consists of 5 blocks and 1 convolutional layer. Based on U-Net, each decoder block takes in the concatenation of feature maps of the previous block and the respective encoder block. It then processes through 3 layers of convolutional network with kernel size

followed by batch normalization and ReLU activation function. The number of channels are all set to 128. At the end of each block, the feature maps are upsampled by 2. The output of the 5th block is then fed into a

convolutional layer so that the output results in the required dimensions for a given task. For segmentation task, the number of classes to classify is the dimension, while for depth estimation, the dimension is 1.

TTNets are also designed similarly to U-Net with 3 blocks of contracting path and 3 blocks of expansive path. In the contracting path, each block consists of 2 repeating applications of convolutional layer followed by ReLU activation function. At the end of each block, the feature maps are downsampled by a max-pooling layer with stride 2. The number of channels are, in order, 64, 128, 256. On the other side, the expansive path is built symmetrically with 2 convolutional layers with the same kernel size followed by ReLU activation layer. At the end of each block, the feature maps are upsampled by 2. The output of the final block is then fed into a convolutional layer similar to the decoder.

Iv-B Loss Functions

As mentioned in section III, our proposed method is to add the cross-task consistency loss to the loss function. Below, we denote as the target value/class we wish to predict, as the direct predictions, and as the task-transferred predictions for task 1 and 2 respectively. We prepare 2 types of loss functions: are the losses for direct predictions and are the cross-task consistency losses. Using these notations, the loss of task , , can be denoted as the weighted average of the 2 losses,


where is the weight and is the other task.

Let task 1 be the segmentation task and task 2 be the depth estimation task.

Following the example of numerous previous works, we use cross-entropy loss as the loss function for the segmentation task. Given an output, the cross entropy loss for direct predictions is expressed as,


For task-transferred predictions, the loss is expressed as,


We do not backpropagate the loss via direct prediction for cross-task consistency loss because the target we are trying to minimize against is the direct prediction, the output of the other task’s decoder.

For depth estimation task, we use norm distance as the loss function. Given an output, the depth loss for direct predictions is written as,


for all pixels with valid depth values. During training, we mask the invalid pixels so that its loss is not backpropagated through the network. We also use norm for task-transferred predictions as well.


Similar to task-transferred prediction of segmentation task, the loss is not backpropagated via the direct prediction.

Overall, the loss we wish to minimize is the weighted sum of the 2 tasks,


where are the weights. For , we experiment using equal weights, uncertainty weights [kendall18], and GradNorm [chen18].

V Experiments

V-a Datasets

We consider 2 datasets, Cityscapes [cordts16] and NYU [silberman12] dataset to validate our proposed method.

Cityscapes dataset is a collection of diverse urban street scenes gathered using a stereo camera. It is provided with 19-class and 7-class segmentation labels, and we use the 7-class version so that it can be compared with previous works. The images’ resolution is but we resize to to speed up training.

NYU dataset is a dataset composed of a wide variety of indoor scenes recorded by an RGB camera and depth cameras using Microsoft Kinect with 13-class segmentation labels. The resolution of the images are but we resize to to speed up training.

For both datasets, we apply normalization, random horizontal flipped with a probability of 0.5, and random scaled cropping with scales chosen randomly from . We mask out invalid pixels during training, such as void classes of segmentation task and pixels with depth 0 (incorrectly calculated depth) for depth estimation task. Following previous works, for depth of Cityscapes, we use the inverse disparity values as the target because the raw disparity values range from 0 to infinity (ex. sky) and training to predict such infinite values lead to poor generalization.

V-B Evaluation Metrics

We use the following metrics to evaluate the performance. For segmentation task, we use mean Intersection-over-Union (mIoU) and pixel accuracy (Pix. Acc.). For depth estimation task, we use absolute error (Abs. Err.) and absolute relative error (Rel. Err.).

Furthermore, following [vande20, maninis19], we compute performance improvement of a model against the baseline model as the average percentage points’ gain of evaluation metrics,


where if a lower value means better performance for measure of task and 0 otherwise.

V-C Training Protocols

For all datasets, we use Adam optimizer with an initial learning rate of 0.0001. For Cityscapes dataset, the learning rate is halved every 80 epochs during training. We train the model for 250 epochs with batch size 8. For NYU dataset, we halve the learning rate every 60 epochs during training. We train the model for 100 epochs with batch size 6. We choose

as 0.01 for Cityscapes and 0.0001 for NYU dataset.

We conduct all experiments using Pytorch. Our codes are available at

V-D Baseline Models

We compare the result of our XTasC-Net with 3 models, (1) the same encoder and decoder to train each task separately, (2) the same architecture as XTasC-Net but but without TTNet (by setting ), and (3) the same architecture as XTasC-Net but using alignment loss. We refer to these models as Base ST-Net, Base MT-Net, and Align-Net, respectively. For models except Base ST-Net, we use uncertainty weights [kendall18] to balance the 2 tasks’ loss when learning jointly. Further analysis on weighting methods is written in the appendix (Appendix C).

V-E Results

V-E1 Results on Cityscapes dataset

First, we compare our results against the 3 baseline models with several weighting methods in Table I.

Model Segmentation Depth Performance
(Higher Better) (Lower Better) (Higher Better)
mIoU Pix Acc Abs Err Rel Err
Base ST-Net 66.40 93.48 0.0124 19.78 0.00
Base MT-Net 66.84 93.57 0.0122 19.61 1.44
Align-Net 66.61 93.53 0.0124 19.83 0.67
XTasC-Net 66.51 93.56 0.0122 19.40 1.58
TABLE I: Ablation Results of Different Models on Cityscapes Validation Set for 7-Class Semantic Segmentation and Depth Estimation Task. Best Results are in Bold.

Overall, we can observe that our XTasC-Net succeeds in achieving higher results compared to the other 3 baseline models. Although Base MT-Net achieves the highest score for segmentation tasks, using TTNet to transfer losses between tasks leads to better overall performance. Between Align-Net and XTasC-Net, XTasC-Net achieves higher overall performance. These results show that task-transferred predictions do not interfere with direct predictions but exploit the task relationships to improve both tasks’ predictions.

Table II shows the results for all methods.

Model #P. Segmentation Depth
(Higher Better) (Lower Better)
mIoU Pix Acc Abs Err Rel Err
STAN [liu19] 12.52 51.90 90.87 0.0145 27.46
Dense† [huang17] 14.96 51.89 91.22 0.0134 25.36
Cross-Stitch† [misra16] 50.31 90.43 0.0152 31.36
MTAN* [liu19] 4.12 53.04 91.11 0.0144 33.63
PCGrad* [yu20] 4.12 53.59 91.45 0.0171 31.34
KD4MTL* [li20] 4.12 52.71 91.54 0.0139 27.33
AdaMT-Net‡ [jha20] 4.91 62.53 94.16 0.0125 22.23
XTasC-Net† (Ours) 3.15 66.51 93.56 0.0122 19.40
TABLE II: Results of Multi-Task Learning on Cityscapes Validation Set for 7-Class Semantic Segmentation and Depth Estimation Task. #P Shows the Number of Parameters of the Model. Italic Represents Estimated Values. Best Results are in Bold and Second Best are Underlined. * Equal Weights. † Uncertainty Weights. ‡ Gradient-based Weight Learning.

We show the number of parameters in the table to show that our XTasC-Net achieves the best result with fewer parameters. The results show that our model outperforms previous works on most evaluation metrics with a sufficiently fewer amount of parameters.

For segmentation task, we observe great improvement on mIoU metric. We think that this is due to the difference of using attention modules or not. Previous works such as [liu19, jha20, li20, yu20] have used attention modules in their networks, enabling the model to “look” at the entire image in the training phase. On the other hand, since our model only uses convolutional layers, the model can only learn from pixels nearby. Intuitively, this leads to higher mIoU while attention modules improve pixel accuracy.

The result shows that our method reduces error both in terms of absolute error and relative error for the depth estimation task. Intuitively, learning from the segmentation task’s predictions motivates the model to output different depth ranges for each class. We think that this leads to depth predictions with more contrast and hence higher accuracy.

Our qualitative results are shown in Fig. 3.

Fig. 3: Qualitative results on Cityscapes validation set. From top to bottom: input image, ground truth segmentation, predicted segmentation, ground truth depth, and predicted depth.

V-E2 Results on NYU dataset

We compare our results against the 3 baseline models in Table III.

Model Segmentation Depth Performance
(Higher Better) (Lower Better) (Higher Better)
mIoU Pix Acc Abs Err Rel Err
Base ST-Net 30.62 62.30 0.6451 0.2443 0.00
Base MT-Net 31.02 63.58 0.6103 0.2284 3.82
Align-Net 30.46 63.32 0.6514 0.2295 1.55
XTasC-Net. 30.31 63.02 0.5954 0.2235 4.09
TABLE III: Ablation Results of Different Models on NYU Test Set for 13-Class Semantic Segmentation and Depth Estimation Task. Best Results are in Bold.

Again, we see a slight drop of performance for segmentation task. Compared to Cityscapes dataset, we observe that using MTL results in lower mIoU score because scenes in NYU datasets differ largely, with different depth range, lighting condition, and camera angle. However, using the proposed methods lead to better performance.

Table IV shows the result for all methods.

Model Segmentation Depth
(Higher Better) (Lower Better)
mIoU Pix Acc Abs Err Rel Err
STAN [liu19] 16.65 55.07 0.6935 0.2891
Dense [huang17] 17.22 55.59 0.6002 0.2654
Cross-Stitch [misra16] 17.01 53.99 0.6095 0.2671
MTAN [liu19] 20.10 53.73 0.6417 0.2758
PCGrad† [yu20] 21.29 54.07 0.6705 0.3000
KD4MTL* [li20] 22.44 57.32 0.6003 0.2601
AdaMT-Net‡ [jha20] 20.61 58.91 0.6136 0.2547
XTasC-Net†(Ours) 30.31 63.02 0.5954 0.2235
TABLE IV: Results of Multi-Task Learning on NYU Test Set for 13-Class Semantic Segmentation and Depth Estimation Task. Best Results are in Bold and Second Best are Underlined. * Equal weights. † Uncertainty weights. ‡ Gradient-based weight learning. DWA.

As shown in the table, our model outperforms all previous works except for AdaMT-Net [jha20]. Compared to AdaMT-Net, our model improves mIoU and relative depth error by a fair margin.

For the segmentation task, we again observe a larger improvement for mIoU compared to pixel accuracy. As described in section V-E1, we can infer that this is due to the models’ characteristics. Using only convolutional layers without any attention modules, we achieve a 7.87 point improvement of mIoU and a smaller gain of 2.67 points for pixel accuracy. Our model shows significant performance for depth estimation tasks as well (Abs Err: 0.6002 0.5954, Abs Rel Err: 0.2547 0.2235).

Vi Conclusion

In this work, we made mainly two contributions. First, we proposed a new model architecture called XTasC-Net (Cross-Task Consistency Network) that achieved state-of-the-art results for most evaluation metrics on two benchmark datasets, Cityscapes and NYU dataset. We also showed that our model is parameter-efficient. Secondly, we introduced a new loss term, cross-task consistency loss, for the 2-task MTL setting. Our method performs well on both datasets that were collected using a stereo camera (Cityscapes dataset) and infrared camera (NYU dataset).

Cross-task consistency loss is a term that takes the loss between the transferred prediction with the prediction of the other task. This loss motivated the model to output consistent predictions for both tasks by transferring information of one task to the other. We showed both theoretically and empirically the effect of cross-task consistency loss. Although our proposed loss term is limited to 2-task MTL problems, it efficiently uses the task relationship for performance improvement. Furthermore, compared to previous self-supervised learning methods, our self-supervised training method can be applied even if the data were collected using non-stereo cameras.

While we only explored the case of semantic segmentation and depth estimation task, we expect our proposed framework can be applied to any MTL settings if there exists some relationship between the tasks. In the field of computer vision, many tasks are related to each other [zamir18]. With the introduction of datasets aimed for MTL [zamir18, roberts20], we think that our framework will help the model learn and exploit the task relationships. Furthermore, if the framework can be extended to an arbitrary number of tasks, it will help our understanding of task relationships.


Appendix A Proof of the Proposition


By the law of total expectation, the first part of the equation is,

The first inequality of the equation holds because,

Furthermore, the second inequality holds from Jensen’s inequality,

Appendix B Effects of Different Encoders

As explained in section IV, we use ResNet as our encoder. ResNet has several variations of depth, and different works have used different encoders as their encoder. Therefore, we also evaluate the change of performance between 3 types of ResNet, ResNet18, 34, and 50 (Table V, VI). We evaluate ResNet18 because it has fewer parameters compared to ResNet34. We also evaluate ResNet50 because it has more parameters but no more than the other models used for comparison.

ResNet #P. Segmentation Depth
(Higher Better) (Lower Better)
mIoU Pix Acc Abs Err Rel Err
18 2.14 66.15 93.47 0.0124 19.28
34 3.15 66.51 93.56 0.0122 19.40
34* 3.15 68.71 94.23 0.0111 18.48
50 4.04 66.12 93.50 0.0124 19.98

Ablation Results of Different Encoders on Cityscapes Validation Set for 7-class Semantic Segmentation and Depth Estimation Task. #P Shows the Number of Parameters of the Model. * Use Weights Pretrained on ImageNet.

ResBet #P. Segmentation Depth
(Higher Better) (Lower Better)
mIoU Pix Acc Abs Err Rel Err
18 2.14 29.72 61.44 0.6174 0.2314
34 3.15 30.31 63.02 0.5954 0.2235
34* 3.15 44.75 75.81 0.4851 0.1835
50 4.04 28.63 61.66 0.6115 0.2287

Ablation Results of Different Encoders on NYU Test Set for 13-class Semantic Segmentation and Depth Estimation Task. #P Shows the Number of Parameters of the Model. * Use Weights Pretrained on ImageNet.

The results show that ResNet34 is the best encoder for both datasets resulting in the best scores for most evaluation metrics. Naturally, using a deeper encoder motivates the model to learn more contextual features and improve performance. However, for our model, the results show no such improvement by using deeper ResNet. The results also confirm that using pretrained weights of ResNet34 leads to better performance. Although the results in section V-E use the scores achieved without using pretrained weights for fair comparison, one should consider using the weights in applications.

Appendix C Effects of Different Weighting Methods

Below in table VII and VIII, we examine the effects of using different weighting methods. We consider using equal weights, uncertainty weights [kendall18], or gradient normalization [chen18].

Model Weighting Segmentation Depth Performance
(Higher Better) (Lower Better) (Higher Better)
mIoU Pix Acc Abs Err Rel Err
Base MT-Net Equal weights 65.86 93.25 0.0129 20.47 -1.50
Uncert. weights [kendall18] 66.84 93.57 0.0122 19.61 1.44
XTasC-Net Equal weights 66.32 93.51 0.0126 21.19 -1.55
Uncert. weights [kendall18] 66.51 93.56 0.0122 19.40 1.58
GradNorm [chen18] 66.48 93.52 0.0124 19.58 0.93
TABLE VII: Ablation Results of Different Weighting Methods on Cityscapes Validation Set for 7-Class Semantic Segmentation and Depth Estimation Task.
Model Weighting Segmentation Depth Performance
(Higher Better) (Lower Better) (Higher Better)
mIoU Pix Acc Abs Err Rel Err
Base MT-Net Equal weights 30.42 62.89 0.6384 0.2326 1.53
Uncert. weights [kendall18] 31.02 64.58 0.6103 0.2284 3.82
XTasC-Net Equal weights 30.65 63.13 0.6319 0.2237 2.98
Uncert. weights [kendall18] 30.31 63.02 0.5954 0.2235 4.09
GradNorm [chen18] 30.71 63.44 0.6002 0.2222 4.53
TABLE VIII: Ablation Results of Different Weighting Methods on NYU Test Set for 13-Class Semantic Segmentation and Depth Estimation Task.

Compared to Base MT-Net with equal weights, we observe that XTasC-Net with uncertainty weights or GradNorm leads to improvement in all four evaluation metrics except uncertainty weights for NYU dataset. Between the two weighting methods, we find that the model with uncertainty weighting excels especially for depth estimation task. Furthermore, under uncertainty weighting regime, XTasC-Net has better overall scores than Base MT-Net (Cityscapes: 1.44 1.58, NYU: 3.82 4.09).