Multi-task learning in Convolutional Networks has displayed remarkable success in the field of recognition. This success can be largely attributed to learning shared representations from multiple supervisory tasks. However, existing multi-task approaches rely on enumerating multiple network architectures specific to the tasks at hand, that do not generalize. In this paper, we propose a principled approach to learn shared representations in ConvNets using multi-task learning. Specifically, we propose a new sharing unit: "cross-stitch" unit. These units combine the activations from multiple networks and can be trained end-to-end. A network with cross-stitch units can learn an optimal combination of shared and task-specific representations. Our proposed method generalizes across multiple tasks and shows dramatically improved performance over baseline methods for categories with few training examples.READ FULL TEXT VIEW PDF
Over the last few years, ConvNets have given huge performance boosts in recognition tasks ranging from classification and detection to segmentation and even surface normal estimation. One of the reasons for this success is attributed to the inbuilt sharing mechanism, which allows ConvNets to learn representations shared across different categories. This insight naturally extends to sharing between tasks (see Figure1) and leads to further performance improvements, e.g., the gains in segmentation  and detection [21, 19]. A key takeaway from these works is that multiple tasks, and thus multiple types of supervision, helps achieve better performance with the same input. But unfortunately, the network architectures used by them for multi-task learning notably differ. There are no insights or principles for how one should choose ConvNet architectures for multi-task learning.
How should one pick the right architecture for multi-task learning? Does it depend on the final tasks? Should we have a completely shared representation between tasks? Or should we have a combination of shared and task-specific representations? Is there a principled way of answering these questions?
To investigate these questions, we first perform extensive experimental analysis to understand the performance trade-offs amongst different combinations of shared and task-specific representations. Consider a simple experiment where we train a ConvNet on two related tasks (e.g., semantic segmentation and surface normal estimation). Depending on the amount of sharing one wants to enforce, there is a spectrum of possible network architectures. Figure 2(a) shows different ways of creating such network architectures based on AlexNet . On one end of the spectrum is a fully shared representation where all layers, from the first convolution (conv2) to the last fully-connected (fc7), are shared and only the last layers (two fc8s) are task specific. An example of such sharing is  where separate fc8 layers are used for classification and bounding box regression. On the other end of the sharing spectrum, we can train two networks separately for each task and there is no cross-talk between them. In practice, different amount of sharing tends to work best for different tasks.
So given a pair of tasks, how should one pick a network architecture? To empirically study this question, we pick two varied pairs of tasks:
We first pair semantic segmentation (SemSeg) and surface normal prediction (SN). We believe the two tasks are closely related to each other since segmentation boundaries also correspond to surface normal boundaries. For this pair of tasks, we use NYU-v2  dataset.
We exhaustively enumerate all the possible Split architectures as shown in Figure 2(a) for these two pairs of tasks and show their respective performance in Figure 2(b). The best performance for both the SemSeg and SN tasks is using the “Split conv4” architecture (splitting at conv4), while for the Det task it is using the Split conv2, and for Attr with Split fc6. These results indicate two things – 1) Networks learned in a multi-task fashion have an edge over networks trained with one task; and 2) The best Split architecture for multi-task learning depends on the tasks at hand.
While the gain from multi-task learning is encouraging, getting the most out of it is still cumbersome in practice. This is largely due to the task dependent nature of picking architectures and the lack of a principled way of exploring them. Additionally, enumerating all possible architectures for each set of tasks is impractical. This paper proposes cross-stitch units, using which a single network can capture all these Split-architectures (and more). It automatically learns an optimal combination of shared and task-specific representations. We demonstrate that such a cross-stitched network can achieve better performance than the networks found by brute-force enumeration and search.
has a rich history in machine learning. The termmulti-task learning (MTL) itself has been broadly used [28, 42, 2, 55, 54, 14] as an umbrella term to include representation learning and selection [13, 4, 31, 37]39, 56, 41] etc. and their widespread applications in other fields, such as genomics 8, 7, 35]
and computer vision[31, 58, 53, 10, 30, 51, 40, 3]. In fact, many times multi-task learning is implicitly used without reference; a good example being fine-tuning or transfer learning , now a mainstay in computer vision, can be viewed as sequential multi-task learning . Given the broad scope, in this section we focus only on multi-task learning in the context of ConvNets used in computer vision.
, face landmark detection and face detection[59, 57], auxiliary tasks in detection , related classes for image classification  etc. Usually these methods share some features (layers in ConvNets) amongst tasks and have some task-specific features. This sharing or split-architecture (as explained in Section 1.1) is decided after experimenting with splits at multiple layers and picking the best one. Of course, depending on the task at hand, a different Split architecture tends to work best, and thus given new tasks, new split architectures need to be explored. In this paper, we propose cross-stitch units as a principled approach to explore and embody such Split architectures, without having to train all of them.
In order to demonstrate the robustness and effectiveness of cross-stitch units in multi-task learning, we choose varied tasks on multiple datasets. In particular, we select four well established and diverse tasks on different types of image datasets: 1) We pair semantic segmentation [45, 46, 27] and surface normal estimation [18, 11, 52], both of which require predictions over all pixels, on the NYU-v2 indoor dataset . These two tasks capture both semantic and geometric information about the scene. 2) We choose the task of object detection [44, 17, 20, 21] and attribute prediction [15, 1, 33] on web-images from the PASCAL dataset [16, 12]. These tasks make predictions about localized regions of an image.
In this paper, we present a novel approach to multi-task learning for ConvNets by proposing cross-stitch units. Cross-stitch units try to find the best shared representations for multi-task learning. They model these shared representations using linear combinations, and learn the optimal linear combinations for a given set of tasks. We integrate these cross-stitch units into a ConvNet and provide an end-to-end learning framework. We use detailed ablative studies to better understand these units and their training procedure. Further, we demonstrate the effectiveness of these units for two different pairs of tasks. To limit the scope of this paper, we only consider tasks which take the same single input, e.g., an image as opposed to say an image and a depth-map .
Given a single input image with multiple labels, one can design “Split architectures” as shown in Figure 2. These architectures have both a shared representation and a task specific representation. ‘Splitting’ a network at a lower layer allows for more task-specific and fewer shared layers. One extreme of Split architectures is splitting at the lowest convolution layer which results in two separate networks altogether, and thus only task-specific representations. The other extreme is using “sibling” prediction layers (as in ), which allows for a more shared representation. Thus, Split architectures allow for a varying amount of shared and task-specific representations.
Given that Split architectures hold promise for multi-task learning, an obvious question is – At which layer of the network should one split? This decision is highly dependent on the input data and tasks at hand. Rather than enumerating the possibilities of Split architectures for every new input task, we propose a simple architecture that can learn how much shared and task specific representation to use.
Consider a case of multi task learning with two tasks and on the same input image. For the sake of explanation, consider two networks that have been trained separately for these tasks. We propose a new unit, cross-stitch unit, that combines these two networks into a multi-task network in a way such that the tasks supervise how much sharing is needed, as illustrated in Figure 3. At each layer of the network, we model sharing of representations by learning a linear combination of the activation maps [31, 4] using a cross-stitch unit. Given two activation maps from layer for both the tasks, we learn linear combinations (Eq 1) of both the input activations and feed these combinations as input to the next layers’ filters. This linear combination is parameterized using . Specifically, at location in the activation map,
We refer to this the cross-stitch operation, and the unit that models it for each layer as the cross-stitch unit. The network can decide to make certain layers task specific by setting or to zero, or choose a more shared representation by assigning a higher value to them.
Since cross-stitch units are modeled as linear combination, their partial derivatives for loss with tasks are computed as
We denote by and call them the different-task values because they weigh the activations of another task. Likewise, are denoted by , the same-task values, since they weigh the activations of the same task. By varying and values, the unit can freely move between shared and task-specific representations, and choose a middle ground if needed.
We use the cross-stitch unit for multi-task learning in ConvNets. For the sake of simplicity, we assume multi-task learning with two tasks. Figure 4 shows this architecture for two tasks and . The sub-network in Figure 4(top) gets direct supervision from task and indirect supervision (through cross-stitch units) from task . We call the sub-network that gets direct supervision from task as network , and correspondingly the other as . Cross-stitch units help regularize both tasks by learning and enforcing shared representations by combining activation (feature) maps. As we show in our experiments, in the case where one task has less labels than the other, such regularization helps the “data-starved” tasks.
Next, we enumerate the design decisions when using cross-stitch units with networks, and in later sections perform ablative studies on each of them.
Cross-stitch units initialization and learning rates: The values of a cross-stitch unit model linear combinations of feature maps. Their initialization in the range is important for stable learning, as it ensures that values in the output activation map (after cross-stitch unit) are of the same order of magnitude as the input values before linear combination. We study the impact of different initializations and learning rates for cross-stitch units in Section 5.
Network initialization: Cross-stitch units combine together two networks as shown in Figure 4. However, an obvious question is – how should one initialize the networks and ? We can initialize networks and by networks that were trained on these tasks separately, or have the same initialization and train them jointly.
We now describe the experimental setup in detail, which is common throughout the ablation studies.
Datasets and Tasks: For ablative analysis we consider the tasks of semantic segmentation (SemSeg) and Surface Normal Prediction (SN) on the NYU-v2  dataset. We use the standard train/test splits from . For semantic segmentation, we follow the setup from  and evaluate on the 40 classes using the standard metrics from their work
Setup for Surface Normal Prediction: Following , we cast the problem of surface normal prediction as classification into one of 20 categories. For evaluation, we convert the model predictions to 3D surface normals and apply the Manhattan-World post-processing following the method in . We evaluate all our methods using the metrics from . These metrics measure the error in the ground truth normals and the predicted normals in terms of their angular distance (measured in degrees). Specifically, they measure the mean and median error in angular distance, in which case lower error is better (denoted by ‘Mean’ and ‘Median’ error). They also report percentage of pixels which have their angular distance under a threshold (denoted by ‘Within ’ at a threshold of ), in which case a higher number indicates better performance.
Networks: For semantic segmentation (SemSeg) and surface normal (SN) prediction, we use the Fully-Convolutional Network (FCN 32-s) architecture from  based on CaffeNet  (essentially AlexNet ). For both the tasks of SemSeg and SN, we use RGB images at full resolution, and use mirroring and color data augmentation. We then finetune the network (referred to as one-task network
) from ImageNet
for each task using hyperparameters reported in. We fine-tune the network for semantic segmentation for 25k iterations using SGD (mini-batch size 20) and for surface normal prediction for 15k iterations (mini-batch size 20) as they gave the best performance, and further training (up to 40k iterations) showed no improvement. These one-task networks serve as our baselines and initializations for cross-stitching, when applicable.
Cross-stitching: We combine two AlexNet architectures using the cross-stitch units as shown in Figure 4. We experimented with applying cross-stitch units after every convolution activation map and after every pooling activation map, and found the latter performed better. Thus, the cross-stitch units for AlexNet are applied on the activation maps for pool1, pool2, pool5, fc6 and fc7. We maintain one cross-stitch unit per ‘channel’ of the activation map, e.g., for pool1 we have 96 cross-stitch units.
|(Lower Better)||(Higher Better)||(Higher Better)|
Cross-stitch units capture the intuition that shared representations can be modeled by linear combinations . To ensure that values after the cross-stitch operation are of the same order of magnitude as the input values, an obvious initialization of the unit is that the values form a convex linear combination, i.e., the different-task and the same-task to sum to one. Note that this convexity is not enforced on the values in either Equation 1 or 2, but serves as a reasonable initialization. For this experiment, we initialize the networks and with one-task networks that were fine-tuned on the respective tasks. Table 1 shows the results of evaluating cross-stitch networks for different initializations of values.
We initialize the values of the cross-stitch units in the range , which is about one to two orders of magnitude larger than the typical range of layer parameters in AlexNet . While training, we found that the gradient updates at various layers had magnitudes which were reasonable for updating the layer parameters, but too small for the cross-stitch units. Thus, we use higher learning rates for the cross-stitch units than the base network. In practice, this leads to faster convergence and better performance. To study the impact of different learning rates, we again use a cross-stitched network initialized with two one-task networks. We scale the learning rates (wrt. the network’s learning rate) of cross-stitch units in powers of (by setting the lr_mult
layer parameter in Caffe). Table 2 shows the results of using different learning rates for the cross-stitch units after training for 10k iterations. Setting a higher scale for the learning rate improves performance, with the best range for the scale being . We observed that setting the scale to an even higher value made the loss diverge.
|(Lower Better)||(Higher Better)||(Higher Better)|
When cross-stitching two networks, how should one initialize the networks and ? Should one start with task specific one-task networks (fine-tuned for one task only) and add cross-stitch units? Or should one start with networks that have not been fine-tuned for the tasks? We explore the effect of both choices by initializing using two one-task networks and two networks trained on ImageNet [9, 43]. We train the one-task initialized cross-stitched network for 10k iterations and the ImageNet initialized cross-stitched network for 30k iterations (to account for the 20k fine-tuning iterations of the one-task networks), and report the results in Table 3. Task-specific initialization performs better than ImageNet initialization for both the tasks, which suggests that cross-stitching should be used after training task-specific networks.
|(Lower Better)||(Higher Better)||(Higher Better)|
We visualize the weights and of the cross-stitch units for different initializations in Figure 4. For this experiment, we initialize sub-networks and using one-task networks and trained the cross-stitched network till convergence. Each plot shows (in sorted order) the values for all the cross-stitch units in a layer (one per channel). We show plots for three layers: pool1, pool5 and fc7. The initialization of cross-stitch units biases the network to start its training preferring a certain type of shared representation, e.g., biases the network to learn more task-specific features, while biases it to share representations. Figure 4 (second row) shows that both the tasks, across all initializations, prefer a more task-specific representation for pool5, as shown by higher values of . This is inline with the observation from Section 1.1 that Split conv4 performs best for these two tasks. We also notice that the surface normal task prefers shared representations as can be seen by Figure 4(b), where and values are in similar range.
|Segmentation||Surface Normal||Segmentation||Surface Normal||Segmentation||Surface Normal|
We now present experiments with cross-stitch networks for two pairs of tasks: semantic segmentation and surface normal prediction on NYU-v2 , and object detection and attribute prediction on PASCAL VOC 2008 [12, 16]. We use the experimental setup from Section 5 for semantic segmentation and surface normal prediction, and describe the setup for detection and attribute prediction below.
Dataset, Metrics and Network: We consider the PASCAL VOC 20 classes for object detection, and the 64 attribute categories data from . We use the PASCAL VOC 2008 [12, 16] dataset for our experiments and report results using the standard Average Precision (AP) metric. We start with the recent Fast-RCNN  method for object detection using the AlexNet  architecture.
Training: For object detection, Fast-RCNN is trained using 21-way 1-vs-all classification with 20 foreground and 1 background class. However, there is a severe data imbalance in the foreground and background data points (boxes). To circumvent this, Fast-RCNN carefully constructs mini-batches with foreground-to-background ratio, i.e., at most 25% of foreground samples in a mini-batch. Attribute prediction, on the other hand, is a multi-label classification problem with 64 attributes, which only train using foreground bounding boxes. To implement both tasks in the Fast R-CNN framework, we use the same mini-batch sampling strategy; and in every mini-batch only the foreground samples contribute to the attribute loss (and background samples are ignored).
Scaling losses: Both SemSeg and SN used same classification loss for training, and hence we were set their loss weights to be equal (). However, since object detection is formulated as 1-vs-all classification and attribute classification as multi-label classification, we balance the losses by scaling the attribute loss by .
Cross-stitching: We combine two AlexNet architectures using the cross-stitch units after every pooling layer as shown in Figure 4. In the case of object detection and attribute prediction, we use one cross-stitch unit per layer activation map. We found that maintaining a unit per channel, like in the case of semantic segmentation, led to unstable learning for these tasks.
Single-task Baselines: These serve as baselines without benefits of multi-task learning. First we evaluate a single network trained on only one task (denoted by ‘One-task’) as described in Section 5. Since our approach cross-stitches two networks and therefore uses parameters, we also consider an ensemble of two one-task networks (denoted by ‘Ensemble’). However, note that the ensemble has network parameters for only one task, while the cross-stitch network has roughly parameters for two tasks. So for a pair of tasks, the ensemble baseline uses the cross-stitch parameters.
Multi-task Baselines: The cross-stitch units enable the network to pick an optimal combination of shared and task-specific representation. We demonstrate that these units remove the need for finding such a combination by exhaustive brute-force search (from Section 1.1). So as a baseline, we train all possible “Split architectures” for each pair of tasks and report numbers for the best Split for each pair of tasks.
There has been extensive work in Multi-task learning outside of the computer vision and deep learning community. However, most of such work, with publicly available code, formulates multi-task learning in an optimization framework that requires all data points in memory[60, 6, 23, 61, 14, 34, 49]. Such requirement is not practical for the vision tasks we consider.
So as our final baseline, we compare to a variant of [62, 1] by adapting their method to our setting and report this as ‘MTL-shared’. The original method treats each category as a separate ‘task’, a separate network is required for each category and all these networks are trained jointly. Directly applied to our setting, this would require training 100s of ConvNets jointly, which is impractical. Thus, instead of treating each category as an independent task, we adapt their method to our two-task setting. We train these two networks jointly, using end-to-end learning, as opposed to their dual optimization to reduce hyperparameter search.
Table 5 shows the results for semantic segmentation and surface normal prediction on the NYUv2 dataset . We compare against two one-task networks, an ensemble of two networks, and the best Split architecture (found using brute force enumeration). The sub-networks , (Figure 4) in our cross-stitched network are initialized from the one-task networks. We use cross-stitch units after every pooling layer and fully connected layer (one per channel). Our proposed cross-stitched network improves results over the baseline one-task networks and the ensemble. Note that even though the ensemble has parameters compared to cross-stitched network, the latter performs better. Finally, our performance is better than the best Split architecture network found using brute force search. This shows that the cross-stitch units can effectively search for optimal amount of sharing in multi-task networks.
|(Lower Better)||(Higher Better)||(Higher Better)|
Multiple tasks are particularly helpful in regularizing the learning of shared representations[50, 14, 5]. This regularization manifests itself empirically in the improvement of “data-starved” (few examples) categories and tasks.
For semantic segmentation, there is a high mismatch in the number of labels per category (see the black line in Figure 6). Some classes like wall, floor have many more instances than other classes like bag, whiteboard etc. Figure 6 also shows the per-class gain in performance using our method over the baseline one-task network. We see that cross-stitch units considerably improve the performance of “data-starved” categories (e.g., bag, whiteboard).
We train a cross-stitch network for the tasks of object detection and attribute prediction. We compare against baseline one-task networks and the best split architectures per task (found after enumeration and search, Section 1.1). Table 6 shows the results for object detection and attribute prediction on PASCAL VOC 2008 [12, 16]. Our method shows improvements over the baseline for attribute prediction. It is worth noting that because we use a background class for detection, and not attributes (described in ‘Scaling losses’ in Section 6), detection has many more data points than attribute classification (only 25% of a mini-batch has attribute labels). Thus, we see an improvement for the data-starved task of attribute prediction. It is also interesting to note that the detection task prefers a shared representation (best performance by Split fc7), whereas the attribute task prefers a task-specific network (best performance by Split conv2).
|Method||Detection (mAP)||Attributes (mAP)|
Following a similar analysis to Section 6.3, we plot the relative performance of our cross-stitch approach over the baseline one-task attribute prediction network in Figure 5. The performance gain for attributes with smaller number of training examples is considerably large compared to the baseline (4.6% and 4.3% mAP for the top 10 and 20 attributes with the least data respectively). This shows that our proposed cross-stitch method provides significant gains for data-starved tasks by learning shared representations.
We present cross-stitch units which are a generalized way of learning shared representations for multi-task learning in ConvNets. Cross-stitch units model shared representations as linear combinations, and can be learned end-to-end in a ConvNet. These units generalize across different types of tasks and eliminate the need to search through several multi-task network architectures on a per task basis. We show detailed ablative experiments to see effects of hyperparameters, initialization etc. when using these units. We also show considerable gains over the baseline methods for data-starved categories. Studying other properties of cross-stitch units, such as where in the network should they be used and how should their weights be constrained, is an interesting future direction.
We would like to thank Alyosha Efros and Carl Doersch for helpful discussions. This work was supported in part by ONR MURI N000141612007 and the US Army Research Laboratory (ARL) under the CTA program (Agreement W911NF-10-2-0016). AS was supported by the MSR fellowship. We thank NVIDIA for donating GPUs.
A unified architecture for natural language processing: Deep neural networks with multitask learning.In ICML, 2008.
Imagenet classification with deep convolutional neural networks.In NIPS, 2012.
Scalable multitask representation learning for scene classification.In CVPR, 2014.
Multi-task feature selection.Statistics Department, UC Berkeley, Tech. Rep, 2006.
Inadmissibility of the usual estimator for the mean of a multivariate normal distribution.In
Proceedings of the Third Berkeley symposium on mathematical statistics and probability, volume 1. 1956.
Sparse representation for computer vision and pattern recognition.Proceedings of the IEEE, 98(6):1031–1044, 2010.