Multi-Task Learning by Deep Collaboration and Application in Facial Landmark Detection

10/28/2017
by   Ludovic Trottier, et al.
Université Laval
0

Convolutional neural networks (CNN) have become the most successful and popular approach in many vision-related domains. While CNNs are particularly well-suited for capturing a proper hierarchy of concepts from real-world images, they are limited to domains where data is abundant. Recent attempts have looked into mitigating this data scarcity problem by casting their original single-task problem into a new multi-task learning (MTL) problem. The main goal of this inductive transfer mechanism is to leverage domain-specific information from related tasks, in order to improve generalization on the main task. While recent results in the deep learning (DL) community have shown the promising potential of training task-specific CNNs in a soft parameter sharing framework, integrating the recent DL advances for improving knowledge sharing is still an open problem. In this paper, we propose the Deep Collaboration Network (DCNet), a novel approach for connecting task-specific CNNs in a MTL framework. We define connectivity in terms of two distinct non-linear transformations. One aggregates task-specific features into global features, while the other merges back the global features with each task-specific network. Based on the observation that task relevance depends on depth, our transformations use skip connections as suggested by residual networks, to more easily deactivate unrelated task-dependent features. To validate our approach, we employ facial landmark detection (FLD) datasets as they are readily amenable to MTL, given the number of tasks they include. Experimental results show that we can achieve up to 24.31 state-of-the-art MTL approaches. We finally perform an ablation study showing that our approach effectively allows knowledge sharing, by leveraging domain-specific features at particular depths from tasks that we know are related.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 6

page 7

05/23/2017

Sluice networks: Learning what to share between loosely related tasks

Multi-task learning is partly motivated by the observation that humans b...
08/26/2019

Stochastic Filter Groups for Multi-Task CNNs: Learning Specialist and Generalist Convolution Kernels

The performance of multi-task learning in Convolutional Neural Networks ...
02/09/2020

Multi-Task Learning by a Top-Down Control Network

A general problem that received considerable recent attention is how to ...
04/29/2020

Emerging Relation Network and Task Embedding for Multi-Task Regression Problems

Multi-task learning (mtl) provides state-of-the-art results in many appl...
08/04/2020

Guiding CNNs towards Relevant Concepts by Multi-task and Adversarial Learning

The opaqueness of deep learning limits its deployment in critical applic...
08/13/2019

Incorporating Task-Specific Structural Knowledge into CNNs for Brain Midline Shift Detection

Midline shift (MLS) is a well-established factor used for outcome predic...
07/06/2020

Meta-Learning Symmetries by Reparameterization

Many successful deep learning architectures are equivariant to certain t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past few years, Convolutional Neural Networks (CNNs) have become the leading approach in many vision-related tasks [1]. Their ability to learn a hierarchy of increasingly abstract concepts allows them to transform complex high-dimensional input images into simple low-dimensional output features. CNNs have been used in many settings, but their need to have a large amount of data during training has restricted them to domains where data is abundant. Optimizing CNNs is tricky not only because of problems like vanishing / exploding gradients [2], but also because they typically have many parameters to be learned. While previous works have looked at supervised and unsupervised pre-training to improve generalization, others have considered casting their original single-task problem into a new Multi-Task Learning (MTL) problem [3]. As Caruana (1998) [4] explained in his seminal work: “MTL improves generalization by leveraging the domain-specific information contained in the training signals of related tasks". Exploring new ways to more efficiently gather information from related tasks would help to further improve generalization on the main one.

MTL has proven its value in several domains over the years. It has become a dominant field of machine learning 

[5], with many influential works [6]. Although MTL dates back several years, recent major advances in Deep Learning (DL) opened up opportunities for novel contributions. Works on grasping [7], pedestrian detection [8]

, natural language processing 

[9]

, face recognition 

[10][11] and object detection [12] helped MTL make a resurgence in the DL community. They have shown the potential of MTL to mitigate data scarcity when training deep networks, which has influenced its growing popularity

MTL approaches can generally be divided into two major categories: hard and soft parameter sharing [13]. Hard-parameter sharing dates back to the original work of Caruana (1998) and is the most common of the two. Approaches in this category have a shared central section with many heads (one per task). Features from specific tasks compete together and those relevant to all tasks are favored. Recent works in DL have shown that hard-parameter sharing can be successful [14][15][7][11]. However, a too large emphasis on features relevant to all tasks can be harmful for learning high-level features specific to a particular task. These types of specific features are usually needed to obtain a good representation for the particular task. Also, shared layers are prone to be contaminated by noise coming from noxious tasks [16]. These limitations can be detrimental even though hard-parameter sharing reduces the risk of over-fitting [17].

Soft-parameter sharing has been proposed as an alternative to alleviate these drawbacks. Approaches in this category substitute the shared central section by separate task-specific CNNs, but provide a knowledge sharing mechanism to connect them. Each CNN can then learn task-specific features and share their knowledge without interfering with others. Recent works in this category have looked at regularizing the distance between task-specific parameters with a norm [18] or a trace norm [19], training shared and private LSTM submodules [16], partitioning the hidden layers into subspaces [20]

and regularizing the FC layers with tensor normal priors 

[21]. In the domain of continual learning, progressive network [22]

has also shown promising results for sequential transfer learning, by employing lateral connections to previously learned networks.

In this paper, we present a novel soft-parameter knowledge sharing mechanism for connecting task-specific CNNs in a MTL framework. We refer to our approach as Deep Collaboration. We define connectivity in terms of a collaborative block that uses two non-linear transformations with lateral connections. One aggregates task-specific features into global features, and the other merges back the global features into each task-specific CNN. Our collaborative block is differentiable and can be dropped in any existing CNN architectures as a whole. We evaluated our approach on the problem of facial landmark detection in a MTL framework and obtained better results in comparison to other approaches of the literature. We further assess the objectivity of our training framework by randomly varying the contribution of each related tasks. Finally, we verify that our collaborative block enables knowledge sharing with an ablation study that shows the depth-specific influence of tasks that we know are related.

The content of our paper is organized as follows. In Section 2, we present related works in MTL and facial landmark detection. We elaborate on our approach in Section 3, and present experimental results in Section 4. We finally conclude our paper in Section 5. Our code is available here: [23].

2 Related Work

2.1 Multi-Task Learning

Our proposed Deep Collaboration knowledge sharing mechanism is related to other existing approaches. One is Cross-Stitch (XS) [12], which connects task-specific CNNs by linearly combining their feature maps at certain depths. One drawback of XS is that it is limited to capture only linear dependencies between each CNN. In contrast to XS, our approach uses non-linear transformations in order to capture more complex dependencies.

Another related approach is Tasks-Constrained Deep Convolutional Network (TCDCN) [15]

. The authors proposed an early-stopping criterion to remove auxiliary tasks that start to overfit before becoming detrimental to the main task. This approach has however several hyper-parameters to be selected manually. For each task, it has an hyper-parameter controlling the period length of the local window and a threshold that stops the task when the criterion exceeds it. Unlike TCDCN, our approach has no hyper-parameters that need to be tuned to the tasks at hand. Our collaborative block consists of a series of Batch Normalization 

[24]

, ReLU 

[25], and convolutional layers shaped in a standard setting that is commonly found in nowadays works.

Our proposed approach is also related to HyperFace [14]. The authors proposed to fuse the layers at various depth and exploit features of different levels of complexity. Their goal was to allow low-level features with better localization properties to help tasks such as landmark localization and pose detection, and allow high-level features with better class-specific properties to help tasks like face detection and gender recognition. Although HyperFace is in the hard-parameter sharing category and is not entirely related to our approach, the idea of feature fusion is also central in our work. Instead of fusing the features at intermediate layers of a single CNN, our approach aggregates same-level features of multiple CNNs, at different depth independently.

2.2 Facial Landmark Detection

Facial Landmark Detection (FLD) is an essential component in many face-related tasks [26][27][28][29]. FLD can be described as follows: given the image of a face of a person, the goal is to predict the -position of specific landmarks associated with key features of the visage. Applications such as face recognition [30], face validation [31], facial feature detection and tacking [32]

rely on the ability to correctly find the location of these distinct facial landmarks in order to succeed. Localizing facial key points like the center of the eyes, the corners of the mouth, the tip of the nose and the earlobes is however a challenging problem when many lighting conditions, head poses, facial expressions and occlusions increase diversity of the face images. In addition to integrating this variability into the estimation process, a FLD model must also take into account a number of correlated factors. For instance, although both an angry person and a sad person have frowned eyebrows, an angry person will have pinched lips while a sad person will have sunken mouth corners 

[33]. A particularity of datasets geared towards FLD is that they are particularly well-suited for MTL. In addition to containing the position of the facial landmarks, these datasets also contain a number of other labels that can be used to defined auxiliary tasks. Gender recognition, smile recognition, glasses recognition or face orientation are examples of tasks often chosen to evaluate MTL approaches.

3 Deep Collaboration

Given task-specific Convolutional Neural Networks (CNNs), our goal is to connect them with lateral connections in order to allow domain-specific information sharing. We define connectivity in terms of a collaborative block containing two distinct non-linear transformations. One aggregates task-specific features into global features, and the other merges back the global features into each task-specific CNN. Our collaborative block is differentiable and can be dropped in any existing CNN architectures as a whole. For this reason, we make no assumption about the structure of the task-specific CNNs. Our approach can even work with different CNNs, but for the sake of simplicity, we suppose that the CNNs are the same. We refer to it as the underlying network.

We also decompose the underlying network as a series of blocks. Each block can be as small as a single layer, as large as the whole network itself, or based on simple rules, such as grouping all layers with matching spatial dimensions or grouping every subsequent layers. The arrangement of the layers into blocks does not change the composition of the underlying network. We only use it to make explicit the depth at which we connect the task-specific CNNs.

Since our collaborative block can be inserted at any depth, we also drop the depth index on the feature maps to further simplify the equations. As such, we define the feature map output of a block at a certain depth as , where is the task index. Our approach takes as input all task-specific feature maps and processes them into new feature maps as follows:

(1)

where and represent the central and the task-specific aggregations respectively, and denotes depth-wise concatenation. We refer to Eq. (1) as our collaborative block. The goal of is to combine all task-specific feature maps into a global feature map representing unified knowledge, while the goal of is to merge back the global feature map with each task-specific input . The compositional structure of and is as follows:

(2)
(3)

where stands for Batch Normalization [24], for a standard convolutional layer with filters of size , and is the usual function composition. The first layer in divides the number of feature maps by a factor of , while the first layer in divides it to match the size of . An illustration of our collaborative block is shown in Fig. 1.

Figure 1: Example of our collaborative block applied on the feature maps of two task-specific networks. The input feature maps (shown in \⃝raisebox{-0.9pt}{1}) are first concatenated depth-wise and transformed into a global feature map (\⃝raisebox{-0.9pt}{2}). The global feature map is then concatenate with each input feature map individually and transformed into task-specific feature maps (\⃝raisebox{-0.9pt}{3}). Each resulting feature map is then added back to the input feature map using a skip connection (\⃝raisebox{-0.9pt}{4}), which gives the final outputs of the block (\⃝raisebox{-0.9pt}{5}).

One particularity of our approach is that we use a skip connection in mapping . Recent works [34][35][36][37][38] have shown that networks with identity skip connections are more easily able to learn proper input-output mappings. Inspired by these works, we opted for an identity skip connection in in order to more easily learn the proper mapping to integrate domain-specific information from the other tasks. In particular, identity skip connections put an incentive on learning the identity mapping. We can see this by the ease at which the network can obtain the identity mapping by simply pushing all the weights in towards zero. In our MTL context, the identity mapping can be seen as a way to remove the influence of the global features . This allows to take into account the cases where integrating back to the task-specific features would not help.

Another motivation for using an identity skip connection around the global feature map comes from the fact that depth influences the relevance of each task towards another. Some task-specific CNNs can benefit more when they share their low-level features than their high-level features, while other benefit more in the other way. For instance, tasks such as landmark localization and pose detection profit more from low-level features containing better localization properties, while tasks such as face detection and gender recognition profit more from class-specific high-level features. Considering that CNNs learn a hierarchy of increasingly abstract features, our collaborative block can take into account task relevance by deactivating a different set of residual mappings based on the depth at which it is inserted. An example of such specialization will be shown in our ablative study in Section 4.4.

Figure 2: Deep Collaboration Network (DCNet) using ResNet18 as underlying network in a MTL setting on the MTFL dataset. The top part shows the block structure of ResNet18 interleaved with our proposed collaborative block, while the bottom part details each residual and task-specific FC blocks.

Fig. 2 presents an example of inserting our collaborative block at different depths in a MTL framework on the MTFL dataset [15]. In this particular case, we opted for a ResNet18 as underlying network. We refer to this network as our Deep Collaboration Network (DCNet). As we can see in the top part of the figure, integrating our approach comes down to interleaving the underlying network block structure with our collaborative block. Each collaborative block receives as input the output of each task-specific block, processes them as detailed in Eq. (1), and sends the result back to each task-specific network. Adding our approach to any underlying network can be done by simply following the same pattern of interleaving the network block structure with our collaborative block.

4 Experiments

In this section, we detail our Multi-Task Learning (MTL) training framework and present our experiments in Facial Landmark Detection (FLD) tasks. We further evaluate the effect of data scarcity on performance and illustrate an example of knowledge sharing between task-specific CNNs with an ablation study.

4.1 Multi-Task Learning Training Framework

The goal of Facial Landmark Detection (FLD) is to predict the -position of specific landmarks associated with key features of the visage. While the number and type of landmarks are specific to each dataset, examples of standard landmarks to be predicted are the corners of the mouth, the tip of the nose and the center of the eyes. In addition to the facial landmarks, each dataset further defines a number of related tasks. These related tasks also vary from one dataset to another, and are typically gender recognition, smile recognition, glasses recognition or face orientation.

On a more technical level, we define a learning framework in which we treat each task as a classification problem. While this is straightforward for gender, smile and glasses recognition as they are already classification tasks, it is a bit more tricky for face orientation and FLD. For face orientation, instead of predicting the roll, yaw and pitch real value as in a regression problem, we divide each component into 30 degrees wide bins and predict the label of the bin corresponding to the value. Similarly for FLD, rather than predicting the real -position of each landmark, we divide the image into 1 pixel wide bins and predict the label of the bin corresponding to the value. Note that we still use the original real values when comparing our prediction with the ground truth, so that we incorporate our approximation errors in the final score.

We report our results using the landmark failure rate metric [15], which is defined as follows: we first compute the mean distance between the predicted landmarks and the ground truth landmarks, then normalize it by the inter-ocular distance from the center of the eyes. A normalized mean distance greater than is reported as a failure.

4.2 Facial Landmark Detection on the MTFL Task

Figure 3:

Landmark failure rates (%) on the MTFL task. The reported values are the average over the last five epochs, averaged over three tries. The left plot presents our results with AlexNet as the underlying network, while the right one with ResNet18. AN-S and RN-S stand for single-task training, AN and RN for multi-task training with a single central network, ANx and RNx for multi-task training with a single central network widen to match the number of parameters of our approach, HF for HyperFace, TCDCN for

[15]’s approach and XS for Cross-Stitch. In each instance, the left column (blue) is for un-pretrained networks, while the right column (green) is for pre-trained networks. Our proposed approach obtains the lowest failure rates overall.
MTFL   AFLW
Figure 4: Example predictions of our DCNet with pre-trained ResNet18 as underlying network on the MTFL and AFLW task. For MTFL, the first two examples are successes, while last two are failure cases. For AFLW, the first three examples are successes, while the last one is a failure case. Elements in green correspond to ground truth, while those in blue correspond to predictions. Facial landmarks are shown as small dots, and related tasks labels are displayed on the side. As we can see, over-exposition and tilted face profile can have a large impact on the prediction quality.

As a first experiment, we performed facial landmark detection on the Multi-Task Facial Landmark (MTFL) task [15]

. The dataset contains 12,995 face images annotated with five facial landmarks and four related attributes of gender, smiling, wearing glasses and face profile (five profiles in total). The training set has 10,000 images, while the test set has 2,995 images. We perform four sets of experiments using an ImageNet pre-trained AlexNet, an ImageNet pre-trained ResNet18, an un-pretrained AlexNet and an un-pretrained ResNet18 as underlying networks. For AlexNet, we apply our collaborative block after each max pooling layer, while for ResNet18, we do as shown in Fig. 

2.

We compare our approach to several other approaches of the literature. We include single-task learning (AN-S when using AlexNet as underlying network, RN-S when using ResNet18), hard-parameter sharing MTL (AN and RN), hard-parameter sharing MTL where the central section is widened to match the number of parameters of our approach (ANx and RNx), HyperFace (HF) [14], Tasks-Constrained Deep Convolutional Network (TCDCN) [15], Cross-Stitch (XS) [12] and XS widen to match the number of parameters of our approach (XSx). Except for TCDCN, we train each network ourselves three times for 300 epochs and report landmark failure rates averaged over the last five epochs, further averaged over the three tries.

Fig. 3 presents our FLD results on the MTFL dataset. The left part of the figure corresponds to using AlexNet as underlying network, while the right one corresponds to ResNet18. The top part reports the landmark failure rates, while the bottom part reports the mean error. In each plot, the left bar (blue) is for un-pretrained network, while the right bar (green) is for ImageNet pre-trained network. In addition, Fig. 4 shows example predictions from DCNet with pre-trained ResNet18 as underlying network. The first two examples were reported as successes, while the last two are failures. The ground truth elements are colored in green, while our predictions are colored in blue. We also include the labels of the related tasks: gender, smiling, wearing glasses and face profile.

The results of Fig. 3 show that our proposed approach obtained the lowest failure rates and mean error in each case. Indeed, our DCNet with un-pretrained and pre-trained AlexNet as underlying network obtained 19.67% and 19.96% failure rates respectively, and 14.95% and 13.52% with ResNet18. This is significantly lower than the other approaches to which we compare ourselves. For instance, with AlexNet, HF had 27.75% and 27.32%, XS had 26.41% and 25.65%, TCDCN had 25.00%111Zhang et. al only provided results with pre-trained AlexNet [15], and XSx had 25.23%. With ResNet18, XS had 18.43% and 15.52% respectively, and XSx had 17.28. We obtained the highest improvements when using AlexNet as the underlying network when comparing to XS. With un-pretrained and pre-trained AlexNet, we obtained improvements of 6.74% and 5.69%, while we obtained 3.48% and 2.00% with ResNet18. Performing MTL with our approach can thus improve performance over using other approaches of the literature.

Another result that we can see from Fig. 3 is that our soft-parameter sharing approach obtains higher performance than the hard-parameter sharing approaches with matching number of parameters. For instance, increasing the number of parameters of hard-parameter sharing AlexNet lowers it error rate from 28.02% (AN) to 26.88% (ANx), but our approach lowers it further to 19.67%. Similarly, increasing the number of parameters of hard-parameter sharing ResNet18 lowers it error rate from 20.05% (RN) to 16.75% (RNx), but our approach lowers it further to 14.95%. These results are interesting because they show that while increasing the number of parameters is an effortless avenue to improve performance, it has limitations. Developing novel approaches to enhance network connectivity in a soft-parameter sharing setting seems more rewarding. This may help to motivate new efforts in this avenue to further leverage domain-information of related tasks.

4.3 Effect of Data Scarcity on the AFLW Task

Train / Test Ratio Networks
RN-S RN XS Ours
0.1 / 0.9 57.39 58.00 73.06 60.64
0.3 / 0.7 31.84 32.00 36.24 29.73
0.5 / 0.5 23.41 23.31 26.02 20.77
0.7 / 0.3 21.47 21.92 22.37 18.50
0.9 / 0.1 13.03 12.80 13.51 10.82
Table 1: Landmark failure rate results on the AFLW dataset using a pre-trained ResNet18 as underlying network. The presented values are averaged over the last five epochs, further averaged over three tries. The first column is the train / test ratio, and the subsequent ones are the networks: single-task ResNet18 (RN-S), multi-task ResNet18 (RN) and Cross-Stitch network (XS). Our approach obtains the best performance in all cases, except the first one where we observe over-fitting.

As second experiment, we evaluated the influence of the number of training examples to simulate data scarcity on the Annotated Facial Landmarks in the Wild (AFLW) task [39]. The dataset has 21,123 Flickr images, and each image can contain more than one face. Instead of using the images as provided, we process them using the available face bounding boxes. We extract all faces with visible landmarks, which gives a total of 2,111 images. This dataset defines 21 facial landmarks and has 3 related tasks (gender, wearing glasses and face orientation). For face orientation, we divide the roll, yaw and pitch into 30 degrees wide bins (14 bins in total), and predict the label corresponding to each real value.

Our experiment works as follows. With a pre-trained ResNet18 as underlying network, we compare our approach to single-task ResNet18 (RN-S), multi-task ResNet18 (RN) and Cross-Stitch network (XS) by training on a varying number of images. We use five different train / test ratios, starting with 0.1 / 0.9 up to 0.9 / 0.1 by 0.2 increment. In other words, we train each approaches on the first of the available images and test on the other , then repeat for all the other train / test ratios. We use the same training framework as in section 4.2. We train each network three times for 300 epochs, and report the landmark failure rate averaged over the last five epochs, further averaged over the three tries. Example predictions are shown in Fig. 4.

As we can see in Table 1, our approach obtained the best performance in all cases except the first one. Indeed, we observe between 1.98% and 6.51% improvements with train / test ratios from 0.3 / 0.7 to 0.9 / 0.1, while we obtain a negative relative change of 3.25% with train / test ratio of 0.1 / 0.9. However, since all multi-task approaches obtained higher failure rates than the single-task approach, this suggests that the networks are over-fitting the small training set. Nonetheless, these results show that we can obtain better performance using our approach.

One particularity that we observe in Table 1 is that the XS network has relatively high failure rates. In the previous experiment of Section 4.2

, XS had either similar or better performance than the other approaches (except ours). This could be due to our current multi-task learning framework that is unfavorable towards XS. In order to investigate whether this is the case, we perform the following additional experiment. Using a pre-trained ResNet18 as underlying network, we compare our approach to XS by training each network 100 times using task weights randomly sampled from a log-uniform distribution. Specifically, we first sample from a uniform distribution

, then use as the weight. We trained both XS and our approach for 300 epochs with the same task weights using a train / test ratio of 0.5 / 0.5.

Figure 5: Landmark failure rate improvement (in %) of our approach compared to XS when sampling random task weights. We used a pre-trained ResNet18 as underlying network. The histogram at the left and the plot at the top right represents performance improvement achieved by our proposed approach (positive value means lower failure rates), while the plot at the bottom right corresponds to the log of the task weights. Our approach outperformed XS in 86 out of the 100 tries, thus empirically demonstrating that our learning framework was not unfavorable towards XS and that our approach is less sensitive to the task weights .

Figure 5 presents the results of this experiment. The plot at the top right of the figure represents the landmark failure rate improvement (in %) of our approach compared to XS, while the plot at the bottom right corresponds to the log of the task weights for each try. In 86 out of the 100 tries, our approach had a positive failure rate improvement, that is, obtained lower failure rates than XS. As we can see in the histogram at the left of Fig. 5

, the improvement rate is normally distributed around 2.80%, has a median improvement of 3.66% and a maximum improvement of 9.83%. Even though we sampled at random the weights of the related tasks, our approach outperforms XS in the majority of the cases. Our learning framework was therefore not unfavorable toward XS.

4.4 Illustration of Knowledge Sharing With an Ablation Study

As third experiment, we perform an ablation study on the MTFL task [15] with an un-pretrained ResNet18 as underlying network. The goal of this experiment is to verify that our collaborative block effectively enables knowledge sharing between task-specific CNNs. To do so, we evaluate the impact, on facial landmark detection, of removing the contribution of each task-specific features. We zero out the designated feature map before concatenation at the input of the central aggregation . The network is trained using the same framework as explained in Sec. 4.1, and the ablation study is performed at test time on the test set when training is done.

Figure 6: Results of our ablation study on the MTFL dataset with an un-pretrained ResNet18 as underlying network. We remove each task-specific features from each respective central aggregation layer and evaluate the effect on landmark failure rate. The rows represent the task-specific CNNs, while the columns correspond to the network block structure. Blocks with a high saturated color were found to have a large impact on failure rate. In particular, this ablative study shows that the influence of high-level face profile features is large within our proposed architecture. This corroborates with the well-known fact that the location of facial landmarks is closely dependent on the orientation of the face. This constitutes an empirical evidence of domain-specific information sharing via our approach.

Figure 6 presents the results of our ablation study. The rows represent each task-specific CNN, while the columns correspond to the network block structure. The blocks are ordered from left (input) to right (output), while the task-specific networks are ordered from top (main task) to bottom (related tasks). The color saturation indicates the influence of removing the task-specific feature maps from the central aggregation at the corresponding depth. A high saturation reflects high influence on failure rate, while a low saturation reflects low influence.

As first result, removing features from the facial landmark detection network significantly increases landmark failure rate. For instance, we observe a negative (worse) relative change of 29.72% and 47.00% in failure rate by removing features from Block 3 and Block 2 respectively. This illustrates that the main-task network both contributed to and fed from the global features computed by the central aggregation . The CNN for landmark detection had the possibility to remove the contribution of the global features, and so isolate itself from the other CNNs, but the opposite occurred. We actually observe a mutual influence between the CNNs, where the task-specific features from the facial landmark CNN influence the quality of the global features, which in turn influence the quality of the subsequent task-specific features.

Another result that we can see from Fig. 6 is that Block 5 of task Profile has the highest influence on failure rate. We observe a negative relative change of 83.87% by removing the features maps of task Profile from the central aggregation. What is particularly interesting in this case is that we observe this high relative change at Block 5, which corresponds to the highest block in the network. Since the block lies at the top of the network, it outputs features with a high level of abstraction. We therefore expect that these features represent high-level factors of variation corresponding to face orientation, which should look like a rotation matrix. It therefore makes sense that features representing the orientation of the face would be useful to predict facial landmarks, since we know that the location of the facial landmarks is closely dependent on the orientation of the face. The landmark CNN can use these rich features to better rotate the predicted facial landmarks. This is indeed what we observe in Fig. 6. These results constitute an empirical evidence that our approach allows leveraging domain-specific information from related tasks.

4.5 Facial Landmark Detection With MTCNN

As final experiment, we performed an experimental evaluation using the recent Multi-Task Cascaded Convolutional Network (MTCNN) [27]. The authors proposed a cascade structure of three stages, where each stage is composed of a multi-task CNN. MTCNN performs predictions in a coarse-to-fine manner. The CNN of the first stage generates (in a fully-convolutional way) many hypotheses about the position of the face and the facial landmarks, and the subsequent second and third stages refines them. The CNNs are trained sequentially with hard-negative mining, in a hard-parameter sharing setting.

We implemented our approach in the available code project [40] and compared ourself to MTCNN. We followed the provided hard negative mining recipe and generated our images. For landmark detection, we used the LFWNet [26] and CelebA [41] datasets, and generated 600k face images with facial landmarks. For face detection, we used the WIDER [42] dataset, and generated 1.5M face images with a bounding box. We trained a MTCNN with the stage networks connected with our collaborative block, and a standard MTCNN with widened stage networks to match the number of parameters.

On the test set of MTFL [15] dataset, standard MTCNN obtained a landmark failure rate of 37.85%, a mean error of 0.0996 and failed to detect a face 112 times, while our approach obtained better performances with 28.97%, 0.0930 and 79 respectively. Note that the reason MTCNN obtains worse performance than our Deep Collaboration Network (DCNet), as reported in Fig. 3, is because it has fewer parameters. DCNet has about 85M parameters, while the sum of all three stage-CNN in MTCNN is about 2M. This is because MTCNN is carefully designed to balance computational speed and landmark detection precision. It can predict many faces in high-dimensional images with a low computation burden. An example of its prediction capability is shown in Fig. 7.

Figure 7: MTCNN predictions on the photo of the 2018 Oscar nominees (image resolution of 2983 1197). The stage-CNNs are trained using our proposed collaborative block. The coarse-to-fine detection scheme employed by MTCNN allows predicting many faces in high-dimensional images with low computational burden.

5 Conclusion and Future Work

In this paper, we proposed a novel soft-parameter knowledge sharing mechanism based on lateral connections for Multi-Task Learning (MTL). Our proposed approach implements connectivity in term of a collaborative block, which uses two distinct non-linear transformations. The first one aggregates task-specific features into global features, and the other merges back the global features into each task-specific Convolutional Neural Network (CNN). Our collaborative block is differentiable and can be dropped in any existing CNN architectures as a whole. Our results on facial landmark detection tasks showed that networks connected with our proposed collaborative block outperformed the other state-of-the-art approaches, including the recent Cross-Stitch and MTCNN approach. We verify that our collaborative block effectively enables knowledge sharing between task-specific CNNs with an ablation study. We observed that the CNNs incorporated features with a varying level of abstraction from the other CNNs, by observing the depth-specific influence of tasks that we know are related. These results constituted an empirical evidence that our approach allows leveraging domain-specific information from related tasks. Evaluating our proposed approach on other MTL problems could be an interesting avenue for future works. For instance, the recurrent networks used to solve natural language processing problems could benefit from our approach.

Acknowledgements

We gratefully acknowledge the support of NVIDIA Corporation for providing a Tesla Titan X for our experiments through their Hardware Grant Program.

References