With the rapid popularization of mobile devices, more and more deep learning based computer vision applications have been being ported from computer platforms to mobile platforms. Many low-level computer vision tasks, benefiting from their category-agnostic characters, act as fundamental components in mobile devices. For example, when using a smartphone to photograph, many supporting tasks are running in the background to assist users with better pictures and provide real-time effect previews. Single-camera smartphones usually apply the salient object segmentation task to simulate the bokeh effect that requires depth information[hou2016deeply]. To help users taking pictures with more visual pleasing compositions, the edge detection task is adopted to obtain structure information [RcfEdgePami2019]. And the skeleton extraction task plays an important role in supporting taking photos by gesturing and instructing users with more interesting poses [zhao2018hifi]. However, due to the limited storage and computing resources of mobile devices, it is inconvenient and inefficient to store the pre-trained models for every different applications and perform multiple different tasks sequentially.
One feasible solution is to perform the aforementioned tasks within a single model but there exist two main challenges. One is how to learn different tasks simultaneously while the other is how to settle the divergence of feature domains and optimization targets of different tasks. Most previous work [kokkinos2017ubernet, wu2019mutual, liu2019simple, wu2019stacked] solved the first challenge by observing the characteristics owning by different tasks and designing specialized network structures for each task manually. They assumed that all the tasks learned jointly are complementary and some tasks are auxiliary (e.g. utilizing extra edge information to help the salient object detection task with more accurate segmentations in edge areas). Usually the performances of the auxiliary tasks are sacrificed and ignored. But when facing the second challenge that the tasks being solved are contrasting, as demonstrated in Fig. 1, directly applying these methods often fails. As shown in the 3rd row of Tab. III, when trained jointly with the other two tasks, the performance of skeleton extraction is badly damaged.
The design criterion of previous work is usually task-oriented and specific, greatly restricting their applicability to other tasks [kokkinos2017ubernet]. From the standpoint of network architecture, in spite of three different tasks, all of them require multi-level features, though in varying degrees. Salient object segmentation requires the ability to extract homogeneous regions and hence relies more on high-level features [hou2016deeply]. Edge detection aims at detecting accurate boundaries and hence needs more low-level features to sharpen the coarse edge maps produced by deeper layers [xie2015holistically, maninis2017convolutional]. Skeleton extraction [shen2016object, ke2017srn] prefers a proper combination of low-, mid- and high-level information to detect scale-variant (either thick or thin) skeletons. Thus, a natural question is whether it is possible to design an architecture that can coalesce these three contrasting low-level vision tasks into a unified but end-to-end trainable network with no loss on the performance of each task.
Taking into account the different characteristics of each task, we present a novel, unified framework to settle the above challenges. Specifically, our network comprises a shared backbone and three task branches of identical design, as shown in Fig. 2. To facilitate each task branch to select appropriate features at different levels of the backbone automatically, we introduce a dynamic feature integration strategy that is able to choose favored features dynamically in an end-to-end learning manner. This dynamic strategy can largely ease the process of architecture building and promote the backbone to adjust its parameters for solving multiple problems adaptively. Then a task-adaptive attention module is adopted to enforce the interchange of information among different task branches in a separate-gather way. By coupling previously independent branches, we can avoid the network to optimize asymmetrically. Our approach is easy to follow and can be trained end-to-end on a single GPU. Without sacrifice on performance, it reaches a speed of 40 FPS performing the three tasks simultaneously when processing a image.
To evaluate the performance of the proposed architecture, we compare it with the state-of-the-art methods of the three tasks. Experimental results show that our approach outperforms existing single-purpose methods on multiple widely used benchmarks. Specifically, for salient object segmentation, compared to previous state-of-the-art works, our method has a performance gain of 1.2% in terms of F-measure on average, over six popular datasets. For skeleton extraction, we also improve the state-of-the-art results by 1.9% in terms of F-measure on the SK-LARGE dataset [shen2017deepskeleton]. Furthermore, to let readers better understand the proposed approach, we conduct extensive ablation experiments on different components of the proposed architecture.
To sum up, the contributions of this paper can be summarized as follows: (i) We design a dynamic feature integration strategy to explore the feature combinations automatically according to each input and task, and solve three contrasting tasks simultaneously in an end-to-end unified framework, running at 40 FPS; (ii) We compare our multi-task approach with the single-purpose state-of-the-arts of each task and obtain better performances.
Ii Related Work
Ii-a Relevant Binary Tasks
For salient object segmentation, traditional methods are mostly based on hand-crafted features [cheng2015global, huang2017300, liu2011learning, jiang2013salient, perazzi2012saliency]. With the popularity of CNNs, many methods [li2015visual, wang2015deep, zhao2015saliency, lee2016deep, SuperCNN_IJCV2015] started to use CNNs to extract features. Some of them [wangsaliency, liu2016dhsnet, wang2018detect, zhang2018progressive, wang2019iterative, xu2019deepcrf] incorporated the idea of iterative and recurrent learning to refine the predictions. There are also works solving the problem from the aspect of fusing richer features [li2016deep, li2017instance, luo2017non, zhang2017amulet, hou2016deeply, zhang2017learning, zhang2018bi, wu2019cascaded, xu2019structured, zeng2019towards], introducing attention mechanism [chen2018reverse, zhang2018progressive, liu2018picanet, zhao2019pyramid], using multiple stages to learn the prediction in a stage-wise manner [xiao2018deep, wang2017stagewise, wang2018detect], or adding more supervisions to get predictions with sharper edges [li2018contour, liu2019simple, wang2019salient, feng2019attentive, fan2018SOC, zhang2019capsal, qin2019basnet, wu2019mutual, wu2019stacked, su2019selectivity, zhao2019egnet, zhao2019optimizing, liu2019employing]. For edge detection, early works [canny1986computational, marr1980theory, torre1986edge] mostly relied on various gradient operators. Later works [konishi2003statistical, martin2004learning, arbelaez2011contour] further employed manually-designed features.Recently, CNN-based methods solved this problem by using fully-convolutional networks in a patch-wise [ganin2014n, shen2015deepcontour, bertasius2015deepedge, hwang2015pixel] or pixel-wise prediction manner [xie2015holistically, kokkinos2015pushing, yang2016object, liu2016learning, maninis2017convolutional, wang2017deep, RcfEdgePami2019, he2019bi]. For skeleton extraction, earlier methods [yu2004segmentation, jang2001pseudo, majer2004influence] mainly relied on gradient intensity maps of natural images to extract skeletons. Later, learning-based methods [tsogkas2012learning, sironi2014multiscale, levinshtein2013multiscale, widynski2014local] viewed skeleton extraction as a per-pixel classification problem or a super-pixel clustering problem. Recent methods [shen2016object, ke2017srn, zhao2018hifi, wang2019deepflux] designed powerful network structures considering this problem hierarchically. Different from all the above approaches, our approach simultaneously solves the three tasks within a unified framework instead of learning each task with an individual network.
Ii-B Multi-Task Learning
Multi-task learning (MTL) has a long history in the area of machine learning[caruana1997multitask, doersch2017multi, evgeniou2004regularized, kumar2012learning]. Recently, many CNN-based MTL methods had been proposed, most of which focused on the design of network architecture [misra2016cross, kokkinos2017ubernet, ahn2019deep, strezoski2019many]
, or loss functions to balance the importance of different tasks[kendall2018multi, chen2018gradnorm], or both of them [liu2019end]. Different work also solved different task combinations, including: image classification in multiple domains [rebuffi2017learning]; object recognition, localization, and detection [sermanet2013overfeat, ren2015faster, he2017mask]
; pose estimation and action recognition[gkioxari2014r, kendall2015posenet, du2019cross]; semantic classes, surface normals, and depth prediction [eigen2015predicting, teichmann2018multinet, misra2016cross, liu2019end, kendall2018multi, gao2019nddr]. However, the majority of these methods focused on specific related tasks requiring datasets with different types of annotations supported simultaneously. Different from the above methods, we aim to incorporate the idea of dynamic feature integration into architecture design. This allows our approach to learn multiple tasks together based on training data from multiple individual datasets. Moreover, unlike the previous methods [kokkinos2017ubernet, wu2019mutual] which fix the strategies of how features integrate into network structures, our approach can adjust the network connections to select features dynamically to facilitate multi-task training.
In this section, instead of attempting to manually design an architecture that might work for all the three tasks, we propose to encourage the network to dynamically select features at different levels according to the favors of each task and the content of each input as described in Sec. I.
Iii-a Overall Pipeline
We include three different tasks on multiple individual datasets (i.e. , DUTS [wang2017learning] for saliency, BSDS 500 [arbelaez2011contour] and VOC Context [mottaghi2014role] for edge, SK-LARGE [shen2016object] or SYM-PASCAL [ke2017srn] for skeleton) within a unified network which can be trained end-to-end. All the datasets are directly used following the existing single-purpose methods proposed for each task with no extra processing.
The overall pipeline of the proposed framework is illustrated in Fig. 2. We employ the ResNet-50 [He2016] network as the feature extractor. We take the feature maps outputted by conv_1 as , and the outputs by conv2_3, conv3_4, conv4_6, and conv5_3 as to , respectively. We set the dilation rates of the convolutional layers in conv5 to 2 as done in pixel-wise prediction tasks. Moreover, we add a pyramid pooling module (PPM) [zhao2016pyramid] on the top of ResNet-50 to capture more global information as done in [wang2017stagewise, liu2019simple]. The output is denoted as .
Rather than manually fixing the feature integration strategy in the network structure as done in most of the previous single-purpose methods, a serious of dynamic feature integration modules (DFIMs) of various output down-sampling rates (orange dashed rounded rectangles in Fig. 2) are arranged to integrate the features extracted from the backbone (i.e. ) dynamically and separately for the three tasks.
A task-adaptive attention module (TAM) is then followed after each DFIM to intelligently allocate information across tasks, preventing the network from tendentious optimization directions. Finally, the corresponding feature maps outputted by the TAMs for each task are up-sampled and summarized and then followed by a convolutional layer for final prediction, respectively.
Iii-B Dynamic Feature Integration
It has been mentioned in many previous multi-task methods [kokkinos2017ubernet, misra2016cross, kendall2018multi, liu2019end] that the features required by different tasks vary greatly. And most of them require multiple kinds of annotations within a single dataset, which is difficult to obtain. Differently, we utilize training data of different tasks from multiple individual datasets, and it is more likely to meet circumstances where features required by different tasks conflict, as demonstrated in Fig. 1. To solve this problem, we propose DFIM, which adjusts the feature integration strategy dynamically according to each task and input during both training and testing. Compared to existing methods that integrate specific levels of features from the backbone based on manual observations of different tasks’ characteristics, DFIM learns the feature integration strategy.
To be specific, we take the set of features extracted from the backbone as input for each DFIM. And the demanded output down-sampling rate of each DFIM is determined during the network definition period. As illustrated in Fig. 3, we first transfer to the same number of channels (i.e. , ) and down-sampling rate (i.e. , ), denoted as , through a
convolutional layer and bilinear interpolation, respectively. To give DFIM a view of all the features to be selected, we summarize, and follow it with a global average pooling (GAP) layer to create a compact and global feature () as done in [hu2018squeeze]. For each , we use an independent fully connected (FC) layer to map the feature to channels, and then apply a softmax operator to transform the
feature into the form of probabilitythat could be used as an indicator to select features. Different from those who keep dense connections of , we only keep half the connections as :
in which means taking the median. Thus the output of DFIM with down-sampling rate for the can be obtained with
By arranging a series of DFIMs of various down-sampling rates, we can obtain the dynamically integrated feature maps , as shown in Fig. 2. Since the feature integration strategies depend only on the input and task type, the network is able to learn integration strategies for each input and task within a broader and more flexible feature combination space in an end-to-end manner.
Iii-C Task-Adaptive Attention
As we utilize training data from multiple individual datasets, the domain shifting [caruana1997multitask, pan2009survey] problem can not be ignored. How to coalesce the information from diverse datasets effectively and efficiently is indispensable to the maintaining of the overall performance across all tasks. As illustrated in the first row (Single-task) of Fig. 4, the levels of features that different tasks favor vary greatly. If we use the task-specific feature maps generated by DFIMs for the prediction of each task directly, the gradients of some task to the shared parts of the network may bias distinctly from the other tasks, hence deflecting the optimization direction to local minimums and causing under-fitting.
To this end, we propose to let the network have the ability of intelligently allocating information for different tasks after the shared features from the backbone are dynamically integrated and tailored for each task. As shown in the top right corner of Fig. 2, the output feature maps from the DFIM with a down-sampling rate of are further rescaled by being forwarded to a TAM. The parameters in TAM are shared across tasks for the exchange of information. Compared to using the outputs for each task directly, the additional modeling of all tasks’ relations gives DFIM the ability to adaptively adjust each task’s influences on the shared backbone by considering the content of input and all task’s characteristics simultaneously. TAM forces the interchange of information across tasks even after the feature maps are separated and tailored for each task. This is quite different from previous methods [kokkinos2017ubernet, liu2019end, wu2019mutual], which kept different tasks’ branches independent of each other until the end.
To have a better perception, we visualize the intermediate feature maps around TAM in Fig. 5. As can be seen, in the 1st row, for salient object segmentation, before TAM (a,e), it is hard to distinguish the dog (child) from the background. The attention maps learned in TAM (b,f) erase the activation of background effectively. And after TAM (c,g), the dog (child) is highlighted clearly. In the 2nd row, for edge detection, the feature maps after TAM (c,g) have obvious thinner and sharper activations in the areas where edges may locate compared to the thick and blur activations before TAM (a,e). A similar phenomenon can also be observed for the skeleton extraction task. As shown in the last row, the skeletons of the dog (child) become stronger and clearer after TAM. All the aforementioned discussions verify the significant effect of TAM on better allocating the information for different tasks.
Iv Experiment Setup
In this section, we describe the experiment setups, including the implementation details of the proposed network, the used datasets, the training procedure, and the evaluation metrics for the three tasks.
We implement the proposed method based on PyTorch111https://pytorch.org. All experiments are carried out on a workstation with an Intel Xeon 12-core CPU (3.6GHz), 64GB RAM, and a single NVIDIA RTX-2080Ti GPU. We use the Adam [kingma2014adam]
optimizer with an initial learning rate of 5e-5 and a weight decay of 5e-4. Our network is trained for 12 epochs in total, and the leaning rate is divided by 10 after 9 epochs. The parameters of the backbone (i.e. , ResNet-50 [He2016]
) of our network are initialized with ImageNet[krizhevsky2012imagenet] pre-trained model, while all other parameters are randomly initialized. Group normalization [wu2018group]
is applied after each convolutional layer except for the backbone. The optimization configurations for all parameters in our network are identical, except for the parameters of the batch normalization layers of the backbone are frozen during both training and testing.
Datasets. We use individual datasets for different tasks, and each dataset only has one kind of annotation. The detailed configurations are listed in Tab. I. And all the datasets are directly used following the existing single-purpose methods proposed for each task [zhang2017amulet, RcfEdgePami2019, zhao2018hifi] with no extra pre-processing.
|Saliency||DUTS-TR [wang2017learning]||10553||ECSSD [yan2013hierarchical], PASCAL-S [li2014secrets],||1000, 850,|
|DUT-OMRON [yang2013saliency], SOD [movahedi2010design],||5166, 300,|
|HKU-IS [li2015visual], DUTS-TE [wang2017learning]||1447, 5019|
|Edge||BSDS500 [arbelaez2011contour] &||300 +||BSDS500 [arbelaez2011contour]||200|
|VOC Context [mottaghi2014role]||10103|
|Skeleton||SK-LARGE [shen2017deepskeleton]||746||SK-LARGE [shen2017deepskeleton]||745|
|SYM-PASCAL [ke2017srn]||648||SYM-PASCAL [ke2017srn]||788|
Training Procedure. To jointly solve three different tasks from three individual datasets in an end-to-end way, for each iteration, we randomly sample an image-groundtruth pair for each of the three tasks, respectively. Then sequentially, each of the three image-groundtruth pairs is forwarded to the network, and the corresponding loss is calculated. At last, we simply summarize the three calculated losses (integers), backward through the network, and then take an optimization step. All the other training procedures are identical to typical single-purpose methods.
Loss Functions. We define the loss functions of the three tasks as most of the previous single-purpose methods. We use standard binary cross-entropy loss for salient object segmentation [hou2016deeply, zhang2017amulet] and balanced binary cross-entropy loss [xie2015holistically, RcfEdgePami2019, zhao2018hifi] for edge detection and skeleton extraction. We simply summarize the losses of the three tasks as the overall loss. The detailed formulas of the loss functions we used are as follows. Given an image’s prediction map and its corresponding groundtruth map , for all pixels , we compute the standard binary cross-entropy loss as:
and the balanced binary cross-entropy loss as:
Salient Object Segmentation:
in which refers to the edge pixels and refers to the non-edge pixels.
in which refers to the skeleton pixels and refers to the non-skeleton pixels.
|Shared: 27.01M (91.34%)||Specific: 2.56M (8.66%)|
Evaluation Criteria. For salient object segmentation, we use F-measure score (), mean absolute error (MAE), precision-recall (PR) curves, and S-measure [fan2017structure] for evaluation. And the hyper-parameter in is set to 0.3 as done in previous work to weight precision more than recall. For edge detection, we use the fixed contour threshold (ODS) and per-image best threshold (OIS) as our measures. Before evaluation, we apply the standard non-maximal suppression (NMS) algorithm to get thinned edges. For skeleton extraction, the skeleton maps are NMS-thinned before evaluation. A series of precision/recall pairs are then obtained by applying different thresholds to the thinned skeleton maps to draw the PR-curve. And the F-measure score is obtained under the optimal threshold over the whole dataset.
|PASCAL-S [li2014secrets]||DUT-OMRON [yang2013saliency]||DUTS-TE [wang2017learning]||BSDS 500 [arbelaez2011contour]||SK-LARGE [shen2017deepskeleton]|
|Our Method (Single-Task)|
|Our Method (Multi-Task)|
|Other Methods (Multi-Task)|
V Ablation Studies
In this subsection, we first analyse the composition of parameters of the proposed model. Then we investigate the effectiveness of the proposed DFIM by conducting experiments on both single- and multi-task settings. Finally we show the effect of TAM with a better overall convergence and performance.
V-a Composition of Parameters
We list the composition of the parameters of our network in Tab. II. As can be seen, 91.34% of the parameters are shared across tasks where the feature extractor (ResNet-50 & PPM) takes up 91.71%. And the shared parts of DFIMs and TAMs only bring in 2.25M (8.33%) parameters. Each task owns 0.85M (2.87%) task-specific parameters, respectively. The polarized composition of parameters proves the efficiency and effectiveness of the proposed approach. By taking advantage of the shared features extracted from the backbone and adaptively recombining them, more parameters and space can be saved. At the meanwhile, by handing the design of feature integration strategies to the network itself, less human interaction is required.
V-B Dynamic Feature Integration
Effectiveness of Dynamic Feature Integration. As shown in the 1st row of Tab. III
, directly applying our method whilst only performing a single task can obtain comparable results with the state-of-the-art methods on the salient object detection and edge detection tasks. And a bigger promotion can be observed on the skeleton extraction task (1.7%). This indicates that the proposed DFIM is able to adjust it’s feature selecting strategies according to the characteristics of the target task being solved. Rather than engineering specific network structure for different tasks manually as usually done in the previous methods, DFIM requires less human interactions.
When the three tasks are learned jointly (the 5th row of Tab. III), the performance of salient object segmentation task is promoted by a clear margin on three datasets in nearly all terms. This is consistent with previous researches that edge information can help the salient object segmentation task with more accurate segmentation in edge areas. The performance of edge detection task also increases, indicating that the edges of salient objects may provide useful guidance signals as well. The skeleton extraction task only drops slightly.
To have a numerical estimation of the difficulty in jointly training the three tasks, we set up a baseline by removing the dynamic feature selecting process in DFIM (the 3rd row in Tab. III, marked as ‘identity’). This means that all the operations after the features extracted from the backbone being summarized are removed, as shown in Fig. 3. This also equals to replacing Eqn. (2) with
By comparing the ‘identity’ version with the proposed ‘sparse’ feature selecting version (the 5th row), we can observe clear drops on the tasks of edge detection and skeleton extraction, 0.7% and 3.6%, respectively. These phenomenons demonstrate that simply fusing all levels of features damages the detection of edge and skeleton. It’s difficult to design network structures manually when the involved tasks have distinct optimization targets and take training samples from different datasets. Similar circumstance occurs in previous works [kokkinos2017ubernet, wu2019mutual], where the performance of partial tasks decrease dramatically when solving different tasks jointly, as shown in the last two rows of Tab. III, But with DFIM, by letting the network itself to integrate features dynamically and accordingly, all of the three tasks perform comparably to training each task separately.
Dynamically Learned Integration Strategies. To have a better understanding of what feature integration strategies have been learned by our proposed method, we randomly select 100 images from each of the testing set of DUTS (saliency), BSDS500 (edge) and SK-LARGE (skeleton) to form up a 300-image set. By forwarding these images, we average the values of all images, which are the indicators for features selecting. We plot the probabilities of each stage of features from the backbone been selected by each DFIM for different tasks in Fig. 4. If we compare the subplots column-wise, the stages of features preferred by different tasks vary greatly. This may explain why a good performing architecture for one task does not work on the other tasks [RcfEdgePami2019, hou2016deeply, wang2019deepflux]. If we compare the subplots row-wise, the stages of features been selected when each of the three tasks is separately trained in a single-task manner also differ greatly from those when they are jointly trained in a multi-task manner. This may be the reason why each of the three tasks has been well investigated, but little literature has tried to solve them jointly. It is hard to manually design architecture as the features from the shared backbone now will be simultaneously affected by all the tasks.
Sparse or Dense Connections. In Tab. III, we compare our sparse-connected network with a dense-connected version, which means all the feature maps in are kept rather than only half as formulated in Eqn. (1). As shown in the 4th and 5th rows, the dense version has worse performances on nearly all three tasks. This may indicate that not every stage of features from the backbone is always helpful [hinton2012improving]. For example, for edge detection, more lower-level feature maps are necessary for precise localization of edge pixels [xie2015holistically, RcfEdgePami2019], while for skeleton extraction, more higher-level information is essential to determine whether a pixel being skeleton or not [zhao2018hifi, wang2019deepflux].
|Down- sampling Rates||Saliency||Edge||Skeleton|
Down-Sampling Rates of DFIMs. As listed in Tab. IV, we conduct ablation experiments on the combinations of down-sampling rates of DFIMs. A wider range of down-sampling rates shows a better equilibrium of overall performances, especially on salient object segmentation and edge detection, which agrees with common sense that richer multi-scale information usually helps.
V-C Task-Adaptive Attention
Effectiveness of TAM. With DFIM, we can jointly train the three tasks under a unified architecture. However, as shown in the 5th row of Tab. III, compared to separately training (the 1st row), the performance of skeleton extraction decreases. As the annotations of salient object segmentation and edge detection tasks pay more attention to pixels where edges exist, which disagree with the goal of skeleton extraction task, the optimization of skeleton extraction task could be influenced and misguided towards adverse directions. With TAM, the network is able to allocate the information of all tasks from a global view by adjusting the gradients of each task towards the shared backbone adaptively. As can be seen from the 7th row compared to the 5th row in Tab. III, better overall performances are reached. The performances of salient object segmentation and edge detection are slightly better while the skeleton extraction task outperforms with 0.7%.
|ECSSD [yan2013hierarchical]||PASCAL-S [li2014secrets]||DUT-OMRON [yang2013saliency]||HKU-IS [li2015visual]||SOD [movahedi2010design]||DUTS-TE [wang2017learning]|
Necessity of Information Exchange. To investigate whether the promotion brought by TAM is due to the introduction of additional parameters, we also conduct experiment leaving the parameters of different branches in TAM unshared (the 6th row) so that different task branches are independent of each other after selecting features from the shared backbone. By not sharing the parameters in TAM, extra 1.66M parameters are further lead in. However, as can be seen from the 6th row in Tab. III, even with more parameters introduced, the overall performance of the unshared version is obviously worse than the shared version of TAM (the 7th row). Though the skeleton task performs slightly better, the other two tasks decline greatly. These phenomenons show that enforcing the interchange of information across tasks after the separation of branches of each task is helpful to the overall convergence of all tasks, while simple attention mechanism works not well. This can also be observed from the first two rows of Tab. III, that appending TAM when each task is trained separately brings no help even downgrade to most of the three tasks.
|(a) ECSSD [yan2013hierarchical]||(b) PASCAL-S [li2014secrets]||(c) DUT-OMRON [yang2013saliency]|
|(d) HKU-IS [li2015visual]||(e) SOD [movahedi2010design]||(f) DUTS-TE [wang2017learning]|
Vi Comparisons to the State-of-the-Arts
In this section, we compare the proposed method (denoted as DFI for convenience) with state-of-the-art methods on salient object segmentation, edge detection, and skeleton extraction. As very little literature has solved the three tasks jointly before, e.g. UberNet [kokkinos2017ubernet] (CVPR’17) and MLMS [wu2019mutual] (CVPR’19), which solved salient object segmentation jointly with edge detection, we mainly compare with the state-of-the-art single-purpose methods of the three tasks for better illustration. For fair comparisons, for each task, the predicted maps (e.g. saliency maps, edge maps, skeleton maps) of other methods are generated by the original code released by the authors or directly provided by them. All the results are obtained directly from single-model test without relying on any other pre- or post-processing tools except for the NMS process before the evaluation of edge and skeleton maps [xie2015holistically, RcfEdgePami2019, zhao2018hifi]. And for each task, all the predicted maps are evaluated with the same evaluation code.
Vi-a Salient Object Segmentation
We exhaustively compare DFI with 16 existing state-of-the-art salient object segmentation methods including DCL [li2016deep], RFCN [wangsaliency], MSR [li2017instance], DSS [hou2016deeply], NLDF [luo2017non], Amulet [zhang2017amulet], PAGR [zhang2018progressive], DGRL [wang2018detect], MLMS [wu2019mutual], JDFPR [xu2019deepcrf], PAGE [wang2019salient], CapSal [zhang2019capsal], CPD [wu2019cascaded], PiCANet [liu2018picanet], AFNet [feng2019attentive], and BASNet [qin2019basnet].
F-measure, MAE and S-measure Scores. Here, we compare DFI with the aforementioned approaches in terms of F-measure, MAE, and S-measure (See Tab. V). As can be seen, compared to the second-best methods on each dataset, DFI outperforms all of them over six datasets with average promotions of 1.2% and 1.0% in terms of F-measure and S-measure, respectively. Especially on the challenging DUTS-TE dataset, promotions of 2.1% and 1.8% in terms of F-measure and S-measure can be observed. Similar patterns can also be observed using the MAE score. Also, when compared to MLMS [wu2019mutual], which learns salient object segmentation and edge detection jointly, DFI has even larger improvements on both tasks, as shown in the 7th and 9th rows of Tab. III. Without TAM, DFI still outperforms MLMS [wu2019mutual] by a large margin (the 3rd and 9th rows). This phenomenon demonstrated the effectiveness of the proposed DFIM and TAM,
|(a) Image||(b) GT||(c) Ours||(d) CED [wang2017deep]||(e) RCF [RcfEdgePami2019]||(f) HED [xie2015holistically]|
PR Curves. Other than numerical results, we also show the PR curves on the six datasets as shown in Fig. 7. As can be seen, the PR curves of DFI (red solid ones) are comparable to other previous approaches and even better on some datasets. Especially on the PASCAL-S and DUTS-TE datasets, DFI outstands compared to all other previous approaches. As the recall score approaches 1, our precision score is much higher than other methods, which reveals that the false positives in our saliency map are low.
|BSDS 500 [arbelaez2011contour]|
|(a) SK-LARGE [shen2017deepskeleton]||(b) SYM-PASCAL [ke2017srn]|
|SK-LARGE [shen2017deepskeleton]||SYM-PASCAL [ke2017srn]|
Visual Comparisons. In Fig. 6, we show the visual comparisons with several previous state-of-the-art approaches. In the top row, the salient object is partially occluded. And DFI is able to segment the entire object without mixing in the unrelated areas. As shown in the 2nd row, DFI is also able to segment out the salient object with more precise boundaries and details. A similar phenomenon happens when processing images where salient objects are tiny and irregular or the contrast between foreground and background is low. For example, the bottom two rows of Fig. 6. These results demonstrate that DFI benefits from better distinguishing the edge pixels and segmenting out the whole objects, which might be the advantage of joint training with the edge detection and skeleton extraction tasks.
|(a) GT||(b) Ours||(c) Hi-Fi [zhao2018hifi]||(d) SRN [ke2017srn]||(e) FSDS [shen2016object]|
Vi-B Edge Detection
We compare DFI with results from 13 existing state-of-the-art edge detection methods, including gPb-owt-ucm [arbelaez2011contour], SE-Var [dollar2015fast], MCG [pont2017multiscale], DeepEdge [bertasius2015deepedge], DeepContour [shen2015deepcontour], HED [xie2015holistically], CEDN [yang2016object], RDS [liu2016learning], COB [maninis2017convolutional], RCF [RcfEdgePami2019], DCNN+sPb [kokkinos2015pushing], CED [wang2017deep] and LPCB [deng2018learning], most of which are CNN-based methods.
Quantitative Analysis. In Tab. VI, we show the quantitative results. DFI achieves ODS of 0.819 and OIS of 0.836, which are even better than the previous works that are well-designed for edge detection. Thanks to DFIM and TAM, the information from the other tasks not only does not influence but helps the performance of edge detection, as shown in the 1st and 7th rows of the ‘Edge’ column of Tab. III.
PR Curves. The precision-recall curves of our method and some selected methods on the BSDS 500 dataset [arbelaez2011contour] can be found in Fig. 8. One can observe that the PR curve produced by our approach is already better than human in some certain cases and is comparable to previous methods especially in precision.
Visual Analysis. In Fig. 9, we show some visual comparisons between DFI and some leading representative methods [wang2017deep, liu2016learning, xie2015holistically]. As can be observed, DFI performs better in detecting the boundaries compared to the others. In the last row of Fig. 9, it is apparent that the real boundaries of the wolf are well highlighted. Besides, thanks to the dynamic fusion mechanism, the features learned by DFI are much more powerful compared to [xie2015holistically, liu2016learning]. This is because the areas with no edges are rendered much cleaner. To sum up, in spite of the improvements in ODS and OIS, the quality of our results is much higher visually.
|PAGE [wang2019salient]||CPD [wu2019cascaded]||DGRL [wang2018detect]||Amulet [zhang2017amulet]||DSS [hou2016deeply]|
|RCF [RcfEdgePami2019]||CED [wang2017deep]||LPCB [deng2018learning]||Hi-Fi [zhao2018hifi]||DeepFlux [wang2019deepflux]|
Vi-C Skeleton Extraction
We compare DFI with 9 recent CNN-based methods including MIL [tsogkas2012learning], HED [xie2015holistically], RCF [RcfEdgePami2019], FSDS [shen2016object], LMSDS [shen2017deepskeleton], SRN [ke2017srn], LSN [liu2018linear], Hi-Fi [zhao2018hifi], and DeepFlux [wang2019deepflux] on 2 popular and challenging datasets including SK-LARGE [shen2017deepskeleton] and SYM-PASCAL [ke2017srn]. For fair comparisons, we train two different models using these two datasets separately, as done in the above methods.
Quantitative Analysis. In Tab. VII, we show quantitative comparisons with existing methods. As can be seen, DFI wins dramatically by a large margin (1.9 points) on the SK-LARGE dataset [shen2016object]. There is also an improvement of 0.9 points on the SYM-PASCAL dataset [ke2017srn].
PR Curves. In Fig. 10, we also show the precision-recall curves of our approach with some selected skeleton extraction methods. As can be seen, quantitatively, our approach on both datasets substantially outperforms other existing methods with a clear margin.
Visual Analysis. In Fig. 11, we show some visual comparisons. Owing to the advanced features integration strategy that is performed dynamically, DFI is able to locate the exact positions of the skeletons more accurately. This point can also be substantiated by the fact that our prediction maps are much thinner and stronger than other works. Both quantitative and visual results unveil that DFI provides a better way to combine different-level features for skeleton extraction, even in a multi-task manner.
Vi-D Comparisons of Running Time
As shown in Tab. VIII
, we compare the speed of DFI against other open source methods evaluated in our paper including all three tasks. We report average speed (fps) of different methods as well as the corresponding input size below (tested in the same environment). DFI can run at 57 FPS in single-task mode which is comparable to other methods while producing better detection results. Also, DFI runs at 40 FPS even in multi-task mode which means predicting three different tasks at the same time.
In this paper, we solve three different low-level pixel-wise prediction tasks simultaneously, including salient object segmentation, edge detection, and skeleton extraction. We propose a dynamic feature integration module (DFIM) to learn the feature integration strategy for each task dynamically, and a task-adaptive attention module (TAM) to allocate information across tasks for better overall convergence. Experiments on a wide range of datasets show that DFI can perform comparably sometimes even better than the state-of-the-art methods of the solved tasks. DFI is fast as well, which can perform these three pixel-wise prediction tasks simultaneously with a speed of 40 FPS.
This research was supported by Major Project for New Generation of AI under Grant No. 2018AAA0100400, NSFC (61922046), the national youth talent support program, and Tianjin Natural Science Foundation (18ZXZNGX00110).