Salient object detection (SOD) aims at detecting the most visually conspicuous objects or regions in an image [achanta2009frequency, cheng2014global, liu2021dna]. It has a wide range of computer vision applications such as human-robot interaction [meger2008curious], content-aware image editing [cheng2010repfinder]gao2013visual], object recognition [ren2013region], image thumbnailing [marchesotti2009framework]
, weakly supervised learning[liu2020leveraging], etc
. In the last decade, convolutional neural networks (CNNs) have significantly pushed forward this field. Intuitively, the global contextual information (existing in the top CNN layers) is essential forlocating salient objects, while the local fine-grained information (existing in the bottom CNN layers) is helpful in refining object details [cheng2014global, luo2017non, liu2018picanet]. This is why the U-shaped encoder-decoder CNNs have dominated this field [wang2015deep, lee2016deep, wang2016saliency, li2016deep, zhang2017amulet, zhang2017learning, luo2017non, wang2017stagewise, liu2018picanet, li2018contour, chen2018reverse, liu2020lightweight, liu2021samnet, li2020depthwise, li2020complementarity, chen2020embedding, qiu2019revisiting, qiu2020simple, wu2019cascaded, liu2019simple, zhao2019egnet, liu2021dna, qiu2021miniseg]
, where the encoder extracts multi-level deep features from raw images and the decoder integrates the extracted features with skip connections to make image-to-image predictions[wang2015deep, lee2016deep, wang2016saliency, li2016deep, zhang2017amulet, zhang2017learning, luo2017non, wang2017stagewise, liu2018picanet, li2018contour, chen2018reverse, liu2020lightweight, liu2021samnet, li2020depthwise, li2020complementarity, chen2020embedding]. The encoder is usually the existing CNN backbones, e.g., VGG [simonyan2015very] and ResNet [he2016deep], while most efforts are put into the design of the decoder [qiu2019revisiting, qiu2020simple, wu2019cascaded, liu2019simple, zhao2019egnet]. Although remarkable progress has been seen in this direction, CNN-based encoders share the intrinsic limitation of extracting features from images in a local manner. In other words, CNNs usually gradually enlarge the receptive fields with the increase of network depth, which is limited to model global relationships. The lack of powerful global modeling has been the main bottleneck for CNN-based SOD.
To this end, we note that recent popular transformer networks[vaswani2017attention, dosovitskiy2021image] provide a new perspective for this problem. Originating from machine translation [vaswani2017attention]
, the transformer entirely replies on self-attention to model global dependencies of sequence data directly. Viewing image patches as tokens (words) in natural language processing (NLP) applications[dosovitskiy2021image], the transformer can be applied to learn powerful global feature representations for images. In a short period, the transformer has advanced various computer vision tasks such as image classification [dosovitskiy2021image, liu2021transformer, liu2021swin], object detection [carion2020end], instance segmentation [wang2020end], action recognition [kalfaoglu2020late], object tracking [yan2021learning]
, and pose estimation[mao2021tfpose].
Since the global relationship modeling of the transformer is essential for SOD to locate salient objects in a natural scene, some works have tried to bring the transformer into SOD [liu2021visual, mao2021transformer]. For example, Liu et al. [liu2021visual] proposed a pure transformer network for SOD, where the encoder and decoder are both transformers. Mao et al. [mao2021transformer] adopted Swin Transformer [liu2021swin] as the encoder and designed a simple CNN-based decoder. We note that existing transformer-based SOD methods [liu2021visual, mao2021transformer] entirely rely on the transformer to extract global feature representations through using the transformer as the encoder. However, these methods ignore the effect of local representations, which is also essential for SOD in refining object details [zhang2017amulet, zhang2017learning, luo2017non, wang2017stagewise, liu2018picanet, li2018contour, chen2018reverse, liu2020lightweight, liu2021samnet, qiu2019revisiting, qiu2020simple, wu2019cascaded, liu2019simple, zhao2019egnet, li2020depthwise, li2020complementarity, chen2020embedding], as mentioned above. Therefore, existing SOD methods have gone from one extreme to the other, i.e., from the lack of powerful global modeling (CNN-based methods) to the lack of local representation learning (transformer-based methods).
Based on the above observation, how to achieve effective local representations in accompany with transformer networks would be the key to further boost SOD. To this end, we consider combining the merits of transformers and CNNs, which are adept at global relationship modeling and local representation learning, respectively. In this paper, we propose a new encoder-decoder architecture, namely Asymmetric Bilateral U-Net (ABiU-Net). Its encoder consists of two parts: Transformer Encoder Path (TEncPath) and Hybrid Encoder Path (HEncPath). TEncPath directly uses a transformer network, i.e., PVT [wang2021pyramid], for global relationship modeling, in order to locate salient objects. Following PVT [wang2021pyramid], the output resolutions of its four stages are , , , and
with respect to the input image, respectively. Since the transformer entirely relies on self-attention to extract global contextual features, TEncPath lacks local find-grained features that are essential for refining object details/boundaries. Hence, we introduce HEncPath by stacking six convolution stages, to enhance the locality of the encoder. The output strides of HEncPath are, , , , , and , sequentially. To fuse the global and local features, the inputs of the stages of HEncPath are from the corresponding preceding convolution stage as well as the transformer stages, respectively. In this way, HEncPath introduces locality into feature representations with the guide of global contexts from TEncPath. Therefore, HEncPath is a hybrid encoding of global long-range dependencies and local representations. Note that HEncPath is lightweight with a small number of channels and a fast downsampling strategy.
In the phase of decoding, we design an asymmetric bilateral decoder containing two simple paths, namely Transformer Decoder Path (TDecPath) and Hybrid Decoder Path (HDecPath), which are utilized to decode feature representations from TEncPath and HEncPath, respectively. TDecPath decodes coarse salient object locations, while HDecPath is expected to further refine object details/boundaries. We adopt the standard channel attention mechanism to enhance the feature representations from TEncPath and HEncPath. The output of TDecPath at each stage is fed into the corresponding stage of HDecPath so that these two decoder paths can communicate and learn complementary information, i.e., coarse locations and find-grained details, respectively. With the above design, ABiU-Net can make complementary use of global contextual modeling (for locating salient objects) and local representation learning (for refining object details) [luo2017non, cheng2014global, liu2018picanet] by exploring the cooperation of transformers and CNNs, which provides a new perspective for SOD in the transformer era. Extensive experiments on six challenging datasets demonstrate that the proposed ABiU-Net favorably outperforms existing state-of-the-art SOD approaches.
Ii Related Work
Ii-a CNN-based Salient Object Detection
In the past few years, CNN-based saliency detectors have dominated SOD and the accuracy has been remarkably boosted due to the multi-level representation capability of CNN [zhang2017amulet, zhang2017learning, luo2017non, wang2017stagewise, liu2018picanet, li2018contour, chen2018reverse, liu2020lightweight, liu2021samnet, li2020depthwise, li2020complementarity, chen2020embedding, qiu2019revisiting, qiu2020simple, wu2019cascaded, liu2019simple, zhao2019egnet, yan2020new, wang2021new]. It is widely accepted that the high-level semantic information extracted from the top CNN layers is beneficial to locating the coarse positions of salient objects, while the low-level information extracted from the bottom layers can refine the details of salient objects. Hence, both the high-level and low-level information are important for accurately segmenting salient objects [liu2016dhsnet, li2016deep, luo2017non, zhang2017learning, zhang2017amulet, wang2017stagewise, liu2018picanet, wang2018detect, hou2019deeply]. Most existing CNN-based SOD methods use pre-trained image classification models, e.g., VGG [simonyan2015very] and ResNet [he2016deep], as encoders and focus on designing effective decoders by aggregating multi-level features [hou2019deeply, liu2019simple, zhao2019pyramid, qiu2019revisiting, zhang2018progressive, feng2019attentive]. Liu et al. [liu2020lightweight, liu2021samnet] also designed new encoders to achieve lightweight SOD. However, as analyzed above, the long-range global contexts extracted from the top layers of CNNs are not enough for accurately locating salient objects. Although lots of CNN techniques, such as the non-local technique [wang2018non] and the pyramid pooling module [zhao2017pyramid], can be embedded into SOD networks to further explore global contexts. Nonetheless, these strategies are still limited in global dependency modeling due to the natural property of CNNs. This has been the main bottleneck for improving the accuracy of SOD. To solve this problem, we note that recent vision transformers [vaswani2017attention, dosovitskiy2021image] are adept at global relationship modeling, which motivates us to explore transformers for boosting SOD.
Ii-B Vision Transformer
The transformer is first proposed in NLP for machine translation [vaswani2017attention]
. The transformer network alternately stacks multi-head self-attention modules, aiming at estimating the global dependencies between every two patches, and a multilayer perceptron, aiming at feature enhancement. Recently, researchers have brought the transformer into computer vision and achieved remarkable achievements. Specifically, Dosovitskiyet al. [dosovitskiy2021image]
made the first attempt to apply the transformer to image classification, attaining competitive performance on the ImageNet dataset[russakovsky2015imagenet]. They split an image into a sequence of flattened patches that are fed into transformers. Following [dosovitskiy2021image], lots of studies have emerged in a short period, and much better performance than state-of-the-art CNNs has been achieved [yuan2021tokens, wang2021pyramid, liu2021swin, liu2021transformer, wu2021p2t, han2021transformer]. For SOD, Mao et al. [mao2021transformer] adopted Swin Transformer [liu2021swin] as the encoder and designed a simple CNN-based decoder to predict saliency maps. In addition, Liu et al. [liu2021visual] developed a novel pure transformer-based model for both RGB and RGB-D SOD. Although existing transformer-based works [mao2021transformer, liu2021visual] solve the lack of global contexts in CNN-based SOD, they ignore the local representations that are essential for refining salient object details. In this paper, we work on how to effectively learn both global contexts and local features by exploring the cooperation of transformers and CNNs, so salient objects can be precisely located and segmented.
Ii-C Encoder-decoder Architectures
For the combination of high-level and low-level CNN features, various encoder-decoder architectures are devised. We summarize these architectures in Fig. 1. Hypercolumn [hariharan2015hypercolumns] simply aggregates features from different levels of the encoder for final predictions, as shown in Fig. 1(a). The aggregated hyper-features are so-called hypercolumns. The typical examples of this architecture for SOD include [li2016deep, wang2017stagewise, zeng2018learning, zhao2019pyramid, su2019selectivity, mao2021transformer]. As shown in Fig. 1(b), U-Net [ronneberger2015u] is actually the U-shaped encoder-decoder architecture. The main idea of U-Net is to supplement a contracting encoder with a symmetric decoder, where the pooling operations in the encoder are replaced with upsampling operations in the decoder. Deep supervision [lee2015deeply] is also imposed to ease the training process. Most SOD methods are based on the U-shaped architecture [zhang2017amulet, zhang2017learning, luo2017non, wang2017stagewise, liu2018picanet, li2018contour, chen2018reverse, qiu2019revisiting, wu2019cascaded, liu2019simple, zhao2019egnet, qiu2020simple, liu2020lightweight, liu2021samnet, liu2021dna, liu2021visual, li2020depthwise, li2020complementarity, chen2020embedding]. As shown in Fig. 1(c), BiSeNet [yu2018bisenet], designed for semantic segmentation, is a two-path network. The blue path is the spatial path that stacks only three convolution layers to obtain the feature map to retain affluent spatial details. The green path is the deep context path. These two paths are fused for final prediction. From Fig. 1(d), we can clearly see the difference between the proposed ABiU-Net and other architectures. Although both BiSeNet and ABiU-Net have a two-path encoder, ABiU-Net enables interaction between the two paths, while BiSeNet does not. Moreover, ABiU-Net supplements an asymmetric bilateral decoder for better fusing the formation from the two-path encoder, while BiSeNet directly utilizes the features of the encoder for final prediction. Therefore, ABiU-Net is a new encoder-decoder architecture in the transformer era.
This section presents the proposed ABiU-Net for accurate SOD. First, we describe the overall framework of ABiU-Net in Section III-A. Then, we introduce the asymmetric bilateral encoder in Section III-B. At last, we present the asymmetric bilateral decoder in Section III-C.
Iii-a Overall Framework
Since previous CNN-based SOD methods [wang2015deep, lee2016deep, wang2016saliency, li2016deep, zhang2017amulet, zhang2017learning, luo2017non, wang2017stagewise, liu2018picanet, li2018contour, chen2018reverse, liu2020lightweight, liu2021samnet] lack powerful global context modeling and existing transformer-based SOD methods [mao2021transformer, liu2021visual] lack effective local representations, in this paper, we aim at learning both global contexts and local features for locating salient objects and refining object details, respectively. For this goal, we improve the traditional U-Net [ronneberger2015u] to the new ABiU-Net by exploring the cooperation of transformers and CNNs, so ABiU-Net can inherit the merit of transformers for global context modeling and the merit of CNNs for local feature learning. Specifically, ABiU-Net consists of an asymmetric bilateral encoder and an asymmetric bilateral decoder.
As shown in Fig. 2, the asymmetric bilateral encoder of ABiU-Net contains two parts: TEncPath in grey and HEncPath in blue. TEncPath is one well-known transformer network, e.g., PVT [wang2021pyramid], with four stages. The main issue is that the transformer network only focuses on learning long-range dependencies with the sacrifice of local information. This is remedied by another encoder path, i.e., HEncPath. HEncPath stacks six lightweight convolution stages. In particular, the inputs of stages are not only from the preceding stage but also from the corresponding transformer stage that has the same output stride. Hence, HEncPath can achieve the hybrid encoding of global long-range dependencies and local representations.
The asymmetric bilateral decoder also consists of two paths, namely TDecPath and HDecPath, decoding feature representations from TEncPath and HEncPath, respectively. TDecPath has four stages, which can be viewed as a simple top-down generation path to regress the coarse locations of salient objects. HDecPath is a hybrid decoding process with six stages in the top-down view. The input of HDecPath at each stage is not only from HEncPath but also from TDecPath. In this way, two decoder paths can communicate and learn complementary information, i.e., the coarse locations of salient objects decoded in TDecPath would guide the learning of HDecPath to further refine object details. We will introduce the decoder in Section III-C.
The output feature map of HDecPath at the last stage is used for final saliency map prediction. We also impose deep supervision [lee2015deeply] to the other five stages of HDecPath and the last stage of TDecPath. To achieve this, we design a simple Prediction Module (PM), which converts a feature map to a saliency map. PM first adopts two successive
convolution layers with batch normalization and nonlinearization to convert the input feature map to a single-channel map. Then, thesigmoidactivation function is followed to predict the saliency probability map, whose values range from 0 to 1. The proposed ABiU-Net is trained end-to-end using the standard binary cross-entropy loss (BCE). Suppose the saliency maps of HDecPath are denoted by () from bottom to top, and the saliency map of TDecPath is denoted by . The total training loss can be calculated as
where represents the ground-truth saliency map. is a weighting scalar for loss balance. In this paper, we empirically set to 0.4, as suggested by [zhao2017pyramid, liu2021dna, liu2020lightweight, liu2021samnet, qiu2019revisiting, qiu2020simple]. During testing, is viewed as the final output saliency map.
Iii-B Asymmetric Bilateral Encoder
First of all, we want to clarify that the asymmetry of our bilateral encoder mainly refers to: i) two encoder paths are based on different networks (transformer and CNN); ii) the numbers of stages of these two paths are different (four for TEncPath and six for HEncPath); and iii) the targets that they are responsible for are different (global context modeling and hybrid feature encoding). Now, we describe how we designed it and the reasons behind it.
We use the popular transformer network, PVT [wang2021pyramid], as TEncPath. All stages of PVT share a similar architecture, which consists of a patch embedding layer and several transformer blocks. Specifically, given an input image, PVT splits it into small patches with the size of , using a patch embedding layer. Then, the flattened patches are added with a position embedding and fed into transformer blocks. From the second stage, PVT utilizes a patch embedding layer to shrink the feature map by a scale of 2 at the beginning of each stage, followed by the addition with a position embedding and then some transformer blocks. Suppose , , , and denote the output feature maps of the four stages of PVT from bottom to top, and they have scales of , , , and with 64, 128, 320, and 512 channels, respectively. Please refer to the original paper [wang2021pyramid] for more details.
For HEncPath, we design a lightweight CNN-based sub-network to introduce the local sensitivity. Specifically, we stack six convolution stages whose outputs are , , , , , and
from bottom to top, respectively. Except for the last stage, a max-pooling layer with a stride of 2 is connected after each stage for feature downsampling, leading to output scales of, , , , , and for six stages from bottom to top, respectively. The input image is fed into not only TEncPath but also HEncPath. The output of the first stage of HEncPath is used as the input of its second stage. From the third stage, the input of each stage is the concatenation of the feature map from the preceding stage and the feature map from the corresponding stage of TEncPath with the same resolution. The concatenated feature map is first connected to a convolution for integration. Then, HEncPath processes it through convolution layers with batch normalization and nonlinearization. Through using as the input of HEncPath, it is easier for to learn complementary local fine-grained features with the guidance of . Since TEncPath provides a high-level semantic abstraction for HEncPath, it is unnecessary to use a very deep or cumbersome CNN for HEncPath. Note that HEncPath also takes the original image as input so as to mine complementary local information from the image.
For a clear presentation, we can formulate these steps of HEncPath as
where denotes the input color image. is convolution. represents successive convolutions with batch normalization and nonlinearization omitted for simplicity, where is the number of convolution layers at each stage of HEncPath. is a max pooling layer with a stride of 2. “” represents the concatenation operation along the channel dimension. We set to 2 for , and we have , with 16, 64, 64, 128, 256, and 256 output channels, respectively, to make HEncPath a lightweight sub-network.
With the asymmetric bilateral encoder, we obtain two sets of features with different characteristics. (), generated by TEncPath, is based on global long-range dependence modeling, thus containing rich contextual information. On the other hand, () aims at learning complementary information to global modeling, guided by . Hence, contains rich local features. The transformer features, , would be useful to locate salient objects with the global view on the image scenes, while the CNN features, , would be useful to refine object details with the local fine-grained representations. Therefore, the combination of the global features and the local features would lead to accurate SOD.
Iii-C Asymmetric Bilateral Decoder
Corresponding to the encoder, we design an asymmetric bilateral decoder containing two paths, i.e., TDecPath and HDecPath. TDecPath is utilized to decode feature representations from TEncPath, which can be viewed as a simple top-down generation path. The top stage takes as input, and its output is denoted as . The -th () stage of TDecPath has two inputs, i.e., from TEncPath and from the preceding stage of TDecPath, generating the output . The main operations of TDecPath are as below. First of all, are separately fed into a convolution layer with batch normalization and nonlinearization to reduce the number of channels to (128, 64, 32, 16), generating feature maps (), respectively. This process can be formulated as
After that, () is concatenated with the output of the preceding stage in TDecPath. Note that is the only input of the fourth stage of TDecPath, so there is no concatenation operation. Then, we adopt the standard Channel Attention (CA) mechanism to process the concatenated feature map for feature enhancement. As shown in Fig. 2, the CA mechanism is a typical squeeze-excitation attention block [hu2018squeeze] whose description is omitted here. After the CA block, we obtain , , , and from top to bottom. Formally, this can be written as
where the convolution reduces the number of feature channels of () to the half. is to upsample the feature map by a factor of 2. In this way, TDecPath can decode the coarse locations of salient objects using the global contexts in TEncPath.
HDecPath is expected to further refine salient object details, guided by the coarse locations decoded by TDecPath. Hence, HDecPath takes two inputs: the features from TDecPath and HEncPath, providing the coarse locations of salient objects and the features about object details, respectively. HDecPath contains six decoder stages, whose outputs are denoted as (). First, we connect the convolution layers (with batch normalization and nonlinearization) to the side-outputs of HEncPath and TDecPath, generating and , respectively. This can be formulated as
in which () has (256, 128, 64, 32, 32, 8) channels, respectively. () has the same number of channels as . Note that and () have the same scale.
Then, we feed and into a CA block for feature fusion, followed by a convolution to produce the refined feature . Different from the sixth stage, the -th stage () has three inputs, i.e., , , and , where should be upsampled by a factor of 2 first. These three inputs are concatenated, whose result is fed into a CA block. After that, a convolution is connected for feature fusion, generating the output (). However, for the first and second stages of HDecPath, the operations are the same as TDecPath because there are no side-outputs from TDecPath with the same scales. The inputs of these two stages are and (), and the outputs are (). We formulate the computation process of HDecPath as
With the convolutions in Eq. (6), HDecPath produces the final decoded feature maps () with (128, 64, 32, 32, 8, 8) channels, respectively.
Iv-a Experimental Setup
Iv-A1 Implementation Details
We adopt the PyTorch framework[paszke2019pytorch] to implement the proposed method. The backbone network, i.e., PVT [wang2021pyramid]
, is pre-trained on the ImageNet dataset[russakovsky2015imagenet]. The AdamW [loshchilov2019decoupled] optimizer with the weight decay of 1e-4 is used to optimize the network. The learning rate policy is poly so that the current learning rate equals the base one multiplying , where and mean the numbers of the current and maximum iterations, respectively. We set the initial learning rate to 5e-5 and
to 0.9. The proposed ABiU-Net is trained for 50 epochs with a batch size of 16. All experiments are conducted on a TITAN Xp GPU.
We follow recent studies [wang2018detect, liu2018picanet, wang2017stagewise, liu2021dna, liu2019simple, qiu2019revisiting, zhao2019pyramid, wu2019cascaded, feng2019attentive, zeng2019towards] to train the proposed ABiU-Net on the DUTS training set [wang2017learning]. The DUTS training set is comprised of 10553 images and corresponding high-quality saliency map annotations. To evaluate the performance of various SOD methods, we utilize the DUTS test set [wang2017learning] and five other widely used datasets, including SOD [movahedi2010design], HKU-IS [li2015visual], ECSSD [yan2013hierarchical], DUT-OMRON [yang2013saliency], and THUR15K [cheng2014salientshape]. There are 5019, 300, 4447, 1000, 5168, and 6232 natural complex images in the above six test datasets, respectively.
Iv-A3 Evaluation Criteria
This paper evaluates the accuracy of various SOD models using four popular evaluation metrics, including the max-measure score , mean absolute error (MAE), weighted -measure score [margolin2014evaluate], and structure-measure [fan2017structure]. Here, the performance on a dataset is the average over all images in this dataset. We introduce these metrics in detail as follows.
Suppose denotes the, the predicted saliency map can be converted into a binary map that is compared to the ground-truth saliency map for computing the precision and recall values. Varying the threshold values, we can derive a series of precision-recall value pairs. With precision and recall, can be formulated as
in which is set to a typical value of 0.3 to emphasize more on precision, following previous works [wang2015deep, lee2016deep, wang2016saliency, li2016deep, zhang2017amulet, zhang2017learning, luo2017non, wang2017stagewise, liu2018picanet, li2018contour, chen2018reverse, liu2020lightweight, liu2021samnet, qiu2019revisiting, qiu2020simple, wu2019cascaded, liu2019simple, zhao2019egnet, liu2021dna, hou2019deeply, wang2018detect, zhang2018progressive, zhao2019pyramid, feng2019attentive, zeng2019towards, zeng2018learning]. We compute scores under various thresholds and report the best one, i.e., the maximum .
The MAE metric measures the absolute error between the predicted saliency map and the ground-truth map. MAE is calculated as
in which and denote the predicted and ground-truth saliency maps, respectively. and are image height and width, respectively. and are the saliency scores of the predicted and ground-truth saliency maps at the location , respectively.
We continue by introducing the weighted -measure score [margolin2014evaluate], denoted as . It is computed as
where and are the weighted precision and weighted recall to amend the flaws in other metrics. has the same meaning as that in Eq. (7).
Considering that the above measures are based on pixel-wise errors and often ignore the structural similarities, structure-measure () [fan2017structure] is proposed to simultaneously evaluate region-aware and object-aware structural similarities. is calculated as
where and are object-aware and region-aware structural similarities, respectively. The balance parameter is set to 0.5 by default.
Besides the above numerical evaluation, we also show visual comparisons, i.e., precision vs.
recall curves (PR curves). PR curves depict the precision-recall value pairs when varying binarization thresholds, suggesting the trade-off between precision and recall.
|Methods||Publication||#Param (M)||FLOPs (G)||Time (s)|
Iv-B Performance Comparison
We compare the proposed ABiU-Net to 29 previous state-of-the-art SOD methods, including MDF [li2015visual], LEGS [wang2015deep], ELD [lee2016deep], RFCN [wang2016saliency], DCL [li2016deep], DHS [liu2016dhsnet], NLDF [luo2017non], Amulet [zhang2017amulet], UCF [zhang2017learning], SRM [wang2017stagewise], PiCA [liu2018picanet], BRN [wang2018detect], C2S [li2018contour], RAS [chen2018reverse], DSS [hou2019deeply], PAGE-Net [wang2019salient], AFNet [feng2019attentive], DUCRF [xu2019structured], HRSOD [zeng2019towards], CPD [wu2019cascaded], BASNet [qin2019basnet], PoolNet [liu2019simple], EGNet [zhao2019egnet], FNet [wei2020f3net], ITSD [zhou2020interactive], MINet [pang2020multi], LDF [wei2020label], GCPANet [chen2020global], and GateNet [zhao2020suppress]. For fair comparisons, the predicted saliency maps are downloaded from the official websites or produced by the released code with default settings. Note that we do not provide the results of MDF [li2015visual] on the HKU-IS [li2015visual] dataset because MDF adopts HKU-IS for training. For the same reason, we do not report the results of DHS [liu2016dhsnet] on the DUT-OMRON [yang2013saliency] dataset.
Iv-B1 Quantitative Evaluation
The quantitative results of various methods in terms of , MAE, , and on six datasets are summarized in Table I. We can observe that the proposed ABiU-Net outperforms other methods by a large margin. Specifically, ABiU-Net attains values of 87.9%, 95.1%, 95.9%, 84.3%, 82.0%, and 90.6%, which are 1.2%, 1.6%, 1.7%, 3.8%, 0.5%, and 2.0% higher than the second-best results on SOD, HKU-IS, ECSSD, DUT-OMRON, THUR15K, and DUTS-test datasets, respectively. ABiU-Net also achieves the best MAE, i.e., 0.9%, 0.7%, 1.1%, 1.3%, 0.5%, and 0.9% better than the second-best results on SOD, HKU-IS, ECSSD, DUT-OMRON, THUR15K, and DUTS-test datasets, respectively. Moreover, the values of ABiU-Net are 2.1%, 3.7%, 3.5%, 5.9%, 3.6%, and 5.6% higher than the second-best results on six datasets, respectively. In terms of the metric , ABiU-Net achieves the best performance in all cases again, i.e., 0.6%, 1.4%, 1.2%, 2.4%, 1.5%, and 1.7% higher than the second-best results on six datasets, respectively. Therefore, we can conclude that ABiU-Net has pushed forward the state-of-the-art for SOD significantly.
Iv-B2 PR curves
We show the PR curves of ABiU-Net and other baselines on six datasets in Fig. 3. The higher curves mean the better performance that the corresponding methods achieve. We can clearly see that the proposed ABiU-Net consistently outperforms other competitors in all cases.
Iv-B3 Complexity Analysis
Table II summarizes the complexity comparison between ABiU-Net and some recent competitive methods in terms of the number of parameters, the number of FLOPs, and runtime. We can observe that the complexity of ABiU-Net is comparable to state-of-the-art methods. Especially, ABiU-Net has the fewest number of FLOPs.
Iv-B4 Qualitative Evaluation
We display qualitative comparisons in Fig. 4 to explicitly show the superiority of ABiU-Net over previous state-of-the-art methods. Fig. 4 includes some representative images to incorporate various difficult circumstances, including complex scenes, large/small objects, thin objects, multiple objects, low-contrast scenes, and confusing backgrounds. Overall, ABiU-Net can generate better saliency maps in various scenarios. Surprisingly, ABiU-Net can even accurately segment salient objects with very complicated thin structures (the second thin sample), which is very challenging for all other methods.
Iv-C Ablation Studies
In this section, we conduct extensive ablation studies for a better understanding of the proposed method.
Iv-C1 Effect of Component Designs
We first evaluate the effect of the component designs of the proposed ABiU-Net. We start with the simple transformer-based encoder network without the decoder, i.e., TEncPath. We directly upsample the feature map from the last stage of the encoder for final prediction. The results are shown in the 1 line of Table III. We can see that the detection performance is quite poor.
Effect of the Decoder
Then, we add a simple decoder, i.e., TDecPath, to TEncPath, resulting in a transformer-based U-shaped encoder-decoder network. The evaluation results are put in the 2 line of Table III. Significant performance boosting can be observed, which demonstrates that the encoder-decoder structure is necessary to utilize the low-level features for accurate SOD. The performance is even very competitive with recent state-of-the-art SOD methods, implying that the powerful global relationship modeling of the vision transformer [vaswani2017attention, dosovitskiy2021image] is essential for SOD in discovering salient objects. The goal of this paper is to further boost the SOD performance upon this high baseline using the natural properties of the transformer.
Effect of the Asymmetric Bilateral Encoder
Next, we adopt TEncPath and HEncPath to build an asymmetric bilateral encoder. We also adopt HDecPath as the decoder, but the input of each stage in HDecPath is the outputs from HEncPath and the preceding decoder stage, excluding the output from TDecPath in ABiU-Net. The experimental results are displayed in the 3 line of Table III. We can observe a significant performance improvement. As discussed before, the vision transformer can learn global long-range dependencies for locating salient objects, while lacking local fine-grained representations that are essential for refining object details. This experiment validates that our asymmetric bilateral encoder can supplement the transformer by complementary local representations, leading to higher SOD accuracy.
“DS” means deep supervision.
|#Channels of HEncPath||(128, 128, 64, 32, 32, 8)||.879||.092||.951||.022||.958||.027||.839||.047||.814||.062||.901||.030|
|(256, 128, 64, 32, 16, 8)||.879||.093||.951||.022||.959||.026||.843||.045||.815||.061||.904||.029|
|(512, 256, 128, 64, 32, 16)||.877||.093||.951||.022||.958||.027||.839||.047||.815||.062||.904||.030|
|(512, 256, 128, 128, 64, 16)||.881||.090||.950||.022||.959||.025||.836||.048||.814||.062||.901||.030|
|#Channels of TDecPath||(64, 32, 16, 8)||.878||.093||.951||.022||.957||.027||.844||.045||.817||.060||.907||.028|
|(128, 32, 32, 8)||.868||.096||.951||.022||.957||.028||.841||.045||.818||.060||.907||.028|
|(128, 128, 64, 16)||.876||.092||.951||.022||.958||.027||.842||.045||.818||.060||.907||.028|
|(256, 128, 64, 32)||.871||.097||.951||.022||.959||.027||.844||.045||.817||.060||.907||.029|
|#Channels of HDecPath||(128, 64, 32, 16, 16, 8)||.875||.094||.950||.022||.959||.026||.843||.046||.816||.061||.905||.029|
|(128, 128, 64, 32, 16, 8)||.880||.090||.951||.022||.958||.025||.840||.045||.815||.062||.906||.029|
|(256, 256, 64, 32, 16, 8)||.874||.095||.951||.022||.958||.025||.846||.045||.813||.062||.904||.029|
|(512, 256, 128, 64, 32, 16)||.872||.096||.950||.022||.959||.026||.840||.045||.815||.061||.906||.029|
“#Channels” means the number of channels. The default numbers of channels for HEncPath, TDecPath, and HDecPath are (256, 256, 128, 64, 64, 16), (128, 64, 32, 16), and (256, 128, 64, 32, 32, 8) from top to bottom, respectively.
Effect of the Asymmetric Bilateral Decoder
We continue by adding TDecPath to the model in Section IV-C1 to form the asymmetric bilateral decoder. The evaluation results are summarized in the 4 line of Table III. The asymmetric bilateral decoder can consistently improve the SOD accuracy. Intuitively, the two decoder paths, i.e., TDecPath and HDecPath, play different roles in the decoding process. TDecPath takes the outputs of TEncPath as the inputs to decode the coarse locations of salient objects, while HDecPath takes the outputs of TDecPath and HEncPath to decode the fine details of salient objects, where TDecPath guides the learning of HDecPath. This experiment verifies that the asymmetric bilateral decoder is helpful in learning complementary information for decoding accurate salient objects.
Effect of the Attention Mechanism
After that, we add the standard channel attention mechanism [hu2018squeeze] to the asymmetric bilateral decoder for feature enhancement. The results are provided in the 5 line of Table III. We can find that the attention mechanism can consistently improve the SOD accuracy, especially in terms of the metric MAE. This suggests that the attention mechanism is useful for SOD by suppressing the noisy activation and enhancing the necessary activation, also proved in previous works [zhang2018progressive, wang2019salient, chen2018reverse, zhao2019pyramid].
Effect of Deep Supervision
At last, we add deep supervision as in Fig. 2 to obtain the final ABiU-Net. The results are shown in the 6 line of Table III. Such default ABiU-Net achieves the best accuracy. Note that the 5 line is just ABiU-Net without deep supervision. From the 5 line to the 6 line, there is a significant performance improvement, indicating the effectiveness of training with deep supervision.
Iv-C2 Impact of Hyper-parameters
To explain how the default hyper-parameters of ABiU-Net are set, we evaluate the performance when varying hyper-parameters. Since PVT [wang2021pyramid] is used as the TEncPath, we study the numbers of channels of HEncPath (), TDecPath (), and HDecPath (). We try some different settings, and the results are summarized in Table IV. One can observe that ABiU-Net seems quite robust to various parameter settings as there is only a small performance fluctuation. This property makes that ABiU-Net has the potential to serve as a base architecture for SOD in the transformer era. Since the default parameter setting achieves slightly better performance, we use this setting by default.
This paper focuses on boosting SOD accuracy with the vision transformer [vaswani2017attention, dosovitskiy2021image]. It is widely accepted that the transformer is adept at learning global contextual information that is essential for locating salient objects, while CNN has a strong ability to learn local fine-grained information that is necessary for refining object details [cheng2014global, luo2017non, liu2018picanet]. Therefore, this paper explores the combination of the transformer and CNN to learn discriminative hybrid features for accurate SOD. For this goal, we design the Asymmetric Bilateral U-Net (ABiU-Net), where both the encoder and decoder have two paths. Extensive experiments demonstrate that ABiU-Net can significantly improve the SOD performance when compared with state-of-the-art methods. Considering that ABiU-Net is an elegant architecture without carefully designed modules or engineering skills, ABiU-Net provides a new perspective for SOD in the transformer era.