Salient object detection (SOD) [wang2019sodsurvey][2021-TMM-cmSalGAN][2021-TPAMI-rethinking-Co-SOD][2021-TNNLS-RethinkingRGBD][2021-TPAMI-NoisyLabel][2021-PR-EFNet] aims to find salient areas in an image [2020-CVPR-Scribble][2021-AAAI-RD3D][2021-TPAMI-Uncertainty] or video [2018-TIP-VSOD-FCN][2018-TPAMI-SA-VOS] by using intelligent algorithms that mimic human visual characteristics. It has been used in many image comprehension and video processing tasks, such as photo cropping [2019-TPAMI-PhotoCropping], image editing [2010-ACMTOG-RepFinder], 4D saliency detection [2019-ICCV-LightField] photo composition [2009-ACMTOG-Sketch2Photo], and target tracking [2009-CVPR-tracking-SBDT].
With the development of deep neural networks, various network structures and novel convolution modules are designed to improve the segmentation effect. The majority of salient object detection networks are based on the U-shaped network[2015-CVPR-FCN] to integrate the features of different depths and scales. The network structure represented by U-net [2015-ICM-Unet] and feature pyramid network (FPN) [2027-CVPR-FPNdetection] has an obvious problem of semantic information dilution. Therefore, transferring rich semantic information to shallow layers without losing location information and destroying details is the focus of current algorithms [2019-TPAMI-DSS][2020-AAAI-F3][2019-ICCV-EGNet][2019-CVPR-PFANet][2019-CVPR-PoolNet][2020-AAAI-GCPANet]. Among them, the global guidance structure (GGS) represented by [2019-CVPR-PoolNet][2020-AAAI-GCPANet][2021-AAAI-SCWS] is widely used. The existing methods have great contributions to network structure and module optimization. These designs are based on rich experimental attempts and the subjective experience of scholars. Although great progress has been made at present, two key issues still are worthy of further study: how to regulate different types of features to complete better segmentation from the perspective of the overall need for weight regulation and how to better balance the ability to organize a broad spatial scene with the ability to scrutinize highly detailed objects.
Most of the existing methods fuse multi-level features directly without considering the contribution ratio of the fused features to the final output. Therefore, for the first issue, we propose a perception-and-regulation (PR) block to optimize the FPN structure from the perspective of global perception and local feature fusion fine-tuning. Global perception helps to provide accurate semantic information and make better weight regulation, and local feature fusion fine-tuning helps to enhance useful information and suppress invalid information. The perception part of the PR block realizes global perception, which adopts the deepest feature with the largest receptive field as the input. The regulation part of the PR block realizes local feature fusion fine-tuning, which adopts a weighted method to optimize the feature fusion process.
For the second issue, [2020-CVPR-COD][2019-CVPR-PFANet][2018-TPAMI-DeepLab] use atrous convolution to expand the perception range. The too-large atrous ratio will make the information at independent sampling points discontinuous, which is not conducive to the continuity of spatial information in detail. Therefore, the convolution combination of multiple atrous rates is widely adopted [2019-CVPR-PFANet][2018-TPAMI-DeepLab][2020-arxiv-detectRS] to make the network have the ability to organize a wide space scene and scrutinize highly detailed objects. To solve this problem, we propose our solution, the imitating eye observation module (IEO), to better balance the two abilities.
A partition search strategy is adopted in IEO to effectively alleviate the negative impact of a too-large atrous ratio. The adoption of the PR block enables the IEO module to further balance different types of features in IEO, which helps to further improve network performance.
Inspired by squeeze-and-excitation (SE) block [2020-TPAMI-SENet] and the classification network, we design a global attention unit called perception-and-regulation (PR) block to regulate the network automatically and achieve the best feature fusion effect. Our goal is to improve the feature fusion effect by explicitly modeling the interdependencies between features to be fused. To achieve this, we design a mechanism that helps the network to recalibrate the feature fusion process, through which it can learn to use global information to adaptively adjust the weights of the features to be fused. In addition, we design an imitating human eye observation module (IEO) to organize a broad spatial scene and have the ability to scrutinize highly detailed objects. Our contributions are summarized as follows:
We propose a perception-and-regulation (PR) block to help the network understand the global information and assign the feature weights uniformly to realize the spontaneous and adaptive global feature regulation.
An imitating human eye observation module (IEO) is proposed to help the network have the ability to organize a wide space scene and scrutinize highly detailed objects. PR block balances and optimizes these two abilities.
Sufficient experiments conducted on 5 SOD datasets demonstrate that the proposed method outperforms 22 state-of-the-art SOD methods in terms of eight metrics. In addition, ablation experiments on several modules and networks prove the universality of our PR block.
Ii Related work
Ii-a Salient object detection
Early salient object detection methods are based on hand-crafted features [2009-CVPR-Frequency][2015-TPAMI-GC][2013-CVPR-Submodular][2011-ICCV-Center][2013-CVPR-Discriminativ][2013-CVPR-Hierarchical][2013-CVPR-Graph][2016-TIP-CDST]
and intrinsic cues without a deep hierarchical structure. With the development of deep learning, the deep features with rich contextual information make a great breakthrough in the field of salient object detection. As a landmark algorithm, fully convolutional networks (FCN)[2015-CVPR-FCN] creatively removes the fully-connected layer to predict the semantic label for each pixel. Then the U-shape based structures represented by U-net [2015-ICM-Unet] and feature pyramid network (FPN) [2027-CVPR-FPNdetection] has gradually become the mainstream structure by integrating all levels of features layer by layer. Based on this structure, scholars have explored more abundant multi-layer feature blending methods to help feature expression effectively. Among them, the global guidance structure (GGS) represented by [2019-TPAMI-DSS][2020-AAAI-GCPANet][2019-CVPR-PoolNet] has gradually become a common method to strengthen semantic information.
Our Perception-and-Regulation (PR) network is based on GSS to regulate the network spontaneously and adaptively. It is worth noting that, in contrast to FCN abandoning the fully-connected layer (FC), our network mainly relies on the FC layer in the classification network to perceive the size and shape of objects. Then the weights of the features to be fused are evaluated according to the semantic information obtained from the FC layer, as shown in Fig.1.
Ii-B Semantic information reinforcement
Scholars have proposed various methods to transmit global information of high-level features to the shallow layers to help the network get detail information and accurately locate the objects. For example, Hou et al. [2019-TPAMI-DSS] proposed short connections to help shallower side-output layers get semantic information more directly. Zhao et al. [2019-ICCV-EGNet] used the highest-level features to help the edge features (shallow layer) of explicit modeling to filter out useless edge details. Zhao et al. [2019-CVPR-PFANet] designed a spatial attention module where context-aware high-level features are added to help the location information transfer to the shallow layer. Liu et al. [2019-CVPR-PoolNet] introduced a global guidance module (GGM) to explicitly make shallower layers be aware of the locations of the objects. Global context flow module in [2020-AAAI-GCPANet] solved the issue of dilution in the process of high-level feature transmission, which is similar to GGM.
The global guidance idea of [2020-AAAI-GCPANet] [2019-CVPR-PoolNet] can be simplified to GGS structure in Fig. 2 (a) and the structure can effectively solve the problem that semantic information is diluted in the process of feature transfer of FPN structure or U-net structure [2015-ICM-Unet]. Even if these structural designs are based on the subjective experience and repeated attempts [2019-TPAMI-DSS] of excellent scholars, the rigid feature fusion process does not consider the relationship between different features to be fused and the highest level features. The GGS-PR structure of Fig. 3 uses PR block to further slightly regulate and optimize the network. Wei et al. [2020-AAAI-F3] designed a cross feature module (CFM) to select features with rich semantic information to transfer to
the shallower layer and let the features with details enter the next cycle as shown in Fig. 2 (b). CFM can be understood as strengthening the transmission of semantic information in FPN structure. Our PR block does the same work, but the way of strengthening semantic features is more flexible and adaptive as shown in the FPN-PR of Fig. 3.
Ii-C Attention mechanisms
In the classification task, Hu et al. [2020-TPAMI-SENet] improves the quality of feature representation greatly by establishing the interdependencies between the channels of its convolutional features. Correspondingly, Woo et al. [2018-arXiv-BAM][2018-ECCV-CBAM] use the spatial attention map generated by utilizing the inter-spatial relationship of features with the channel attention map to help the network learn ’where’ is an informative part and ’what’ is meaningful on the spatial and channel axis respectively. Some SOD methods adopt attention mechanism [2020-TIP-RAS][2018-CVPR-PAGRN][2019-CVPR-CPD][2019-CVPR-PFANet]. Wang et al. [2019-CVPR-PAGE]
design the pyramid attention module to make the network pay more attention to important regions and multi-scale information. As a special attention mechanism, the gated mechanism is widely used by long short term memory (LSTM) and gated recurrent unit (GRU), which play an important role in SOD algorithms[2019-CVPR-TopDown][2019-PAMI-ASNet][2019-CVPR-CapSal]. Some segmentation algorithms [2017-CVPR-GatedDense][2018-CVPR-BMPM][2020-ECCV-GateNet] use the gated mechanism to adjust the network. CGM of Unet3+ [2020-ICASSP-UNet3+] can be regarded as the extreme case of the gated mechanism, because the weights of controlled features are set to 0 or 1 via argmax (Fig.2(c)), which helps to determine whether the target is an organ or noise.
Inspired by SENet [2020-TPAMI-SENet] and the classification network, we design a PR block for global regulation which can be regarded as a macro global attention mechanism. [2020-TPAMI-SENet] adaptively recalibrates channel-wise features at the micro level, while PR block recalibrates different types of features of the whole network at the macro level. Different from the existing methods of the gated network, all the features to be fused are regulated in our network and the perception part is located in the position with rich semantic information, which helps to analyze the size and shape of the object accurately and uniformly. In addition, inspired by SAC of [2020-arxiv-detectRS] (Fig.2(d)), we use softmax, which is commonly used in classification networks, to add constraints to the regulation part of the PR block.
Some algorithms [2019-CVPR-PFANet][2020-CVPR-COD][2020-arxiv-detectRS][2020-ECCV-GateNet][2018-TPAMI-DeepLab] use the atrous convolution to expand the receptive range to better observe the object. The disadvantage of atrous convolution with a large atrous ratio is that the information given by spatial continuity may be lost (such as edges) and it is not conducive to the segmentation of small objects. We design a special spatial attention mechanism to compensate for this shortcoming by imitating foveal vision and peripheral vision.
Iii Perception-and-Regulation Block
The Perception-and-Regulation (PR) block is a computational unit with semantic perception and global regulation capability. The PR block serves the feature level weight regulation of our final network PRNet, as shown in the network structure on the left side of Fig.3. Besides, it can also be used in many network structures or modules.
Inspired by SENet [2020-TPAMI-SENet] and the classification network, we design a PR block to make the fusion process of different types of features adaptively adjust according to the overall need of weight regulation. SE block focuses on features and adaptively recalibrates features at the channel level, while our PR block focuses on the entire network and recalibrates the entire network at the feature level. Therefore, PR block can be considered as a macro version of SE block.
In addition, FCN [2015-CVPR-FCN] adapts classification networks (AlexNet [2012-NIPS-AlexNet], VGG net [2015-ICLR-VGG] and GoogLeNet [2015-CVPR-GoogLeNet]) into fully convolution networks by replacing the fully-connected (FC) layer with convolution layers to achieve semantic segmentation. On the contrary, we make full use of the perception and understanding ability of the FC layer to make adaptive regulation for the whole network. In Sec.3.1, we discuss three perception strategies in detail for the perception part of the PR block. PR block solves the first issue of how to regulate different types of features to complete better segmentation
from the perspective of the overall need for weight regulation. PR block has a better regulation effect on the fusion process of features with large differences. The differences of features here refer to different levels of features (FPN GGS), the features produced by atrous convolution or normal convolution (CFE), the original features or attention features in the residual structure of the attention module (CBAM).
Iii-a Perception: Semantic Information Embedding
In the perception part of the PR block, three semantic information embedding methods are designed (Fig.4) according to the idea of SENet for global information embedding. In order to introduce our idea clearly, we annotate the features in FPN-PR in the upper right corner of Fig.3.
Strategy 1 in Fig.4 uses the FC layer to perceive the size and shape of the objects. The output of the FC layer of the classification network is the category of the predicted object, while our output results are the weights of the features to be fused. The weights are adjusted adaptively according to the characteristics of the objects. Features regulated by weights are represented by the color of the corresponding weight. Unlike the feature weighting of the SE block for a single channel (Fig.4), the PR block regulates the whole features. The perception part of strategy 1
is set at the location of high-level features (i5) with rich semantic information. Max-pooling is used to reduce the size of the feature. Then the feature
after pooling is transformed into a one-dimensional vector
by flatten operation. We use a multi-layer perceptron (MLP)[2018-arXiv-BAM] with one hidden layer to enhance the perception ability of the network. To save a parameter overhead, the hidden activation size is set to . C is the number of features that need to be regulated and it is eight for the FPN structure in Fig.4. The output layer size is
. The activation function of the output layer replaces ReLU with sigmoid and the final results are multiplied by 2 to show whether the features are enhanced or suppressed. In short, the perception part (P) is computed as:
The activation function of output layer is element-wise sigmoid.
Due to the dense connection of FC layers, the final weights of strategy 1 have a strong correlation. In order to explain the perception part of the PR block more clearly, we provide two other spatial dimension (S) and channel dimension (C) perception design. Different from strategy 1, we design multiple independent memory units (MU) for strategy 2,3 (Fig.4) to better illustrate the mapping relationship. The function of MU is to establish the mapping relationship between the shape and size of input features and the weight of regulated features. The number of MU is determined by the number of features that need to be regulated. In the MU of strategy 2 , we use two convolution operations to gradually reduce the channel dimension of input features to 1. Here we refer to the design of spatial attention map [2018-arXiv-BAM][2018-ECCV-CBAM]. r is the reduction ratio. The purpose of adding convolution here is to get the output weight according to the input features, and it can be understood that the convolutions change the average gray value of the input features. The weight is equal to the average of the output feature (channel is 1). We change the activation function to sigmoid to enhance the nonlinearity of the output and multiply the final result by 2. In short, the perception part (P) is computed as:
is the convolution operation using the element-wise sigmoid function andis th global average pooling.
For the MU of strategy 3, we use global average pooling to reduce the input feature to one dimension. Then we use the fully-connected layer to reduce the channel to . After sigmoid and average operation, the final weight is obtained. The perception part (P) is computed as:
where is the fully-connected layer and the activation function of the output layer is sigmoid. is the average operation of the one-dimensional vector.
Iii-B Regulation: Adaptive Recalibration
The accurate location information of high-level features in FPN structure is diluted in the process of multiple fusion [2020-AAAI-GCPANet]. This is because element-wise addition or concatenation operations do not treat the weights of the features to be fused differently. Most of the current algorithms focus on changing the network structure [2020-AAAI-GCPANet][2019-ICCV-EGNet][2019-CVPR-PoolNet][2019-TPAMI-DSS] or module [2020-AAAI-F3][2019-CVPR-PAGE] to enhance semantic information, while the PR network focuses on the regulation of network and can greatly improve the performance of simple network structure. The PR network uses a PR block for global regulation. It builds a bridge between the features to be fused and the semantic information made by the fully-connected layer. The semantic information is expressed in the form of weight.
Iii-B1 Basic Structural Analysis (FPN-PR and GGS-PR)
Both [2019-CVPR-PoolNet] and [2020-AAAI-GCPANet] use the global guidance method to enhance the semantic information of the shallow features of the network. We add a global guidance structure to the FPN to imitate this process, as shown in Fig.2 (a). The global guidance structure can be considered as a simplified version of [2019-TPAMI-DSS][2019-CVPR-PoolNet][2020-AAAI-GCPANet]. The location features are directly added to the shallow features to enhance the ability of salient object location. There is still room for improvement. We add PR block to both FPN and GGS structures (Fig.3) for further slight regulation. [2020-AAAI-F3] proposes a CFM module to help the output features with high-level features (Fig.2 (b) blue arrow) as the main components to transfer to the shallow layer. This scheme is not flexible enough because the structure of CFM is fixed. While the perception part of the PR block helps us to adaptively recalibrate the weight of each feature in the whole network according to the characteristics of the object to achieve better segmentation.
shows the realization of perception and regulation of PR block. In order to simplify the network and reduce the amount of computation, we use convolution (convolution, batch normalization, ReLU) to unify the output features of the encoder to 64 channels. FPN-PR and GGS-PR on the right side of Fig.3 do not show this detail. Taking FPN-PR with perception strategy 2 as an example, eight memory units are used to perceive the input feature and evaluate the weights of interlayer features (i1, i2, i3, i4, i5) and decoder features (d1, d2, d3, d4, d5). The gray dotted arrow represents the input feature of the perception part. The gray solid arrows represent the regulation of the eight output weights of the PR block on each feature to be fused. The only difference from the traditional FPN structure is that PR block weights these features. The features (g1, g2, g3) in the purple region G are global guidance features. It is worth noting that we only weighted the three feature fusion process of GGS to explore the influence of PR block on the global guidance. In short, the FPN-PR and GGS-PR are computed as:
where refers to the convolution operation. is the weight. is upsampling.
In order to make the perception part of PR block have a better perception effect on different scale objects, we adopt the partial encoder design of SSD algorithm [2016-ECCV-SSD], as shown in the left part of Fig.3. The advantage of SSD is that it can detect objects on multiple scales. We use its 6,7,8 layer structure and combine 4,7,8 layer features with concatenation to realize multi-layer perception.
Iii-B2 Exemplars (CFE-PR, CBAM-PR and EGNet-PR)
In order to verify the universality of PR block, we apply it to context-aware feature extraction module (CFE) in Fig.2 (e) [2019-CVPR-PFANet] and CBAM module in Fig.2 (f) [2018-ECCV-CBAM]. The features to be fused in FPN and GGS structures are features of different scales and depths, while the features to be fused in CFE-PR modules have different receptive fields.
Different from [2019-CVPR-PFANet], there are all 3x3 convolution in CFE of this paper and their dilation rates are 1,3,5,7 respectively. CFE is used in the feature i3, i4 and i5 position (follow [2019-CVPR-PFANet]) of basic FPN structure to enhance the output feature and PR block is used for internal regulation as shown in Fig.2 (e).
As complementary attention modules, channel-wise attention and spatial attention are used to calibrate features at the spatial and channel levels and they can learn ’what’ and ’where’ to attend in the channel and spatial axes respectively [2020-TPAMI-SENet][2018-arXiv-BAM][2018-ECCV-CBAM]. PR block can be considered as the third type of attention, which focuses on the influence of the whole feature on the network (FPN, GGS) or module(CFE, CBAM). The perception part of the PR block analyzes the global context information of high-level features, and strengthens or weakens the features to be fused from the perspective of the global needs of the network. So we call it global attention in Fig.2 (f). We use PR block with channel attention and spatial attention to further optimize the attention mechanism. The regulation part of the PR block is added to the attention branch of CBAM in Fig. 2 (f) and five CBAM-PR modules are added to i1, i2, i3, i4, i5 positions of the basic FPN structure.
We also added PR block to the final output position of EGNet [2019-ICCV-EGNet] for perception and regulation, and the final result was further improved. Fig.2 (g) is a simplified structure diagram of EGNet, and we added PR block in its final output location (FPN structure). Line 11-16 of Tab.I proves the effect of PR block in the above modules and networks.
Iv Imitating Eye Observation Module
The purpose of the imitating human eye observation module (IEO) is to quickly and accurately find and locate salient objects. IEO uses a partition search strategy (Fig.5 step1) to enable the network to focus on the analysis of local areas. Then it uses integration operation (Fig.5 step2) to associate the local feature analysis results with global information. Inspired by human perception of foveal vision and peripheral vision [PV1], we propose a peripheral vision module (PVM) (Fig.5 blue region) to cooperate with the foveal vision to achieve accurate understanding of the salient objects in step3. The peripheral vision analysis strategy of PVM is applied in both the regional integration step1 and the global search (step3). Step2 is a bridge to make the perception range of peripheral vision wider.
In step 1, we divide the input feature into four regions in the spatial dimension () and analyze each region separately with PVM. This mimics the human partition search strategy, which helps to find objects that are not in the center of view. In addition, this operation also helps to form the global receptive field (we will explain it in detail in step 3). Because the sub-regions of are partitioned again and then they all pass through the PVM module, we only show the implementation details of in Fig.5 step1 and Eq.7-10. The features obtained by partitioning in spatial dimension () are reorganized in channel dimension () in Eq.8.
where , , , refer to spatial dimension partition, channel dimension partition, spatial dimension concatenation, channel dimension concatenation. is 3x3 convolution 1 in PVM and activated with ReLU. is 3x3 convolution 2 in PVM and activated with sigmoid.
In step 2, we merge the results of partition search (Eq.10) in spatial dimension to get and then use concatenation operation to merge , in channel dimension. Step 2 strengthens the association between the region search feature and the original feature.
In step 3, we do the peripheral visual perception of again. This process is the same as the previous Eq.7-9. PVM can be considered as a special attention mechanism. It expands the receptive field by comparing the features of corresponding positions in other partitions and then corrects the original features in spatial dimension (F1F2 in Fig.6). Eq.9 shows the combination of attention branch and primitive branch [2018-ECCV-CBAM][2018-arXiv-BAM]. PVM can also be considered as a special atrous convolution and has only four sampling points and a very large atrous rate (Fig.6 F3F3). The PVM here needs to be used for the feature of a large receptive field, otherwise, the spatial information discontinuity caused by a large atrous rate will appear. The problem can be solved by partitioning and stacking PVMs (F1F2F3). The partition search strategy allows our IEO module to be used for features of the smaller receptive field, which overcomes the limitation of the global perception strategy in [2019-CVPR-AFNet]. If we want to use the IEO for more shallow features, we can further decompose the features F, F, F and F (left side of Fig.5) into smaller sub-regions for partition search and integration before step1 and step2. To keep the network simple, we only used partitioning and integration once in IEO, and let IEOs be used for features i3, i4, i5 in PRNet of Fig.3.
The disadvantage of atrous convolution with a large atrous ratio is that the information given by spatial continuity may be lost (such as edges). Because the sampling points of atrous convolution are too scattered and the information of a single sampling point is not continuous enough. This is the same as the human eye can’t pay attention to the details of objects if there is only peripheral vision. The human eye diagram on the left of Fig. 5 shows the foveal vision cells and peripheral vision cells. Foveal visual cells are rich and concentrated, and they can observe the details of the object. The peripheral visual cells are sparse and widely distributed, and they can organize a broad space scene [PV2][PV3]. They are like convolutions of different atrous rates. How to balance these two capabilities is the purpose of IEO module design.
The dark orange dots in F3 of Fig.6 can be regarded as four groups of sampling points of PVM. One sampling point of PVM consists of 9 points (3x3 Conv). Therefore, PVM has both a large atrous rate and detail observation ability. It is worth noting that the 3x3 convolution here can also use atrous convolution with a small atrous rate. For simplicity, we use normal convolution here. The receptive field of the dark orange pixel in F3 can cover the dark green areas in F1. The receptive field of F3 is global. In addition, we add the fovea vision feature (Fig.5 the red arrow) to cooperate with the P-feature. In the last stage of step 3, we use concatenation operation to merge the original feature, fovea vision feature (r=5), and peripheral vision feature. Finally, we use PR block to balance the relationship among the three types of features.
where is to repeat Eq.7-9 for .
are the weight produced by the memory unit. Inspired by the classification network, Eq.13 uses a softmax layer to associate weights. SAC uses a similar approach (1-S) to associate partial weights[2020-arxiv-detectRS]. The three features regulated are peripheral vision feature, fovea vision feature (an original feature after atrous convolution operation), and original feature. The atrous ratio of the convolution of the foveal vision feature is r = 5, which is designed according to the experimental effect.
We use the widely used binary cross entropy (BCE) loss and a consistency-enhanced (CEL) loss [2020-CVPR-MINet] to supervise the prediction map, as shown in Eq.15.
BCE loss is defined as:
where is the prediction result of saliency map at . is the ground truth label of the pixel .
CEL loss is a variant of IoU loss, which can measure the similarity of two images from an overall perspective. it is defined as:
Vi-a Datasets and Evaluation Metrics
We evaluate the proposed architecture on 5 SOD datasets: DUTS [2017-CVPR-DUTS] with 10,553 training and 5,019 test images, DUT-OMRON [LiThe] with 5,168 images, ECSSD [2013-CVPR-Hierarchical] with 1,000 images, PASCAL-S [LiThe] with 850 images, HKU-IS [2015-CVPR-HKU-IS] with 4,447 images. We follow the data partition of [2020-CVPR-MINet][2019-TPAMI-DSS] to use 1,447 images of HKU-IS for testing.
In addition, we define a large salient object dataset (L) and a small salient object dataset (S). They are helpful to further analyze the dynamic regulation of PR block when dealing with different scale objects. We select large object images (1270) and small object images (1576) based on the ratio of white pixels in the GT, as shown in Eq.18 and Fig.7. The pictures and ground truth labels are selected from five common test datasets (ECSSD, PASCAL-S, DUT-OMRON, DUTS, HKU-IS). is the number of white pixels in the GT and is the number of black pixels. is the threshold. We set and to 0.38 and 0.03 respectively to obtain dataset L and S.
We use six metrics to evaluate the performance of PRNet and other state-of-the-art models. Mean absolute error (MAE) [perazzi2012saliency] measures the average pixel-level relative error between the prediction and the GT by calculating the mean of the absolute value of the difference. F-measure () [2009-CVPR-Frequency] are also widely adopted in previous models [2019-ICCV-EGNet][2019-TPAMI-DSS].is usually set to 0.3. The maximal values are calculated from PR curves, represented as . An adaptive threshold (twice the mean value of the prediction) is adapted to calculate . Weighted F-measure () is a measure of completeness for improving F-measure [2014-CVPR-FW]. Structural similarity measure () [2017-ICCV-Sm] and E-measure () [2018-IJCAI-Em] are also useful for quantitative evaluation of saliency maps. Besides, precision-recall (PR) curves are drawn.
We follow [2020-AAAI-F3][2020-CVPR-MINet][2019-ICCV-SCRN] to use DUTS-TR [2017-CVPR-DUTS] as the training dataset and other above-mentioned datasets are used as testing datasets. In the training phase, we follow [2020-CVPR-MINet]
to use random horizontal flipping, random color jittering, and random rotating as data augmentation techniques to prevent the over-fitting problem. PRNet is trained for 40 epochs on an NVIDIA RTX 2080Ti GPU. Batchsize is set to 4. VGG-16, pre-trained on the ImageNet dataset, is used as the backbone network. The parameters for the rest of PRNet are initialized by the default setting of PyTorch. Our model adopts the stochastic gradient descent (SGD) optimizer with a momentum of 0.9, a weight decay of 0.0005, and an initial learning rate of 0.001. ”Poly” strategy[2015-CS-ParseNet] (factor is 0.9) is applied. During testing, the input size is set to 320x320.
|2.FPN(PR)||[rgb] 1, 0, 0.773||[rgb] 1, 0, 0.890||[rgb] 1, 0, 0.857||[rgb] 1, 0, 0.047||[rgb] 1, 0, 0.666||[rgb] 1, 0, 0.842||[rgb] 1, 0, 0.793||[rgb] 1, 0, 0.067|
|10.GGS(PR)||[rgb] 1, 0, 0.766||[rgb] 1, 0, 0.892||[rgb] 1, 0, 0.851||[rgb] 1, 0, 0.047||[rgb] 1, 0, 0.651||[rgb] 1, 0, 0.841||[rgb] 1, 0, 0.783||[rgb] 1, 0, 0.068|
|12.FPN+CBAM(PR)||[rgb] 1, 0, 0.775||[rgb] 1, 0, 0.894||[rgb] 1, 0, 0.858||[rgb] 1, 0, 0.046||[rgb] 1, 0, 0.666||[rgb] 1, 0, 0.847||[rgb] 1, 0, 0.793||[rgb] 1, 0, 0.066|
|14.FPN+CFE(PR)||[rgb] 1, 0, 0.798||[rgb] 1, 0, 0.902||[rgb] 1, 0, 0.869||[rgb] 1, 0, 0.043||[rgb] 1, 0, 0.700||[rgb] 1, 0, 0.856||[rgb] 1, 0, 0.812||[rgb] 1, 0, 0.060|
|16.EGNet(PR) [2019-ICCV-EGNet]||[rgb] 1, 0, 0.806||[rgb] 1, 0, 0.900||[rgb] 1, 0, 0.881||[rgb] 1, 0, 0.042||[rgb] 1, 0, 0.731||[rgb] 1, 0, 0.866||[rgb] 1, 0, 0.837||[rgb] 1, 0, 0.056|
|17. CFD [2020-AAAI-F3]||.778||.890||.858||.047||.674||.836||.796||.066|
|18. CFD(PR) [2020-AAAI-F3]||[rgb] 1, 0, 0.790||[rgb] 1, 0, 0.891||[rgb] 1, 0, 0.861||[rgb] 1, 0, 0.044||[rgb] 1, 0, 0.681||[rgb] 1, 0, 0.842||[rgb] 1, 0, 0.799||[rgb] 1, 0, 0.062|
|2. GGS(PR) + IEO||.793||.901||.868||.043||.692||.851||.809||.062|
|3. GGS(PR) + IEO(PR)||[rgb] 1, 0, 0.802||[rgb] 1, 0, 0.908||[rgb] 1, 0, 0.872||[rgb] 1, 0, 0.041||[rgb] 1, 0, 0.698||[rgb] 1, 0, 0.857||[rgb] 1, 0, 0.812||[rgb] 1, 0, 0.059|
|4. GGS(PR) + IEO(PR)||.794||.903||.868||.042||.690||.856||.807||.060|
|5. GGS(PR) + CFE(PR)||.794||.902||.869||.043||.696||.856||.812||.061|
|6. w/o CEL||.765||.879||.865||.048||.669||.840||.808||.067|
|1. IEO in i5||.790||.901||.867||.043||.684||.849||.805||.063|
|2. IEO in i4, i5||.792||.905||.867||.042||.682||.850||.803||.063|
|3. IEO in i3, i4, i5||[rgb] 1, 0, 0.802||[rgb] 1, 0, 0.908||[rgb] 1, 0, 0.872||[rgb] 1, 0, 0.041||.698||.857||.812||[rgb] 1, 0, 0.059|
|4. IEO in i2, i3, i4, i5||.797||.904||.870||.042||[rgb] 1, 0, 0.699||[rgb] 1, 0, 0.859||[rgb] 1, 0, 0.813||.060|
|5. IEO in i1, i2, i3, i4, i5||.798||.903||.871||.042||[rgb] 1, 0, 0.699||.856||[rgb] 1, 0, 0.813||.061|
|2. PRNet(PR)||[rgb] 1, 0, 0.802||[rgb] 1, 0, 0.908||[rgb] 1, 0, 0.872||[rgb] 1, 0, 0.041||[rgb] 1, 0, 0.698||[rgb] 1, 0, 0.857||[rgb] 1, 0, 0.812||[rgb] 1, 0, 0.059|
Vi-C Ablation Studies
Ablation analysis of PR block in different structures and modules. Lines 1-4 of Tab.I analyze the effect of PR block with different perception strategies in FPN structure. are strategy 1,2,3 in Fig.4. The has the best regulation effect in FPN structure because of its rich parameters and intensive interaction analysis in the fully-connected layer. But in the final network structure (PRNet), is the best (Tab.IV), we will explain this phenomenon in later analysis. In order to simplify the experiment, we uniformly use the best strategy in the final network to carry out the comparison experiments and the ablation experiment in Tab.I, Tab.II and Tab.III. Line 6 of Tab.I is the gate strategy provided by the GateNet [2020-ECCV-GateNet], and its effect is not as good as the PR block of global perception and regulation. It is worth noting that line 7 of Tab.I is the AIMs module of MINet [2020-CVPR-MINet]. AIM has a more complex interaction structure than PR block, but the regulation effect of PR block is better. Line 8 verify the effect of the multi-level perception strategy in the FPN structure. FPN(PR) means that FPN-PR uses the encoder structure shown in the left side of Fig. 3. Line 9-18 verify the improvement of PR block to GGS structure (the lower right corner of Fig.3), CFE module (Fig.2 (e)), CBAM module (Fig.2 (f)), EGNet network (Fig.2 (g)) and CFD decoder (Fig.2 (b)).
Ablation analysis of the PRNet. Tab.II shows the ablation experiment of PRNet (Fig.3). The baseline is GGS(PR). IEO improves the network performance greatly in line 2, but the allocation of foveal vision feature and peripheral vision feature is not balanced. The performance of IEO is further improved after being regulated by PR block (line 3). The right side of Fig. 5 shows how the weights obtained by the perception part regulate the features of IEO. Lines 3 and 4 show the effect of multi-level perception. The experiment in the 5th line replaces the IEO-PR module in the 3rd line with the CFE-PR module (Fig.2 (e)), which proves the effectiveness of the IEO-PR module. The model of the experiment in the 6th line is the same as that in the 3rd line, but the experiment in the 6th line is only supervised by BCE loss (CEL loss is removed). The 3rd and 6th experiments show the effect of CEL loss.
It is worth noting that, as shown in Fig.3, the IEO and CFE modules in Tab.II are placed in i3, i4, and i5 positions, which follows the setting of [2019-CVPR-PFANet]. The peripheral vision module in the IEO module has a large void rate, so it is better to use it for high-level features with a larger receptive field. Experiments in Tab.III also show that this scheme is the best for IEO-PR module. Besides, reducing the number of IEO modules is conducive to simplify the model and improving the speed.
Perception strategy analysis. Lines 1,2,3,4 of Tab.I show that strategy 1 (PR) is the best in simple structures (FPN). While strategy 2 (PR) performs best in complex structures (PRNet) as shown in Tab.IV.
The fully-connected layers in PR that are inspired by the classification network make the weights coupled and correlation strong as shown in Fig.4. In addition, there are many parameters in the FC layer, which is also helpful to recalibrate the weights of features with obvious differences in the simple structure (FPN). But for complex network PRNet, the difference between the features to be fused becomes smaller. Take the feature fusion process F on the far right of PRNet (Fig.3) as an example, because d4 is the fusion feature of g2 (d5) and i4, the difference between g2 and d4 is reduced. Fully-connected layers in strategy 1 over-interpret the weights, which makes the effect of PR block worse. Fully-connected layers are also used in strategy 3 (PR), so it has the same problem. The MU of strategy 3 evaluates each feature weight independently, which is different from the strong coupling in strategy 1. Strategy 3 can be considered as a simplified version of strategy 1 because the parameters of its fully-connected layer are less than strategy 1. Strategy 1 performs better on simple networks (FPN, CFD), and strategy 2 performs better on complex networks (PRNet, EGNet). Besides, we analyze the weights of strategies 1 and 2 in the FPN structure, as shown in Fig.8. We can find that strategy 1 is radical and sensitive, while strategy 2 is conservative and restricted. These characteristics result in the performance differences between the two strategies in simple network structure and complex network structure.
PR directly uses global average pooling to reduce the spatial dimension (H, W) to (1, 1), which is beneficial for complex networks (PRNet) with little difference in features to be fused. Because there is no fully-connected layer, the final weights are more directly and closely related to the spatial features of salient objects. PR can prevent overfitting of the fully-connected layers in strategy 1, 3.
PR block is originally inspired by the classification network. And we find that the paper, Network In Network (NIN) [2014-ICLR-NIN], just verifies this phenomenon from the perspective of the classification task. NIN explains the advantages of global average pooling to replace the fully connected layer in detail. According to the above analysis, we can use strategy 1 to regulate the feature fusion process with large feature differences, while strategy 2 can be used for the feature fusion process with small feature differences. In addition, it should be noted that strategy 1 is not suitable for all feature fusion processes. For the case that the difference between the features to be fused is very small, radical weight regulation may have a negative impact.
Feature weight analysis. To further analyze how the PR block works, we show the average weight of each fusion position in the pie chart and line chart in Fig.9. F1-4 and FG1-3 represent multiple regulated points of FPN-PR and GGS-PR (Fig.3). The pie chart shows the results in the training dataset and 5 test datasets. The line chart shows the results of 5 test datasets and the results of 1300 pictures with large objects and 1300 pictures with small objects. The images of large objects and small objects are obtained by setting threshold according to the area ratio of white pixels in GT. Blue, red and purple regions and lines represent decoder features, interlayer features, and global features respectively. The grey line indicates that the weight of the feature without PR block regulation is 1.
We use the dotted circle to indicate the position where the weight of the PR block changes obviously. This change is strongly related to the size of the salient object. It is worth noting that the highest-level features (d5) of small objects are specially enhanced (dotted circle in the small object) to prevent dilution. The lowest-level (i1) features in large objects are sufficiently suppressed to prevent interference. The left side of Fig.9 shows the effect of the PR block. In order to verify that the dynamic regulation of weights is meaningful, we lock the weights in Tab.I line 5. The weights of FPN(PR) is obtained from the average weight of training dataset. Line
|DSS||.826||.789||.755||.885||.824||.056||.772||.729||.691||.846||.788||.066||.910||[rgb] .357, .608, .835.894||.864||.938||.879||.041||.916||.900||.871||.924||.882||.053||.839||.807||.760||.851||.797||.096|
|BMPM||.852||.745||.761||.863||.862||.049||.774||.692||.681||.839||.809||.064||.920||.871||.860||.938||.907||.039||.928||.868||.871||.916||.911||.045||.864||.771||.785||.847||[rgb] .439, .678, .278.845||.075|
|PoolNet||.876||-||-||-||.043||.817||[rgb] 1, 0, 0.817||-||-||-||-||[rgb] .357, .608, .835.058||[rgb] .357, .608, .835.928||-||-||-||-||[rgb] .357, .608, .835.035||[rgb] .439, .678, .278.936||-||-||-||-||.047||.857||-||-||-||-||.078|
|AFNet||.863||.793||.785||[rgb] .439, .678, .278.895||.867||.046||.797||[rgb] .439, .678, .278.739||.717||[rgb] .357, .608, .835.859||.826||.057||.925||.889||.872||[rgb] .357, .608, .835.949||.906||[rgb] .439, .678, .278.036||.935||.908||.886||[rgb] .439, .678, .278.941||[rgb] .439, .678, .278.913||[rgb] .439, .678, .278.042||[rgb] .439, .678, .278.871||[rgb] .439, .678, .278.828||.804||[rgb] .357, .608, .835.887||.850||.071|
|MLMSNet||.852||.745||.761||.863||.862||.049||.774||.692||.681||.839||.809||.064||.920||.871||.860||.938||[rgb] .439, .678, .278.907||.039||.928||.868||.871||.916||.911||.045||.864||.771||.785||.847||.845||.075|
|PAGE||.838||.777||.769||.886||.854||.052||.792||.736||[rgb] .357, .608, .835.722||[rgb] 1, 0, 0.860||.825||.062||.920||.884||.868||[rgb] .439, .678, .278.948||.904||[rgb] .439, .678, .278.036||.931||.906||.886||[rgb] .357, .608, .835.943||.912||[rgb] .439, .678, .278.042||.859||.817||.792||.879||.840||.078|
|BANet-V||.852||.789||.781||.891||.861||.046||.793||.731||[rgb] .439, .678, .278.719||.856||.823||.061||.920||.887||.871||.948||.903||[rgb] .439, .678, .278.036||.935||[rgb] .439, .678, .278.910||[rgb] .439, .678, .278.890||[rgb] 1, 0, 0.944||[rgb] .439, .678, .278.913||[rgb] .357, .608, .835.041||.867||.826||.799||.879||.841||.078|
|GATE-V||[rgb] .439, .678, .278.870||.783||.786||.888||[rgb] .439, .678, .278.870||[rgb] .439, .678, .278.045||.794||.724||.704||.854||.821||.061||[rgb] .357, .608, .835.928||.889||.872||[rgb] .439, .678, .278.948||[rgb] .357, .608, .835.909||[rgb] .357, .608, .835.035||[rgb] 1, 0, 0.941||.896||.886||.931||[rgb] 1, 0, 0.917||[rgb] .357, .608, .835.041||[rgb] .357, .608, .835.882||.810||[rgb] .439, .678, .278.807||.870||[rgb] .439, .678, .278.856||[rgb] .439, .678, .278.070|
|ITSD-V||[rgb] 1, 0, 0.877||[rgb] .439, .678, .278.798||[rgb] 1, 0, 0.814||.893||[rgb] 1, 0, 0.877||[rgb] .357, .608, .835.042||[rgb] .357, .608, .835.807||[rgb] 1, 0, 0.745||[rgb] 1, 0, 0.734||[rgb] .439, .678, .278.858||[rgb] .439, .678, .278.829||.063||[rgb] .439, .678, .278.926||[rgb] .439, .678, .278.891||[rgb] .439, .678, .278.882||.947||[rgb] .439, .678, .278.907||[rgb] .357, .608, .835.035||[rgb] .357, .608, .835.939||.875||[rgb] 1, 0, 0.897||.918||[rgb] .357, .608, .835.914||[rgb] 1, 0, 0.040||.884||.787||[rgb] 1, 0, 0.824||.857||[rgb] 1, 0, 0.858||[rgb] .357, .608, .835.068|
|FCNet||.829||.795||.757||.887||.822||[rgb] .439, .678, .278.045||.717||.676||.618||.795||.745||.066||-||-||-||-||-||-||-||-||-||-||-||-||.857||[rgb] .357, .608, .835.830||.802||[rgb] .439, .678, .278.882||.830||[rgb] .357, .608, .835.068|
|HVPNet||.840||.749||.730||.863||.849||.058||[rgb] .439, .678, .278.804||.721||.700||.847||[rgb] 1, 0, 0.831||.065||.916||.871||.840||.936||.899||.045||.928||.889||.855||.924||.903||.052||.849||.794||.753||.850||.827||.091|
|CAGNet-V||.851||[rgb] 1, 0, 0.820||[rgb] .439, .678, .278.797||[rgb] .357, .608, .835.900||.852||[rgb] .439, .678, .278.045||.782||[rgb] .357, .608, .835.743||.718||[rgb] .439, .678, .278.858||.807||[rgb] 1, 0, 0.057||.922||.906||[rgb] 1, 0, 0.888||[rgb] .439, .678, .278.948||.899||.033||.930||[rgb] .357, .608, .835.911||[rgb] .357, .608, .835.892||.932||.897||[rgb] .439, .678, .278.042||.860||[rgb] .439, .678, .278.828||.799||.874||.825||.077|
|SAMNet||.836||.745||.729||.864||.849||.058||.803||.717||.699||.847||[rgb] .357, .608, .835.830||.065||.915||.870||.837||.938||.898||.045||.928||.891||.858||.930||.907||.050||.850||.790||.747||.849||.824||.093|
|PRNet||[rgb] 1, 0, 0.877||[rgb] .357, .608, .835.815||[rgb] .357, .608, .835.802||[rgb] 1, 0, 0.908||[rgb] .357, .608, .835.872||[rgb] 1, 0, 0.041||.789||.731||.698||.857||.812||[rgb] .439, .678, .278.059||[rgb] 1, 0, 0.930||[rgb] 1, 0, 0.906||[rgb] .357, .608, .835.885||[rgb] 1, 0, 0.956||[rgb] 1, 0, 0.910||[rgb] 1, 0, 0.033||[rgb] .439, .678, .278.936||[rgb] 1, 0, 0.913||[rgb] .439, .678, .278.890||[rgb] .439, .678, .278.941||.910||[rgb] .357, .608, .835.041||[rgb] 1, 0, 0.884||[rgb] 1, 0, 0.843||[rgb] .357, .608, .835.815||[rgb] 1, 0, 0.893||[rgb] .357, .608, .835.856||[rgb] 1, 0, 0.067|
1,4,5 show that it is effective to suppress low-level features with fixed weights, but it is better to regulate the weights according to the analysis results of the PR block.
Vi-D Comparison with State-of-the-arts
We compare PRNet against 22 SOD state-of-the-art methods, including DCL [2016-CVPR-DCL], NLDF [Luo2017Non], MSRNet [2017-CVPR-MSRNet], DSS [2019-TPAMI-DSS], BMPM [2018-CVPR-BMPM], RAS [2020-TIP-RAS], PAGRN [2018-CVPR-PAGRN], C2S [2018-ECCV-C2S], PAGE [2019-CVPR-PAGE], BANet [2019-ICCV-BANet], AFNet [2019-CVPR-AFNet], GateNet [2020-ECCV-GateNet], ITSD [2020-CVPR-ITSD], FCNet [2020-NIPS-FCNet], HVPNet [2020-TCYB-HVPNet], CAGNet [2020-PR-CAGNet], SAMNet [2021-TIP-SAMNet], etc.
For fair comparisons, we use all saliency maps provided by the authors or generated by their codes. PoolNet [2019-CVPR-PoolNet] (with StdEdge) adds another dataset (BSDS500) for joint training, which makes the comparison results unfair. So we compare PRNet with PoolNet (with SalEdge, only DUTS-TE dataset) in Tab. V, and the experimental result shows that our algorithm is better. Our PRNet (only 130M) is a simple network, so there is no comparison with EGNet (434M) [2019-ICCV-EGNet] and MINet (650M) [2020-CVPR-MINet] with large parameters.
Quantitative evaluation. Tab. V shows the scores of the proposed model and 22 state-of-the-art saliency detection methods on five widely used datasets and also demonstrates that the perception-and-regulation strategy is successful in making simple networks perform favorably against other algorithms. Moreover, the PR curves by our approach outperform other methods, as shown in Fig. 10.
Qualitative evaluation. Fig. 11 shows the visual examples produced by our model and other models. From the 1st row to the 10th row, the size of the salient object gradually changes from the largest to the smallest. Our algorithm is effective in dealing with multi-scale object detection. The proposed method performs better in various challenging scenarios, including the small object, medium-sized object, and big object. Fig.12 shows the visualization results of the whole process. The difference between the FPN network with a PR block and the FPN network without a PR block is clearly shown. Besides, we provide some failure cases of our algorithm in the supporting document to help future researchers to conduct further analysis.
In this paper, we propose a novel framework PRNet for salient object detection. A PR block is designed to help the network understand the global information and assign the feature weights spontaneously and adaptively. To better perceive semantic information and reasonably allocate weight, we propose 3 perception strategies and carry out comparative experiments. Through experiments, we verify the different application scenarios of different strategies. Considering the relationship between local perception and global perception, we propose an IEO module to help the network have the ability to organize a wide space scene and scrutinize highly detailed objects. Sufficient experiments demonstrate that PRNet performs well. In the future, we may expand PRNet to more complex structures, such as recurrent structure networks and multi-modal SOD networks (RGB-T, RGB-D).