Perception-and-Regulation Network for Salient Object Detection

07/27/2021
by   Jinchao Zhu, et al.
6

Effective fusion of different types of features is the key to salient object detection. The majority of existing network structure design is based on the subjective experience of scholars and the process of feature fusion does not consider the relationship between the fused features and highest-level features. In this paper, we focus on the feature relationship and propose a novel global attention unit, which we term the "perception- and-regulation" (PR) block, that adaptively regulates the feature fusion process by explicitly modeling interdependencies between features. The perception part uses the structure of fully-connected layers in classification networks to learn the size and shape of objects. The regulation part selectively strengthens and weakens the features to be fused. An imitating eye observation module (IEO) is further employed for improving the global perception ability of the network. The imitation of foveal vision and peripheral vision enables IEO to scrutinize highly detailed objects and to organize the broad spatial scene to better segment objects. Sufficient experiments conducted on SOD datasets demonstrate that the proposed method performs favorably against 22 state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 7

page 9

page 11

09/10/2021

ACFNet: Adaptively-Cooperative Fusion Network for RGB-D Salient Object Detection

The reasonable employment of RGB and depth data show great significance ...
03/25/2019

SAC-Net: Spatial Attenuation Context for Salient Object Detection

This paper presents a new deep neural network design for salient object ...
04/29/2021

Video Salient Object Detection via Adaptive Local-Global Refinement

Video salient object detection (VSOD) is an important task in many visio...
08/08/2021

MPI: Multi-receptive and Parallel Integration for Salient Object Detection

The semantic representation of deep features is essential for image cont...
04/21/2019

A Simple Pooling-Based Design for Real-Time Salient Object Detection

We solve the problem of salient object detection by investigating how to...
12/20/2021

a novel attention-based network for fast salient object detection

In the current salient object detection network, the most popular method...
01/19/2021

Salient Object Detection via Integrity Learning

Albeit current salient object detection (SOD) works have achieved fantas...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Salient object detection (SOD) [wang2019sodsurvey][2021-TMM-cmSalGAN][2021-TPAMI-rethinking-Co-SOD][2021-TNNLS-RethinkingRGBD][2021-TPAMI-NoisyLabel][2021-PR-EFNet] aims to find salient areas in an image [2020-CVPR-Scribble][2021-AAAI-RD3D][2021-TPAMI-Uncertainty] or video [2018-TIP-VSOD-FCN][2018-TPAMI-SA-VOS] by using intelligent algorithms that mimic human visual characteristics. It has been used in many image comprehension and video processing tasks, such as photo cropping [2019-TPAMI-PhotoCropping], image editing [2010-ACMTOG-RepFinder], 4D saliency detection [2019-ICCV-LightField] photo composition [2009-ACMTOG-Sketch2Photo], and target tracking [2009-CVPR-tracking-SBDT].

With the development of deep neural networks, various network structures and novel convolution modules are designed to improve the segmentation effect. The majority of salient object detection networks are based on the U-shaped network 

[2015-CVPR-FCN] to integrate the features of different depths and scales. The network structure represented by U-net [2015-ICM-Unet] and feature pyramid network (FPN) [2027-CVPR-FPNdetection] has an obvious problem of semantic information dilution. Therefore, transferring rich semantic information to shallow layers without losing location information and destroying details is the focus of current algorithms [2019-TPAMI-DSS][2020-AAAI-F3][2019-ICCV-EGNet][2019-CVPR-PFANet][2019-CVPR-PoolNet][2020-AAAI-GCPANet]. Among them, the global guidance structure (GGS) represented by [2019-CVPR-PoolNet][2020-AAAI-GCPANet][2021-AAAI-SCWS] is widely used. The existing methods have great contributions to network structure and module optimization. These designs are based on rich experimental attempts and the subjective experience of scholars. Although great progress has been made at present, two key issues still are worthy of further study: how to regulate different types of features to complete better segmentation from the perspective of the overall need for weight regulation and how to better balance the ability to organize a broad spatial scene with the ability to scrutinize highly detailed objects.

Fig. 1: Inspired by the application of fully-connected layers in the classification networks, we replace the prediction of categories with the prediction of weights of features to be fused. The network regulated by weights has a better performance.

Most of the existing methods fuse multi-level features directly without considering the contribution ratio of the fused features to the final output. Therefore, for the first issue, we propose a perception-and-regulation (PR) block to optimize the FPN structure from the perspective of global perception and local feature fusion fine-tuning. Global perception helps to provide accurate semantic information and make better weight regulation, and local feature fusion fine-tuning helps to enhance useful information and suppress invalid information. The perception part of the PR block realizes global perception, which adopts the deepest feature with the largest receptive field as the input. The regulation part of the PR block realizes local feature fusion fine-tuning, which adopts a weighted method to optimize the feature fusion process.

For the second issue, [2020-CVPR-COD][2019-CVPR-PFANet][2018-TPAMI-DeepLab] use atrous convolution to expand the perception range. The too-large atrous ratio will make the information at independent sampling points discontinuous, which is not conducive to the continuity of spatial information in detail. Therefore, the convolution combination of multiple atrous rates is widely adopted [2019-CVPR-PFANet][2018-TPAMI-DeepLab][2020-arxiv-detectRS] to make the network have the ability to organize a wide space scene and scrutinize highly detailed objects. To solve this problem, we propose our solution, the imitating eye observation module (IEO), to better balance the two abilities.

Fig. 2: Related module diagrams and the application of PR block in CFE, CBAM, and EGNet. (a) Simplified global guidance structure(GGS) [2020-AAAI-GCPANet] [2019-CVPR-PoolNet]. (b) Cross feature module (CFM) of  [2020-AAAI-F3] (a) (b) is used for semantic information enhancement. (c) Classification-guided module (CGM) of Unet3+ [2020-ICASSP-UNet3+] is used to judge whether it is noise or organ. (d) Switchable trous revolution (SAC) [2020-arxiv-detectRS]

is helpful to detect objects of different scales. (e) Context-aware feature extraction module 

[2019-CVPR-PFANet] with PR block (f) Convolutional block attention module (CBAM) [2018-ECCV-CBAM] with PR block (g) Simplified structure of EGNet [2019-ICCV-EGNet] with PR block.

A partition search strategy is adopted in IEO to effectively alleviate the negative impact of a too-large atrous ratio. The adoption of the PR block enables the IEO module to further balance different types of features in IEO, which helps to further improve network performance.

Inspired by squeeze-and-excitation (SE) block [2020-TPAMI-SENet] and the classification network, we design a global attention unit called perception-and-regulation (PR) block to regulate the network automatically and achieve the best feature fusion effect. Our goal is to improve the feature fusion effect by explicitly modeling the interdependencies between features to be fused. To achieve this, we design a mechanism that helps the network to recalibrate the feature fusion process, through which it can learn to use global information to adaptively adjust the weights of the features to be fused. In addition, we design an imitating human eye observation module (IEO) to organize a broad spatial scene and have the ability to scrutinize highly detailed objects. Our contributions are summarized as follows:

  • We propose a perception-and-regulation (PR) block to help the network understand the global information and assign the feature weights uniformly to realize the spontaneous and adaptive global feature regulation.

  • An imitating human eye observation module (IEO) is proposed to help the network have the ability to organize a wide space scene and scrutinize highly detailed objects. PR block balances and optimizes these two abilities.

  • Sufficient experiments conducted on 5 SOD datasets demonstrate that the proposed method outperforms 22 state-of-the-art SOD methods in terms of eight metrics. In addition, ablation experiments on several modules and networks prove the universality of our PR block.

Ii Related work

Ii-a Salient object detection

Early salient object detection methods are based on hand-crafted features [2009-CVPR-Frequency][2015-TPAMI-GC][2013-CVPR-Submodular][2011-ICCV-Center][2013-CVPR-Discriminativ][2013-CVPR-Hierarchical][2013-CVPR-Graph][2016-TIP-CDST]

and intrinsic cues without a deep hierarchical structure. With the development of deep learning, the deep features with rich contextual information make a great breakthrough in the field of salient object detection. As a landmark algorithm, fully convolutional networks (FCN) 

[2015-CVPR-FCN] creatively removes the fully-connected layer to predict the semantic label for each pixel. Then the U-shape based structures represented by U-net [2015-ICM-Unet] and feature pyramid network (FPN) [2027-CVPR-FPNdetection] has gradually become the mainstream structure by integrating all levels of features layer by layer. Based on this structure, scholars have explored more abundant multi-layer feature blending methods to help feature expression effectively. Among them, the global guidance structure (GGS) represented by [2019-TPAMI-DSS][2020-AAAI-GCPANet][2019-CVPR-PoolNet] has gradually become a common method to strengthen semantic information.

Our Perception-and-Regulation (PR) network is based on GSS to regulate the network spontaneously and adaptively. It is worth noting that, in contrast to FCN abandoning the fully-connected layer (FC), our network mainly relies on the FC layer in the classification network to perceive the size and shape of objects. Then the weights of the features to be fused are evaluated according to the semantic information obtained from the FC layer, as shown in Fig.1.

Ii-B Semantic information reinforcement

Scholars have proposed various methods to transmit global information of high-level features to the shallow layers to help the network get detail information and accurately locate the objects. For example, Hou et al. [2019-TPAMI-DSS] proposed short connections to help shallower side-output layers get semantic information more directly. Zhao et al. [2019-ICCV-EGNet] used the highest-level features to help the edge features (shallow layer) of explicit modeling to filter out useless edge details. Zhao et al. [2019-CVPR-PFANet] designed a spatial attention module where context-aware high-level features are added to help the location information transfer to the shallow layer. Liu et al. [2019-CVPR-PoolNet] introduced a global guidance module (GGM) to explicitly make shallower layers be aware of the locations of the objects. Global context flow module in  [2020-AAAI-GCPANet] solved the issue of dilution in the process of high-level feature transmission, which is similar to GGM.

The global guidance idea of [2020-AAAI-GCPANet] [2019-CVPR-PoolNet] can be simplified to GGS structure in Fig. 2 (a) and the structure can effectively solve the problem that semantic information is diluted in the process of feature transfer of FPN structure or U-net structure [2015-ICM-Unet]. Even if these structural designs are based on the subjective experience and repeated attempts [2019-TPAMI-DSS] of excellent scholars, the rigid feature fusion process does not consider the relationship between different features to be fused and the highest level features. The GGS-PR structure of Fig. 3 uses PR block to further slightly regulate and optimize the network. Wei et al. [2020-AAAI-F3] designed a cross feature module (CFM) to select features with rich semantic information to transfer to

Fig. 3: Perception-and-regulation network (PRNet). On the right side of the picture are the applications of PR block in FPN structure and GGS structure. The middle part of the picture shows the regulation part of PR (F-PR and F-PR). Besides, F in the GGS structure is not regulated by PR. On the left side of the picture, we add SSD structure and IEO modules to GGS-PR to get the final network PRNet. i1, i2, i3, i4, i5 in region I are the interlayer features. d1, d2, d3, d4, d5 in region D are the decoder features. g1, g2, g3 in region G are global guidance features. The gray dotted arrow represents the input feature of the perception part of PR block. The gray solid arrows in region I represent the regulation of the output weights of the PR block on each feature fusion process.

the shallower layer and let the features with details enter the next cycle as shown in Fig. 2 (b). CFM can be understood as strengthening the transmission of semantic information in FPN structure. Our PR block does the same work, but the way of strengthening semantic features is more flexible and adaptive as shown in the FPN-PR of Fig. 3.

Ii-C Attention mechanisms

In the classification task, Hu et al. [2020-TPAMI-SENet] improves the quality of feature representation greatly by establishing the interdependencies between the channels of its convolutional features. Correspondingly, Woo et al. [2018-arXiv-BAM][2018-ECCV-CBAM] use the spatial attention map generated by utilizing the inter-spatial relationship of features with the channel attention map to help the network learn ’where’ is an informative part and ’what’ is meaningful on the spatial and channel axis respectively. Some SOD methods adopt attention mechanism [2020-TIP-RAS][2018-CVPR-PAGRN][2019-CVPR-CPD][2019-CVPR-PFANet]. Wang et al. [2019-CVPR-PAGE]

design the pyramid attention module to make the network pay more attention to important regions and multi-scale information. As a special attention mechanism, the gated mechanism is widely used by long short term memory (LSTM) and gated recurrent unit (GRU), which play an important role in SOD algorithms 

[2019-CVPR-TopDown][2019-PAMI-ASNet][2019-CVPR-CapSal]. Some segmentation algorithms [2017-CVPR-GatedDense][2018-CVPR-BMPM][2020-ECCV-GateNet] use the gated mechanism to adjust the network. CGM of Unet3+ [2020-ICASSP-UNet3+] can be regarded as the extreme case of the gated mechanism, because the weights of controlled features are set to 0 or 1 via argmax (Fig.2(c)), which helps to determine whether the target is an organ or noise.

Inspired by SENet [2020-TPAMI-SENet] and the classification network, we design a PR block for global regulation which can be regarded as a macro global attention mechanism. [2020-TPAMI-SENet] adaptively recalibrates channel-wise features at the micro level, while PR block recalibrates different types of features of the whole network at the macro level. Different from the existing methods of the gated network, all the features to be fused are regulated in our network and the perception part is located in the position with rich semantic information, which helps to analyze the size and shape of the object accurately and uniformly. In addition, inspired by SAC of [2020-arxiv-detectRS] (Fig.2(d)), we use softmax, which is commonly used in classification networks, to add constraints to the regulation part of the PR block.

Some algorithms [2019-CVPR-PFANet][2020-CVPR-COD][2020-arxiv-detectRS][2020-ECCV-GateNet][2018-TPAMI-DeepLab] use the atrous convolution to expand the receptive range to better observe the object. The disadvantage of atrous convolution with a large atrous ratio is that the information given by spatial continuity may be lost (such as edges) and it is not conducive to the segmentation of small objects. We design a special spatial attention mechanism to compensate for this shortcoming by imitating foveal vision and peripheral vision.

Iii Perception-and-Regulation Block

The Perception-and-Regulation (PR) block is a computational unit with semantic perception and global regulation capability. The PR block serves the feature level weight regulation of our final network PRNet, as shown in the network structure on the left side of Fig.3. Besides, it can also be used in many network structures or modules.

Inspired by SENet [2020-TPAMI-SENet] and the classification network, we design a PR block to make the fusion process of different types of features adaptively adjust according to the overall need of weight regulation. SE block focuses on features and adaptively recalibrates features at the channel level, while our PR block focuses on the entire network and recalibrates the entire network at the feature level. Therefore, PR block can be considered as a macro version of SE block.

In addition, FCN [2015-CVPR-FCN] adapts classification networks (AlexNet [2012-NIPS-AlexNet], VGG net [2015-ICLR-VGG] and GoogLeNet  [2015-CVPR-GoogLeNet]) into fully convolution networks by replacing the fully-connected (FC) layer with convolution layers to achieve semantic segmentation. On the contrary, we make full use of the perception and understanding ability of the FC layer to make adaptive regulation for the whole network. In Sec.3.1, we discuss three perception strategies in detail for the perception part of the PR block. PR block solves the first issue of how to regulate different types of features to complete better segmentation

Fig. 4: Three perception strategies for the perception-and-regulation block. Strategy 1 uses the fully-connected layer of classification networks to evaluate the weights of the features to be fused. Strategy 2,3 generate feature weights independently for each feature to be fused according to the highest level feature. The dimension change of the weight of strategy 2 is (HxWxC)-(HxWx1)-(1x1x1). The process of strategy 3 is (HxWxC)-(1x1xC)-(1x1x1). The features i1, i2, i3, i4, d2, d3, d4, d5 to be regulated come from the interlayer region (Region I) and decoder region (Region D) of FPN, which correspond to the upper right corner of Fig.3. For the convenience of comparison, we show the SE block in the upper right corner [2020-TPAMI-SENet].

from the perspective of the overall need for weight regulation. PR block has a better regulation effect on the fusion process of features with large differences. The differences of features here refer to different levels of features (FPN GGS), the features produced by atrous convolution or normal convolution (CFE), the original features or attention features in the residual structure of the attention module (CBAM).

Iii-a Perception: Semantic Information Embedding

In the perception part of the PR block, three semantic information embedding methods are designed (Fig.4) according to the idea of SENet for global information embedding. In order to introduce our idea clearly, we annotate the features in FPN-PR in the upper right corner of Fig.3.

Strategy 1 in Fig.4 uses the FC layer to perceive the size and shape of the objects. The output of the FC layer of the classification network is the category of the predicted object, while our output results are the weights of the features to be fused. The weights are adjusted adaptively according to the characteristics of the objects. Features regulated by weights are represented by the color of the corresponding weight. Unlike the feature weighting of the SE block for a single channel (Fig.4), the PR block regulates the whole features. The perception part of strategy 1

is set at the location of high-level features (i5) with rich semantic information. Max-pooling is used to reduce the size of the feature

. Then the feature

after pooling is transformed into a one-dimensional vector

by flatten operation. We use a multi-layer perceptron (MLP) 

[2018-arXiv-BAM] with one hidden layer to enhance the perception ability of the network. To save a parameter overhead, the hidden activation size is set to . C is the number of features that need to be regulated and it is eight for the FPN structure in Fig.4. The output layer size is

. The activation function of the output layer replaces ReLU with sigmoid and the final results are multiplied by 2 to show whether the features are enhanced or suppressed. In short, the perception part (

P) is computed as:

(1)

The activation function of output layer is element-wise sigmoid.

Due to the dense connection of FC layers, the final weights of strategy 1 have a strong correlation. In order to explain the perception part of the PR block more clearly, we provide two other spatial dimension (S) and channel dimension (C) perception design. Different from strategy 1, we design multiple independent memory units (MU) for strategy 2,3 (Fig.4) to better illustrate the mapping relationship. The function of MU is to establish the mapping relationship between the shape and size of input features and the weight of regulated features. The number of MU is determined by the number of features that need to be regulated. In the MU of strategy 2 , we use two convolution operations to gradually reduce the channel dimension of input features to 1. Here we refer to the design of spatial attention map [2018-arXiv-BAM][2018-ECCV-CBAM]. r is the reduction ratio. The purpose of adding convolution here is to get the output weight according to the input features, and it can be understood that the convolutions change the average gray value of the input features. The weight is equal to the average of the output feature (channel is 1). We change the activation function to sigmoid to enhance the nonlinearity of the output and multiply the final result by 2. In short, the perception part (P) is computed as:

(2)

where

is the convolution operation using the element-wise sigmoid function and

is th global average pooling.

For the MU of strategy 3, we use global average pooling to reduce the input feature to one dimension. Then we use the fully-connected layer to reduce the channel to . After sigmoid and average operation, the final weight is obtained. The perception part (P) is computed as:

(3)

where is the fully-connected layer and the activation function of the output layer is sigmoid. is the average operation of the one-dimensional vector.

Iii-B Regulation: Adaptive Recalibration

The accurate location information of high-level features in FPN structure is diluted in the process of multiple fusion [2020-AAAI-GCPANet]. This is because element-wise addition or concatenation operations do not treat the weights of the features to be fused differently. Most of the current algorithms focus on changing the network structure [2020-AAAI-GCPANet][2019-ICCV-EGNet][2019-CVPR-PoolNet][2019-TPAMI-DSS] or module [2020-AAAI-F3][2019-CVPR-PAGE] to enhance semantic information, while the PR network focuses on the regulation of network and can greatly improve the performance of simple network structure. The PR network uses a PR block for global regulation. It builds a bridge between the features to be fused and the semantic information made by the fully-connected layer. The semantic information is expressed in the form of weight.

Iii-B1 Basic Structural Analysis (FPN-PR and GGS-PR)

Both [2019-CVPR-PoolNet] and [2020-AAAI-GCPANet] use the global guidance method to enhance the semantic information of the shallow features of the network. We add a global guidance structure to the FPN to imitate this process, as shown in Fig.2 (a). The global guidance structure can be considered as a simplified version of  [2019-TPAMI-DSS][2019-CVPR-PoolNet][2020-AAAI-GCPANet]. The location features are directly added to the shallow features to enhance the ability of salient object location. There is still room for improvement. We add PR block to both FPN and GGS structures (Fig.3) for further slight regulation. [2020-AAAI-F3] proposes a CFM module to help the output features with high-level features (Fig.2 (b) blue arrow) as the main components to transfer to the shallow layer. This scheme is not flexible enough because the structure of CFM is fixed. While the perception part of the PR block helps us to adaptively recalibrate the weight of each feature in the whole network according to the characteristics of the object to achieve better segmentation.

Fig.3

shows the realization of perception and regulation of PR block. In order to simplify the network and reduce the amount of computation, we use convolution (convolution, batch normalization, ReLU) to unify the output features of the encoder to 64 channels. FPN-PR and GGS-PR on the right side of Fig.

3 do not show this detail. Taking FPN-PR with perception strategy 2 as an example, eight memory units are used to perceive the input feature and evaluate the weights of interlayer features (i1, i2, i3, i4, i5) and decoder features (d1, d2, d3, d4, d5). The gray dotted arrow represents the input feature of the perception part. The gray solid arrows represent the regulation of the eight output weights of the PR block on each feature to be fused. The only difference from the traditional FPN structure is that PR block weights these features. The features (g1, g2, g3) in the purple region G are global guidance features. It is worth noting that we only weighted the three feature fusion process of GGS to explore the influence of PR block on the global guidance. In short, the FPN-PR and GGS-PR are computed as:

(4)
(5)

where refers to the convolution operation. is the weight. is upsampling.

In order to make the perception part of PR block have a better perception effect on different scale objects, we adopt the partial encoder design of SSD algorithm [2016-ECCV-SSD], as shown in the left part of Fig.3. The advantage of SSD is that it can detect objects on multiple scales. We use its 6,7,8 layer structure and combine 4,7,8 layer features with concatenation to realize multi-layer perception.

Iii-B2 Exemplars (CFE-PR, CBAM-PR and EGNet-PR)

In order to verify the universality of PR block, we apply it to context-aware feature extraction module (CFE) in Fig.2 (e) [2019-CVPR-PFANet] and CBAM module in Fig.2 (f) [2018-ECCV-CBAM]. The features to be fused in FPN and GGS structures are features of different scales and depths, while the features to be fused in CFE-PR modules have different receptive fields.

Different from [2019-CVPR-PFANet], there are all 3x3 convolution in CFE of this paper and their dilation rates are 1,3,5,7 respectively. CFE is used in the feature i3, i4 and i5 position (follow [2019-CVPR-PFANet]) of basic FPN structure to enhance the output feature and PR block is used for internal regulation as shown in Fig.2 (e).

As complementary attention modules, channel-wise attention and spatial attention are used to calibrate features at the spatial and channel levels and they can learn ’what’ and ’where’ to attend in the channel and spatial axes respectively [2020-TPAMI-SENet][2018-arXiv-BAM][2018-ECCV-CBAM]. PR block can be considered as the third type of attention, which focuses on the influence of the whole feature on the network (FPN, GGS) or module(CFE, CBAM). The perception part of the PR block analyzes the global context information of high-level features, and strengthens or weakens the features to be fused from the perspective of the global needs of the network. So we call it global attention in Fig.2 (f). We use PR block with channel attention and spatial attention to further optimize the attention mechanism. The regulation part of the PR block is added to the attention branch of CBAM in Fig. 2 (f) and five CBAM-PR modules are added to i1, i2, i3, i4, i5 positions of the basic FPN structure.

We also added PR block to the final output position of EGNet [2019-ICCV-EGNet] for perception and regulation, and the final result was further improved. Fig.2 (g) is a simplified structure diagram of EGNet, and we added PR block in its final output location (FPN structure). Line 11-16 of Tab.I proves the effect of PR block in the above modules and networks.

Fig. 5: Imitating human eye observation module. The blue area is the peripheral vision module, which is applied twice in steps 1, 3 to make the peripheral vision feature quickly obtain the global receptive field. Under the regulation of PR block, peripheral vision feature, fovea vision feature, and original feature are fused.

Iv Imitating Eye Observation Module

The purpose of the imitating human eye observation module (IEO) is to quickly and accurately find and locate salient objects. IEO uses a partition search strategy (Fig.5 step1) to enable the network to focus on the analysis of local areas. Then it uses integration operation (Fig.5 step2) to associate the local feature analysis results with global information. Inspired by human perception of foveal vision and peripheral vision [PV1], we propose a peripheral vision module (PVM) (Fig.5 blue region) to cooperate with the foveal vision to achieve accurate understanding of the salient objects in step3. The peripheral vision analysis strategy of PVM is applied in both the regional integration step1 and the global search (step3). Step2 is a bridge to make the perception range of peripheral vision wider.

Fig. 6: The receptive field of features in peripheral vision module (PVM). The green, orange, and purple squares represent the features F1, F2(F3), and F3’ in Fig.5, respectively. The receptive fields of the green pixel in F2 and the orange pixel in F3 ’are indicated by the guide arrows respectively. The receptive field of the orange pixel in F3’ can cover the dark green areas in F1. The atrous convolution (blue square on the far right) is used to compare with PVM.

In step 1, we divide the input feature into four regions in the spatial dimension () and analyze each region separately with PVM. This mimics the human partition search strategy, which helps to find objects that are not in the center of view. In addition, this operation also helps to form the global receptive field (we will explain it in detail in step 3). Because the sub-regions of are partitioned again and then they all pass through the PVM module, we only show the implementation details of in Fig.5 step1 and Eq.7-10. The features obtained by partitioning in spatial dimension () are reorganized in channel dimension () in Eq.8.

(6)
(7)
(8)
(9)

where , , , refer to spatial dimension partition, channel dimension partition, spatial dimension concatenation, channel dimension concatenation. is 3x3 convolution 1 in PVM and activated with ReLU. is 3x3 convolution 2 in PVM and activated with sigmoid.

In step 2, we merge the results of partition search (Eq.10) in spatial dimension to get and then use concatenation operation to merge , in channel dimension. Step 2 strengthens the association between the region search feature and the original feature.

(10)
(11)

where .

In step 3, we do the peripheral visual perception of again. This process is the same as the previous Eq.7-9. PVM can be considered as a special attention mechanism. It expands the receptive field by comparing the features of corresponding positions in other partitions and then corrects the original features in spatial dimension (F1F2 in Fig.6). Eq.9 shows the combination of attention branch and primitive branch [2018-ECCV-CBAM][2018-arXiv-BAM]. PVM can also be considered as a special atrous convolution and has only four sampling points and a very large atrous rate (Fig.6 F3F3). The PVM here needs to be used for the feature of a large receptive field, otherwise, the spatial information discontinuity caused by a large atrous rate will appear. The problem can be solved by partitioning and stacking PVMs (F1F2F3). The partition search strategy allows our IEO module to be used for features of the smaller receptive field, which overcomes the limitation of the global perception strategy in [2019-CVPR-AFNet]. If we want to use the IEO for more shallow features, we can further decompose the features F, F, F and F (left side of Fig.5) into smaller sub-regions for partition search and integration before step1 and step2. To keep the network simple, we only used partitioning and integration once in IEO, and let IEOs be used for features i3, i4, i5 in PRNet of Fig.3.

The disadvantage of atrous convolution with a large atrous ratio is that the information given by spatial continuity may be lost (such as edges). Because the sampling points of atrous convolution are too scattered and the information of a single sampling point is not continuous enough. This is the same as the human eye can’t pay attention to the details of objects if there is only peripheral vision. The human eye diagram on the left of Fig. 5 shows the foveal vision cells and peripheral vision cells. Foveal visual cells are rich and concentrated, and they can observe the details of the object. The peripheral visual cells are sparse and widely distributed, and they can organize a broad space scene [PV2][PV3]. They are like convolutions of different atrous rates. How to balance these two capabilities is the purpose of IEO module design.

The dark orange dots in F3 of Fig.6 can be regarded as four groups of sampling points of PVM. One sampling point of PVM consists of 9 points (3x3 Conv). Therefore, PVM has both a large atrous rate and detail observation ability. It is worth noting that the 3x3 convolution here can also use atrous convolution with a small atrous rate. For simplicity, we use normal convolution here. The receptive field of the dark orange pixel in F3 can cover the dark green areas in F1. The receptive field of F3 is global. In addition, we add the fovea vision feature (Fig.5 the red arrow) to cooperate with the P-feature. In the last stage of step 3, we use concatenation operation to merge the original feature, fovea vision feature (r=5), and peripheral vision feature. Finally, we use PR block to balance the relationship among the three types of features.

(12)
(13)
(14)

where is to repeat Eq.7-9 for .

are the weight produced by the memory unit. Inspired by the classification network, Eq.13 uses a softmax layer to associate weights. SAC uses a similar approach (1-S) to associate partial weights 

[2020-arxiv-detectRS]. The three features regulated are peripheral vision feature, fovea vision feature (an original feature after atrous convolution operation), and original feature. The atrous ratio of the convolution of the foveal vision feature is r = 5, which is designed according to the experimental effect.

V Supervision

We use the widely used binary cross entropy (BCE) loss and a consistency-enhanced (CEL) loss [2020-CVPR-MINet] to supervise the prediction map, as shown in Eq.15.

(15)

BCE loss is defined as:

(16)

where is the prediction result of saliency map at . is the ground truth label of the pixel .

CEL loss is a variant of IoU loss, which can measure the similarity of two images from an overall perspective. it is defined as:

(17)

Vi Experiment

Vi-a Datasets and Evaluation Metrics

We evaluate the proposed architecture on 5 SOD datasets: DUTS [2017-CVPR-DUTS] with 10,553 training and 5,019 test images, DUT-OMRON [LiThe] with 5,168 images, ECSSD [2013-CVPR-Hierarchical] with 1,000 images, PASCAL-S [LiThe] with 850 images, HKU-IS [2015-CVPR-HKU-IS] with 4,447 images. We follow the data partition of [2020-CVPR-MINet][2019-TPAMI-DSS] to use 1,447 images of HKU-IS for testing.

In addition, we define a large salient object dataset (L) and a small salient object dataset (S). They are helpful to further analyze the dynamic regulation of PR block when dealing with different scale objects. We select large object images (1270) and small object images (1576) based on the ratio of white pixels in the GT, as shown in Eq.18 and Fig.7. The pictures and ground truth labels are selected from five common test datasets (ECSSD, PASCAL-S, DUT-OMRON, DUTS, HKU-IS). is the number of white pixels in the GT and is the number of black pixels. is the threshold. We set and to 0.38 and 0.03 respectively to obtain dataset L and S.

(18)
Fig. 7: On the left are images of a large salient object dataset. On the right are images of a small salient object dataset.

We use six metrics to evaluate the performance of PRNet and other state-of-the-art models. Mean absolute error (MAE[perazzi2012saliency] measures the average pixel-level relative error between the prediction and the GT by calculating the mean of the absolute value of the difference. F-measure ([2009-CVPR-Frequency] are also widely adopted in previous models [2019-ICCV-EGNet][2019-TPAMI-DSS].

is the weighted harmonic mean of Precision and Recall and

is usually set to 0.3. The maximal values are calculated from PR curves, represented as . An adaptive threshold (twice the mean value of the prediction) is adapted to calculate . Weighted F-measure () is a measure of completeness for improving F-measure [2014-CVPR-FW]. Structural similarity measure ([2017-ICCV-Sm] and E-measure ([2018-IJCAI-Em] are also useful for quantitative evaluation of saliency maps. Besides, precision-recall (PR) curves are drawn.

Vi-B Implementation

We follow [2020-AAAI-F3][2020-CVPR-MINet][2019-ICCV-SCRN] to use DUTS-TR [2017-CVPR-DUTS] as the training dataset and other above-mentioned datasets are used as testing datasets. In the training phase, we follow [2020-CVPR-MINet]

to use random horizontal flipping, random color jittering, and random rotating as data augmentation techniques to prevent the over-fitting problem. PRNet is trained for 40 epochs on an NVIDIA RTX 2080Ti GPU. Batchsize is set to 4. VGG-16, pre-trained on the ImageNet dataset, is used as the backbone network. The parameters for the rest of PRNet are initialized by the default setting of PyTorch. Our model adopts the stochastic gradient descent (SGD) optimizer with a momentum of 0.9, a weight decay of 0.0005, and an initial learning rate of 0.001. ”Poly” strategy 

[2015-CS-ParseNet] (factor is 0.9) is applied. During testing, the input size is set to 320x320.

Model DUTS-TE DUT-OMRON
F E S MAE F E S MAE
1.FPN .725 .859 .842 .057 .629 .813 .780 .078
2.FPN(PR) [rgb] 1, 0, 0.773 [rgb] 1, 0, 0.890 [rgb] 1, 0, 0.857 [rgb] 1, 0, 0.047 [rgb] 1, 0, 0.666 [rgb] 1, 0, 0.842 [rgb] 1, 0, 0.793 [rgb] 1, 0, 0.067
3.FPN(PR) .764 .887 .852 .048 .658 .839 .788 .068
4.FPN(PR) .762 .883 .853 .048 .655 .836 .788 .069
5.FPN(PR-fixed) .735 .866 .840 .053 .617 .810 .769 .078
6.FPN(GateNet) [2020-ECCV-GateNet] .744 .886 .837 .051 .624 .839 .767 .071
7.FPN+AIMs [2020-CVPR-MINet] .768 .884 .860 .047 - - - -
8.FPNssd(PR) .774 .889 .861 .047 .675 .839 .802 .067
9.GGS .751 .880 .845 .051 .641 .828 .777 .073
10.GGS(PR) [rgb] 1, 0, 0.766 [rgb] 1, 0, 0.892 [rgb] 1, 0, 0.851 [rgb] 1, 0, 0.047 [rgb] 1, 0, 0.651 [rgb] 1, 0, 0.841 [rgb] 1, 0, 0.783 [rgb] 1, 0, 0.068
11.FPN+CBAM [2018-ECCV-CBAM] .768 .888 .856 .048 .661 .837 .792 .069
12.FPN+CBAM(PR) [rgb] 1, 0, 0.775 [rgb] 1, 0, 0.894 [rgb] 1, 0, 0.858 [rgb] 1, 0, 0.046 [rgb] 1, 0, 0.666 [rgb] 1, 0, 0.847 [rgb] 1, 0, 0.793 [rgb] 1, 0, 0.066
13.FPN+CFE [2019-CVPR-PFANet] .793 .896 .867 .044 .688 .849 .805 .062
14.FPN+CFE(PR) [rgb] 1, 0, 0.798 [rgb] 1, 0, 0.902 [rgb] 1, 0, 0.869 [rgb] 1, 0, 0.043 [rgb] 1, 0, 0.700 [rgb] 1, 0, 0.856 [rgb] 1, 0, 0.812 [rgb] 1, 0, 0.060
15.EGNet [2019-ICCV-EGNet] .802 .897 .878 .044 .728 .864 .836 .057
16.EGNet(PR[2019-ICCV-EGNet] [rgb] 1, 0, 0.806 [rgb] 1, 0, 0.900 [rgb] 1, 0, 0.881 [rgb] 1, 0, 0.042 [rgb] 1, 0, 0.731 [rgb] 1, 0, 0.866 [rgb] 1, 0, 0.837 [rgb] 1, 0, 0.056
17. CFD [2020-AAAI-F3] .778 .890 .858 .047 .674 .836 .796 .066
18. CFD(PR[2020-AAAI-F3] [rgb] 1, 0, 0.790 [rgb] 1, 0, 0.891 [rgb] 1, 0, 0.861 [rgb] 1, 0, 0.044 [rgb] 1, 0, 0.681 [rgb] 1, 0, 0.842 [rgb] 1, 0, 0.799 [rgb] 1, 0, 0.062
TABLE I: Ablation experiments on various structures and modules. Because strategy 2 (PR) performs best in the final network structure (PRNet), to simplify the experiment, we use strategy PR for PR block in the experiments after Line 4.
Model DUTS-TE DUT-OMRON
F E S MAE F E S MAE
1. GGS(PR) .755 .877 .859 .050 .660 .837 .801 .069
2. GGS(PR) + IEO .793 .901 .868 .043 .692 .851 .809 .062
3. GGS(PR) + IEO(PR) [rgb] 1, 0, 0.802 [rgb] 1, 0, 0.908 [rgb] 1, 0, 0.872 [rgb] 1, 0, 0.041 [rgb] 1, 0, 0.698 [rgb] 1, 0, 0.857 [rgb] 1, 0, 0.812 [rgb] 1, 0, 0.059
4. GGS(PR) + IEO(PR) .794 .903 .868 .042 .690 .856 .807 .060
5. GGS(PR) + CFE(PR) .794 .902 .869 .043 .696 .856 .812 .061
6. w/o CEL .765 .879 .865 .048 .669 .840 .808 .067
TABLE II: Ablation experiment of PRNet. Because strategy 2 (PR) performs best in the final network structure (PRNet 3rd line), to simplify the experiment, PR is used in every step of the ablation experiment. The model in the 6th line is the same as that in the 3rd line, but the loss only uses BCE loss (CEL loss is removed).
Model DUTS-TE DUT-OMRON
F E S MAE F E S MAE
1. IEO in i5 .790 .901 .867 .043 .684 .849 .805 .063
2. IEO in i4, i5 .792 .905 .867 .042 .682 .850 .803 .063
3. IEO in i3, i4, i5 [rgb] 1, 0, 0.802 [rgb] 1, 0, 0.908 [rgb] 1, 0, 0.872 [rgb] 1, 0, 0.041 .698 .857 .812 [rgb] 1, 0, 0.059
4. IEO in i2, i3, i4, i5 .797 .904 .870 .042 [rgb] 1, 0, 0.699 [rgb] 1, 0, 0.859 [rgb] 1, 0, 0.813 .060
5. IEO in i1, i2, i3, i4, i5 .798 .903 .871 .042 [rgb] 1, 0, 0.699 .856 [rgb] 1, 0, 0.813 .061
TABLE III: Ablation Experiment of IEO Module. We adjust the location and number of IEO on the basis of PRNet(PR). The experiments show that the best location is i3 i4 i5.
Model DUTS-TE DUT-OMRON
F E S MAE F E S MAE
1. PRNet(PR) .794 .904 .868 .042 .690 .852 .807 .062
2. PRNet(PR) [rgb] 1, 0, 0.802 [rgb] 1, 0, 0.908 [rgb] 1, 0, 0.872 [rgb] 1, 0, 0.041 [rgb] 1, 0, 0.698 [rgb] 1, 0, 0.857 [rgb] 1, 0, 0.812 [rgb] 1, 0, 0.059
3. PRNet(PR) .792 .902 .868 .043 .686 .851 .806 .062
TABLE IV: Comparison of 3 types of perception strategies in PRNet.

Vi-C Ablation Studies

Ablation analysis of PR block in different structures and modules. Lines 1-4 of Tab.I analyze the effect of PR block with different perception strategies in FPN structure. are strategy 1,2,3 in Fig.4. The has the best regulation effect in FPN structure because of its rich parameters and intensive interaction analysis in the fully-connected layer. But in the final network structure (PRNet), is the best (Tab.IV), we will explain this phenomenon in later analysis. In order to simplify the experiment, we uniformly use the best strategy in the final network to carry out the comparison experiments and the ablation experiment in Tab.I, Tab.II and Tab.III. Line 6 of Tab.I is the gate strategy provided by the GateNet [2020-ECCV-GateNet], and its effect is not as good as the PR block of global perception and regulation. It is worth noting that line 7 of Tab.I is the AIMs module of MINet [2020-CVPR-MINet]. AIM has a more complex interaction structure than PR block, but the regulation effect of PR block is better. Line 8 verify the effect of the multi-level perception strategy in the FPN structure. FPN(PR) means that FPN-PR uses the encoder structure shown in the left side of Fig. 3. Line 9-18 verify the improvement of PR block to GGS structure (the lower right corner of Fig.3), CFE module (Fig.2 (e)), CBAM module (Fig.2 (f)), EGNet network (Fig.2 (g)) and CFD decoder (Fig.2 (b)).

Ablation analysis of the PRNet. Tab.II shows the ablation experiment of PRNet (Fig.3). The baseline is GGS(PR). IEO improves the network performance greatly in line 2, but the allocation of foveal vision feature and peripheral vision feature is not balanced. The performance of IEO is further improved after being regulated by PR block (line 3). The right side of Fig. 5 shows how the weights obtained by the perception part regulate the features of IEO. Lines 3 and 4 show the effect of multi-level perception. The experiment in the 5th line replaces the IEO-PR module in the 3rd line with the CFE-PR module (Fig.2 (e)), which proves the effectiveness of the IEO-PR module. The model of the experiment in the 6th line is the same as that in the 3rd line, but the experiment in the 6th line is only supervised by BCE loss (CEL loss is removed). The 3rd and 6th experiments show the effect of CEL loss.

It is worth noting that, as shown in Fig.3, the IEO and CFE modules in Tab.II are placed in i3, i4, and i5 positions, which follows the setting of  [2019-CVPR-PFANet]. The peripheral vision module in the IEO module has a large void rate, so it is better to use it for high-level features with a larger receptive field. Experiments in Tab.III also show that this scheme is the best for IEO-PR module. Besides, reducing the number of IEO modules is conducive to simplify the model and improving the speed.

Perception strategy analysis. Lines 1,2,3,4 of Tab.I show that strategy 1 (PR) is the best in simple structures (FPN). While strategy 2 (PR) performs best in complex structures (PRNet) as shown in Tab.IV.

The fully-connected layers in PR that are inspired by the classification network make the weights coupled and correlation strong as shown in Fig.4. In addition, there are many parameters in the FC layer, which is also helpful to recalibrate the weights of features with obvious differences in the simple structure (FPN). But for complex network PRNet, the difference between the features to be fused becomes smaller. Take the feature fusion process F on the far right of PRNet (Fig.3) as an example, because d4 is the fusion feature of g2 (d5) and i4, the difference between g2 and d4 is reduced. Fully-connected layers in strategy 1 over-interpret the weights, which makes the effect of PR block worse. Fully-connected layers are also used in strategy 3 (PR), so it has the same problem. The MU of strategy 3 evaluates each feature weight independently, which is different from the strong coupling in strategy 1. Strategy 3 can be considered as a simplified version of strategy 1 because the parameters of its fully-connected layer are less than strategy 1. Strategy 1 performs better on simple networks (FPN, CFD), and strategy 2 performs better on complex networks (PRNet, EGNet). Besides, we analyze the weights of strategies 1 and 2 in the FPN structure, as shown in Fig.8. We can find that strategy 1 is radical and sensitive, while strategy 2 is conservative and restricted. These characteristics result in the performance differences between the two strategies in simple network structure and complex network structure.

Fig. 8: The effect analysis of strategy 1 (PR) and strategy 2 (PR) in FPN architecture. The pie chart and line chart show the statistical results of feature weights on five test datasets. The pie chart is presented in the proportion of feature weight. The line chart is presented in the form of concrete numerical value. F1-F5 represent the five feature fusion processes of FPN. F1 represents the feature fusion process of i1 and d2. The blue regions represent the weight ratio of decoder features (d2, d3, d4, d5). The pink regions represent the weight ratio of interlayer features (i1, i2, i3, i4). For the feature definitions of i1 and d2, please refer to FPN-PR in the upper right corner of Fig.3.

PR directly uses global average pooling to reduce the spatial dimension (H, W) to (1, 1), which is beneficial for complex networks (PRNet) with little difference in features to be fused. Because there is no fully-connected layer, the final weights are more directly and closely related to the spatial features of salient objects. PR can prevent overfitting of the fully-connected layers in strategy 1, 3.

PR block is originally inspired by the classification network. And we find that the paper, Network In Network (NIN) [2014-ICLR-NIN], just verifies this phenomenon from the perspective of the classification task. NIN explains the advantages of global average pooling to replace the fully connected layer in detail. According to the above analysis, we can use strategy 1 to regulate the feature fusion process with large feature differences, while strategy 2 can be used for the feature fusion process with small feature differences. In addition, it should be noted that strategy 1 is not suitable for all feature fusion processes. For the case that the difference between the features to be fused is very small, radical weight regulation may have a negative impact.

Fig. 9: Weight analysis. For FPN-PR and GGS-PR, we use pie charts and line charts to show the average values of multiple fusion position weights of 5 test datasets and the training dataset. In the line chart, we use 1270 images of large objects and 1576 images of small objects respectively to show how the PR block regulates the network in the segmentation task of objects of different scales.

Feature weight analysis. To further analyze how the PR block works, we show the average weight of each fusion position in the pie chart and line chart in Fig.9. F1-4 and FG1-3 represent multiple regulated points of FPN-PR and GGS-PR (Fig.3). The pie chart shows the results in the training dataset and 5 test datasets. The line chart shows the results of 5 test datasets and the results of 1300 pictures with large objects and 1300 pictures with small objects. The images of large objects and small objects are obtained by setting threshold according to the area ratio of white pixels in GT. Blue, red and purple regions and lines represent decoder features, interlayer features, and global features respectively. The grey line indicates that the weight of the feature without PR block regulation is 1.

We use the dotted circle to indicate the position where the weight of the PR block changes obviously. This change is strongly related to the size of the salient object. It is worth noting that the highest-level features (d5) of small objects are specially enhanced (dotted circle in the small object) to prevent dilution. The lowest-level (i1) features in large objects are sufficiently suppressed to prevent interference. The left side of Fig.9 shows the effect of the PR block. In order to verify that the dynamic regulation of weights is meaningful, we lock the weights in Tab.I line 5. The weights of FPN(PR) is obtained from the average weight of training dataset. Line

Model DUTS-TE DUT-OMRON HKU-IS ECSSD PASCAL-S
F F F E S MAE F F F E S MAE F F F E S MAE F F F E S MAE F F F E S MAE
DCL .782 .712 .408 .839 .734 .088 .757 .695 .583 .830 .772 .080 .757 .695 .583 .830 .772 .080 .901 .874 .790 .908 .870 .068 .830 .776 .685 .836 .794 .108
Amulet .778 .671 .655 .798 .803 .085 .743 .647 .625 .784 .780 .098 .897 .842 .816 .914 .884 .051 .915 .869 .841 .912 .893 .059 .841 .771 .741 .831 .821 .098
NLDF .813 .739 .710 .855 .816 .065 .753 .684 .634 .817 .770 .080 .902 .872 .839 .929 .878 .048 .905 .878 .839 .912 .875 .063 .833 .782 .742 .842 .804 .099
UCF .771 .624 .586 .766 .777 .117 .735 .613 .564 .763 .758 .132 .888 .810 .753 .893 .866 .074 .911 .840 .789 .888 .883 .078 .830 .708 .681 .787 .803 .126
MSRNet .829 .703 .720 .840 .839 .061 .782 .676 .669 .820 .808 .073 .914 .857 .853 .935 .902 .040 .911 .839 .849 .905 .895 .054 .858 .747 .769 .837 .841 .081
DSS .826 .789 .755 .885 .824 .056 .772 .729 .691 .846 .788 .066 .910 [rgb] .357, .608, .835.894 .864 .938 .879 .041 .916 .900 .871 .924 .882 .053 .839 .807 .760 .851 .797 .096
BMPM .852 .745 .761 .863 .862 .049 .774 .692 .681 .839 .809 .064 .920 .871 .860 .938 .907 .039 .928 .868 .871 .916 .911 .045 .864 .771 .785 .847 [rgb] .439, .678, .278.845 .075
RAS .831 .751 .740 .864 .839 .059 .787 .713 .695 .849 .814 .062 .913 .871 .843 .931 .887 .045 .921 .889 .857 .922 .893 .056 .838 .787 .738 .837 .795 .104
PAGRN .854 .784 .724 .883 .838 .056 .771 .711 .622 .843 .775 .071 .919 .887 .823 .941 .889 .047 .927 .894 .834 .917 .889 .061 .858 .808 .738 .854 .817 .093
C2S .811 .717 .717 .847 .831 .062 .759 .682 .663 .828 .799 .072 .898 .851 .834 .928 .886 .047 .911 .865 .854 .915 .896 .053 .857 .775 .777 .850 .840 .080
PoolNet .876 - - - .043 .817 [rgb] 1, 0, 0.817 - - - - [rgb] .357, .608, .835.058 [rgb] .357, .608, .835.928 - - - - [rgb] .357, .608, .835.035 [rgb] .439, .678, .278.936 - - - - .047 .857 - - - - .078
AFNet .863 .793 .785 [rgb] .439, .678, .278.895 .867 .046 .797 [rgb] .439, .678, .278.739 .717 [rgb] .357, .608, .835.859 .826 .057 .925 .889 .872 [rgb] .357, .608, .835.949 .906 [rgb] .439, .678, .278.036 .935 .908 .886 [rgb] .439, .678, .278.941 [rgb] .439, .678, .278.913 [rgb] .439, .678, .278.042 [rgb] .439, .678, .278.871 [rgb] .439, .678, .278.828 .804 [rgb] .357, .608, .835.887 .850 .071
MLMSNet .852 .745 .761 .863 .862 .049 .774 .692 .681 .839 .809 .064 .920 .871 .860 .938 [rgb] .439, .678, .278.907 .039 .928 .868 .871 .916 .911 .045 .864 .771 .785 .847 .845 .075
PAGE .838 .777 .769 .886 .854 .052 .792 .736 [rgb] .357, .608, .835.722 [rgb] 1, 0, 0.860 .825 .062 .920 .884 .868 [rgb] .439, .678, .278.948 .904 [rgb] .439, .678, .278.036 .931 .906 .886 [rgb] .357, .608, .835.943 .912 [rgb] .439, .678, .278.042 .859 .817 .792 .879 .840 .078
BANet-V .852 .789 .781 .891 .861 .046 .793 .731 [rgb] .439, .678, .278.719 .856 .823 .061 .920 .887 .871 .948 .903 [rgb] .439, .678, .278.036 .935 [rgb] .439, .678, .278.910 [rgb] .439, .678, .278.890 [rgb] 1, 0, 0.944 [rgb] .439, .678, .278.913 [rgb] .357, .608, .835.041 .867 .826 .799 .879 .841 .078
HRS .843 .793 .746 .889 .829 .051 .762 .708 .645 .842 .772 .066 .913 .892 .854 .938 .883 .042 .920 .902 .859 .923 .883 .054 .852 .809 .748 .850 .801 .090
GATE-V [rgb] .439, .678, .278.870 .783 .786 .888 [rgb] .439, .678, .278.870 [rgb] .439, .678, .278.045 .794 .724 .704 .854 .821 .061 [rgb] .357, .608, .835.928 .889 .872 [rgb] .439, .678, .278.948 [rgb] .357, .608, .835.909 [rgb] .357, .608, .835.035 [rgb] 1, 0, 0.941 .896 .886 .931 [rgb] 1, 0, 0.917 [rgb] .357, .608, .835.041 [rgb] .357, .608, .835.882 .810 [rgb] .439, .678, .278.807 .870 [rgb] .439, .678, .278.856 [rgb] .439, .678, .278.070
ITSD-V [rgb] 1, 0, 0.877 [rgb] .439, .678, .278.798 [rgb] 1, 0, 0.814 .893 [rgb] 1, 0, 0.877 [rgb] .357, .608, .835.042 [rgb] .357, .608, .835.807 [rgb] 1, 0, 0.745 [rgb] 1, 0, 0.734 [rgb] .439, .678, .278.858 [rgb] .439, .678, .278.829 .063 [rgb] .439, .678, .278.926 [rgb] .439, .678, .278.891 [rgb] .439, .678, .278.882 .947 [rgb] .439, .678, .278.907 [rgb] .357, .608, .835.035 [rgb] .357, .608, .835.939 .875 [rgb] 1, 0, 0.897 .918 [rgb] .357, .608, .835.914 [rgb] 1, 0, 0.040 .884 .787 [rgb] 1, 0, 0.824 .857 [rgb] 1, 0, 0.858 [rgb] .357, .608, .835.068
FCNet .829 .795 .757 .887 .822 [rgb] .439, .678, .278.045 .717 .676 .618 .795 .745 .066 - - - - - - - - - - - - .857 [rgb] .357, .608, .835.830 .802 [rgb] .439, .678, .278.882 .830 [rgb] .357, .608, .835.068
HVPNet .840 .749 .730 .863 .849 .058 [rgb] .439, .678, .278.804 .721 .700 .847 [rgb] 1, 0, 0.831 .065 .916 .871 .840 .936 .899 .045 .928 .889 .855 .924 .903 .052 .849 .794 .753 .850 .827 .091
CAGNet-V .851 [rgb] 1, 0, 0.820 [rgb] .439, .678, .278.797 [rgb] .357, .608, .835.900 .852 [rgb] .439, .678, .278.045 .782 [rgb] .357, .608, .835.743 .718 [rgb] .439, .678, .278.858 .807 [rgb] 1, 0, 0.057 .922 .906 [rgb] 1, 0, 0.888 [rgb] .439, .678, .278.948 .899 .033 .930 [rgb] .357, .608, .835.911 [rgb] .357, .608, .835.892 .932 .897 [rgb] .439, .678, .278.042 .860 [rgb] .439, .678, .278.828 .799 .874 .825 .077
SAMNet .836 .745 .729 .864 .849 .058 .803 .717 .699 .847 [rgb] .357, .608, .835.830 .065 .915 .870 .837 .938 .898 .045 .928 .891 .858 .930 .907 .050 .850 .790 .747 .849 .824 .093
PRNet [rgb] 1, 0, 0.877 [rgb] .357, .608, .835.815 [rgb] .357, .608, .835.802 [rgb] 1, 0, 0.908 [rgb] .357, .608, .835.872 [rgb] 1, 0, 0.041 .789 .731 .698 .857 .812 [rgb] .439, .678, .278.059 [rgb] 1, 0, 0.930 [rgb] 1, 0, 0.906 [rgb] .357, .608, .835.885 [rgb] 1, 0, 0.956 [rgb] 1, 0, 0.910 [rgb] 1, 0, 0.033 [rgb] .439, .678, .278.936 [rgb] 1, 0, 0.913 [rgb] .439, .678, .278.890 [rgb] .439, .678, .278.941 .910 [rgb] .357, .608, .835.041 [rgb] 1, 0, 0.884 [rgb] 1, 0, 0.843 [rgb] .357, .608, .835.815 [rgb] 1, 0, 0.893 [rgb] .357, .608, .835.856 [rgb] 1, 0, 0.067
TABLE V: Quantitative evaluation. We compare 22 SOD methods on 5 SOD datasets. The maximum and mean F-measure (larger is better), S-measure (larger is better) and MAE (smaller is better) of different salient object detection methods on five benchmark datasets. The best three results are highlighted in and . -V: VGG16 as backbone in the algorithm that provides multiple backbones. *: We compare the PoolNet algorithm that uses only the DUTS-TE dataset for training. -: The author do not provide the test results of the corresponding datasets.
Fig. 10: Precision-Recall curves (1st row) and F-measure curves (2nd row) on five common saliency datasets.

1,4,5 show that it is effective to suppress low-level features with fixed weights, but it is better to regulate the weights according to the analysis results of the PR block.

Vi-D Comparison with State-of-the-arts

We compare PRNet against 22 SOD state-of-the-art methods, including DCL [2016-CVPR-DCL], NLDF [Luo2017Non], MSRNet [2017-CVPR-MSRNet], DSS [2019-TPAMI-DSS], BMPM [2018-CVPR-BMPM], RAS [2020-TIP-RAS], PAGRN [2018-CVPR-PAGRN], C2S [2018-ECCV-C2S], PAGE [2019-CVPR-PAGE], BANet [2019-ICCV-BANet], AFNet [2019-CVPR-AFNet], GateNet [2020-ECCV-GateNet], ITSD [2020-CVPR-ITSD], FCNet [2020-NIPS-FCNet], HVPNet [2020-TCYB-HVPNet], CAGNet [2020-PR-CAGNet], SAMNet [2021-TIP-SAMNet], etc.

For fair comparisons, we use all saliency maps provided by the authors or generated by their codes. PoolNet [2019-CVPR-PoolNet] (with StdEdge) adds another dataset (BSDS500) for joint training, which makes the comparison results unfair. So we compare PRNet with PoolNet (with SalEdge, only DUTS-TE dataset) in Tab. V, and the experimental result shows that our algorithm is better. Our PRNet (only 130M) is a simple network, so there is no comparison with EGNet (434M) [2019-ICCV-EGNet] and MINet (650M) [2020-CVPR-MINet] with large parameters.

Quantitative evaluation. Tab. V shows the scores of the proposed model and 22 state-of-the-art saliency detection methods on five widely used datasets and also demonstrates that the perception-and-regulation strategy is successful in making simple networks perform favorably against other algorithms. Moreover, the PR curves by our approach outperform other methods, as shown in Fig. 10.

Qualitative evaluation. Fig. 11 shows the visual examples produced by our model and other models. From the 1st row to the 10th row, the size of the salient object gradually changes from the largest to the smallest. Our algorithm is effective in dealing with multi-scale object detection. The proposed method performs better in various challenging scenarios, including the small object, medium-sized object, and big object. Fig.12 shows the visualization results of the whole process. The difference between the FPN network with a PR block and the FPN network without a PR block is clearly shown. Besides, we provide some failure cases of our algorithm in the supporting document to help future researchers to conduct further analysis.

Fig. 11: Qualitative comparisons with state-of-the-art algorithms. From top to bottom, the size of salient objects gradually decreases.
Fig. 12: Visual analysis of PR block regulation process. We show two examples (more examples can be found in the supporting document). For the feature analysis of the bird, the 1st and 2nd rows are the decoder features d1-d5 of FPN and the decoder features d1-d5 of FPN-PR, respectively. The 3rd row is ground truth, the input image and encoder (VGG-16) features i1-i4. The 1st column is the final output saliency maps of FPN and FPN-PR. In the red box is the feature fusion process of FPN-PR. The gray (encoder), red (interlayer features), and blue (decoder features) arrows correspond to the FPN-PR in the upper right corner of Fig.3. The white font is the result of weight regulated by PR block (PR).

Vii Conclusions

In this paper, we propose a novel framework PRNet for salient object detection. A PR block is designed to help the network understand the global information and assign the feature weights spontaneously and adaptively. To better perceive semantic information and reasonably allocate weight, we propose 3 perception strategies and carry out comparative experiments. Through experiments, we verify the different application scenarios of different strategies. Considering the relationship between local perception and global perception, we propose an IEO module to help the network have the ability to organize a wide space scene and scrutinize highly detailed objects. Sufficient experiments demonstrate that PRNet performs well. In the future, we may expand PRNet to more complex structures, such as recurrent structure networks and multi-modal SOD networks (RGB-T, RGB-D).

References