Log In Sign Up

Unifying Global-Local Representations in Salient Object Detection with Transformer

The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires the model deep enough to have a global receptive field and such a deep model always leads to the loss of local details. In this paper, we introduce a new attention-based encoder, vision transformer, into salient object detection to ensure the globalization of the representations from shallow to deep layers. With the global view in very shallow layers, the transformer encoder preserves more local representations to recover the spatial details in final saliency maps. Besides, as each layer can capture a global view of its previous layer, adjacent layers can implicitly maximize the representation differences and minimize the redundant features, making that every output feature of transformer layers contributes uniquely for final prediction. To decode features from the transformer, we propose a simple yet effective deeply-transformed decoder. The decoder densely decodes and upsamples the transformer features, generating the final saliency map with less noise injection. Experimental results demonstrate that our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks by a large margin, with an average of 12.17 improvement in terms of Mean Absolute Error (MAE). Code will be available at


page 1

page 2

page 3

page 7

page 8


Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net

Existing salient object detection (SOD) methods mainly rely on CNN-based...

SaLite : A light-weight model for salient object detection

Salient object detection is a prevalent computer vision task that has ap...

Towards High-Resolution Salient Object Detection

Deep neural network based methods have made a significant breakthrough i...

Co-Saliency Detection with Co-Attention Fully Convolutional Network

Co-saliency detection aims to detect common salient objects from a group...

GroupTransNet: Group Transformer Network for RGB-D Salient Object Detection

Salient object detection on RGB-D images is an active topic in computer ...

Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers

Detection transformers have recently shown promising object detection re...

Visual Parser: Representing Part-whole Hierarchies with Transformers

Human vision is able to capture the part-whole hierarchical information ...

1 Introduction

Input GT Ours GateNet
Figure 1: Examples of our method. We propose the Global-Local Saliency TRansformer (GLSTR) to unify global and local features in each layer. We compare our method with a SOTA method, GateNet [59], based on FCN architecture. Our method can localize salient region precisely with accurate boundary.

Salient object detection (SOD) aims at segmenting the most visually attracting objects of the images, aligning with the human perception. It gained broad attention in recent years due to its fundamental role in many vision tasks, such as image manipulation [27], semantic segmentation [25], autonomous navigation [7], and person re-identification [49, 58].

Traditional methods mainly rely on different priors and hand-crafted features like color contrast [6], brightness [11], foreground and background distinctness [15]

. However, these methods limit the representation power, without considering semantic information. Thus, they often fail to deal with complex scenes. With the development of deep convolutional neural network (CNN), fully convolutional network (FCN) 

[24] becomes an essential building block for SOD [20, 40, 39, 51]. These methods can encode high-level semantic features and better localize the salient regions.

However, due to the nature of locality, methods relying on FCN usually face a trade-off between capturing global and local features. To encode high-level global representations, the model needs to take a stack of convolutional layers to obtain larger receptive fields, but it erases local details. To retain the local understanding of saliency, features from shallow layers fail to incorporate high-level semantics. This discrepancy may make the fusion of global and local features less efficient. Is it possible to unify global and local features in each layer to reduce the ambiguity and enhance the accuracy for SOD?

In this paper, we aim to jointly learn global and local features in a layer-wise manner for solving the salient object detection task. Rather than using a pure CNN architecture, we turn to transformer [35] for help. We get inspired from the recent success of transformer on various vision tasks [9] for its superiority in exploring long-range dependency. Transformer applies self-attention mechanism for each layer to learn a global representation. Therefore, it is possible to inject globality in the shallow layers, while maintaining local features. As illustrated in Fig. 2, the attention map of the red block has the global view of the chicken and the egg even in the first layer. The network does attend to small scale details even in the final layer (the twelfth layer). More importantly, with the self-attention mechanism, the transformer is capable to model the “contrast”, which has demonstrated to be crucial for saliency perception [32, 28, 8].

Input GT Layer 1 Layer 12
Figure 2: The visual attention map of the red block in the input image on the first and twelfth layers of the transformer. The attention maps show that features from shallow layer can also have global information and features from deep layer can also have local information.

With the merits mentioned above, we propose a method called Global-Local Saliency TRansformer (GLSTR

). The core of our method is a pure transformer-based encoder and a mixed decoder to aggregate features generated by transformers. To encode features through transformers, we first split the input image into a grid of fixed-size patches. We use a linear projection layer to map each image patch to a feature vector for representing local details. By passing the feature through the multi-head self-attention module in each transformer layer, the model further encodes global features without diluting the local ones. To decode the features with global-local information over the inputs and the previous layer from the transformer encoder, we propose to densely decode feature from each transformer layer. With this dense connection, the representations during decoding still preserve rich local and global features. To the best of our knowledge, we are the first work to apply the pure transformer-based encoder on SOD tasks.

To sum up, our contributions are in three folds:

  • We unify global-local representations with a novel transformer-based architecture, which models long-range dependency within each layer.

  • To take full advantage of the global information of previous layers, we propose a new decoder, deeply-transformed decoder, to densely decode features of each layer.

  • We conduct extensive evaluations on five widely-used benchmark datasets, showing that our method outperforms state-of-the-art methods by a large margin, with an average of improvement in terms of MAE.

2 Related Work

Salient Object Detection. FCN has became the mainstream architecture for the saliency detection methods, to predict pixel-wise saliency maps directly [20, 40, 39, 51, 23, 55, 14, 31, 36]. Especially, the skip-connection architecture has been demonstrated its performance in combining global and local context information [16, 59, 22, 53, 26, 41, 34]. Hou et al. [16] proposed that the high-level features are capable of localizing the salient regions while the shallower ones are active at preserving low-level details. Therefore, they introduced short connections between the deeper and shallower feature pairs. Based on the U-Shape architecture, Zhao et al. [59] introduced a transition layer and a gate into every skip connection, to suppress the misleading convolutional information in the original features from the encoder. In order to localize the salient objects better, they also built a Fold-ASPP module on the top of the encoder. Similarly, Liu et al. [22] inserted a pyramid pooling module on the top of the encoder to capture the high-level semantic information. In order to recover the diluted information in the convolutional encoder and alleviate the coarse-level feature aggregation problem while decoding, they introduced a feature aggregation module after every skip connection. In [57], Zhao et al. adopted the VGG network without fully connected layers as their backbone. They fused the high-level feature into the shallower one at each scale to recover the diluted location information. After three convolution layers, each fused feature was enhanced by the saliency map. Specifically, they introduced the edge map as a guidance for the shallowest feature and fused this feature into deeper ones to produce the side outputs.

Aforementioned methods usually suffer from the limited size of convolutional operation. Although there will be large receptive fields in the deep convolution layers, the low-level information is diluted increasingly with the increasing number of layers.

Transformer. Since the local information covered by the convolutional operation is deficient, numerous methods adopt attention mechanisms to capture the long-range correlation [43, 1, 17, 56, 37]. The transformer, proposed by Vaswani et al. [35]

, has demonstrated its global-attention performance in machine translation task. Recently, more and more works focus on exploring the ability of transformer in computer vision tasks. Dosovitskiy

et al. [9] proposed to apply the pure transformer directly to sequences of image patches for exploring spatial correlation on image classification tasks. Carion et al.  [2] proposed a transformer encoder-decoder architecture on object detection tasks. By collapsing the feature from CNN into a sequence of learned object queries, the transformer can dig out the correlation between objects and the global image context. Zeng et al. [50]

proposed a spatial-temporal transformer network to tackle video inpainting problem along spatial and temporal channel jointly.

The most related work to ours is [60]. They proposed to replace the convolution layers with the pure transformer during encoding for semantic segmentation task. Unlike this work, we propose a transformer based U-shape architecture network that is more suitable for dense prediction tasks. With our dense skip-connections and progressive upsampling strategy, the proposed method can sufficiently associate the decoding features with the global-local ones from the transformer.

Figure 3: The pipeline of our proposed method. We first divide the image into non-overlap patches and map each patch into the token before feeding to transformer layers. After encoding features through 12 transformer layers, we decode each output feature in three successive stages with 8, 4, and 2 upsampling respectively. Each decoding stage contains four layers and the input of each layer comes from the features of its previous layer together with the corresponding transformer layer.

3 Method

In this paper, we propose a Global-Local Saliency TRansfromer (GLSTR) which models the local and global representations jointly in each encoder layer and decodes all features from transformers with gradual spatial resolution recovery. In the following subsections, we first review the local and global feature extraction by the fully convolutional network-based salient object detection methods in Sec.

3.1. Then we give more details about how the transformer extracts global and local features in each layer in Sec. 3.2. Finally, we specify different decoders including naive decoder, stage-by-stage decoder, multi-level feature aggregation decoder [60], and our proposed decoder. The framework of our proposed method is shown in Fig. 3.

3.1 FCN-based Salient Object Detection

We revisit traditional FCN based salient object detection methods on how to extract global-local representations. Normally, a CNN-based encoder (e.g., VGG [34] and ResNet [13]

) takes an image as input. Due to the locality of CNN, a stack of convolution layers is applied to the input images. At the same time, the spatial resolution of the feature maps gradually reduces at the end of each stage. Such an operation extends the receptive field of the encoder and reaches a global view in deep layers. This kind of encoder faces three serious problems: 1) features from deep layers have a global view with diluted local details due to too much convolution operations and resolution reduction; 2) features from shallow layers have more local details but lack of semantic understanding to identify salient objects, due to the locality of the convolution layers; 3) features from adjacent layers have limited variances and lead to redundancy when they are densely decoded.

Recent work focuses on strengthening the ability to extract either local details or global semantics. To learn local details, previous works [57, 42, 47] focus on features from shallow layers by adding edge supervision. Such low-level features lack a global view and may misguide the whole network. To learn global semantics, attention mechanism is used to enlarge the receptive field. However, previous works [5, 55] only apply the attention mechanism in deep layers (e.g. , the last layer of the encoder). In the following subsections, we demonstrate how to preferably unify local-global features using the transformer.

3.2 Transformer Encoder

3.2.1 Image Serialization

Since the transformer is designed for the 1D sequence in natural language processing tasks, we first map the input 2D images into 1D sequences. Specifically, given an image

with height , width and channel , the 1D sequence is encoded from the image . A simple idea is to directly reshape the image into a sequence, namely, . However, due to the quadratic spatial complexity of self-attention, such an idea leads to enormous GPU memory consumption and computation cost. Inspired by ViT [9], we divide the image into non-overlapping image patches with a resolution of . Then we project these patches into tokens with a linear layer, and the sequence length is . Every token in represents a non-overlay patch. Such an operation trades off the spatial information and sequence length.

3.2.2 Transformer

Taking 1D sequence as the input, we use pure transformer as the encoder instead of CNN used in previous works [44, 30, 29]. Rather than only having limited receptive field as CNN-based encoder, with the power of modeling global representations in each layer, the transformer usually requires much fewer layers to receive a global receptive. Besides, the transformer encoder has connections across each pair of image patches. Based on these two characteristics, more local detailed information is preserved in our transformer encoder. Specifically, the transformer encoder contains positional encoding and transformer layers with multi-head attention and multi-layer perception.

Positional Encoding. Since the attention mechanism cannot distinguish the positional difference, the first step is to feed the positional information into sequence and get the position enhanced feature :


where is the addition operation and

indicates the positional code which is randomly initialized under truncated Gaussian distribution and is trainable in our method.

Transformer Layer.

The transformer encoder contains 17 transformer layers, and each layer contains multi-head self-attention (MSA) and multi-layer perceptron (MLP). The multi-head self-attention is the extension of self-attention (SA):


where indicates the input features of the self-attention, and are the weights with the trainable parameters. d is the dimension of and is the softmax function. To apply multiple attentions in parallel, multi-head self-attention has independent self-attentions:


where indicates the concatenation. To sum up, in the transformer layer, the output feature is calculated by:


where indicates layer norm and is the layer features of the transformer.

3.3 Various Decoders

In this section, we design a global-local decoder to decode the features from the transformer encoder and generate the final saliency maps with the same resolution as the inputs. Before exploring the ability of our global-local decoder, we simply introduce the naive decoder, stage-by-stage decoder, and multi-level feature aggregation decoder used in [60] (see Fig. 4). To begin with, we first reshape the features from the transformer into .

Figure 4: Different types of decoder. Our decoder (d) densely decodes all transformer features and gradually upsamples to the resolution of inputs.

Naive Decoder. As illustrated in Fig. 4

(a), a naive idea of decoding transformer features is directly upsampling the outputs of last layer to the same resolution of inputs and generating the saliency maps. Specifically, three conv-batch norm-ReLU layers are applied on

. Then we bilinearly upsample the feature maps 16, followed by a simple classification layer:


where indicates convolution-batch norm-ReLU layer with parameters . is the bilinear operation for upsampling and

is the sigmoid function.

is the predicted saliency map.

Stage-by-Stage Decoder. Directly upsampling 16 at one time like the naive decoder will inject too much noise and lead to coarse spatial details. Inspired by previous FCN-based methods [30] in salient object detection, stage-by-stage decoder which upsamples the resolution 2 in each stage will mitigate the losses of spatial details seen in Fig. 4(b). Therefore, there are four stages, and in each stage, a decoder block contains three conv-batch norm-ReLU layers:


Multi-level feature Aggregation. As illustrated in Fig. 4(c), Zheng et al.  [60] propose a multi-level feature aggregation to sparsely fuse multi-level features following the similar setting of pyramid network. Rather than having a pyramid shape, the spatial resolutions of features keep the same in the transformer encoder.

Specifically, the decoder take features from the layers and upsample them 4 following several convolutional layers. Then features are top-down aggregated via element-wise addition. Finally, the features from all streams are fused to generate the final saliency maps.

Deeply-transformed Decoder. As illustrated in Fig. 4(d), to combine the benefit from the features of each transformer layer and include less noise injection from upsampling, we gradually integrate all transformer layer features and upsample them into the same spatial resolution of the input images. Specifically, we first divide the transformer features into three stages and we upsample them from the first to the third stage respectively. Note that the upsampling operation includes pixel shuffle () and bilinear upsampling (2), where indicates the stage. In the layer of the stage, the salient feature comes from the concatenation of the transformer feature from the corresponding layer together with the salient feature from the previous layer. A convolution block is applied after upsampling for .


Then we simply apply a convolutional operation on salient features to generate the final saliency maps . Therefore, in our method, we have twelve output saliency maps:


4 Experiment

MAE      MAE      MAE      MAE      MAE     
DSS C 0.056 0.801 0.824 0.063 0.737 0.790 0.040 0.889 0.790 0.052 0.906 0.882 0.095 0.809 0.797
UCF C 0.112 0.742 0.782 0.120 0.698 0.760 0.062 0.874 0.875 0.069 0.890 0.883 0.116 0.791 0.806
Amulet C 0.085 0.751 0.804 0.098 0.715 0.781 0.051 0.887 0.886 0.059 0.905 0.891 0.100 0.810 0.819
BMPM C 0.049 0.828 0.862 0.064 0.734 0.809 0.039 0.910 0.907 0.045 0.917 0.911 0.074 0.836 0.846
RAS C 0.059 0.807 0.838 0.062 0.753 0.814 - - - 0.056 0.908 0.893 0.104 0.803 0.796
PSAM C 0.041 0.854 0.879 0.057 0.765 0.830 0.034 0.918 0.914 0.040 0.931 0.920 0.075 0.847 0.851
CPD C 0.043 0.841 0.869 0.056 0.754 0.825 0.034 0.911 0.906 0.037 0.927 0.918 0.072 0.837 0.847
SCRN C 0.040 0.864 0.885 0.056 0.772 0.837 0.034 0.921 0.916 0.038 0.937 0.927 0.064 0.858 0.868
BASNet C 0.048 0.838 0.866 0.056 0.779 0.836 0.032 0.919 0.909 0.037 0.932 0.916 0.078 0.836 0.837
EGNet C 0.039 0.866 0.887 0.053 0.777 0.841 0.031 0.923 0.918 0.037 0.936 0.925 0.075 0.846 0.853
MINet C 0.037 0.865 0.884 0.056 0.769 0.833 0.029 0.926 0.919 0.034 0.938 0.925 0.064 0.852 0.856
LDF C 0.034 0.877 0.892 0.052 0.782 0.839 0.028 0.929 0.920 0.034 0.938 0.924 0.061 0.859 0.862
GateNet C 0.040 0.869 0.885 0.055 0.781 0.838 0.033 0.920 0.915 0.040 0.933 0.920 0.069 0.852 0.858
SAC C 0.034 0.882 0.895 0.052 0.804 0.849 0.026 0.935 0.925 0.031 0.945 0.931 0.063 0.868 0.866
Naive T 0.043 0.855 0.878 0.059 0.776 0.835 0.042 0.905 0.903 0.042 0.927 0.919 0.068 0.854 0.862
SETR T 0.039 0.869 0.888 0.056 0.782 0.838 0.037 0.917 0.912 0.041 0.930 0.921 0.062 0.867 0.859
VST T 0.037 0.877 0.896 0.058 0.800 0.850 0.030 0.937 0.928 0.034 0.944 0.932 0.067 0.850 0.873
Ours T 0.029 0.901 0.912 0.045 0.819 0.865 0.026 0.941 0.932 0.028 0.953 0.940 0.054 0.882 0.881
Table 1:

Quantitative comparisons with FCN-based (denoted as C) and Transformer-based (denoted as T) salient object detection methods by three evaluation metrics. Top three performances are marked in

Red, Green, and Blue respectively. Simply utilizing transformer (Naive) achieves comparable performance. Our method significantly outperforms all competitors regardless of what basic component they used (i.e., either FCN or transformer-based).The results of VST come from the paper due to the lack of released code or saliency maps.
(a) Image
(b) GT
(c) Ours (T)
(d) SETR (T)
(e) GateNet (C)
(f) LDF (C)
(g) EGNet (C)
(h) BMPM (C)
(i) DSS (C)
Figure 5: Qualitative comparisons with state-of-the-art methods. Our method provides more visually reasonable saliency maps by accurately locating salient objects and generating sharp boundaries than other transformer-based (denoted as T) and FCN-based methods (denoted as C).

4.1 Implementation Details

We train our model on DUTS following the same setting as [57]

. The transformer encoder is pretrained on ImageNet 


and the rest layers are randomly initialized following the default settings in PyTorch. We use SGD with momentum as the optimizer. We set momentum = 0.9, and weight decay = 0.0005. The learning rate starts from 0.001 and gradually decays to 1e-5. We train our model for 40 epochs with a batch size of 8. We adopt vertical and horizontal flipping as the data augmentation techniques. The input images are resized to 384

384. During the inference, we take the output of the last layer as the final prediction.

4.2 Datasets and Loss Function

We evaluate our methods on five widely used benchmark datasets: DUT-OMRON [48], HKU-IS [19], ECSSD [33], DUTS [38], PASCAL-S [21]

. DUT-OMRON contains 5,169 challenging images which usually have complex background and more than one salient objects. HKU-IS has 4,447 high-quality images with multiple disconnected salient objects in each image. ECSSD has 1,000 meaningful images with various scenes. DUTS is the largest salient object detection dataset including 10,533 training images and 5,019 testing images. PASCAL-S chooses 850 images from PASCAL VOC 2009.

We use standard binary cross entropy loss as our loss function. Given twelve outputs, we design the final loss as:


where indicate our predicted saliency maps and indicates the ground truth.

4.3 Evaluation Metrics

In this paper, we evaluate our methods under three widely used evaluation metrics: mean absolute error (MAE), weighted F-Measure (), and S-Measure ([12].

MAE directly measures the average pixel-wise differences between saliency maps and labels by mean absolute error:


evaluates the precision and recall at the same time and use

to weight precision:


where is set to 0.3 as in previous work [57].

S-Measure strengthens the structural information of both foreground and background. Specifically, the regional similarity and the object similarity are combined together with weight :


where is set to 0.5 [57].

4.4 Comparison with state-of-the-art methods

We compare our method with 12 state-of-the-art salient object detection methods: DSS [16], UCF [54], Amulet [52] BMPM [51],RAS [4], PSAM[3] PoolNet [22], CPD [46], SCRN [45], BASNet [30], EGNet [57], MINet [29], LDF [44], GateNet [59], SAC [18], SETR [60], VST [10], and Naive (Transformer encoder with the naive decoder). The saliency maps are provided by their authors or calculated by the released code except SETR and Naive which are implemented by ourselves. Besides, all results are evaluated with the same evaluation code.

Quantitative Comparison. We report the quantitative results in Tab. 1. As can be seen from these results, with out any post-processing, our method outperforms all compared methods using FCN-based or transformer-based architecture by a large margin on all evaluation metrics. In particular, the performance is averagely improved 12.17% over the second best method LDF in terms of MAE. The superior performance on both easy and difficult datasets shows that our method does well on both simple and complex scenes. Another interesting finding is that transformer is powerful to generate rather good saliency maps with “Naive” decoder, although it cannot outperform state-of-the-art FCN-based method (i.e., SAC). Further investigating the potential for transformer may keep enhancing the performance on salient object detection task.

Qualitative Comparison. The qualitative results are shown in Fig. 5. With the power of modeling long range dependency over the whole image, transformer-based methods are able to capture the salient object more accurately. For example, in the sixth row of Fig. 5, our method can accurately capture the salient object without being deceived by the ice hill while FCN-based methods all fail in this case. Besides, saliency maps predicted by our method are more complete, and aligned with the ground truths.

4.5 Ablation Study

In this subsection, we mainly evaluate the effectiveness of dense connections and different upsampling strategies in our decoder design on two challenging datasets: DUTS-TE and DUT-OMRON.

Effectiveness of Dense Connections.

Input GT Dens.4 Dens.3 Dens.2 Dens.1 Dens.0
Ours Naive
Figure 6: The effect of dense connection during decoding. We gradually increase the density of connection between transformer output features and decoding convolution layer in all stages. The naive decoder (i.e., density 0) predicts saliency maps with a lot of noises, which is severely influenced by the background. As the density increasing, the saliency maps become more accurate and sharp, especially near the boundaries.
MAE    MAE   
0 0.043 0.855 0.878 0.059 0.776 0.835
1 0.033 0.889 0.902 0.052 0.802 0.852
2 0.031 0.895 0.907 0.050 0.807 0.857
3 0.030 0.899 0.910 0.047 0.813 0.862
4 0.029 0.901 0.912 0.045 0.819 0.865
Table 2: Quantitative evaluation on dense connection during decoding. As can be seen, adding connections can bring significant performance gain compared to the naive one (i.e., density=0). As the density increasing, the performance gradually improves on all three evaluation metrics. The best performances are marked as red.

We first conduct an experiment to examine the effect of densely decoding the features from each transformer encoding layer. Starting from simply decoding the features of last transformer (i.e., naive decoder), we gradually add the fuse connections in each stage to increase the density until reaching our final decoder, denoted as density 0, 1, 2, 3. For example, the density 3 means that there are three connections (i.e., from ) added in each stage and only one (i.e., ) for density 1.

The qualitative results are illustrated in Fig. 6, and quantitative results are shown in Tab. 2. When we use the naive decoder without any connection, the results are much worse with lots of noises. The saliency maps are more sharp and accurate with the increase of decoding density, and the performances also gradually increase on three evaluation metrics. As each layer captures a global view of its previous layer, the representation differences are maximized among different transformer features. Thus, each transformer feature can contribute individually to the predicted saliency maps.

Effectiveness of Upsampling Strategy.

Strategy MAE    MAE   
Density set to 0
16 0.043 0.855 0.878 0.059 0.776 0.835
Four 2 0.041 0.861 0.883 0.059 0.775 0.836
Density set to 1
16 0.036 0.875 0.896 0.052 0.794 0.850
4 and 4 0.034 0.886 0.900 0.053 0.801 0.850
Ours 0.033 0.889 0.902 0.052 0.802 0.852
Density set to 4
16 0.034 0.882 0.902 0.051 0.799 0.846
4 and 4 0.030 0.900 0.911 0.047 0.814 0.862
Ours 0.029 0.901 0.912 0.045 0.819 0.865
Table 3: Qualitative evaluations on the upsampling strategies under different density settings. Gradually upsampling leads to better performance than the naive one and our upsampling strategy outperforms others under different density settings.

To evaluate the effect of our gradually upsampling strategy used in the decoder, we compare different upsampling strategies on different density settings. 1) The Naive decoder (denoted as 16) simply upsamples the last transformer features 16 for prediction. 2) When density is set to 0, we test an upsampling strategy in stage-by-stage decoder (denoted as Four 2), where the output features of each stage are upsampled 2 before sending to the next stage and there are four upsampling operations in total. 3) A sampling strategy (denoted as 4 and 4) used in SETR [60] upsamples each transformer features 4, and another 4 is applied to obtain the final predicted saliency map. 4) Our decoder (Ours) upsamples transformer features 8, 4, 2 for the layers connected to stage 1, stage 2, stage 3 respectively.

The quantitative results are reported in Tab 3 with some interesting findings. Gradually upsampling always performs better than the naive one with a single upsampling operation. With the help of our upsampling strategy, the dense decoding manifests its advanced performance, with less noise injected during decoding. Directly decoding the output feature from last transformer in Stage mode only leads to limited performance, further indicating the importance of our dense connection.

4.6 The Statistics of Timing

Although the computational cost of transformer encoder may be high, thanks to the parallelism of attention architecture, the inference speed is still comparable to recent methods, shown in Tab. 4. Besides, as a pioneer work, we want to highlight the power of transformer on salient object detection (SOD) task and leave time complexity reduction as the future work.

Method DSS R3Net BASNet SCRN EGNet Ours
Speed(FPS) 24 19 28 16 7 15
Table 4: Running time of different methods. We test all methods on a single RTX2080Ti.

5 Conclusion

In this paper, we explore the unified learning of global-local representations in salient object detection with an attention-based model using transformers. With the power of modeling long range dependency in transformer, we overcome the limitations caused by the locality of CNN in previous salient object detection methods. We propose an effective decoder to densely decode transformer features and gradually upsample to predict saliency maps. It fully uses all transformer features due to the high representation differences and less redundancy among different transformer features. We adopt a gradually upsampling strategy to keep less noise injection during decoding to magnify the ability of locating accurate salient region. We conduct experiments on five widely used datasets and three evaluation metrics to demonstrate that our method significantly outperforms state-of-the-art methods by a large margin.


  • [1] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le (2019) Attention augmented convolutional networks. In ICCV, pp. 3286–3295. Cited by: §2.
  • [2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In ECCV, pp. 213–229. Cited by: §2.
  • [3] J. Chang and Y. Chen (2018) Pyramid stereo matching network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5410–5418. Cited by: §4.4.
  • [4] S. Chen, X. Tan, B. Wang, and X. Hu (2018) Reverse attention for salient object detection. In European Conference on Computer Vision, Cited by: §4.4.
  • [5] S. Chen, X. Tan, B. Wang, and X. Hu (2018) Reverse attention for salient object detection. In ECCV, pp. 234–250. Cited by: §3.1.
  • [6] M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu (2014) Global contrast based salient region detection. IEEE TPAMI 37 (3), pp. 569–582. Cited by: §1.
  • [7] C. Craye, D. Filliat, and J. Goudou (2016) Environment exploration for object-based visual saliency learning. In ICRA, pp. 2303–2309. Cited by: §1.
  • [8] R. Desimone and J. Duncan (1995) Neural mechanisms of selective visual attention. Annual review of neuroscience 18 (1), pp. 193–222. Cited by: §1.
  • [9] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1, §2, §3.2.1, §4.1.
  • [10] Y. Du, Y. Xiao, and V. Lepetit (2021) Learning to better segment objects from unseen classes with unlabeled videos. arXiv preprint arXiv:2104.12276. Cited by: §4.4.
  • [11] W. Einhäuser and P. König (2003) Does luminance-contrast contribute to a saliency map for overt visual attention?. European Journal of Neuroscience 17 (5), pp. 1089–1097. Cited by: §1.
  • [12] D. Fan, M. Cheng, Y. Liu, T. Li, and A. Borji (2017) Structure-measure: a new way to evaluate foreground maps. In ICCV, pp. 4548–4557. Cited by: §4.3.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.1.
  • [14] S. He, J. Jiao, X. Zhang, G. Han, and R. W. Lau (2017) Delving into salient object subitizing and detection. In ICCV, pp. 1059–1067. Cited by: §2.
  • [15] S. He and R. W. Lau (2014) Saliency detection with flash and no-flash image pairs. In ECCV, pp. 110–124. Cited by: §1.
  • [16] Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr (2017) Deeply supervised salient object detection with short connections. In CVPR, pp. 3203–3212. Cited by: §2, §4.4.
  • [17] H. Hu, Z. Zhang, Z. Xie, and S. Lin (2019) Local relation networks for image recognition. In ICCV, pp. 3464–3473. Cited by: §2.
  • [18] X. Hu, C. Fu, L. Zhu, T. Wang, and P. Heng (2020) SAC-net: spatial attenuation context for salient object detection. IEEE Transactions on Circuits and Systems for Video Technology. Note: to appear Cited by: §4.4.
  • [19] G. Li and Y. Yu (2015-06)

    Visual saliency based on multiscale deep features

    In CVPR, pp. 5455–5463. Cited by: §4.2.
  • [20] G. Li and Y. Yu (2016) Deep contrast learning for salient object detection. In CVPR, pp. 478–487. Cited by: §1, §2.
  • [21] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille (2014) The secrets of salient object segmentation. In CVPR, pp. 280–287. Cited by: §4.2.
  • [22] J. Liu, Q. Hou, M. Cheng, J. Feng, and J. Jiang (2019) A simple pooling-based design for real-time salient object detection. In CVPR, pp. 3917–3926. Cited by: §2, §4.4.
  • [23] N. Liu, J. Han, and M. Yang (2018) Picanet: learning pixel-wise contextual attention for saliency detection. In CVPR, pp. 3089–3098. Cited by: §2.
  • [24] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §1.
  • [25] W. Luo, M. Yang, and W. Zheng (2021) Weakly-supervised semantic segmentation with saliency and incremental supervision updating. Pattern Recognition, pp. 107858. Cited by: §1.
  • [26] Z. Luo, A. Mishra, A. Achkar, J. Eichel, S. Li, and P. Jodoin (2017) Non-local deep features for salient object detection. In CVPR, pp. 6609–6617. Cited by: §2.
  • [27] R. Mechrez, E. Shechtman, and L. Zelnik-Manor (2019) Saliency driven image manipulation. Machine Vision and Applications 30 (2), pp. 189–202. Cited by: §1.
  • [28] C. T. Morgan (1943) Physiological psychology.. Cited by: §1.
  • [29] Y. Pang, X. Zhao, L. Zhang, and H. Lu (2020-06) Multi-scale interactive network for salient object detection. In CVPR, Cited by: §3.2.2, §4.4.
  • [30] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand (2019-06) BASNet: boundary-aware salient object detection. In CVPR, Cited by: §3.2.2, §3.3, §4.4.
  • [31] S. Ren, C. Han, X. Yang, G. Han, and S. He (2020) TENet: triple excitation network for video salient object detection. In ECCV, pp. 212–228. Cited by: §2.
  • [32] J. H. Reynolds and R. Desimone (2003) Interacting roles of attention and visual salience in v4. Neuron 37 (5), pp. 853–863. Cited by: §1.
  • [33] J. Shi, Q. Yan, L. Xu, and J. Jia (2015) Hierarchical image saliency detection on extended cssd. IEEE TPAMI 38 (4), pp. 717–729. Cited by: §4.2.
  • [34] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2, §3.1.
  • [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §1, §2.
  • [36] B. Wang, W. Liu, G. Han, and S. He (2020) Learning long-term structural dependencies for video salient object detection. IEEE TIP 29, pp. 9017–9031. Cited by: §2.
  • [37] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L. Chen (2020) Axial-deeplab: stand-alone axial-attention for panoptic segmentation. In ECCV, pp. 108–126. Cited by: §2.
  • [38] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan (2017) Learning to detect salient objects with image-level supervision. In CVPR, pp. 136–145. Cited by: §4.2.
  • [39] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan (2016) Saliency detection with recurrent fully convolutional networks. In ECCV, pp. 825–841. Cited by: §1, §2.
  • [40] T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu (2017) A stagewise refinement model for detecting salient objects in images. In ICCV, pp. 4019–4028. Cited by: §1, §2.
  • [41] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji (2018) Detect globally, refine locally: a novel approach to saliency detection. In CVPR, pp. 3127–3135. Cited by: §2.
  • [42] W. Wang, S. Zhao, J. Shen, S. C. Hoi, and A. Borji (2019) Salient object detection with pyramid attention and salient edges. In CVPR, pp. 1448–1457. Cited by: §3.1.
  • [43] X. Wang, R. Girshick, A. Gupta, and K. He (2018)

    Non-local neural networks

    In CVPR, pp. 7794–7803. Cited by: §2.
  • [44] J. Wei, S. Wang, Z. Wu, C. Su, Q. Huang, and Q. Tian (2020-06) Label decoupling framework for salient object detection. In CVPR, Cited by: §3.2.2, §4.4.
  • [45] Z. Wu, L. Su, and Q. Huang (2019) Stacked cross refinement network for edge-aware salient object detection. In ICCV, Vol. , pp. 7263–7272. External Links: Document Cited by: §4.4.
  • [46] Z. Wu, L. Su, and Q. Huang (2019-06) Cascaded partial decoder for fast and accurate salient object detection. In CVPR, Cited by: §4.4.
  • [47] Z. Wu, L. Su, and Q. Huang (2019) Stacked cross refinement network for edge-aware salient object detection. In ICCV, pp. 7264–7273. Cited by: §3.1.
  • [48] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang (2013) Saliency detection via graph-based manifold ranking. In CVPR, pp. 3166–3173. Cited by: §4.2.
  • [49] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li (2014) Salient color names for person re-identification. In ECCV, pp. 536–551. Cited by: §1.
  • [50] Y. Zeng, J. Fu, and H. Chao (2020) Learning joint spatial-temporal transformations for video inpainting. In ECCV, pp. 528–543. Cited by: §2.
  • [51] L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang (2018) A bi-directional message passing model for salient object detection. In CVPR, pp. 1741–1750. Cited by: §1, §2, §4.4.
  • [52] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan (2017-10) Amulet: aggregating multi-level convolutional features for salient object detection. In ICCV, Cited by: §4.4.
  • [53] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan (2017) Amulet: aggregating multi-level convolutional features for salient object detection. In ICCV, pp. 202–211. Cited by: §2.
  • [54] P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin (2017-10) Learning uncertain convolutional features for accurate saliency detection. In ICCV, Cited by: §4.4.
  • [55] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang (2018) Progressive attention guided recurrent network for salient object detection. In CVPR, pp. 714–722. Cited by: §2, §3.1.
  • [56] H. Zhao, J. Jia, and V. Koltun (2020) Exploring self-attention for image recognition. In CVPR, pp. 10076–10085. Cited by: §2.
  • [57] J. Zhao, J. Liu, D. Fan, Y. Cao, J. Yang, and M. Cheng (2019) EGNet: edge guidance network for salient object detection. In ICCV, pp. 8779–8788. Cited by: §2, §3.1, §4.1, §4.3, §4.3, §4.4.
  • [58] R. Zhao, W. Oyang, and X. Wang (2016) Person re-identification by saliency learning. IEEE TPAMI 39 (2), pp. 356–370. Cited by: §1.
  • [59] X. Zhao, Y. Pang, L. Zhang, H. Lu, and L. Zhang (2020) Suppress and balance: a simple gated network for salient object detection. In ECCV, pp. 35–51. Cited by: Figure 1, §2, §4.4.
  • [60] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, et al. (2020) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2012.15840. Cited by: §2, §3.3, §3.3, §3, §4.4, §4.5.