We propose a novel Synergistic Attention Network (SA-Net) to address the light field salient object detection by establishing a synergistic effect between multi-modal features with advanced attention mechanisms. Our SA-Net exploits the rich information of focal stacks via 3D convolutional neural networks, decodes the high-level features of multi-modal light field data with two cascaded synergistic attention modules, and predicts the saliency map using an effective feature fusion module in a progressive manner. Extensive experiments on three widely-used benchmark datasets show that our SA-Net outperforms 28 state-of-the-art models, sufficiently demonstrating its effectiveness and superiority. Our code will be made publicly available.READ FULL TEXT VIEW PDF
In the past few years, numerous deep learning methods have been proposed...
Camouflaged object detection (COD) is a challenging task due to the low
Attention mechanisms are widely used in salient object detection models ...
Annotating user interfaces (UIs) that involves localization and
With the goal of identifying pixel-wise salient object regions from each...
This paper studies the task of matching image and sentence, where learni...
We present 4D-Net, a 3D object detection approach, which utilizes 3D Poi...
Salient object detection (SOD) is a task aiming to segment the objects that grasp most of the human attention. It plays a key role in learning/reflecting human visual mechanism and in various computer vision applications, such as instance segmentation and video object segmentation. According to the input modalities, the SOD task can be classified into three categories: 2/3/4D SOD[Zhang2019MemoryorientedDFMoLF].
Recently, light field SOD [li2017saliency] (or 4D SOD) has attracted increasing attention owning to the introduction of various light field benchmark datasets, such as DUT-LF [DUTLF], LFSD [LFSD], HFUT [HFUT], DUT-MV [DUTMV], and Lytro Illum [Lytro]. Despite all-in-focus (AiF) images, light field datasets [LFSD, HFUT, Lytro] also provide focal stacks (FSs), multi-view images, and depth images, where the FS is usually known as a series of images focusing at different depths of a given scene while the depth image contains holistic depth information. Unlike 3D (RGB-D) SOD models, which utilize only two modalities, i.e., RGB images and depth maps, the light field SOD models also use multi-view images (e.g., [zhang2020multiMTCNet, DUTMV]), or FSs (e.g., [Zhang2019MemoryorientedDFMoLF, Zhang2020LFNetLFNet, Piao2020ExploitARERNet]
) as auxiliary information to further improve the performance. It is worth noting that, most recent FS-based deep learning light field SOD models (e.g., ERNet[Piao2020ExploitARERNet]) have achieved the state-of-the-art performance (by a large margin) on three widely-used light field benchmark datasets [DUTLF, HFUT, LFSD].
Despite their advantages, existing works suffer from two major limitations. First, they explore little about the complementarities between AiF images and the FSs. In practice, the AiF images are generated from FSs with a photo-montage technique, implying that the former simultaneously depict the global relationship and spatial details of each local region, while the latter asynchronously focus on different local details and contain the global information in temporal dimension. Existing methods either employ simple fusion strategies (e.g., addition or concatenation) before the encoding stage [Zhang2019MemoryorientedDFMoLF], or use pre-trained models of one modality to guide the other [Piao2020ExploitARERNet], which are not theoretically sufficient to discover the close relationship between these two modalities, and are unable to fully explore the rich information from FSs for light field SOD. Therefore, a more sophisticated strategy for the fusion of AiF and FS features is urgently needed. Second, existing methods pay little attention to the inter-frame relationship of a given FS, hindering the further improvement of SOD performance. Based on the observation that the FS is a sequence of images with the same global contents yet alternative depth-dependent local-focused regions (Figure 1), we argue that both the spatial and temporal (inter-frame) information are important for the FS-based SOD task, while few studies focus on spatial-temporal modeling for FS.
To this end, we propose a novel Synergistic Attention Network (SA-Net) to conduct light field SOD with rich information from AiF images and FSs. Specifically, we first employ 3D convolutional neural networks (CNNs) to extract the spatial-temporal features from FSs. At the decoding stage, we propose a synergistic attention (SA) module, where the features from AiF images and FSs are selectively fused and optimized to achieve a synergistic effect for SOD. Finally, the multi-modal features are fed to our progressive fusion (PF) module, which fuses multi-modal features and predicts the saliency map in a progressive manner. In a nutshell, we provide four main contributions as follows:
We propose the SA module to decode the high-level features from both AiF images and FSs with a synergistic attention mechanism. Our SA module exploits the most meaningful information from the multi-modal multi-level features, allowing accurate SOD by taking advantage of light field data.
We introduce a dual-branch backbone to encode the AiF and FS information, simultaneously. To the best of our knowledge, our work is the first attempt to utilize 3D CNNs for the feature extraction of FSs in the field of light field SOD.
We design the PF module to gradually fuse the selective high-level features for the final saliency prediction.
Extensive experiments demonstrate that our SA-Net outperforms 28 state-of-the-art SOD models upon three widely-used light field datasets.
In this section, we discuss recent works from the aspects of light field SOD, attention mechanisms, and 3D CNNs.
Datasets. To the best of our knowledge, LFSD [LFSD] is the earliest public light field SOD dataset, which provides AiF images, FSs, multi-view (MV) images, depth maps, and micro-lens (ML) images for 100 different scenes. Later, [HFUT] constructed HFUT, which contains 255 AiF images and other light field modalities, including FS, MV, ML, and depth. More recent datasets such as DUT-LF [DUTLF] and DUT-MV [DUTMV] contains 1,462 and 1,580 AiF images, respectively. Though being much larger than the early datasets, DUT-LF [DUTLF] does not provide MV and ML images, while DUT-MV [DUTMV] only provides MV images. Most recently proposed dataset, Lytro Illum [Lytro], provides 640 AiF images also the four extra modalities. Note that all the five datasets mentioned above are released with a ground-truth pixel-wise binary mask for each AiF image.
Methods. As the light field SOD is an emerging field, to the best of our knowledge, there are only 18 (11/7 traditional/deep learning-based, respectively) available models. For traditional ones, the early method [LFSD] conducted light field SOD by considering background and location related prior knowledge. In addition, [Li2015WSC] proposed a unified architecture based on weighted sparse coding. Later methods [Zhang2015SaliencyDILF, Sheng2016RelativeLFRL, Wang2017BIF, li2017saliency, HFUT, Wang2018AccurateSSDDF, Wang2018SalienceGDSGDC] explored and further combined multiple visual cues (e.g., depth, color contrast, light field flows and boundary prior) to detect saliency. Most recent methods [Wang2020RegionbasedDRDFD, Piao2020SaliencyDVDCA] shifted more attention to depth information and employed cellular automata for the saliency detection in light field. With the development of public light field datasets, deep learning-based methods were proposed to conduct SOD in alternative light field modalities. Specifically, [DUTMV] developed a view synthesis network to detect salient objects by involving MVs. With MVs as inputs, [zhang2020multiMTCNet] further established a unified structure to synchronously conduct salient object and edge detection. Besides, [Lytro] applied DeepLab-v2 for SOD with MLs. As a mainstream, [DUTLF, Zhang2019MemoryorientedDFMoLF, Piao2020ExploitARERNet, Zhang2020LFNetLFNet] all employed ConvLSTM and attention mechanisms to detect saliency from AiF images and FSs.
Generally, there are three main categories of attention mechanisms, including: channel attention [hu2018squeeze], spatial attention [woo2018cbam], and self-attention [wang2018non]
, which is a concept borrowed from the field of natural language processing. These attention mechanisms can be easily embedded to different CNN-based architectures. Besides, co-attention, as a specific type of attention mechanism, has been used in the fields of video object segmentation (e.g.,[lu2020zero]), RGB-D SOD (e.g., [LiuS2MA]), etc. However, such a mutual attention mechanism has been seldom studied in the field of light field SOD.
3D CNNs have proved great competence in modeling spatial-temporal information of video data, thus dominating the video-based detection fields, such as action recognition [carreira2017quo]. Recently, RD3D [chen2021rd3d] was proposed to address the task of 3D SOD, and achieved promising performance on widely-used RGB-D SOD benchmarks, which further demonstrates the superiority of 3D CNNs for multi-modal data processing. Since FSs are regarded as sequences of images focusing at alternative depths, learning FSs’ spatial-temporal features via 3D CNNs possesses great potentials to boost the light field SOD performance, but so far lacks investigation.
In SA-Net, we exploit rich cross-modal complementary information with channel attention and co-attention mechanisms so that to achieve a synergistic effect between multi-level AiF and FS features. In addition, to capture the inter-frame information of FS, we, for the first time, employ 3D CNNs to extract rich features from FSs. Figure 2 shows an overview of our SA-Net, which consists of three major components, including a multi-modal encoder consisting of 2D and 3D CNNs ( 3.1), two cascaded SA modules ( 3.2), and a PF module ( 3.3).
As shown in Figure 2
, the encoder of our network is a dual-branch architecture designed to adapt the two modalities, i.e., AiF images and FSs, to synchronous feeding and separate passing strategies. For the 2D branch, we encode an input AiF image with a group of convolutional blocks. On the other hand, an FS is represented as a 4D tensor with the last dimensiondenoting the number of frames. We encode the FS with a stack of 3D convolutional blocks, which are able to jointly capture the rich inner- and inter-frame information for accurate SOD. Note that the same setting () as in [Piao2020ExploitARERNet]
is adopted in our 3D branch, and a zero-padding strategy is applied to the FS with less than 12 frames.
As high-level features tend to reserve the essential cues (e.g., location, shape) of salient objects while the low-level ones contain relative trivial information (e.g., edge) [wu2019cascaded], our decoder only integrates high-level features to avoid redundant computational complexity. Specifically, we use and to denote the high-level AiF and FS features extracted from the 2D and 3D CNNs of our dual-branch backbone network ( 3.1).
Multi-Level Attention. As shown in Figure 2, a receptive field block (RFB) [wu2019cascaded] is first employed to enrich the global context information for each convolution block. Taking the AiF branch as an example, the adjacent high-level features from the encoder are then combined with a channel attention (CA) mechanism, i.e.,
where represents the th level features provided by RFB; is the upsampled version of ; , , , andis further concatenated with the upper level feature provided by a residual block (RB) for the feature , which is one of the pair-wise inputs () for the second stage of our SA module. Note that the FS branch follows the consistent procedure as in AiF’s since two branches are symmetric.
Multi-Modal Attention. In this stage, the high-level feature interaction between the two modalities is conducted with two cascaded co-attention (CoA) modules (Figure 2). As shown in Figure 3, given the pair-wise features at th layer as inputs, a similarity matrix can be computed as:
where represents a flatten operation reshaping the 3D feature matrix to a 2D one with a dimension of , denotes matrix multiplication, is a weight matrix with all ones, which assigns equal attention to both branches of the encoder. The is then column-/row-wisely normalized via:
where normalizes each column of the similarity matrix. Therefore, the co-attention-based pair-wise features () at th layer are further defined as:
where reshapes the given matrix from a dimension of to . Considering the variations of spatial information among each of the inputting image scenes, a self-gate mechanism [wang2018non] is further employed to automatically learn the co-attention confidences () for and . Therefore the final outputs of our SA module at th layer are computed as:
where the co-attention confidence with denoting a convolutional layer. Our SA module is particularly effective in exploiting the multi-level and multi-modal complementary information, which, therefore, provides significantly improved performance, as demonstrated by our ablation studies in 4.3.
To obtain the final prediction, we further design a PF module to gradually combine the selective high-level features provided by our SA module (Figure 2). In practice, the AiF images provide more informative and less redundant features in comparison with FSs, which, therefore, are regarded as data with higher quality. Based on this observation, we further refine the FS features with an AiF-induced attention (AA) component (Figure 4) before the final fusion of the two modalities. The AA component unifies the channel and spatial attention by computing:
where and denotes spatial and channel attention components, respectively. We then concatenate the resulting features and them to a deconvolutional block for the final prediction , i.e.,
where denotes the concatenation operation, and represents a deconvolutional block consisting of three deconvolutional layers and convolutional layers that are organized in a cascaded manner (Figure 2).
As shown in Figure 2, our model predicts three saliency maps: . Let denotes the ground-truth saliency map, we jointly optimize the three-way predictions by defining a hybrid loss :
where and denotes Binary Cross Entropy (BCE) and Intersection over Union (IoU) loss, respectively; the loss with denoting E-Measure [Fan2018Enhanced].
is implemented in PyTorch and optimized with Adam algorithm[kingma2014adam]. The backbone of SA-Net is based on a 2D standard ResNet50 for AiF images and an inflated 3D ResNet50 [carreira2017quo]
for FSs. The 2D convolution layers in our backbone are initialized with ImageNet-pretrained ResNet50, while the 3D convolutional layers are initialized with a 2D weight transfer strategy[carreira2017quo]. During the training stage, the batch size is set to 2, the learning rate is initialized as 1e-5 and decreased by 10 when training loss reaches a flat. It takes about 14 hours to train the proposed model based on a platform consists of Intel i9-7900X CPU@3.30GHz and one TITAN XP.
Datasets. We evaluate our SA-Net and 28 state-of-the-art SOD methods based on three widely-used light field datasets: DUT-LF, HFUT and LFSD, which all provide FS and semantic ground truth corresponding to each of the AiF images (see details in 2.1). We follow the settings in [Piao2020ExploitARERNet], where 1000/100 AiF images of DUT-LF/HFUT are randomly selected as the training set, respectively, while the remains (462+155) and the whole LFSD are used for testing.
Metrics. We adopt the newly proposed S-Measure () [fan2017structure] and E-Measure () [Fan2018Enhanced], also the generally agreed Mean Absolute Error () [perazzi2012saliency] and F-Measure () [borji2015salient] as evaluation metrics for the quantitative comparison between benchmark models and SA-Net. Following the benchmark in [Piao2020ExploitARERNet], we report the adaptive F/E-Measure scores. See supplemental materials (SM) for more details about the metrics.
We quantitatively compare our SA-Net with 12/9/7 state-of-the-art RGB/RGB-D/light field SOD methods, respectively. As shown in Table 1, our SA-Net outperforms all the state-of-the-art SOD models by a large margin in terms of all four evaluation metrics. We also perform a detailed comparison between our SA-Net and the cutting edge 4D SOD methods by using F/E-Measure curves. The results, shown in Figure 5, indicate that our SA-Net provides the F/E-Measure curves that are higher than the competing models, further confirming the superiority of SA-Net. Finally, we show some of the predicted saliency maps in Figure 6. As can be observed, our SA-Net provides the saliency maps closest to the ground truths. In contrast, the competing models show unsatisfactory performance and give saliency maps with either missing or extra parts. Please refer to our SM for complete quantitative results of 28 baseline models and more visual results.
To verify the effectiveness of each proposed module of our SA-Net, we conduct thorough ablation studies by gradually adding key components. We first construct a baseline “Model0”, which extracts AiF and FS features with two 2D ResNet50 backbones, simply concatenates, and up-samples the pair-wise high-level features for SOD.
Effectiveness of multi-modal encoder. To investigate the effectiveness of our model-modal encoder, we construct the second ablated version “Model1”, which is similar to “Model0”, but using a 3D backbone to extract FS features, consistent with our multi-model encoder ( 3.1). The results, shown in Table 2, indicate that “Model1” outperforms “Model0” in terms of all evaluations, demonstrating the superiority of our 3D CNN-based encoder.
Effectiveness of SA module. To investigate the effectiveness of our SA module, we further construct “Model2” and “Model3”, which incorporates the SA into “Model1” without and with CoA, respectively. As shown in Table 2, both “Model2” and “Model3” improve the performance in comparison with “Model1”. In particular, the full version of SA (“Model3”) provides a significant improvement compared to “Model1”, indicating the importance of synergistic attention for learning the complementarities of multi-modal features.
Effectiveness of PF module. Compared with “Model3”, “Model4” uses the deconvolutional block (Figure 2) to gradually up-sample the features for predicting the saliency map. Besides, a three-way supervision (“Model5”) is further employed to provide a deep supervision for the training of AiF and FS. Finally, with the AA component ( 3.3), our SA-Net achieves the best performance (Table 2) and provides the saliency maps closest to ground truth (Figure 7).
In this paper, we propose a novel deep learning model, SA-Net, which addresses the light field SOD by learning the synergistic attention for AiF and FS features. The innovative attributes of our SA-Net are three-fold: (i) it exploits the cross-modal complementary information by establishing a synergistic effect between multi-modal features, (ii) it is the first attempt to learn both the spatial and inter-frame features of FSs with 3D CNNs, and (iii) it predicts the saliency map with an effective fusion model in a progressive manner. Extensive qualitative and quantitative experimental results on three widely-used light field datasets demonstrate the effectiveness and superiority of our SA-Net.
In this work, we evaluate all the 28 benchmark models and our SA-Net with four widely used SOD metrics with respect to the ground-truth binary foreground map and predicted saliency map. The F-Measure () [borji2015salient] and mean absolute error (MAE) [perazzi2012saliency] focus on the local (per-pixel) match between ground truth and prediction, while S-Measure () [fan2017structure] pays attention to the object structure similarities. Besides, E-Measure () [Fan2018Enhanced] considers both the local and global information.
MAE computes the mean absolute error between the ground truth and a normalized predicted saliency map , i.e.,
where and denotes height and width, respectively.
F-Measure gives a single-valued score () considering both the and , which is defined as:
where denotes a binary mask converted from a predicted saliency map and is the ground truth. Multiple are computed by taking different thresholds of on the saliency map. Note that the is set to 0.3 according to [achanta2009frequency]. Besides, the adaptive F-Measure-based results reported in the submission are calculated by applying an adaptive threshold algorithm [borji2015salient].
S-Measure evaluates the structure similarities between salient objects in ground-truth foreground maps and predicted saliency maps:
where and denotes the object-/region-based structure similarities, respectively. is set as 0.5 so that equal weights are assigned to both the object-level and region-level assessments [fan2017structure].
E-Measure is a cognitive vision-inspired metric to evaluate both the local and global similarities between two binary maps. Specifically, it is defined as:
where represents the enhanced alignment matrix [Fan2018Enhanced]. Similar to , adaptive E-Measure is adopted for the evaluation in our submission.
We perform extensive experiments to compare our SA-Net with 28 state-of-the-art SOD methods. Due to the page limit, we only report the quantitative results of 18 most recently proposed methods in our submission. According to the complete results shown in Table 3, our SA-Net outperforms all the 28 competing models upon three datasets in terms of all the four metrics, which demonstrates the superiority and effectiveness of our proposed method.
Comparison of Ablation Models. Due to the page limit, we only show partial visual results of ablation studies in our submission. To further illustrate the benefit of each key component in our SA-Net, we show complete qualitative results for all the six ablation models in Figure 8. As can be observed, each component improves the quality of predicted saliency maps and contributes to the superior performance of SA-Net.
Comparison with State-of-the-Arts. To further demonstrate the effectiveness of our SA-Net, we show extensive visual results of our method as well as the competing models upon the three benchmark datasets (Figure 9 to 12). Overall, our proposed SA-Net depicts fine object structures and possesses less false positive/negative, thus giving predictions closest to ground truths.