A Fixation-based 360° Benchmark Dataset for Salient Object Detection

01/22/2020 ∙ by Yi Zhang, et al. ∙ 0

Fixation prediction (FP) in panoramic contents has been widely investigated along with the booming trend of virtual reality (VR) applications. However, another issue within the field of visual saliency, salient object detection (SOD), has been seldom explored in 360 (or omnidirectional) images due to the lack of datasets representative of real scenes with pixel-level annotations. Toward this end, we collect 107 equirectangular panoramas with challenging scenes and multiple object classes. Based on the consistency between FP and explicit saliency judgements, we further manually annotate 1,165 salient objects over the collected images with precise masks under the guidance of real human eye fixation maps. Six state-of-the-art SOD models are then benchmarked on the proposed fixation-based 360 image dataset (F-360iSOD), by applying a multiple cubic projection-based fine-tuning method. Experimental results show a limitation of the current methods when used for SOD in panoramic images, which indicates the proposed dataset is challenging. Key issues for 360 SOD is also discussed. The proposed dataset is available at https://github.com/Panorama-Bill/F-360iSOD.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The panoramic image, or 360 (omnidirectional) image, which captures the content on the whole 360180 viewing range surrounding a viewer, plays an import role in virtual reality (VR) applications and distinguishes itself from traditional 2-dimensional (2D) image which covers only one specific plane. Recently, commercial Head-Mounted Displays (HMDs) are developed to provide observers an immersive even interactive experience by allowing them to freely rotate their head and thus focusing on desired scenes and objects. Considering the fact that some salient parts of the 360 image attract more human attentions than others [22]

, visual saliency prediction in panoramas becomes one of the focused issues within the field of computer vision and is considered as a key to study human observation behavior in virtual environments. The fixation prediction (FP) and salient object detection (SOD) are both closely related to the concept of visual saliency. Thanks to the accessibility of HMDs and eye

[width=1]figure1.pdf

Figure 1: Representative samples of the proposed fixation-based 360 image dataset (F-360iSOD). The first row: four panoramic images presented in the equirectangular format; the second row: images overlayed with thresholded fixation maps; the third row: object-level ground-truths; the fourth row: instance-level ground-truths.

trackers, image [21] and video (e.g., [11, 28, 32]

) datasets have been constructed for the deep learning-based FP in panoramic content. However, to the best of knowledge,

[13] is the only study for SOD in 360 scenarios, which does not use the fixations as a guidance for the salient object annotation.

As shown in Fig. 1, 360 images tend to have richer scenes and much more foreground objects compared to flat-2D images from traditional SOD datasets (e.g., [29, 18, 15, 12, 23, 30]). Therefore, it is more challenging to differentiate the salient objects from the non-salient ones in panoramas. Preserving 360 images with a few obvious foreground objects while discarding those ambiguous ones may bring selection bias to the dataset, thus being inefficient for exploring the real human attention behavior as viewing panoramic content. Based on the strong correlation between FP and explicit human judgements [2], and the successfully established fixation-based 2D SOD datasets [2, 14, 9], we argue that the salient objects in panoramas can also be manually annotated with the assistance of fixations, thus representing the real-world daily scenes. The main contributions of this paper are: 1) a fixation-based 360 image dataset (F-360iSOD) with both the object-/instance-level pixel-wise annotations is proposed; 2) six newly proposed state-of-the-art 2D SOD models [20, 33, 16, 24, 25, 34] are benchmarked by five widely used SOD metrics [1, 17, 6, 19, 7] on the proposed dataset in a cross-testing manner, with a multiple cubic projection-based fine-tuning strategy [3]; 3) key issues for 360 SOD are discussed.

2 Related work

2.1 2D SOD Datasets

ECSSD [29], SOD [18], PASCAL-S [15], HKU-IS [12], DUTS [23] and DUT-OMRON [30] are the six most widely used image datasets to benchmark the deep learning-based 2D SOD models, these datasets all provide improved pixel-wise object-level annotations and high challenges. It is worth mentioning that many deep learning-based SOD models are trained on the training set of the DUTS since 2017. The DUTS is so far the largest image dataset for SOD which contains 10,553 images for training and a testing set of 5,019 images. SOC [5] is a more recently proposed SOD dataset with 6,000 images and about 80 category labels. The dataset includes 3,000 images without salient objects (or with only the presence of background objects) and both the precise object-/instance-level ground-truths. Further research [8] also emphasizes the great importance of image depth (D) information by proposing RGB-D-based SOD benchmarks. Further, similar to JUDD-A [2] (image SOD), there are two video-based SOD datasets [14, 9] which also apply fixations to aid the manual annotation of salient objects, thus providing new ideas for the future SOD dataset construction.

2.2 SOD Models for 2D Images

In recent years, fully convolutional network (FCN)-based models dominate the field of SOD. The FCN-based architecture differentiates itself from other deep learning methods by giving saliency maps as outputs, rather than classification scores, thus managing to predict saliency maps in a single feed-forward process by benefiting from an end-to-end learning. EGNet [33] is one of the recently proposed top-performanced state-of-the-art models. The method is motivated by the idea that simultaneously learning the salient edge and object information can help improving performance of SOD models. It models these two complementary information with an independent network outside the VGG-based backbone. SCRN [25] is another newly proposed SOD model that considers the edge information. It also implements the SOD and salient edge detection in a synchronous way, by stacking several so-called cross refinement units in an end-to-end manner. BASNet [20] proposes residual refinement module and hybrid loss to refine the salient objects boundaries in predicted saliency maps. PoolNet [16]

considers improving the feature extraction efficiency of multiple layers of current U shape architecture by adding two new modules, which are both designed based on simple pooling techniques. GCPANet

[34] is a more recently proposed method which brings improvements to the traditional bottom-up/top-down networks by proposing four new modules. CPD [24] modifies the traditional encoder-decoder framework to directly refine high-level features by generated saliency maps, without the consideration of low-level features. The idea here is different from PoolNet and GCPANet, which try to integrate both the low-/high-level features. Due to the limited space, we will not include all the SOD models in this section (see recent benchmark studies [5] for more).

2.3 Panoramic Datasets

Generally, there are two types of panoramic datasets focusing on head movement (HM) prediction and FP, respectively. Datasets such as 360-VHMD [4], VR-VQA48 [26] and PVS-HM [27] contain only head tracking data, while Salient!360 [21], Stanford360 [22], VQA-OV [11], VR-scene [28] and 360-Saliency [32] provide ground-truth eye fixations. Besides, 360-SOD [13] is a newly proposed omnidirectional image dataset for SOD. However, the salient objects are labeled with pure human judgements, rather than fixation-based guidance. Besides, the dataset does not provide instance-level ground-truths or object class labels.

3 A New Dataset for panoramic SOD

[width=0.9]figure2.pdf

Figure 2: Statistics of the proposed dataset (F-360iSOD).

In this section, we present a new fixation-based 360 image dataset, called F-360iSOD, which contains 107 (52 indoor/55 outdoor) panoramic images with challenging real-world daily scenes, 1,165 salient objects (from 72 object classes) manually labeled with precise object-/instance-level masks.

3.1 Image Collection

The F-360iSOD is a small-scale 360 dataset with totally 107 panoramic images collected from Stanford360 [22] and Salient!360 [21] which contain 85 and 22 equirectangular images, respectively. To the best of our knowledge, the Stanford360 and Salient!360 are the only panoramic image datasets that provide raw eye fixation data. All the images of the proposed F-360iSOD are presented in equirectangular format with a size of 20481024 for convenient processing.

3.2 Salient Object Annotation

Inspired by the 2D SOD datasets [2, 14, 9]

where fixation data are used to aid the salient object annotation, an expert is asked to manually annotate (by tracing boundaries) the salient objects with both the object-/instance-level masks on the collected equirectangular images, under the guidance of fixation maps convoluted by a Gaussian with a standard deviation empirically set to 2

of visual angle (note that each of the Gaussian-smoothed fixation maps is thresholded with an adaptive saliency value to keep the top one-10th of each self before shown to the annotator). The whole annotation process has been repeated three times to pass the quality check implemented by two other experts, for the final ground-truths. Besides, nine images without any salient object annotations are kept in F-360iSOD, to avoid the common bias of 2D SOD datasets (as mentioned in [31]), brought by an assumption that there is at least one salient object in each of the image.

3.3 Dataset Statistics

In F-360iSOD, each of the salient object belongs to one specific class. Generally, there are 1,165 salient objects from 72 categories, thus reflecting 7 aspects (human, text, vehicle, architecture, artwork, animal and daily stuff) of the real-world common scenes (Fig. 2). The person category occupies the largest proportion with a number of instances of 386; other relative large object classes include painting, text, building, person face and car, with a number of instances of 92, 89, 86, 75 and 72, respectively.

4 Experimental results and discussions

4.1 Dataset Split

The F-360iSOD consists of one training set and two testing set, which are denoted as F-360iSOD-train, F-360iSOD-testA and F-360iSOD-testB, respectively. The F-360iSOD-train contains 68 equirectangular images from the Salient!360 [21], while the F-360iSOD-testA collects the remaining 17 (85 in total). Besides, the F-360iSOD-testB is established to enable the cross-testing for SOD models, with 22 images from another panoramic image dataset (Stanford360 [22]).

4.2 Projection Methods

By wearing HMDs, people are able to freely rotate their head to make multiple viewports focusing on the attractive regions of the surrounding 360 content. Based on this prior knowledge, we apply cubemap projection (where a 360 image is projected into 6 rectangular patches) to process 68 panoramic images (from F-360iSOD-train) with multiple rotation angles (0, 30, 60 both horizontally and vertically [3]). Therefore, we gain 54 (633) patches representative of multiple fields of view for each of the 360 image. 3672 (5468) 2D patches (256256) are therefore generated and used as inputs for the fine-tuning of 2D SOD models.

4.3 Evaluation Metrics

To measure the agreement between manually labeled ground-truths and model predictions, we adopt five widely used SOD metrics: F-measure curves [1], weighted measure (Fbw) [17], mean absolute error (MAE) [19], structural measure (S-measure) [6] and enhanced-alignment measure (E-measure) curves [7]. Note that the is set to 0.3 in F-measure and Fbw, aiming to emphasize more on precision as suggested in [1]. F-measure, MAE and Fbw address the pixel-wise errors, while S-measure evaluates the structural similarity between predicted saliency maps and binary ground-truths :

(1)

where and denote the object-/region-aware structure similarities, respectively; is empirically set to 0.7 ( in 2D) to attach more importance on object structure, based on the observation that panoramic images are usually dominated by small salient objects distributed over the whole image (e.g., Fig. 1), rather than one or multiple spatially connected foreground objects located at the center of the image. E-measure is a more recently proposed SOD metric which combines both the pixel-/image-level information:

(2)

where the means the enhanced alignment matrix; the and are the height and width of the foreground map.

4.4 Benchmarking Results

[width=1]figure3.pdf

Figure 3: F-measure curves and E-measure curves obtained by six state-of-the-art SOD models on the F-360iSOD.

[width=2]figure4.pdf

Figure 4: A qualitative comparison between six state-of-the-art SOD models on F-360iSOD.

In our study, each of the SOD model is fine-tuned on the F-360iSOD-train with an initial learning rate of one-10th of the default, and a batch size of 1. The training process will stop as the S-measure value on the F-360iSOD-testA starts to go down. As a result, it takes about 20 epochs for BASNet

[20], EGNet [33], CPD [24] and SCRN [25] to converge, while 70 for PoolNet [16] and 15 for GCPANet [34]. The quantitative and qualitative comparison between the six state-of-the-art 2D SOD models on both the F-360iSOD-testA/B are illustrated in Table. 1, Fig. 3 and Fig. 4, respectively.

Methods F-360iSOD-testA F-360iSOD-testB
SCRN [25] .551 .809 .050 .124 .708 .034
BASNet [20] .567 .825 .046 .118 .683 .048
CPD [24] .521 .763 .052 .129 .695 .032
PoolNet [16] .500 .834 .068 .136 .716 .058
GCPANet [34] .630 .822 .045 .106 .693 .039
EGNet [33] .715 .864 .045 .190 .714 .041
Table 1: A quantitative comparison between six state-of-the-art SOD models on F-360iSOD, where means Fbw, represents S-measure. Note that the top three results of each column are highlighted in red, green and blue, respectively.

4.5 Discussions

Features of 360 datasets. All the benchmarking models are constrained to some extent on the proposed F-360iSOD, even though achieving high performances in 2D SOD [20, 33, 16, 24, 25, 34]. The limitation is mainly due to the challenges brought by the features of 360 dataset, such as equirectangular projection-induced distortions, small objects and clutter scenes, etc.

Fixation-based complexity analysis. Since the panoramic images tend to contain much more scenes and objects than 2D images, the ambiguity of saliency judgements in panoramas should also be considered, which can be quantified by inter observer congruency (IOC) [10] and entropy based on fixation maps (note that the fixation maps are smoothed with a Gaussian with a standard deviation of 1 visual angle, as suggested in [22]). As an image with high IOC and low entropy is usually considered to be simple, the F-360iSOD-testB should be easier to explore when compared with the F-360iSOD-testA (Fig. 5), from a perspective of human judgements.

Unseen object classes. All benchmarking models fail on the

[width=0.9]figure5.pdf

Figure 5: A fixation-based complexity analysis of the proposed F-360iSOD. The F-360iSOD-train, F-360iSOD-testA/B are annotated in black, blue and red, respectively.

F-360iSOD-testB, mainly due to the presence of unseen object classes in Stanford360 [22], such as sharks, bells, robots, etc. People are capable of recognizing new object categories when provided with high-level descriptions. This strong generalization ability is still absent in current SOD models.

Instance-level ground-truths. As far as we know, the proposed F-360iSOD is the first 360

dataset that provides instance-level semantic labels for salient objects. Future SOD models must be able to recognize the individual instances of multiple classes, which is crucial for practical applications, such as image captioning, scene understanding, etc.

5 Conclusions

In this paper, we propose a fixation-based 360 image dataset (F-360iSOD), with precisely annotated salient objects/instances from multiple classes representative of real-world daily scenes. Six recently proposed top-performanced SOD methods are fine-tuned and tested on the F-360iSOD. Results show a limit of current 2D models when directly applied to the SOD in panoramas. We believe the F-360iSOD can be used as one of the basic panoramic datasets and thus supporting future 360 SOD model development.

References

  • [1] R. Achanta, S. Hemami, F. Estrada, and S. Süsstrunk (2009) Frequency-tuned salient region detection. In IEEE CVPR, pp. 1597–1604. Cited by: §1, §4.3.
  • [2] A. Borji (2014) What is a salient object? a dataset and a baseline model for salient object detection. IEEE TIP 24 (2), pp. 742–756. Cited by: §1, §2.1, §3.2.
  • [3] F. Chao, L. Zhang, W. Hamidouche, and O. Deforges (2018) SAlGAN360: visual saliency prediction on 360 degree images with generative adversarial networks. In ICMEw, pp. 01–04. Cited by: §1, §4.2.
  • [4] X. Corbillon, F. De Simone, and G. Simon (2017) 360-degree video head movement dataset. In MMSys, pp. 199–204. Cited by: §2.3.
  • [5] D. Fan, M. Cheng, J. Liu, S. Gao, Q. Hou, and A. Borji (2018) Salient objects in clutter: bringing salient object detection to the foreground. In ECCV, pp. 186–202. Cited by: §2.1, §2.2.
  • [6] D. Fan, M. Cheng, Y. Liu, T. Li, and A. Borji (2017) Structure-measure: a new way to evaluate foreground maps. In IEEE ICCV, pp. 4548–4557. Cited by: §1, §4.3.
  • [7] D. Fan, C. Gong, Y. Cao, B. Ren, M. Cheng, and A. Borji (2018) Enhanced-alignment measure for binary foreground map evaluation. IJCAI, pp. 698–704. Cited by: §1, §4.3.
  • [8] D. Fan, Z. Lin, J. Zhao, Y. Liu, Z. Zhang, Q. Hou, M. Zhu, and M. Cheng (2019) Rethinking rgb-d salient object detection: models, datasets, and large-scale benchmarks. arXiv preprint arXiv:1907.06781. Cited by: §2.1.
  • [9] D. Fan, W. Wang, M. Cheng, and J. Shen (2019) Shifting more attention to video salient object detection. In IEEE CVPR, pp. 8554–8564. Cited by: §1, §2.1, §3.2.
  • [10] O. Le Meur, T. Baccino, and A. Roumy (2011) Prediction of the inter-observer visual congruency (iovc) and application to image ranking. In ACM MM, pp. 373–382. Cited by: §4.5.
  • [11] C. Li, M. Xu, X. Du, and Z. Wang (2018) Bridge the gap between vqa and human behavior on omnidirectional video: a large-scale dataset and a deep learning model. In ACM MM, pp. 932–940. Cited by: §1, §2.3.
  • [12] G. Li and Y. Yu (2015)

    Visual saliency based on multiscale deep features

    .
    In IEEE CVPR, pp. 5455–5463. Cited by: §1, §2.1.
  • [13] J. Li, J. Su, C. Xia, and Y. Tian (2019) Distortion-adaptive salient object detection in 360° omnidirectional images. IEEE JSTSP. Cited by: §1, §2.3.
  • [14] J. Li, C. Xia, and X. Chen (2017)

    A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection

    .
    IEEE TIP 27 (1), pp. 349–364. Cited by: §1, §2.1, §3.2.
  • [15] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille (2014) The secrets of salient object segmentation. In IEEE CVPR, pp. 280–287. Cited by: §1, §2.1.
  • [16] J. Liu, Q. Hou, M. Cheng, J. Feng, and J. Jiang (2019) A simple pooling-based design for real-time salient object detection. In IEEE CVPR, Cited by: §1, §2.2, §4.4, §4.5, Table 1.
  • [17] R. Margolin, L. Zelnik-Manor, and A. Tal (2014) How to evaluate foreground maps?. In IEEE CVPR, pp. 248–255. Cited by: §1, §4.3.
  • [18] V. Movahedi and J. H. Elder (2010) Design and perceptual validation of performance measures for salient object segmentation. In CVPRw, pp. 49–56. Cited by: §1, §2.1.
  • [19] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung (2012) Saliency filters: contrast based filtering for salient region detection. In IEEE CVPR, pp. 733–740. Cited by: §1, §4.3.
  • [20] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand (2019) BASNet: boundary-aware salient object detection. In IEEE CVPR, pp. 7479–7489. Cited by: §1, §2.2, §4.4, §4.5, Table 1.
  • [21] Y. Rai, J. Gutiérrez, and P. Le Callet (2017) A dataset of head and eye movements for 360 degree images. In MMSys, pp. 205–210. Cited by: §1, §2.3, §3.1, §4.1.
  • [22] V. Sitzmann, A. Serrano, A. Pavel, M. Agrawala, D. Gutierrez, B. Masia, and G. Wetzstein (2018) Saliency in vr: how do people explore virtual environments?. IEEE TVCG 24 (4), pp. 1633–1642. Cited by: §1, §2.3, §3.1, §4.1, §4.5, §4.5.
  • [23] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan (2017) Learning to detect salient objects with image-level supervision. In IEEE CVPR, pp. 136–145. Cited by: §1, §2.1.
  • [24] Z. Wu, L. Su, and Q. Huang (2019) Cascaded partial decoder for fast and accurate salient object detection. In IEEE CVPR, pp. 3907–3916. Cited by: §1, §2.2, §4.4, §4.5, Table 1.
  • [25] Z. Wu, L. Su, and Q. Huang (2019) Stacked cross refinement network for edge-aware salient object detection. In IEEE ICCV, pp. 7264–7273. Cited by: §1, §2.2, §4.4, §4.5, Table 1.
  • [26] M. Xu, C. Li, Y. Liu, X. Deng, and J. Lu (2017) A subjective visual quality assessment method of panoramic videos. In ICME, pp. 517–522. Cited by: §2.3.
  • [27] M. Xu, Y. Song, J. Wang, M. Qiao, L. Huo, and Z. Wang (2018)

    Predicting head movement in panoramic video: a deep reinforcement learning approach

    .
    IEEE TPAMI. Cited by: §2.3.
  • [28] Y. Xu, Y. Dong, J. Wu, Z. Sun, Z. Shi, J. Yu, and S. Gao (2018) Gaze prediction in dynamic 360 immersive videos. In IEEE CVPR, pp. 5333–5342. Cited by: §1, §2.3.
  • [29] Q. Yan, L. Xu, J. Shi, and J. Jia (2013) Hierarchical saliency detection. In IEEE CVPR, pp. 1155–1162. Cited by: §1, §2.1.
  • [30] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang (2013) Saliency detection via graph-based manifold ranking. In IEEE CVPR, pp. 3166–3173. Cited by: §1, §2.1.
  • [31] Y. Zhang, L. Zhang, W. Hamidouche, and O. Deforges (2020) Key issues for the construction of salient object datasets with large-scale annotation. In IEEE MIPR, Cited by: §3.2.
  • [32] Z. Zhang, Y. Xu, J. Yu, and S. Gao (2018) Saliency detection in 360 videos. In ECCV, pp. 488–503. Cited by: §1, §2.3.
  • [33] J. Zhao, J. Liu, D. Fan, Y. Cao, J. Yang, and M. Cheng (2019) EGNet: edge guidance network for salient object detection. In IEEE ICCV, pp. 8779–8788. Cited by: §1, §2.2, §4.4, §4.5, Table 1.
  • [34] Zuyao,Chen, Qianqian,Xu, R. ,Cong, and H. Qingming (2020) Global context-aware progressive aggregation network for salient object detection. In AAAI, Cited by: §1, §2.2, §4.4, §4.5, Table 1.