ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos

07/24/2021 ∙ by Yi Zhang, et al. ∙ 7

Exploring to what humans pay attention in dynamic panoramic scenes is useful for many fundamental applications, including augmented reality (AR) in retail, AR-powered recruitment, and visual language navigation. With this goal in mind, we propose PV-SOD, a new task that aims to segment salient objects from panoramic videos. In contrast to existing fixation-level or object-level saliency detection tasks, we focus on multi-modal salient object detection (SOD), which mimics human attention mechanism by segmenting salient objects with the guidance of audio-visual cues. To support this task, we collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy, thus distinguishing itself with richness, diversity and quality. Specifically, each sequence is marked with both its super-/sub-class, with objects of each sub-class being further annotated with human eye fixations, bounding boxes, object-/instance-level masks, and associated attributes (e.g., geometrical distortion). These coarse-to-fine annotations enable detailed analysis for PV-SOD modeling, e.g., determining the major challenges for existing SOD models, and predicting scanpaths to study the long-term eye fixation behaviors of humans. We systematically benchmark 11 representative approaches on ASOD60K and derive several interesting findings. We hope this study could serve as a good starting point for advancing SOD research towards panoramic videos.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

page 11

page 12

page 15

page 16

page 17

page 18

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, AI companies and manufacturers have developed several panoramic cameras, such as Facebook’s Surround360, Insta360 One, Ricoh Theta, and Google Jump VR, which produce omnidirectional111In the following sections, we use ‘omnidirectional’, ‘panoramic’, and ‘360°’ interchangeably. images (fig:AnnotationExamples (b)) capturing a scene with a 360 field-of-view (FoV). Thus, exploring human attention in dynamic scenes captured by these devices is of significant importance to augmented/virtual reality (AR/VR) applications, e.g., shopping, online recruitment, piloting [hu2017deep], automatic cinematography [su2016pano2vid], and immersive games.

In practice, we have found that existing object-level saliency detection (i.e., SOD) techniques and the datasets that underpin their progress are subject to two important limitations. First, the input source only includes the visual information from images (e.g., I-SOD (Image SOD) [fan2018salient, wang2021salient, borji2019salient, borji2015salient], CoSOD [fan2020taking, Fan2021Group, deng2021re], RGB-D SOD [zhou2021rgb, piao2019depth, liu2020learning, 20TNNLS_RgbdBench, fu2021siamese, fu2020jl], RGB-T SOD [tang2019rgbt, tu2019rgb], LFSOD [jiang2020light, piao2020dut], HRSOD [zeng2019towards, zhang2021looking], Remote Sensing SOD [zhang2020dense]), or videos (e.g., V-SOD (Video SOD) [SSAV, ji2021fsnet, wang2021semantic]) ignoring the auditory cues that are ubiquitous in dynamic scenes [tavakoli2019dave, Tsiami_2020_CVPR, van2008audiovisual]. Furthermore, all the above mentioned SOD tasks focus on 2D images/videos, which are regarded as perspective images with local FoVs (compared with 360°

180° FoV), thus failing to capture the surrounding context and corresponding layout of immersive real-life daily scenes. However, these rich global geometrical cues are crucial for human attention modeling.

To this end, we envision that segmenting salient objects from panoramic videos with audio-visual data will benefit not only our research community but also commercial products. To facilitate the study of panoramic video salient object detection (PV-SOD), we collect ASOD60K222Collecting the six types of labels was a costly and time-consuming work, and it took us about 1 year to set up this large-scale database., the first large-scale PV-SOD dataset providing professional annotations. ASOD60K has several distinctive features:

  • Hierarchical categories. All videos in the database are labeled in a hierarchical manner, i.e., with the super-class and sub-class. The two-level semantic categories provide a solid foundation for not only weakly supervised approaches but also fully supervised models.

  • Diverse annotations. For each video sequence/frame, we provide coarse-to-fine annotations, including the subjects’ head movement (HM) and eye fixations, bounding boxes, object-level masks, and instance-level labels, which can greatly benefit different vision tasks (e.g., scanpath prediction, fixation prediction, SOD and salient instance detection).

  • Attribute labels. Each sequence is annotated with specific attributes, e.g., geometrical distortion, occlusions, and motion blur. Combined with the performance of the evaluated models, these attributes (tab:Attributes & fig:AttExample) shed new light on the experimental analysis.

  • High quality. All video sequences are in high-resolution (4K) to adapt to VR devices such as Head-Mounted Displays (HMDs). Moreover, cross-checking (i.e., more than three-fold) by multiple experts and volunteers is conducted to maintain reliability, accuracy, and consistency during the whole annotation process.

The aforementioned aspects together provide important support for studying human attention in panoramic videos. Further, we summarize the design rules that a balanced PV-SOD dataset should fulfill, which can be used as reference for similar fields when collecting and labeling data.

To reveal the challenges of PV-SOD, we perform a set of empirical studies based on the collected ASOD60K dataset. We obtain three interesting observations. i) According to the overall () and attribute-based performance of the tested models, this task is still far from being solved. ii) We find that the eye fixations with audio are relatively consistent across subjects while the data without audio exhibit large fluctuations between different subjects. iii) A sparsely labeled database is more beneficial for image-based models but more challenging for video-based models. These findings clearly show the challenges of salient object detection in panoramic videos.

In a nutshell, our main contributions are twofold: i) We introduce ASOD60K, the first large-scale dataset for PV-SOD, which consists of 62,455 high-resolution (4K) video frames from 67 carefully selected 360° panoramic video sequences. 10,465 key frames are annotated with rich labels, namely, super-class, sub-class, attributes, HM data, eye fixations, bounding boxes, object masks, and instance masks. ii) Based on the established ASOD60K, we present a comprehensive study on 11 representative models, which serves as the first standard leaderboard. Based on the evaluation results, we present insightful conclusions that may inspire novel ideas toward new research directions.

Dataset Task Year Pub. #Img #GT Obj.-Level Ins.-Level Attribute Fix. GT Audio
ECSSD [ECSSD] I-SOD 2013 CVPR 1,000 1,000 139 400
DUT-OMRON [DUTO] I-SOD 2013 CVPR 5,168 5,168 139 401
SegTrack V2 [SegV2] V-SOD 2013 ICCV 1,065 1,065 212 640
PASCAL-S [PASCALS] I-SOD 2014 CVPR 850 850 139 500
FBMS [FBMS] V-SOD 2014 TPAMI 13,860 720 253 960
HKU-IS [HKUIS] I-SOD 2015 CVPR 4,447 4,447 100 500
MCL [kim2015spatiotemporal] V-SOD 2015 TIP 3,689 463 270 480
ViSal [ViSal] V-SOD 2015 TIP 963 193 240 512
DAVIS2016 [DAVIS] V-SOD 2016 CVPR 3,455 3,455 900 1,920
DUTS [DUTS] I-SOD 2017 CVPR 15,572 15,572 100 500
ILSO [li2017instance] I-SOD 2017 CVPR 1,000 1,000 142 400
UVSD [liu2017saliency] V-SOD 2017 TCSVT 3262 3262 240 877
SOC [fan2018salient] I-SOD 2018 ECCV 6,000 6,000 161 849
VOS [VOS] V-SOD 2018 TIP 116,103 7,467 312 800
DAVSOD [SSAV] V-SOD 2019 CVPR 23,938 23,938 360 640
F-360I-SOD [Yi2020fSOD] PI-SOD 2020 ICIP 107 107 1,024 2,048
360-SOD [li2020distortion] PI-SOD 2020 JSTSP 500 500 512 1,024
360-SSOD [ma2020stage] PI-SOD 2020 TVCG 1,105 1,105 546 1,024
ASOD60K (OUR) PV-SOD 2021 CVMJ 62,455 10,465 1,920 3,840
Table 1: Summary of widely used salient object detection (SOD) datasets and the proposed panoramic video SOD (PV-SOD) dataset. #Img: The number of images/frames. #GT: The number of ground-truth masks. Pub. = Publication. Obj.-Level = Object-Level. Ins.-Level = Instance-Level. Fix. GT = Fixation-guided ground truths. denotes equirectangular (ER) images.

2 Related Work

Human attention modeling in panoramic videos can be roughly split into four levels: HM prediction [pvshm], eye fixation/gaze prediction [wu2020spherical, xu2018gaze], salient object detection (SOD), and salient instance detection. Our work mainly focuses on the object-level task, leaving other tasks to our future studies. In this section, we only briefly discuss some closely related works, i.e., datasets, models, and techniques for 360° image processing.

2.1 Datasets

Image Salient Object Detection (I-SOD). The image-based SOD task has gained significant attention in the past few years. The remarkable progress of I-SOD is highly related to the development of several representative datasets [ECSSD, DUTO, PASCALS, HKUIS, DUTS, li2017instance, fan2018salient]. ECSSD [ECSSD], DUT-OMRON [DUTO], PASCAL-S [PASCALS], and HKU-IS [HKUIS] are four early, small-scale datasets with limited image resolution. To increase the amount of training data, DUTS [DUTS] was introduced and has become one of the most popular benchmarks. Furthermore, ILSO [li2017instance] and SOC [fan2018salient] were recently proposed with the goal of enabling not only object-level but also instance-level SOD tasks. We refer the reader to the survey paper by [wang2021salient] for a thorough review.

Video Salient Object Detection (V-SOD). In addition to I-SOD datasets, several V-SOD benchmarks have also been introduced. tab:related works summarizes their details. As can be seen, DAVSOD [SSAV] is the largest dataset and provides comprehensive annotations for the V-SOD task.

Panoramic Image Salient Object Detection (PI-SOD). There are three attempts toward establishing datasets for 360° panoramic image-based SOD [li2020distortion, ma2020stage, Yi2020fSOD], all of which provide pixel-wise object-level ground truths (GTs) with similar resolution (e.g., ). 360-SOD [li2020distortion] is the pioneering work for SOD in 360 scenes. It consists of 500 equirectangular (ER) images (the most widely used planar representation of 360° image without any loss of spatial information) representing both the indoor/outdoor scenes with object-level annotations. 360-SSOD [ma2020stage] is a larger public PI-SOD dataset that has 1,105 semantically balanced panoramic (ER) images. Besides, F-360iSOD [Yi2020fSOD] is so far the only 360° image SOD dataset that provides pixel-wise instance-level GTs.

Other Datasets. Other closely related datasets are either designed to simulate human HM (360VHMD [corbillon2017360] and PVS-HMEM [pvshm]), or eye fixations (i.e., saliency detection) in 360°/panoramic images (Salient!360 [rai2017dataset]) or 360° videos (e.g., Wild-360 [cheng2018cube], VR-Scene [xu2018gaze], 360-Saliency [zhang2018saliency]). As far as we know, Chao et al.’s[chao2020audio] work is the only 360 audio-visual saliency dataset, which contains 12 videos with HM-based annotations under mute, mono, and ambisonics modalities. Finally, datasets such as [yang2018object, 360indoorWACV2020, 360sports] focus on bounding-box-level object detection in 360°.

As summarized in tab:related works, no work exists to study the segmentation of salient objects in free-viewing panoramic videos with audio. The closest works are audiovisual saliency detection from 2D videos [Tsiami_2020_CVPR, jain2020avinet] in a plane and salient object detection in omnidirectional images [li2020distortion, Yi2020fSOD, ma2020stage] without audio. We refer readers to the survey paper about 360° data processing [xu2020state, fan2019survey] for more details.

2.2 SOD Models

Since no SOD approaches currently exist for the PV-SOD task, we present the SOD methodologies for I-SOD, V-SOD, and PI-SOD.

Algorithms for I-SOD

. In the past few years, convolutional neural networks (CNNs) have been the most commonly used architecture in state-of-the-art (SOTA) I-SOD models 

[ASNet, AFNet, BASNet, qin2021boundary, zhuge2021salient, CPD, PoolNet, EGNet, SCRN, F3Net, GCPANet, ITSD, MINet, SOD100K], which are trained on large-scale datasets (e.g., DUTS [DUTS]) in a fully supervised manner. ASNet [ASNet] uses eye fixations to aid salient object localization, while models such as AFNet [AFNet], BASNet [qin2021boundary], PoolNet [PoolNet], EGNet [EGNet], SCRN [SCRN], and LDF[CVPR2020LDF] emphasize object appearance (boundaries or skeletons) as guidance for the accurate segmentation of salient objects. With comparable accuracy, methods such as CPD [CPD], ITSD [ITSD], and CSNet [SOD100K] also achieve improved inference speed.

Models for V-SOD. The recent development of large-scale video datasets such as DAVIS [DAVIS] and DAVSOD [SSAV]

have enabled deep learning-based V-SOD. Several works 

[MGA, li2018flow, RCRNet] have achieved success by introducing optical flow cues into the network. There is, however, the long-standing and often ignored issue of saliency shift, which was first highlighted and modeled in SSAV [SSAV]. According to the open benchmark results, COSNet [COSNet], RCRNet [RCRNet], and PCSA [gu2020PCSA] obtain best performances in the V-SOD task.

[width=.98]./figures/fig_attributes.pdf

Figure 2: Examples of challenging attributes (see tab:Attributes) on ER images from our ASOD60K, with instance-level GT and fixations as annotation guidance. denote random frames of a given video. Best viewed in color. More examples are shwon in fig:show_part1.

Methods for PI-SOD. To the best of our knowledge, DDS [li2020distortion], stage-wise SOD [ma2020stage] and FANet [huang2020fanet] are so far the only models exclusively designed for PI-SOD. They all emphasize the importance of mitigating the geometrical distortion brought by ER projection via specific modules.

2.3 360° Image Processing Approaches

The vast majority of current 360° image processing techniques are CNN-based, proposed for either ER or stereoscopic images (e.g., spheres and icosahedrons).

CNNs on ER Images. ER projection is the most widely used approach for the 2D representation of a 360 image. It applies a uniform grid-based sampling method on the spherical surface, followed by an inevitable over-sampling of spherical regions near poles. Therefore, salient objects in ER images may suffer geometrical distortions to varying extents, depending on the distance between their geometrical locations and the equator of the ER image (fig:AttExample). SphereNet [coors2018spherenet], which consists of a location-adaptive kernel, was proposed for the classification and detection of objects in ER images. Similar location-dependent convolutional kernels are also applied in [su2017learning, su2019kernel].

CNNs on Stereoscopic Images. As there is no perfect 2D representation for 360 images, SO(3)-based spherical CNNs [cohen2018spherical, esteves2018learning] were proposed to directly generalize convolutions on a sphere. However, these reparameterized 3D convolutional kernels hinder the use of classical backbones (e.g., ResNet [he2016deep] or VGGNet [Simonyan15]) pre-trained on large-scale datasets (e.g.

, ImageNet 

[deng2009imagenet]), which play an essential role in CNN-based SOD models (see sec:SOD_method). Recent researches [jiang2019spherical, cohen2019gauge, zhang2019orientation] generalize convolutions on subdivided icosahedral faces, which contain much less geometrical distortion compared to ER images [eder2020tangent]. In addition, tangent image [eder2020tangent] was proposed to enable the implementation of semantic segmentation in 4K-resolution 360 images.

3 Proposed ASOD60K Dataset

We elaborate our ASOD60K in terms of stimuli collection, subjective experimentation, annotation pipeline and dataset statistics. Our goal is to introduce a new challenging dataset to the PV-SOD community.

3.1 Stimuli Collection

The stimuli of ASOD60K were searched on YouTube with different keywords (e.g., 360°/panoramic/omnidirectional video, spatial audio, ambisonics [morgadoNIPS18]). As a result, our collected stimuli cover various real-world scenes (e.g., indoor/outdoor scenes), different occasions (e.g., sports, travel, concerts, interviews, dramas), different motion patterns (e.g., static/moving camera), and diverse object categories (e.g., humans, instruments, animals). They possess a wide range of major challenges found in 360° content333Objects scattered far from the equator thus suffering from serious geometrical distortions in ER projections., providing us with a solid foundation to build a representative benchmark. In this way, we obtained about 1,000 noise videos, e.g., videos with a shaking camera, dark-screen transitions, without key content, displaying too many objects, of low quality. In line with the video dataset creation rules in [SSAV, wang2018revisiting], we then carefully collected 67 high-quality video sequences with a total of 62,455 frames recorded with 62,45540 HM and eye fixations. Similar to [corbillon2017360], the frame rate of each collected video is not fixed (varying from 24fps to 60fps), which did not influence the results of following subjective experiments since human attention is mainly event-related, rather than frame rate-dependent. Note that we manually trimmed the videos into small clips (29.6s on average) to avoid fatigue during the collection of human eye fixations. As a result, the final duration is 1983s in total.

3.2 Subjective Experimentation

Equipment. All the video clips were displayed using a HTC Vive HMD embedded with a Tobii eye tracker with 120Hz sample rate to collect eye fixations.

Observers. We recruited 40 participants (8 females and 32 males) aging from 18 to 34 years old who reported normal or corrected-to-normal visual and audio acuity. Twenty participants were randomly selected to watch videos with mono sound, while the other participants watched videos without sound. Note that the two groups own the same gender and age distributions. Hence, each video with each audio modality (i.e., with or without sound) was viewed by 20 participants, and each participant viewed each video only once. We performed task-free viewing sessions.

Settings. All the participants seated in a swivel chair, wearing a HMD with headphones, and asked to explore the panoramic videos without any specific intention. During the experiments, the starting position was fixed to the center ( and ) at the beginning of every video display. To avoid motion sickness and eye fatigue, we inserted a short rest of a five-second gray screen between two successive videos and a long break of 20 minutes after every 20 videos. We calibrated the system for each participant at the beginning and end stage of every long break.

3.3 Professional Annotation

Super-/Sub-Class Labeling. As shown in fig:categories, our ASOD60K contains 67 videos representing three super-categories of audio-introduced scenes, including speaking (e.g., monologue, conversation), music (e.g., human singing, instrument playing) and miscellanea (e.g., the sound of vehicle engines and horns on the streets, crowd noise in the open air). Each video is named in terms of its audio-visual information.

[width=1]./figures/fig_categories.pdf

Figure 3: Statistics of the proposed ASOD60K. (a) Super-/sub-category information. (b) Instance density of each sub-class. (c) Main components of ASOD60K scenes. Best viewed in color.

Head Movement and Eye Fixations. The recent video object segmentation dataset DAVIS [DAVIS] contains only one or several foreground objects per frame, where the salient objects can be easily defined. In contrast, other recent video SOD datasets, such as VOS [VOS] and DAVSOD [SSAV], collect video stimuli representing more challenging scenes with multiple salient objects. In such cases, fixation-based annotations (e.g., saliency-shift [SSAV]) are used as guidance to alleviate the ambiguity of defining salient objects. Based on the subjects’ per-frame HM and eye fixations gained by conducting the subjective experiments (sec:subjectExp) with audio-visual stimuli, we produced the final annotations.

Att. Description
MO Multiple Objects. three objects occur simultaneously.
OC Occlusions. Object is partially occluded.
LS Low Space. Object occupies of image area.
MB Motion Blur. Moving object with fuzzy boundaries.
OV Out-of-View. Object is cut in half in ER projection.
GD Geometrical Distortion. Distorted object in ER projection.
CS Competing Sounds. Sound objects compete for attention.
Table 2: Attributes description (see examples in fig:AttExample).

Bounding Box Annotations. Generally, there are two types of labels in 360° object detection, i.e., bounding FoVs [zhao2020spherical, 360indoorWACV2020, yang2018object] and bounding boxes [yang2018object]. As the vast majority of our collected video frames contain multiple salient objects near the 360 camera, bounding FoVs may introduce serious annotation ambiguities due to the divergence of projection angle selections between multiple annotators [yang2018object], thus not being suitable for the salient object annotation in the scenes from our ASOD60K. Following [yang2018object], we directly annotated the salient objects with bounding boxes in ER images.

Our annotation protocol is threefold: i) We uniformly extracted 10,465 key frames from the total 62,455 frames with a sampling rate of 1/6. ii) We filtered the Gaussian-smoothed eye fixation maps corresponding to each of the key frames. iii) We adopted the widely used CVAT toolbox as our annotation platform, and recruited an expert to manually annotate the bounding box of each salient object in each of the ER key frames, under the guidance of the corresponding fixations overlaid (see fig:AttExample). Finally, we obtained total 19,904 salient objects labeled with instance-level bounding boxes from 10,465 key frames. To the best of our knowledge, this is the first attempt to annotate salient objects with the guidance of audio-visual attention data.

Object-Level Annotations. With the coarse annotations (i.e., bounding boxes) in hand, we needed to further label the data in a fine manner. Thus, three experts were recruited to manually annotate the salient objects in the 10K key frames. To ensure satisfying annotations, they were first required to pass a training session444Note that it took the experts about 10 hours. during which they had to correctly segment (by finely tracing objects’ boundaries rather than drawing rough polygons) all the salient objects in a given video (previously shown to three senior researchers, with GTs acquired by consistent opinions), with the guidance of overlaid per-frame bounding boxes. Followed by a session during which they were asked to annotate all the defined salient objects in the rest of the panoramic videos. Finally, a thorough inspection was conducted by the same three senior researchers, to ensure the accuracy of the annotations. Following the same pipeline as [deng2021re], we obtained 10,465 object annotations.

Instance-Level Annotations. Another three well-trained experts were then recruited to further draw pixel-wise instance-level masks by carefully tracing boundaries (rather than rough polygons) of the defined salient objects in each of the 10,465 key frames. To ensure high quality annotations, all the masks were sent to a quality check procedure implemented by the same three senior researches. As a result, we acquired 19,904 instance-level masks representing all the salient objects in all the 10,465 key frames (the number of instances in each video are shown in fig:categories (b)). Further, to refine the annotations quality, we transferred all the instance-level masks to object-level binary masks. The bounding boxes were also refined by the object-level masks. Please refer to fig:show_part1 for annotation examples.

[width=]./figures/fig_dist_GT.pdf

Figure 4: Average object-level GT maps of 360-SOD [li2020distortion], 360-SSOD [ma2020stage] and our ASOD60K.

[width=]./figures/fig_dist_GT_supp-min.pdf

Figure 5: The average object-level GT maps of our ASOD60K at super-class level.

[width=.98]./figures/fig_attributes_corr-min.pdf

Figure 6: Attributes statistics. (a)/(b) represent the correlation and frequency of ASOD60K’s attributes, respectively.

Attribute Labels. Following two large-scale video object segmentation datasets [DAVIS, SSAV], we also provide seven attributes in the proposed ASOD60K, including multiple objects (MO), occlusions (OC), low space (LS), out-of-view (OV), motion blur (MB), geometrical distortion (GD) and competing sounds (CS) (Table 2). It is worth mentioning that, OV and GD (fig:AttExample) are exclusive geometrical attributes of ER images, and CS is a novel attribute attached to audio-visual stimuli (Please see per-video attributes’ statistics in tab:attributes_details).

Sequence General 360° Audio No.
MO OC LS MB OV GD CS
Speaking (35) French 5
WaitingRoom 5
Cooking 5
AudiIntro 3
Ellen 1
GroveAction 5
Warehouse 2
GroveConvo 5
Surfing 3
Passageway 4
RuralDriving 4
Lawn 2
AudiAd 6
ScenePlay 5
UrbanDriving 3
Interview 4
Telephone 5
Walking 3
Bridge 4
Breakfast 5
Debate 3
BadmintonConvo 7
Director 6
ChineseAd 6
Exhibition 1
PianoConvo 3
FilmingSite 5
Brothers 6
Rap 4
Spanish 5
Questions 4
PianoMono 5
Snowfield 3
Melodrama 5
Gymnasium 5
Music (16) Guitar 4
Subway 5
Jazz 5
Bass 5
Canon 4
MICOSinging 4
Clarinet 5
Trumpet 3
PianoSaxophone 5
Chorus 4
Studio 5
Church 4
Duet 4
Blues 4
Violins 5
SingingDancing 6
Miscellanea (16) Beach 4
BadmintonGym 4
InVehicle 4
Japanese 4
Tennis 5
Diesel 4
Park 4
Lion 2
Carriage 6
Platform 5
Dog 4
RacingCar 4
Train 4
Football 4
ParkingLot 6
Skiing 6
No. 56 52 59 40 8 39 35 289
Table 3: Attribute details. General attributes: MO = multiple objects. OC = occlusions. LS = low space. MB = motion blur. 360° geometrical attributes: OV = out-of-view. GD = geometrical distortion. Spatial audio attributes: CS = competing sounds.

[width=1]./figures/fig_pass_reject_min.pdf

Figure 7: Passed and rejected examples of annotation quality control.

3.4 Dataset Features and Statistics

Attribute Distribution. The attributes summarize natural daily scenes viewed in omni-direction, also inspire model development for PI-SOD and PV-SOD. As shown in fig:attributes_corr, the seven proposed attributes are closely related to each other, representing challenging common scenarios.

Equator Center Bias. fig:GTComparison and fig:center_bias_class visualize the global and super-class center bias [fan2018salient, fan2021salient] of ASOD60K. Compared to 360-SOD and 360-SSOD, our dataset shows stronger center bias as it is the only one with salient objects annotated according to participants’ eye fixations, with the starting point set to the center of each video display during the subjective experiments. It has been broadly proved that photographers tend to capture the main content of a 360° video at the equator center, and that users also usually pay more attention to such an area during watching [xu2020state, salient360vid, fan2019survey]. Our ASOD60K is hence more able to reflect real-world viewing behaviors compared to the 360-SOD [li2020distortion] and 360-SSOD [ma2020stage].

Instance Size. Following [fan2018salient], we compute the normalized instances’ size of our ASOD60K. The size distribution ranges from 0.03% 23.00%, covering very small objects.

Quality Control. High-quality annotation is one of the most important aspects of training for learning-based models. As illustrated in fig:Pass&Reject2, we carefully conduct three-fold cross-validation to ensure the annotation quality.

4 Empirical Studies

4.1 Settings

Dataset Splits. All 67 videos are split into separate training and test sets with a random selection strategy, in a ratio of about 6:4. Therefore, we reach a unique split of 40 training and 27 test videos (5,796/4,669 key frames respectively), with corresponding per-pixel instance-/object-level GTs. The testing set is further divided into test0/test1/test2 with 6/6/15 videos, respectively, according to super-class labels (i.e., Miscellanea/Music/Speaking).

Metrics. We apply three widely used SOD metrics to quantitatively compare the SOTA I-SOD/V-SOD models. These metrics include structural measure (S-Measure, [Fan2017Smeasure, fan2021structure], maximum enhanced-alignment measure (E-Measure, [fan2021cognitive, fan2018enhanced] and mean absolute error (MAE) [MAE]. The MAE [MAE] focuses on the local (per-pixel) match between ground truth and prediction, while S-Measure () [Fan2017Smeasure] pays attention to the object structure similarities. Besides, E-Measure () [fan2018enhanced] considers both the local and global information.

MAE computes the mean absolute error between the ground truth and a normalized predicted saliency map , i.e.,

(1)

where and denotes height and width, respectively.

S-Measure evaluates the structure similarities between salient objects in GT foreground maps and predicted saliency maps:

(2)

where and denotes the object-/region-based structure similarities, respectively. is set as 0.5 so that equal weights are assigned to both the object-level and region-level assessments [Fan2017Smeasure].

E-Measure is a cognitive vision-inspired metric to evaluate both the local and global similarities between two binary maps. Specifically, it is defined as:

(3)

where represents the enhanced alignment matrix [fan2018enhanced].

Type Publication Methods Miscellanea (Test0) Music (Test1) Speaking (Test2)
I-SOD CVPR’19 CPD [CPD] 0.654 0.584 0.035 0.608 0.823 0.018 0.588 0.756 0.026
ICCV’19 SCRN [SCRN] 0.665 0.564 0.046 0.683 0.841 0.023 0.636 0.739 0.034
AAAI’20 F3Net [F3Net] 0.655 0.557 0.040 0.662 0.801 0.021 0.626 0.716 0.027
CVPR’20 MINet [MINet] 0.650 0.557 0.050 0.670 0.789 0.020 0.590 0.680 0.053
CVPR’20 LDF [CVPR2020LDF] 0.663 0.557 0.044 0.671 0.828 0.023 0.625 0.761 0.037
ECCV’20 CSF [SOD100K] 0.652 0.575 0.033 0.665 0.833 0.018 0.636 0.791 0.026
ECCV’20 GateNet [GateNet] 0.677 0.596 0.044 0.673 0.852 0.018 0.633 0.739 0.034
V-SOD CVPR’19 COSNet [COSNet] 0.610 0.535 0.031 0.577 0.825 0.016 0.572 0.722 0.023
ICCV’19 RCRNet [RCRNet] 0.661 0.576 0.034 0.695 0.839 0.019 0.632 0.775 0.030
AAAI’20 PCSA [gu2020PCSA] 0.602 0.549 0.034 0.655 0.764 0.021 0.572 0.679 0.026
PI-SOD SPL’20 FANet [huang2020fanet] 0.610 0.513 0.030 0.646 0.814 0.018 0.566 0.696 0.027
Table 4: Performance comparison of 7/3 state-of-the-art conventional I-SOD/V-SOD methods and one PI-SOD method [huang2020fanet] over ASOD60K. / denotes a larger/smaller value is better. Best result of each column is bolded.
Attr. Metrics I-SOD V-SOD PI-SOD
CPD [CPD] SCRN [SCRN] F3Net [F3Net] MINet [MINet] LDF [CVPR2020LDF] CSF [SOD100K] GateNet [GateNet] COSNet [COSNet] RCRNet [RCRNet] PCSA [gu2020PCSA] FANet [huang2020fanet]
MO 0.610 0.657 0.644 0.624 0.648 0.649 0.653 0.588 0.661 0.606 0.605
0.741 0.740 0.702 0.691 0.742 0.752 0.733 0.722 0.746 0.681 0.695
0.027 0.034 0.030 0.045 0.033 0.027 0.034 0.024 0.029 0.027 0.025
OC 0.606 0.655 0.641 0.619 0.645 0.645 0.650 0.577 0.652 0.599 0.593
0.772 0.768 0.725 0.699 0.771 0.780 0.755 0.744 0.763 0.704 0.720
0.023 0.029 0.026 0.043 0.028 0.023 0.030 0.020 0.025 0.024 0.022
LS 0.605 0.649 0.639 0.618 0.637 0.644 0.647 0.585 0.650 0.609 0.598
0.721 0.723 0.693 0.665 0.719 0.740 0.715 0.697 0.723 0.674 0.669
0.025 0.034 0.028 0.045 0.037 0.025 0.033 0.022 0.029 0.026 0.025
MB 0.622 0.651 0.630 0.620 0.646 0.638 0.645 0.582 0.642 0.586 0.587
0.728 0.718 0.692 0.675 0.717 0.749 0.701 0.709 0.734 0.702 0.688
0.021 0.029 0.027 0.047 0.029 0.021 0.030 0.019 0.024 0.022 0.020
OV 0.634 0.661 0.568 0.633 0.636 0.636 0.639 0.582 0.630 0.600 0.611
0.844 0.786 0.571 0.711 0.854 0.837 0.841 0.817 0.848 0.700 0.820
0.018 0.021 0.029 0.038 0.039 0.021 0.025 0.021 0.029 0.021 0.018
GD 0.630 0.662 0.639 0.633 0.659 0.646 0.658 0.588 0.651 0.578 0.599
0.680 0.690 0.641 0.666 0.672 0.695 0.684 0.662 0.695 0.655 0.657
0.037 0.042 0.040 0.045 0.043 0.035 0.042 0.032 0.037 0.036 0.034
CS 0.625 0.680 0.667 0.654 0.664 0.670 0.676 0.592 0.680 0.620 0.616
0.748 0.759 0.712 0.718 0.747 0.745 0.762 0.722 0.736 0.689 0.710
0.029 0.035 0.031 0.035 0.034 0.028 0.033 0.026 0.030 0.029 0.028
Table 5: Performance comparison of 7/3/1 state-of-the-art I-SOD/V-SOD/PI-SOD methods based on each of the attributes.

Training Protocols. To provide a comprehensive benchmark, we collect the released codes of 10 SOTA I-SOD/V-SOD methods and one PI-SOD model, re-train these models with the training set of ASOD60K and the widely used I-SOD training set, DUTS-train [DUTS] (except for FANet [huang2020fanet], which is designed for ER images only). The selected baselines (CPD[CPD], SCRN[SCRN], F3Net[F3Net], MINet[MINet], LDF[CVPR2020LDF], CSF+Res2Net[SOD100K], GateNet [GateNet], RCRNet[RCRNet], COSNet[COSNet] and PCSA [gu2020PCSA]

) meet the following criteria: i) classical architectures, ii) recently published and open-sourced, iii) achieve SOTA performance on existing I-SOD/V-SOD benchmarks. Note that all the baselines are trained with the recommended parameter settings.

5 Discussion

From the benchmark results, we observe that, generally, the I-SOD models gain comparable performance to their V-SOD counterparts. One possible reason is that, since the annotations of our ASOD60K are only based on key frames, the relatively sparse spatiotemporal information may prevent the V-SOD models from acquiring full performance. In contrast, the visual cues are easily learned by the I-SOD model with such sparsely labeled data.

Overall Performance. From the evaluation in tab:QuantityComparison, we observe that, in most cases, the I-SOD methods (e.g., GateNet, and CSF) achieve better performance than the V-SOD (e.g., COSNet, RCRNet, and PCSA) and PI-SOD models. For specific scenes (e.g., speaking), however, SCRN obtains a very competitive performance to GateNet, while performs worse than CSF. The E-Measure performances are shown in fig:curves_class.

[width=.98]./figures/fig_curves_classes-min.pdf

Figure 8: E-Measure (E-M) curves of all baselines upon our ASOD60K.

[width=.98]./figures/fig_curves_attrs-min.pdf

Figure 9: Attribute-based E-Measure (E-M) curves of all baselines upon our ASOD60K.

Attribute Performance. To provide deeper insights for the challenging cases, we report the performances of all 11 baselines on our seven attributes. A detailed attributes-based E-Measure is shown in fig:curves_attr. As shown in tab:QuantityComparisonAttr, the average score among the models for the different attributes are: 0.631 (MO), 0.626 (OC), 0.626 (LS), 0.623 (MB), 0.621 (OV), 0.631 (GD) and 0.649 (CS). Out-of-view (OV) is the most challenging attribute as the objects usually appear in the corner of the ER images. Besides, the scores on all attributes are less than 0.65, demonstrating the strong challenge of our ASOD60Kand leaving large room for future improvement.

[width=.98]./figures/fig_scanpath.pdf

Figure 10: Eye fixation distributions of all participants watching PianoConvo sequence without (a)/with (c) audio. Corresponding fixations are overlaid in (b) and (d), respectively.

General Attributes. As shown in Table 3, every collected video owns at least one general attributes (i.e., multiple objects (MO), occlusions (OC), motion blur (MB) and low space (LS)), indicating that our ASOD60K

 contains the main challenges faced in so many computer vision fields, such as object detection, video object segmentation, etc. It is worth mentioning that, as the 360° image captures a wide field-of-view (FoV) with the range of 360°

180°, salient objects far from the panoramic camera may perform extremely small sizes, making the LS a more challenging situation within PV-SOD.

[width=]./figures/fig_atten_GT.pdf

Figure 11: Visual comparison of visual and audio-visual attention. The objects with high saliency values () are marked with white/red bounding boxes. Zoom in for details.

Eye Fixations With/Without Audio. fig:fixa_distri shows an example of 20 participants watching videos without and with audio in longitude, respectively. We find that the eye fixation recordings with audio are highly consistent across subjects, while the recordings without audio are not. fig:AVAttenGT highlights this finding by vividly showing a significant divergence between human attentions (converging/scattering) with and without the guidance of audio information, respectively. Both the quantitative (fig:fixa_distri) and qualitative (fig:AVAttenGT) experimental results indicate that human attention highly depends on a co-guidance of audio-visual information.

Audio-Induced Attributes. Creating realistic VR experiences requires 360° videos to be captured with their surrounding visual and audio stimuli. The audio cue plays a significant role in informing the viewers about the location of sounding salient objects in the 360° environment [vasudevan2020semantic], providing an immersive multimedia experience. However, existing SOD methods ignore the audio cue and thus failing to detect the correct salient objects annotated based on audio-induced eye fixations (Figure 13). As shown in Figure 13, most of the baselines tend to detect all visual salient objects while ignore the audio-visual attention shifts among the key frames. It is necessary for future works to model both the visual and audio cue for a better performance on our ASOD60K.

360° Geometrical Attributes. The 360° image captures a scene covering omnidirectional (360°180°) spatial information, thus including more comprehensive object structures when compared to 2D image, which owns a FoV with limited range. For instance, as shown in Figure 12, the out-of-view (OV) object in equirectangular (ER) image has a well shape when re-projected to a specific FoV on sphere. However, OV objects in 2D images or videos permanently lose spatial information due to the limited viewing range of the normal cameras. Geometrical distortion (GD) is the other important attribute of 360° images under ER projection (Figure 12 (c)), which is largely alleviated in spherical FoV (Figure 12 (d)). As there is no perfect 2D (planar) representation for 360° images, a trade-off between geometrical distorted extent and spatial information retention will always exist. Future methods may take advantage of multiple projection methods for a improved performance when conducting PV-SOD.

[width=.8]./figures/fig_ga-min.pdf

Figure 12: An illustration of typical geometrical attributes. (a) Out-of-view (OV) in equirectangular (ER) image. (b) OV in spherical field-of-view (FoV). (c) Geometrical distortion (GD) in ER image. (d) GD in spherical FoV.

Small Objects. We define small objects LS (tab:Attributes) as those that occupy an area smaller than 0.5% of the whole image. LS, as one of the well-known challenges in the image segmentation task, is still not completely solved. As stated in sec:FeaStatis, due to the wide FoV range of 360°180°, the smallest object in our ASOD60K only occupies 0.03% of the ER image, making it more challenging. As for the attribute-based performance, we observe that models under this situation achieve a comparable performance to the others. However, considering that the applied metrics may be biased toward the true negative, the true positive score tends to be worse than the existing shown performance.

Novel Metric. In this benchmark, we only introduce widely used metrics for I-SOD. However, PV-SOD involves context (e.g.

, audio, spatial and temporal) relationship between salient/non-salient objects, which is quite important for PV-SOD assessment. Thus, designing a more suitable evaluation metric for PV-SOD is an interesting and open issue.

Future Directions. Currently, we only focus on the object-level task. However, the instance-level task is more difficult and may be suitable for many image-editing applications. In addition, as described in [fan2018salient], studying non-salient objects will provide rich context for reasoning the salient objects in a scene. Finally, in this study, we only provide sparse annotations for the proposed dataset. However, dense annotations like those given in DAVIS [DAVIS] dataset can provide more valuable information (e.g., sequence-to-sequence modeling or audio-visual matching) for both traditional I-SOD and PV-SOD models.

[width=]./figures/fig_cs.pdf

Figure 13: An illustration of the unique audio-visual attribute, competing sounds. Img = image. GT = ground truth.

6 Conclusion

We have proposed ASOD60K, the first large-scale dataset for the PV-SOD task. Compared with the traditional SOD task, PV-SOD is more challenging. The hierarchical annotations enable ASOD60K

 to easily be extended to different-level tasks, such as weakly supervised learning, multi-modality learning, and head movement/fixation prediction. In addition, we provide several empirical rules for creating high-quality datasets. We have further investigated 11 cutting-edge methods at both the overall and attribute levels. The obtained findings indicate that PV-SOD is far from being solved. We hope that our studies will facilitate SOD research towards panoramic videos and thus inspiring more novel ideas for AR/VR applications.

Super-class/Sequence Metrics I-SOD V-SOD PI-SOD
CPD [CPD] SCRN [SCRN] F3Net [F3Net] MINet [MINet] LDF [CVPR2020LDF] CSF [SOD100K] GateNet [GateNet] COSNet [COSNet] RCRNet [RCRNet] PCSA [gu2020PCSA] FANet [huang2020fanet]
Sp./Debate 0.547 0.620 0.605 0.553 0.566 0.576 0.628 0.514 0.559 0.571 0.557
max 0.764 0.855 0.854 0.800 0.842 0.853 0.849 0.844 0.843 0.836 0.755
mean 0.600 0.752 0.818 0.592 0.829 0.802 0.768 0.410 0.809 0.605 0.702
0.014 0.016 0.014 0.012 0.012 0.013 0.015 0.009 0.013 0.012 0.015
Sp./BadmintonConvo 0.712 0.669 0.617 0.712 0.613 0.647 0.652 0.613 0.668 0.550 0.635
max 0.867 0.822 0.611 0.850 0.814 0.826 0.846 0.845 0.804 0.743 0.830
mean 0.749 0.663 0.555 0.815 0.659 0.662 0.737 0.603 0.773 0.434 0.704
0.027 0.034 0.032 0.034 0.068 0.033 0.046 0.032 0.039 0.033 0.030
Sp./Director 0.679 0.753 0.701 0.677 0.756 0.772 0.726 0.716 0.755 0.731 0.672
max 0.852 0.880 0.891 0.832 0.902 0.900 0.894 0.899 0.883 0.918 0.844
mean 0,735 0.729 0.852 0.773 0.849 0.810 0.681 0.744 0.774 0.718 0.768
0.031 0.038 0.032 0.037 0.029 0.028 0.034 0.029 0.037 0.031 0.030
Sp./ChineseAd 0.601 0.645 0.551 0.477 0.631 0.630 0.605 0.553 0.542 0.569 0.595
max 0.895 0.883 0.908 0.695 0.906 0.908 0.913 0.910 0.899 0.910 0.850
mean 0.483 0.524 0.527 0.391 0.576 0.635 0.544 0.641 0.632 0.470 0.502
0.009 0.009 0.042 0.069 0.027 0.011 0.012 0.015 0.028 0.010 0.007
Sp./Exhibition 0.487 0.469 0.480 0.469 0.428 0.492 0.486 0.487 0.473 0.510 0.475
max 0.614 0.658 0.605 0.597 0.689 0.770 0.735 0.811 0.576 0.773 0.560
mean 0.486 0.365 0.460 0.350 0.270 0.508 0.459 0.514 0.349 0.512 0.329
0.013 0.061 0.009 0.042 0.139 0.011 0.013 0.008 0.040 0.014 0.024
Sp./PianoConvo 0.577 0.652 0.579 0.607 0.639 0.636 0.586 0.603 0.718 0.509 0.632
max 0.871 0.851 0.847 0.875 0.880 0.858 0.885 0.888 0.880 0.882 0.863
mean 0.774 0.673 0.745 0.833 0.847 0.807 0.693 0.706 0.803 0.420 0.804
0.037 0.035 0.035 0.038 0.033 0.036 0.036 0.035 0.028 0.033 0.033
Sp./FilmingSite 0.578 0.633 0.603 0.610 0.637 0.645 0.636 0.578 0.640 0.633 0.522
max 0.727 0.762 0.681 0.766 0.799 0.766 0.787 0.708 0.738 0.805 0.793
mean 0.562 0.627 0.626 0.636 0.707 0.654 0.613 0.540 0.628 0.652 0.727
0.013 0.023 0.023 0.030 0.017 0.014 0.020 0.012 0.013 0.017 0.016
Sp./Brothers 0.673 0.686 0.638 0.655 0.652 0.697 0.685 0.662 0.664 0.666 0.623
max 0.778 0.806 0.747 0.772 0.746 0.816 0.792 0.784 0.813 0.820 0.729
mean 0.688 0.677 0.713 0.705 0.706 0.715 0.629 0.661 0.728 0.650 0.681
0.018 0.024 0.023 0.024 0.025 0.017 0.022 0.015 0.019 0.016 0.016
Sp./Rap 0.498 0.477 0.521 0.343 0.507 0.525 0.463 0.482 0.506 0.495 0.532
max 0.830 0.816 0.831 0.471 0.814 0.824 0.761 0.828 0.858 0.832 0.818
mean 0.530 0.387 0.548 0.260 0.484 0.678 0.400 0.513 0.590 0.566 0.733
0.006 0.087 0.021 0.371 0.025 0.012 0.095 0.007 0.020 0.009 0.009
Sp./Spanish 0.606 0.765 0.746 0.679 0.793 0.713 0.701 0.724 0.700 0.543 0.602
max 0.838 0.870 0.851 0.819 0.873 0.854 0.865 0.877 0.862 0.839 0.728
mean 0.651 0.807 0.835 0.727 0.865 0.797 0.822 0.784 0.800 0.486 0.503
0.038 0.030 0.032 0.035 0.025 0.036 0.040 0.032 0.037 0.042 0.035
Sp./Questions 0.505 0.640 0.740 0.563 0.605 0.691 0.671 0.576 0.676 0.595 0.549
max 0.925 0.921 0.901 0.926 0.920 0.915 0.922 0.935 0.907 0.909 0.757
mean 0.763 0.609 0.870 0.576 0.855 0.700 0.574 0.569 0.667 0.540 0.703
0.009 0.011 0.006 0.007 0.010 0.010 0.009 0.009 0.014 0.012 0.013
Sp./PianoMono 0.598 0.555 0.573 0.572 0.629 0.522 0.637 0.506 0.611 0.502 0.502
max 0.702 0.855 0.766 0.796 0.861 0.842 0.755 0.859 0.831 0.796 0.715
mean 0.682 0.736 0.688 0.739 0.746 0.758 0.696 0.500 0.736 0.397 0.633
0.057 0.054 0.044 0.056 0.054 0.048 0.060 0.039 0.047 0.037 0.043
Sp./Snowfield 0.729 0.811 0.778 0.800 0.819 0.779 0.823 0.601 0.794 0.580 0.578
max 0.739 0.816 0.741 0.784 0.819 0.763 0.836 0.864 0.779 0.812 0.775
mean 0.677 0.778 0.720 0.764 0.792 0.716 0.788 0.485 0.753 0.514 0.618
0.032 0.029 0.031 0.028 0.026 0.029 0.027 0.033 0.030 0.035 0.040
Sp./Melodrama 0.609 0.685 0.655 0.673 0.667 0.664 0.617 0.467 0.608 0.604 0.568
max 0.782 0.835 0.837 0.811 0.835 0.841 0.816 0.788 0.816 0.831 0.794
mean 0.699 0.744 0.732 0.773 0.784 0.717 0.710 0.296 0.730 0.521 0.770
0.108 0.084 0.068 0.083 0.079 0.095 0.100 0.076 0.100 0.079 0.098
Sp./Gymnasium 0.551 0.514 0.492 0.501 0.501 0.507 0.537 0.520 0.520 0.503 0.505
max 0.806 0.683 0.830 0.700 0.813 0.686 0.754 0.863 0.760 0.798 0.752
mean 0.584 0.545 0.461 0.593 0.469 0.512 0.487 0.584 0.518 0.468 0.642
0.007 0.013 0.021 0.011 0.020 0.016 0.020 0.007 0.017 0.027 0.010
Mu./Studio 0.741 0.770 0.753 0.788 0.758 0.739 0.724 0.637 0.778 0.756 0.760
max 0.878 0.889 0.898 0.904 0.892 0.899 0.891 0.904 0.893 0.901 0.895
mean 0.745 0.731 0.832 0.826 0.847 0.756 0.601 0.629 0.800 0.729 0.859
0.008 0.009 0.010 0.006 0.009 0.009 0.010 0.008 0.009 0.008 0.007
Table 6: Sequence performance comparison of 7/3/1 SOTA I-SOD/V-SOD/PI-SOD methods. Sp. = Speaking. Mu. = Music.
Super-class/Sequence Metrics I-SOD V-SOD PI-SOD
CPD [CPD] SCRN [SCRN] F3Net [F3Net] MINet [MINet] LDF [CVPR2020LDF] CSF [SOD100K] GateNet [GateNet] COSNet [COSNet] RCRNet [RCRNet] PCSA [gu2020PCSA] FANet [huang2020fanet]
Mu./Church 0.527 0.589 0.621 0.566 0.518 0.624 0.651 0.562 0.676 0.623 0.679
max 0.868 0.917 0.933 0.747 0.932 0.903 0.950 0.887 0.900 0.942 0.866
mean 0.451 0.575 0.731 0.576 0.715 0.601 0.657 0.487 0.635 0.577 0.774
0.007 0.012 0.012 0.021 0.018 0.011 0.008 0.006 0.008 0.018 0.007
Mu./Duet 0.662 0.704 0.698 0.653 0.751 0.648 0.730 0.553 0.731 0.538 0.643
max 0.879 0.891 0.892 0.876 0.898 0.903 0.883 0.901 0.889 0.873 0.876
mean 0.810 0.705 0.792 0.821 0.808 0.693 0.735 0.542 0.776 0.509 0.765
0.041 0.058 0.044 0.039 0.033 0.033 0.036 0.036 0.033 0.036 0.031
Mu./Blues 0.580 0.742 0.776 0.722 0.771 0.734 0.740 0.595 0.765 0.743 0.600
max 0.879 0.889 0.890 0.802 0.871 0.844 0.893 0.884 0.894 0.904 0.852
mean 0.598 0.688 0.830 0.698 0.789 0.766 0.640 0.473 0.834 0.715 0.612
0.016 0.015 0.013 0.027 0.015 0.015 0.015 0.015 0.013 0.017 0.014
Mu./Violins 0.589 0.668 0.537 0.692 0.661 0.631 0.656 0.578 0.669 0.671 0.604
max 0.852 0.877 0.578 0.861 0.883 0.845 0.872 0.851 0.868 0.856 0.790
mean 0.649 0.655 0.477 0.775 0.722 0.724 0.569 0.597 0.724 0.625 0.749
0.017 0.020 0.015 0.016 0.022 0.019 0.017 0.015 0.021 0.020 0.017
Mu./SingingDancing 0.506 0.601 0.582 0.560 0.561 0.594 0.568 0.521 0.569 0.558 0.557
max 0.804 0.820 0.815 0.820 0.673 0.782 0.813 0.791 0.810 0.812 0.759
mean 0.500 0.587 0.758 0.565 0.618 0.589 0.547 0.452 0.637 0.608 0.705
0.026 0.034 0.037 0.025 0.042 0.026 0.026 0.023 0.030 0.034 0.033
Mi./Dog 0.497 0.516 0.571 0.560 0.569 0.557 0.562 0.523 0.562 0.539 0.520
max 0.548 0.685 0.535 0.551 0.593 0.572 0.589 0.671 0.612 0.693 0.345
mean 0.460 0.457 0.493 0.511 0.424 0.532 0.515 0.494 0.548 0.544 0.325
0.013 0.015 0.014 0.007 0.020 0.004 0.009 0.005 0.004 0.005 0.003
Mi./RacingCar 0.770 0.769 0.763 0.770 0.772 0.771 0.791 0.760 0.772 0.755 0.762
max 0.428 0.438 0.365 0.439 0.459 0.458 0.448 0.449 0.453 0.426 0.285
mean 0.315 0.338 0.283 0.349 0.332 0.287 0.370 0.276 0.293 0.261 0.260
0.089 0.115 0.087 0.109 0.102 0.085 0.107 0.083 0.087 0.085 0.081
Mi./Train 0.604 0.616 0.614 0.607 0.629 0.594 0.663 0.501 0.524 0.515 0.489
max 0.618 0.700 0.531 0.676 0.589 0.665 0.780 0.556 0.671 0.526 0.432
mean 0.581 0.553 0.486 0.493 0.554 0.558 0.634 0.351 0.462 0.416 0.386
0.024 0.030 0.020 0.041 0.012 0.013 0.022 0.016 0.016 0.028 0.016
Mi./Football 0.653 0.696 0.618 0.656 0.668 0.658 0.676 0.648 0.710 0.635 0.556
max 0.833 0.856 0.790 0.830 0.835 0.846 0.820 0.811 0.866 0.810 0.742
mean 0.634 0.676 0.755 0.633 0.770 0.721 0.663 0.649 0.732 0.630 0.477
0.004 0.004 0.004 0.003 0.004 0.003 0.004 0.003 0.002 0.003 0.002
Mi./ParkingLot 0.635 0.627 0.624 0.564 0.640 0.562 0.625 0.548 0.624 0.501 0.627
max 0.666 0.645 0.649 0.614 0.650 0.646 0.663 0.659 0.661 0.622 0.665
mean 0.641 0.551 0.600 0.597 0.625 0.602 0.610 0.482 0.612 0.501 0.593
0.028 0.041 0.048 0.059 0.041 0.035 0.045 0.027 0.038 0.029 0.026
Mi./Skiing 0.697 0.728 0.689 0.727 0.632 0.757 0.695 0.624 0.745 0.641 0.590
max 0.784 0.764 0.782 0.730 0.829 0.781 0.761 0.814 0.806 0.781 0.744
mean 0.705 0.645 0.669 0.661 0.517 0.675 0.605 0.573 0.716 0.613 0.500
0.015 0.024 0.027 0.025 0.044 0.015 0.030 0.014 0.016 0.016 0.012
Table 7: Sequence performance comparison of 7/3/1 SOTA I-SOD/V-SOD/PI-SOD methods. Mu. = Music. Mi. = Miscellanea.

[width=]./figures/fig_visual_te0.pdf

Figure 14: Visual results of all baselines on the ASOD60K-test0 (Miscellanea). Img = image. GT = ground truth.

[width=]./figures/fig_visual_te1.pdf

Figure 15: Visual results of all baselines on the ASOD60K-test1 (Music). Img = image. GT = ground truth.

[width=]./figures/fig_visual_te2.pdf

Figure 16: Visual results of all baselines on the ASOD60K-test2 (Speaking). Img = image. GT = ground truth.

[width=]./figures/fig_show_part1.pdf

Figure 17: Sample key frames from ASOD60K, with fixations and instance-level ground truth overlaid.

Appendix

Per Video Performance & Visual Results. The per-video quantitative results are shown in tab:SeqQua_1 and tab:SeqQua_2. Please refer to fig:visual_te0. fig:visual_te1 and fig:visual_te2 for visual results.

This research has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under the Grant Number 15/RP/27760.

References