DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction

05/25/2019 ∙ by Hamed R. Tavakoli, et al. ∙ 0

This paper presents a conceptually simple and effective Deep Audio-Visual Eembedding for dynamic saliency prediction dubbed "DAVE". Several behavioral studies have shown a strong relation between auditory and visual cues for guiding gaze during scene free viewing. The existing video saliency models, however, only consider visual cues for predicting saliency over videos and neglect the auditory information that is ubiquitous in dynamic scenes. We propose a multimodal saliency model that utilizes audio and visual information for predicting saliency in videos. Our model consists of a two-stream encoder and a decoder. First, auditory and visual information are mapped into a feature space using 3D Convolutional Neural Networks (3D CNNs). Then, a decoder combines the features and maps them to a final saliency map. To train such model, data from various eye tracking datasets containing video and audio are pulled together. We further categorised videos into `social', `nature', and `miscellaneous' classes to analyze the models over different content types. Several analyses show that our audio-visual model outperforms video-based models significantly over all scores; overall and over individual categories. Contextual analysis of the model performance over the location of sound source reveals that the audio-visual model behaves similar to humans in attending to the location of sound source. Our endeavour demonstrates that audio is an important signal that can boost video saliency prediction and help getting closer to human performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 5

page 7

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Saliency models are often used to predict eye fixations made by humans during task-free scene viewing. Such models have been extensively studied within various disciplines, including, psychology, cognitive vision sciences, and computer vision 

[Itti2000].

In computer vision, while a plethora of models have been reported for image saliency, fewer works have addressed saliency prediction over videos. Video saliency models can be utilized in various applications such as video analysis and summarization [Marat2007], stream compression [Hadi2014], augmented and virtual reality [Sandor2010, ZhangSaliencyDI], etc. Many of such applications are naturally multi-modal, containing a dynamic sequence of images and audio. In spite of the existing evidence on the correlation between auditory and visual cues and their joint contribution to attention [Burg2008], to date, most of the video saliency models neglect audio. They heavily rely on spatio-temporal visual cues as the only source of information.

We introduce a deep multi-modal video saliency model, which uses both audio and visual information to predict salience. The model is designed to facilitate the analysis of the contribution of each modality. We, thus, construct two baselines including an audio saliency model and a visual saliency model. The baseline models share the same architecture with our multi-modal model, except that they rely only on video or audio. We, then, analyze the contribution of each information cue. To train our model, we combine data from various databases to assist the training of deep audio-visual models with more data. In our database, the videos are accompanied with eye tracking data gathered under free-viewing condition when both audio and video have been presented to the viewers. We categorize the videos into three types and evaluate the performance with respect to each category along with the overall performance.

In a nutshell, our main contributions include: (1) Introducing a deep audio-visual saliency model for video, (2) Providing video categorical annotation for the data to enhance model analysis with respect to stimulus type, (3) Assessing the contribution of each modality audio, video, and audio-video using deep saliency models, and (4) Comparison and analysis of the proposed audio-visual saliency model against the existing video saliency models.

2 Related Works

Humans are intelligent multi-sensory creatures, capable of spotting and focusing on a specific audio or visual stimulus in a cluttered environment (, have attentional behavior). It is, hence, unsurprising that inspired by such observations psychologists and neuroscientists have been studying attentional mechanisms underlying auditory, visual, and auditory-visual attention. The recorded seminal ideas and works on attention can be traced back to the 19th century [James1890]

. Several decades of research on attention mechanisms have amassed a rich literature on the topic. Covering the whole literature is, thus, beyond the scope of this paper. Instead, to stress the need for multi-modal attention models, the following subsections provide a brief account of relevant studies along four axes. We refer the reader to 

[Carrasco2011, John2011, Borji2013, Kaya2016] for further information on these topics.

2.1 Interplay between visual and auditory attention

Auditory attention regards mechanisms of attention in the auditory system. A good example of such mechanisms is the famous cocktail party problem [Cherry1953]. The behavioral studies on the formation of auditory attention have been interested in the subject’s response to an auditory stimulus,  [Maccoby1966, Bartgis2003]. From neurophysiological point of view, mechanisms of auditory attention and its influence on auditory cortical representations have been of interest,  [Mesgarani2012].

Parallel to studies on auditory attention, numerous works have explored visual attention. The span of behavioral studies on visual attention is broad and covers a wide range of experiments on primates and humans. In such experiments, an observer is often presented with a stimulus and his neural and/or behavioral responses are recorded (using single unit recording, brain imaging, or eye tracking) [Carrasco2011].

With respect to multimodal attention, some behavioral studies have investigated the role of audio-visual components during sensory development in infants [Lewkowicz1988]. Richards [Richards2000] studied attention engagement to compound auditory-visual stimuli and showed an increase in sustained attention development with age progression in infants. Burg  [Burg2008] conducted experiments to understand the effect of audiovisual events on guiding attention and concluded that audiovisual synchrony guides attention in an exogenous manner in adults.

2.2 Computational models of visual saliency

The proposed model follows the recent wave of data-driven saliency models based on deep neural networks, e.g. [Huang2015ICCV, Kummerer2014b, Cornia2016]. Such data-driven approaches fall within the broader category of models based on the feature integration theory (FIT) [TREISMAN1980]. That is the visual input is mapped into a feature space and a saliency map is obtained by combining the features. Many of the saliency models can be categorised within this broad category, though they may follow different computational schemes [Hamed2014phd]. The deep neural saliency models can be seen as an encoder-decoder architecture. First, a feed-forward convolutional neural network (CNN) projects the input to a feature space (encoder). Then, a second neural architecture combines the features to form a saliency map (decoder). The decoder consists of series of convolution operations or more complicated recurrent networks ( [Cornia2016]). Our approach follows the same principle, however, it is applied to both audio and video modalities.

2.3 Computational models of auditory saliency

The computational modeling of auditory saliency is a relatively younger field than visual salience modeling. Some works have been inspired by visual saliency techniques. For example,  [Kayser2005] adopts the model of [Itti2000] to audio domain and produces auditory saliency maps. An auditory salience map identifies the presence of a salient sound over time. Some models are, however, rooted in the biology of the auditory system or extend saliency map idea for audio-based tasks.

Wrigley  [Wrigley2004] proposed a model in which a network of neural oscillators performs auditory stream segregation on the basis of oscillatory correlation. To be concise, the model processes the audio input to simulate auditory nerve encoding. Periodicity information is extracted using a correlogram and noise segments are identified. This information is passed through an array of neural oscillators in which each oscillator corresponds to a particular frequency channel. The output from each oscillator is connected via excitatory links to a leaky integrator that decides which oscillator and in consequence input channel should be attended.

Oldoni  [Oldoni2013] proposed an auditory attention model for designing better urban soundscapes. Their model is largely inspired by the FIT [TREISMAN1980]

. The audio signal is converted to a 1/3-octave band spectrogram. Intensity, spectral contrast, and temporal contrast are computed from the spectrogram. This process corresponds to the peripheral auditory processing and results in a feature vector that forms a saliency map, which will be used to identify the sounds that will be heard. While having some commonalities with vision based models of attention in the first layer, this model goes beyond saliency map approaches and extends such models to task-based top-down driven attention.

The audio branch in the proposed framework follows the principle of FIT for the sake of consistency with the visual part. The main difference with the auditory models of saliency is the application of deep neural networks.

2.4 Models of audio-visual saliency

To the best of our knowledge, only a few multi-modal saliency works exist. Some of the early works include  [CoutrotModel2014, CoutrotModel2015, CoutrotModel2016], which are extended by [multiECCV2018]. In [CoutrotModel2014], static and dynamic low-level video features are extracted using Gabor filters. The faces are also segmented using a semi-automatic segmentation tool [Bertolino2012] interactively as part of the visual feature space. For the audio track of video frames a speaker diarization technique is proposed based on voice activity detection, audio speaker clustering, and motion detection. This information is then combined with visual information to obtain a saliency map. In [CoutrotModel2015, CoutrotModel2016], this framework is improved by adding annotation of body parts. The need for manual face and body part segmentation limits the applicability of these models in real-world scenarios.

Boccignone  [multiECCV2018] follow a similar path for automating saliency prediction in social interaction scenarios. They extract several priority maps, including (1) spatio-temporal saliency features using [Seo2009], (2) face maps from a convolutional neural network (CNN) face detector, and (3) the active speaker map from automatic lip sync algorithm of [Chung16a]. Once the maps are extracted, a sampling scheme is employed to find attention attractors. Instead of relying on a complicated sampling scheme and multiple feature maps, we directly learn the mapping using a deep neural network.

Our model is distinct from aforementioned audio-visual saliency models because (1) contrary to existing models that only focus on conversations and faces, our model is applicable to any scene type, and (2) it is a single end-to-end trainable framework for the multi-modal saliency prediction.

3 Model

Fig. 1 depicts the proposed pipeline for multimodal saliency prediction. We formulate the problem as follows. Given a video segment as a set , consisting of a frame sequence , and an audio signal

, the saliency of the video segment can be computed as the following probability distribution:

(1)

where is a neural network.

As shown in Fig. 1, we use a two-stream neural network, one stream for video and another for audio. These streams are based on 3D-Convolutional neural networks (3D CNNs), in particular 3D-ResNet [Hara2018CVPR]. The video stream processes 16 video frames at a time. The audio stream employs 3D-ResNet on log mel-spectrogram of the audio signal arranged into 16 frames. The obtained representations are concatenated and fed to a series of upsampling and convolution layers to obtain a final saliency map.

To train such a network, we utilize the ground truth saliency distribution obtained from fixation density maps, denoted as , and the saliency distribution predicted by the model, denoted as . We minimize the KL-divergence between the two distributions,

(2)

where represents the spatial domain of a saliency map.

Figure 1: The overall architecture of the proposed model.

3.1 Implementation details

In the following, we will explain the technical details of the proposed model and what needs to be taken into account for reproducing the results. Our implementation is available at https://github.com/hrtavakoli/DAVE.

Encoding video using 3D CNNs:

The audio and video backbone of our model is based on the 3D ResNet architecture [Hara2018CVPR]. The depth of 18 is chosen to keep a balance between audio and video branches as we can not train a very deep network for audio due to dimensions of audio log mel-spectrograms. For the video branch, we initialize the weights from the models pre-trained on the Kinetics dataset [Kinetics2017] for action classification. The input is of size , where is the number of frames and

is the number of frame channels. The input frames are normalized with the mean and standard deviation of frames from the Kinetics training set.

Audio preprocessing:

For the audio signal, we resample the audio to 16KHz and transform it into a log mel-spectrogram with a window length of 0.025 seconds and a hop length of 0.01 seconds with 64 bands. We then convert the transformed audio information into a sequence of successive overlapping frames, resulting in an audio tensor representation of shape

, where is the number of channels. This procedure follows the steps of [Hershey2017]. Since we apply a 3DResNet adapted from a pre-trained model on images, we replicate the input to have and process audio frames at a time.

Encoding audio data using 3D CNNs:

To handle audio data, we initialized the 3D ResNet with weights pre-trained on Kinetics dataset [Kinetics2017]. Then, we re-trained the model for audio classification on the speech command database [SpeechCmd2018], where the task is command classification. This dataset consists of one-second long utterances of 30 short words, by several different people. Once the training with this data is done, we use the weights of this network to initialize the audio branch of our saliency model for training.

Decoding saliency from audio-visual features:

The encoded visual and auditory information using the 3D CNNs are concatenated together. These features are then fed to a 2D convolution layer with kernel size 1. The purpose of this layer is to reduce the feature size by half. This layer is followed by two blocks consisting of a bilinear up-sampling with factor 2, followed by a 2D convolution layer of kernel size 3 and batch normalization. The final layer is a 2D

convolution, resulting in a saliency map of size .

3.2 Training by dynamic routing

Training of our model is nontrivial. The lack of stimulus diversity in video databases has been shown to hinder video saliency learning [wang2018cvpr]. To circumvent this situation, we train our model on data from static saliency datasets as well as video data. To this end, we update the graph dynamically depending on the source of the input. In the case of static images, the image is replicated to form a volume of appropriate number of frames. Then, the audio branch and the section of graph used for mixing the audio and video features are frozen and are ignored in the computational graph during training. In other words, the visual features are directly fed to the first upsampling block, hence just the visual 3D CNN and part of the saliency decoder are updated during the backward pass. A schematic diagram of the computational graph is illustrated in Fig. 2

to demonstrate training with static images and audio-visual input. The model is trained alternatively one epoch for audio-visual input and one epoch for image volumes. During training the batch size is 10, learning rate start with

and Adam optimizer is used for a total of 10 epochs.

Figure 2: Computational graph during training, (a) handling audio-visual data, (b) handling static images as video frames by replicating them. Path selection decides which part of the graph should be frozen (light gray areas) based on input during training.

4 Experimental Setup

4.1 Data

The predicted saliency maps are assessed against human fixations obtained by eye tracking while subjects watched video stimuli. In the multimodal attention case, videos contain both images and audio. Unfortunately, the field lacks a large corpus of fixation data where the audio and video stimuli have been presented simultaneously. Only a limited number of video sequences with fixations is available that meets the required criteria (i.e., presence of video and audio during eye tracking). To remedy this, we pool the existing data from [Mital2011DIEM][Coutrot2011erbsound] and [Coutrot2015DB] together to construct a dataset for model training. We refer to this dataset as DAVoS standing for ‘Data for Audio Visual Saliency’.

From [Mital2011DIEM], we only select 76 videos for which the audio has been played during eye tracking (out of a total of 85 videos). We use all the sequences from [Coutrot2011erbsound] and [Coutrot2015DB]. In total, we have 150 videos. The video sequences are on average seconds long (minimum: , maximum: , median: seconds). In total, the dataset consists of frames corresponding to minutes (approximately 2 hours and 33 minutes). We categorized the video sequences into three categories based on their content: ‘Nature’, ‘Social events’, and ‘miscellaneous’. The ‘Nature’ category includes videos of nature and animals in their habitats. ‘Social events’ sequences contain group activities including at least 2 people and covers a wide range of activities such as sports, crowds, conversations, etc. Remaining videos are categorized as ‘Miscellaneous’. Table 1 summarizes the number of videos in each category per train/validation/test splits.

The average number of subjects per video in the dataset is 45 (minimum: 18, maximum: 220, median: 31). All fixation data are recorded at 1000 Hz with an SR-EyeLink1000 eye tracker under the free-viewing task. The presentation setup has been slightly different in the above mentioned datasets. This, however, does not affect us as (1) the gaze data have been appropriately mapped to correct 2D pixel coordinates and (2) correct smoothing according to viewing angle of each source has been applied. Fig. 3 depicts the average fixation map over all frames as well as over different categories. As also shown in the literature [Tatler2007], fixations tend to gravitate more towards the center (a.k.a, center-bias). Examples from each video category and fixation density maps are provided in Fig. 4.

Figure 3: From left to right: Mean Eye Position (MEP) for the different categories in the training set, for the all training sequences (used as a lower bound baseline), and for the entire dataset (train,val,test).
Figure 4: Example frames of each video category with the corresponding ground truth eye gaze overlaid. From top to bottom: Nature, Social Events, and Miscellaneous classes.
Video Type Train Valid. Test
Nature 37 6 10
Social events 18 16 11
miscellaneous 37 7 7
Total (150) 92 29 29
Table 1: Number of sequences in each video category.

4.2 Evaluation scores

Evaluating image saliency models has been studied extensively in the past [Borji2013ICCV, Judd2012, salMetrics_Bylinskii, TavakoliABL17]. For video saliency evaluation, individual prediction maps are evaluated using image saliency evaluation scores; that are then averaged over all maps. Here, we follow the steps of [wang2018cvpr] and use the same scores as provided in https://mmcheng.net/videosal/. The scores include Normalized Scanpath Saliency (NSS), Similarity Metric (SIM), Linear Correlation Coefficient (CC), Area Under the ROC Curve (AUC-Judd)[Judd2012], and shuffled AUC (sAUC) [Zhang2008]. Please see the supplement for more details.

4.3 Compared models

Table 2 summarizes the information about the compared models. Previous works have demonstrated that video saliency models learning from spatio-temporal data outperform image saliency models in video saliency prediction [Jiang2018ECCV, wang2018cvpr]. We, thus, focus on video saliency models. The compared models consist of traditional video saliency models that rely on some image features or generic image statistics as well as state-of-the-art deep models. We could not compare our model with [CoutrotModel2016, multiECCV2018] since they do not have their code publicly available. Furthermore, they are task-specific and work only on sequences of speakers.

Among deep models, we compare with two state-of-the-art works: DeepVS [Jiang2018ECCV] and ACLNet [wang2018cvpr] that have public implementations and pre-trained models available. In addition, for ACLNet, we fine tuned the model using DAVoS training data for five epochs starting from the pre-trained model with a learning rate of (DeepVS does not provide code for training). The fine-tuned model will be referred to as ACLNet*.

Among traditional saliency models, we compared to SEO [Seo2009], FEVS [TavakoliModel2013], UHF-D [Wang_2017_CVPR_Workshops], and AWS-D [AWSD2018]. Except UHF-D, which learns a set of filters by employing unsupervised hierarchical independent subspace analysis from a random video sequence from YouTube, the rest of the models are training-free. The models are developed for videos, though they may have image only versions. The input to the models are the video frames of the same size as our deep model, . The temporal length of frames is equal to the models’ expected number of frames. A model may resize the inputs internally when required.

To understand the contribution of each modality, we also propose two single modality models. We design an audio saliency model, which uses only the audio branch of our encoder, while the decoder is similar to the multi-modal saliency model (discarding the concat and first 1x1 Conv block that mixes audio and video). Similarly, we form a video saliency model, that uses only the video encoder branch. The two models are trained with the same setup as our proposed audio-visual model from the same starting point and weight initialization.

Model Train Ext. fin. Deep CB
Audio-only (ours) DAVoS
Video-only (ours) DAVoS
Audio-visual (ours) DAVoS
FEVS [TavakoliModel2013] PO
AWS-D [AWSD2018]
SEO [Seo2009]
UHF-D [Wang_2017_CVPR_Workshops] YouTube PO
ACLNet [wang2018cvpr] DHF1K
ACLNet* [wang2018cvpr] DHF1K DAVoS
DeepVS [Jiang2018ECCV] LEDOV ML
Table 2: The list of the compared models, the source of the training data, the data used for extra fine-tuning (Ext. fin.), the type of the model (deep vs. traditional) and use of center prior (‘CB’). The fine tuning was only applied to deep models that provide the training scripts. The fine-tuned models are indicated by ‘*’. ’PO’ stands for the use of center prior at prediction output and ’ML’ indicates the use of such priors in middle layers.

5 Results

5.1 Expected performance bounds

To define a lower-bound performance and establish a strong baseline model, we computed a mean eye position map (MEP) from the training sequences, as depicted in Fig. 3. The MEP depicts the center-bias that a model may learn by training on the dataset. It is, thus, a powerful baseline model to assess the usefulness of saliency models. The upper-bound performance on the database is computed by splitting subjects into two groups and assessing one group against the other. This is similar to the infinite human analysis in [Judd2012], which is a common approach for establishing a upper-bound performance for the saliency models. We call the upper-bound ‘Human Infinite’. The results for the bounds are reported in Table 3 along with the performance of the models. We use these bounds to assess the success of a model on the data. A model has an acceptable performance if it outperforms MEP. The better a model is, the closer its scores to the human infinite scores are.

Model Name NSS AUC Judd sAUC CC SIM
Human Infinite 3.83 0.8852 0.7877 0.7140 0.574
Audio + Video (ours) 2.45 0.8818 0.7268 0.5457 0.4493
Video only (ours) 2.26 0.8793 0.7259 0.5178 0.4158
DeepVS 2.05 0.8603 0.6996 0.4666 0.3733
ACLNet* 1.98 0.8698 0.7003 0.4755 0.3793
ACLNet 1.88 0.8670 0.6960 0.4625 0.3489
FEVS 1.71 0.8532 0.6777 0.4217 0.3629
MEP (dummy baseline) 1.59 0.8443 0.6623 0.4029 0.3263
UHF-D 1.56 0.8401 0.6945 0.38104 0.2752
Audio only (ours) 1.54 0.8419 0.6631 0.3929 0.3246
AWS-D 0.99 0.7309 0.6537 0.2317 0.2248
SEO 0.55 0.6899 0.5512 0.1276 0.1946
Table 3: Performance of various models on DAVoS test set, sorted by NSS. indicates that higher value is better. Mean Eye Position (MEP) from training center indicates a strong baseline and Human Infinite defines the upper performance bound. The best score is in bold, and the second best is underlined.

5.2 Contribution of the modalities

In this section, we look closer into the contribution of individual modalities and their combination using our models described in sub-section 4.3. Results are summarized in Table 3. While the video saliency produces reasonably good predictions, the audio saliency model performs slightly better than MEP only in terms of sAUC. The visual inspection of audio model shows that its prediction maps are mostly focused on the center of frames (see examples in Fig. 5). This is reasonable given facts that the visual spatial information channel is absent and the precise spatial localization of the sound source in the spatial domain is difficult. The combination of both modalities results in the best saliency prediction performance. An ANOVA test on scores with each modality as factors indicates significant difference () between all the models on all the scores. This shows that the combination of audio and video as features indeed improves the saliency prediction significantly.

Figure 5: Contribution of each modality in model. From left to right: image, ground truth, audio model, video model, multimodal model. The audio model is focused on the center. The video model finds the correct spatial regions, while the multimodal model has a better attention distribution.

5.3 Audio-visual integration and stimuli categories

We further look into the extra benefit that one can gain from using multimodal data compared to using just videos. To this end, we focus on the performance of our proposed video-only and audio-visual models over scene categories. The results are reported in Table 5. Please see the supplement for the results of all models.

The higher NSS scores of the human infinite baseline for ‘social events’ suggests that there is a higher degree of agreement between observers in terms of where they look. For example, if two people are talking in a video, most of the observers will attend to the same speaker. The MEP has the lowest prediction score on ‘social events’ indicating that the salient regions are likely off-center and more unique. Please see Fig. 4 for some examples.

By comparing the video model with the audio-video model, we observe an overall increase in model performance using both audio and video signals over all the categories. An ANOVA test over scores within each category with the models as factors showed significant difference () among the models over all scores on all the categories. The sAUC has about the same values for both models over ‘Miscellaneous’ and ‘Social Event’ categories, both models seem to capture the off-center regions similarly. Nevertheless, the audio-visual model has a clear advantage over the video model because its predictions better align with the human scan-path (higher NSS) and are more similar to human maps (higher CC and SIM scores).

Cat. Model Name NSS AUC Judd sAUC CC SIM
Nature Human Infinite 3.31 0.8806 0.7724 0.6961 0.5603
Video+audio 2.27 0.8773 0.7233 0.5392 0.4504
Video 2.04 0.8762 0.7191 0.5066 0.4073
MEP 1.76 0.8696 0.6864 0.4714 0.3682
Soc. Ev. Human Infinite 3.65 0.8765 0.7760 0.6971 0.5485
Video+audio 2.65 0.8853 0.7264 0.5453 0.4420
Video 2.45 0.8824 0.7275 0.5136 0.4080
MEP 1.35 0.8196 0.6337 0.3147 0.2744
Misc. Human Infinite 3.34 0.8682 0.7716 0.6543 0.5256
Video+audio 2.39 0.8812 0.7360 0.5495 0.4548
Video 2.25 0.8774 0.7373 0.5321 0.4335
MEP 1.73 0.8446 0.6754 0.4378 0.3422
Table 4: Multimodal integration and contribution with respect to video categories. Mean eye position (MEP) is obtained from all the videos in the training set of DAVoS and assessed against the videos in each category in the test set. Human Infinite is obtained by assessing half of the subjects against the other half. MEP is a strong baseline and human infinite defines the expected upper bound of the test data. The best scores are shown in bold letters.

5.4 Behaviour of the models over time

To understand the performance of models over video duration, we plot the scores of each frame over the video length (ordered frames). The average performance behaviour over time is obtained by normalizing the length of videos (each video has different number of frames and duration). We re-sample the score of frames over video length to a fixed number, here 1000. Then, we compute the mean of scores of all the videos for each frame as in Fig. 6.

Looking at the overall mean performance of the models over time, we learn that the audio-visual model (our best model) has an edge over audio and video models most of the time. Comparing the NSS score of the audio-visual predictions with the video model over all the predicted frames, 53.54% of the time the audio-visual model outperforms the video model (NSS score is higher). There are, however, cases where the video model performs slightly better than the audio-visual model.

In order to dig deeper into the cases where one model is better than the other, we looked into the performance of the best and worst sequences for each category using the NSS score. The results are summarized in Fig. 7. Investigating video frames and predictions suggest that the audio-visual model is more sensitive to active sound sources, whereas the video model is more responsive to visual cues. In Fig. 7, the middle column frames show examples in which the audio-visual model localizes the speaking face well, while the video model captures many faces. Thus, we look further into the role of sound source later. To sum up, preferring one cue over the other for computing saliency leads to nuance performance variations over the video length, though audio-visual model has the overall advantage.

It is worth noting that the models have roughly equal performance at the beginning of the video sequences. To find out the reason, we sampled the first 10 frames from each video (290 frames) and checked the predictions and ground-truth visually, and in terms of scores. The analysis indicated that the ground truth data is mostly focused on the center of frames (78.97% 229 out of 290 frames) and models predict such cases easily (mean NSS score ).

Figure 6: Mean NSS and sAUC of modalities over time. The values are smoothed for the visualization purposes. The overall performance, prior to re-sampling, of each modality is also reported.
Figure 7: Comparing predictions over time between the proposed audio-video model and video model. From left to right: nature, social events, and miscellaneous categories. Top row sequences with best NSS score and bottom row sequences with worst NSS score. Sample frames depict ground-truth, video model, and audio-video model maps overlaid on the frame (from left to right).

5.5 Comparison to other models

Table 3 summarizes the performance of all models. Our comparison shows that the proposed audio-visual model outperforms all other models. Our proposed video-only model outperforms all the other video models. Fine-tuning ACLNet improved its performance, albeit small. This indicates that possibly deep models are already generalized enough to handle any unseen visual data, which alleviates our concerns regarding the lack of access to DeepVS training code for fine-tuning it on DAVoS. Deep video models perform better than the tested traditional saliency models. Most of the traditional models fall short in comparison to MEP on most of the scores, except FEVS. Fig. 8 depicts some randomly chosen predictions to facilitate better understanding of models’ output. As shown, the proposed audio-visual model produces maps that are more similar to the ground truth human density maps. We also visualize the mean maps over the test frames.

Contrary to FEVS, UHF-D, and DeepVS models that incorporate center-prior, our proposed models do not benefit from such prior information. This suggests that if a deep model is trained properly, there is no need to incorporate center prior into the middle layers or predictions (see [Kruthiventi2017, cornia2018]). This can be also observed in Fig. 8, where our audio-visual model has the most similar mean prediction map to MEP over the test set. The mean prediction map reflects how much the predictions are close to ground truth (on average).

Figure 8: Example saliency maps from the tested models. The last row contains the mean saliency map of each model over the test frames.

5.6 Impact of sound source location

We noticed that our audio-visual model performed particularly well on cases where the video frame contained a visible sound source. To further assess this observation, we performed a contextual evaluation on the predicted saliency maps. To this end, we randomly sampled frames from each test video and manually annotated the location of the sound source if present. Only frames where a sound source was found were included in the subsequent analysis. Following the contextual evaluation scheme of [TavakoliABL17], where a score is computed over regions of interest, we computed contextual NSS scores for all the models. That is, we compute the NSS score over (1) the location of the sound source, (2) other parts of the frame except the sound source location, and (3) the entire test frame. The results are reported in Fig. 9. As depicted, the audio-visual model has the best contextual performance, attending to the sound source consistent with human attention. This finding is aligned with the findings of human studies where a synchronous audio-visual stimulus influences and grabs gaze [Burg2008].

Figure 9: Contextual evaluation of the saliency models in the case of visible sound source. The red and green lines indicate human performance in the cases (1) and (2), respectively. Similarly, the blue and yellow lines correspond to MEP performance on the cases (1) and (2), respectively.

6 Discussion and conclusion

In this paper, we presented a generic audio-visual deep dynamic saliency model. Our proposed audio-visual saliency model outperformed all other models in predicting fixations over dynamic input. The boost in performance of saliency prediction was observed over all the video categories. We performed a contextual analysis to figure out the performance of models depending on the location of sound source. Our analysis showed that in the presence of sound source (1) human mostly attend to the location of sound source, and (2) our audio-visual model behaves most similar to humans in attending to the location of sound source. The finding consistent with the studies that support audiovisual synchrony role in attention guidance,  [Burg2008].

Analysis of the audio, video, and audio-video models over time showed that their behavior is only similar at the beginning of the videos. The root cause seemed to lie on the nature of the ground truth data, where fixations over 78% of the first 10 frames of each video are concentrated on the center. This can be due to the lag in the attention system (subjects need some time to understand the story), though digging deeper into it requires controlled behavioural studies. Investigating the performance over the video length hinted to variations in favouring different input cues (audio and visual) in our audio-video model versus the video model. In depth analysis of these fine variations is a future step, which requires controlled human experiments with extensive annotation, and can help improve the models further.

Overall, we show that audio signal contributes significantly to dynamic saliency prediction. Our model can be utilized in applications centered around dynamic attention with sound as their ubiquitous nature, augmented reality, virtual and mixed reality, and human-robot interaction.

Acknowledgment

Hamed R. Tavakoli gratefully acknowledges the support of NVIDIA Corporation for the donation of the GPUs used in the development of this work.

Appendix

Appendix A Evaluation Scores

Evaluating image saliency models has been studied extensively in the past. See for example [Borji2013ICCV], [Judd2012], [salMetrics_Bylinskii], [TavakoliABL17]. For video saliency evaluation, individual prediction maps are evaluated using image saliency evaluation scores are scores are then averaged. Here, we follow the steps of [wang2018cvpr] and use the same scores as provided in  https://mmcheng.net/videosal/. The scores include Normalized Scanpath Saliency (NSS), Similarity Metric (SIM), Linear Correlation Coefficient (CC), Area Under the ROC Curve (AUC-Judd)[Judd2012], and shuffled AUC (sAUC) [Zhang2008].

The NSS is designed to evaluate a saliency map over fixation locations. Given a saliency map and a binary fixation map , NSS is defined as,

(3)
where

where and are the mean and standard deviation of the saliency map.

The linear correlation coefficient (CC) measures the correlation between the saliency map and a smoothed fixation map . It is defined as , where is the covariance, and are the standard deviations of the saliency map and smoothed fixation map, respectively.

The similarity metric (SIM) measures the similarity between two distributions. Treating and as probability distributions, , ( and ), the SIM is computed as ).

The area under the curve (AUC) treats the estimated saliency map as a classifier output. It computes the area under the ROC curve obtained by varying a threshold and obtaining the true positive rate and false positive rate. The AUC-Judd samples the true positives from all the saliency map values above a threshold at fixated pixels and measures the false positive rate as the total saliency map values above a threshold at not-fixated pixels. The sAUC samples the negatives from fixated locations of other images, other video frames in our case. This sampling scheme penalizes the center-bias phenomenon 

[Zhang2008].

Appendix B Model Comparison on Categories

We compared all the models within the three categories. The results, summarized in Table 5, indicate superiority of the audio-visual model over all the categories.

Cat. Model Name NSS AUC Judd sAUC CC SIM
Nature Human Infinite 3.31 0.8806 0.7724 0.6961 0.5603
Video+audio (ours) 2.27 0.8773 0.7233 0.5392 0.4504
Video (ours) 2.04 0.8762 0.7191 0.5066 0.4073
DeepVS 1.89 0.8569 0.6959 0.4621 0.3751
ACLNet* 2.03 0.8841 0.7232 0.5179 0.4010
ACLNet 1.96 0.8811 0.7164 0.5116 0.3766
FEVS 1.85 0.8719 0.7019 0.4816 0.3991
MEP 1.76 0.8696 0.6864 0.4714 0.3682
AUDIO (ours) 1.72 0.8684 0.6888 0.4637 0.3678
UHF-D 1.64 0.8456 0.7103 0.4153 0.2943
AWS-D 0.74 0.6799 0.6288 0.1723 0.2092
SEO 0.45 0.6578 0.5434 0.1021 0.1881
Soc. Ev. Human Infinite 3.65 0.8765 0.7760 0.6971 0.5485
Video+audio 2.65 0.8853 0.7264 0.5453 0.4420
Video 2.45 0.8824 0.7275 0.5136 0.4080
DeepVS 2.26 0.8671 0.7008 0.4775 0.3723
ACLNet* 2.02 0.8692 0.6834 0.4488 0.3594
ACLNet 1.91 0.8659 0.6837 0.4324 0.3251
FEVS 1.56 0.8393 0.6546 0.3540 0.3210
UHF-D 1.46 0.8301 0.6675 0.3364 0.2502
MEP 1.35 0.8196 0.6337 0.3147 0.2744
AUDIO 1.32 0.8180 0.6322 0.3123 0.2756
AWS-D 1.13 0.7658 0.6575 0.2656 0.2299
SEO 0.60 0.7137 0.5488 0.1404 0.1936
Misc. Human Infinite 3.34 0.8682 0.7716 0.6543 0.5256
Video+audio 2.39 0.8812 0.7360 0.5495 0.4548
Video 2.25 0.8774 0.7373 0.5321 0.4335
DeepVS 1.98 0.8555 0.7027 0.4574 0.3723
ACLNet* 1.84 0.8517 0.6831 0.4562 0.3778
ACLNet 1.74 0.8497 0.6858 0.4391 0.3450
FEVS 1.74 0.8473 0.6773 0.4354 0.3728
MEP 1.73 0.8446 0.6754 0.4378 0.3422
AUDIO 1.60 0.8400 0.6714 0.4114 0.3368
UHF-D 1.59 0.8465 0.7108 0.3971 0.2843
AWS-D 1.13 0.7658 0.6575 0.2656 0.2299
SEO 0.61 0.6996 0.5647 0.1436 0.2045
Table 5: Comparing models on video categoires.

Appendix C Distribution of scores

To understand the overall behaviour of our proposed models, , audio-visual, visual, and audio model, better; we calculated the distribution of NSS scores, summarized in Figure 10

. The better a model is, it should have a distribution skewed towards higher values of NSS. Despite the mean of the models are close to each other, the multimodal (audio-visual model) is further skewed towards higher scores and has less low scores i comparison to audio model and video model.

Figure 10: Score distribution for NSS scores. A better model should have a distribution skewed towards higher NSS score (right) The mean of each distribution is also depicted as a vertical line.

Appendix D Comparative prediction improvement in terms of number of frames with improved score

We summarize the percentage of frames that the NSS score has improved for them (higher NSS) using the audio-visual model in comparison to other models in Table 6. For example, the proposed audio-visual model outperforms our proposed video only model on 53.54% of frames. The amount of improvements is consistent with the ranking of models in terms of the scores as well.

Compared Model % improved frame prediction
Video (ours) 53.54
Audio (ours) 73.13
ACLNet* 67.18
ACLNet 70.94
FEVS 68.05
AWSD 82.04
deepVS 60.46
SEO 90.36
UHFD 69.31
Table 6: Percentage of frames with better NSS score for the audio-visual model in comparison to other models.

Appendix E 3D vs 2D ResNet for audio classification

We experimented with both 3D and 2D ResNet for audio classification on speech command database [SpeechCmd2018] using ResNet18. Our experiments showed that on sound classification using a 3D ResNet18 achieves higher accuracy rate 88.68% versus 82.55% of 2D ResNet. Nevertheless, the prepossessing for 2D ResNet is slightly different as we do not convert the log mel-spectrogram into several feature frames. We, thus, adapted the 3D ResNet architecture, which is provides us also a symmetric feature encoding architecture, both audio and video will have the same type of encoder.

References