A simple and effective method for video saliency prediction
This paper investigates modifying an existing neural network architecture for static saliency prediction using two types of recurrences that integrate information from the temporal domain. The first modification is the addition of a ConvLSTM within the architecture, while the second is a computationally simple exponential moving average of an internal convolutional state. We use weights pre-trained on the SALICON dataset and fine-tune our model on DHF1K. Our results show that both modifications achieve state-of-the-art results and produce similar saliency maps.READ FULL TEXT VIEW PDF
This paper investigates modifying an existing neural network architectur...
This work adapts a deep neural model for image saliency prediction to th...
In this technical report, we present our publicly downloadable implement...
TASED-Net is a 3D fully-convolutional network architecture for video sal...
Conventional saliency prediction models typically learn a deterministic
Deep convolutional neural networks have achieved impressive performance ...
Head motion prediction is an important problem with 360 videos, in
A simple and effective method for video saliency prediction
Visual saliency pertains to how an object or any piece of information may stand out from its surroundings. Detecting saliency is an integral part of how sentient organisms process information. We live in a world where the visual data we receive on a daily basis is immense and cluttered with noise; therefore, the brain has evolved in such a way that allows living organisms to focus their attention on the most relevant information, so as to function efficiently. Efforts in the computer vision community have been ongoing for many years to simulate this biological process artificially leading to the development of large-scale static gaze datasets, (e.g. SALICON[Jiang et al.(2015)Jiang, Huang, Duan, and Zhao]) and, more recently, dynamic gaze datasets (e.g. DHF1K [Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji]
). Based on these datasets, model-driven approaches tackle the task of saliency prediction by estimating heatmaps of probabilities, where every probability corresponds to how likely it is that the corresponding pixel will attract human attention. Thanks to the availability of large-scale datasets, deep learning architectures have managed to significantly improve the accuracy achievable in this task (e.g.[Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji, Pan et al.(2017)Pan, Ferrer, McGuinness, O’Connor, Torres, Sayrol, and Giro-i Nieto, Gorji and Clark(2018), Jiang et al.(2015)Jiang, Huang, Duan, and Zhao, Pan et al.(2016)Pan, Sayrol, Giro-i Nieto, McGuinness, and O’Connor]).
Most scientific interest has so far been focused on image-based saliency models, with video saliency prediction gaining more traction in recent years with the introduction of large-scale video saliency datasets ([Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji, Mathe and Sminchisescu(2015)]). When it comes to extracting visual information from the temporal domain, ConvLSTMs have become increasingly popular, achieving state-of-the-art results in various computer vision tasks (e.g. [Xingjian et al.(2015)Xingjian, Chen, Wang, Yeung, Wong, and Woo, Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji, Xu et al.(2018)Xu, Yang, Fan, Yang, Yue, Liang, Price, Cohen, and Huang]). In this work we augment a state-of-the-art architecture for image saliency [Pan et al.(2017)Pan, Ferrer, McGuinness, O’Connor, Torres, Sayrol, and Giro-i Nieto] by adding a ConvLSTM module within its internal structure, similar to [Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji, Gorji and Clark(2018)]. More interestingly, we also test a much simpler method for temporal stability. We wrap a convolutional layer with a temporal exponential moving average (EMA) [Polyak(1964)] operation. Using this recurrence, the output will always be a smoothed average of its previous states. This method is already used in gradient descent with momentum [Sutskever et al.(2013)Sutskever, Martens, Dahl, and Hinton] to speed up convergence, replacing the current gradient with the exponential moving average of current and past gradients, derived from mini-batches of the data. To the best of our knowledge, this is the first time that this method has been applied within the architecture of a neural network.
Ablation studies are commonly used to better understand the performance impact of added components. Whilst this has merit, we propose that simple functions should also be used to investigate the necessity of complex modifications. To this end, in this work we consider both an elaborate ConvLSTM recurrence and a much simpler weighted average recurrence, and show that the simpler approach competes with the ConvLSTM on the task of video saliency.
Video saliency prediction with deep neural networks has basically adapted to this task the architectures proposed for video action recognition. A first popular option are two-stream networks [Simonyan and Zisserman(2014b)], in which the motion information is encoded by a pre-computation of the optical flow and adding it in a separate tower from the RGB channels. This is the approach adopted by STSConvNet [Bak et al.(2017)Bak, Kocak, Erdem, and Erdem]
. This solution presents two important limitations: the computation overhead that is necessary to compute the optical flow, and the lack of temporal perspective further than the pairs of consecutive frames typically considered when computing optical flow. These shortcomings are partially addressed with the neural architectures where the temporal relation across frames is computed by a recurrent neural network (RNN)[Donahue et al.(2015)Donahue, Anne Hendricks, Guadarrama, Rohrbach, Venugopalan, Saenko, and Darrell]. RNN-based deep models for saliency prediction have already been explored [Bazzani et al.(2017)Bazzani, Larochelle, and Torresani, Jiang et al.(2017)Jiang, Xu, and Wang, Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji, Gorji and Clark(2018)] and are the core of the state of the art solutions. Similarly to [Montes et al.(2016)Montes, Salvador, Pascual, and Giro-i Nieto] for activity detection, RMDN [Bazzani et al.(2017)Bazzani, Larochelle, and Torresani] combined the short-term memory encoded by C3D spatio-temporal convolutions [Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri]
with a long short-term memory encoded by a plain LSTM. However, most current works have adopted a ConvLSTM layer as temporal recurrence, so that the recurrent layer would have a notion of space at a local scale. The OM-CNN model proposed in[Jiang et al.(2017)Jiang, Xu, and Wang] fuses the RGB and optical flow from two-stream architecture with two ConvLSTMs. The authors of the largest dataset for video saliency prediction, the DHF1K (Dynamic Human Fixation 1K) dataset[Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji], trained a deep neural model based on ConvLSTM layers with attention (ACLNet). The authors of [Gorji and Clark(2018)] exploit an existing model pre-trained for static saliency prediction, but with a more complex architecture composed of four branches fused with a ConvLSTM.
Our model outperforms the presented state of the art with a simple architecture that only considers RGB frames as input. As in some of the referred works, we exploit a model pre-trained with static images and study its enhancement with two types a temporal recurrence.
The adopted neural architecture follows an encoder-decoder scheme that processes the temporal recurrence in the bottleneck. The topology of both encoder-decoder is adopted from SalGAN [Pan et al.(2017)Pan, Ferrer, McGuinness, O’Connor, Torres, Sayrol, and Giro-i Nieto], the current top performing static saliency model on the DHF1K saliency benchmark. SalGAN encoder corresponded to the popular VGG-16 convolutional network [Simonyan and Zisserman(2014a)] designed and trained to solve an image classification task. At the decoder side, SalGAN used the same layers as in the encoder in reverse order, and interspersed by upsampling instead of pooling operations. The original SalGAN model was trained using a combination of adversarial and binary cross entropy (BCE) loss. Here, for simplicity, we use only BCE and term the resulting architecture SalBCE.
We introduce a temporally aware component into the SalBCE network. This is either the addition of a ConvLSTM layer or an exponential moving average (EMA) applied on a pre-existing convolutional layer. Figure 1 presents a schematic of our architecture.
An LSTM is an autoregressive architecture that controls the flow of information in the network using 3 gates: update, forget, and output (Figure 2, left). In ConvLSTMs [Xingjian et al.(2015)Xingjian, Chen, Wang, Yeung, Wong, and Woo], the operations at each gate are convolutions. Temporal information is preserved in the cell state Ct upon which gated element-wise operations are performed by the update and forget gate. The hidden state Ht is concatenated with the input at each step and propagated through linear and non-linear operations at the gates. At each gate the current state St of the model is passed through the ConvLSTM gates and the cell state Ct and hidden state Ht are updated. In the following equations ‘’ represents the element-wise product, ‘’ a convolution operation, ‘’ the sigmoid logistic function and ‘’ the hyperbolic tangent. The update, forget, and output gates can be written as:
and the new cell state and hidden state are then given by:
where and are the model parameters.
We added the ConvLSTM architecture at the bottleneck of our model, so that the input to the ConvLSTM is an encoded representation of the frame at time . The output cell state is fed to the decoder for further processing that results in a saliency map. To obtain the saliency map, a convolution is used at the final layer of the decoder, so as to filter out all channels but one. We sequentially pass video frames to the model as input and get a sequence of time-correlated saliency maps in the output. The ConvLSTM component learns to leverage the temporal features during training. The name we gave to this type of model is SalCLSTM.
As an alternative approach, the exponential moving average (EMA) recurrence [Polyak(1964)] is added on a specified layer so that at time the convolutional state of this layer will be a decaying weighted average of the current and all previous states (Figure 2, right). At time the convolutional layer outputs a state that is fed to the exponential weighted average. The output
is then propagated further in the model. Note that there is a hyperparameterthat affects the impact of previous states on the current time step (the lower the value the higher the impact).
This recurrence is straightforward to implement, especially compared to the ConvLSTM. We experimented with the placement of the EMA function at several different layers with . We name our model SalEMA. On the initial step, where there is no past information, the model runs like a static saliency map predictor.
The parameters of SalCLSTM and SalEMA were estimated by backpropagating a pixel-wise content loss that compared the value of each pixel in the predicted saliency map with its corresponding pixel in the ground truth map. The total binary cross entropy loss was computed as the average of the individual binary cross entropies (BCE) over all pixels:
where P represents the predicted saliency map and Q the ground truth saliency map.
SalCLSTM and SalEMA were not trained from scratch though, as the parameters of the encoder-decoder convolutional layers were adopted from SalBCE. SalBCE was trained for 27 epochs over the SALICON[Jiang et al.(2015)Jiang, Huang, Duan, and Zhao] dataset of still images using only the same BCE loss. We also utilized data augmentation techniques (mirroring and rotation of frames) which resulted in improved performance.
Our next step was adding recurrence that uses the intrinsic temporal information of video datasets and train it with the DHF1K dataset [Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji]. The DHF1K dataset [Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji] contains 700 annotated videos at 640360 resolution. We extracted frames at their original 30 fps rate, and resized them to 192256 resolution. We loaded them using a batch size of 10 frames from a single video at a time. By backpropagating the loss through time up to a maximum of 10 frames, we avoid exceeding memory capacity and potential vanishing or exploding gradients. We found it was necessary to initialize the ConvLSTM recurrence with the Xavier initialization method [Glorot and Bengio(2010)], otherwise this model would converge to black images rather than saliency maps. This was likely due to oversaturation of the sigmoid activation layer. We trained all our models for 7 epochs, where we observed the loss reaching a plateau on our baseline. We used the Adam optimizer [Kingma and Ba(2014)] with a learning rate of .
|tuned on DHF1K||AUC-J||s-AUC||NSS||CC||SIM|
The effect of temporal recurrences proposed for SalEMA and SalCLSTM was assessed with five different visual saliency metrics: Normalized Scanpath Saliency (NSS), Similarity Metric (SIM), Linear Correlation Coefficient (CC), AUC-Judd (AUC-J), and shuffled AUC (s-AUC). In all cases, a higher value corresponds to a better performance. The reader is referred to [Bylinskii et al.(2019)Bylinskii, Judd, Oliva, Torralba, and Durand] for a detailed description of these metrics. The reported figures correspond to an average per video, that is, we first compute the metric on each frame, then average across all frames of each video, and we finally average across all videos.
We train and evaluate our models on three video saliency datasets, namely DHF1K [Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji], Hollywood-2 and UCF-sports [Mathe and Sminchisescu(2015)]. DHF1K is a large scale dataset with a high diversity of contents and variable length (from 400 frames to 1200 frames at 30fps). It includes 1000 videos, out of which 700 are publicly annotated, and 300 are withheld for testing purposes. In contrast to DHF1K, Hollywood-2 [Marszałek et al.(2009)Marszałek, Laptev, and Schmid] and UCF-sports [Soomro and Zamir(2014)] are limited to human actions and can be categorized as task-driven, given that the observers were explicitly asked to identify actions and scene context. These datasets were originally formed for the task of action recognition and were later adopted as a video saliency benchmark. Furthermore, both datasets have been divided into separate shots, so that no scene change occurs in the sequences that are fed into the models. Hollywood-2 is split into a training set of 3100 clips and a test set of 3559 clips, while UCF-sports has been split to a training set of 104 clips and a test set of 48 clips. These shots are much smaller in size than a DHF1K video sample, ranging from 40 frames to just a single frame per shot. We also use SALICON [Jiang et al.(2015)Jiang, Huang, Duan, and Zhao], a large-scale image saliency database, to set a baseline. DHF1K is used for experimenting with variations over the proposed models, as well as for comparison with the state of the art together with Hollywood-2 and UCF-sports.
The results in Table 1 indicate that the simple addition of EMA even without extra training does almost as well as a sophisticated ConvLSTM recurrence, and even improves it after being fine-tuned with the DHF1K training partition. EMA essentially performs a smoothing over the frames of the video by averaging. A possible explanation for why this boosts performance in video saliency is that saliency tends to be relatively consistent across frames, with the exception of rapid movements.
Encouraged by the positive results of our EMA modification at the bottleneck (layer 30), we explored more possible locations of the EMA function. In particular we tested its placement on: output (layer 61), decoder (layer 56), encoder (layer 7). We also implemented a variation that integrates EMA at two separate layers simultaneously, one in the encoder (7) and one in the decoder (56). In that case we set to 0.3 at each location so as to not have an over-smoothing effect that would result in a significant lag at adapting to changes in the scene. Furthermore, in a video there can be spontaneous scene changes. In such instances, it would be optimal to have the EMA reset and forget all the previous states. However, EMA is not adaptive in this way, so we experimented with a skip connection that allows information to bypass this layer instead [He et al.(2016)He, Zhang, Ren, and Sun]. We also applied a second type of regularization, the dropout technique [Hinton et al.(2012)Hinton, Srivastava, Krizhevsky, Sutskever, and Salakhutdinov]
, at the convolutional layer right before the EMA layer. Dropout essentially turns off neurons with a preassigned probability (0.5) at each training step. This mitigates co-adaptation of neurons during training, allowing for clusters of neurons to learn independently. This way, at test time, we get the average from an ensemble of layers at location 30. The average of this ensemble pertains to spatial information, but since we are also using EMA, we get the moving average across the temporal dimension as well. The results reported in Table2 do not show a clear winning configuration across the five metrics metrics but, as NSS and CC are considered as the most appropriate ones to capture viewing behavior [Bylinskii et al.(2019)Bylinskii, Judd, Oliva, Torralba, and Durand], we adopted SalEMA30 with dropout as our best configuration.
|Model||tuned on DHF1K||AUC-J||s-AUC||NSS||CC||SIM|
Furthermore, we evaluated our two models on Hollywood-2 and UCF-sports [Mathe and Sminchisescu(2015)]. We compare our models to the current state-of-the-art as evaluated on the test split of the corresponding datasets by Wang et al [Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji]. Like ACLNet [Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji], our models were trained first for DHF1K in all cases, and later fine-tuned for the specific Hollywood-2 or UCF-Sports dataset. Table 3 shows how, for DHF1K, SalEMA achieves the best performance compared to other models in the current benchmark across all metrics but s-AUC. On the other hand, SalCLSTM obtains the best results on all metrics for UCF-Sports and leads the performance on AUC-J, NSS and CC for Hollywood-2.
|ACLnet [Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji]||0.890||0.601||2.354||0.434||0.315|
|SalGAN [Pan et al.(2017)Pan, Ferrer, McGuinness, O’Connor, Torres, Sayrol, and Giro-i Nieto]||0.866||0.709||2.043||0.370||0.262|
|DVA [Wang and Shen(2018)]||0.860||0.595||2.013||0.358||0.262|
|ACLnet [Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji]||0.913||0.757||3.086||0.623||0.542|
|OM-CNN [Jiang et al.(2017)Jiang, Xu, and Wang]||0.887||0.693||2.313||0.446||0.356|
|DVA [Wang and Shen(2018)]||0.860||0.727||2.459||0.482||0.372|
|ACLnet [Wang et al.(2018)Wang, Shen, Guo, Cheng, and Borji]||0.897||0.744||2.567||0.51||0.406|
|DVA [Wang and Shen(2018)]||0.872||0.725||2.311||0.439||0.339|
|OM-CNN [Jiang et al.(2017)Jiang, Xu, and Wang]||0.870||0.691||2.089||0.405||0.321|
A more detailed analysis between SalEMA and SalCLSTM was obtained by plotting the difference in their NSS and CC performance per video in the DHF1K validation set (100 videos). Concretely, we subtracted the metric value achieved by the SalCLSTM from that of SalEMA in each video and display the results in Figure 3
. This way, we can assess whether the two configurations end up producing similar results. In this case we would expect the variance to be low and the NSS difference to be close to zero most of the time. However, the results are sparse and diverge from video to video. This observation serves as evidence that the function approximated by the ConvLSTM is differs from that of an exponential moving average, despite its similar overall effectiveness.
We also delved deeper into the Hollywood-2 dataset for potential clues that would explain the difference in performance. This dataset consists of very small shots, including even single-frame shots. In these cases we found that the ConvLSTM does much better than the EMA (by a margin of around NSS points). We also noticed, however, that in these cases the ground truths for the saliency maps correspond to a central Gaussian, despite the fact that other salient objects are present in other locations of the frame. Figure 5 shows two examples in which the provided ground truth focuses in the center, although different faces appear in the image. In these examples, SalEMA captures these salient objects better, while SalCLSTM seems to focus on the center.
Predictions from two Hollywood outliers where SalEMA performed particularly bad. The order corresponds to: SalEMA (left image), SalCLSTM (middle image), ground truth (right image). The ground truth appears aberrant, as it completely ignores human faces that are well-known to be salient objects.
Finally, we experimented with the hyperparameter by varying its value and also by making it trainable. Table 4 shows relatively stable performance despite the variations on the value. We also had our model learn alpha on its own by introducing a trainable parameter . To ensure that the resulting update equation represents a convex combination of the current features and previous state, is passed through a sigmoid so that the final value is constrained to . The resulting recurrence is:
Whereas all other parameters of the model are set to a learning rate of , the learning rate of alpha was set to 0.1 and was trained separately for 3 epochs on SalEMA pretrained with . We set to 0.5 at the start of this tuning and by the end, it converges to 0.1477. The final performance was found to be approximately the same as the best model in Table 4.
This work has presented SalEMA and SalCLSTM, two variations of a convolutional neural network for video saliency prediction. Their main difference is how temporal recurrence is modelled, whether with a simple yet effective exponential moving average with a single parameter, or a convolutional LSTM that despite being adopted for many video sequence processing tasks, seems needlessly complex for this specific task of video saliency prediction. This indicates that, in some cases, components of more sophisticated models may just learn to approximate much simpler functions. It is likely that similar methods can be conceived of in other types of tasks as well.
On another note, ablation studies are a common practice for evaluating the contribution that an added component has on a model’s performance. We argue that there should be a more detailed effort in analyzing the behavior of deep architectures. Using predefined functions like the one presented in this work may shed more light on the necessity of a complex architecture.
This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under grant number SFI/15/SIRG/3283 and SFI/12/RC/2289. This work has been developed in the framework of project TEC2016-75976-R, funded by the Spanish Ministerio de Economia y Competitividad and the European Regional Development Fund (ERDF).
What do different evaluation metrics tell us about saliency models?IEEE transactions on pattern analysis and machine intelligence, 41(3):740–757, 2019.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
Convolutional lstm network: A machine learning approach for precipitation nowcasting.In Advances in neural information processing systems, pages 802–810, 2015.