Video saliency detection aims to model the gaze fixation patterns of humans when viewing a dynamic scene. Because the predicted saliency map can be used to prioritize the video information across space and time, this task has a number of applications such as video surveillance [12, 41], video captioning , video compression [11, 13], etc.
Previous state-of-the-art approaches for video saliency detection [3, 19, 38] largely depend on LSTMs  to aggregate information temporally. For example, OM-CNN  feeds spatial features from YOLO  and temporal features from FlowNet  into a two-layer LSTM. The leading state-of-the-art model, ACLNet , also uses a LSTM to aggregate spatial features guided by frame-wise image saliency maps. The strong performance of LSTM-based approaches over non-LSTM based ones suggests that aggregating information temporally boosts performance on video saliency detection.
However, all of these LSTM-based, existing video saliency models fail to jointly process spatial and temporal information when predicting a saliency map from the extracted features. Specifically, either spatial decoding and temporal aggregation are performed separately, or only one of these two processes is considered for the final prediction. The existing works are hence unable to leverage the collective spatiotemporal information, which is expected to be important to video saliency [9, 25].
To this end, we propose a novel 3D fully-convolutional encoder-decoder network architecture for video saliency detection, which we call the Temporally-Aggregating Spatial Encoder-Decoder Network (TASED-Net). As described in Figure 1, TASED-Net progressively reduces the temporal dimensionality within both the encoder and the decoder subnetworks, which enables it to spatially upsample the encoded features and temporally aggregate all the information as well. Similarly to other architectures designed for pixel-level tasks [1, 28, 33], TASED-Net compresses the spatial dimensions to extract high-level features at a low resolution, then upscales them to produce a full-resolution prediction map. On top of that, the decoder subnetwork performs temporal aggregation; we refer to it as the prediction network in our architecture since it jointly processes spatial and temporal information in a fully-convolutional way. TASED-Net predicts a single saliency map conditioned on a fixed number of previous frames, thus we apply it in a sliding-window fashion to predict a saliency map for every frame in the video.
Just as numerous 2D encoder-decoder architectures adopt VGG-16 
pre-trained on ImageNet as their encoder network, we choose S3D  pre-trained on the Kinetics dataset  as the encoder network for TASED-Net. It has been shown by Xie et al.  that S3D is efficient and effective in extracting spatiotemporal features, and by Hara et al. 
that the Kinetics dataset is sufficiently large for effective transfer-learning. Therefore, we expect that the encoder network of TASED-Net can fully benefit from the successful 3D convolutional network architecture and extremely large-scale video dataset.
For the prediction network, we first place a series of transposed convolution layers and max-unpooling layers for spatial upscaling, and then we use convolution layers for temporal aggregation. The tricky part is that the max-unpooling layers cannot reuse the pooling indices or switches 
from the corresponding max-pooling layers since they have larger temporal receptive field than the max-unpooling layers. We introduce a new type of pooling operation, which we callAuxiliary pooling, that overcomes this non-trivial problem by adding extra max-poolings that can produce the properly-sized switches. Auxiliary poolings first reduce the temporal dimension of the input feature maps, and then obtain the appropriate switches for the matching max-unpooling layers. We compare Auxiliary pooling
with two common upsampling operations, which are interpolation and transposed convolution (deconvolution), to demonstrate its effectiveness and necessity.
We comprehensively evaluate our architecture on three large-scale video saliency datasets: DHF1K , Hollywood2 [23, 24], and UCFSports [24, 32, 35]. Our results demonstrate that TASED-Net significantly outperforms previous state-of-the-art baselines on all three datasets. We believe that our novel architecture is effective in predicting video saliency because it jointly performs spatial decoding and temporal aggregation in a fully-convolutional way, instead of using separate recurrent units such as LSTM.
In summary, our main contributions are threefold:
We develop a powerful end-to-end 3D fully-convolutional network for video saliency detection, comprised of an encoder network followed by a prediction network, which we name TASED-Net.
We propose the novel concept of Auxiliary pooling which obtains switches with reduced temporal dimension so that max-unpooling layers of the prediction network can properly work.
We comprehensively evaluate our proposed network on three large-scale datasets for video saliency and show the effectiveness of our joint modelling of spatial decoding and temporal aggregation.
2 Related Work
Recent Video Saliency Detection Models. Previous state-of-the-art video saliency models rely on optical flow or LSTM to utilize temporal information. STSConvNet  adopts a two-stream architecture where temporal information from optical flow is processed independently by a temporal stream. RMDN 
uses spatiotemporal features extracted from C3D and then aggregates temporal information in the long term with a subsequent LSTM. OM-CNN  first extracts spatial and temporal features from YOLO  and FlowNet  subnets, which represent objectness and motion respectively, and feed them into a two-layer LSTM. ACLNet  implements an attention module pre-trained on SALICON , a large-scale dataset for image saliency, and uses the frame-wise attention mask to encourage an LSTM to better capture dynamic saliency in the long term. Comparative results of these previous models are reported in Wang et al. . Image saliency detection models can also be used to predict video saliency if used in a frame-wise manner for each frame of a video. However, unsurprisingly, even state-of-the-art image saliency detection models such as SalGAN , DVA , Deep Net , and SALICON  are significantly outperformed by ACLNet because they does not consider any temporal information.
Relevant 2D ConvNets. Deep 2D ConvNets have achieved great success in diverse areas of image analysis beyond image classification for the last few years, including object detection, instance segmentation, and image saliency detection. Among such successes, VGG-16  pre-trained on ImageNet  has played a key role as an effective feature extractor for transfer learning. Another success in 2D ConvNets has been encoder-decoder networks [1, 28, 33]. For example, SegNet  improves a single-stream encoder-decoder architecture by upsampling the feature maps through max-unpooling with switches from the encoder network. Switches  are latent variables which record the locations of maximum activation. These variables are used by unpooling layers to partially-inverse the max-pooling operation. This method shows that max-unpooling is more suitable for decoding than other upsampling operations such as linear upsampling or even learnable upsampling method through transposed convolution, which inspires our Auxiliary pooling.
Recent 3D ConvNets. 3D ConvNets have achieved state-of-the-art results in the action recognition task. Above all, 3D ConvNets inflated from 2D ConvNets are leading the field by leveraging successful 2D network architectures as well as their parameters. Carreira and Zisserman  propose I3D, which inflates the 2D convolutional filters of Inception  to produce a 3D ConvNet with strong performance. Xie et al.  further explore inflated 3D ConvNets by proposing a more computationally-efficient architecture called S3D. Hara et al.  experimentally show that various other inflated 3D ConvNets are also effective and predict that 3D ConvNets pre-trained on the Kinetics dataset  can retrace the success story of 2D ConvNets, i.e. that they can be used to initialize models for many other fields of video analysis, just as VGG-16  has been applied to diverse image-based problems. We adopt S3D as the encoder network for our approach with the hope that it takes advantage of the successful architecture and the large-scale video dataset for effective transfer learning.
3.1 Architecture Overview
The overall flow of our proposed architecture is illustrated in Figure 1. We choose this design based on three assumptions:
saliency detection of any frame can be done well by only considering a fixed number of consequent past frames (we will call this number throughout this paper);
given an input of frames, predicting a single saliency map for one specific time step is better than predicting maps for two or more steps at once; and
there are enough number of frames in a video (specifically, the total number of frames of a video is not less than ).
The encoder network first encodes an input clip of frames spatiotemporally; this provides a deep low-resolution feature representation. Then, the following prediction network decodes the features spatially while jointly aggregating temporal information to produce a full-resolution prediction map for a single time step. We note that unlike the previous state-of-the-art models that use LSTM, our method is conditioned on a fixed number of previous frames when predicting a saliency map. The prediction network is devised to coincide with the second assumption by predicting a single saliency map corresponding to the last frame of an input clip. Frame-wise saliency maps are predicted by applying the architecture in a sliding window fashion. In other words, , a saliency map at , is predicted given an input clip for any , where is the frame at time step and is the total number of frames in the video.
The problem with this configuration is that the first saliency maps are not predicted. Our workaround is to reverse the chronological order of the first input clips. That is, for is predicted by conditioning on . As a result, our architecture can predict a frame-wise saliency map for every frame as long as our third assumption that is satisfied.
TASED-Net has a common property with well-known image encoder-decoder networks that reduce and then upsample the spatial resolution [1, 28, 33]. The core difference of our model comes from temporal aggregation inside the prediction network, which requires extra operations that we call Auxiliary pooling. The architecture of TASED-Net, along with Auxiliary pooling, is explained in detail in the following sections.
3.2 Architecture specification
A detailed illustration of TASED-Net is depicted in Figure 2. An input clip is spatiotemporally encoded by 3D convolutional operation blocks of the encoder network taken from the S3D  network pre-trained on the Kinetics dataset . The encoder network takes advantage of the successful 3D ConvNet architecture and the large-scale video dataset to extract rich encoded feature maps. We add a convolution after the convolutional blocks from S3D to re-distribute encoded information across the channel dimension.
Next, we describe the prediction network. We spatially upsample the encoded spatiotemporal features, leaving the time dimension alone, with a series of transpose convolutional layers and max-unpooling layers. At this point, we have only upsampled to a quarter of the original spatial resolution (quarter-resolution). Afterwards, we apply spatial transposed convolutions interspersed with temporal convolutions, which finally results in a full-resolution saliency map. The stride for these transposed convolution layers is, so they double the spatial dimensions of the feature maps. The kernel sizes of the two temporal convolutions are and , where and are set to 2 and
respectively to aggregate all temporal information. Batch normalization
come after all the convolutional operations except the last layer. After the last convolution layer, a sigmoid function is applied to produce an intensity map of saliency. A more thorough description of the architecture can be found in Supplementary material.
3.3 Auxiliary pooling
In our architecture, we wish to leverage the effective reconstruction ability of max-unpooling layers, which have been used in state-of-the-art pixel-level segmentation models [1, 28]. However, implementing this in our architecture is non-trivial because the decoder (prediction network) never upsamples along the temporal dimension, which makes the temporal dimensions of switches  from the encoder incompatible with those from the decoder. Specifically, switches of the max-unpooling layers and their corresponding max-pooling layers have different temporal sizes. In order to obtain switches with the proper sizes for the max-unpooling layers, extra processing steps are required. For each max-unpooling layers, we add two sequential extra pooling layers, which we call Auxiliary poolings. The first Auxiliary pooling receives the input feature map from the encoder and reduces the temporal length of the feature map. Then, the following Auxiliary pooling, whose kernel works only spatially, stores the proper switches for the matched unpooling layer which also only works in spatial dimension. These blocks of two sequential Auxiliary poolings make it possible for the decoder to reconstruct spatial information effectively by using the stored switches. Note that Auxiliary poolings are only used for storing switches and are not included in the main data stream. A detailed illustration of how Auxiliary poolings truly work is described in Figure 3. A general pooling operation takes an input feature map and produces pooled map with switches which record the location of maximum activation within the input: . The first Auxiliary pooling is applied to obtain the intermediate temporally-reduced pooled map : (hyphen: variables not in use). The second Auxiliary pooling is applied to store switches in the reduced temporal domain: . The matched unpooling operation unpools the input feature map from decoder only spatially using the switches : . A more detailed input and output sizes can be found in Supplementary material. The necessity of Auxiliary pooling in TASED-Net and its variants are also further discussed in Section 4.4.
3.4 Temporal aggregation strategy
Temporal aggregation takes a spatiotemporally encoded feature map, whose spatial resolution is a quarter of the full video resolution, and performs the following two operations: reducing the time dimension of the input features to 1, and upscaling the spatial dimensions to full-resolution. There exist a variety of strategies that perform the required spatial upsampling and temporal reduction operations in different orders; we depict a few in Figure 4. The first strategy, late aggregation, performs two spatial upsampling operations followed by one temporal convolutional operation that performs temporal dimension reduction. The second strategy, early two-step aggregation, performs one temporal convolution before each spatial upsampling operation. The final strategy, late two-step aggregation, performs one temporal convolution after each spatial upsampling operation. We found that late two-step aggregation performs best (see Section 4.2), so we implemented it in TASED-Net.
4.1 Experiments setup
Datasets. We evaluate our method on three standard datasets: DHF1K , Hollywood2 [23, 24], and UCFSports [24, 32, 35]. These datasets and some others are compared in terms of variety, scalability, and generality by Wang et al. , and we choose the DHF1K dataset as our main benchmark (i.e. we focus our analysis on this dataset) because it includes the most general and diverse scenes with various types of objects, motion, and backgrounds out of the aforementioned datasets. It consists of 1K videos with around 600K frames; 300 videos are preserved as a test set with no public ground-truth annotations of human eye fixation points. There is a public server for reporting results on the test set for fair evaluation. The Hollywood2 dataset contains 1,707 videos focusing on human actions in movie scenes, and the UCFSports dataset contains of 150 videos of human actions in sports. We believe that our selection of three datasets is sufficient to show the effectiveness and generality of our approach.
Training/testing process. For training TASED-Net, clips with consequent frames are randomly but densely sampled from a video. Note that this sampling scheme is valid because our model predicts each saliency map independently. Each frame is resized to . We train our network with a batch size of 40 on 600 videos from the DHF1K training set through the SGD algorithm with 0.9 momentum in an end-to-end manner. The learning rate is fixed at 0.001 for the encoder network. For the prediction network, the learning rate starts at 0.1 and decays twice by a factor of 10 when the validation loss does not decrease for a certain number of steps that depends on . For TASED-Net with
, the first decaying point is at step 750, the second one is at step 950. The whole training process of 1K iterations takes less than 3 hours. Evaluation on the whole validation set takes a lot of time due to a large number of frames (60K in the validation set of the DHF1K dataset), so we uniformly sample 2K clips to approximate the validation loss. We choose Kullback-Leibler (KL) divergence as the loss function, which Jianget al.  have shown to be effective for training saliency models. When testing, we apply TASED-Net in a sliding-window fashion to predict a frame-wise saliency map for every frame of all videos within the dataset. It takes around 0.06s to process each frame.
Normalized Scanpath Saliency (NSS),
Linear Correlation Coefficient (CC),
Area Under the Curve by Judd (AUC-J), and
Shuffled-AUC (s-AUC). NSS and CC estimate a linear correlation between the prediction and ground-truth fixation map. SIM is for computing similarity between two histograms, and AUC-J and s-AUC are variants of the well-known AUC metric. Higher scores on each metric indicate better performance.
4.2 Evaluation on DHF1K
Since the ground-truth annotations for the test set of DHF1K  are hidden for fair comparison, we first evaluate variants of our model on the validation set. The performance of TASED-Net with different and temporal aggregation strategies are compared in Table 1. The results indicate that TASED-Net with and late two-step aggregation performs the best since this configuration achieves the best performance across most metrics (it has 21.2M Params and 63.2G FLOPs; more results on different ’s are provided in Section 4.5
). We believe that late two-step aggregation performs better than early two-step aggregation because the feature maps used in spatial upscaling have a larger size in the temporal dimension. That is, late two-step aggregation performs better thanks to temporally richer feature maps. Interestingly, late aggregation performs poorly despite having the richest features, probably due to overfitting. In addition, we observe that the scores drop by 0.5 NSS (0.06 CC, 0.04 SIM, 0.015 AUC) without Kinetics pre-training for most cases. This shows the effectiveness of Kinetics pre-training. For the rest of the paper, we report the performance of TASED-Net with, late two-step aggregation, and pre-training.
|Early two-step (16)||2.591||0.464||0.343||0.894||0.708|
|Early two-step (32)||2.673||0.475||0.361||0.891||0.706|
|Late two-step (16)||2.622||0.469||0.349||0.892||0.713|
|Late two-step (32)||2.706||0.481||0.362||0.894||0.718|
Next, we submitted our results to the DHF1K online benchmark . The performance of TASED-Net and previous state-of-the-art methods on the test set of DHF1K is reported in Table 2. Our model outperforms other methods by a wide margin across all evaluation metrics. We note that ACLNet , the leading state-of-the-art method, is arguably better-primed for saliency detection than TASED-Net—it has a component pre-trained on an image-saliency dataset, SALICON , whereas we pre-train the encoder network of TASED-Net on an action recognition dataset. The higher performance of TASED-Net suggests that pre-training on a large-scale video dataset plays a significant role in performing well on other tasks in general. We also want to point out that TASED-Net has a much smaller network size (82MB v.s. 252MB). Interestingly, our AUC-J score does not increase much compared to the other metrics. This phenomenon has already been reported by Bylinskii et al. , who suggest that AUC-J is less capable of discriminating between different high-performing saliency models because it is invariant to monotonic transformations.
|Deep Net ||1.775||0.331||0.201||0.855||0.592|
To perform a qualitative analysis, we compare the performance of TASED-Net to the leading state-of-the-art method, ACLNet , on videos from the validation set of the DHF1K dataset. We observe that we can easily recognize the differences between the results of each model when the difference of NSS scores between the two is greater than 0.5. Based on this gap, TASED-Net outperforms ACLNet on 37 out of the 100 videos in the validation set, while ACLNet outperforms TASED-Net only on 7 videos. Qualitative results of our model and ACLNet for the better and worse cases are given in Figure 5 (see Supplementary material for more examples of qualitative results). As shown in (a) and (b) in Figure 5, TASED-Net seems highly sensitive to salient moving objects and less sensitive to background objects, which is consistent with the goal of video saliency in general. On the other hand, ACLNet seems to put more weight on spatially conspicuous objects, so sometimes it attends to distracting background objects. This makes the saliency map predicted by ACLNet a lot blurrier than ours in many cases.
We have observed that for videos where the ground-truth fixation points are scattered across a large area, our model quantitatively performs worse than ACLNet. This is because ACLNet generally predicts blurrier maps that better fit highly-scattered fixation points. However, we also find that ground-truth fixation points are unstable for these videos. For example, in (c) of Figure 5, the fixation points do not smoothly follow the carp, but instead flicker and jump between different carp. In (d), because the foreground object is so large, fixation points tend to move around the object. Furthermore, different subjects do not fixate on the same part of a large object. In these cases, it is hard to say that the ground-truth fixation points represent general human gaze behavior well. Therefore, we strongly believe that a larger number of human subjects is needed to properly annotate videos where the fixation points are frequently scattered across a large area. We also believe that a larger and more comprehensive dataset with more diverse scenes is needed to cover general situations where the salient moving objects are not the only dominant information. More qualitative results can be found in Supplementary material.
4.3 Performance on other datasets
We further test our model on two commonly used public datasets, which are Hollywood2 [23, 24] and UCFSports [24, 32, 35]. To leverage the relatively large scale of the DHF1K dataset, we first pre-train TASED-Net on DHF1K, and then fine-tune on Hollywood2 or UCFSports. For short videos with fewer than frames, we simply loop those videos to fit in with our method. Table 3 compares our model with various previous state-of-the-art approaches. TASED-Net again achieves the best performance on each dataset across most of the metrics.
|Deep Net ||2.066||0.451||0.300||0.884||0.736|
|Deep Net ||1.903||0.414||0.282||0.861||0.719|
4.4 Necessity of Auxiliary pooling
As discussed earlier, Auxiliary poolings are needed for the max-unpooling layers to work in our proposed architecture. Here, we compare two possible variants of Auxiliary pooling. The first variant, which we call TASED-Net-tri, replaces all the max-unpooling layers with trilinear upsampling (interpolation). The second variant, which we name TASED-Net-trp, replaces the max-unpooling layers with transposed convolutions (deconvolution). Note that these two variants do not require Auxiliary poolings. Table 4 compares these variants and shows that TASED-Net without Auxiliary pooling operations performs poorly. In other words, we discover that replacing max-unpooling layers does not work well although TASED-Net-tri and TASED-Net-trp may seem more straightforward. This proves the effectiveness and necessity of Auxiliary pooling in TASED-Net.
In addition, we apply our temporally-aggregating scheme to many other powerful architectures including FCN , U-Net , Deeplab [6, 7], which have achieved great success in dense prediction tasks. The results are reported in Supplementary material. The unsatisfying results justify our architecture with the proposed Auxiliary pooling.
4.5 Other observations
We observe that stacking multiple transposed convolution layers with stride within each spatial decoding block in the prediction network does not boost performance. To demonstrate this, we augment TASED-Net by adding two more transposed convolutional layers to each spatial decoding block. This denser (or deeper) version approximately increases the network size by 40%, so we expect that it would yield better performance by finely decoding spatial information. However, we found that it actually yields slightly worse performance (see Supplementary material). This might be because spatial decoding is of less importance in video saliency detection than in other tasks where more precise pixel-wise outputs are required (e.g. video segmentation). Therefore, video saliency models may not necessarily benefit from stronger spatial decoding capabilities. Otherwise, it may be due to overfitting. To better understand how this phenomenon is affected by dataset size and task formulation, we would have to test the denser TASED-Net on larger datasets and alternative tasks like video segmentation.
It is also observed that predicting multiple saliency maps all at once for each sliding window decreases the overall performance when compared to predicting a single saliency map. We believe that this is because increasing the prediction space makes it harder for the decoder (prediction network) to be optimized. It shows that our temporally-aggregating scheme is more appropriate for the video saliency detection.
Furthermore, we observe that TASED-Net with larger than 32 performs worse than when (see Table 5). These results may indicate that it is sufficient to consider a fixed number of past frames for video saliency detection. However, they could also be a result of overfitting. TASED-Net with smaller than 32 also performs worse than when , which implies that it is necessary to consider enough number of past frames with a duration of about one second for video saliency detection. We believe that further optimization on is not necessary for this paper.
We have presented TASED-Net as a novel fully-convolutional architecture for video saliency detection. The main idea is simple but effective: spatially decoding the features extracted by the encoder while jointly aggregating all the temporal information in order to produce a single full-resolution prediction map. We also propose the new concept of Auxiliary pooling, which enables our architecture to leverage the benefits of max-unpooling layers for reconstruction. TASED-Net significantly outperforms previous state-of-the-art methods on major video saliency detection datasets, which demonstrates the benefits of performing spatial decoding and temporal aggregation in a fully-convolutional way, as well as the benefits of conditioning on a limited amount of past information when predicting video saliency. Finally, we comprehensively analyze TASED-Net with many variants, and show that our proposed Auxiliary pooling is necessary and effective.
Acknowledgement. We thank Ryan Szeto for his valuable feedback and comments. We also thank Stephan Lemmer, Mohamed El Banani, and Luowei Zhou for their discussions. This research was, in part, supported by NIST grant 60NANB17D191.
-  (2015) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561. Cited by: §1, §2, §3.1, §3.3.
-  (2018) Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Transactions on Multimedia 20 (7), pp. 1688–1698. Cited by: §2, Table 2, Table 3.
-  (2016) Recurrent mixture density network for spatiotemporal visual attention. arXiv preprint arXiv:1603.08199. Cited by: §1, §2.
-  (2018) What do different evaluation metrics tell us about saliency models?. IEEE transactions on pattern analysis and machine intelligence. Cited by: §4.2.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 4724–4733. Cited by: §2.
-  (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §4.4.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818. Cited by: §4.4.
-  (2009) Imagenet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Cited by: §1, §2.
-  (1995) Neural mechanisms of selective visual attention. Annual review of neuroscience 18 (1), pp. 193–222. Cited by: §1.
-  (2015) Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766. Cited by: §1, §2.
-  (2010) A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression.. IEEE Trans. Image Processing 19 (1), pp. 185–198. Cited by: §1.
-  (2010) Predictive saliency maps for surveillance videos. In Distributed Computing and Applications to Business Engineering and Science (DCABES), 2010 Ninth International Symposium on, pp. 508–513. Cited by: §1.
-  (2014) Saliency-aware video compression. IEEE Transactions on Image Processing 23 (1), pp. 19–33. Cited by: §1.
-  (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 18–22. Cited by: §1, §2.
-  (2007) Graph-based visual saliency. In Advances in neural information processing systems, pp. 545–552. Cited by: Table 2, Table 3.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 262–270. Cited by: §2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.2.
-  (2017) Predicting video saliency with object-to-motion cnn and two-layer convolutional lstm. arXiv preprint arXiv:1709.06316. Cited by: §1, §2, Table 2, Table 3.
-  (2015) Salicon: saliency in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1072–1080. Cited by: §2, §4.1, §4.2, Table 2, Table 3.
-  (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §1, Figure 2, §2, §3.2.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §4.4.
-  (2009) Actions in context. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 2929–2936. Cited by: §1, §4.1, §4.3.
-  (2015) Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37 (7), pp. 1408–1424. Cited by: §1, §4.1, §4.3.
-  (1994) Perceptual integration of motion and form information: is the movement filter involved in form discrimination?. Journal of Experimental Psychology: Human Perception and Performance 20 (2), pp. 397. Cited by: §1.
Rectified linear units improve restricted boltzmann machines.
Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §3.2.
-  (2013) Static saliency vs. dynamic saliency: a comparative study. In Proceedings of the 21st ACM international conference on Multimedia, pp. 987–996. Cited by: §1.
-  (2015) Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 1520–1528. Cited by: §1, §2, §3.1, §3.3.
-  (2017) Salgan: visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081. Cited by: §2, Table 2.
-  (2016) Shallow and deep convolutional networks for saliency prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 598–606. Cited by: §2, Table 2, Table 3.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1, §2.
-  (2008) Action mach a spatio-temporal maximum average correlation height filter for action recognition. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8. Cited by: §1, §4.1, §4.3.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §2, §3.1, §4.4.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §2, §2.
-  (2014) Action recognition in realistic sports videos. In Computer vision in sports, pp. 181–208. Cited by: §1, §4.1, §4.3.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §2.
-  (2018) Revisiting video saliency: a large-scale benchmark and a new model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4894–4903. Cited by: §1, §1, §2, Figure 5, §4.1, §4.1, §4.2, §4.2, §4.2, Table 1, Table 2, Table 3.
-  (2018) Deep visual attention prediction. IEEE Trans. Image Process 27 (5), pp. 2368–2378. Cited by: §2, Table 2, Table 3.
-  (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 305–321. Cited by: §1, Figure 2, §2, §3.2.
-  (2011) A spatiotemporal saliency model for video surveillance. Cognitive Computation 3 (1), pp. 241–263. Cited by: §1.
-  (2011) Adaptive deconvolutional networks for mid and high level feature learning. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2018–2025. Cited by: §1, Figure 2, §2, §3.3.