Ultrafast Video Attention Prediction with Coupled Knowledge Distillation

04/09/2019 ∙ by Kui Fu, et al. ∙ 0

Large convolutional neural network models have recently demonstrated impressive performance on video attention prediction. Conventionally, these models are with intensive computation and large memory. To address these issues, we design an extremely light-weight network with ultrafast speed, named UVA-Net. The network is constructed based on depth-wise convolutions and takes low-resolution images as input. However, this straight-forward acceleration method will decrease performance dramatically. To this end, we propose a coupled knowledge distillation strategy to augment and train the network effectively. With this strategy, the model can further automatically discover and emphasize implicit useful cues contained in the data. Both spatial and temporal knowledge learned by the high-resolution complex teacher networks also can be distilled and transferred into the proposed low-resolution light-weight spatiotemporal network. Experimental results show that the performance of our model is comparable to ten state-of-the-art models in video attention prediction, while it costs only 0.68 MB memory footprint, runs about 10,106 FPS on GPU and 404 FPS on CPU, which is 206 times faster than previous models.



There are no comments yet.


page 1

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent developments of portable/wearable devices have heightened the need for video attention prediction. Benefiting from comprehensive rules [13, 25], large-scale training datasets [8]

and deep learning algorithms

[34], it becomes feasible to construct more and more complex models to improve the performance steadily in recent years. It is generally expected that such models could be used to facilitate subsequent tasks such as event understanding [39], scene reconstruction [33] and drone navigation [49].

Figure 1: Comparison between the high-resolution and low-resolution videos. a) and b) are videos and their attention maps with high-resolution, while c) and d) are with low-resolution. Salient objects in low-resolution videos are missing much spatial information but still localizable.

However, due to the limited computational ability and memory space of such portable/wearable devices, there are two main issues in video attention prediction: 1) How to reduce the computational cost and memory space to enable the real-time applications on such devices? 2) How to extract powerful spatiotemporal features from videos to avoid remarkable loss of attention prediction accuracy? To address these two issues, it is necessary to explore a feasible solution that converts existing complex and sparse spatial and temporal attention models into a simple and compact spatiotemporal network, which is more efficient in dealing with videos and aims to meet the requirements of practical applications on resource-limited devices.

Over the past years, deep learning has become the dominant approach in attention prediction due to its impressive capability of handling large-scale learning problems [45, 34, 2]. Strictly speaking, there are two main factors that affect the computational cost and memory space of Convolutional Neural Networks (CNNs): the network parameters and the resolution of feature maps in each network layer. Researchers tend to design more complex networks and collect more data for better performance. However, it has been proved that most of the complex networks are sparse and have a lot of redundancies. This fact facilitates the study of network compression, which aims to decrease computational and memory space cost. The video attention prediction can be greatly accelerated by such compact networks, but often at the expense of existing a great prediction accuracy attenuation. To address this problem, seminal works have investigated knowledge distillation technology that trains a simple or compressed network to mimic the behavior of a complex or sparse network, in the hope of recovering some or all of the accuracy drop [6, 15, 36].

With the analyses in mind, we first reduce the resolution of the input image to which decreases the computational and memory space cost by about one order of magnitude. As demonstrated in Fig. 1, we observe that the salient objects in low-resolution videos are missing much spatial information but still localizable, which implies that it is still capable to recover the missing details. We then construct the network based on the depth-wise convolutions, which can further decrease the computational cost by about one order of magnitude and leads to an Ultrafast Video Attention Prediction Network (UVA-Net). However, UVA-Net will suffer from a dramatical performance decrease with a straight-forward training strategy. To this end, we propose a coupled knowledge distillation strategy, which can augment the model to discover and emphasize useful cues from the data and extract spatiotemporal features simultaneously. With this strategy, UVA-Net can predict the fixation map at an ultrafast speed while achieving a comparable performance with ten state-of-the-art methods.

The contributions of this work are summarized as follows: 1) We design an extremely lightweight spatiotemporal network with an ultrafast speed, which is 206 times faster than previous methods. 2) We propose a coupled knowledge distillation strategy to enable the network to effectively fuse the spatial and temporal features simultaneously and avoid remarkable loss of attention prediction accuracy. 3) Comprehensive experiments are conducted and illustrate that our model can achieve an ultrafast speed with a comparable attention prediction accuracy to the state-of-the-art models.

2 Related Work

In this section, we give a brief review of recent works from two perspectives: visual attention model as well as knowledge distillation and transfer.

2.1 Visual Attention Model

The models of image/video attention prediction can be divided into three aspects: heuristic models, non-deep learning models and deep learning models.

The heuristic models [19, 18, 38, 20, 30] can be roughly categorized into bottom-up approaches and top-down approaches. General models in the former category are stimulus-driven and compete to pop out visual signals. For example, Fang et al. [10] proposed a video attention model with a fusion strategy that according to the compactness and the temporal motion contrast. Later, Fang et al. [11] proposed another video attention model that based on the uncertainty weighting. The models in the latter category are task-driven and usually integrating high-level factors. Borji et al. [4] adopt a unified Bayesian approach to modeling task-driven visual attention.

In order to overcome the deficiency of the heuristic model fusion strategy, researchers have proposed plenty of non-deep learning models [28, 24, 42, 44, 43]. Vig et al. [44]

proposed a simple bottom-up attention model with supervised learning techniques fine-tune the free parameters for dynamic scenarios. Fang

et al. [9] proposed an optimization framework with pairwise binary terms to pop out salient targets and suppressing distractors.

For further performance improvements, some deep learning models are proposed, which are emphasizing the importance of automatic hierarchical feature extraction and end-to-end learning

[31]. For example, Kümmerer et al. [23] directly used the features from VGG-19 network [40] for attention inference without additional fine-tuning. Pan et al. [34] proposed two networks for fixation prediction, with deep and shallow structures, respectively.

These modes can achieve impressive performance but usually have a high computational cost. How to obtain a speed-accuracy trade-off in attention prediction is still a key issue for scientific researchers.

Figure 2: Overview. Our coupled knowledge distillation approach consists of four streams: a spatial teacher stream, a temporal teacher stream, a student stream and a spatiotemporal stream. It is trained in two steps: knowledge distillation and spatiotemporal joint optimization.

2.2 Knowledge Distillation and Transfer

Knowledge distillation is a class of techniques originally proposed by Hinton et al. [15], which aims at transferring knowledge in complex teacher model to simple student model and improving the performance of student model at test time. In [15], Hinton et al.

 adopt the soft labels generated by teacher model as the supervision signal in addition to the regular labeled training data during the training phase. Studies have shown that such extra supervision signal can be in various reasonable forms, such as classification probabilities

[15], feature representation [36], or inter-layer flow (the inner product of feature maps) [47]. It has observed a new wave of development exploiting knowledge transfer technologies to distill the knowledge in easy-to-train complex models into hard-to-train simple models [36]. In general, most off-line knowledge distillation frameworks take two phases for training, which is resource-intensive. To address this issue, Zhang et al. [50] proposed an on-line deep mutual learning framework, which training distillation network in one-phase between two peer student models.

There is no doubt that these knowledge distillation approaches are successful in high-resolution scenarios, but their effectiveness in low-resolution dynamic scenarios is questionable since they face the dual challenges of the limited network capacity and the loss of stimulus signal. With such questions, this paper demonstrates an ultrafast attention prediction network for videos.

3 Coupled Knowledge Distillation

In this section, we present a coupled knowledge distillation approach for video attention prediction. We first briefly overview the whole approach and introduce the Residual block with channel-wise attention. Then we expatiate on the knowledge distillation and the spatiotemporal joint optimization, respectively. Finally, we demonstrate the training and testing strategy of our approach.

3.1 Overview

We start with an overview of our coupled knowledge distillation approach before going into details below. We learn a strong yet efficient attention predictor by transferring the knowledge of a spatial teacher network and a temporal teacher network to a simple and compact one. Note that both the spatial and temporal teacher networks are easy-to-train and have excellent performance. The overview is as shown in Fig. 2.

The overall training process consists of two steps: 1) Knowledge distillation. Distilling the spatial and temporal knowledge inherited in the spatial teacher network and the temporal teacher network to a student network. 2) Spatiotemporal joint optimization. Transferring the knowledge learned by the student network to a spatiotemporal network and then fine-tune it.

3.2 Residual block with Channel-wise attention

Inspired by MobileNetV2 [37], we construct our networks with depth-wise convolutions with inverted residual structure. The basic convolutional blocks of MobileNet V2 are illustrated in Fig. 3 (a). Such blocks can greatly improve the compactness of the networks but hard to maintain the same accuracy. Aiming to meet the requirements of practical applications on resource-limited devices, we introduce channel-wise attention mechanism in MobileNet V2 blocks and propose a novel light-weight block, named as CA-Res block. Detailed architecture of CA-Res blocks are as shown in Fig. 3 (b).

Let and be the input and output intermediate representation, respectively. The can be formulated as


where denotes the inverted residual structure of MobileNet V2 block, refers to a channel-wise attention function,

is element-wise multiplication. We denote global average pooling and global max pooling as

and , respectively. In general, performs well in preserving global characteristics, while has the potential to remain the texture features [46]. It has been proved that exploiting both of them can greatly improve the representation power of networks than using each independently. Given an intermediate feature map F, we can model via



refers to a multi-layer perceptron (MLP) which functioned as a generator of attention vectors, and

is sigmoid function. Both of

and are used to squeeze the spatial dimension of F.

Figure 3: Detailed convolutional blocks. (a) MobileNet V2 block. (b) CA-Res block.

3.3 Knowledge Distillation

Given the CA-Res block, we now describe the details to construct our knowledge distillation network. Suppose that the dataset is given as , , where is an input image and is a ground-truth map corresponding to . Since this work aims to learn a spatiotemporal student model, the student architecture should be able to simulate the spatial/temporal features extracted by the teachers. Thus, our student network is a two-branch network that takes low-resolution successive frame pair as input, and outputs both spatially and temporally predicted attentions. Note that in teacher networks, the temporal information is often extracted with the assistance of optical flows, as it provides effective ways to smoothly accumulate features along time. However, optical flows are expensive to compute, and hard to be applied in a lightweight student model.

To effectively mimic temporal features with low cost, we propose to adopt simple operations on immediate features, which have more representation power, to mimic the temporal dynamics. Specifically, given the frame pair , the network first processes them through a low-level module , which contains one stand convolutional layer and three CA-Res blocks in each branch, and shares the weights in the two paths. In this manner, the spatial intermediate representation and temporal intermediate representation can be computed via


where refers to a concatenating operation. We feed the into a spatial path, which consists of five blocks, to further extract high-level spatial intermediate representation . Similarly, the is fed into a temporal path to extract high-level temporal intermediate representation . Note that the spatial path and the temporal path share the same topology structure. Each path is followed with two deconvolutional layers to restore the and to fine feature map with the original resolution of input frames. The overall architecture of the student network is as shown in Fig. 4 (a).

We denote the spatial teacher, temporal teacher and student networks as , and , respectively. Then, the can be trained by optimizing spatial loss and temporal loss


where and denote the hard loss and soft loss, respectively. The hard loss is a loss associated with the predicted density map and the ground truth density map, while the soft loss is a loss associated with the predicted density map and the teacher prediction following the practice in knowledge distillation [15]. The parameter balances and which we empirically set to . and refer to the spatial and temporal branch of , respectively. With such a strategy, the knowledge inherited in both spatial and temporal teacher networks can be distilled into the student network, which can simultaneously generate spatial and temporal attention predictions.

3.4 Spatiotemporal Joint Optimization

Although the student model can distill spatial and temporal knowledge inherited in teacher networks to itself separately, it is still challenging to fuse the spatial and temporal features together to provide more powerful representations. To address this issue, we construct a spatiotemporal network, which is denoted as . The first six blocks of share the same structure as with those of to transferring the knowledge from the . In this manner, can generate both spatial intermediate representation and temporal intermediate representation like . Following the shared part, a fusion sub-net takes the concatenation of and as input, fuse them together, and infer the final spatiotemporal attention map


where refers to the fusion sub-net. Note that the fusion sub-net has the same structure as the last five blocks except for its first block, which has doubled channels. The detail of is illustrated in Fig. 4 (b).

Figure 4: Network architecture. (a) Student network . (b) Spatiotemporal network .

Different from , is trained by optimizing the spatiotemporal loss , which is only related to the hard loss


3.5 Training and Testing Details

All networks proposed in this paper are implemented with Tensorflow

[1] on an NVIDIA GPU 1080Ti and a six-core CPU Intel 3.4GHz. In the knowledge distillation step, is trained from scratch with a learning rate of and a batch size of 96. Adam [22] is employed to minimize the spatial loss and temporal loss . After training, the performs well in extracting both spatial and temporal features from successive frame pair . In the spatiotemporal joint optimization step, the knowledge inherited from is transferred into , by optimizing the , can perform well in extracting the spatiotemporal features. Noting that the learning rate and batch size are the same as in the knowledge distillation step.

For testing, we feed a successive frame pair into the spatiotemporal network, which generates a spatiotemporal attention map that corresponds to . In this manner, we can iteratively compute the spatiotemporal attention prediction for the next frame until reaching the last frame of the video. In particular, the last frame of the video does not exist the next frame and we take as input instead. In this case, both and take spatial cues into consideration and ignores temporal cues for attention prediction.

4 Experiments

We evaluate the proposed UVA-Net on a public dataset AVS1K [12], which is an aerial video dataset for attention prediction. The AVS1K is divided into a training set, a validation set and a testing set in a proportion of 2:1:1. It consists of four categories: Building, Human, Vehicle, and Others.

Models  AUC  sAUC  NSS  SIM  CC   Parameters (M)   Memory Footprint (MB) Speed (fps)
(NVIDIA 1080Ti)
(Intel 3.4GHz)
H  HFT [26]  0.789  0.715  1.671  0.408  0.539  —  —  —  7.6
 SP [27]  0.781  0.706  1.602  0.422  0.520  —  —  —  3.6
 PNSP [10]  0.787  0.634  1.140  0.321  0.370  —  —  —  —
NL  SSD [25]  0.737  0.692  1.564  0.404  0.503  —  —  —  2.9
 LDS [9]  0.808  0.720  1.743  0.452  0.565  —  —  —  4.6
DL  eDN [43]  0.855  0.732  1.262  0.289  0.417  —  —  —  0.2
 iSEEL [41]  0.801  0.767  1.974  0.458  0.636  —  —  —  —
 DVA [45]  0.864  0.761  2.044  0.544  0.658  25.07  59.01  49  2.5
 SalNet [34]  0.797  0.769  1.835  0.410  0.593  25.81  43.22  28  1.5
 STS [2]  0.804  0.732  1.821  0.472  0.578  41.25  86.94  17  0.9
 UVA-DVA-32  0.850  0.740  1.905  0.522  0.615  0.16  0.68  10,106  404.3
 UVA-DVA-64  0.856  0.748  1.981  0.540  0.635  0.16  2.73  2.588  101.7
Table 1: Performance comparisons of 12 state-of-the-art models on AVS1K. The best and runner-up models of each column are marked with bold and underline, respectively. Except our model, the other deep models fine-tuned on AVS1K are marked with *.

On the AVS1K, the UVA-Net is compared with ten state-of-the-art models for video attention prediction, including:

1) Three heuristic models (group denoted as [H]): HFT [26], SP [27] and PNSP [10].

2) Two non-deep learning models (group denoted as [NL]): SSD [25] and LDS [9].

3) Five deep learning models (group denoted as [DL]): eDN [43], iSEEL [41], DVA [45], SalNet [34] and STS [2].

With respect to the prior investigation of [35, 29, 7]

, we adopt five evaluation metrics in the comparisons, including the traditional Area Under the ROC Curve (AUC), the shuffled AUC (sAUC), the Normalized Scanpath Saliency (NSS), the Similarity Metric (SIM) 

[16], and the Correlation Coefficient (CC) [3]

. AUC is generated by enumerating all probable thresholds of the true positive rate versus false positive rate and reflects the classification ability of the ROC curve. sAUC takes the fixations shuffled from other frames as negatives in generating the curve. NSS measures the average response at the eye fixation locations. SIM can measure the similarity between the estimated and ground truth maps, while CC is computed as the linear correlation between them. Note that all five metrics are positively correlated with the performance. In this paper, we take NSS as the primary metric

[45, 32].

4.1 Comparison with the State-of-the-art Models

The performance of 12 state-of-the-art models on AVS1K is presented in Tab. 1. For sake of simplicity, we take UVA-Net in the resolution of and , fix the spatial and temporal teacher models as DVA and TSNet [2], respectively. Some representative results of those models are illustrated in Fig. 5.

Figure 5: Representative frames of state-of-the-art models on AVS1K. (a) Video frame, (b) Ground truth, (c) HFT, (d) SP, (e) PNSP, (f) SSD, (g) LDS, (h) eDN, (i) iSEEL, (j) DVA, (k) SalNet, (l) STS, (m) UVA-DVA-32, (n) UVA-DVA-64.

From Tab. 1, we observe that both the UVA-DVA-32 and UVA-DVA-64 are comparable to the ten state-of-the-art models. In terms of AUC, NSS, and SIM, the UVA-DVA-64 ranks the second place and its CC ranks the third place while its sAUC ranks the fourth place. Such impressive performance can be attributed to the coupled knowledge distillation approach, which distills the spatial and temporal knowledge from well-trained complex teacher networks to a simple and compact student network, and finally transfers it into a spatiotemporal network. The knowledge distillation step makes it capable for the spatiotemporal network to separately extracting spatial and temporal cues from successive frame pair, while the spatiotemporal joint optimization step gives the spatiotemporal network the ability to fuse such spatial and temporal cues together to provide more powerful representations. Specially, we find that the UVA-DVA-64 is superior to traditional single branch networks (such as SalNet, with a performance gain), two-branch networks for video (such as STS, with a performance gain), but slightly inferior to multi-branch structure networks (such as DVA, with a performance drop). In addition, our UVA-Net has extremely low parameters (only M) and low memory footprint (UVA-DVA-32 takes MB and UVA-DVA-64 takes MB), resulting in high computational efficiency. In summary, the UVA-DVA-64 achieves very fast speed ( FPS) and comparable performance to the state-of-the-art models. The UVA-DVA-32 achieves ultrafast speed ( FPS) but with slight performance degradation.

Obviously, our approach has an impressive performance in dynamic scenarios, but it’s effectiveness in static scenarios still up to debate. We present the performance of ten state-of-the-art models on MIT1003 [21] in Tab. 2. Note that we take frame pair as input and both the spatial and temporal branches take the same teacher, DVA, as additional supervision signal. From this table, we find that our approach is still comparable in static scenarios. It outperforms most of the state-of-the-art models but has an obvious performance attenuation for SalNet and DVA. This can be explained by the absence of temporal cues and the failure of spatiotemporal fusion strategy. In summary, our UVA-Net is an efficient video attention prediction approach with low-computational cost but relies on motion cues, resulting in an obvious performance attenuation in static scenarios.

Model  AUC  sAUC  NSS  SIM  CC
BMS [48]  0.79  0.69  1.25  0.33  0.36
CAS [13]  0.76  0.68  1.07  0.32  0.31
AIM [5]  0.79  0.68  0.82  0.27  0.26
Judd Model [21]  0.76  0.68  1.02  0.29  0.30
GBVS [14]  0.83  0.66  1.38  0.36  0.42
ITTI [19]  0.77  0.66  1.10  0.32  0.33
eDN [43]  0.85  0.66  1.29  0.30  0.41
SalNet [34]  0.85  0.75  1.82  0.43  0.57
DVA [45]  0.86  0.76  1.94  0.49  0.61
our  0.83  0.71  1.58  0.43  0.50
Table 2: Performance comparisons of ten state-of-the-art models on MIT1003. The best and runner-up models of each column are marked with bold and underline, respectively. All deep learning model are fine-tuned on SALICON [17] without any extra data.

4.2 Detailed Performance Analysis

Beyond performance comparisons, we also conduct several experiments on AVS1K to verify the effectiveness of the proposed UVA-Net.

Resolution reduction and supervision signal. We conduct an experiment to assess the resolution reduction and supervision signal. Without loss of generality, we fix the temporal teacher signal as TSNet, and provide three candidate spatial teacher signals, DVA, SalNet and SSNet [2] in resolution , , , and . The performance of UVA-Net with different settings are presented in Tab. 3.

UVA-DVA  256  0.786  0.680  1.447  0.397  0.454
 128  0.810  0.698  1.566  0.438  0.498
 96  0.827  0.726  1.765  0.483  0.560
 64  0.856  0.748  1.981  0.540  0.635
 32  0.850  0.740  1.905  0.522  0.615
UVA-SalNet  256  0.775  0.676  1.423  0.394  0.451
 128  0.807  0.698  1.623  0.451  0.511
 96  0.843  0.717  1.746  0.484  0.551
 64  0.859  0.751  1.955  0.529  0.626
 32  0.852  0.739  1.892  0.519  0.611
UVA-SSNet  256  0.786  0.687  1.485  0.404  0.467
 128  0.807  0.703  1.641  0.455  0.517
 96  0.834  0.721  1.788  0.493  0.564
 64  0.852  0.744  1.971  0.535  0.627
 32  0.845  0.736  1.894  0.522  0.610
Table 3: The performance comparisons of the UVA-Net with different settings on AVS1K dataset. T-S: spatial teacher model, Res: input resolution. The best model of each column in each spatial teacher signal are marked with bold.

From this table, we find that models in low-resolution trends to have better performance. For example, with spatial supervision signal as DVA, the performance of UVA-Net in terms of all metrics in resolution ranks the first place. However, further resolution reduction will result in non-negligible performance degradation. We can infer that our UVA-Net can extract powerful spatiotemporal features from proper low-resolution videos, and it is still challenging for the UVA-Net in dealing with the details in high-resolution videos. In practical, compared with UVA-Net in , the one with suffers from a performance drop.

Overall, the UVA-Net with DVA as teacher signal achieves the best performance, while SSNet and SalNet rank in the second and third places, respectively. For example, with resolution , the UVA-Net supervised with DVA achieves NSS=, while the SalNet and SSNet have only and , respectively. This is consistent with the performance of teacher models in Tab. 1, which indicates the proposed approach can effectively transfer the knowledge inherited in teacher models.

Ablation study. We use the AVS1K dataset and adopt the UVA-Net with MobileNet V2 blocks trained from scratch as the baseline, denoted as MB+scratch. We empirically show the effectiveness of our design choice via six experiments. 1) MB+dis. The UVA-Net with MobileNet V2 blocks with coupled knowledge distillation. 2) MB+SE+dis. The UVA-Net with MobileNet V2 blocks and SE block with coupled knowledge distillation. 3) CA-Res+scratch. The UVA-Net with CA-Res blocks trained from scratch. 4) CA-Res+spa+scratch. The student network spatial branch with CA-Res blocks trained from scratch. 5) CA-Res+tmp+scratch. The student network temporal branch with CA-Res blocks trained from scratch. 6) CA-Res+dis. The UVA-Net with CA-Res blocks with coupled knowledge distillation. The performances of all these ablation models are described in Fig. 6.

Figure 6: The performance comparisons of seven ablation models on AVS1K dataset.

From this figure, we find that CA-res+scratch achieves a performance gain to MB+scratch, i.e. NSS: , and CA-res+dis is superior to MB+dis, i.e. NSS: , which verify the effectiveness of the proposed channel-wise attention mechanism. Similarly, the effectiveness of our coupled knowledge distillation can be proved by the fact that MB+dis and CA-Res+dis are superior to MB+scratch and CA-Res+scratch, respectively. CA-Res+dis has a performance gain to MB+SE+dis, i.e. NSS: , which infers the superior of our channel-wise attention mechanism to traditional SE block. The reason behind this may be that the SE block adopts global average pooling and ignore global max pooling, making it challenging in maintaining the texture characteristic of feature maps. Additionally, without spatiotemporal joint optimization, CA-Res+spa+dis and CA-Res+tem+dis have a non-negligible performance degradation to CA-Res+dis, i.e. NSS: and , respectively. This verifies that exploiting both spatial and temporal cues can greatly improve the representation power of networks than using each independently.

Speed analysis. As mentioned above, our approach can greatly reduce the computational cost and memory space without remarkable loss of prediction accuracy. In particular, the DVA, SalNet, and STS contain , and million parameters, while the proposed UVA-Net contains only million parameters. Namely, the UAV-Net achieves a substantial attenuation in the parameter amount. In addition, the memory footprints of UVA-Net in the different resolution are illustrated in Fig. 7.

Figure 7: The inference memory footprint for various models. the memory footprint is proportional to the resolution of input.

From the figure, we find that the memory footprint is proportional to the input resolution. For example, with a low resolution, such as , the memory footprint of UVA-Net can be reduced to an extremely low value, MB. With such few parameters and low memory footprint, the process of attention inference can be greatly accelerated.

For comparison, we demonstrate the inference runtime of UVA-Net with different resolution on both high-end GPU (NVIDIA 1080Ti) and low-end CPU (Intel 3.4GHz) in Tab. 4.

Model  GPU (NVIDIA 1080Ti)  CPU (Intel 3.4GHz)
 time/#FPS  time/#FPS
UVA-Net-256 6.602 ms / 151 134.230 ms / 7.4
UVA-Net-128 1.586 ms / 631 40.928 ms / 24.4
UVA-Net-96 0.897 ms / 1,115 22.495 ms / 44.5
UVA-Net-64 0.386 ms / 2,588 9.830 ms / 101.7
UVA-Net-32 0.099 ms / 10,106 2.474 ms / 404.3
Table 4: Inference time and speed on GPU and CPU.

We observe that the inference runtime (with a NVIDIA 1080Ti) for attention prediction can be reduced to ms, ms, ms, ms and ms in resolution , , , and , respectively. Our model is a remarkable speed-accuracy trade-off, which achieves ultrafast speed and impressive accuracy even with a low-end CPU.

5 Conclusion

In this paper, we propose a simple yet powerful approach via coupled knowledge distillation for video attention prediction. In the knowledge distillation step, we distill the spatial and temporal knowledge inherited in the teacher networks to a student network. In the spatiotemporal joint optimization step, we transfer the knowledge learned by the student network to a spatiotemporal network and then fine-tune it. The proposed approach can greatly reduce the computational cost and memory space without remarkable loss of prediction accuracy. The experimental results on a video dataset have validated the effectiveness of the proposed approach.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al.

    Tensorflow: a system for large-scale machine learning.

    In OSDI, volume 16, pages 265–283, 2016.
  • [2] C. Bak, A. Kocak, E. Erdem, and A. Erdem. Spatio-temporal saliency networks for dynamic saliency prediction. IEEE TMM, 2017.
  • [3] A. Borji. Boosting bottom-up and top-down visual features for saliency estimation. In CVPR, 2012.
  • [4] A. Borji, D. N. Sihite, and L. Itti. Probabilistic learning of task-specific visual attention. In CVPR, 2012.
  • [5] N. D. Bruce and J. K. Tsotsos. Saliency, attention, and visual search: An information theoretic approach. Journal of vision, 9(3):5–5, 2009.
  • [6] C. Bucilu , R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541. ACM, 2006.
  • [7] Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand. What do different evaluation metrics tell us about saliency models? IEEE TPAMI, 41(3):740–757, 2019.
  • [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • [9] S. Fang, J. Li, Y. Tian, T. Huang, and X. Chen. Learning discriminative subspaces on random contrasts for image saliency analysis. IEEE TNNLS, 28(5):1095–1108, 2017.
  • [10] Y. Fang, W. Lin, Z. Chen, C.-M. Tsai, and C.-W. Lin. A video saliency detection model in compressed domain. IEEE TCSVT, 24(1):27–38, 2014.
  • [11] Y. Fang, C. Zhang, J. Li, J. Lei, M. P. Da Silva, and P. Le Callet. Visual attention modeling for stereoscopic video: a benchmark and computational model. IEEE TIP, 26(10):4684–4696, 2017.
  • [12] K. Fu, L. Jia, S. Hongze, and Y. Tian. How drones look: Crowdsourced knowledge transfer for aerial video saliency prediction. arXiv preprint arXiv:1811.05625, 2018.
  • [13] S. Goferman, L. Zelnikmanor, and A. Tal. Context-aware saliency detection. IEEE TPAMI, 34(10):1915, 2012.
  • [14] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. In NIPS, 2007.
  • [15] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. stat, 1050:9, 2015.
  • [16] W. Hou, X. Gao, D. Tao, and X. Li. Visual saliency detection using information divergence. Pattern Recognition, 46(10):2658–2669, 2013.
  • [17] X. Huang, C. Shen, X. Boix, and Q. Zhao. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In ICCV, pages 262–270, 2015.
  • [18] L. Itti. Automatic foveation for video compression using a neurobiological model of visual attention. IEEE TIP, 13(10):1304, 2004.
  • [19] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI, 20(11):1254–1259, 1998.
  • [20] P. Jiang, H. Ling, J. Yu, and J. Peng. Salient region detection by ufo: Uniqueness, focusness and objectness. In ICCV, 2013.
  • [21] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In ICCV, pages 2106–2113. IEEE, 2009.
  • [22] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [23] M. Kümmerer, T. S. Wallis, and M. Bethge. Deepgaze ii: Reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563, 2016.
  • [24] W.-F. Lee, T.-H. Huang, S.-L. Yeh, and H. H. Chen. Learning-based prediction of visual attention for video signals. IEEE TIP, 20(11):3028–3038, 2011.
  • [25] J. Li, L. Y. Duan, X. Chen, T. Huang, and Y. Tian. Finding the secret of image saliency in the frequency domain. IEEE TPAMI, 37(12):2428, 2015.
  • [26] J. Li, M. D. Levine, X. An, X. Xu, and H. He. Visual saliency based on scale-space analysis in the frequency domain. IEEE TPAMI, 35(4):996–1010, 2013.
  • [27] J. Li, Y. Tian, and T. Huang. Visual saliency with statistical priors. IJCV, 107(3):239–253, 2014.
  • [28] J. Li, Y. Tian, T. Huang, and W. Gao. Probabilistic multi-task learning for visual saliency estimation in video. IJCV, 90(2):150–165, 2010.
  • [29] J. Li, C. Xia, Y. Song, S. Fang, and X. Chen. A data-driven metric for comprehensive evaluation of saliency models. In ICCV, 2015.
  • [30] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu. Saliency detection on light field. In CVPR, 2014.
  • [31] X. Li, L. Zhao, L. Wei, M.-H. Yang, F. Wu, Y. Zhuang, H. Ling, and J. Wang. Deepsaliency: Multi-task deep neural network model for salient object detection. IEEE TIP, 25(8):3919–3930, 2016.
  • [32] N. Liu and J. Han. A deep spatial contextual long-term recurrent convolutional network for saliency detection. arXiv preprint arXiv:1610.01708, 2016.
  • [33] F. Mancini, M. Dubbini, M. Gattelli, F. Stecchi, S. Fabbri, and G. Gabbianelli. Using unmanned aerial vehicles (uav) for high-resolution reconstruction of topography: The structure from motion approach on coastal environments. Remote Sensing, 5(12):6880–6898, 2013.
  • [34] J. Pan, E. Sayrol, X. Giroinieto, K. Mcguinness, and N. E. Oconnor. Shallow and deep convolutional networks for saliency prediction. In CVPR, 2016.
  • [35] N. Riche, M. Duvinage, M. Mancas, B. Gosselin, and T. Dutoit. Saliency and human fixations: state-of-the-art and study of comparison metrics. In ICCV, 2013.
  • [36] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  • [37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, pages 4510–4520, 2018.
  • [38] H. J. Seo and P. Milanfar. Static and space-time visual saliency detection by self-resemblance. JOV, 9(12):15–15, 2009.
  • [39] T. Shu, D. Xie, B. Rothrock, S. Todorovic, and S. Chun Zhu. Joint inference of groups, events and human roles in aerial videos. In CVPR, pages 4576–4584, 2015.
  • [40] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [41] H. R. Tavakoli, A. Borji, J. Laaksonen, and E. Rahtu.

    Exploiting inter-image similarity and ensemble of extreme learners for fixation prediction using deep features.

    Neurocomputing, 244:10–18, 2017.
  • [42] E. Vig, M. Dorr, and E. Barth. Efficient visual coding and the predictability of eye movements on natural movies. Spatial Vision, 22(5):397–408, 2009.
  • [43] E. Vig, M. Dorr, and D. Cox. Large-scale optimization of hierarchical features for saliency prediction in natural images. In CVPR, 2014.
  • [44] E. Vig, M. Dorr, T. Martinetz, and E. Barth. Intrinsic dimensionality predicts the saliency of natural dynamic scenes. IEEE TPAMI, 34(6):1080–1091, 2012.
  • [45] W. Wang and J. Shen. Deep visual attention prediction. IEEE TIP, 27(5):2368–2378, 2018.
  • [46] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon. Cbam: Convolutional block attention module. In ECCV, pages 3–19, 2018.
  • [47] J. Yim, D. Joo, J. Bae, and J. Kim.

    A gift from knowledge distillation: Fast optimization, network minimization and transfer learning.

    In CVPR, volume 2, 2017.
  • [48] J. Zhang and S. Sclaroff. Saliency detection: A boolean map approach. In ICCV, pages 153–160, 2013.
  • [49] J. Zhang, Y. Wu, W. Liu, and X. Chen. Novel approach to position and orientation estimation in vision-based uav navigation. IEEE Transactions on Aerospace and Electronic Systems, 46(2):687–700, 2010.
  • [50] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4320–4328, 2018.