1 Introduction
The ability to predict the gaze of humans has many applications in computer vision and related fields. It has been used for image cropping
[1], improving video compression [2], and as a tool to optimize user interfaces [3], for instance. In psychology, gaze prediction models are used to shed light on how the brain might process sensory information [4]. Recently, due to advances in deep learning, human gaze prediction has received large performance gains. In particular, reusing image representations trained for the task of object recognition have proven to be very useful
[5, 6]. However, these networks are relatively slow to evaluate while many realworld applications require highly efficient predictions. For example, popular websites often deal with large amounts of images which need to be processed in a short amount of time using only CPUs. Similarly, improving video encoding with gaze prediction maps requires the processing of large volumes of data in near realtime.In this paper we explore the tradeoff between computational complexity and gaze prediction performance. Our contributions are twofold: First, using a combination of knowledge distillation [7] and pruning, we show that good performance can be achieved with a much faster architecture, achieving a 10x speedup for the same generalization performance in terms of AUC. Secondly, we provide a principled derivation for the pruning method of Molchanov et al. [8]
, extend it, and show that our extension works well when applied to gaze prediction. We further discuss how to choose the tradeoff between performance and computational cost and suggest methods for automatically tuning a weighted combination of the corresponding losses, reducing the need to run expensive hyperparameter searches.
2 Fast Gaze Prediction Models
Our models build on the recent stateoftheart model DeepGaze II [6], which we first review before discussing our approach to speeding it up. The backbone of DeepGaze II is formed by VGG19 [9]
, a deep neural network pretrained for object recognition. Feature maps are extracted from several of the top layers, upsampled, and concatenated. A
readout network withconvolutions and ReLU nonlinearities
[10] takes in these feature maps and produces a single output channel, implementing a pointwise nonlinearity. This output is then blurred with a Gaussian filter,, followed by the addition of a center bias to take into account the tendencies of observers to fixate on pixels near the image center. This center bias is computed as the marginal logprobability of a fixation landing on a given pixel,
, and is dataset dependent. Finally, a softmax operation is applied to produce a normalized probability distribution over fixation locations, or
saliency map:(1) 
Here, is the input image, extracts feature maps, bilinearly upsamples the feature maps and is the readout network.
To improve efficiency, we made some minor modifications in our reimplementation of DeepGaze II. We first applied the readout network and then bilinearly upsampled the onedimensional output of the readout network, instead of upsampling the highdimensional feature maps. We also used separable filters for the Gaussian blur. To make sure the size of the saliency map matches the size of the input image, we upsample and crop the output before applying the softmax operation.
We use two basic alternative architectures providing different tradeoffs between computational efficiency and performance. First, instead of VGG19, we use the faster VGG11 architecture [9]. As we will see, the performance lost by using a smaller network can for the most part be compensated by finetuning the feature map representations instead of using fixed pretrained representations. Second, we try DenseNet121 [11] as a feature extractor. DenseNets have been shown to be more efficient, both computationally and in terms of parameter efficiency, when compared to stateoftheart networks in the object recognition task [11].
Even when starting from these more parameter efficient pretrained models, the resulting gaze prediction networks remain highly overparametrized for the task at hand. To further decrease the number of parameters we turn to pruning: greedy removal of redundant parameters or feature maps. In the following section we derive a simple, yet principled, method for greedy network pruning which we call Fisher pruning.
2.1 Fisher Pruning
Our goal is to remove feature maps or parameters which contribute little to the overall performance of the model. In this section, we consider the general case of a network with parameters trained to minimize a crossentropy loss,
(2) 
where are inputs, are outputs, and the expectation is taken with respect to some data distribution . We first consider pruning single parameters . For any change in parameters , we can approximate the corresponding change in loss with a 2 order approximation around the current parameter value :
(3)  
(4) 
Following this approximation, dropping the th parameter (setting ) would lead to the following increase in loss:
(5) 
where
is the unit vector which is zero everywhere except at its
th entry, where it is 1. Following related methods which also start from a 2 order approximation [12, 13], we assume that the current set of parameters is at a local optimum and that the 1 term vanishes as we average over a dataset of input images. In practice, we found that including the first term actually reduced the performance of the pruning method. For the diagonal of the Hessian, we use the approximation(6) 
which assumes that is close to (see Supplementary Section 1 for a derivation). Eqn. (6
) can be viewed as an empirical estimate of the Fisher information of
, where an expectation over the model is replaced with real data samples. If and are in fact equal and the model is twice differentiable with respect to parameters , the Hessian reduces to the Fisher information matrix and the approximation becomes exact.If we use data points to estimate the Fisher information, our approximation of the increase in loss becomes
(7) 
where is the gradient of the parameters with respect to the th data point. In what follows, we are going to use this as a pruning signal to greedily remove parameters onebyone where this estimated increase in loss is smallest.
For convolutional architectures, it makes sense to try to prune entire feature maps instead of individual parameters, since typical implementations of convolutions may not be able to exploit sparse kernels for speedups. Let be the activation of the th feature map at spatial location for the th datapoint. Let us also introduce a binary mask into the network which modifies the activations of each feature map as follows:
(8) 
The gradient of the loss for the th datapoint with respect to is
(9) 
and the pruning signal is therefore , since before pruning. The gradient with respect to the activations is available during the backward pass of computing the network’s gradient and the pruning signal can therefore be computed at little extra computational cost.
We note that this pruning signal is very similar to the one used by Molchanov et al. [8] – which uses absolute gradients instead of squared gradients and a certain normalization of the pruning signal – but our derivation provides a more principled motivation. An alternative derivation which does not require and to be close is provided in Supplementary Section 2.
2.2 Regularizing Computational Complexity
In the previous section, we have discussed how to reduce the number of parameters or feature maps of a neural network. However, we are often more interested in reducing a network’s computational complexity. That is, we are trying to solve an optimization problem of the form
(10) 
where here may contain the weights of a neural network but may also contain a binary mask describing its architecture. measures the computational complexity of the network. During optimization, we quantify the computational complexity in terms of floating point operations. For example, the number of floating point operations of a convolution with a bias term, filters, input channels, output channels, and producing a feature map with spatial extent is given by
(11) 
Since and
represent the size of the output, this formula automatically takes into account any padding as well as the stride of a convolution. The total cost of a network is the sum of the cost of each of its layers.
To solve the above optimization problem, we try to minimize the Lagrangian
(12) 
where controls the tradeoff between computational complexity and a model’s performance. We compute the cost of removing a parameter or feature map as
(13) 
where the increase in loss is estimated as in the previous section. During training, we periodically estimate the cost of all feature maps and greedily prune feature maps which minimize the combined cost. When pruning a feature map, we expect the loss to go up but the computational cost to go down. For different , different architectures will become optimal solutions of the optimization problem.
2.3 Automatically Tuning
How should be chosen? One option is to treat it like any hyperparameter and to train many models with different values of . In some settings, this may not be feasible. In this section, we therefore discuss an approach which allows generating many models of different complexity in a single training run.
For a given , a feature should be pruned if Equation 13 is negative, that is, when doing so reduces the overall cost because it decreases the computational cost more than it increases the crossentropy:
(14) 
We propose choosing the smallest such that after removing all features with negative or zero pruning signal, a reduction in either a desired number of features or total computational cost is achieved.
The threshold for pruning feature map is reached when setting the weight to
(15) 
Consider pruning only a single feature map. The smallest such that Equation 14 is satisfied for at least one feature map is given by . For with , we have
(16) 
since . That is, these feature maps should not be pruned, which means that is a reasonable choice if we only want to prune 1 feature map. We propose a greedy strategy, where in each iteration of pruning only 1 feature map is targeted and is used as a weight. Note that we can equivalently use the directly as a hyperparameterfree pruning signal. This signal is intuitive, as it picks the feature map whose increase in loss is small relative to the decrease in computational cost.
Another possibility is to automatically tune such that the total reduction in cost reaches a target if one were to remove all feature maps with negative pruning signal (Equation 14). However, we do not explore this option further in this paper.
2.4 Training
Our saliency models were trained in several steps. First, we trained a DeepGaze II model using Adam [14] with a batch size of 8 and an initial learning rate of 0.001 which was slowly decreased over the course of training. As in [6], the model was first pretrained using the SALICON dataset [15] while using the MIT1003 dataset [16] for validation. The validation data was evaluated every 100 steps and training was stopped when the crossentropy on the validation data did not decrease 20 times in a row. The parameters with the best validation score observed until then were saved. Afterwards, the MIT1003 dataset was split into 10 training and validation sets and used to train 10 DeepGaze II models again with early stopping.
The ensemble of DeepGaze II models was used to generate an average saliency map for each image in the SALICON dataset. These saliency maps were then used for knowledge distillation [7]. This additional data allows us to not only train the readout network of our own models, but also finetune the underlying feature representation. We used a weighted combination of the crossentropy for MIT1003 and the crossentropy with respect to the DeepGaze II saliency maps, using weights of 0.1 and 0.9, respectively.
After training our models to convergence, we start pruning the network. We accumulated pruning signals (Equation 7) for 10 training steps while continuing to update the parameters before pruning a single feature map. The feature map was selected to maximize the reduction in the combined cost (Equation 13). We tried to apply early stopping to the combined cost to automatically determine an appropriate number of feature maps to prune, however, we found that early stopping terminated too early and we therefore opted to treat the number of pruned features as another hyperparameter which we optimized via random search. During the pruning phase we used SGD with a fixed learning rate of 0.0025 and momentum of 0.9, as we found that this led to slightly better results than using Adam. This may be explained by a regularizing effect of SGD [17].
2.5 Related Work
Many recent papers have used pretrained neural networks as feature extractors for the prediction of fixations [5, 18, 6, 19, 20]. Most closely related to our work is the DeepGaze approach [5, 6]. In contrast to DeepGaze, here we also finetune the feature representations, which despite the limited amount of available fixation data is possible because we use a combination of knowledge distillation and pruning to regularize our networks. Kruthiventi et al. [18] also tried to finetune a pretrained network by using a smaller learning rate for pretrained parameters than for other parameters. Vig et al. [21] trained a smaller network endtoend but did not start from a pretrained representation and therefore did not achieve the performance of current stateoftheart architectures. Similarly, Pan et al. [22] trained networks endtoend while initializing only a few layers with parameters obtained from pretrained networks but have since been outperformed by DeepGaze II and other recent approaches.
To our knowledge, none of this prior work has addressed the question of how much computation is actually needed to achieve good performance.
Many heuristics have been developed for pruning
[23, 24, 8]. More closely related to ours are methods which try to directly estimate the effect on the loss. Optimal brain damage [12], for example, starts from a 2nd order approximation of a squared error loss and computes second derivatives analytically by performing an additional backward pass. Optimal brain surgeon [13] extends this method and automatically tries to correct parameters after pruning by computing the full Hessian. In contrast, our pruning signal only requires gradient information which is already computed during the backward pass. This makes the proposed Fisher pruning not only more efficient but also easier to implement.Most closely related to our pruning method is the approach of Molchanov et al. [8]. By combining a 1st order approximation with heuristics, they arrive at a very similar estimate of the change in loss due to pruning. We found that in practice, both estimates performed similarly when used without regularization (Figure 1). The main contribution in Section 2.1 is a new derivation which provides a more principled motivation for the pruning signal.
Unlike most papers on pruning, Molchanov et al. [8] also explicitly regularized the computational complexity of the network. However, their approach to regularization differs from ours in two ways. First, a fixed weight was used for the computational cost while pruning a different number of feature maps. In contrast, here we recognize that each setting of creates a separate optimization problem with its own optimal architecture. In practice, we find that the speed and architecture of a network is heavily influenced by the choice of even when pruning the same number of feature maps, suggesting that using different weights is important. Molchanov et al. [8] further only estimated the computational cost of each feature map once before starting to prune. This leads to suboptimal pruning, as the computational cost of a feature map changes when neighboring layers are pruned (Figure 1).
3 Experiments
In the following, we first validate the performance of Fisher pruning on a simpler toy example. We then explore the performance and computational cost of two architectures for saliency prediction. First, we try using the smaller VGG11 variant of Simonyan et al. [9]
for feature extraction. In contrast to the readout network of Kümmerer et al.
[6], which took as input feature maps from 5 different layers, we only used the output of the last convolutional layer (“conv5_2”) as input. Extracting information from multiple layers is less important in our case, since we are also optimizing the parameters of the feature extraction network. As an alternative to VGG, we try DenseNet121 as a feature extractor [11], using the output of “dense block 3” as input to the readout network. In both cases, the readout network consisted of convolutions with parametric rectified linear units
[25] and 32, 16, and 2 feature maps in the hidden layers. In the following, we will call the first network FastGaze and the second network DenseGaze.3.1 Fisher Pruning
Method  Error  Computational cost 

LeCun et al. [26]  0.80%  100% 
Han et al. [24]  0.77%  16% 
Fisher ()  0.84%  26% 
Fisher ()  0.79%  10% 
Fisher ()  0.86%  17% 
We apply Fisher pruning to the example of a LeNet5 network trained on the MNIST dataset [26]. We compare our method to the pruning method of Han et al. [24] which requires or regularization of the model’s parameters during an initial training phase and cannot directly be applied to a pretrained model. LeNet5 consists of two convolutional layers and two pooling layers, followed by two fully connected layers. Following Han et al. [24]
, the details of the initial architecture were the same as in the MNIST example provided by the Caffe framework
^{1}^{1}1https://github.com/BVLC/caffe/. We used 53000 data points of the training set for training and 7000 data points for validation and early stopping.We find that Fisher pruning performs well, but that taking the computational cost into account is important (Table 1). Automatically choosing the weight controlling the tradeoff between performance and computational cost works better than ignoring computational cost , although not as well as using a fixed but optimized weight .
An incorrect decision of a pruning algorithm may be corrected by retraining the network’s weights, especially for toy examples like LeNet5 which are quick to train. This can mask the bad performance of a poor pruning algorithm. In Figure 1, we therefore look at the pruning performance of various pruning techniques applied to FastGaze while keeping the parameters fixed. We alternate between estimating the pruning signal on the full MIT1003 training set and pruning 10 features at a time. We included the method of Molchanov et al. [8] as well as two naive baselines, namely pruning based on the average activity of a feature, (L1A), and pruning based on the average norm of a weight vector corresponding to a feature map, (L1W).
While the simple baselines are intuitive (i.e., if a feature is “off” most of the time, one might expect it to not contribute much to a network’s performance), they are also inherently flawed. The activations and parameters of a layer with ReLU activations can be arbitrarily scaled without changing the function implemented by the network by compensating for the scaling in the parameters of the next layer. Correspondingly, we find that the baselines perform poorly.
We further find that unregularized Fisher pruning performs as well as the method of Molchanov et al. [8], independent of whether we use the normalized or unnormalized variant of their pruning signal. However, we find that our regularization of the number FLOPs gives better results. Using different for differently strongly pruned networks appears to be important, as well as updating the cost of a feature map as neighboring feature maps get pruned (Figure 1).
Finally, we also tested our alternative pruning signal which automatically tunes the tradeoff weight (). The results suggest that this approach works well when the number of features to be pruned is small, but may work worse when the number of pruned features is large (Figure 1).
3.2 Pruning FastGaze and DenseGaze
To find the optimal pruning parameters, we ran multiple experiments with a randomly sampled number of pruned features and a randomly chosen tradeoff parameter . The total number of feature maps is 2803 for FastGaze and 7219 for DenseGaze. was chosen between 3e4 and 3e1. We evaluated each model in terms of loglikelihood, area under the curve (AUC), normalized scanpath saliency (NSS), and similarity (SIM). Other metrics commonly used in saliency literature such as sAUC or CC are closely related to one of these metrics [27]. We used the publicly available CAT2000 [28]
dataset for evaluation, which was not used during training of any of the models. While there are many ways to measure computational cost, we here were mostly interested in singleimage performance on CPUs. We used a single core of an Intel Xeon E52620 (2.4GHz) and averaged the speed over 6 images of 384 x 512 pixels. Our implementation was written in PyTorch
[29].Figure 2 shows the performance of various models. In terms of loglikelihood, NSS, and SIM, we find that both FastGaze and DenseGaze generalize better to CAT2000 than our reimplementation of DeepGaze II, despite the fact that both models were regularized to imitate DeepGaze II. In terms of AUC, DeepGaze II performs slightly better than FastGaze but is outperformed by DenseGaze. Pruning only seems to have a small effect on performance, as even heavily pruned models still perform well. For the same AUC, we achieve a speedup of roughly 10x with DenseGaze, while in terms of loglikelihood even our most heavily pruned model yielded better performance (which corresponds to a speedup of more than 75x). Comparing DenseGaze and FastGaze, we find that while DenseGaze achieves better AUC performance, FastGaze is able to achieve faster runtimes due to its less complex architecture.
Data 


DeepGaze II (3.59s) 

DenseGaze (577ms) 

FastGaze (356ms) 

FastGaze (91ms) 
We find that explicitly regularizing computational complexity is important. For the same AUC and depending on the amount of pruning, we observe speedups of up to 2x for FastGaze when comparing regularized and nonregularized models.
In Figure 3 we visualize some of the pruned FastGaze models. We find that at lower computational complexity, optimal architectures have a tendency to alternate between convolutions with large and small numbers of feature maps. This makes sense when only considering the computational cost of a convolutional layer (Equation 11), but it is interesting to see that such an architecture can still perform well in terms of fixation prediction, which requires the detection of various types of objects.
Qualitative results are provided in Figure 4. Even at large reductions in computational complexity, the fixation predictions appear very similar. At a speedup of 39x compared to DeepGaze II, the saliency maps start to become a bit blurrier, but generally detect the same structures. In particular, the model still responds to faces, people, objects, signs, and text.
Model  AUC  KL  SIM  NSS  GFLOP 

Center Bias  78%  1.24  0.45  0.92   
eDN [21]  82%  1.14  0.41  1.14   
SalNet [22]  83%  0.81  0.52  1.51   
DeepGaze I [5]  84%  1.23  0.39  1.22   
SAMResNet [31]  87%  1.27  0.68  2.34   
DSCLRCN  87%  0.95  0.68  2.35   
DeepFix [18]  87%  0.63  0.60  2.26   
SALICON [32]  87%  0.54  0.60  2.12   
DeepGaze II [6]  88%  0.96  0.46  1.29  240.6 
FastGaze  85%  1.21  0.61  2.00  10.7 
DenseGaze  86%  1.20  0.63  2.16  12.8 
To verify that our models indeed perform close to the state of the art, we submitted saliency maps to the MIT Saliency Benchmark [30, 33, 34]. We computed saliency maps for the MIT300 test set, which contains 300 more images of the same type as MIT1003. We evaluated a FastGaze model which took 356ms to evaluate in PyTorch, (2250 pruned features, ) and a DenseGaze model which took 577ms (2701 pruned features, ). We find that both models perform slightly below the state of the art when evaluated on MIT300 (Table 2), but are still comparable to other recent deep saliency models. We explain the discrepancy by the fact that the submitted models were chosen for their performance on CAT2000. That is, they generalize very well to other datasets, but may have lost information about the subtleties of the MIT datasets.
4 Conclusion
We have described a principled pruning method which only requires gradients as input, and which is efficient and easy to implement. Unlike most pruning methods, we explicitly penalized computational complexity and tried to find the architecture which optimally optimizes a given tradeoff between performance and computational complexity. With this we were able to show that the computational complexity of stateoftheart saliency models can be drastically reduced while maintaining a similar level of performance. Together with a knowledge distillation approach, the reduced complexity allowed us to train the models endtoend and achieve good generalization performance.
In settings where training is expensive, trying out many different parameters to tune the tradeoff between computational complexity and performance may not be feasible. We have discussed an alternative pruning signal which takes into account computational complexity but is free of hyperparameters. This approach does not only apply to Fisher pruning, but can be combined with any pruning signal estimating the importance of a feature map or parameter.
Less resource intensive models are of particular importance in applications where a lot of data is processed, as well as in applications running on resource constrained devices such as mobile phones. Faster gaze prediction models also have the potential to speed up the development of video models. The larger number of images to be processed in videos impacts training times, making it more difficult to iterate models. Another issue is that the amount of fixation training data in existence is fairly limited for videos. Smaller models will allow for faster training times and a more efficient use of the available training data.
References
 [1] Ardizzone, E., Bruno, A., Mazzola, G. In: Saliency Based Image Cropping. Springer Berlin Heidelberg (2013) 773–782
 [2] Feng, Y., Cheung, G., Tan, W.T., Ji, Y.: Gazedriven video streaming with saliencybased dualstream switching. In: IEEE Visual Communications and Image Processing (VCIP). (2012)
 [3] Xu, P., Sugano, Y., Bulling, A.: Spatiotemporal modeling and prediction of visual attention in graphical user interfaces. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. (2016)
 [4] Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural circuitry. Human Neurobiology 4 (1985) 219–227

[5]
Kümmerer, M., Theis, L., Bethge, M.:
Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet.
In: ICLR Workshop. (May 2015) 
[6]
Kümmerer, M., Wallis, T.S.A., Bethge, M.:
DeepGaze II: Reading fixations from deep features trained on object recognition.
ArXiv eprints (October 2016)  [7] Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network. ArXiv eprints (2015)

[8]
Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.:
Pruning convolutional neural networks for resource efficient inference.
In: International Conference on Learning Representations (ICLR). (2017)  [9] Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for LargeScale Image Recognition. ArXiv eprints (September 2014)

[10]
Nair, V., Hinton, G.:
Rectified linear units improve restricted boltzmann machines.
In: Proceedings of the 27th International Conference on Machine Learning. (2010)

[11]
Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.:
Densely connected convolutional networks.
In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
 [12] LeCun, Y., Denker, J.S., Solla, S.A.: Optimal Brain Damage. In Touretzky, D.S., ed.: Advances in Neural Information Processing Systems 2. MorganKaufmann (1990) 598–605
 [13] Hassibi, B., Stork, D.G.: Second order derivatives for network pruning: Optimal brain surgeon. In: Advances in Neural Information Processing Systems. (1993) 164–171
 [14] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR). (2015)
 [15] Jiang, M., Huang, S., Duan, J., Zhao, Q.: SALICON: Saliency in Context. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2015)
 [16] Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: IEEE International Conference on Computer Vision (ICCV). (2009)
 [17] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Learning Representations (ICLR). (2017)
 [18] Kruthiventi, S.S.S., Ayush, K., Babu, R.V.: DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations. In: IEEE Transactions on Image Processing. Volume 26. (2017)

[19]
Tavakoli, H.R., Borji, A., Laaksonen, J., Rahtu, E.:
Exploiting interimage similarity and ensemble of extreme learners for fixation prediction using deep features.
Neurocomputing 244(Supplement C) (2017) 10 – 18  [20] Liu, N., Han, J.: A deep spatial contextual longterm recurrent convolutional network for saliency detection. CoRR abs/1610.01708 (2016)
 [21] Vig, E., Dorr, M., Cox, D.: LargeScale Optimization of Hierarchical Features for Saliency Prediction in Natural Images. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2014)
 [22] Pan, J., Sayrol, E., Giroi Nieto, X., McGuinness, K., O’Connor, N.E.: Shallow and deep convolutional networks for saliency prediction. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2016)
 [23] Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: International Conference on Learning Representations (ICLR). (2017)
 [24] Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural networks. In: Advances in Neural Information Processing Systems. (2015)
 [25] He, K., Zhang, X., Ren, S., , Sun, J.: Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification. In: IEEE International Conference on Computer Vision (ICCV). (2015)
 [26] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradientbased learning applied to document recognition. In: Proceedings of the IEEE. Volume 86. (1998) 2278–2324
 [27] Kümmerer, M., Wallis, T.S.A., Bethge, M.: Saliency benchmarking: Separating models, maps and metrics. arxiv (Apr 2017)
 [28] Borji, A., Itti, L.: CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research. CVPR 2015 workshop on ”Future of Datasets” (2015) arXiv preprint arXiv:1505.03581.
 [29] PyTorch. https://github.com/pytorch
 [30] Bylinskii, Z., Judd, T., Borji, A., Itti, L., Durand, F., Oliva, A., Torralba, A.: MIT Saliency Benchmark
 [31] Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Predicting human eye fixations via an lstmbased saliency attentive model. CoRR abs/1611.09571 (2016)
 [32] Huang, X., Shen, C., Boix, X., Zhao, Q.: SALICON: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In: The IEEE International Conference on Computer Vision (ICCV). (2015)
 [33] Judd, T., Durand, F., Torralba, A.: A benchmark of computational models of saliency to predict human fixations. Technical report, MIT technical report (2012)
 [34] Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? CoRR abs/1604.03605 (2016)
S1 Details of Fisher pruning
Under mild regularity conditions, the diagonal of the Hessian of the crossentropy loss is given by
(17)  
(18)  
(19) 
where the last step follows from the quotient rule. For the second term we have
(20)  
(21)  
(22)  
(23) 
where we have assumed that has been trained to convergence and is close to .
S2 Alternative derivation of Fisher pruning
Let be our original model and be the pruned model, where we multiply the activations by binary mask parameters as in Eqn. 8 of the main text. Pruning is achieved by setting for pruned features, and for features we wish to keep.
We can define the cost of pruning as the extent to which it changes the model’s output, which can be measured by the KL divergence
(24) 
This KL divergence can be approximated locally by a quadratic distance (the FisherRao distance) as we will show below. First, note that when , , so the value of the divergence is 0, and its gradients with respect to both and are exactly as well.
Thus, we can approximate by its secondorder Taylorapproximation around the unpruned model as follows:
(25) 
where is the Hessian of at .
Pruning a single feature amounts to setting , where is the unit vector which is zero everywhere except at its entry, where it is . The cost of pruning a single feature is then approximated as:
(26) 
Under some mild conditions, the Hessian of at is the Fisher information matrix, which can be approximated by the empirical Fisher information. In particular for the diagonal terms we have that:
(27)  
(28)  
(29)  
(30)  
(31)  
(32) 
where is defined as in Eqn. 9 of the main text.
Comments
There are no comments yet.