Here we present DeepGaze II, a model that predicts where people look in images. The model uses the features from the VGG-19 deep neural network trained to identify objects in images. Contrary to other saliency models that use deep features, here we use the VGG features for saliency prediction with no additional fine-tuning (rather, a few readout layers are trained on top of the VGG features to predict saliency). The model is therefore a strong test of transfer learning. After conservative cross-validation, DeepGaze II explains about 87 achieves top performance in area under the curve metrics on the MIT300 hold-out benchmark. These results corroborate the finding from DeepGaze I (which explained 56 on object recognition provide a versatile feature space for performing related visual tasks. We explore the factors that contribute to this success and present several informative image examples. A web service is available to compute model predictions at http://deepgaze.bethgelab.org.READ FULL TEXT VIEW PDF
Recently, data-driven deep saliency models have achieved high performanc...
Recent results suggest that state-of-the-art saliency models perform far...
Although Deep Convolutional Networks (DCNs) are approaching the accuracy...
In this technical report, we present our publicly downloadable implement...
Recurrent feedback connections in the mammalian visual system have been
Visual saliency models have recently begun to incorporate deep learning ...
Deep learning architectures are an extremely powerful tool for recognizi...
Humans and other animals with foveated visual systems make several eye movements per second, bringing their high-resolution fovea to bear on things they want to see. Understanding the factors that guide eye movements is therefore an important component of understanding behaviour. One problem that has received significant attention is that of predicting fixation locations given the image the observer is viewing (usually in a free-viewing paradigm). Here we term this problem saliency prediction
, in keeping with the computer vision literature111Note that saliency is sometimes defined as the visibility or contrast of some image region, irrespective of whether that predicts human fixations..
The state-of-the-art in saliency prediction improved markedly since 2014 with the advent of models using deep neural networks. The first of these models Vig2014 trained deep neural networks on the task of saliency prediction. We subsequently boosted performance significantly above EDN in our model DeepGaze I kuemmerer2015deepgaze, by using pretrained features (taken from AlexNet [krizhevsky2012]
) trained on the ImageNet object recognition benchmark. This is therefore an example of transfer learning, where features learned on one task are re-used for a second task (with or without fine-tuning). The success of this approach is exciting because it implies that the features learned by deep neural networks on ImageNet have abstracted generalisable information from images. The transfer learning paradigm seems to be particularly important for saliency prediction because typical saliency datasets are relatively small—a few thousand images with fixations in the hundreds per image—making learning of deep neural networks from scratch Vig2014 relatively unconstrained.
Since DeepGaze I, a variety of new models also apply transfer learning approaches using deep features. In contrast to DeepGaze I, which uses AlexNet, the SALICON model Huang_2015_ICCV, DeepFix Kruthiventi2015 and PDP jetley2016 use the better-performing VGG-19 network Simonyan2014, whose features are retrained on saliency prediction using the SALICON dataset then fine-tuned on the MIT1003 dataset. SALICON and DeepFix substantially improved performance over DeepGaze I in the MIT benchmark ([mit-saliency-benchmark]; see below). The scale of this improvement could suggest that retraining deep features is crucial for further performance improvement, or it could suggest that the VGG features themselves (which significantly outperform AlexNet for object recognition) provide a better feature space for saliency prediction irrespective of retraining. In this paper we show the latter is the case.
Here, we introduce DeepGaze II. Relative to DeepGaze I, it uses the VGG-19 pretrained network and pretraining on the SALICON dataset. In addition, rather than using a linear predictor, DeepGaze II uses a pointwise nonlinear combination of deep features. Two additional crucial distinctions between DeepGaze II and the models discussed above (SALICON, DeepFix and PDP) are that we train our model in a probabilistic framework optimising the log-likelihood kuemmerer2015, and that we do not re-train the VGG features themselves. DeepGaze II (as for DeepGaze I) also models the centre bias as an explicit prior.
As for DeepGaze I, we formulate DeepGaze II as a probabilistic model. Building on previous work applying probabilistic modelling to fixation prediction vincent2009, Barthelme, we have recently shown that formulating existing models appropriately can remove most of the inconsistencies between existing model evaluation metrics kuemmerer2015. Furthermore, we argued that using log-likelihood (the standard way to compare probabilistic models) as an evaluation criteria represented a useful and intuitive loss function for model evaluation with close ties to information theory (though other loss functions may have advantages for certain use cases Vig2014. Here, we train and evaluate DeepGaze II using the framework of log-likelihood (specifically reported as information gain explained, see[kuemmerer2015]) for our in-house tests, and present key metrics from the MIT benchmark (AUC, shuffled AUC).
The architecture of DeepGaze II is visualized in 1
. The image in question is (possibly after resizing, see below) given as input to the VGG-19 network, from which all fully-connected layers have been removed and for which all filters have been rescaled to yield feature maps with unit variance over the imagenet dataset gatys2015. After processing the image in VGG, the feature maps of a selection of layers (conv5_1, relu5_1, relu5_2, conv5_3, relu5_4; selected via random search) are rescaled and cropped to match an earlier layer (conv2_1 in our implementation). This rescaling is necessary to equate the sizes of the feature maps from different layers; conv2_1 is chosen such that spatial resolution is sufficient for precise prediction but computation time is reduced. Matching here means that we identify a pixel in the output of a convolution with the center of its receptive field in its input layer.
After rescaling and cropping, these feature maps have the same size and can be combined into one 3-dimensional tensor (withchannels) which is used as input for a second neural network (called the readout network
) in the following. This readout network consists of four layers of 1x1 convolutions followed by ReLu nonlinearities. Therefore, the readout network is only able to represent apointwise nonlinearity in the VGG features. The first three layers use 16, 32, and 2 features. The last layer has only one output channel . This final output from the readout network is convolved with a Gaussian to regularize the predictions:
Fixations tend to be near to the center of the image in a way which is strongly task and dataset dependent Tatler2007Centerbias. Therefore it is important to model this center bias and do so in a way that allows easy substitution of other centre biases (e.g. depending on the task). We do so by explicitly modelling the center bias as a prior distribution that is added to :
is finally converted into a probability distribution over the image by the means of a softmax (as for DeepGaze I):
DeepGaze II is trained using maximum likelihood learning (see [kuemmerer2015] for an extensive discussion of why log-likelihoods are a meaningful metric for saliency modelling). If denotes the probability distribution over and predicted by DeepGaze II for an image , the log-likelihood of a dataset is
for fixations at locations in the image . This loss function depends on the parameters of the readout network and the kernel size of the Gaussian used to regularize the prediction (note that it also depends on the parameters of VGG, but we do not retrain them). As it also is differentiable in these parameters, of-the-shelf optimization techniques can be used to optimize the loss. Here we use the Sum-of-Functions-Optimizer (SFO, [sohldickstein2013]), a mini-batch-based version of L-BFGS. The full training procedure consists of multiple phases and is visualized in 2.
In the pretraining phase, the readout network is initialized with random weights and trained on the SALICON dataset Jiang2015. This dataset consists of 10000 images with pseudofixations from a mouse-contingent task and has proven to be very useful for pretraining saliency models Huang_2015_ICCV, Kruthiventi2015, jetley2016. All images are downsampled by a factor of two. We use 100 images per mini-batch for the SFO.
The MIT1003 dataset is used to determine when to stop the training process. After each iteration over the whole dataset (one epoch) we calculate the performance of the model on the MIT1003 (test) dataset. We wish to stop training when the test performance starts to decrease (due to overfitting). We determine this point by comparing the performance from the last three epochs to the performance five epochs before those. Training runs for at least 20 epochs, and is terminated if all three of the last epochs show decreased performance or if 800 epochs are reached. As it is more expensive to use images of many different sizes, we resized all images from the MIT1003 dataset to either a size ofor depending on their aspect ratio, before downsampling by a factor of two.
After pretraining, the model is fine-tuned on the MIT1003 dataset. As DeepGaze I showed that overfitting to images is in fact a much larger problem than overfitting to subjects, DeepGaze II is crossvalidated over images: the images from the dataset are randomly split into 10 parts of equal size. Then ten models are trained starting from the result of the pretraining, each one using 9 of the ten parts for training and the remaining part for the stopping criterion (following the stopping criteria as above). We use 10 images per mini-batch in the SFO.
When evaluating on any dataset but the MIT1003 dataset, we use a mixture of these ten models. This holds specifically for the MIT300 dataset from the MIT Saliency Benchmark. When evaluating on the MIT1003 dataset for our in-house analyses, for each image we use the model which has not been trained using this image.
How well does the DeepGaze II model perform on saliency prediction relative to other saliency models? We first consider this from the standpoint of information theory (information gain explained) evaluated on a subset of the MIT1003 dataset (as used in kuemmerer2015, kuemmerer2015deepgaze), and second present results from the MIT saliency benchmark website on the held-out MIT300 set.
In [kuemmerer2015], we described the calculation of information gain explained (an intuitive transformation of log-likelihood). Information gain tells us what the model knows about the data beyond the baseline model, which here is the image-independent centre bias, expressed in bits / fixation:
where is the density of the model at location when viewing image , and is the density of the baseline model. Information gain explained relates the model’s information gain to the gold standard (crossvalidated prediction of all subjects from all other subjects—sometimes called the “empirical saliency map”) information gain. It is the proportion of the gold standard information gain accounted for by the model.
where is the density of the gold standard model.
To remain consistent with our previously published work kuemmerer2015, kuemmerer2015deepgaze, we evaluate DeepGaze II on a subset of the MIT1003 dataset consisting of all images of size (). For each image in this set, there is exactly one model from the fine-tuning crossvalidation procedure that did not use that image for training. We use the density from this model for evaluation. This means we are reporting test performance, crossvalidated over images, as opposed to training performance.
The gold standard model is essentially a Gaussian kernel density estimate that predicts one subject’s fixations for a given image from the fixations of all other subjects. That is, the gold standard model is an image-specific prediction crossvalidated over subjects, and as for the models we report test not training performance.
Figure 3 shows the information gain explained for DeepGaze II against that for DeepGaze I and the models evaluated in [kuemmerer2015]. DeepGaze II accounts for 87% of the explainable information gain, a substantial improvement from DeepGaze I’s 56%, and begins to approach the upper limit (according to the gold standard) of performance in saliency prediction. Note that we currently do not include models that improved on DeepGaze I on the MIT benchmark (SALICON, DeepFix and PDP) in this evaluation because the code for these models is not publically available.
We can also evaluate candidate models according to their performance relative to the gold standard for each image in the dataset (Figure 4). Here, one can see that the AIM, eDN and DeepGaze I model predictions fall largely below the gold standard, and all include a number of images with negative information gain (meaning that the models make worse predictions than the baseline for those images). DeepGaze II clusters much closer to the gold standard predictions (diagonal line) and there are no images for which its prediction is worse than the baseline. Note that it is possible to have images for which the model prediction is better than the gold standard. There can be at least two reasons for this: first, it can be that fixations cluster in smaller areas than predicted by the gold standard (recall that the gold standard kernel size is learned over all images); second, there could be subjects who are inconsistent relative to other subjects but still look at areas that a model can predict. In this case the gold standard model performs poorly when predicting that subject relative to the model (recall that the gold standard performances are test performances).
The area under the ROC curve (AUC) metric expects saliency maps to include the centre bias, whereas shuffled AUC expects models to exclude the centre bias Barthelme, kuemmerer2015. Because the DeepGaze II architecture makes it trivial to include or exclude the centre bias into the model prediction, we submitted two sets of saliency maps to the MIT benchmark: one uses the centre bias trained on the MIT1003 dataset, the other uses a uniform centre bias. In addition, because the MIT Benchmark requires submission of model predictions as JPEG images, we quantised the log density for each image into 256 values such that each value receives the same number of pixels.
Table 1 reports the results of evaluating DeepGaze II on the MIT saliency benchmark (the held-out MIT 300 set). DeepGaze II beats the nearest competitors SALICON and DeepFix by one percent. For shuffled AUC, DeepGaze II beats the nearest competitors by a larger margin (note that this could be due in part to those models not excluding centre biases).
Figure 5 shows the three images for which DeepGaze II explained the most of the explainable information gain in the patterns of fixations, and Figure 6 shows the worst. For visualising probability densities, we include three contour lines which together divide the map into four regions. Each region has the same probability mass: that is, the model expects each area to receive the same number of fixations on average. If the dark areas are very concentrated, then the model expects a small area to receive most of the fixations. In addition, for each image we sample from each model to obtain the same number of fixations as for the ground truth fixations. Sampling is straightforward because the density predicted by the model is a multinomial distribution over the pixels. This allows an intuitive comparison of model and data. Note that both of these analysis approaches are only possible using a probabilistic model.
Some interesting patterns to consider include the first image in Figure 6, which is a photograph of a bakery shopfront. Humans fixate on the baked goods (which DeepGaze II captures) and on the store logo imprinted on the window in the upper right of the image (which DeepGaze II fails to capture, presumably because it does not detect the low-contrast, partially-occluded text). For the third image of Figure 6, people fixate on the signage above the storefront, which in the image is distorted by perspective projection. Both DeepGaze I and II appear to miss this text. This might be because both the VGG and AlexNet fail to provide features sensitive to such distorted text, or because distorted text is so rare in the training set that the contributions of these features are downweighted by our training procedure. In either case, these two examples highlight one potential avenue for model improvement (better training on text).
Why is DeepGaze II better than DeepGaze I? We quantified the contributions of the three primary changes from DeepGaze I to DeepGaze II on the MIT1003 dataset222 Note that this is not the original DeepGaze I model as presented in [kuemmerer2015]. Here we have trained on the full MIT1003 dataset and used the same scheme of crossvalidation over images as described in this paper. . As seen in Figure 7
, the largest single improvement is brought by using the pretrained VGG features in place of AlexNet (though we also include more channels from VGG than from AlexNet). Using the readout network rather than a linear regression slightly decreases performance when considered independently, likely due to overfitting. Training on the SALICON dataset marginally improved performance. Combining SALICON pretraining with the VGG features yields the largest intermediate model performance improvement.
We additionally provide examples of images for which DeepGaze II improves most from DeepGaze I (Figure 8) and performs worse than DeepGaze I (Figure 9) in terms of information gain differences (in bit/fix). The improvement for the first image in Figure 8 seems to be driven by better recognition of text, whereas for the second and third images DeepGaze II seems to benefit from improved (or more spatially-specific) face and person detection.
Here we have presented DeepGaze II, a model of saliency prediction that uses transfer learning from the VGG-19 network to achieve state-of-the-art performance. Information gain explained is able to quantify precise differences between models, and shows the clear improvement gained by DeepGaze II (note however, that some high-performing models were not included in these evaluations because their code is not publically available). Our model is also ranked first on the held-out MIT300 benchmark according to AUC and shuffled AUC, the most commonly-reported evaluation metrics. Note however that here, at least for AUC, the difference between DeepGaze II and other models is modest.
Why does DeepGaze II perform better relative to other models that also use deep features? We believe this could be because, at least in part, we do not retrain the VGG features. While this reduces the model space, it also greatly reduces the number of parameters that must be learned from data, reducing the chance of overfitting. Furthermore, since we only use convolutions on top of this, we cannot learn new features that are substantially different from VGG: only a pointwise nonlinearity is possible. These two aspects of our model therefore represent a much more stringent test of the transfer success of deep features.
We provide a web service to calculate DeepGaze II predictions for arbitrary images at http://deepgaze.bethgelab.org.