Visual Attention : what is salient in an image with DeepRare2019
Human visual system is modeled in engineering field providing feature-engineered methods which detect contrasted/surprising/unusual data into images. This data is "interesting" for humans and leads to numerous applications. Deep learning (DNNs) drastically improved the algorithms efficiency on the main benchmark datasets. However, DNN-based models are counter-intuitive: surprising or unusual data is by definition difficult to learn because of its low occurrence probability. In reality, DNNs models mainly learn top-down features such as faces, text, people, or animals which usually attract human attention, but they have low efficiency in extracting surprising or unusual data in the images. In this paper, we propose a model called DeepRare2019 (DR) which uses the power of DNNs feature extraction and the genericity of feature-engineered algorithms. DR 1) does not need any training, 2) it takes less than a second per image on CPU only and 3) our tests on three very different eye-tracking datasets show that DR is generic and is always in the top-3 models on all datasets and metrics while no other model exhibits such a regularity and genericity. DeepRare2019 code can be found at https://github.com/numediart/VisualAttention-RareFamilyREAD FULL TEXT VIEW PDF
Visual Attention : what is salient in an image with DeepRare2019
Where do people look on images in average? At rare, thus surprising things! Let's compute them automatically
The human visual system handles a huge quantity of incoming visual information and it cannot carry out multiple complex tasks in the same time on the whole visual field. This bottleneck [donald1958]
implies that it has an exceptional ability of sampling the surrounding world and pay attention to objects of interest. In computer vision, visual attention is modeled through the so-called saliency maps. The modeling of visual attention has numerous applications such as object detection, image segmentation, image/video compression, robotics, image re-targeting, visual marketing and so on[mancas2016].
Since the early 2000, numerous models of visual attention based on image features were provided. In this paper, they will be referred as “classical models”. While they can be very different, most of them have the same main philosophy: search for contrasted, rare, abnormal or surprising features within a given context. Among those models one may find seminal work of [itti2000] or [rose1999], but also more recent work based on information processing such as AIM [aim]. Finally, some models became a reference for classical models such as GBVS [gbvs], RARE [rare2012], BMS [bms2013] or AWS [aws].
With the arrival of the deep learning wave, most researchers have focused on Deep Neural Networks saliency which will be referred as“DNN-based” in this paper. DNN-based models triggered a revolution in terms of results on the main benchmark datasets such as MIT benchmark [mit-saliency-benchmark] where DNN-based saliency models definitely outperformed classical models. The DNN-based models have been already used in several applications such as image and video processing, medical signal processing, big data analysis, and saliency modeling as well [sun15], [zhao15], [qin15], [han14], [sun16]. Some of the DNN-based models became new references such as SALICON [salicon2015], MLNet [mlnet2016] or SAM-ResNet [sam2018].
However, recently DNN-based models have been criticized for some drawbacks. They underestimate the importance of bottom-up attention [kum17] which indicates that they were mostly trained to detect the attractive top-down objects rather than detect saliency itself. In [kong19] the authors found that if saliency models very precisely detect top-down features, they neglect a lot of bottom-up information which is surprising and rare, thus by definition difficult to learn. This shows that saliency cannot be learnt but instead objects which are often attended by human gaze (such as faces, text, bodies, etc.) are learnt and by the way, they are enough to provide good results on the main benchmarks. Recently, [Kotseruba2019] introduced two novel datasets, one based on psycho-physical patterns (P3
) and one based on natural odd-one-out (O3) stimuli. They showed that while DNN-based models are good in MIT dataset on natural images, their results drastically drop on P3 and O3. This shows that in addition to not take into account low-level features, DNN-based models are not generic enough to adapt to new images which are different enough from the training dataset.
In parallel to DNN-based models, DeepFeat [deepFeat] or SCAFI [scafi]
deal with models where pre-trained deep features are used. Those models will be called“deep-features models” in this paper. However, they are not yet comparable to DNN-based models for general images datasets such as the MIT benchmark.
Based on the new datasets in [Kotseruba2019], we provide a new deep-feature saliency model called DeepRare2019 mixing deep features and the philosophy of an existing classical model [rare2012]. Efficient on all the datasets, with no need for any training, efficient in terms of computation even on CPU and easily usable on any DNN architecture.
Our contribution is in mixing the simplicity of the idea of rarity computation to find the most salient features with the advantages of deep features extraction. Indeed, rare features attract human attention as they are surprising compared to the other features within the image. The resulting model is called DeepRare2019 (DR). This combination has the advantage to be fast (less than 1 second per image on CPU with a VGG16 feature extractor) and easy to adapt to any default DNN architectures (VGG19, ResNet, etc.).
A convolutional network is a great tool for feature extraction. When trained on a general dataset such as ImageNET, the network will extract a complete set of features that one finds in images at several scales (from very low-level in the first layers to very high level in the last ones). We decide here to use a VGG16 architecture with its default training on ImageNET dataset as a feature extractor, but any other architecture could be used as well. Our implementation is based on Keras framework to extract the convolutional layers and feature maps within those layers. We do not use (1) the pooling layers (as they are redundant with the previous convolutional layer) and (2) the final fully connected classification layers. An example for layer 1 is illustrated in Figure1.
In a VGG16, the convolutional layers are gathered within 5 groups separated by the pooling layers : 1) the first low-level features in layers 1 and 2, then 2) second set of low-level features from layers 4 and 5, after that 3) the first middle-level layers 7, 8 and 9 and 4) the second middle-level layers 11, 12 and 13 and finally 5) the high-level features from layers 15, 16 and 17.
On each feature map within the layers we compute the data rarity. For that we use the main idea from [rare2012] without the multi-resolution part which is naturally achieved by the VGG16 architecture (and also by most of other architectures). A very simple rarity function R based on the histogram of each feature map sampled on a few bins (11 in the current implementation) is used as in equation 1.
where p(i) is the occurrence probability for the pixels of bin i. Once the rarity histogram R is computed, the resulting rarity image is reconstructed by backprojection. This operation uses the histogram of a feature (here the rarity of a feature) and then use it to find this feature in an image projecting each histogram value on the corresponding pixel in an image. This image will highlight pixels in the feature map which are rare compared to the other pixels in the feature map. Based on [rare2012], rare pixels are the ones which might attract human attention. Rarity is applied on each feature map of each layer as it can be seen on the 64 feature maps of layer 1 in Figure 1.
The advantage of this approach is that it is very fast to compute and this is important as it needs to be applied to numerous feature maps.
Once the rarity of all feature maps is computed, the results need to be fused together. We use a classical map fusion from [itti2004] where the fusion weights depend on the squared difference between the max and the mean of each map. This is applied to all feature maps within each layer leading to 13 deep layer conspicuity maps (DLCM), one for each convolutional layer in VGG16 (see Figure 1 for first layer).
In a second stage, the same fusion method is applied for each of the 5 layer groups arriving to 5 deep groups conspicuity maps (DGCM). This fusion is made in a way to give more importance to higher level layers.
Finally, the 5 DGCM are summed up. A top-down face map is added based on feature map #105 from layer 15 which is known to detect faces which are large enough[scafi].
We use 3 datasets namely MIT1003 [Judd2009], P3, and O3 datasets [Kotseruba2019] to validate our results. The MIT dataset has general-purpose real-life images. P3 dataset evaluates the ability of saliency algorithms to find singleton targets which focuses on color, orientation, and size (without center bias). O3 dataset depicts a scene with multiple objects similar to each other in appearance (distractors) and a singleton (target) which focuses on color, shape, and size (with center bias). We decided to use these 3 very different datasets to check how saliency models behave when facing images in different contexts.
Concerning metrics, we use measures from [Kotseruba2019]. The “number of fixations” (# fix.) is defined as the path formed by the saliency maximum followed by the other maxima of the saliency map before reaching the target. The global saliency index (GSI) measures how well the target mean saliency is distinguished from the distractors. The maximum saliency ratio (MSR) focuses on maximum saliency of the target versus the distractors [Wloka2016] and the same for the background versus target (MSRt and MSRb
). We also use standard eye-tracking evaluation metrics from MIT benchmark[mit-saliency-benchmark] such as CC, KL, AUC Judd, AUC Borji, NSS, and SIM.
We compare our model to other models on P3 and O3 datasets. According to [Kotseruba2019], they observe that most classical models perform better on P3 than DNN-based models. In contrast, DNN-based models perform better on O3.
Figure 2 shows six samples from P3 dataset which exhibit color, orientation and size differences of the target. While distractors are still visible on DR saliency map, the targets are always correctly highlighted compared to RARE2012 which works well mainly for colors and two DNN-based models (MLNet and SALICON) which only work on one sample. Figure 3 shows images from O3 dataset for different target categories (easy or difficult). Again, our model highlights the target better than the DNN-based models. DR seems equivalent in average with RARE. Figure 4 shows images from MIT1003 dataset. DR always finds the GT focus regions (except for the right image where one GT focus is just in the middle probably due to the centred bias) but it also has details around those focus areas which might decrease its scores on MIT1003.
In overall, DeepRare2019 has the most stable behaviour performing well on both datasets while the other models might be good on some images but much less good on others.
We make a quantitative validation of our model on three datasets. First on MIT1003 dataset which shows general-purpose images where learning objects is very important. This dataset is basically one which should provide advantage to DNN-based models which focus on objects instead of salient information (faces, text, etc.). Second, we use O3 dataset from [Kotseruba2019] which also provides real life images but with odd-out-one regions. The dataset should provide similar difficulty to classical and DNN-based saliency models. Finally, we use P3 dataset from [Kotseruba2019] which shows synthetic psycho-physical images with pop-out objects which should work better for classical saliency models.
We summarize in Table I the results of DeepRare2019 and also results coming from [Kotseruba2019] for MLNet and SALICON where MLNet was trained with SALICON, P3 and O3 datasets and SALICON was trained with OSIE, P3 and O3 datasets. For other models (DeepFeat, eDN, GBVS, RARE2012, BMS, AWS), the figures come from [deepFeat].
We remark that our model is less good than SALICON (and probably than newer models such as SAM-ResNet), but equivalent to MLNet and better than other DNN-based models. It is also better than DeepFeat and all classical models.
The O3 dataset uses the MSR metric defined in [Kotseruba2019]. When the MSRt is higher, it is better as the target is well highlighted compared to the distractors. When MSRb is lower, it is better, it means that the maximum of the saliency of the target is higher than the one of the background. The first measure will ensure that the target is visible compared to the distractors and the second that it is visible compared to the background.
Table II shows the MSR from [Kotseruba2019] where we added DeepRare2019 at the end splitting the dataset between the images where color is a good discriminator (Color) and the others (Non-color). All models work better for targets where color is an important feature and less well for non-color.
For MSRt(higher is better) for Color our model is less good especially compared to DNN-based models. However we can see that for Non-color images where the models fail much more DeepRare2019 has a remarkable stability being second and very close the the best one (SAM-ResNet).
If we take into account the MSRb (lower is better), our model clearly outperforms all the others providing the best discrimination between the target and the background. DeepRare2019 is the only model with a MSRb smaller than 1 which means that in average the maximum of the target saliency is higher than the maximum of the background saliency.
The P3 dataset is the one which exhibits the less top-down information and it even does not have any centered bias. Naturally, for this dataset, the DNN-based models perform the worst. We will check here how DeepRare2019 deals with the data.
First we use the average # of fixations and found percentage metrics. Table IV shows first the results on P3 for DeepRare2019 compared with SALICON and MLNet models. Our model definitely outperforms the two DNN-based models and needs much less fixations to discover more of the targets showing here very good results.
|Model||Avg. # fix.||% found|
Figure 5 shows that compared to state-of-the-art models (top graph), our model (bottom-graph) ranges between 80 % of targets found after 15 fixations to 88 % target found after 100 fixations. It is possible to see that even after 15 fixations more than 80 % of the targets are found which is much better than all tested DNN-based models and all classical models excepting IMSIG [imsig2012] which has equivalent results.
For the GSI score, figures 6, 7 and 8 let us compare the three best classical models with the three best DNN-based models on the left and DeepRare2019 results on the right. For color targets (Figure 6, right graph) we see that the maximum of GSI score for DR is 0.56 which puts our model under BMS, RARE2012 and IMSIG but much better than all the other models.
In addition, the shape of the GSI curve exhibited by DeepRare2019 is coherent from a biological point of view: if the difference between the target color and the distractor color is small, then the model detects less well the target (left-side of the curve) than when the color of the target and background is very different (right-side of the curve). Our model is the only one to provide a biologically plausible GSI curve.
For orientation targets (Figure 7, right graph) we see that the maximum of GSI score is about 0.22. This makes DeepRare2019 better than any other model in terms of maximum.
Also, the shape of the GSI curve exhibited by DeepRare2019 is again coherent from a biological point of view: if the difference between the target orientation and the distractor orientation is small (left-side of the curve), then the model detects the target less well than when target orientation is very different from the distractors (right-side of the curve).
For size targets (Figure 8, right graph) we see that the maximum of GSI score is about 0.25 which makes it close to RARE2012 in terms of maximum GSI.
The shape of the GSI curve exhibited by our model is finally again coherent from a biological point of view: if the difference between the target size and the distractor size is small (center of the curve), then the model detects the target less well than when its size is very different (left and right sides of the curve). We can also see an asymmetry in the curve showing that it is easier for DeepRare2019 to detect target twice bigger than distractors than targets twice smaller than the distractors which is again biologically coherent.
We proposed a novel saliency model called DeepRare2019 using the simplified rarity idea of [rare2012] applied on the deep features extracted by a VGG16 network pretrained on ImageNet dataset. This exhibits several interesting features.
It needs no training and the default ImageNet training is enough.
The model is computationally efficient and is easy to run on CPU at less than one second per image.
Our approach is very modular, and it is very easy to adapt to any neural network architecture such as VGG19, ResNET50 or MobileNetV2 for adaptation on mobile devices such as smartphones.
It is possible to check each layer contribution and thus better understand the result contrary to black-box DNN-based models.
DeepRare2019 is very generic and stable through all kinds of different datasets where other models are sometimes better but only for one dataset and/or a specific metric but much worse for the others.
We show that this model is the most stable and generic when testing it on 3 very different datasets. It was first tested on MIT1003 where it outperforms all the classical models and most of the DNN-based models. However some DNN-based models, especially the latest ones still provide better results. We then tested DeepRare2019 on the O3 dataset, where it outperforms all the models on target/background discrimination. On target/distractor discrimination, other models perform better for Color, but our model is second on non-Color showing its stability again. Finally, on P3 dataset, our model is first ex-aequo for the target discrimination based on the number of fixations. When computing the average GSI metric our model is the only one to be in the top-three for all the features (color, orientation, size) and the only one to exhibit a GSI plot which is biologically plausible.
While one cannot expect from an unsupervised model such as DeepRare2019 to be better on MIT1003 dataset than DNN-based models which are trained and tuned on similar data, those DNN-based models are bad or even completely lost on O3 and P3 datasets. The other way around, classical models are sometimes better than DeepRare2019 on the latter datasets, but they perform much worse than DeepRare2019 on MIT1003 dataset. In addition they outperform DeepRare2019 only on specific metrics and never on all the dataset subclasses.
To conclude, DeepRare2019 is always the best or in the top-3 or top-4 best models in all tests we achieved. No other model is capable to be good in all datasets and their subclasses. DeepRare2019 is definitely the most stable and generic model within the tested saliency models.
All those advantages show that deep-features-engineered models might become a good choice in visual attention field especially when the images they are applied on are special and specific eye-tracking datasets are not available or when explaining the result is of high importance.
Supported by ARES-CCD (program AI 2014-2019) under the funding of Belgian university cooperation.