Log In Sign Up

Invariance Analysis of Saliency Models versus Human Gaze During Scene Free Viewing

by   Zhaohui Che, et al.

Most of current studies on human gaze and saliency modeling have used high-quality stimuli. In real world, however, captured images undergo various types of distortions during the whole acquisition, transmission, and displaying chain. Some distortion types include motion blur, lighting variations and rotation. Despite few efforts, influences of ubiquitous distortions on visual attention and saliency models have not been systematically investigated. In this paper, we first create a large-scale database including eye movements of 10 observers over 1900 images degraded by 19 types of distortions. Second, by analyzing eye movements and saliency models, we find that: a) observers look at different locations over distorted versus original images, and b) performances of saliency models are drastically hindered over distorted images, with the maximum performance drop belonging to Rotation and Shearing distortions. Finally, we investigate the effectiveness of different distortions when serving as data augmentation transformations. Experimental results verify that some useful data augmentation transformations which preserve human gaze of reference images can improve deep saliency models against distortions, while some invalid transformations which severely change human gaze will degrade the performance.


page 5

page 7

page 9

page 11

page 12

page 14


Leverage eye-movement data for saliency modeling: Invariance Analysis and a Robust New Model

Data size is the bottleneck for developing deep saliency models, because...

Improving saliency models' predictions of the next fixation with humans' intrinsic cost of gaze shifts

The human prioritization of image regions can be modeled in a time invar...

The Effect of Distortions on the Prediction of Visual Attention

Existing saliency models have been designed and evaluated for predicting...

GASP: Gated Attention For Saliency Prediction

Saliency prediction refers to the computational task of modeling overt a...

Vanishing point attracts gaze in free-viewing and visual search tasks

To investigate whether the vanishing point (VP) plays a significant role...

Toward Improving the Evaluation of Visual Attention Models: a Crowdsourcing Approach

Human visual attention is a complex phenomenon. A computational modeling...

Code Repositories


code for saliency prediction using GAN + ResBlock + CenterSurround + PixelWise/HistWise Losses

view repo

1 Introduction

Visual attention is an advanced internal mechanism for detecting informative and conspicuous regions from external visual stimuli. Over digital images or videos, human fixations represent the coordinate value of conspicuous regions are a good proxy of visual attention. Visual attention is an efficient front-end operation for complex back-end computer vision tasks such as scene understanding, object recognition and detection, segmentation and visual description

[1, 2, 3].

A plethora of computational saliency models have been proposed in the past decades to predict human fixations automatically by simulating human visual system [4, 5, 6]. Early saliency models extract hand-crafted features [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], while deep saliency models [19, 20, 21, 22, 23] learn relevant features automatically. Both types of models generate a scalar-valued saliency map which represents the location and importance of the salient regions.

To the best of our knowledge, most state-of-the-art saliency models and cognitive studies on visual attention have used high-quality and distortion-free stimuli. However, in most practical circumstances, external stimulus are corrupted by diverse distortions. In addition, some saliency-guided applications like image/video quality assessment [24] and object detection and recognition [25] have to deal with distorted images. Some related works have investigated visual attention with the consideration of distortions. Kim et al. [26] investigated visual saliency over noisy images. They found that noise significantly degrades the accuracy of saliency models, and proposed a robust model for noise-corrupted images. Judd et al. [27] elaborately investigated the visual fixations on low-resolution images, and compared human gaze dispersions on different resolutions. Min et al. [28] investigated the influences of compression artifacts on visual attention by conducting an eye-tracking experiment on images with different compression levels. Zhang et al. [29] investigated the optimal strategy to integrate the human attention cues into perceptual quality prediction, and pointed out that eye-tracking data on distorted images promotes perceptual quality metrics’ performances.

The above works have considered specific types of distortions using limited amount of data and a small set of saliency models. In this paper, we conduct a more comprehensive analysis to investigate the influence of distortions on both human gaze and saliency models. We first construct a large saliency database including 1900 images corrupted by 19 types of distortions. Eye-tracking experiments are conducted on this database. We analyze the human gaze discrepancy when viewing stimulus corrupted by different distortions. Then we conduct a comprehensive comparison of several state-of-the-art deep learning based as well as classic saliency models on this database. We measure the gap between the saliency model prediction and human gaze when distortions are introduced. More interesting observations concerning visual attention and stimuli distortions are also made. Our results have important implications for applying saliency models in practical applications with distortions, and provide useful data augmentation strategies to improve deep learning based saliency models. To the best of our knowledge, this is the first systematic effort in this direction in the saliency field.

2 The Proposed Eye-tracking Database

2.1 Stimuli and distortion types

We selected 100 distortion-free reference images from the finer-grained CAT2000 saliency database [30] since it covers various scenes including indoor and outdoor scenes, natural and man-made scenes, synthetic patterns, fractal images, cartoon images, etc. Specifically, the reference set of the proposed database consists of 33 outdoor scenes, 22 indoor scenes, 15 cartoon images, 15 art images, and 15 fractal images. Considering that different reference images have different aspect ratios, the authors of CAT2000padded each image by adding two gray bands to left and right sides and adjusted the image scale to make sure all images have the same resolution (). This setting guarantees the consistency of the eye-tracking experiment so that eye-movement data will not be affected by resolution.

Distortion Types Generation Method (using Matlab) IO score : sAUC, CC, NSS
Reference 100 distortion-free images from CAT2000, denoted by img 0.7335, 0.9538, 3.4352
MotionBlur1 imfilter(img, fspecial(’motion’, 15, 0)) 0.6637, 0.9226, 2.5720
MotionBlur2 imfilter(img, fspecial(’motion’, 35, 90)) 0.6512, 0.9203, 2.5883
Noise1 imnoise(img, ’gaussian’, 0, 0.1) 0.7060, 0.9394, 3.0316
Noise2 imnoise(img, ’gaussian’, 0, 0.2) 0.6961, 0.9392, 3.0256
JPEG1 imwrite(img, saveroutine, ’Quality’, 5) 0.7030, 0.9021, 2.9193
JPEG2 imwrite(img, saveroutine, ’Quality’, 0) 0.7046, 0.9034, 2.8633
Contrast1 imadjust(img, [], [0.3,0.7]) 0.7220, 0.9306, 3.0077
Contrast2 imadjust(img, [], [0.4,0.6]) 0.7021, 0.9311, 3.4303
Rotation1 imrotate(img, -45, ’bilinear’, ’loose’) 0.6804, 0.8935, 2.2865
Rotation2 imrotate(img, -135, ’bilinear’, ’loose’) 0.6543, 0.8923, 2.0978
Shearing1 imwarp(img, affine2d([1 0 0; 0.5 1 0; 0 0 1]) 0.7106, 0.9435, 3.0105
Shearing2 imwarp(img, affine2d([1 0.5 0; 0 1 0; 0 0 1]) 0.6874, 0.9273, 2.5758
Shearing3 imwarp(img, affine2d([1 0.5 0; 0.5 1 0; 0 0 1]) 0.6648, 0.8882, 2.1177
Inversion imrotate(img, -180, ’bilinear’, ’loose’) 0.6947, 0.9342, 3.0621
Mirroring mirror symmetry version of reference images 0.7256, 0.9306, 3.3598
Boundary edge(img, ’canny’, 0.3, sqrt(2)) 0.6696, 0.8879, 2.3119
Cropping1 cut a narrow band from the left side of img 0.6972, 0.9343, 2.6299
Cropping2 cut a narrow band from the top side of img 0.6923, 0.9382, 2.6412
Table 1: The details of the proposed eye-tracking database. IO score [13] provides an upper-bound on prediction accuracy of saliency models.

To systematically assess the influences of ubiquitous distortions on human and model attention behavior, we choose 19 typical distortions during the whole image acquisition, transmission, and displaying chain, explained below : 1) We select 2 levels of motion blur and 2 levels of Gaussian noise to simulate the distortions introduced in the acquisition stage. 2) We consider 2 levels of JPEG compression to simulate distortions introduced in transmission. 3) To simulate displaying distortions, we consider 2 levels of contrast change, 2 rotation degrees, and 3 shearing transformations. 4) In addition, we consider inversion, mirroring, line drawing (sketch/boundary maps), and 2 types of cropping distortions to explore the visual fixation variations under extremely abnormal conditions. This way, we derive 18 distorted images for each reference stimuli. Thus, a total of 1900 images (18 100 + 100 reference images) are included in the proposed eye-tracking database111Download link: Obscured for blind review.. Details of distortion types and generation methods are shown in Table 1.

2.2 Eye tracking apparatus

Collecting eye-tracking data is expensive and time consuming. To overcome this challenge, some new large-scale data collection methodologies [31, 32, 33] have resorted to mouse movements or webcam gaze tracking. These methodologies are devised for reducing the semantic gap between human and deep models. However, it is still unclear whether they can replace eye movements for model training and whether they suffice to reach human level accuracy. Therefore, instead of relying on such methods, here we employ laboratory eye tracking which is more accurate.

Bylinskii et al. [34] pointed out that the eye-tracking parameters, such as the distance of the subject to eye-tracker, calibration error, and image size affect quality of the collected data. Poor experimental settings can significantly affect performance evaluation and conclusions. Here we utilize Tobii X120 eye tracker to record eye-movements. We use the LG 47LA6600 CA monitor with horizontal resolution of 1920 and vertical resolution of 1080, so that the resolutions of stimulus and the monitor screen are the same. The height and width of the monitor are 60cm and 106cm, respectively. The distance between the subject and the monitor is 180cm, and the distance between the subject and the eye-tracker is 60cm. According to Bylinskii et al. [35]

, one degree of visual angle is used both as 1) an estimate of the size of the human fovea: e.g., how much of the image a participant has in focus during a fixation, and 2) to account for measurement error in the eye-tracking set-up. In our experiment, the width of the screen subtends

of visual angle, and of horizontal visual angle contains 56.91 pixels. Accordingly, the height of the screen subtends of visual angle, and

of vertical visual angle corresponds to 56.55 pixels. The obtained visual angle is necessary for computing the standard deviation of Gaussian kernel used in the following step.

It is well known that there are two types of ground-truths for measuring the performances of visual saliency models, i.e. 1) a discrete fixations map made up of discrete gaze points which are recorded by an eye-tracker directly, and 2) a continuous fixation map representing the probability of the human fixation. The former can be converted into the latter by a Gaussian smooth filter with standard deviation

equal to one degree of visual angle [36]. Here, we choose .

2.3 Subjects and task

We recruited 40 subjects to participate in the eye tracking experiment. They were 24 males and 16 females, with age ranging from 18 to 35 years old. All participants were naive subjects and had not seen the stimulus set before. Besides, subjects viewed the stimulus under a free-viewing condition.

Engelke et al.[37] investigated the impact of eye-tracking experimental settings on the quality of fixation maps, and found that the fixation map becomes more stable with the longer duration time. Further, they pointed out that the convergence speed of fixation map may aid in reducing experimental time and cost while marginally sacrificing the accuracy of the final fixation map. Moreover, for duration times longer than , the fixation map is accurate enough to approximate the human attention behavior. In the proposed database, the duration time for each stimuli is . We inserted a gray image with duration time between each two consecutive images to reset the visual fixation center and avoid carryover and memory effects [38]. Besides, the order of stimulus was randomized for each subject to mitigate the carryover effect.

2.4 “Between-subjects” protocol

A traditional subjective experiment protocol called “within-subjects” has been widely utilized in eye-tracking experiments [28, 39]. The “within-subjects” protocol asks the same group of subjects to view all stimulus, so that each subject has to view several distorted versions originated from the same reference image. This protocol may cause some undesirable effects such as learning from experience or memory about salient objects and the carryover effect [38]. Hence, we adopt the “between-subjects” experiment protocol proposed in [40] instead of the “within-subjects” protocol. In “between-subject” protocol, the subjects are divided into several non-overlapping groups. Each group is randomly assigned to view different distortion groups. This protocol reduces the repetitive stimulus presented to each single subject. Considering that the proposed database contains 1 reference group and 18 distortion groups, we equally divided the 40 subjects into 4 groups, and arranged where each subject viewed only 4 or 5 distortion groups, rather than all 19 groups. As a result, the carryover effect will be mitigated. This way, for each stimulus, we collect eye-movement data from 10 subjects. For each subject, it takes 42 minutes to accomplish the eye-tracking experiment.

3 Analysis of Human Gaze Discrepancy

3.1 Quantitative evaluation

In this section, we investigate the human visual fixation dispersions over 19 distortions, and compare the fixation maps of the reference and distorted images.

(a) CC similarity matrix
(b) SIM similarity matrix
(c) KL dissimilarity matrix
Figure 1: CC, SIM similarity matrixes and KL dissimilarity matrix of human gaze on different distortions. The distortion types are ranked by their similarity/dissimilarity values when using the human gaze on Reference as ground-truth. The higher CC and SIM values represent the better similarity, while the lower KL value means the better relevance.

We quantify the discrepancies between human fixation maps of distorted and reference images using 3 similarity evaluation measures including: Correlation Coefficient (CC), Similarity Measure (SIM), and Kullback-Leibler divergence (KL)

[41]. The similarity matrices of different distortions are shown in Figure 1. The distortion types of each similarity matrix are ranked by its similarity value with the reference group, and the relevances decrease from left to right. Figure 1 indicates that different distortions do have influences on human attention, and the extent of impact is highly related to distortion types. Notably, for Inversion, Mirroring, Rotation and Shearing distortions which change the image pixels’ locations, we map the human gaze maps via the inverse transformations corresponding to Table 1 to align them with the Reference image pixel-by-pixel for fair comparison, as shown in Figure 2. We find that:

1. Mirroring and Shearing1 have slight influences on human attention compared to other distortions, because they obtain the best similarity values in terms of CC, SIM and KL metrics.

2. Rotation2, Cropping2 and Shearing3 have significant influences on human gaze, because the discrepancies of human fixation between these distortions and Reference are significant, as shown by CC, SIM and KL metrics in Figure 1.

3. The human fixation maps of the images degraded by the same type but different levels of distortion are quite close for these cases: Noise1 vs Noise2, JPEG1 vs JPEG2, MotionBlur1 vs MotionBlur2 (when compared to each other). Human gaze maps over these distortions in different levels achieve high similarity values using CC, SIM and KL metrics. Besides, the higher distortion level, the higher human gaze discrepancy (when compared to Reference).

3.2 Finer-grained analyses

3.2.1 Characteristics of different metrics:

As shown in Figure 1, different similarity matrices have disparate characteristics including symmetry properties and distortion rankings due to properties of different metrics. A finer-grained analyses about the influence of different metrics including CC, SIM and KL to each distortion type is provided in the supplementary material.

(a) Rotation2
(b) Shearing3
Figure 2: Human gaze discrepancy on Rotation2 and Shearing3 compared to Reference. The 3rd row represents the restored version of Rotation2/Shearing3 via inverse transformation. This way, the Restoration is aligned with Reference pixel-by-pixel for fair comparison. The 1st and 2nd rows represent the human gaze maps of Reference and Cropping1/Cropping2 respectively. The 3rd and 4th rows represent CC and SIM maps in which the higher value means the better approximation. The 5th row represents KL map in which the higher value means the severer discrepancy.
(a) Mirroring
(b) Boundary
Figure 3: Human gaze discrepancy on Mirroring and Boundary compared to Reference.
(a) Cropping1
(b) Cropping2
Figure 4: Human gaze discrepancy on Cropping1 and Cropping2 compared to Reference.

3.2.2 Influences of different distortions:

In this section, we summarize the influence of different distortions on human gaze.

: Human gaze maps are almost the same on Mirroring and Reference groups, but there is still a small gap. Generally speaking, for most stimulus with multiple salient objects, the most conspicuous salient object will be noticed in both of Reference and Mirrored images. However, for secondary salient objects, the human fixations may be different on Reference and Mirrored images, as shown in Figure 3.(a).

: In general, Boundary group retains most semantic information compared to Reference because the human gaze discrepancy between Boundary and Reference groups is not huge, even better than Cropping1. We find that humans prefer to look at regions with intensive edges when color and luminance features are lacking, as shown in Figure 3.(b).

: Cropping1 distracts human attention from the salient regions appearing on the left side of the whole stimuli, but the main part of the stimuli will not be influenced severely, as shown in Figure 4.(a). Cropping2 alters the human gaze severely, because salient objects containing semantic information are often framed in the center part. As a result, the risk of damaging the objects with semantic information is higher for Cropping2 compared to Cropping1.

: Inversion is a special rotation with rotation angle. Rotation distortions do have influences on human gaze. Rotation2 with rotation angle has the severer influence on human attention compared to (Rotation1) and (Inversion). In particular, for stimuli with multiple salient objects in a complex background, humans prefer to concentrate on one of the salient objects, and the dominant salient object may be altered, as shown in the 1st, 2nd, and 4th columns of Figure 2.(a).

: As shown in Figure 2.(b), shearing distortions have influences on human gaze, and the strength of influence highly depends on the affine transformation matrix shown in Table 1. The severer deformation increases the discrepancy of human gaze when compared to Reference. Considering that geometric distortions (i.e., Rotation and Shearing) change the effective size of images (as shown in Figure 2), we take some arrangements to mitigate the additional influence of image effective size to eye movement data, and the arrangements are explained in detail in the supplementary material.

: The low level Contrast1 has slight influence on human gaze, but the high level Contrast2 attracts human gaze to center region, i.e., there is severe center-bias. The qualitative results are provided in the supplementary material.

: Experimental results show that human gaze is tolerant to Gaussian noise maybe because human eyes are able to detect salient regions even when the stimuli is corrupted by severe noise. However, for stimuli including one dominant salient object in a complex background, high level noise will distract human attention from the dominant salient object. Similarly, low level JPEG artifacts will be ignored for human attention, but high level JPEG artifacts will alter the human gaze. MotionBlur artifacts have a more profound impact on human gaze compared to JPEG and Noise.

(a) CC scores (,)
(b) NSS scores (,)
(c) sAUC scores (,)
(d) sAUC scores when using Human Gaze map on Reference as ground-truth
Figure 5:

The performance of state-of-the-art saliency models on different distortions. The horizontal axis represents different distortion types which are ranked by average performance over 21 saliency models. The vertical axis represents different saliency models which are ranked by average performance over 19 distortions. The bar graph on the right side represents the average performance of each model, and the bar graph at the bottom represents the average performance of each distortion. The error bars represent standard error of the mean (SEM). The

value means the standard deviation of each model’s performance over 19 distortions, and the red and blue represent the highest and the lowest values respectively. Notably, represents the standard deviation of the average performances of different models, while represents the standard deviation of the average performances of different distortions. (d): Compared to (c), we calculate the sAUC score for each model once again. However, we adopt the human gaze map of the reference stimuli to compute the sAUC scores for the other 18 distorted stimulus, rather than using the human gaze map of the real distorted stimulus as (a)-(c) do.

4 Performance of Saliency Models

4.1 Quantitative evaluation

We test 15 early saliency models including IttiKoch [7], GBVS [8], Torralba [9], CovSal [10] (CovSal-1 utilizes covariance feature and CovSal-2 utilizes both of covariance and mean features), AIM [11], Hou [12] (Hou-Lab and Hou-RGB adopt Lab and RGB color spaces respectively), LS [13], LGS [13], BMS [14], RC [15], Murray [16], AWS [17] and ContextAware [18], and 6 deep models including ML-Net [19], SalGAN [20], SALICON [21], SalNet [22], SAM-ResNet [23] and SAM-VGG [23] on the proposed database. The performances are shown in Figure 5.

We observe the following points:

1. Deep models outperform early models significantly on different distortions.

2. Rotation2 and Shearing3 are the most challenging distortions for models, because most models obtain poor performances on these distortions. Recall that Rotation2 and Shearing3 also have severe impacts on human gaze.

3. The discrepancy between different saliency models seems much larger than the discrepancy between different distortions. As shown in Figure 5, the standard deviation of average performances of 21 saliency models (i.e., ) is higher than the standard deviation of average performances of 19 distortions (i.e., ) when using both of sAUC, CC and NSS metrics.

4. AWS, ContextAware and Torralba models are robust to different distortions, because sAUC, CC and NSS scores of these models obtain small standard deviations (i.e. ) over 19 distortions. However, SALICON and ML-Net models obtain unstable performance on CC and NSS metrics, because they fail on Noise2 and Contrast2. The same observation holds for the SalNet and SALICON when using the sAUC metric. We find that the early models using hand-crafted features obtain more robust performance compared to deep models, while deep models have the higher average performance compared to the early models.

4.2 Finer-grained analyses

In this section, we will explain some outliers appearing on Figure

5, and explore the gap between saliency models and human gaze.

Analyses of Metrics: As shown in Figure 5

, the ranks of saliency models and distortions are highly related to the evaluation metrics. Specifically, the Normalized Scanpath Saliency (NSS) is sensitive to false positives

[34]. sAUC, also called the shuffled AUC, penalizes models that include the center-bias and it ignores low-valued false positives compared to NSS [34]. As a result, the 1st and 5th columns in Figure 6.(a) have similar sAUC scores. NSS of the 5th column, however, is significantly lower than the 1st, because severe false positives on the 5th column contribute to lowering the normalized saliency value at each fixation location, thus reducing the overall NSS score. The same observation holds for the 3rd and 5th columns in Figure 6.(b).

(a) Boundary
(b) Rotation2
Figure 6: Example of failure cases of different models on Boundary and Rotation2. The 1st and 2nd rows of (b) are mapped by inverse transformation for better observation.

Outliers: There are some outliers appearing on Figure 5, explained below.

ML-Net fails on Noise2 and Contrast2. This is because ML-Net produces severe false positives on the upper-left region of stimulus corrupted by Noise2. Besides, ML-Net falsely produces two slender salient lines on the top and bottom sides of stimulus corrupted by Contrast2, as shown in Figure 7.

SALICON fails on Noise2 and Contrast2. Because SALICON produces severe false positives on the left and right sides of most stimulus corrupted by severe noise, as shown in Figure 7.(a). Further, SALICON detects the whole image as salient region on Contrast2 as shown in Figure 7.(b).

LS, LGS and GBVS fail on Boundary. LS extracts features from both of RGB and Lab color spaces, but the and channels of Lab color space are close to 0, because stimulus of Boundary are binary images. As a result, LS produces NaN values at a normalization step. Thus, we compute the LS saliency map using only channel for avoiding NaN values. LGS and GBVS produce severe false positives on the left and right sides on Boundary as shown in Figure 6.(a).

CovSal-1 and CovSal-2 rank at the bottom on Figure 5.c, because CovSal includes severe center-bias which is penalized by the sAUC metric, as shown in Figure 7.(b).

Upper-bound of models: We report the Human Inter-Observer (IO) scores [13] of different distortions in Table 1. IO score provides an upper-bound on prediction accuracy of saliency models, because different observers are often the best predictors of each other.

(a) Noise2
(b) Contrast2
Figure 7: Example of failure cases of different models on Noise2 and Contrast2.
(a) MotionBlur2
(b) Shearing3
Figure 8: Example of failure cases of different models on MotionBlur2 and Shearing3. The 1st and 2nd rows of (b) are mapped by inverse transformation for better observation

Immunity to Distortion: We define the immunity as the ability of a saliency model to predict the fixation locations for distorted stimuli as consistent as the distortion-free stimuli. Figure 5.(d) shows the average sAUC score of different models on distorted stimuli when using human gaze on Reference as ground-truth. The higher score here means that the model has the better immunity to distortions, although it may not be the best model to predict the real human gaze when viewing distorted stimuli. We also find that most models obtain the higher sAUC scores when using human gaze on Reference as ground-truth compared to when using real human gaze distracted by distortions. Notably, the immunity mentioned here is different from the robustness mentioned in section 4.1. This is because the robustness means that the performance of saliency model will not be severely degraded by different distortions, and the performance is calculated by using real human gaze on distorted stimuli as ground-truth.

5 Application for Data Augmentation

Data augmentation is widely used in deep-learning based computer vision tasks [42] to reduce overfitting and to improve generalization capacity of deep models. The most common data augmentation strategy is enlarging the dataset using some label-preserving transformations, such as Cropping, Inversion, ContrastChange, and Rotation. However, different from classical image classification and object detection problems, common data augmentation methods may produce label noise for the saliency prediction problem, because different transformations will change the ground truth at different levels. This paper carries important implications as to which of these kinds of transformations are valid and which are not as approximations of real human gaze behavior. We divide common transformations of the proposed dataset into two sets, i.e. valid and invalid augmented sets, and explore how fine-tuning on different sets of augmented data can improve or degrade the performance of deep models with respect to ground truth data.

0.6330, 0.4795, 1.5697,
0.4639, 0.7434, 0.9975
0.6568, 0.5304, 1.6716,
0.4817, 0.8132, 0.8776.
0.6588, 0.5887, 1.9443,
0.5146, 0.7930, 0.8834.
0.6548, 0.4774, 1.5969,
0.4561, 0.7653, 0.9826.
using CAT2000
0.6386, 0.7606, 2.2692,
0.6317, 0.8681, 1.0122
0.6546, 0.7730, 2.2922,
0.6385, 0.8724, 1.1512.
0.6439, 0.5974, 1.9558,
0.5364, 0.8107, 0.8945.
0.6588, 0.5304, 1.6223,
0.4778, 0.8205, 0.8691.
using Valid Set
0.6442, 0.7753, 2.3444,
0.6584, 0.8766, 0.6673.
0.6667, 0.7822, 2.3567,
0.6627, 0.8817, 0.6821.
0.6614, 0.5984, 1.9527,
0.5386, 0.8207, 0.7702.
0.6598, 0.5484, 1.7108,
0.5018, 0.8469, 0.8134.
Table 2: The performance of deep models on valid augmented set. Metric scores are represented by different colors: sAUC, CC, NSS, SIM, AUC-Borji, and KL.
0.5989, 0.5026, 1.4154,
0.4961, 0.7226, 1.0108
0.5990, 0.5026, 1.4154,
0.4961, 0.7224, 1.0168.
0.5938, 0.5552, 1.4762,
0.5230, 0.7450, 0.9174.
0.5943, 0.4179, 1.1162,
0.4535, 0.7154, 0.9833.
using CAT2000
0.5790, 0.7551, 1.9231,
0.6507, 0.8406, 0.9949
0.5771, 0.7584, 1.8984,
0.6517, 0.8423, 1.0745.
0.5845, 0.5727, 1.4433,
0.5442, 0.7652, 0.7949.
0.6080, 0.5379, 1.3662,
0.5194, 0.7978, 0.7793.
using Invalid Set
0.5751, 0.7268, 1.8291,
0.6316, 0.8233, 1.6512.
0.5716, 0.7508, 1.8880,
0.6490, 0.8263, 1.7643.
0.5827, 0.5546, 1.3898,
0.5402. 0.7594, 1.1768.
0.6017, 0.5341, 1.3434,
0.5065, 0.7918, 0.7919.
Table 3: The performance of deep models on invalid augmented set.

On the one hand, we select Reference, Mirroring, Inversion, Contrast1, Shearing1, JPEG1 and Noise1 to generate a valid augmented set, because these transformations have slight effects on human gaze. On the other hand, Rotation1, Rotation2, Shearing2, Shearing3, Cropping1, Cropping3 and MotionBlur2 serve as an invalid augmented set, because these transformations are not able to preserve human gaze labels as approximations of Reference. Considering that Reference images of the proposed dataset is selected from CAT2000, here we first fine-tune 4 state-of-the-art deep models using CAT2000. Then, we use the valid and invalid augmented sets to fine-tune these deep models separately.

For fair comparison, we unify the training set scale, optimization function parameters, and training epoch for different fine-tuning strategies using CAT2000, valid and invalid sets, explained below. First, each of valid set, invalid set and CAT2000 used here is divided into training set (350 images), validation set (175 images), and test set (175 images). Second, we adopt the test set of valid augmented set to calculate the performance in Table

2, and the test set of invalid augmented set is used to compute metric scores in Table 3. Third, for each deep model, the hyper-parameters of fine-tuning this model by CAT2000, valid set, and invalid set are set as the same values: 1) For 4 deep models mentioned in Table 2

, SGD (stochastic gradient descent) with momentum 0.9 and weight decay 0.0005 serves as the optimization function, and the batch size is 1, 2) For ML-Net, learning rate is

, and epoch is 20, 3) For SALICON, learning rate is , and training time is set as 2000 seconds, and 4) For SAM-VGG and SAM-ResNet, learning rates are set as , and epoches are 10.

(c) ML-Net
(d) SAM-ResNet
Figure 9: Qualitative comparison between ground truth, original models, and fine-tuned models. For (a)-(d), the 1st row represents ground truth of human gaze; the 2nd row represents saliency maps generated by original models; the 3rd row represents saliency maps of refined models fine-tuned by valid set.

Experimental results shown in Table 2 verify that fine-tuning using CAT2000 and valid set can improve deep models’ performance. Besides, fine-tuning using valid set achieves better promotion compared to using CAT2000 which contains only normal Reference images. However, as shown in Table 3, fine-tuning using invalid set degrades deep models’ performance compared to using CAT2000. Qualitative results generated by original models and refined models which are fine-tuned using valid set, together with ground truth, are shown in Figure 9. From a qualitative point of view, for the saliency prediction task, fine-tuning using some label-preserving data augmentation transformations (as mentioned in valid set) can improve robustness of deep saliency models against distortions.

6 Discussion and Conclusion

We introduce a large-scale eye-tracking database consisting of 19 typical distortions for boosting saliency modeling to approach human-level accuracy on non-canonical stimuli. We also refine some state-of-the-art deep saliency models by valid data augmentation strategy to achieve better prediction accuracy when suffering from distortions.

The takeaway lessons from our study are as follows.

First, most distortions do have impacts on human gaze, and the magnitude of impact highly depends on distortion type. High level rotation, shearing and cropping distortions significantly distract human gaze. While mirroring, inversion and slight shearing distortions have slight impacts on human gaze.

Second, different distortions distract human gaze in different ways. For example, extreme-low contrast attracts human gaze to center region. Rotation alters the dominant salient object. Cropping distracts human gaze from the salient region appearing on the cut side.

Third, deep saliency models obtain better average performance on different distortions than early non-deep models, but fail on some special distortions including severe noise, boundary and extreme-low contrast. Early saliency models using hand-crafted features provide better robustness in these cases.

Finally, for saliency prediction problem, how to choose data augmentation transformation types has impact on final performance of deep saliency models. Mirroring, Inversion, Contrast1, Shearing1, JPEG1 and Noise1 are qualified to serve as data augmentation transformations, and to improve model performance. While Cropping, Rotation, Shearing2, Shearing3, MotionBlur2 will degrade model performance because these transformations change human gaze label severely.

For state-of-the-art saliency models, there is still a gap between the current prediction and the upper-bound (IO score) of prediction accuracy on distorted stimuli.

We will share our collected data and code with the community to promote research in improving the robustness of deep models over different distortions and to close the gap between saliency models and the human IO model.


  • [1] Oliva, A., Torralba, A., Castelhano, M.S., Henderson, J.M.: Top-down control of visual attention in object detection. In: ICIP. (2003)
  • [2] Frintrop, S.: A visual attention system for object detection and goal-directed search. Springer (2006)
  • [3] Mishra, A., Aloimonos, Y., Fah, C.L.: Active segmentation with fixation. In: ICCV. (2009)
  • [4] Borji, A., Itti, L.:

    State-of-the-art in visual attention modeling.

    IEEE T-PAMI (2013)
  • [5] Borji, A., Sihite, D.N., Itti, L.: Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study. IEEE T-IP (2013)
  • [6] Borji, A., Tavakoli, H.R., Sihite, D.N., Itti, L.: Analysis of scores, datasets, and models in visual saliency prediction. In: ICCV. (2013)
  • [7] Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE T-PAMI (1998)
  • [8] Jonathan, H., Koch, C., Perona, P.: Graph-based visual saliency. In: NIPs. (2007)
  • [9] Torralba, A., Oliva, A., Castelhano, M.S.: Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological review (2006)
  • [10] Erdem, E., Erdem, A.: Visual saliency estimation by nonlinearly integrating features using region covariances. JoV (2013)
  • [11] Bruce, N., Tsotsos, J.: Attention based on information maximization. JoV (2007)
  • [12] Hou, X., Harel, J., Koch, C.: Image signature: Highlighting sparse salient regions. IEEE T-PAMI (2012)
  • [13] Borji, A., Itti, L.: Exploiting local and global patch rarities for saliency detection. In: CVPR. (2012)
  • [14] Zhang, J., Sclaroff, S.: Saliency detection: A boolean map approach. In: ICCV. (2013)
  • [15] Cheng, M., Mitra, N.J., Huang, X.: Global contrast based salient region detection. IEEE T-PAMI (2015)
  • [16] Murray, N., Vanrell, M., Otazu, X.: Saliency estimation using a non-parametric low-level vision model. In: CVPR. (2011)
  • [17] Garcia-Diaz, A., Leboran, V., Fdez-Vidal, X.R.: On the relationship between optical variability, visual saliency, and eye fixations: A computational approach. JoV (2012)
  • [18] Goferman, S., Manor, L., Tal, A.: Context-aware saliency detection. IEEE T-PAMI (2012)
  • [19] Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A deep multi-level network for saliency prediction. In: ICPR. (2016)
  • [20] Pan, J., Canton, C., McGuinness, K., et al: Salgan: Visual saliency prediction with generative adversarial networks. In: arXiv preprint cs.CV. (2017)
  • [21] Huang, X., Shen, C., Boix, X., Zhao, Q.:

    Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks.

    In: ICCV. (2015)
  • [22] Pan, J., McGuiness, K., Sayrol, E., Conner, N., et al: Shallow and deep convolutional networks for saliency prediction. In: CVPR. (2016)
  • [23] Cornia, M., Baraldi, L., Serra, G., et al: Predicting human eye fixations via an lstm-based saliency attentive model. In: arXiv preprint cs.CV. (2016)
  • [24] Zhang, W., Borji, A., Wang, Z., Patrick, Liu, H.: The application of visual saliency models in objective image quality assessment: A statistical evaluation. IEEE T-NNLS (2016)
  • [25] Gao, D., Han, S., Vasconcelos, N.: Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition. IEEE T-PAMI (2010)
  • [26] Kim, C., Milanfar, P.: Visual saliency in noisy images. JoV (2013)
  • [27] Tilke, J., Fredo, D., Antonio, T.: Fixations on low-resolution images. JoV (2011)
  • [28] Min, X., Zhai, G., Gao, Z., Hu, C.: Influence of compression artifacts on visual attention. In: ICME. (2014)
  • [29] Zhang, W., Liu, H.: Toward a reliable collection of eye-tracking data for image quality research: Challenges, solutions, and applications. IEEE T-IP (2017)
  • [30] Borji, A., Itti, L.: Cat2000: A large scale fixation dataset for boosting saliency research. In: arXiv preprint cs.CV. (2015)
  • [31] Jiang, M., Huang, S., Duan, J., Zhao, Q.: Salicon: Saliency in context. In: CVPR. (2015)
  • [32] Krafka, K., Khosla, A., Kellnhofer, P., et al: Eye tracking for everyone. In: CVPR. (2016)
  • [33] Papoutsaki, A., Sangkloy, P., Laskey, J., Daskalova, N., et al: Webgazer: Scalable webcam eye tracking using user interactions. In: IJCAI. (2016)
  • [34] Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? In: arXiv preprint cs.CV. (2017)
  • [35] Bylinskii, Z.: Code for computing visual angle. (2014)
  • [36] Olivier, L., Thierry, B.: Methods for comparing scanpaths and saliency maps: strengths and weaknesses. Behavior Research Methods (2013)
  • [37] Engelke, U., Liu, H., Wang, J., Callet, P.L., Heynderickx, I., Zepernick, H.J., Maeder, A.: Comparative study of fixation density maps. IEEE T-IP (2013)
  • [38] Greenwald, A.G.: Within-subjects designs: To use or not to use? Psychological Bulletin (1976)
  • [39] Vu, C.T., Larson, E.C., Chandler, D.M.: Visual fixation patterns when judging image quality: Effects of distortion type, amount, and subject experience. SSIAI (2008)
  • [40] Zhang, W., Liu, H.: Learning picture quality from visual distraction: Psychophysical studies and computational models. NC (2017)
  • [41] Bylinskii, Z., Judd, T., Borji, A., Itti, L., Durand, F., Oliva, A., Torralba, A.: Mit saliency benchmark.
  • [42] Alex, K., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPs. (2015)