How best to evaluate a saliency model's ability to predict where humans look in images is an open research question. The choice of evaluation metric depends on how saliency is defined and how the ground truth is represented. Metrics differ in how they rank saliency models, and this results from how false positives and false negatives are treated, whether viewing biases are accounted for, whether spatial deviations are factored in, and how the saliency maps are pre-processed. In this paper, we provide an analysis of 8 different evaluation metrics and their properties. With the help of systematic experiments and visualizations of metric computations, we add interpretability to saliency scores and more transparency to the evaluation of saliency models. Building off the differences in metric properties and behaviors, we make recommendations for metric selections under specific assumptions and for specific applications.READ FULL TEXT VIEW PDF
A deep feature based saliency model (DeepFeat) is developed to leverage ...
Creating a linguistic resource is often done by using a machine learning...
The computation and memory needed for Convolutional Neural Network (CNN)...
With the increased focus on visual attention (VA) in the last decade, a ...
Saliency detection has been widely studied because it plays an important...
Understanding the reasons behind the predictions made by deep neural net...
The research field concerned with the digital restoration of degraded wr...
Automatically predicting regions of high saliency in an image is useful for applications including content-aware image re-targeting, image compression and progressive transmission, object and motion detection, image retrieval and matching. Where human observers look in images is often used as a ground truth estimate of image saliency, and computational models producing a saliency value at each pixel of an image are referred to as saliency models111Although the term saliency was traditionally used to refer to bottom-up conspicuity, many modern saliency models include scene layout, object locations, and other contextual information..
Dozens of computational saliency models are available to choose from [8, 9, 13, 14, 40], but objectively determining which model offers the “best” approximation to human eye fixations remains a challenge. For example, for the input image in Fig. 1a, we include the output of 8 different saliency models (Fig. 1b). When compared to human ground truth the saliency models receive different scores according to different evaluation metrics (Fig. 1c). The inconsistency in how different metrics rank saliency models can often leave performance up to interpretability.
In this paper, we quantify metric behaviors. Through a series of systematic experiments and novel visualizations (Fig. 2
), we aim to understand how changes in the input saliency maps impact metric scores, and as a result why models are scored differently. Some metrics take a probabilistic approach to distribution comparison, yet others treat distributions as histograms or random variables (Sec.4). Some metrics are especially sensitive to false negatives in the input prediction, others to false positives, center bias, or spatial deviations (Sec. 5). Differences in how saliency and ground truth are represented and which attributes of saliency models should be rewarded/penalized leads to different choices of metrics for reporting performance [9, 14, 45, 46, 57, 68, 88]. We consider metric behaviors in isolation from any post-processing or regularization on the part of the models.
Building on the results of our analyses, we offer guidelines for designing saliency benchmarks (Sec. 6). For instance, for evaluating probabilistic saliency models we suggest the KL-divergence and Information Gain (IG) metrics. For benchmarks like the MIT Saliency Benchmark which do not expect saliency models to be probabilistic, but do expect models to capture viewing behavior including systematic biases, we recommend either Normalized Scanpath Saliency (NSS) or Pearson’s Correlation Coefficient (CC).
Our contributions include:
An analysis of 8 metrics commonly used in saliency evaluation. We discuss how these metrics are affected by different properties of the input, and the consequences for saliency evaluation.
Visualizations for all the metrics to add interpretability to metric scores and transparency to the evaluation of saliency models.
An accompanying manuscript to the MIT Saliency Benchmark to help interpret results.
Guidelines for designing new saliency benchmarks, including defining expected inputs and modeling assumptions, specifying a target task, and choosing how to handle dataset bias.
Advice for choosing saliency evaluation metrics based on design choices and target applications.
Similarity metrics operating on image features have been a subject of investigation and application to different computer vision domains[51, 75, 83, 94]. Images are often represented as histograms or distributions of features, including low-level features like edges (texture), shape and color, and higher-level features like objects, object parts, and bags of low-level features. Similarity metrics applied to these feature representations have been used for classification, image retrieval, and image matching tasks [70, 74, 75]. Properties of these metrics across different computer vision tasks also apply to the task of saliency modeling, and we provide a discussion of some applications in Sec. 6.3. The discussion and analysis of the metrics in this paper can correspondingly be generalized to other computer vision applications.
A number of papers in recent years have compared models across different metrics and datasets. Wilming et al.  discussed the choice of metrics for saliency model evaluation, deriving a set of qualitative and high-level desirable properties for metrics: “few parameters”, “intuitive scale”, “low data demand”, and “robustness”. Metrics were discussed from a theoretical standpoint without empirical experiments or quantification of metric behavior.
Le Meur and Baccino  reviewed many methods of comparing scanpaths and saliency maps. For evaluation, however, only 2 metrics were used to compare 4 saliency models. Sharma and Alsam  reported the performance of 11 models with 3 versions of the AUC metric on MIT1003 . Zhao and Koch  performed an analysis of saliency on 4 datasets using 3 metrics. Riche et al.  provided an evaluation 12 saliency models with 12 similarity metrics on Jian Li’s dataset . They compared how metrics rank saliency models and reported which metrics cluster together, but did not provide explanations.
|Metric||Denoted here||Evaluation papers appearing in|
|Area under ROC Curve||AUC||[9, 23, 24, 49, 57, 68, 88, 93]|
|Shuffled AUC||sAUC||[8, 9, 49, 68]|
|Normalized Scanpath Saliency||NSS||[8, 9, 23, 49, 57, 68, 88, 93]|
|Pearson’s Correlation Coefficient||CC||[8, 9, 23, 24, 49, 68, 88]|
|Earth Mover’s Distance||EMD||[49, 68, 93]|
|Similarity or histogram intersection||SIM||[49, 68]|
|Kullback-Leibler divergence||KL||[23, 49, 68, 88]|
|Information Gain||IG||[45, 46]|
Borji, Sihite et al.  compared 35 models on a number of image and video datasets using 3 metrics. Borji, Tavakoli et al.  compared 32 saliency models with 3 metrics for fixation prediction and additional metrics for scanpath prediction on 4 datasets. The effects of center bias and map smoothing on model evaluation were discussed. A synthetic experiment was run with a single set of random fixations while blur sigma, center bias, and border size were varied to determine how the 3 different metrics are affected by these transformations. Our analysis extends to 8 metrics tested on different variants of synthetic data to explore the space of metric behaviors.
Li et al.  used crowdsourced perceptual experiments to discover which metrics most closely correspond to visual comparison of spatial distributions. Participants were asked to select out of pairs of saliency maps the map perceived to be closest to the ground truth map. Human annotations were used to order saliency models, and this ranking was compared to rankings by 9 different metrics. However, human perception can naturally favor some saliency map properties over others (Sec. 2.3). Visual comparisons are affected by the range and scale of saliency values, and are driven by the most salient locations, while small values are not as perceptible and don’t enter into the visual calculations. This is in contrast to metrics that are particularly sensitive to zero values and regularization, which might nevertheless be more appropriate for certain applications, for instance when evaluating probabilistic saliency models (Sec. 6.4).
Emami and Hoberock  compared 9 evaluation metrics (3 novel, 6 previously-published) in terms of human consistency. They defined the best evaluation metric as the one which best discriminates between a human saliency map and a random saliency map, as compared to the ground truth map. Human fixations were split into 2 sets, to generate human saliency maps and ground truth maps for each image. This procedure was the only criterion by which metrics were evaluated, and the chosen evaluation metric was used to compare 10 saliency models.
In this paper, we analyze metrics commonly used in other evaluation efforts (Table I) and reported on the MIT Saliency Benchmark . We include Information Gain (IG), recently introduced by Kümmerer et al. [45, 46]. To visualize metric computations and highlight differences in metric behaviors, we used standard saliency models for which code is available online. These models, depicted in Fig. 1b, include Achanta , AIM , CovSal , IttiKoch [44, 84], Judd , SR , Torralba , and WMAP . Models were used for visualization purposes only, as the primary focus of this paper is comparing the metrics, not the models.
Rather than providing tables of performance values and literature reviews of metrics, this paper offers intuition about how metrics perform under various conditions and where they differ, using experiments with synthetic and natural data, and visualizations of metric computations. We examine the effects of false positives and negatives, blur, dataset biases, and spatial deviations on performance. This paper offers a more complete understanding of evaluation metrics and what they measure.
Most saliency papers include side-by-side comparisons of different saliency maps computed for the same images (as in Fig. 1b). Visualizations of saliency maps are often used to highlight improvements over previous models. A few anecdotal images might be used to showcase model strengths and weaknesses.
Bruce et al.  discussed the problems with visualizing saliency maps, in particular the strong effect that contrast has on the perception of saliency models. We propose supplementing saliency map examples with visualizations of metric computations (as in Fig. 2 and throughout the rest of this paper) to provide an additional means of comparison that is more tightly linked to the underlying model performance than the saliency maps themselves.
The choice of evaluation metrics should be considered in the context of the whole evaluation setup, which requires the following decisions to be made: (1) on which input images saliency models will be evaluated, (2) how the ground truth eye movements will be collected (e.g. at which distance and for how long human observers view each image), and (3) how the eye movements will be represented (e.g. as discrete points, sequences, or distributions). In this section we explain the design choices used for our data collection and evaluation.
We used the MIT Saliency Benchmark dataset (MIT300) of 300 natural images [14, 40]. Eye movements were collected by allowing participants to free-view each image for 2 seconds (more details in the appendix). Such a viewing duration typically elicits 4-6 fixations from each observer. This is sufficient to highlight a few points of interest per image, and offers a reasonable testing ground for saliency models. Different tasks (free viewing, visual search, etc.) also differently direct eye movements and may require alternative model assumptions . The free viewing task is most commonly used for saliency modeling as it requires fewest additional assumptions.
The eye tracking set-up, including participant distance to the eye tracker, calibration error, and image size affects the assumptions that can be made about the collected data. In the eye tracking set-up of the MIT300 dataset, one degree of visual angle is approximately 35 pixels. One degree of visual angle is typically used both (1) as an estimate of the size of the human fovea: e.g. how much of the image a participant has in focus during a fixation, and (2) to account for measurement error in the eye tracking set-up. The robustness of the data also depends on the number of eye fixations collected. In the MIT300 dataset, the eye fixations of 39 observers are available per image, more than in other datasets of similar size.
Once collected, the ground truth eye fixations can be processed and formatted in a number of ways for saliency evaluation. There is a fundamental ambiguity in the correct representation for the fixation data, and different representational choices rely on different assumptions. One format is to use the original fixation locations. Alternatively, the discrete fixations can be converted into a continuous distribution, a fixation map, by smoothing (Fig. 1a). We follow common practice222Some researchers choose to cross-validate the smoothing parameter instead of fixing it as a function of viewing angle [45, 46]. and blur each fixation location using a Gaussian with sigma equal to one degree of visual angle . In the following section, we denote the map of fixation locations as and the continuous fixation map (distribution) as .
Smoothing the fixation locations into a continuous map acts as regularization. It allows for uncertainty in the ground truth measurements to be incorporated: error in the eye-tracking as well as uncertainty of what an observer sees when looking at a particular location on the screen. Any splitting of observer fixations in two sets will never lead to perfect overlap (due to the discrete nature of the data), and smoothing provides additional robustness for evaluation. In the case of few observers, smoothing the fixation locations helps to extrapolate the existing data.
On the other hand, conversion of the fixation locations into a distribution requires parameter selection and post-processing the collected data. The smoothing parameter can significantly affect metric scores during evaluation (Table 12a), unless the model itself is properly regularized.
The fixation locations can be viewed as a discrete sample from some ground truth distribution that the fixation map attempts to approximate. Similarly, the fixation map can be viewed as an extrapolation of discrete fixation data to the case of infinite observers.
Metrics for the evaluation of sequences of fixations are also available . However, most saliency models and evaluations are tuned for location prediction, as sequences tend to be noisier and harder to evaluate. We only consider spatial, not temporal, fixation data.
In this paper, we study saliency metrics, that is, functions that take two inputs representing eye fixations (ground truth and predicted) and then output a number assessing the similarity or dissimilarity between them. Given a set of ground truth eye fixations, such metrics can be used to define scoring functions, which take a saliency map prediction as input and return a number assessing the accuracy of the prediction. The definition of a score can further involve post-processing (or regularizing) the prediction to conform it to known characteristics of the ground truth and ignore potentially distracting idiosyncratic errors. In this paper, we focus on the metric and not on the regularization of ground truth data.
We consider 8 popular saliency evaluation metrics in their most common variants. Some metrics have been designed specifically for saliency evaluation (shuffled AUC, Information Gain, and Normalized Scanpath Saliency), while others have been adapted from signal detection (variants of AUC), image matching and retrieval (Similarity, Earth Mover’s Distance), information theory (KL-divergence), and statistics (Pearson’s Correlation Coefficient). Because of their original intended applications, these metrics expect different input formats: KL-divergence and Information Gain expect valid probability distributions as input, Similarity and Earth Mover’s Distance can operate on unnormalized densities and histograms, while Pearson’s Correlation Coefficient (CC) treats its inputs as random variables.
One of the intentions of this paper is to serve as a guide to complement the MIT Saliency Benchmark, and to provide interpretation for metric scores. The MIT Saliency Benchmark accepts saliency maps as intensity maps, without restricting input to be in any particular form (probabilistic or otherwise). If a metric expects valid probability distributions, we normalize the input saliency maps accordingly, but otherwise make no additional modifications or optimizations.
In this paper we analyze these 8 metrics in isolation from the input format and with minimal underlying assumptions. The only distinction we make in terms of the input that these metrics operate on is whether the ground-truth is represented as discrete fixation locations or a continuous fixation map. Accordingly, we categorize metrics as location-based or distribution-based (following Riche et al. ). This organization is summarized in Table II. In this section, we discuss the particular advantages and disadvantages of each metric, and present visualizations of the metric computations. Additional variants and implementation details are provided in the appendix.
|Similarity||AUC, sAUC, NSS, IG||SIM, CC|
Given the goal of predicting the fixation locations on an image, a saliency map can be interpreted as a classifier of which pixels are fixated or not. This suggests a detection metric for measuring saliency map performance.
In signal detection theory, the Receiver Operating Characteristic (ROC) measures the tradeoff between true and false positives at various discrimination thresholds [32, 26].
The Area under the ROC curve, referred to as AUC, is the most widely used metric for evaluating saliency maps. The saliency map is treated as a binary classifier of fixations at various threshold values (level sets), and an ROC curve is swept out by measuring the true and false positive rates under each binary classifier (level set).
Different AUC implementations differ in how true and false positives are calculated.
Another way to think of AUC is as a measure of how well a model performs on a 2AFC task, where given 2 possible locations on the image, the model has to pick the location that corresponds to a fixation .
Computing true and false positives:
An AUC variant from Judd et al. , called AUC-Judd , is depicted in Fig. 3. For a given threshold, the true positive rate (TP rate) is the ratio of true positives to the total number of fixations, where true positives are saliency map values above threshold at fixated pixels. This is equivalent to the ratio of fixations falling within the level set to the total fixations.
The false positive rate (FP rate) is the ratio of false positives to the total number of saliency map pixels at a given threshold, where false positives are saliency map values above threshold at unfixated pixels. This is equivalent to the number of pixels in each level set, minus the pixels already accounted for by fixations.
Another variant of AUC by Borji et al. , called AUC-Borji , uses a uniform random sample of image pixels as negatives
and defines the saliency map values above threshold at these pixels as false positives. These AUC implementations are compared in Fig. 4. The first row depicts the TP rate calculation, equivalent across implementations. The second and third rows depict the FP rate calculations in AUC-Judd and AUC-Borji, respectively. The false positive calculation in AUC-Borji is a discrete approximation of the calculation in AUC-Judd.
Because of a few approximations in the AUC-Borji implementation that can lead to suboptimal behavior, we report AUC scores using AUC-Judd in the rest of the paper.
Additional discussion, implementation details, and other variants of AUC are discussed in the appendix.
Penalizing models for center bias:
The natural distribution of fixations on an image tends to include a higher density near the center of an image . As a result, a model that incorporates a center bias into its predictions will be able to account for at least part of the fixations on an image, independent of image content. In a center-biased dataset, a center prior baseline will achieve a high AUC score.
samples negatives from fixation locations from other images, instead of uniformly at random. This has the effect of sampling negatives predominantly from the image center because averaging fixations over many images results in the natural emergence of a central Gaussian distribution[78, 88]. In Fig. 4 the shuffled sampling strategy of sAUC is compared to the random sampling strategy of AUC-Borji.
A model that only predicts the center achieves an sAUC score of 0.5 because at all thresholds this model captures as many fixations on the target image as on other images (TP rate = FP rate). A model that incorporates a center bias into its predictions is putting density in the center at the expense of other image regions. Such a model will score worse according to sAUC compared to a model that makes off-center predictions, because sAUC will effectively discount the central predictions (Fig. 6). In other words, sAUC is not invariant to whether the center bias is modeled: it specifically penalizes models that include the center bias.
Invariance to monotonic transformations:
AUC metrics measure only the relative (i.e., ordered) saliency map values at ground truth fixation locations. In other words, the AUC metrics are ambivalent to monotonic transformations. AUC is computed by varying the threshold of the saliency map and computing a trade-off between true and false positives. Lower thresholds correspond to measuring the coverage similarity between distributions, while higher thresholds correspond to measuring the similarity between the peaks of the two maps . Due to how the ROC curve is computed, the AUC score for a saliency map is mostly driven by the higher thresholds: i.e., the number of ground truth fixations captured by the peaks of the saliency map (or the first few level sets as in Fig. 5). Models that place high-valued predictions at fixated locations receive high scores, while low-valued predictions at non-fixated locations are mostly ignored (Sec. 5.2).
The Normalized Scanpath Saliency, NSS was introduced to the saliency community as a simple correspondence measure between saliency maps and ground truth, computed as the average normalized saliency at fixated locations 
. Unlike in AUC, the absolute saliency values are part of the normalization calculation. NSS is sensitive to false positives, relative differences in saliency across the image, and general monotonic transformations. However, because the mean saliency value is subtracted during computation, NSS is invariant to linear transformations like contrast offsets. Given a saliency mapand a binary map of fixation locations :
where indexes the pixel, and
is the total number of fixated pixels. Chance is at 0, positive NSS indicates correspondence between maps above chance, and negative NSS indicates anti-correspondence. For instance, a unity score corresponds to fixations falling on portions of the saliency map with a saliency value one standard deviation above average.
Recall that a saliency model with high-valued predictions at fixated locations would receive a high AUC score even in the presence of many low-valued false positives (Fig. 7d). However, all false positives contribute to lowering the normalized saliency value at each fixation location, thus reducing the overall NSS score (Fig. 7c). The visualization for NSS consists of the normalized saliency value for each fixation location (i.e., where ).
Information Gain, IG, was recently introduced by Kümmerer et al. [45, 46] as an information theoretic metric that measures saliency model performance beyond systematic bias (e.g., a center prior baseline).
Given a binary map of fixations , a saliency map , and a baseline map , information gain is computed as:
where indexes the pixel, is the total number of fixated pixels, is for regularization, and information gain is measured in bits per fixation. This metric measures the average information gain of the saliency map over the center prior baseline at fixated locations (i.e., where ).
IG assumes that the input saliency maps are probabilistic, properly regularized and optimized to include a center prior [45, 46]. A score above zero indicates the saliency map predicts the fixated locations better than the center prior baseline. This score measures how much image-specific saliency is predicted beyond image-independent dataset biases, which in turn requires careful modeling of these biases.
We can also compute the information gain of one model over another to measure how much image-specific saliency is captured by one model beyond what is already captured by another model. The example in Fig. 8 contains a visualization of the information gain of the Judd model over the center prior baseline and over the bottom-up IttiKoch model. Visualized in red are image regions for which the Judd model underestimates saliency relative to each model, and in blue are image regions for which the Judd model achieves a gain in performance over each model at predicting the ground truth. The human under the parachute has a high saliency under the center prior model, while the Judd model underestimates the relative saliency of this area (red), but the parachute is where the Judd model has positive information gain over the center prior (blue). On the other hand, the bottom-up IttiKoch model captures the parachute but misses the person in the center of the image, so in this case the Judd model achieves gains on the central image pixels but not on the parachute. We refer the reader to  for a more detailed discussion and visualizations of the IG metric.
The (location-based) metrics described so far score saliency models at how accurately they predict discrete fixation locations. If the ground truth fixation locations are interpreted as a possible sample from some underlying probability distribution, then another approach is to predict the underlying distribution directly instead of the fixation locations. Although we can not directly observe the ground truth distribution, it is often approximated by Gaussian blurring the fixation locations into a fixation map (Sec. 3.2). In this next section we discuss a set of metrics that score saliency models at how accurately they approximate the continuous fixation map.
The similarity metric, SIM (also referred to as histogram intersection), measures the similarity between two distributions, viewed as histograms. First introduced as a metric for color-based and content-based image matching [71, 77], it has gained popularity in the saliency community as a simple comparison between pairs of saliency maps. SIM is computed as the sum of the minimum values at each pixel, after normalizing the input maps. Given a saliency map and a continuous fixation map :
iterating over discrete pixel locations .
A SIM of one indicates the distributions are the same, while a SIM of zero indicates no overlap.
Fig. 9c contains a visualization of this operation. At each pixel of the visualization, we plot .
Note that the model with the sparser saliency map has a lower histogram intersection with the ground truth map. SIM is very sensitive to missing values, and penalizes predictions that fail to account for all of the ground truth density (see Sec. 5.2 for a discussion).
Effect of blur on model performance:
The downside of a distribution metric like SIM is that the choice of the Gaussian sigma (or blur) in constructing the fixation and saliency maps affects model evaluation. For instance, as demonstrated in the synthetic experiment in Fig. 12a, even if the correct location is predicted, SIM will only reach its maximal value when the saliency map’s sigma exactly matches the ground truth sigma. The SIM score drops off drastically under different sigma values, more than the other metrics. Fine-tuning this blur value on a training set with similar parameters as the test set (eyetracking set-up, viewing angle) can help boost model performances [14, 40].
The SIM metric is good for evaluating partial matches, where a subset of the saliency map accounts for the ground truth fixation map. As a side-effect, false positives tend to be penalized less than false negatives. For other applications, a metric that treats false positives and false negatives symmetrically, such as CC or NSS, may be preferred.
The Pearson’s Correlation Coefficient, CC, also called linear correlation coefficient is a statistical method used generally in the sciences for measuring how correlated or dependent two variables are. CC can be used to interpret saliency and fixation maps, and , as random variables to measure the linear relationship between them :
where is the covariance of and . CC is symmetric and penalizes false positives and negatives equally. It is invariant to linear (but not arbitrary monotonic) transformations. High positive CC values occur at locations where both the saliency map and ground truth fixation map have values of similar magnitudes. Fig. 10 is an illustrative example comparing the behaviors of SIM and CC: where SIM penalizes false negatives significantly more than false positives, but CC treats both symmetrically. For visualizing CC in Fig. 10d, each pixel has value:
Due to its symmetric computation, CC can not distinguish whether differences between maps are due to false positives or false negatives. Other metrics may be preferable if this kind of analysis is of interest.
Kullback-Leibler (KL) is a general information theoretic measure of the difference between two probability distributions. In the saliency literature, depending on how the saliency predictions and ground truth fixations are interpreted as distributions, different KL computations are possible. We discuss a few alternative varieties in the appendix. To avoid future confusion about the KL implementation used, we can refer to this variant as KL-Judd similarly to how the AUC variant traditionally used on the MIT Benchmark is referred to as AUC-Judd. Analogous to our other distribution-based metrics, our KL metric takes as input a saliency map and a ground truth fixation map , and evaluates the loss of information when is used to approximate :
where is a regularization constant333The relative magnitude of will affect the regularization of the saliency maps and how much zero-valued predictions are penalized. The MIT Saliency Benchmark uses MATLAB’s built-in eps with value = 2.2204e-16.. KL-Judd is an asymmetric dissimilarity metric, with a lower score indicating a better approximation of the ground truth by the saliency map. We compute a per-pixel score to visualize the KL computation (Fig. 11d). For each pixel in the visualization, we plot . Wherever the ground truth value is non-zero but is close to or equal to zero, a large quantity is added to the KL score. Such regions are the brightest in the KL visualization. There are more bright regions in the rightmost map of Fig. 11d, corresponding to areas in the ground truth map that were left unaccounted for by the predicted saliency. Both models compared in Fig. 11 are image-agnostic: one is a chance model that assigns a uniform value to each pixel in the image, and the other is a permutation control model which uses a fixation map from another randomly-selected image. The permutation control model is more likely to capture viewing biases common across images. It scores above chance for many of the metrics in Table III. However, KL is so sensitive to zero-values that a sparse set of predictions is penalized very harshly, significantly worse than chance.
|Saliency model||Similarity metrics||Dissimilarity metrics|
All the metrics discussed so far have no notion of how spatially far away the prediction is from the ground truth. Accordingly, any map that has no pixel overlap with the ground truth will receive the same score of zero444Unless the model is properly regularized to compensate for uncertainty., regardless of how predictions are distributed (Fig. 12b). Incorporating a measure of spatial distance can broaden comparisons, and allow for graceful degradation when the ground truth measurements have position error.
The Earth Mover’s Distance, EMD, measures the spatial distance between two probability distributions over a region. It was introduced as a spatially robust metric for image matching [71, 62]. Computationally, it is the minimum cost of morphing one distribution into the other. This is visualized in Fig. 9d where in green are all the saliency map locations from which density needs to be moved, and in red are all the fixation map locations where density needs to be moved to. The total cost is the amount of density moved times the distance moved, and corresponds to brightness of the pixels in the visualization. It can be formulated as a transportation problem . We used the following linear time variant of EMD :
where each represents the amount of density transported (or the flow) from the th supply to the th demand and is the ground distance between bin and bin in the distribution. Equation 7 is therefore attempting to minimize the total amount of density movement such that the total density is preserved after the movement. Constraint (1) allows transporting density from to and not vice versa. Constraint (2) prevents more density to be moved from a location than is there. Constraint (3) prevents more density to be deposited to a location than is there. Constraint (4) is for feasibility: such that the amount of density moved does not exceed the total density found in either or . Solving this problem requires global optimization on the whole map, making this metric quite computationally intensive.
A larger EMD indicates a larger difference between two distributions while an EMD of zero indicates that two distributions are the same. Generally, saliency maps that spread density over a larger area have larger EMD values (i.e., worse scores) as all the extra density has to be moved to match the ground truth map (Fig. 9). EMD penalizes false positives proportionally to the spatial distance they are from the ground truth (Sec. 5.2).
This section contains a set of experiments to study the behavior of 8 different evaluation metrics, where we systematically varied properties of the input predictions to quantify the differential effects on metric scores. We focus on the metrics themselves, without assuming any optimization or regularization on the part of the inputs. This most closely reflects how evaluation is carried out on the MIT Saliency Benchmark, which does not place any restrictions on the format of the submitted saliency maps. As a result, our conclusions about the metrics should be informative for other applications, beyond saliency evaluation.
|Local computations, differentiable||✓||✓||✓||✓||✓|
|Invariant to monotonic transformations||✓||✓|
|Invariant to linear transformations (contrast)||✓||✓||✓||✓|
|Requires special treatment of center bias||✓||✓|
|Most affected by false negatives||✓||✓||✓|
|Scales with spatial distance||✓|
Comparing metrics on a set of baselines can be illustrative of metric behavior and be used to uncover the properties of saliency maps that drive this behavior. In Table III we include the scores of 4 baseline models and an upper bound for each metric. The center prior model is a symmetric Gaussian stretched to the aspect ratio of the image, so each pixel’s saliency value is a function of its distance from the center (higher saliency closer to center). Our chance model assigns a uniform value to each pixel in the image. An alternative chance model that also factors in the properties of a particular dataset is called a permutation control: it is computed by randomly selecting a fixation map from another image. It has the same image-independent properties as the ground truth fixation map for the image since it has been computed with the same blur and scale. The single observer model uses the fixation map from one observer to predict the fixations of the remaining observers (1 predicting ). We repeated this leave-one-out procedure and averaged the results across all observers.
To compute an upper bound for each metric we measured how well the fixations of observers predict the fixations of another group of observers, varying from 1 to 19 (half of the total 39 observers). Then we fit these prediction scores to a power function to obtain the limiting score of infinite observers. The details of this computation can be found in the appendix. This is useful to obtain dataset-specific bounds for metrics that are not otherwise bounded (i.e. NSS, EMD, KL, IG), and to provide realistic bounds that factor in dataset-specific human consistency for metrics where the theoretical bound may not be reachable (i.e. AUC, sAUC).
There is a divergent behavior in the way the metrics score a center prior model relative to a single observer model. The center prior captures dataset-specific, image-independent properties; while the single observer model captures image-specific properties but might be missing properties that emulate average viewing behavior. In particular, the single observer model is quite sparse and so achieves worse scores according to the KL, IG, and SIM metrics.
Similarly, we compare the chance and permutation control models. Both are image-independent. However, the chance model is also dataset-independent, while the permutation control model captures some dataset-specific properties. The CC, NSS, AUC, and EMD scores are significantly higher for the permutation control, pointing to the importance under these metrics, of capturing the properties of a particular dataset (including center bias, blur, and scale). On the other hand, KL and IG are sensitive to insufficient regularization. As a result, the permutation control model, which has more zero values, fares worse than the chance model.
One possible meta-measure for selecting metrics for evaluation is how much better one baseline is over another (e.g., [23, 56, 65]). However, the optimal ranking of baselines is likely to be different across applications: in some cases, it may be useful to accurately capture systematic viewing behaviors if nothing else is known, while in another setting, specific points of interest are more relevant than viewing behaviors.
Different metrics place different weights on the presence of false positives and negatives in the predicted saliency relative to the ground truth.
To directly compare the extent to which metrics penalize false negatives, we performed a series of systematic tests. Starting with the ground truth fixation map, we progressively removed different amounts of salient pixels: pixels with a saliency value above the mean map value were selected uniformly at random and set to 0. We then evaluated the similarity of the resulting map to the original ground truth map and measured the drop in score with 25%, 50%, and 75% false negatives.
To make comparison across metrics possible, we normalized this change in score by the score difference between the infinite observer limit and chance. We call this the chance-normalized score. For instance, for the AUC-Judd metric the upper limit is 0.92, chance is at 0.50, and the score with 75% false negatives is 0.67. The chance-normalized score is: . Values for the other metrics are available in Table V.
KL, IG, and SIM are most sensitive to false negatives:
If the prediction is close to zero where the ground truth has a non-zero value, the penalties can grow arbitrarily large under KL, IG, and SIM. These metrics penalize models with false negatives significantly more than false positives. In Table V, KL and IG scores drop below chance levels with only 25% false negatives. Another way to look at this is that these metrics’ sensitivity to regularization drives their evaluations of models. KL and IG scores will be low for sparse and poorly regularized models.
AUC ignores low-valued false positives:
AUC scores are a function of which level sets the false positives fall into - where false positives in the first few level sets are penalized most, but false positives in the last level set do not have a large impact on performance. Models with many low-valued false positives (e.g., Fig. 7) do not incur large penalties. Saliency maps that place different amounts of density but at the correct (fixated) locations will receive similar AUC scores (Fig. 12d).
NSS and CC are equally affected by false positives and negatives: During the normalization step of NSS, a few false positives will be washed out by the other saliency values and will not significantly affect the saliency values at fixated locations. However, as the number of false positives increases, they begin to have a larger influence on the normalization calculation, driving the overall NSS score down.
By construction, CC has a symmetric treatment of false positives and negatives. However, NSS is highly related to CC, and can be viewed as a discrete approximation (see appendix). NSS behavior will be very similar to CC, including the treatment of false positives and negatives.
EMD’s penalty depends on spatial distance: EMD is least sensitive to uniformly-occurring false negatives (e.g., Table V) because the EMD calculation can redistribute saliency values from nearby pixels to compensate. However, false negatives that are spatially far away from any predicted density are highly penalized. Similarly, EMD’s penalty for false positives depends on their spatial location relative to the ground truth, in that false positives close to ground truth locations can be redistributed to those locations at low cost, but distant false positives are highly penalized (Fig. 9).
Common to many images is a higher density of fixations in the center of the image compared to the periphery, a function of both photographer bias (i.e., centering the main subject) and observer viewing biases. The effect of center bias on model evaluation has received much attention [13, 22, 47, 60, 67, 78, 79, 93]. In this section we discuss center bias in the context of the metrics in this paper.
sAUC penalizes models that include center bias:
The sAUC metric samples negatives from other images, which in the limit of many images corresponds to sampling negatives from a central Gaussian. For an image with a strong central viewing bias, both positives and negatives would be sampled from the same image region, and a correct prediction would be at chance (Fig. 6).
The sAUC metric prefers models that do not explicitly incorporate center bias into their predictions. For a fair evaluation under sAUC, models need to operate under the same assumptions, or else their scores will be dominated by whether or not they incorporate center bias.
IG provides a direct comparison to center bias:
Information gain over a center prior baseline provides a more intuitive way to interpret model performance relative to center bias. If a model can not explain fixation patterns on an image beyond systematic viewing biases, such a model will have no gain over a center prior.
EMD spatially hedges its bets:
The EMD metric prefers models that hedge their bets if all the ground truth locations can not be accurately predicted (Fig. 12c). For instance, if an image is fixated in multiple locations, EMD will favor a prediction that falls spatially between the fixated locations instead of one that captures a subset of the fixated locations (contrary to the behavior of the other metrics).
A center prior is a good approximation of average viewing behavior on images under laboratory conditions, where an image is projected for a few seconds on a computer screen in front of an observer . A dataset-specific center prior emerges when averaging fixations over a large set of images. Knowing nothing else about image content, the center bias can act as a simple model prior. Overall if the goal is to predict natural viewing behavior on an image, center bias is part of the viewing behavior and discounting it entirely may be suboptimal. However, different metrics make different assumptions about the models: sAUC penalizes models that include center bias, while IG expects center bias to already be optimized. These differences in metric behaviors have lead to differences in whether models include or exclude center bias (e.g. [40, 14]). As a result, model rankings according to a particular metric can often be dominated by the differences in modeled center bias (Sec. 5.4).
As saliency metrics are often used to rank saliency models, we can measure how correlated the rankings are across metrics. This analysis will indicate whether metrics favor or penalize similar behaviors in models. We sort model performances according to each metric and compute the Spearman rank correlation between the model orderings of every pair of metrics to obtain the correlation matrix in Fig. 13. The pairwise correlations between NSS, CC, AUC, EMD, and SIM range from 0.76 to 0.98. Because of these high correlations, we call this the similarity cluster of metrics. CC and NSS are most highly correlated due to their analogous computations, as are KL and IG (see appendix).
Driven by extreme sensitivity to false negatives, KL, IG, and SIM rank saliency models differently than the similarity cluster. Viewed another way, these metrics are worse behaved if saliency models are not properly regularized. For these metrics, a zero-valued prediction is interpreted as an impossibility of fixations at that location, while for the other metrics, a zero-valued prediction is treated as less salient. These metrics have a natural probabilistic interpretation and are appropriate in cases where missing any ground truth fixation locations should be highly penalized, such as for detection applications (Sec. 6.3). Changing the regularization constant in the metric computations (Eq. 2,6) or regularizing the saliency models prior to evaluation (as in ) can reduce score differences between KL, IG, and the similarity cluster.
Although EMD is the only metric that takes into account spatial distance, it nevertheless ranks saliency models similarly as the other similarity cluster metrics. This is likely the case for two reasons: (i) like the similarity cluster metrics, EMD is also center biased (Table III, Sec. 5.3) and (ii) current model mistakes are often a cause of completely incorrect prediction rather than imprecise localization (note: as models continue to improve, this might change).
Shuffled AUC (sAUC) has low correlations with other metrics because it modifies how predictions at different spatial locations on the image are treated. A model with more central predictions will be ranked lower than a model with more peripheral predictions (Fig. 6). Shuffled AUC assumes center bias has not been modeled, and penalizes models where it has. For these reasons, sAUC has been disfavored by some evaluations [12, 49, 57]. An alternative is optimizing models to include a center bias [40, 41, 45, 46, 61, 93]. In this case, the metric can be ambivalent to any model or dataset biases.
Saliency metrics are much more correlated once models are optimized for center bias, blur, and scale [40, 45, 46]. As a result, the differences between the metrics in Fig. 13 are largely driven by how sensitive the metrics are to these model properties. It is therefore valuable to know if different models make similar modeling assumptions in order to interpret saliency rankings meaningfully across metrics.
Riche et al.  correlated metric scores on another saliency dataset and found that KL and sAUC are most different from the other metrics, including AUC, CC, NSS, and SIM, which formed a single cluster. We can explain this finding, since KL and sAUC make stronger assumptions about saliency models: KL assumes saliency models have sufficient regularization (otherwise false negatives are severely penalized) and sAUC assumes the model does not have a built-in center bias. Both Riche et al.  and our results show that these assumptions do not always hold for the commonly evaluated saliency models, leading to divergent rankings across metrics.
Emami and Hoberock  used human consistency to compare 9 metrics. In discriminating between human saliency maps and random saliency maps, they found that NSS and CC were the best, and KL the worst. This is similar to the analysis in Sec. 5.1.
Li et al.  used crowd-sourced experiments to measure which metric best corresponds to human perception. The authors noted that human perception was driven by the most salient locations, the compactness of salient locations (i.e., low false positives), and a similar number of salient regions as the ground truth. As a result, the perception-based ranking most closely matched that of NSS, CC, and SIM, and was furthest from KL and EMD. However, the properties that drive human perception could be different than the properties desired for other applications of saliency. For instance, for evaluating probabilistic saliency maps, proper regularization and the scale of the saliency values (including very small values) can significantly affect evaluation. For these cases, perception-based metrics might not be as appropriate.
We propose that the assumptions underlying different models and metrics be considered more carefully, and that the different metric behaviors and properties enter into the decision of which metrics to use for evaluation (Table IV).
Saliency models have evolved significantly since the seminal IttiKoch model [44, 84] and the original notions of saliency. Evaluation procedures, saliency datasets, and benchmarks have adapted accordingly. Given how many different metrics and models have emerged, it is becoming increasingly necessary to systematize definitions and evaluation procedures to make sense of the vast amount of new data and results . The MIT Saliency Benchmark is a product of this evolution of saliency modeling; an attempt to capture the latest developments in models and metrics. However, as saliency continues to develop as a research area, larger more specialized datasets may become more appropriate. Based on our experience with the MIT Saliency Benchmark, we provide some recommendations for future saliency benchmarks.
As observed in the previous section, some of the inconsistencies in how metrics rank models are due to differing assumptions that saliency models make. This problem has been emphasized by Kümmerer et al. [45, 46], who argued that if models were explicitly designed and submitted as probabilistic models, then some ambiguities in evaluation would disappear. For instance, a probability of zero in a probabilistic saliency map assumes that a fixation in a region is impossible; under alternative definitions, a value of zero might only mean that a fixation in a particular region is less likely. Metrics like KL, IG, and SIM are particularly sensitive to zero values, so models evaluated under these metrics would benefit from being regularized and optimized for scale. Similarly, knowing whether evaluation will be performed with a metric like sAUC should affect whether center bias is modeled, because this design decision would be penalized under this metric. A saliency benchmark should specify what definition of saliency is assumed, what kind of saliency map input is expected, and how models will be evaluated. The appendix includes additional considerations.
In saliency datasets, dataset bias occurs when there are systematic properties in the ground-truth data that are dataset-specific but image-independent. Most eye-tracking datasets have been shown to be center biased, containing a larger number of fixations near the image center, across different image types, videos, and even observer tasks [8, 9, 16, 18, 36, 39]. Center bias is a function of multiple factors, including photographer bias and observer bias, due to the viewing of fixed images in a laboratory setting [78, 90]. As a result, some models have a built-in center bias (e.g., Judd ), some metrics penalize center bias (e.g., sAUC), and some benchmarks optimize models with center bias prior to evaluation (e.g., LSUN ). These different approaches result from a disagreement in where systematic biases should be handled: at the level of the dataset, model, or evaluation. For transparency, saliency benchmarks should specify whether the submitted models are expected to incorporate center bias, or if dataset-specific center bias will be accounted for and subtracted during evaluation. In the former case, the benchmark can provide a training dataset on which to optimize center bias and other image-independent properties of the ground truth dataset (e.g., blur, scale, regularization), or else share these parameters directly.
The MIT Saliency Benchmark provides the MIT1003 dataset  as a training set to optimize center bias and blur parameters, and for histogram matching (scale regularization)555Associated code is provided at https://github.com/cvzoya/saliency/tree/master/code_forOptimization.. Both MIT300 and MIT1003 have been collected using the same eye tracker setup, so the ground truth fixation data should have similar distribution characteristics, and parameter choices should generalize across these datasets.
The first saliency models were not designed with these considerations in mind, so when compared to models that had incorporated center bias and other properties into saliency predictions, the original models were at a disadvantage. However, the availability of saliency datasets has increased, and many benchmarks provide training data from which systematic parameters can be learned [14, 33, 37]. Many modern saliency models are a result of this data-driven approach. Over the last few years, we have seen fewer differences across saliency models in terms of scale, blur, and center bias .
Saliency models are often designed to predict general task-free saliency, assigning a value of saliency or importance to each image pixel, largely independent of the end application. Saliency is often motivated as a useful representation for image processing applications such as image re-targeting, compression, and transmission, object and motion detection, and image retrieval and matching [6, 38]. However, if the end goal is one of these applications, then it might be easier to directly train a saliency model for the relevant task, rather than for task-free fixation prediction. Task-based, or application-specific, saliency prediction is not yet very common. Relevant datasets and benchmarks are yet to be designed. Evaluating saliency models on specific applications requires choosing metrics that are appropriate to the underlying task assumptions and expected input.
Consider a detection application of saliency such as object and motion detection, surveillance, localization and mapping, and segmentation [1, 17, 27, 28, 43, 50, 59, 91]. For such an application, a saliency model may be expected to produce a probability density of possible object locations, and be highly penalized if a target is missed. For this kind of probabilistic target detection, AUC, KL, and IG would be appropriate. EMD might be useful if some location invariance is permitted.
Applications including adaptive image and video compression and progressive transmission [30, 35, 54, 87], thumbnailing [55, 76], content-aware image re-targeting and cropping [3, 4, 69, 72, 85], rendering and visualization [42, 52], collage [31, 86] and artistic rendering [20, 41] require ranking (by importance or saliency) different image regions. For these applications, when it is valuable to know how much more salient a given image region is than another, an evaluation metric like AUC (that is ambivalent to monotonic transformations of the input map) is not appropriate. Instead, NSS or SIM would provide a more useful evaluation.
A goal of this paper has been to show how metrics behave under different conditions. This can help guide the selection of metrics for saliency benchmarks, depending on the assumptions that are made (e.g., whether the models are probabilistic, whether center bias is accounted for, etc.). A saliency benchmark should specify any assumptions that can be made along with the expected saliency map format.
The MIT Saliency Benchmark assumes that all fixation behavior is part of the saliency modeling: including any systematic dataset parameters (e.g., blur, scale, etc.). Capturing viewing biases is part of the modeling requirements. Metrics like shuffled AUC will penalize models that have a strong center bias. Saliency models submitted are not necessarily probabilistic, so they might be unfairly evaluated by the KL, IG, and SIM metrics that penalize zero values (false negatives), unless they are first regularized and pre-processed as in Kümmerer et al. . AUC has begun to saturate on the MIT Saliency Benchmark and is becoming less capable of discriminating between different saliency models . This is because AUC is ambivalent to monotonic transformations. However, for certain saliency applications it might be valuable to know exactly how much more salient a given image region is than another, and not just their relative saliency ranks. Of the remaining metrics, the Earth Mover’s Distance (EMD) is computationally expensive to compute and difficult to optimize for. Given all of this, for a benchmark operating under the same assumptions as the MIT Saliency Benchmark, we recommend reporting either CC or NSS. Both make limited assumptions about input format, and treat false positives and negatives symmetrically. For a benchmark intended to evaluate saliency maps as probability distributions, IG and KL would be good choices; IG specifically measures prediction performance beyond systematic dataset biases.
|Area under ROC Curve (AUC)||Historically the most commonly-used metric for saliency evaluation. Invariant to monotonic transformations. Driven by high-valued predictions and largely ambivalent of low-valued false positives. Currently saturating on standard saliency benchmarks [14, 15]. Good for detection applications.|
|Shuffled AUC (sAUC)||A version of AUC that compensates for dataset bias by scoring a center prior at chance. Most appropriate in evaluation settings where the saliency model is not expected to account for center bias. Otherwise, has similar properties to AUC.|
|Similarity (SIM)||An easy and fast similarity computation between histograms. Assumes the inputs are valid distributions. More sensitive to false negatives than false positives.|
|Pearson’s Correlation Coefficient (CC)||A linear correlation between the prediction and ground truth distributions. Treats false positives and false negatives symmetrically.|
|Normalized Scanpath Saliency (NSS)||A discrete approximation of CC that is additionally parameter-free (operates on raw fixation locations). Recommended for saliency evaluation.|
|Earth Mover’s Distance (EMD)||The only metric considered that scales with spatial distance. Can provide a finer-grained comparison between saliency maps. Most computationally intensive, non-local, hard to optimize.|
|Kullback-Leibler divergence (KL)||Has a natural interpretation where goal is to approximate a target distribution. Assumes input is a valid probability distribution with sufficient regularization. Mis-detections are highly penalized.|
|Information Gain (IG)||A new metric introduced by [45, 46]. Assumes input is a valid probability distribution with sufficient regularization. Measures the ability of a model to make predictions above a baseline model of center bias. Otherwise, has similar properties to KL.|
We provided an analysis of the behavior of 8 evaluation metrics to make sense of the differences in saliency model rankings according to different metrics. Properties of the inputs affect metrics differently: how the ground truth is represented; whether the prediction includes dataset bias; whether the inputs are probabilistic; whether spatial deviations exist between the prediction and ground truth. Knowing how these properties affect metrics, and which properties are most important for a given application can help with metric selection for saliency model evaluation. Other considerations for metric selection include whether the metric computations are expensive, local, and differentiable, which would influence whether a metric is appropriate for model optimization. Take-aways about the metrics are included in Table VI.
We considered saliency metrics from the perspective of the MIT Saliency Benchmark, which does not assume that saliency models are probabilistic as in [45, 46], but does assume that all systematic dataset biases (including center bias, blur, scale) are taken care of by the model. Under these assumptions we found that the Normalized Scanpath Saliency (NSS) and Pearson’s Correlation Coefficient (CC) metrics provide the fairest comparison. Being closely related mathematically, their rankings of saliency models are highly correlated, and reporting performance using one of them is sufficient. However, under alternative assumptions and definitions of saliency, another choice of metrics may be more appropriate. Specifically, if saliency models are evaluated as probabilistic models, then KL-divergence and Information Gain (IG) are recommended. Arguments for why it might be preferable to define and evaluate saliency models probabilistically can be found in [45, 46]. Specific tasks and applications may also call for a different choice of metrics. For instance, AUC, KL, and IG are appropriate for detection applications, as they penalize target detection failures. However, where it is important to evaluate the relative importance of different image regions, such as for image-retargeting, compression, and progressive transmission, metrics like NSS or SIM are a better fit.
In this paper we discussed the influence of different assumptions on the choice of appropriate metrics. We provided recommendations for new saliency benchmarks, such that if designed with explicit assumptions from the start, evaluation can be more transparent and reduce confusion in saliency evaluation. We also provide code for evaluating and visualizing the metric computations666http://saliency.mit.edu/downloads.html to add further transparency to model evaluation and to allow researchers a finer-grained look into metric computations, to debug saliency models and visualize the aspects of saliency models driving or hurting performance.
The authors would like to thank Matthias Kümmerer and other attendees of the saliency tutorial at ECCV 2016 for helpful discussions about saliency evaluation777http://saliency.mit.edu/ECCVTutorial/ECCV_saliency.htm. Thank you also to the anonymous reviewers for many detailed suggestions. ZB was supported by a postgraduate scholarship (PGS-D) from the Natural Sciences and Engineering Research Council of Canada. Support to AO, AT, and FD was provided by the Toyota Research Institute / MIT CSAIL Joint Research Center.
State-of-the-art in visual attention modeling.IEEE TPAMI, 35(1):185–207, 2013.
Large-scale scene understanding challenge: Saliency prediction.Technical report, 2016. Available at: lsun.cs.princeton.edu/challenge/2016/saliency/saliency.pdf.
Images for the MIT300 dataset were obtained from Flickr Creative Commons and personal photo collections. Eye movements were collected using a table-mounted, video-based ETL 400 ISCAN eye tracker which recorded observers’ gaze paths at 240Hz. The average calibration error was less than one degree of visual angle. Each image was presented for 2 seconds at a maximum dimension of 1024 pixels and the second dimension between 457-1024 pixels (mode: 768 pixels). The task instruction was: ”You will see a series of 300 images. Look closely at each image. After viewing the images you will have a memory test: you will be asked to identify whether or not you have seen particular images before”. This was used to motivate participants to pay attention, but no memory test was used. Images were separated by a 500 ms fixation cross. During pre-processing, the first fixation on each image was thrown out to reduce the center-biasing effects of the fixation cross. A list of alternative eyetracking datasets with different experimental setups, tasks, images, and exposure durations is available at http://saliency.mit.edu/datasets.html.
Location-based versus distribution-based metrics:
The particular implementations of the metrics we use can be categorized as either location-based or distribution-based, as presented in the paper. However, there are implementations of AUC and NSS that require the ground truth to be a distribution . In these implementations, the ground truth distribution is then pre-processed into a binary map by thresholding at a fixed, often arbitrary value. This requires an additional parameter for the metric computation. Our parameter-free, location-based implementations of AUC and NSS are more commonly used for saliency evaluation.
Sampling thresholds for the ROC curve:
The ROC curve is obtained by plotting the true positive rate against the false positive rate at various thresholds of the saliency map. Choosing how to sample thresholds to approximate the continuous ROC curve is an important implementation consideration. A saliency map is first normalized so all saliency values lie between 0 and 1. In the AUC-Judd implementation, each distinct saliency map value is used as a threshold, so this sampling strategy provides the most accurate approximation to the continuous curve. To ensure that enough threshold samples are taken, the saliency map is first jittered by adding a tiny random value to each pixel, thus preventing large uniform regions of one value in the saliency map.
In the AUC-Borji implementation the threshold is sampled at a fixed step size (from 0 to 1 by increments of 0.1), and thus provides a suboptimal approximation for saliency maps that are not histogram equalized. For this reason, and for the otherwise similar computation to AUC-Judd, we report AUC scores using the AUC-Judd implementation in the main paper.
Sampling negatives in AUC-Borji and sAUC:
The AUC-Borji score is calculated by repeatedly sampling a new set of negatives in 100 separate iterations and averaging these intermediate AUC computations together. On each iteration, as many negatives are chosen as fixations on the current image.
In the shuffled AUC (sAUC) variant, negatives are sampled at random from 10 other randomly-sampled images in the dataset (as many negatives are sampled as fixations on the current image), and the final score is also obtained by averaging over 100 trials.
A note about naming: Riche et al.  refer to shuffled AUC as AUC-Borji, but here we make a distinction between Borji’s implementation of AUC with randomly-sampled negatives, and sAUC with negatives sampled from other images .
Other AUC implementations:
Our AUC implementations are location-based (as in [10, 29, 34, 41, 79, 81]), but other distribution-based implementations of AUC have also been used in saliency evaluation, where both ground truth and saliency inputs are continuous maps . The thresholding for computing the ROC curve can be performed on the ground truth map, the saliency map, or both . In the first two cases, one of the maps is thresholded at different values, while the other map is thresholded at a single, fixed value (e.g. to keep 20% of the pixels [81, 57]).
AUC is non-symmetric, and depending on which map is taken as the reference, different scores will be produced. A symmetric variant can be obtained by averaging two non-symmetric AUC calculations by swapping the two maps being compared .
A recent AUC variant  attempts to quantify spatial bias more directly instead of attempting to remove it with metrics like shuffled AUC.
A nonlinear correlation coefficient (Spearman’s CC), has also be used for saliency evaluation [57, 68, 80]. Unlike Pearson’s CC which takes into account the absolute values of the two distribution maps, Spearman’s CC only compares the ranks of the values, making it robust under monotonic transformations.
Relationship between CC and NSS:
Recall that NSS is calculated as:
where is the normalized saliency map, is a binary map of fixations, and is the number of fixated pixels. If the fixations are sampled instead from a fixation distribution , then the probability that a particular fixation at pixel is chosen is just the density . By sampling from , we can construct the binary map (since ). Over sets of samples from :
Note that CC can be written as:
Where is the total number of pixels in the image, and both and are normalized. Recall that NSS and CC both normalize by variance. Thus, NSS can be viewed as a kind of discrete approximation to CC.
The standard implementation of KL that we use is non-symmetric by construction. A symmetric extension of KL as in [6, 49] is computed as: (also see Jeffrey divergence ). We use the asymmetric variant which allows the resulting KL score to be more easily interpreted, since it measures how good a saliency map prediction is at approximating the ground truth distribution. The symmetric variant is more appropriate for comparing saliency maps to each other or for computing inter-observer consistency, cases where it is not well defined what is the predicted versus ground truth distribution . Unlike our variant, the symmetric variant penalizes false negatives and false positives equally.
Other KL implementations:
In this paper, the variant of KL we adopt would be called the image-based KL-divergence according to  since we compute the KL divergence between the saliency map and fixation map directly. This is in contrast to the fixation-based KL divergence that is calculated by binning saliency values at fixated and nonfixated locations and computing the KL divergence of these histograms.
Both versions of KL have been used for saliency evaluation under the metric name KL, leading to some confusion. The Supporting Information of  includes a list of papers (Table S3) using each of these varieties.
There is also a shuffled implementation of KL available  to discount central predictions similar to shuffled AUC (sAUC).
Relationship between KL and IG:
Where iterates over all the pixels in the distribution (approximating an integral). Then:
which for very small approaches:
yielding the discrete approximation:
and within a constant factor (due to change of base from natural log to base 2), this is equal to . Information gain is measured in terms of bits/fixation.
Information gain is like KL but baseline-adjusted (recall also that KL is a dissimilarity metric, while IG is a similarity metric, explaining the change of places between and ). The additional distinction is that IG is more computationally similar to fixation-based KL, rather than image-based KL (which we use in the main paper).
For the IG visualizations in the paper, we compute a per-pixel value of: . This value is then modulated by the human fixation distribution . In red are all pixels where , and in blue are all pixels where . Note that the visualizations in this paper are different from the ones in [45, 46].
We use a fast implementation of EMD provided by Pele and Werman  888Code at http://www.cs.huji.ac.il/~ofirpele/FastEMD/ but without a threshold. For additional efficiency, we resize both maps to 1/32 of their size after they are first resized to the same dimensions. The maps are then normalized to sum to one. Despite these modifications, EMD is more computationally expensive to compute than any of the other metrics because it requires joint optimization across all the image pixels.
For visualization, at pixel we plot in green for all where , and at pixel , we plot in red for all where . Note that the set of pixels where is disjoint from the set of pixels where , so each pixel is either red or green or neither.
Metric computations often involve normalizing the input maps. This allows maps with different saliency value ranges to be compared. A saliency map can be normalized in a number of ways:
(a) Normalization by range:
(b) Normalization by variance999This is also often called standardization.:
(c) Normalization by sum:
Table VII lists the normalization strategies applied to saliency maps by the metrics in this paper. Another approach is normalization by histogram matching, with histogram equalization being a special case. Histogram matching is a monotonic transformation that remaps (re-bins) a saliency map’s values to a new set of values such that the number of saliency values per bin matches a target distribution.
Effect of normalization on metric behaviors:
Histogram matching does not affect AUC calculations101010Unless the thresholds for the ROC curve are not adjusted accordingly. For instance, in the ROC-Borji implementation with uniform threshold sampling, histogram matching changes the number of saliency map values in each bin (at each threshold)., but does affect all the other metrics. Histogram matching can make a saliency map more peaked or more uniform. This has different effects on metrics: for instance, EMD prefers sparser maps provided the predicted locations are near the target locations (the less density to move, the better). However, more distributed predictions are more likely to have non-zero values at the target locations and better scores on the other metrics. These are important considerations for preprocessing saliency maps.
|Metric||Normalized by range||Normalized by variance||Normalized by sum|
Different metrics use different normalization strategies for pre-processing saliency maps prior to scoring them. Normalization can change the extent to which the range of saliency values and outliers affect performance.
Different normalization schemes can also change how metric scores are impacted by very high and very low values in a saliency map. For instance, in the case of NSS, if a large outlier saliency value occurs at least at one of the fixation locations, then the resulting NSS score will be correspondingly high (since it is an outlier, it will not be significantly affected by normalization). Alternatively, if most saliency map values are large and positive except at fixation locations, then the normalized saliency map values at the fixation locations can be arbitrarily large negative numbers.
Normalization for conversion to a density:
It is a common approach during saliency evaluation to take a saliency map as input and normalize it by its sum to convert it into a probability distribution, prior to computation of the SIM, KL, and IG scores. However, if the initial map was not designed to be probabilistic, this transformation is not sufficient to qualify the map as a probabilistic map. For instance, a value of zero in a probabilistic map implies the map predicts that fixations are impossible in this image region. This is why regularization is an important factor for probabilistic maps. Adding a small epsilon to a map’s predictions can drastically improve its KL or IG score. Note additionally that if a saliency map is stored in a compressed format, small regularization values might not be preserved, so the format of the saliency map can either facilitate or hinder evaluation according to different assumptions (see Sec. A.5 below).
Empirical limits of metrics
One of the differences between location-based and distribution-based metrics is that the empirical limits of location-based metrics (AUC, sAUC, NSS, IG) on a given dataset do not reach the theoretical limits (Table VIII). The sets of fixated and non-fixated locations are not disjoint, and thus no classifier can reach its theoretical limit . In this regard, the distribution metrics are more robust. Although different sets of observers fixate similar but not identical locations, continuous fixation maps converge to the same underlying distribution as the number of observers increases. To make scores comparable across metrics and datasets, empirical metric limits can be computed. Empirical limits are specific to a dataset, dependent on the consistency between humans, and can be used as a realistic upper bound for model performance.
We measured human consistency using the fixations of one group of observers to predict the fixations of another group of observers. By increasing the number of observers, we extrapolated performance to infinite observers. After computing performance for to (half of the total 39 observers), we fit these points to the power function , constraining to be negative and to lie within the valid theoretical range of the metric (see Fig. 14). The results of the fitting function111111Matlab’s fit function, using non-linear least squares fitting. that we include in Table VIII are the empirical limit and the 95% confidence bounds. Once the empirical limit has been computed for a given metric on a given dataset, this limit can be used to normalize the scores for all computational models .
|Metric limits||Similarity metrics||Dissimilarity metrics|
|Theoretical range (best score in bold)||[0,1]||[0,1]||[,]||[0,1]||[-1,1]||[,]||[0,]||[0,]|
|(with 95% confidence bounds)||(0.91; 0.93)||(0.79; 0.83)||(3.08; 3.50)||(0.76; 1.24)||(0.82; 1.18)||(2.14; 2.86)||0||0|
Other researchers compute the limit of human consistency as the inter-observer (IO) model where all other observers are used to predict the fixations of the remaining observer [7, 57, 64, 88]. The resulting scores are usually averaged over all or a subset of observers. To avoid confusion, note that this IO model is different from our single observer model: in the IO model observers predict 1 observer; the single observer model, 1 observer predicts observers.
For our center prior model we use a Gaussian stretched to the aspect ratio of the image. This version of the center prior performs slightly better than an isotropic Gaussian because objects of interest tend to be spread along the longer axis. See Clarke and Tatler  for an analysis of different types of center models.
Our chance model assigns a uniform value to each pixel in the image. According to this model, there are few zero values in the resulting chance maps. An alternative interpretation of chance could be to create a random fixation map by randomly selecting a number of locations in the image to serve as fixation locations, and Gaussian blurring the result. This chance model is likely to perform differently according to our metrics because of its different properties. In particular, the greater sparsity in the map, if not regularized properly, would lead to low KL, IG, and SIM scores.
For our single observer model we use the fixation map from one observer to predict the fixations of the remaining observers. We compute the single observer fixation map by Gaussian blurring the fixations of an observer with blur sigma equal to 1 degree of viewing angle in the ground truth eye tracking data. A different blur sigma or regularization factor for this model may compensate for the sparse predictions this model makes and improve its performance according to the KL, IG, and SIM metrics.
Regarding histogram matching:
Prior to September 2014, the MIT Saliency Benchmark histogram matched saliency maps to a target distribution before evaluation . This was intended to reduce differences in saliency map ranges. However, this had significant effects on model performances, inflating or deflating scores, depending on the model. The decision was made to evaluate saliency maps as-is and to leave any preprocessing to the model submitters. This also makes reporting more transparent, as the scores posted on the website directly correspond to the maps submitted.
Saliency map input:
Kümmerer et al. argues that a probabilistic definition is most intuitive for saliency models because it makes the saliency value in an image region easily interpretable: as the probability that a fixation is expected to occur there; or if differently normalized, as the expected number of fixations to occur in that region from an observer or population of observers [45, 46]. It also makes the relative values in a saliency map meaningful; e.g., a region with twice the saliency value can be expected to have twice the fixations.
The precise format in which a saliency model is submitted for evaluation (i.e., to a saliency benchmark) also affects the resulting performance numbers. For instance, jpg-encoded images only save 8 bits per pixel, and the jpg artifacts can have a large impact in image regions with low saliency values. A better approach is requiring model entries in non-compressed formats. A map saved as a log probability map instead of just as a probability map is also better for representing a larger range of values, and for preserving small saliency values (e.g., for regularization). A given saliency benchmark should specify what kind of input is required, so that both the model submitters and evaluators operate under the same set of assumptions, and for the saliency map values to be handled correctly during evaluation.