What is clutter? While it seems easy to make sense of a cluttered desk vs an uncluttered desk at a glance, it is hard to quantify clutter with a number. Is a cluttered desk, one stacked with papers? Or is an uncluttered desk, one that is more organized irrelevant of number of items? An important goal in clutter research has been to develop an image based computational model that outputs a quantitative measure that correlates with human perceptual behavior oliva2004identifying ; henderson2009influence ; rosenholtz2007measuring . Previous studies have created models that output global or regional metrics to measure clutter perception. Such measures are aimed to predict the influence of clutter on perception. However, one important aspect of human visual perception is that it is not space invariant: the fovea processes visual information with high spatial detail while regions away from the central fovea have access to lower spatial detail. Thus, the influence of clutter on perception can depend on the retinal location of the stimulus and such influences will likely interact with the information content in the stimulus.
The goal of the current paper is to develop a foveated clutter model that can successfully predict the interaction between retinal eccentricity and image content in modulating the influence of clutter on perceptual behavior. We introduce a foveated mechanism based on the peripheral architecture proposed by Freeman and Simoncelli freeman2011metamers and stack it into a current clutter model (Feature Congestion rosenholtz2005feature ; rosenholtz2007measuring ) to generate a clutter map that arises from a calculation of information loss with retinal eccentricity but is multiplicatively modulated by the original unfoveated clutter score. The new measure is evaluated in a gaze-contingent psychophysical experiment measuring target detection in complex scenes as a function of target retinal eccentricity. We show that the foveated clutter models that account for loss of information in the periphery correlates better with human target detection (hit rate) across retinal eccentricities than non-foveated models. Although the model is presented in the context of Feature Congestion, the framework can be extended to any previous or future clutter metrics that produce clutter scores that are computed from a global pixel-wise clutter map.
2 Previous Work
Previous studies have developed general measures of clutter computed for an entire image and do not consider the space-variant properties of the human visual system. Because our work seeks to model and assess the interaction between clutter and retinal location, experiments manipulating the eccentricity of a target while observers hold fixation (gaze contingent forced fixation) are most appropriate to evaluate the model. To our knowledge there has been no systematic evaluation of fixation dependent clutter models with forced fixation target detection in scenes. In this section, we will give an overview of state-of-the-art clutter models, metrics and evaluations.
2.1 Clutter Models
Feature Congestion: Feature Congestion, initially proposed by rosenholtz2005feature ; rosenholtz2007measuring produces both a pixel-wise clutter score map as a well as a global clutter
score for any input image or Region of Interest (ROI). Each clutter map is computed by combining a Color map in CIELab space,
an orientation map landy1991texture , and a local contrast
map at multiple scales through Gaussian Pyramids burt1983laplacian .
One of the main advantages Feature Congestion has is that each pixel-wise clutter score (Fig. 1) and global score can be computed in less than a second. Furthermore,
this is one of the few models that can output a specific clutter score for any pixel or ROI in an image.
This will be crucial for developing a foveated model as explained in Section 4.
Edge Density: Edge Density computes a ratio after applying an Edge Detector on the input image oliva2004identifying . The final clutter score is the ratio of edges to total number of pixels present in the image. The intuition for this metric is straightforward: “the more edges, the more clutter” (due to objects for example).
Subband Entropy: The Subband Entropy model begins by computing steerable pyramids simoncelli1995steerable at orientations across each channel from the input image in CIELab color space. Once each subband is collected for each channel, the entropy for each oriented pyramid is computed pixelwise and they are averaged separately. Thus, Subband Entropy wishes to measure the entropy of each spatial frequency and oriented filter response of an image.
Scale Invariance: The Scale Invariant Clutter Model proposed by Farid and Bravo bravo2008scale uses graph-based segmentation felzenszwalb2004efficient at multiple scales. A scale invariant clutter representation is given by the power law coefficient that matches the decay of number of regions with the adjusted scale parameter.
ProtoObject Segmentation: ProtoObject Segmentation proposes an unsupervised metric for clutter scoring yu2013modeling ; yu2014modeling . The model begins by converting the image into HSV color space, and then proceeds to segment the image through superpixel segmentation liu2011entropy ; levinshtein2009turbopixels ; achanta2010slic . After segmentation, mean-shift fukunaga1975estimation is applied on all cluster (superpixel) medians to calculate the final amount of representative colors present in the image. Next, superpixels are merged with one another contingent on them being adjacent, and being assigned to the same mean-shift HSV cluster. The final score is a ratio between initial number of superpixels and final number of superpixels.
Crowding Model: The Crowding Model developed by van der Berg et al. van2009crowding is the only model to have used losses in the periphery due to crowding as a clutter metric. It decomposes the image into 3 different scales in CIELab color space. It then produces 6 different orientation maps for each scale given the luminance channel; a contrast map is also obtained by difference of Gaussians on the previously mentioned channel. All feature maps are then pooled with Gaussian kernels that grow linearly with eccentricity, KL-divergence is then computed between the pre and post pooling feature maps to get information loss coefficients, all coefficients are averaged together to produce a final clutter score. We will discuss the differences of this model to ours in the Discussion (Section 5).
Texture Tiling Model: The Texture Tiling Model (TTM) is a recent perceptual model that accounts for losses in the periphery rosenholtz2012summary ; keshvari2016pooling through psyhophysical experiments modelling visual search eckstein2011visual : feature search, conjunction search, configuration search and asymmetric search. In essence, the Mongrels proposed by Rosenholtz et al. that simulate peripheral losses are very similar to the Metamers proposed by Freeman & Simoncelli freeman2011metamers . We do not include comparisons to the TTM model since it requires additional psychophysics on the Mongrel versions of the images.
2.2 Clutter Metrics
Global Clutter Score: The most basic clutter metric used in clutter research is the original clutter score that every model computes over the entire image. Edge Density &
Proto-Object Segmentation output a ratio, while Subband Entropy and Feature Congestion output a score. However, Feature Congestion is the only model that
outputs a dense pixelwise clutter map before computing a global score (Fig. 1). Thus, we use Feature Congestion clutter maps for our foveated clutter model.
Clutter ROI: The second most used clutter metric is ROI (Region of Interest)-based, as shown in the work of Asher et al. asher2013regional . This metric is of interest when an observer is engaging in target search, vs making a human judgement (Ex: “rate the clutter of the following scenes”).
2.3 Clutter Evaluations
Human Clutter Judgements: Multiple studies of clutter, correlate their metrics with rankings/ratings of clutter provided by human participants. Ideally, if clutter
model A is better than clutter model B, then the correlation of model scores and human rankings/ratings should be higher for model A than for model B. yu2014modeling ; oliva2004identifying ; van2009crowding
Response Time: Highly cluttered images will require more time for target search, hence more time to arrive to a decision of target present/absent. Under the previous assumption, a high correlation value between response time and clutter score are a good sign for a clutter model. rosenholtz2007measuring ; bravo2008scale ; van2009crowding ; asher2013regional ; henderson2009influence
Target Detection (Hit Rate, False Alarms, Performance): In general, when engaging in target search for a fixed amount of time across all trial conditions, an observer will have a lower hit rate and higher false alarm rate for a highly cluttered image than an uncluttered image. rosenholtz2007measuring ; asher2013regional ; henderson2009influence
3 Methods & Experiments
3.1 Experiment 1: Forced Fixation Search
A total of 13 subjects participated in a Forced Fixation Search experiment where the goal was to detect a target in the subject’s periphery and identify if there was a target (person) present or absent. Participants had variable amounts of time (100, 200, 400, 900, 1600 ms) to view each clip that was presented in a random order at a variable degree of eccentricities that the subjects were not aware of (, , , ). They were then prompted with a Target Detection rating scale where they had to rate from a scale from 1-10 by clicking on a number reporting how confident they were on detecting the target. Participants have unlimited time for making their judgements, and they did not take more than 10 seconds per judgment. There was no response feedback after each trial. Trials were aborted when subjects broke fixation outside of a radius around the fixation cross.
Each subject did 12 sessions that consisted of 360 unique images. Every session also presented the images with aerial viewpoints from different vantage points (Example: session 1 had the target at 12 o’clock, while session 2 had the target at 3 o’clock). To control for any fixational biases, all subjects had a unique fixation point for every trial for the same eccentricity values. All images were rendered with variable levels of clutter. Each session took about an hour to complete. The target was of size , , , depending on zoom level.
For our analysis, we only used the low zoom and 100 ms time condition since there was less ceiling effects across all eccentricities.
Stimuli Creation: A total of 273 videos were created each with a total duration of 120 seconds, where a ‘birds eye’ point-of-view camera rotated slowly around the center. While the video was in rotating motion, there was no relative motion between any parts of the video. From the original videos, a total of different clips were created. Half of the clips were target present, while the other half were target absent. These short and slowly rotating clips were used instead of still images in our experiment, to simulate slow real movement from a pilot point of view. All clips were shown to participants in random order.
Apparatus: An EyeLink 1000 system (SR Research) was used to collect Eye Tracking data at a frequency of 1000Hz. Each participant was at a distance of 76 cm from a LCD screen on gamma display, so that each pixel subtended a visual angle of . All video clips were rendered at pixels and a frame rate of 24fps. Eye movements with velocity over and acceleration over were qualified as saccades. Every trial began with a fixation cross, where each subject had to fixate the cross with a tolerance of .
4 Foveated Feature Congestion
A regular Feature Congestion clutter score is computed by taking the mean of the Feature Congestion map of the image or of a target ROI henderson2009influence . We propose a Foveated Feature Congestion (FFC) model that outputs a score which takes into account two main terms: 1) a regular Feature Congestion (FC) score and 2) a Peripheral Integration Feature Congestion (PIFC) coefficient that accounts the lower spatial resolution of the visual periphery that are detrimental for target detection. The first term is independent of fixation, while the second term will act as a non-linear gain that will either reduce or amplify the clutter score depending on fixation distance from the target.
In this Section we will explain how to compute a PIFC, which will require creating a human-like peripheral architecture as explained in Section 4.1. We then present our Foveated Feature Congestion (FFC) clutter model in Section 4.2. Finally, we conclude by making a quantiative evaluation of the FFC (Section 4.3) in its ability to predict variations of target detectability across images and retinal eccentricity of the target.
4.1 Creating a Peripheral Architecture
We used the Piranhas Toolkit Deza2016piranhas to create a Freeman and Simoncelli freeman2011metamers peripheral architecture. This biologically inspired model has been tested and used to model V1 and V2 responses in human and non-human primates with high precision for a variety of tasks portilla2000parametric ; freeman2013functional ; movshon2014representation ; akbas2014object . It is described by a set of pooling (linear) regions that increase in size with retinal eccentricity. Each pooling region is separable with respect to polar angle and log eccentricity , as described in Eq. 2 and Eq. 3 respectively. These functions are multiplied for every angle and eccentricity and are plotted in log polar coordinates to create the peripheral architecture as seen in Fig. 3.
The parameters we used match a V1 architecture with a scale of , a visual radius of , a fovea of , with 222We remove regions with a radius smaller than the foveal radius, since there is no pooling in the fovea., and . The scale defines the number of eccentricities , as well as the number of polar pooling regions from .
Although observers saw the original stimuli at /pixel, with image size ; for modelling purposes: we rescaled all images to half their size so the peripheral architecture could fit all images under any fixation point. To preserve stimuli size in degrees after rescaling our images, our foveal model used an input value of /pixel (twice the value of experimental settings). Resizing the image to half its size also allows the peripheral architecture to consume less CPU computation time and memory.
4.2 Creating a Foveated Feature Congestion Model
Intuitively, a foveated clutter model that takes into account target search should score very low when the target is in the fovea (near zero), and very high when the target is in the periphery. Thus, an observer should find a target without difficulty, achieving a near perfect hit rate in the fovea, yet the observer should have a lower hit rate in the periphery given crowding effects. Note that in the periphery, not only should it be harder to detect a target, but it is also likely to confuse the target with another object or region affine in shape, size, texture and/or pixel value (false alarms). Under this assumption, we wish to modulate a clutter score (Feature Congestion) by a multiplicative factor, given the target and fixation location. We call this multiplicative term: the PIFC coefficient, which is defined over a ROI around the location of target . The target itself was removed when processing the clutter maps since it indirectly contributes to the ROI clutter score asher2013regional . The PIFC aims at quantifying the information loss around the target region due to peripheral processing.
To compute the PIFC, we use the before mentioned ROI, and calculate a mean difference from the foveated clutter map with respect to the original non-foveated clutter map. If the target is foveated, there should be little to no difference between a foveated map and the original map, thus setting the PIFC coefficient value to near zero. However, as the target is farther away from the fovea, the PIFC coefficient should be higher given pooling effects in the periphery. To create a foveated map, we use Feature Congestion and apply max pooling on each pooling region after the peripheral architecture has been stacked on top of the Feature Congestion map. Note that the FFC map values will depend on the fixation location as shown in Fig.4. The PIFC map is the result of subtracting the foveated map from the unfoveated map in the ROI, and the score is a mean distance value between these two maps (we use L1-norm, L2-norm or KL-divergence). Computational details can be seen in Algorithm 1. Thus, we can resume our model in Eq. 4:
where is the Feature Congestion score rosenholtz2007measuring of image which is computed by the mean of the Feature Congestion map , and is the Foveated Feature Congestion score of the image , depending on the point of fixation and the location of the target .
4.3 Foveated Feature Congestion Evaluation
A visualization of each image and its respective Hit Rate vs Clutter Score across both foveated and unfoveated models can be visualized in Fig 5. Qualitatively, it shows the importance of a PIFC weighting term to the total image clutter score when performing our forced fixation search experiment. Futhermore, a quantitative bootstrap correlation analysis comparing classic metrics (Image, Target, ROI) against foveal metrics (FFC, FFC and FFC) shows that hit rate vs clutter scores are greater for those foveated models with a PIFC: Image: , Target: , ROI: , FFC (L1-norm): , FFC (L2-norm): , FFC (KL-divergence): .
Notice that there is no difference in correlations between using the L1-norm, L2-norm or KL-divergence distance for each model in terms of the correlation with hit rate. Table 1(Supp. Mat.) also shows the highest correlation with a ROI window across all metrics. Note that the same analysis can not be applied to false alarms, since it is indistinguishable to separate a false alarm at from (the target is not present, so there is no real eccentricity away from fixation). However as mentioned in the Methods section, fixation location for target absent trials in the experiment were placed assuming a location from its matching target present image. It is important that target present and absent fixations have the same distributions for each eccentricity.
In general, images that have low Feature Congestion have less gain in PIFC coefficients as eccentricity increases. While images with high clutter have higher gain in PIFC coefficients. Consequently, the difference of FFC between different images increases nonlinearly with eccentricity, as observed in Fig. 6. This is our main contribution, as these differences in clutter score as a function of eccentricity do not exist for regular Feature Congestion, and these differences in scores should be able to correlate with human performance in target detection.
Our model is also different from the van der Berg et al. van2009crowding model since our peripheral architecture uses: a biologically inspired peripheral architecture with log polar regions that provide anisotropic pooling levi2011visual rather than isotropic gaussian pooling as a linear function of eccentricity van2009crowding ; we used region-based max pooling for each final feature map instead of pixel-based mean pooling (gaussians) per each scale (which allows for stronger differences); this final difference also makes our model computationally more efficient running at 700ms per image, vs 180s per image for the Crowding model ( speed up). A home-brewed Crowding Model applied to our forced fixation experiment resulted in a correlation of , equivalent to using a non foveated metric such as regular Feature Congestion .
We finally extended our model to create foveated(FoV) versions of Edge Density(ED) oliva2004identifying , Subband Entropy(SE) simoncelli1995steerable ; rosenholtz2007measuring and ProtoObject Segmentation(PS) yu2014modeling showing that correlations for all foveated versions are stronger than non-foveated versions for the same task: , , , , , but . Note that the highest foveated correlation is FC: , despite under a L1-norm loss of the PIFC. Feature Congestion has a dense representation, is more bio-inspired than the other models, and outperforms in the periphery. See Figure 7. An overview of creating dense and foveated versions for previously mentioned models can be seen in the Supp. Material.
In this paper we have introduced a peripheral architecture that shows detrimental effects of different eccentricities on target detection, that helps us model clutter for forced fixation experiments. We introduced a forced fixation experimental design for clutter research; we defined a biologically inspired peripheral architecture that pools features in V1; and we stacked the previously mentioned peripheral architecture on top of a Feature Congestion map to create a Foveated Feature Congestion (FFC) model – and we extended this pipeline to other clutter models. We showed that the FFC model better explains loss in target detection performance as a function of eccentricity through the introduction of the Peripheral Integration Feature Congestion (PIFC) coefficient which varies non linearly.
We would like to thank Miguel Lago and Aditya Jonnalagadda for useful proof-reads and revisions, as well as Mordechai Juni, N.C. Puneeth, and Emre Akbas for useful suggestions. This work was supported by the Institute for Collaborative Biotechnologies through grant 2 W911NF-09-0001 from the U.S. Army Research Office.
- (1) R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. Slic superpixels. Technical report, 2010.
- (2) E. Akbas and M. P. Eckstein. Object detection through exploration with a foveated visual field. arXiv preprint arXiv:1408.0814, 2014.
- (3) M. F. Asher, D. J. Tolhurst, T. Troscianko, and I. D. Gilchrist. Regional effects of clutter on human target detection performance. Journal of vision, 13(5):25–25, 2013.
- (4) M. J. Bravo and H. Farid. A scale invariant measure of clutter. Journal of Vision, 8(1):23–23, 2008.
- (5) P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image code. Communications, IEEE Transactions on, 31(4):532–540, 1983.
- (6) A. Deza, E. Abkas, and M. P. Eckstein. Piranhas toolkit: Peripheral architectures for natural, hybrid and artificial systems.
- (7) M. P. Eckstein. Visual search: A retrospective. Journal of Vision, 11(5):14–14, 2011.
P. F. Felzenszwalb and D. P. Huttenlocher.
Efficient graph-based image segmentation.
International Journal of Computer Vision, 59(2):167–181, 2004.
- (9) J. Freeman and E. P. Simoncelli. Metamers of the ventral stream. Nature neuroscience, 14(9):1195–1201, 2011.
- (10) J. Freeman, C. M. Ziemba, D. J. Heeger, E. P. Simoncelli, and J. A. Movshon. A functional and perceptual signature of the second visual area in primates. Nature neuroscience, 16(7):974–981, 2013.
- (11) K. Fukunaga and L. D. Hostetler. Information Theory, IEEE Transactions on, 21(1):32–40, 1975.
- (12) J. M. Henderson, M. Chanceaux, and T. J. Smith. The influence of clutter on real-world scene search: Evidence from search efficiency and eye movements. Journal of Vision, 9(1):32–32, 2009.
- (13) S. Keshvari and R. Rosenholtz. Pooling of continuous features provides a unifying account of crowding. Journal of Vision, 16(39), 2016.
- (14) M. S. Landy and J. R. Bergen. Texture segregation and orientation gradient. Vision research, 31(4):679–691, 1991.
- (15) D. M. Levi. Visual crowding. Current Biology, 21(18):R678–R679, 2011.
- (16) A. Levinshtein, A. Stere, K. N. Kutulakos, D. J. Fleet, S. J. Dickinson, and K. Siddiqi. Turbopixels: Fast superpixels using geometric flows. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(12):2290–2297, 2009.
- (17) M.-Y. Liu, O. Tuzel, S. Ramalingam, and R. Chellappa. Entropy rate superpixel segmentation. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 2097–2104. IEEE, 2011.
- (18) J. A. Movshon and E. P. Simoncelli. Representation of naturalistic image structure in the primate visual cortex. In Cold Spring Harbor symposia on quantitative biology, volume 79, pages 115–122. Cold Spring Harbor Laboratory Press, 2014.
- (19) A. Oliva, M. L. Mack, M. Shrestha, and A. Peeper. Identifying the perceptual dimensions of visual complexity of scenes.
- (20) J. Portilla and E. P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. International Journal of Computer Vision, 40(1):49–70, 2000.
- (21) R. Rosenholtz, J. Huang, A. Raj, B. J. Balas, and L. Ilie. A summary statistic representation in peripheral vision explains visual search. Journal of vision, 12(4):14–14, 2012.
- (22) R. Rosenholtz, Y. Li, J. Mansfield, and Z. Jin. Feature congestion: a measure of display clutter. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 761–770. ACM, 2005.
- (23) R. Rosenholtz, Y. Li, and L. Nakano. Measuring visual clutter. Journal of vision, 7(2):17–17, 2007.
- (24) E. P. Simoncelli and W. T. Freeman. The steerable pyramid: A flexible architecture for multi-scale derivative computation. In icip, page 3444. IEEE, 1995.
- (25) R. van den Berg, F. W. Cornelissen, and J. B. Roerdink. A crowding model of visual clutter. Journal of Vision, 9(4):24–24, 2009.
- (26) C.-P. Yu, W.-Y. Hua, D. Samaras, and G. Zelinsky. Modeling clutter perception using parametric proto-object partitioning. In Advances in Neural Information Processing Systems, pages 118–126, 2013.
- (27) C.-P. Yu, D. Samaras, and G. J. Zelinsky. Modeling visual clutter perception using proto-object segmentation. Journal of vision, 14(7):4–4, 2014.
Beyond Foveated Feature Congestion
We extended other clutter models to their respective peripheral versions. Since the other models: Edge Density, Subband Entropy and ProtoObject Segmentation have not been designed to produce an intermediate step with a dense clutter pixel-wise representation (unlike Feature Congestion 1
), it is hard to find respective optimal dense clutter representations without losing the essence of each model. For Edge Density, we compute the magnitude of the image gradient after grayscale conversion. For Subband Entropy, we decided to keep all the respective subbands, as the model proposes as well as the coefficients that are used to compute a weighted sum over the entropies. In other words, our dense version of Subband Entropy is more of a dense “Subband Energy” term, since computing Entropy over a vector of a smallvector space of scales and orientations produced very little room for variation. Finally dense ProtoObject Segmentation was computed by following the intuition of final number of superpixels over inital number of superpixels, but since this is not applicable at a pixel wise level, we decided to compute multiple ProtoObject Segmentations with different regularizer and superpixel radius parameters, and averaged all superpixel segmentation ratios – where every map was dense at a superpixel level, and each superpixel score was the initial number of pixels over the final number of initial number of pixels that belong to that superpixel after the meanshift merging stage in HSV color space.
We believe that future work can be tailored towards improving dense versions of each clutter model, as well as creating new dense clutter models, that can easily be stacked with a peripheral architecture.
|Foveated Feature Congestion vs Hit Rate correlation|
|Foveated Edge Density vs Hit Rate correlation|
|Foveated Subband Entropy vs Hit Rate correlation|
|Foveated ProtoObject Segmentation vs Hit Rate correlation|