Line Drawings of Natural Scenes Guide Visual Attention

12/19/2019 ∙ by Kai-Fu Yang, et al. ∙ 6

Visual search is an important strategy of the human visual system for fast scene perception. The guided search theory suggests that the global layout or other top-down sources of scenes play a crucial role in guiding object searching. In order to verify the specific roles of scene layout and regional cues in guiding visual attention, we executed a psychophysical experiment to record the human fixations on line drawings of natural scenes with an eye-tracking system in this work. We collected the human fixations of ten subjects from 498 natural images and of another ten subjects from the corresponding 996 human-marked line drawings of boundaries (two boundary maps per image) under free-viewing condition. The experimental results show that with the absence of some basic features like color and luminance, the distribution of the fixations on the line drawings has a high correlation with that on the natural images. Moreover, compared to the basic cues of regions, subjects pay more attention to the closed regions of line drawings which are usually related to the dominant objects of the scenes. Finally, we built a computational model to demonstrate that the fixation information on the line drawings can be used to significantly improve the performances of classical bottom-up models for fixation prediction in natural scenes. These results support that Gestalt features and scene layout are important cues for guiding fast visual object searching.



There are no comments yet.


page 1

page 2

page 3

page 4

page 7

page 8

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Attention via visual search is necessary for rapid scene perception and object searching because information processing in the visual system is limited to one or a few targets or regions at one time [56]. From the viewpoint of engineering, modelling visual search usually facilitates the subsequent higher visual processing tasks, such as image re-targeting [22], image compression [12], object recognition [50], etc. However, how visual attention is guided in real visual searching tasks is an interesting but un-fully solved question in the field of visual neuroscience.

Studies of physiology and psychophysics have revealed that several sources of factors play important roles in guiding visual attention. The first one is bottom-up guidance. The classical Feature Integration Theory (FIT) suggests that visual attention is driven by some low-level visual cues such as local luminance, color and orientation [53, 32]. With a bottom-up manner, these cues are processed in parallel channels and combined to guide the visual attention. This stimulus-driven guidance causes that some local regions with feature difference in scenes attract more attention than others. Moreover, the studies of computational models based on the FIT (e.g., IT [27, 28]) also show that these models employing local cues usually predict the attention limited in some regions with high local contrasts (e.g., object boundaries).

On the other hand, visual attention is also guided by task related prior knowledge beyond the low-level visual cues [60]. Imagining that we are searching an object in a complex scene, our attention is guided by current task and directed to the objects with known features of the desired targets. Previous studies have revealed that the feedback pathways from high-level cortexes carry rich and varying information about the behavioural context, including visual attention, memory, expectation, etc. [21]. For example, the classical study by Yarbus [66]

shows that human fixation distribution is dependent on the asked question when viewing a picture. In addition, user- or task-driven guidance is also widely used in visual processing models in computer vision area

[40, 30, 6]. Recent interesting works [17, 58]

show that human attention is guided by the knowledge of the plausible size of target objects. It was suggested that this is a useful brain strategy for human to rapidly discount the distracters (e.g., the object with atypical size relative to the surrounding) during visual searching, and this is neglected in current machine learning systems


Besides the stimulus-driven and task-driven guidance, contextual guidance is another important source for guiding general visual searching tasks [52]. According to the Guided Search Theory (GST) [57], the scene context and global statistical information are important for guiding visual object searching. This indicates that rapid global scene analysis will facilitate the prediction of the locations of potential objects. The GST suggests that rapid visual searching involves two parallel pathways [56]: (1) the non-selective pathway, which serves to extract spatial layout (or gist) information rapidly from the entire scenes; and (2) the selective pathway, which works to extract and bind the low-level features under the guidance of the contextual information of scene extracted from the non-selective pathway. This two-pathway based strategy provides a unified framework to integrate the context of scene and the local information for rapid visual searching. Recently, we have built a computational model based on GST to effectively execute the general task of salient structure detection in natural scenes [64], which further supports the efficiency of context guidance in rapid visual searching with a task-independent manner. However, it has been revealed that such context guidance also employs the prior knowledge or experience in our memory and guides the low-level feature integration with a top-down manner, similar to the task-driven guidance [18, 3].

According to the source of contextual information, we can simply classify the contextual guidance into two categories: object-based and scene-based guidance. Nuthmann et al. found a preferred viewing location close to the center of objects within natural scenes when directed with different task instructions, which suggests that visual attentional selection in scenes is object-based

[41]. A related biologically plausible model was proposed to attend to the proto-objects in natural scenes [55]. In addition, the object-based guidance also indicates that meaningful visual objects usually possess some principles about the shape. For example, the well-known Gestalt theory [33] summarizes some universal principles of visual perception (e.g., closure and symmetry), which are crucial factors for facilitating the visual object searching. Based on the object-based prior, Kootstra et al. employed the feature of symmetry for fixation prediction and object segmentation in complex scenes [35, 34]. Yu et al. used the Gestalt grouping cues to select the salient regions from over-segmented scenes [67].

As for the scene-based guidance, previous studies have shown that structural scene cues, such as layout, horizontal line, openness, depth, etc., usually guide the visual searching [43, 49]. Beyond the geometry of the scenes, visual attention is also guided by the semantic information of the real-world scenes, such as the gist of the scenes [43], scene-object relations, and object-object relations [59, 4]. Therefore, clear understanding of the neural basis of scene perception, object perception, perceptual learning and memory will also facilitate the understanding of visual searching [45].

Moreover, scene structure could provide specific scene-related information for specific visual tasks. Recent studies have addressed the role of context in driving [14, 13]. By analysing the fixation data of 40 non-drivers and experienced drivers when viewing 100 traffic images, Deng et al. showed that drivers’ attention was mostly concentrated on the vanishing points of the roads [14, 13]. In addition, their models further support that vanishing point information is more important than other bottom-up cues in guiding drivers’ attention during driving. At the same time, Borji et al. also found that vanishing points can guide the attention in free-viewing and visual searching tasks [8].

On the other hand, physiological evidences have shown that the global contours delineating the outlines of visual objects could be responded quite early by the neurons of high visual cortexes, which can provide the sufficient feedback modulation that enhances the object-related responses at the lower visual levels

[48, 11]. Therefore, in this paper, we focus on the role of line drawings in visual guidance, which is closely related to the important contextual guidance cues such as regional shapes and scene layout. In specific, we explore the role of line drawings in visual searching with a psychophysical experiment. We first collected the human fixations from 498 natural images and the corresponding 996 human-marked line drawings (two line drawings for each image) with a free-viewing manner. The experimental results show that with the absence of some basic features like color and luminance, the distribution of fixations on line drawings has high correlation with that on natural images. Moreover, the subjects pay more attention to the closed regions of line drawings that are usually related to the dominant objects in the scenes. We also built a computational model to demonstrate that the information of line drawings can be used to significantly improve the performance of some classical bottom-up models for fixation prediction in natural scenes.

Ii Material and Methods

Ii-a Stimuli and Observers

We employed 498 natural images from the Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500) [1] and collected the eye fixations of 10 subjects on the natural images with a free-viewing manner. This eye-tracking experiment has been conducted in our previous work [63]. In this work, with the help of the human-marked segmentations and the contour maps (i.e., the line drawings) provided by BSDS500 [1], we further collected the human fixations of another 10 subjects when freely viewing the line drawings.

In this work, we chose two contour maps as visual stimuli from the 5-10 human-marked contour maps for each natural scene available in BSDS500: the one containing the most contour pixels (denoted by ) and the one containing the least contour pixels (denoted by ). Such choice helps demonstrate the consistent role of contours in guiding visual searching no matter the detailed parts are contained or not in the contour maps. Moreover, in order to keep consistent with our previous experiment with the natural scenes [63], each image was resized to 1024768 pixels in this work by adding gray margins around them while maintaining the aspect ratio. The observers include 5 males and 5 females with the mean age of 23.8 (22-25) years old. The selected observers have normal or corrected-to-normal vision for participation. They were naive to the purpose of the experiment and had not previously seen the stimuli. This study was carried out in accordance with the recommendations of “the Guideline for Human Behaviour Studies, the Institutional Review Board of UESTC MRI Research Center for Brain Research with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Institutional Review Board of UESTC MRI Research Center for Brain Research.

Fig. 1:

The flowchart of the visual attention model guided by line drawings.

Ii-B Procedure

The procedure is same as that in our previous experiment on the corresponding natural scenes [63]. Ten subjects were asked to watch the given images displayed on the screen with a free-viewing manner. Each image was presented for 5 seconds after a visual cue of cross (“+”) in the center of the gray screen. An Eyelink-2000 eye-tracking system was used to collect the fixations of subjects with a sampling rate of 1000Hz, and the eye-tracking re-calibration was performed on every 30 images. A chin rest was used to stabilize the head.

The procedure of the experiment was designed and programmed based on the Psychotoolbox of MATLAB. The images were presented on the display in a random order, and the contour maps of the same natural images were separated by at least 30 trails. The observers were instructed to simply watch and enjoy the pictures (i.e., free viewing task). Finally, we obtained the fixations of 10 subjects on the line drawings of 498 natural images from the BSDS500 dataset [1]. Note that two images of BSDS500 were failed to collect reasonable fixation data because the sizes of the objects in those two images are too small.

Ii-C Model-based analysis

In the field of computer vision, numerous saliency detection methods have been proposed, some of which show promising performance, especially when deep learning technology is used to predict human fixations

[5]. However, the researchers in the field of neuroscience devote themselves to make progress to understand how human execute the efficient visual searching. Some researchers have proposed various biologically inspired models for fixation prediction [27, 28]. These methods usually have bottom-up architectures and respond to the regions with local high contrast in various feature channels.

Our experimental results suggest that the closed regions have higher possibility to form meaningful objects and may attract most of our visual fixations (See Results). Therefore, we build a simple model to extract the closed regions and generate the prior location where the objects possibly present. This prior will be used to guide the generation of bottom-up saliency and improve fixation prediction. Figure 1 shows the flowchart of the guided visual attention model.

Firstly, we extract the edges from the input image with the Structured Forests method presented in [16] and group the main edges into contour with the edge link method [36]

. Then, we estimate the possibility of each point in the image within a closed region based on the contour map. Figure

2 shows two examples of closure computation at different points. For each point , we search the contour pixels around it along directions (we sample D directions in this study, and ) starting from . We use to denote the number contour pixels around (x,y) in all the searched directions. For example, is with , while is with . That means point has the higher possibility of locating in a closed region than .

Computationally, the possibility of within a closed region (denoted by Closure Degree) is obtained with

Fig. 2: The examples of computing closure degree for each point.
Fig. 3: Examples of fixation points (green dots) on natural scenes and the corresponding line drawings. From top to bottom row: the fixations from 10 subjects when viewing the natural scenes, the line drawings and , respectively.

In addition, we further consider the radii of regions to boost the estimate of closure degree. Let denotes the radii of all the considered directions referring to point . Subsequently, we compute the final closure degree of as


where, is the mean radius and

is the standard deviation of all the radii along

directions when considering point . With the definition in Equation (2), the small regions (with low ) have the higher closure degrees than the larger regions. Meanwhile, low means the region towards a circle which attracts more attention.

In order to evaluate the contribution of contextual guidance in fixation prediction, we build a simple model to improve the performance of fixation prediction when integrating the spatial layout information into the classical bottom-up methods. In detail, we employ a Gaussian mixture model (GMM)

[47] to fit the closure degree map obtained with Equation (2) and find the main components (we set in the experiments). The line drawing guided prior is obtained by combining multiple GMM components () with weights ()


where, each component is weighted by the component proportion () , ellipticity (), and the distance weight (). In detail, is the component proportion from GMM fitting. is related to the ellipticity which is computed with the long and short radius the Gaussian component along its two main axis. It means that the rounder component has a greater weight. is related to the distance of the center of the Gaussian component to the image center, which indicates that the component nearer to image center has a greater weight.

Finally, the combined saliency map () can be obtained as


where, is a saliency map obtained by a low-level saliency method, like IT [27]. With Equation (5), we simply obtain the line drawing (closure cue) guided visual saliency map.

Iii Results

Iii-a Human fixation analysis

Compared to the natural scenes, the line drawings of an image excludes the basic low-level features such as the color and luminance. However, what is the difference between the distribution of fixations on the line drawings and that on the natural image of a same scene? Figure 3 shows several examples of the fixation distribution on the natural images and the corresponding line drawings. In this figure we display all the fixation points from the 10 subjects on each visual stimulus. Note that we removed the first fixation point of each subject in order to eliminate the center bias considering that the visual cue of cross marker was presented at the center of the gray screen.

From Figure 3, we can clearly see that the fixations on different stimuli (natural scenes and line drawings) of the same scenes are quite similar. Most of the fixation points (green dots) locate on the dominant objects in both the natural scenes and the line drawings. Especially, without the color and surface information, visual attention is also attracted by the objects in the line drawings, which are defined by the shapes and the Gestalt features of local regions. Note that without the specific surface features inside the objects, the fixations on the line drawings are distributed more dispersively than that on the natural scenes. This suggests that the visual attention is attracted by the structure of scenes and the shape features of the potential objects that are outlined by the dominant contours in the line drawings. When we are viewing the natural scenes, the local feature contrasts can further gather our attention to some salient regions of the visual objects. This also indicates that the layout and the structure of scenes may guide our attention to the dominant objects for further visual processing.

For qualitative comparison, we also calculated the correlation coefficient between the distribution of fixations on the line drawings and that on the natural images. Considering that the fixation points are extremely sparse on the stimuli, various scales of Gaussian blurring were used to obtain the spatial distribution of fixations. The correlation coefficients (CC) were computed between the blurred distributions of fixations on the line drawings and the natural scenes, i.e.,


where, is the covariance, and are the standard deviation of blurred distributions of fixations on the line drawings () and the natural scenes (). Figure 4 shows the correlation coefficients with varying blurring scales. The fixation distributions on the line drawings of and are denoted by and respectively. We can see that compared to the line drawings containing less details, the line drawings containing more details provide more visual information, and the fixation distribution on is slightly closer to that on the natural images.

Figure 4 indicates the high correlation (the correlation coefficients are around 0.7) of fixation distribution between the natural scenes and their line drawings. However, it is easily understood that the fixation locations of different subjects are usually not exactly consistent with each other even they are viewing the same objects. Therefore, we suppose that different subjects are attending to the same objects if their fixation points locate within the different parts of the same objects. In order to evaluate the object-based consistence between the visual attention on the natural scenes and the line drawings, we first generated the salient regions based on the human-marked segmentation provided by BSDS500 as follows. For each segment, we computed the density of the human fixation points as its saliency level. Thus, if a segment (or a region) attracts more fixation points, it will obtain a higher saliency score. For each scene, the two salient object maps generated based on the two line drawings (i.e., and ) and the corresponding fixations were linearly summarized together and integrated into the final map containing the hierarchical salient objects. Similarly, we obtained the hierarchical salient objects based on the human-marked segmentations and fixations collected on the original images. Figure 5 shows several examples of hierarchical salient objects generated from the fixations on the natural images and the line drawings.

Fig. 4: Correlation coefficients between the fixation distributions on the line drawings ( and ) and the natural images with varying scales ( in pixel) of Gaussian blurring. The fixation distributions on the line drawings of and are denoted by and

respectively. The error bars are the 95% confidence intervals.

Fig. 5: Allocating saliency score to each region with fixation density. denotes the hierarchical salient objects generated based on the human-marked segmentations and fixations on the natural images, while denotes the hierarchical salient objects generated based on the human-marked segmentations and fixations on the contours (i.e., line drawings). Brighter regions have higher saliency scores.

In this work, the object-based consistence of visual attention was evaluated based on the hierarchical salient objects using two metrics: correlation coefficient [7] and Mean Absolute Error (MAE) [46, 64]. As indicated in Figure 6, with the salient objects generated with the fixations on the natural scenes () as baseline, the salient objects generated with the fixations on the line drawings () obtain clearly higher correlation coefficient and lower MAE than some representative computational models (including MR [62], HS [61], and CGVS [64]). Actually, this is an obvious conclusion considering that these hierarchical salient objects were marked based on the fixations from human subjects. However, this experiment indeed suggests the high consistence of visual attention when we are viewing the natural images and the line drawings of the same scenes. Although without the color and surface information, human subjects can also execute efficient visual searching by only employing the limited scene structure information.

Fig. 6: Correlation coefficients and MAEs of the salient regions based on the hierarchical salient objects. The baseline is the salient objects generated with the fixations on the natural scenes (), while denotes the hierarchical salient objects generated based on the human-marked segmentations and fixations on line drawings

Iii-B Guidance cues

With the absence of color and luminance, how does the line drawings guide our visual attention? Some studies have revealed the role of scene structure for visual guidance in specific visual scenes or tasks. For example, previous studies have shown that the vanishing point of a road plays an important role in traffic scenes especially with the driving task [14]. However, most of the stimuli used in this experiment are task-irrelevant general images, each of which usually contains at least one dominant object. Therefore, we believe that the shape features of regions and scene layout mainly contribute to predict potential objects.

We tested ten shape-related common features and evaluated their correlation with the fixation distributions. These shape features were extracted from each segment of each image, which are listed in Table I. These features can reflect different characters of regions. For example, the regions with higher degree of Closure

are usually related to meaningful objects of higher probability.

Perimeter Ratio carries the continuity of the boundary of a region. Higher Perimeter Ratio indicates higher irregularity of a region. It has been verified that some of these features are helpful for the salient object detection [37]. However, in this work, we further analysed the contribution of individual features to the specific task of visual attention.

Features Description
Area Ratio The ratio of number of pixels in the region to total pixels in the image -0.459 -0.483
Centralization The mean distance of pixels in the region to the center of image, normalized by the image size -0.409 -0.301
Perimeter Ratio The ratio of perimeter of the region to perimeter of the image -0.510 -0.526
MinAxis/MajorAxis The length of the major axis divided by the length of the minor axis of the region -0.072 0.072

The eccentricity of the ellipse that has the same second-moments as the region

-0.067 -0.065
Orientation The orientation of the major axis of the ellipse that has the same second-moments as the region 0.013 -0.006
EquivDiameter The ratio of diameter of a circle with the same area as the region to the diagonal length of image -0.609 -0.601
Solidity The proportion of the pixels in the convex hull that are also in the region -0.047 -0.013
Extent The ratio of pixels in the region to pixels in the total bounding box -0.066 -0.234
Closure The closure score of a region is defined in Equation(7) 0.632 0.736
TABLE I: Ten shape-related common features used in this study and their correlation with the fixation distributions.

Figure 7 lists the relations between the saliency values (which are defined by the fixation density) and the corresponding shape cues of each segment of . In order to clearly show the distributions, the saliency values are shown in log scale in Figure 7. The relations between the saliency values and the corresponding shape cues of each segment on are similar with that on . Table I also lists the correlation coefficient of saliency values with each shape cue on and . We can find that the feature of Closure contributes most to visual attention among all the cues considered here. In addition, the contributions of Closure, EquivDiameter and Perimeter Ratio are higher than that of the well-known center bias (Centralization). In contrast, some features like Orientation contribute little for object detection. It make senses that most of the regions in the natural scenes tested in this work are distributed on the 0 degree and 90 degree orientations (the bottom-left panel in Figure 7), but Orientation cannot provide enough information to distinguish between object and non-object regions.

Fig. 7: Correlations between various shape features and the saliency levels of each region based on . The number listed for in the top of each panel is the linear correlation coefficient between the feature and the saliency values (in log scale).

It has been proved that the closure, as the most important one among the considered features, plays a pivotal role for describing an object proposal [67]. Here we further evaluated the contribution of closure feature in the specific task of visual searching. Firstly, we define a score to measure the degree of closure of a region based on the human-marked segmentation. Let denote the number of intersection between the boundary pixel of segment and the border pixel set of the image and denote the number of all boundary pixels of segment . Then, the closure score of contour pixels on segment is defined as


We denote the set of fixations as and the set of contour pixels as . In addition, we define the closed region as the segment with closure score higher than 0.9 (i.e, ), and the set of pixels on closure contours and in the closure regions as .

Then, we define four metrics as follows: (1) the percentage of the fixations around the contours in all fixations (); (2) the percentage of the contours around the fixations in all contour pixels (); (3) the percentage of the fixations around and in the closed regions in the fixations around all contours (); (4) the percentage of the closed regions in the whole image (). These four metrics can be computed as


where denotes the dilatation operator with a size of pixels applied on (a binary map), and represent respectively the width and height (in pixels) of the image.

Figure 8 illustrates graphically the metrics with varying values on the line drawings of and . From Figure 8 (left), only achieves around 0.5 even with a large , which indicates that less than half of the fixation points locate near the contour lines and more than half of the fixation points locate in the non-contour regions (without any local contrast). This suggests that there are some higher-order visual cues beyond local contrast (contours) for guiding visual attention. In addition, although contour lines attract around half of the visual fixations, further shows that these fixations locate around only 30% of all contour pixels. This means that part of contour lines plays a more important role than others in guiding visual attention.

Fig. 8: The multiple metrics under dilatation operation with varying operator size of pixels on the line drawings of (top row) and (bottom row). Left: and ; Right: PoFC and PoCC. The error bars are the 95% confidence intervals.

A simple suppose is that visual objects are commonly defined by the closed regions in line drawings and visual attention usually focuses on the potential visual objects. Examples shown in Figure 4 and the results listed in Table I suggest that closed regions attract more visual fixation. In order to further verify this suppose, we analyzed the metrics of and with varying values on the line drawings of and . Figure 8 (right) indicates that around 80% of the fixations locate within or around the closed regions (high ), although the closed regions cover only a small percentage of area in the whole scene (low ).

Iii-C Model evaluation

In this experiment, we employed several representative classical models including FIT-based model (IT) [27], AIM [9], SIG [26], graph-based model (GB) [23]. To evaluate the performance of multiple models for the task of fixation prediction, we employed a metric of receiver operating characteristic (ROC) curve widely used in the field of computer vision [7, 9]. In this experiment, we used the revised version of ROC [29], which focuses mainly on the true positive rate, i.e., the proportion of actual fixations that are correctly identified as such, and the human fixation on the natural scenes is regarded as the ground truth. As shown in Figure 9, the fixation prediction performances of all the considered bottom-up models are significantly improved after integrating the guided information from line drawings using Equation (5). For example, the IT model is a classical bottom-up model, which predicts the fixations from natural images by combining the contrast features of color, luminance and orientation [27]. Therefore, with the absence of scene structure feature, IT model mainly detects some regions with high local contrasts while missing the meaningful object regions (e.g., the inner surface of objects). However, line drawings can provide additional scene structure and regional shape information, which serve to guide the visual attention to focus on the inner of potential objects. Therefore, the closure prior from the line drawings that represent the important region information can remarkably improve the performance of IT. Figure 10 shows several examples of saliency maps predicted by the original IT and the improved model (Guided IT).

Fig. 9: Performance comparison between the bottom-up models and the improved models when integrating the guided information from line drawings.
Fig. 10: Examples of saliency maps predicted by the original IT and the improved model (Guided IT).

Iv Discussion

Classical theories of visual processing view the brain as a stimulus-driven device, in which visual information is extracted hierarchically along the visual pathway [53]. However, numerous recent neurophysiological and psychological studies support that a coarse-to-fine mechanism plays an important role in various visual tasks [39], such as stereoscopic depth perception [38], temporal changes [24], object recognition in context [3], etc. In addition, Oliva et al. reveal that the scale usage of visual information is flexible and can provide the information required depending on the task constraints [42]. On the other hand, it has been proven that the coarse-to-fine strategy can contribute to various computer vision applications, e.g., contour detection [68] and image segmentation [65]. These researches further confirm the efficiency of the coarse-to-fine mechanism in visual information processing from the viewpoints of computational modeling.

It is widely accepted that visual processing is an active and highly selective process [18]

, which reflects the dynamic interaction between the bottom-up feature extraction and the top-down constraints

[20, 56]. This means visual information processing is not purely stimulus-evoked, but constrained by top–down influences that strongly shape the intrinsic dynamics of cortical circuits and constantly create predictions about the forthcoming information [20, 21, 25]. For example, with object recognition in context, context-based prediction makes the task of object recognition more efficient [3]. These studies support that context information, acting as coarse and global scene information, can be rapidly extracted and used to facilitate the object recognition in complex scenes [3, 2].

As an important aspect of scene perception, visual attention is also considered as a process that encompasses many aspects central to human visual and cognitive function [18]. Knudsen also proposed a famous conceptual framework that indicates the attention combined contribution of four distinct processes: working memory, competitive selection, top-down sensitivity control, and automatic filtering for salient stimuli [31]. Computational modeling of visual searching has also demonstrated the contribution of information from different levels of perception [52].

According to the scales of features, visual information can be coarsely divided into three levels: low, middle, and high levels. Firstly, at the low-level scale, classical bottom-up frameworks (e.g., Koch, Itti et al. [32, 53, 27]) have shown that local contrast in various feature channels can attract visual attention. These models are usually based on the local filtering and obtain stimuli-driven saliency map without taking into consideration the behavioral goals of searching [28]. Secondly, at the region scale (mid-level), perceptual organization principles which describe how basic visual features are organized into more coherent units [54] (e.g., Gestalt [33]) could guide the visual searching. For example, the regions that match some specific principles (e.g., closure, continuity, symmetry, etc.) usually indicate special visual objects that are more interesting for human visual tasks [35, 15]. Finally, at the scene scale (high-level), scene layout and structure will be important guidance to predict where the interesting objects present in the current scene [56, 19]. With the guided search theory, the high-level scene semantic and gist could provide important spatial cues for targets, which will facilitate the binding of low-level features and speed-up the visual searching [56, 43, 52, 14]. These principles are usually solidified in our memory as certain general knowledge obtained with learning from our daily life.

In this paper, we focus on the role of line drawings in guiding visual attention. Generally, line drawings of natural scenes provide two structure related information, i.e., the shape feature of region (mid-level) and the layout of scene (high-level). Therefore, line drawings can represent two types of contextual sources. On the one hand, line drawings segment an image into various perceptional regions (before the perception of specific objects) that may be used by the visual system to rapidly construct a rough sketch of the scene structure in space [44], and also contribute to fill-in the surface of regions [69]. At the level of region, visual object shapes usually obey some general principles that can be considered as certain general prior or knowledge. For example, Gestalt principles describe some general principles of visual perception [33]. As for the visual attention, the shapes of regions matching some Gestalt principles (e.g., closure, continuity, etc.) could attract more visual attention because they represent meaningful visual objects with higher probability. Moreover, geometrical properties of regions have been strongly proven to contribute to early global topological perception [10].

On the other hand, line drawings provide the coarse scene structure. The visual guided search theory proposed by Wolfe et al. [56] suggests that scene global information (including scene spatial layout) can be transferred rapidly to high visual cortexes via the so-called non-selective visual pathway. Line drawings could give the rough layout of surfaces in the space and provide the basis for scene-based guidance of visual searching. Physiological evidence shows that the global contours delineating the outlines of visual objects may be responded quite early (perhaps via a special pathway) by the neurons of high cortexes, which, although producing only a crude signal about the position and shape of the objects, can provide sufficient feedback modulation to enhance the contour-related responses at lower levels and suppress the irrelevant background inputs [11]. In addition, coarse contour is low frequency visual information [68], which carries coarse scene information to provide enough signals to infer the scene structure and layout [3, 2].

To summarize, visual search should be a unified framework that integrates information from different sources in the brain [51]. As important guidance cues, line drawings of natural scenes provide a rapid representation of the scene structure and regional shape. Coarse scene layout may be not enough for the object identifying tasks, but it can provide efficient cues to predict where the potential target is and contribute to the active vision system to achieve high efficient visual searching and scene perception [18].

V Conclusion

In this paper, we collected a dataset of human fixations on the natural scenes and the corresponding line drawings, which are useful for analyzing the mechanisms of visual attention at the object or shape level. Our experiments reveal a high correlation between the distribution of fixations on the natural images and that on the line drawings of the same scenes. In addition, we systemically analyzed the effects of various shape-based visual cues in guiding visual attention. The results suggest that the closed regions have higher possibility to form meaningful objects and may attract most of our visual fixations. Finally, the computational model further verifies that the information of line drawings can be used to significantly improve the fixation prediction performance of classical bottom-up methods on natural scenes.

In conclusion, we suggest that the cortexes involved in visual attention or visual search should be decomposed into not only various parallel feature channels (such as the IT model [27]), but also various hierarchical levels including low-, mid-, and high-levels. At the same time, the information from various levels plays different roles in visual searching. Therefore, our future work will be extended as following aspects: (1) to build a computational model to predict the scene layout, which will be a powerful complement to the low-level visual features for visual object searching; (2) to further study how the scene structure information guides our visual attention; (3) to integrate the guidance information from various scales and build a unified framework for visual guided searching tasks combining the low-, mid-, and high-level guided cues.


This work was supported by Natural Science Foundations of China under Grant 61703075, Sichuan Province Science and Technology Support Project under Grant 2017SZDZX0019, and Guangdong Province Key Project under Grant 2018B030338001.


  • [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik (2011) Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (5), pp. 898–916. Cited by: §II-A, §II-B.
  • [2] M. Bar, K. S. Kassam, A. S. Ghuman, J. Boshyan, A. M. Schmid, A. M. Dale, M. S. Hämäläinen, K. Marinkovic, D. L. Schacter, B. R. Rosen, et al. (2006) Top-down facilitation of visual recognition. Proceedings of the National Academy of Sciences 103 (2), pp. 449–454. Cited by: §IV, §IV.
  • [3] M. Bar (2004) Visual objects in context. Nature Reviews Neuroscience 5 (8), pp. 617–629. Cited by: §I, §IV, §IV, §IV.
  • [4] S. E. Boettcher, D. Draschkow, E. Dienhart, and M. L. Võ (2018) Anchoring visual search in scenes: assessing the role of anchor objects on eye movements during visual search. Journal of Vision 18 (13), pp. 1–13. Cited by: §I.
  • [5] A. Borji and L. Itti (2013) State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1), pp. 185–207. Cited by: §II-C.
  • [6] A. Borji, D. N. Sihite, and L. Itti (2014) What/where to look next? modeling top-down visual attention in complex interactive environments. IEEE Transactions on Systems, Man, and Cybernetics: Systems 44 (5), pp. 523–538. Cited by: §I.
  • [7] A. Borji, H. R. Tavakoli, D. N. Sihite, and L. Itti (2013) Analysis of scores, datasets, and models in visual saliency prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 921–928. Cited by: §III-A, §III-C.
  • [8] A. Borji (2016)

    Vanishing point detection with convolutional neural networks

    arXiv preprint arXiv:1609.00967. Cited by: §I.
  • [9] N. Bruce and J. Tsotsos (2006) Saliency based on information maximization. In Proceedings of the Advances in Neural Information Processing Systems, pp. 155–162. Cited by: §III-C.
  • [10] L. Chen (2005) The topological approach to perceptual organization. Visual Cognition 12 (4), pp. 553–637. Cited by: §IV.
  • [11] Chen, Minggui, Yan, Yin, Gong, Xiajing, Gilbert, Charles, and Liang (2014) Incremental integration of global contours through interplay between visual cortical areas. Neuron 82 (3), pp. 682–694. Cited by: §I, §IV.
  • [12] C. Christopoulos, A. Skodras, and T. Ebrahimi (2000) The JPEG2000 still image coding system: an overview. IEEE Transactions on Consumer Electronics 46 (4), pp. 1103–1127. Cited by: §I.
  • [13] T. Deng, H. Yan, and Y. Li (2017)

    Learning to boost bottom-up fixation prediction in driving environments via random forest

    IEEE Transactions on Intelligent Transportation Systems 19 (9), pp. 3059–3067. Cited by: §I.
  • [14] T. Deng, K. Yang, Y. Li, and H. Yan (2016) Where does the driver look? top-down-based saliency detection in a traffic driving environment. IEEE Transactions on Intelligent Transportation Systems 17 (7), pp. 2051–2062. Cited by: §I, §III-B, §IV.
  • [15] J. E. Dickinson, K. Haley, V. K. Bowden, and D. R. Badcock (2018) Visual search reveals a critical component to shape. Journal of Vision 18 (2), pp. 1–25. Cited by: §IV.
  • [16] P. Dollár and C. L. Zitnick (2013) Structured forests for fast edge detection. In Proceedings of the IEEE international conference on computer vision, pp. 1841–1848. Cited by: §II-C.
  • [17] M. P. Eckstein, K. Koehler, L. E. Welbourne, and E. Akbas (2017) Humans, but not deep neural networks, often miss giant targets in scenes. Current Biology 27 (18), pp. 2827–2832. Cited by: §I.
  • [18] M. P. Eckstein (2011) Visual search: a retrospective. Journal of Vision 11 (5), pp. 1–36. Cited by: §I, §IV, §IV, §IV.
  • [19] M. P. Eckstein (2017) Probabilistic computations for attention, eye movements, and search. Annual Review of Vision Science 3, pp. 319–342. Cited by: §IV.
  • [20] A. K. Engel, P. Fries, and W. Singer (2001) Dynamic predictions: oscillations and synchrony in top–down processing. Nature Reviews Neuroscience 2 (10), pp. 704–716. Cited by: §IV.
  • [21] C. D. Gilbert and W. Li (2013) Top-down influences on visual processing. Nature Reviews Neuroscience 14 (5), pp. 350–363. Cited by: §I, §IV.
  • [22] S. Goferman, L. Zelnik-Manor, and A. Tal (2012) Context-aware saliency detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (10), pp. 1915–1926. Cited by: §I.
  • [23] J. Harel, C. Koch, and P. Perona (2006) Graph-based visual saliency. In Proceedings of the Advances in Neural Information Processing Systems, pp. 545–552. Cited by: §III-C.
  • [24] J. Hegdé (2008) Time course of visual perception: coarse-to-fine processing and beyond. Progress in Neurobiology 84 (4), pp. 405–439. Cited by: §IV.
  • [25] J. Hopf, S. J. Luck, M. Girelli, T. Hagner, G. R. Mangun, H. Scheich, and H. Heinze (2000) Neural sources of focused attention in visual search. Cerebral Cortex 10 (12), pp. 1233–1241. Cited by: §IV.
  • [26] X. Hou, J. Harel, and C. Koch (2012) Image signature: highlighting sparse salient regions. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1), pp. 194–201. Cited by: §III-C.
  • [27] L. Itti, C. Koch, and E. Niebur (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (11), pp. 1254–1259. Cited by: §I, §II-C, §II-C, §III-C, §IV, §V.
  • [28] L. Itti and C. Koch (2001) Computational modelling of visual attention. Nature Reviews Neuroscience 2 (3), pp. 194–203. Cited by: §I, §II-C, §IV.
  • [29] T. Judd, K. Ehinger, F. Durand, and A. Torralba (2009) Learning to predict where humans look. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2106–2113. Cited by: §III-C.
  • [30] C. Kanan, M. H. Tong, L. Zhang, and G. W. Cottrell (2009) SUN: top-down saliency using natural statistics. Visual Cognition 17 (6-7), pp. 979–1003. Cited by: §I.
  • [31] E. I. Knudsen (2007) Fundamental components of attention. Annual Reviews of Neuroscience 30, pp. 57–78. Cited by: §IV.
  • [32] C. Koch and S. Ullman (1987) Shifts in selective visual attention: towards the underlying neural circuitry. In Matters of Intelligence, pp. 115–141. Cited by: §I, §IV.
  • [33] K. Koffka (1935) Principles of gestalt psychology.. A Harbinger Book 20 (5), pp. 623–628. Cited by: §I, §IV, §IV.
  • [34] G. Kootstra, N. Bergstrom, and D. Kragic (2010) Using symmetry to select fixation points for segmentation. In

    Proceedings of International Conference on Pattern Recognition

    pp. 3894–3897. Cited by: §I.
  • [35] G. Kootstra, A. Nederveen, and B. De Boer (2008) Paying attention to symmetry. In Proceedings of British Machine Vision Conference, pp. 1115–1125. Cited by: §I, §IV.
  • [36] P. Kovesi (2000) MATLAB and Octave functions for computer vision and image processing. Note: Available from: Cited by: §II-C.
  • [37] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille (2014) The secrets of salient object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 280–287. Cited by: §III-B.
  • [38] M. D. Menz and R. D. Freeman (2003) Stereoscopic depth processing in the visual cortex: a coarse-to-fine mechanism. Nature Neuroscience 6 (1), pp. 59–65. Cited by: §IV.
  • [39] M. Mermillod, N. Guyader, and A. Chauvin (2005) The coarse-to-fine hypothesis revisited: evidence from neuro-computational modeling. Brain and Cognition 57 (2), pp. 151–157. Cited by: §IV.
  • [40] V. Navalpakkam and L. Itti (2005) Modeling the influence of task on attention. Vision Research 45 (2), pp. 205–231. Cited by: §I.
  • [41] A. Nuthmann and J. M. Henderson (2010) Object-based attentional selection in scene viewing.. Journal of Vision 10 (8), pp. 1–19. Cited by: §I.
  • [42] A. Oliva and P. G. Schyns (1997) Coarse blobs or fine edges? evidence that information diagnosticity changes the perception of complex visual stimuli. Cognitive Psychology 34 (1), pp. 72–107. Cited by: §IV.
  • [43] A. Oliva and A. Torralba (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 42 (3), pp. 145–175. Cited by: §I, §IV.
  • [44] A. Oliva and A. Torralba (2006) Building the gist of a scene: the role of global image features in recognition. Progress in Brain Research 155, pp. 23–36. Cited by: §IV.
  • [45] M. V. Peelen and S. Kastner (2014) Attention in the real world: toward understanding its neural basis. Trends in Cognitive Sciences 18 (5), pp. 242–250. Cited by: §I.
  • [46] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung (2012) Saliency filters: contrast based filtering for salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 733––740. Cited by: §III-A.
  • [47] H. Permuter, J. Francos, and I. H. Jermyn (2003) Gaussian mixture models of texture and colour for image database retrieval. In IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 3, pp. 569–572. Cited by: §II-C.
  • [48] J. Poort, F. Raudies, A. Wannig, V. A. Lamme, H. Neumann, and P. R. Roelfsema (2012) The role of attention in figure-ground segregation in areas v1 and v4 of the visual cortex.. Neuron 75 (1), pp. 143–156. Cited by: §I.
  • [49] M. G. Ross and A. Oliva (2009) Estimating perception of scene layout properties from global image features. Journal of Vision 10 (1), pp. 1––25. Cited by: §I.
  • [50] U. Rutishauser, D. Walther, C. Koch, and P. Perona (2004) Is bottom-up attention useful for object recognition?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. II–37–II–44. Cited by: §I.
  • [51] H. H. Schütt, L. O. Rothkegel, H. A. Trukenbrod, R. Engbert, and F. A. Wichmann (2019) Disentangling bottom-up versus top-down and low-level versus high-level influences on eye movements over time. Journal of Vision 19 (3), pp. 1–23. Cited by: §IV.
  • [52] A. Torralba, A. Oliva, M. S. Castelhano, and J. M. Henderson (2006) Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search.. Psychological Review 113 (4), pp. 766–786. Cited by: §I, §IV, §IV.
  • [53] A. M. Treisman and G. Gelade (1980) A feature-integration theory of attention. Cognitive Psychology 12 (1), pp. 97–136. Cited by: §I, §IV, §IV.
  • [54] J. Wagemans (2018) Perceptual organization. Stevens’ Handbook of Experimental Psychology and Cognitive Neuroscience 2, pp. 1–70. Cited by: §IV.
  • [55] D. Walther and C. Koch (2006) Modeling attention to salient proto-objects.. Neural Networks the Official Journal of the International Neural Network Society 19 (9), pp. 1395–1407. Cited by: §I.
  • [56] J. M. Wolfe, M. L. Võ, K. K. Evans, and M. R. Greene (2011) Visual search in scenes involves selective and nonselective pathways. Trends in Cognitive Sciences 15 (2), pp. 77–84. Cited by: §I, §I, §IV, §IV, §IV.
  • [57] J. M. Wolfe (1994) Guided search 2.0 a revised model of visual search. Psychonomic Bulletin & Review 1 (2), pp. 202–238. Cited by: §I.
  • [58] J. M. Wolfe (2017) Visual attention: size matters. Current Biology 27 (18), pp. R1002–R1003. Cited by: §I.
  • [59] C. Wu, F. A. Wick, and M. Pomplun (2014) Guidance of visual attention by semantic information in real-world scenes. Frontiers in Psychology 5 (54), pp. 1–13. Cited by: §I.
  • [60] J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, and Q. Zhao (2014) Predicting human gaze beyond pixels. Journal of Vision 14 (1), pp. 1–20. Cited by: §I.
  • [61] Q. Yan, L. Xu, J. Shi, and J. Jia (2013) Hierarchical saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1155–1162. Cited by: §III-A.
  • [62] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang (2013) Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 3166–3173. Cited by: §III-A.
  • [63] K. Yang, X. Gao, J. Zhao, and Y. Li (2015) Segmentation-based salient object detection. In Proceedings of Chinese Conference on Computer Vision, pp. 94–102. Cited by: §II-A, §II-A, §II-B.
  • [64] K. Yang, H. Li, C. Li, and Y. Li (2016) A unified framework for salient structure detection by contour-guided visual search.. IEEE Transactions on Image Processing 25 (8), pp. 3475–3488. Cited by: §I, §III-A.
  • [65] J. Yao, M. Boben, S. Fidler, and R. Urtasun (2015) Real-time coarse-to-fine topologically preserving segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2947–2955. Cited by: §IV.
  • [66] A. L. Yarbus (1967) Eye movements during perception of complex objects. In Eye Movements and Vision, pp. 171–211. Cited by: §I.
  • [67] J. Yu, G. Xia, C. Gao, and A. Samal (2016) A computational model for object-based visual saliency: spreading attention along gestalt cues. IEEE Transactions on Multimedia 18 (2), pp. 273–286. Cited by: §I, §III-B.
  • [68] C. Zeng, Y. Li, and C. Li (2011) Center–surround interaction with adaptive inhibition: a computational model for contour detection. NeuroImage 55 (1), pp. 49–66. Cited by: §IV, §IV.
  • [69] S. Zweig, G. Zurawel, R. Shapley, and H. Slovin (2015) Representation of color surfaces in v1: edge enhancement and unfilled holes. Journal of Neuroscience 35 (35), pp. 12103–12115. Cited by: §IV.