Putting visual object recognition in context

11/17/2019 ∙ by Mengmi Zhang, et al. ∙ Harvard University 0

Context plays an important role in visual recognition. Recent studies have shown that visual recognition networks can be fooled by placing objects in inconsistent contexts (e.g. a cow in the ocean). To understand and model the role of contextual information in visual recognition, we systematically and quantitatively investigated ten critical properties of where, when, and how context modulates recognition including amount of context, context and object resolution, geometrical structure of context, context congruence, time required to incorporate contextual information, and temporal dynamics of contextual modulation. The tasks involve recognizing a target object surrounded with context in a natural image. As an essential benchmark, we first describe a series of psychophysics experiments, where we alter one aspect of context at a time, and quantify human recognition accuracy. To computationally assess performance on the same tasks, we propose a biologically inspired context aware object recognition model consisting of a two-stream architecture. The model processes visual information at the fovea and periphery in parallel, dynamically incorporates both object and contextual information, and sequentially reasons about the class label for the target object. Across a wide range of behavioral tasks, the model approximates human level performance without retraining for each task, captures the dependence of context enhancement on image properties, and provides initial steps towards integrating scene and object information for visual recognition.



There are no comments yet.


page 2

page 4

page 5

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The tiny object on the table is probably a spoon, not an elephant. Objects do not appear in isolation. Instead, they co-vary with other objects and scene properties, their sizes and colors usually respect regularities relative to nearby elements, and objects tend to appear at stereotypical locations. The success in object recognition and detection tasks in natural images relies on


incorporation of contextual information. Deep convolutional neural networks jointly learn statistical associations between objects, image properties, and labels  

[12, 41, 17, 6]. Such algorithms can be tricked into mislabeling or missing an object by placing it in an unfamiliar context (Fig. 1).

Figure 1: Mis-classification of objects in unfamiliar contexts. State-of-the-art deep visual recognition networks, such as InceptionV3 [42], ResNet50 [53] and VGG16 [40], make mistakes when the context is incongruent. The top-5 labels and confidence levels by each model are shown on the right.

Here systematically and quantitatively investigated the mechanisms by which contextual information is integrated into visual recognition. We focus on three fundamental aspects of context: [A] the interaction between object size and the amount of contextual information; [B] the geometry, resolution, and content of contextual information; [C] the temporal dynamics of contextual modulation and the interaction between bottom-up and recurrent computations during contextual modulation. By systematically measuring the effect of context in 10 human psychophysics experiments (Fig. 4, Fig. S9, S10, S11), we gain a quantitative understanding of where, when, and how context modulates recognition. Moreover, the human data provides a quantitative benchmark and constrain to test (but not train) computational models.

Inspired by the neuroscience of human vision, we propose Context-aware Two-stream Attention network (CATNet). The proposed model makes inferences about the target object by guiding attention towards regions with informative contextual cues and object parts via dynamic integration of foveal (object) and peripheral (context) vision, and automatically learning contextual reasoning strategies. We test CATNet and state-of-the-art in-context object recognition models on the same exact psychphysics tasks without re-training the models for each experiment. CATNet surpasses other computational models in these experiments and shares remarkable similarity with human recognition abilities.

Figure 4: Fundamental properties of context and task schematic. Example image with full context (a) and image modifications used in experiments (more examples in Fig. S8). The target location (red box) is always the same across conditions. The correct answer (“mouse”) is not shown in the actual experiment). (h) Subjects were presented with a fixation cross (500 ms), followed by a bounding box indicating the target object location (1000 ms). In most experiments (except for Exp C1-3), the image was shown for ms. After image offset, subjects typed one word to identify the target object.

2 Related Works

2.1 Role of Context in Human Visual Recognition

Many behavioral studies [4, 20] have focused on comparing congruent versus incongruent context conditions: objects appearing in a familiar background can be detected more accurately and faster than objects in an unusual scene (Fig. 1). Several qualitative demonstrations showed that context can help visual processing [2, 7, 25, 1], during recognition tasks [2], detection tasks [7, 25], working memory [18, 1], and visual search [23]. Here we systematically tested the three fundamental properties of context to quantitatively model where, when and how contextual information modulates recognition.

2.2 Role of Context in Computer Vision

Contextual reasoning about objects and relations is critical to machine vision. Some studies show deep nets for object recognition, trained on natural image datasets, e.g

. ImageNet

[28], indeed rely implicitly but strongly on context [19, 8]. These algorithms can fail when objects are placed in an incongruent context ([6, 17, 12]) (Fig. 1).

Many exciting successes of computer vision methods can be partly ascribed to capitalizing on the statistical correlations between contextual information and object labels. Here we briefly and non-exhaustively introduce context-aware computational models in various applications. Qualitative analyses based on the statistical summary of object relationships, have provided an effective source of information for perceptual inference tasks, such as object detection (

[48, 34, 24, 46, 32]

), scene classification (

[21, 47, 52]), semantic segmentation ([52]), and visual question answering ([44]).

Classical approaches, e.g. Conditional Random Field (CRF), reason jointly across multiple computer vision tasks in image labeling, scene classification [21, 52, 29, 10], object detection and semantic segmentation [33]. Several graph-based methods incorporating contextual information, combined with neural network architectures, have been successfully applied in object priming [48], place and object recognition [50, 47], object detection [11, 32], and visual question answering [44]. Recent interesting approaches have used deep graph neural networks for contextual inference [26, 13, 15, 5]. These works typically assume that full contextual information is always available. However, in our experiments, we include experimental conditions where partial contextual information is available, such as minimal context, blurred context and only low-level context texture (Figure 4). Breaking away from these previous works where graph optimization is performed globally, our proposed model selects important visual features using an attention mechanism and integrates partial information from both the target object and the context over multiple steps, and, importantly, generalizes to context variations (Section 5). Furthermore, we provide a direct comparison against human benchmark performance.

3 Human psychophysics experiments

We examined three fundamental properties of contextual modulation in visual recognition (Fig. 4) by conducting 10 experiments, schematically illustrated in Fig. 4h, on Amazon Mechanical Turk [49]. We recruited 80 subjects per experiment, yielding a total of trials.

3.1 Experiment setup

The stimuli consisted of 2,259 images spanning 55 object categories from the test set of MSCOCO Dataset [30]. We constrained the size of target objects to four bins (in pixels): Size 1 [16-32], Size 2 [56-72], Size 4 [112-144], and Size 8 [224-288]. Given the stimulus size of pixels and viewing distance of

meters, these values correspond to about 1, 2, 4, and 8 degrees of visual angle; but this may vary in MTurk depending on viewing conditions. To avoid any biases and potential memory effects, we took the following precautions: (a) Only one target object was selected per image; (b) Target objects were uniformly distributed over the 4 sizes and 55 categories; (c) Subjects saw at most 2 target objects per category; (d) The trial order was randomized.

3.2 Detailed description of each experiment

Experiment A: Context quantity.

We investigated the interaction between the object size and the amount of context in two experiments.

Exp A1, Object size. We conjectured that the impact of contextual information would depend on the target object size. We considered 4 object sizes as above. For each size, we introduced either minimal context (tightest rectangular bounding box enclosing the object, Fig. 4b) or full context (the entire image, Fig. 4a).

Exp A2, Amount of context. For each object size, we systematically titrated the amount of contextual information (Fig. 4c). The context-object ratio (CO) is the total image area excluding the target object divided by the object size. We included CO=0 (no pixels surrounding the object), 2, 4, 8, 16, and 128. Some combinations of large sizes and large CO values may not be possible.

Experiment B: Context content.

We studied how context resolution, geometry, and congruency modulated recognition in 5 experiments. Unless stated otherwise, we focused on sizes 1, 2 and 4, minimal and full context.

Exp B1, Blurred context. Human vision shows strong eccentricity dependence (high resolution in the fovea and progressively lower resolution toward the periphery). To quantify the impact of context resolution on recognition, only the context was blurred (Fig. 4

d) using a zero-mean Gaussian with standard deviation

pixels (image size = pixels).

Exp B2, Blurred object. To compare the effect of blurring the context versus the target object, we applied the same Gaussian blurring only to the object itself.

Exp B3, Texture only. We constructed textures constrained by the image statistics [35], and pasted the intact object on them (Fig. 4e). The textures preserve low-level features, but distort high-level features and semantic information.

Exp B4, Jigsaw context. To investigate the impact of the geometrical properties of context, we divided the image into , , and ”jigsaw” pieces (Fig. 4f). The piece containing the target object remained in the same position as in the original image, and the other pieces were randomly scrambled. We discarded cases when the object occupied more than one piece. For size 8, it was not possible to have the 8x8 condition.

Exp B5, (In)congruent context. To examine the importance of context consistency in recognition, we pasted objects in different backgrounds by considering congruent object-context pairs (object and context belong to the same class label), and incongruent object-context pairs (context taken from a different image class label) (Fig. 4g).

Figure 5: Architecture overview of Context-aware Two-stream Attention network (CATNet)

. The diagram depicts the iterative modular steps carried out by CATNet over multiple time steps in the context-aware object recognition task. CATNet consists of 3 main modules: feature extraction, attention, and recurrent memory. These three modular steps repeat until a pre-specified number of time steps

. For illustrative purposes, only the first and second time steps in a trial are shown here (Section 4 for definition of variables and Fig. S6 and S7 for implementation details of the attention and LSTM modules. CATNet is only trained using full context natural images and then it is tested in different conditions specified by each experiment (Section 3.1).

Experiment C: Dynamics of contextual modulation.

We investigated the temporal dynamics of contextual effects in 3 experiments.

Exp C1, Exposure time. In experiments A and B, the image duration was 200 ms (Fig. 4h). Here we systematically varied to be 50, 100, or 200 ms (Fig. S9).

Exp C2, Backward masking. Backward masking is a technique commonly used in neuroscience to interrupt visual processing [43]. The mask shown after stimulus offset is purported to block top-down and recurrent computations. We used Portilla masks [35] as in Exp B3 (Fig. S10). The stimulus exposure times followed those in Exp C1.

Exp C3, Asynchronous context presentation. In all experiments above, object and context information were presented synchronously. During natural vision, subjects move their eyes from location P1 to location P2. The information gathered while fixating at P1 acts as a prior temporal context of fixation at P2. To investigate the effect of such prior temporal context in recognition, while conceptually simplifying the problem, we split the image into context-only and object-only parts. First, the context-only part was presented for a duration of 25, 50, 100, or 200 ms. Next, the context was removed, and the object-only part was presented for a duration of 50, 100, or 200 ms (Fig. S11). The synchronous conditions were also included for comparison purposes.

3.3 Performance evaluation and statistics

Most recognition experiments enforced N-way categorization (e.g., [43]). Here we introduced a more unbiased and natural probing mechanism whereby there were no constraints on what words subjects could use to describe the target object (Fig. 4h). To evaluate human performance, we separately collected a distribution of ground truth answers for each target object (Mturk subjects not participating in the main experiments). Though computational models were evaluated using N-way categorization, we still find it instructive to plot computational results alongside human behavior for comparison purposes. Moreover, relative changes and trends in humans can be directly compared to computational results. For human-model, within-human and within-model comparisons, we used the Wilcoxon ranksum test [22], and one-way or two-way ANOVA tests [27] (Supp. Material).

4 Context-aware Two-stream Attention Net

We propose a Context-aware Two-stream Attention network (CATNet), extending previous work on image captioning

[51]. CATNet is presented with the stimulus, a natural image where the target object is indicated by a white bounding box. Inspired by the eccentricity dependence of human vision, CATNet has one stream that processes only the target object (, minimal context, Fig. 4b but without the gray background) and a second stream that processes the contextual information in the periphery (, full context, Fig. 4a). The two streams are processed through weight-sharing convolutional neural networks in parallel. is enlarged to be the same size as , such that each convolutional kernel sees at finer-grain details.

CATNet explicitly integrates the fovea and periphery via concatenation and makes a first attempt to predict a class label out of a pre-defined set of object classes. Since horizontal and top-down connections pervasive throughout brain cortices presumed to be important for recognition [43], we add a recurrent LSTM module in CATNet to iteratively reason about context. The LSTM module constantly modulates its internal representation of the scene via attention and outputs predicted class labels over multiple time steps where . These attention-modulated features maps of and are functions of . For simplicity in naming conventions, we use superscript to denote or in all variables to distinguish visual processes on or respectively and use subscript to denote time-dependent variables.

4.1 Convolutional Feature Extraction

CATNet takes and as inputs and uses a feed-forward convolutional neural network to extract feature maps and , respectively. We use the VGG16 network [40], pre-trained on ImageNet [14] and fine-tune it at the training stage. To focus on specific parts of the image and select features at those locations, we preserve the spatial organization of features; thus, CATNet uses the output feature maps at the last convolution layer of VGG16. The parameters of both feed-forward feature extractor networks on and are shared. Since is the enlarged version of the target object region in , this results in higher acuity and enhances sensitivity to details of the target object. We describe next but the same ideas apply to .

A feature vector

of dimension represents the part of the image at location , where and , and and are the width and height, respectively, of the feature map:


4.2 Attentional Modulation

We use a “soft-attention” mechanism as introduced by [3] to compute “the context gist” on and “the object gist” on (Fig. S6). There are two attention maps on and respectively where each stream has identical architectures but different weight parameters. We describe the context stream of attention but the same principles apply to the object attention map. For each location in , the attention mechanism generates a positive scalar , representing the relative importance of the feature vector in capturing the context gist. depends on the feature vectors , combined with the hidden state at the previous step of a recurrent network described below:


where and are weight matrices initialized randomly and learnt during training. Because not all attended regions might be useful for context reasoning, the soft attention module also predicts a gating vector from the previous hidden state , such that determines how much the current observation contributes to the context vector at each location: , where is a weight matrix and each element in is a gating scalar at location . As also noted by [51], helps put more emphasis on the salient objects in the images. Once the attention map and the gating scale are computed, the model applies the “soft-attention” mechanism to compute by summing over all the regions in the image:


We define as concatenation of and , which is used as input to the LSTM module described next. The attention module is smooth and differentiable, and CATNet can learn all the weight matrices in an end-to-end fashion via back-propagation.

4.3 Recurrent Connections using LSTM

We use a long short-term memory (LSTM) network to output a predicted class label

based on the previous hidden state and the gist vector for and . Our implementation of LSTM closely follows [54] (Fig. S7). The variables represent the input, forget, memory, output and hidden state of the LSTM respectively.

To compare CATNet and human performance when exposure time changes (Exp. C), we set one time step in the LSTM to correspond to ms and considered the predicted class labels of CATNet at the corresponding number of time steps as the answers.

To predict the class label for the target object, the LSTM computes a classification vector where each entry denotes a class probability given the hidden state :


where is a matrix of learnt parameters initialized randomly.

4.4 Training and Implementation Details

We trained CATNet end-to-end by minimizing the cross entropy loss between the predicted label at each time step and the ground truth label :


We used all images from the MSCOCO training set for training and validating all models. On every training image, each object can be selected as the target object and they are always in shown in full context. Only at the testing stage, we vary the context based on different conditions in each experiment as described in Section 3.1. Importantly, none of the human behavioral experiments are used to train the model. The input image size (both and ) was pixels. We set the total number of time steps for training CATNet. Further implementation details are provided in the Supp. Material.

Data and code availability: All source code, and the data from the psychophysics experiments will be released publicly upon publication.

4.5 Competitive baselines and ablated models

We compared the results of CATNet against several competitive baselines, such as DeepLab-CRF [9] in semantic segmentation and YOLO3 [36, 37] in object detection. These models were adapted to the context-aware object recognition task (Supp. Material).

To study the role of attention, the two-stream architecture, and recurrent connections, we also introduced a series of ablated versions of CATNet. Starting from original VGG16 object recognition network [40] pre-trained on ImageNet [14] (VGG16 on cropped objects), we added in one component at a time and evaluated their incremental performance change. These models include VGG16 + binary mask, two-stream VGG16, VGG16 + attention, and VGG16 + attention + LSTM.

5 Results

5.1 Object and context size matter

Figure 6: Contextual modulation is stronger for smaller target objects (Exp A1). Top-1 accuracy increases with object sizes (Fig. 4a-b) and contextual information increases accuracy, particularly for small target objects, for humans and CATNet.

For the minimal context condition (Fig. 4b), human performance improved monotonically as a function of object size from to (Exp A1, Fig. 6, one-way ANOVA: , ). This effect was readily captured by the CATNet model (one-way ANOVA: , ). Adding full contextual information (Fig. 4a) led to a large improvement in performance both for humans and CATNet. Contextual modulation strongly depends on object size: the performance ratio between the full context and minimal context conditions was 4.7 and 2.5 (humans and CATNet, respectively) for object size 1, whereas the ratio was 1.1 and 1.05 (humans and CATNet, respectively) for object size 8. Contextual information greatly facilitates recognition when the target objects are small and hard to recognize.

We further quantified how the amount of contextual information impacts recognition by titrating the context object ratio (CO) from 0 to 128 (Exp A2, Fig. S1). The amount of context is important both for humans (one-way ANOVA: , ) and CATNet (one-way ANOVA: , ).

Across all the CO ratios, humans outperformed CATNet for small object sizes and CATNet outperformed humans for the largest object size. Of note, CATNet was never trained or fine-tuned with any human data. These experiments demonstrate that the context quantity has a strong impact on recognition.

5.2 Blurred context is sufficient for recognition

Figure 7: Contextual facilitation persists even after small amounts of blurring (Exp B1). A large amount of context blurring (Fig. 4d) is required to disrupt the recognition enhancement for humans and CATNet.

Due to strong eccentricity dependence of human vision, peripheral information has less resolution than the fovea. In fact, the resolution drops so sharply that humans are legally blind in the far periphery. We conjectured that low resolution context could be sufficient to facilitate recognition. To test this conjecture, we applied different amounts of blurring in the context (Exp B1, Fig. 4d).

Human recognition accuracy dropped with the amount of blurring from levels indistinguishable from the full resolution condition when pixels all the way to levels indistinguishable from the minimal context condition when pixels (Fig. 7, one-way ANOVA: , ). Interestingly, there was a wide range of blurring that led to robust context modulation, consistent with the notion that humans do not require full resolution context for recognition. The effects of blurring were also captured by CATNet, where contextual enhancement disappeared only when using large values (one-way ANOVA: , ). Similar with the results in exp A1 and exp A2, humans outperformed CATNet on small objects.

We also compared the effects of blurring the object itself without blurring the context (Exp. B2, Fig. S2). Although the total number of pixels affected by blurring the target object is much smaller than blurring the context (for a fixed ), modifying the object led to larger accuracy drops, for object sizes 2 and 4 both for humans and CATNet.

5.3 Contextual effects rely on spatial configuration

Figure 8: Large geometrical context re-arrangements disrupts contextual enhancement (Exp B4). Scrambling context pieces (Fig. 4f) reduces the contextual enhancement only when many small context pieces are changed, both for humans and CATNet.

Another important aspect of context is the relative position of objects and features in the image; e.g., the sky is often at the top under natural viewing conditions. To evaluate how the spatial configuration of context impacts recognition accuracy, we scrambled the images into various numbers of jigsaw pieces while the piece containing the target object remained in the same position as in the original image (Exp B4, Fig. 4f). Both humans and CATNet relied on the spatial configuration of context over all object sizes (humans: one-way ANOVA: , ; CATNet: one-way ANOVA: , ). The inconsistent spatial configuration of contextual information in the and configurations led to a reduction in recognition accuracy. Interestingly, accuracy in the configuration was not significantly different from the unscrambled full context condition, probably because each large piece itself already contains sufficient contextual information or the effect of context reasoning decreases with increasing distance to the target object [55].

CATNet was more robust to the distorted spatial configurations: recognition accuracy differed from the full-context condition only for the configuration (for and , two-tailed ranksum test, ).

5.4 Bad context is worse than no context

Given that the moderately blurred context still retained its effects on recognition (Fig. 7), we asked whether the contextual effects could still be elicited using low-level texture features from the images. We tested this possibility by pasting objects on Portilla textures constrained by the image statistics (Exp B3, Fig. 4e).

Low-level texture features did not facilitate object recognition for either humans or CATNet (Fig.S3). In fact, human performance was actually slightly impaired when objects were embedded within these textures compared to the minimal context condition (two-tailed ranksum test, all object sizes, ). For CATNet, low-level texture features improved recognition with respect to minimal context only for object size 1, but the effect was much smaller than when using full contextual information.

Given that low-level textures did not help (and could even hurt recognition), and inspired by Fig. 1, we next studied recognition when objects were removed from their original images and placed in the same location but in different images: congruent contexts (images with same class labels) or incongruent contexts (images with different class labels, Fig. 4g).

Congruent contexts enhanced recognition for smaller object sizes compared to the minimal context condition both for humans and CATNet (Fig. 9). The facilitation elicited by congruent context was lower than that in the original full context. Although congruent contexts typically share similar correlations between objects and scene properties, pasting the object in a congruent context did not lead to the same enhancement. This may be due to the erroneous relative size between objects, the unnatural boundaries created by pasting, or important contextual cues specific to each image. Interestingly, CATNet was relatively oblivious to these effects and performance in the congruent condition was closer to that in the original full context condition.

In high contrast with these observations, incongruent contexts consistently degraded recognition performance below the minimal context condition. Across all object sizes, subjects showed higher accuracy for objects in congruent versus incongruent contexts (one-way ANOVA: , ). Accuracy was lower for incongruent context than minimal context (two-tailed ranksum test, ). Similarly, CATNet recognition accuracy also positively correlated with congruent context (one-way ANOVA: , ) and was degraded by incongruent context (for all object sizes, two-tailed ranksum test, ).

Figure 9: Incongruent context impairs recognition. Pasting the target objects in different but congruent contexts facilitates recognition. Pasting the target objects in incongruent contexts (Fig. 4g) impairs recognition, both for humans and CATNet.

5.5 Temporal dynamics of contextual modulation

Figure 10: Stimulus exposure time has little effect in recognition (Exp C1). Exposure time was varied from 50 to 200 ms. Exposure time of 50 ms is sufficient to get the “gist” of context.

The dynamics of recognition places strong constraints to interpret the flow of bottom-up and top-down visual processes [45, 43, 38]. We conducted 3 experiments to examine the dynamics of contextual effects on recognition.

First, we varied the exposure time (Fig. 4h) from 50 to 200 ms (Exp. C1). Interestingly, human performance was largely unaffected by the image duration (Fig. 10). To assess the role of exposure time in CATNet, each computational time step was mapped to 25 ms (Sec 4.3). Consistent with human behavior results, exposure time had no effect on object recognition for CATNet.

Exp C1 shows that context modulation occurs within a short stimulus presentation duration. Such rapid computations are typically thought of as involving largely bottom-up processing [39, 16]. Despite the short exposure, there could be additional computations that take place after stimulus offset. The next experiment sought to interrupt those computations using backward masking, where presentation of the stimulus is rapidly followed by Portilla mask [35] (Exp C2, Fig. S10).

Accuracy in the minimal context condition was not changed by backward masking (Fig. S4). The recognition enhancement in the full context condition was impaired when the mask was introduced after 50-100 ms exposure to the image, but not with longer exposures, consistent with previous studies [43]. Overall, these results show that contextual modulation is fast and involves recurrent computations.

In natural vision, subjects interpret a scene by moving their eyes in ballistic saccades; thus, contextual information is often available before processing an object. When fixating on a given object, subjects already have prior contextual information from the previous fixations. To approximate this process and study contextual reasoning with semi-realistic temporal priors, we designed an experiment where the context and target object were shown asynchronously: context was presented for 25, 50, 100 or 200 ms before showing the minimal context image (Exp. C3, Fig. S11). Surprisingly, even 25 ms exposure to context was sufficient to trigger contextual modulation (Fig. S5). For small objects, contextual facilitation was larger with increased context exposure, reaching the levels of the synchronous condition for 200 ms. In sum, a previous saccade, which typically last 200 ms, provides contextual information that can be held in memory and enhance recognition of a minimal context object, and even shorter exposure to context already enhances recognition.

5.6 Comparison with other models

Thus far, we focused on presenting the results of the CATNet model introduced in Fig. 5. As discussed in Section 2, such as [50, 47], other computational models have been proposed to incorporate some form of contextual information. We compared CATNet versus two state-of-the-art models incorporating contextual information for semantic segmentation (deeplab [10]), and object detection (yolo3 [10]). Details about performance of these models are shown in Fig. S13 and S14. Although deeplab and yolo3 leverage on global context information, CATNet outperformed both models, especially on small objects. For example, deeplab performed almost as well as CATNet on large objects but it failed to recognize small objects and demonstrate the strong contextual facilitation repeatedly observed in every experiment (Fig. 6789). These observations also hold true for yolo3. Even though yolo3 has a dedicated object recognition module after region proposal, it failed to take contextual information into account when recognizing small objects. We also note again that all computational models, including CATNet, performed worse than humans on small objects in every experiment, which suggests that it is necessary to come up with more intelligent ways of reasoning about context in existing computer vision tasks.

5.7 Ablation reveals critical model components

We also compared CATNet versus many other baselines, including modified versions of CATNet with ablated components. To gauge performance based on visual features in the whole image without focusing on the target object location, we evaluate pre-trained VGG16 [40] as a lower bound. As expected, the accuracy of VGG16 was essentially at chance, particularly for small objects (Fig. S15), confirming that in-context object recognition is not a trivial visual feature mapping task and requires focusing on the target object location. Next, we concatenated the natural stimulus with a binary mask indicating the target object location (VGG16+binarymask). Although this increased performance, accuracy was still well below CATNet (Fig. S16), suggesting that the attentional mechanism to weigh the different features plays an important role. To evaluate this, we implemented an attention module (Section 4, VGG16+attention). This led to a large performance boost, consistent with previous work showing the efficiency of attention in computer vision tasks [31]. In Fig. S12, we provide visualization examples of predicted attention maps on context and target objects respectively. CATNet learns to focus on informative context regions for recognition. Consistent with previous work [31], attention on target objects is sparse and focuses on object edges or the minimal context regions surrounding the target rather than on visual features on the targets themselves. We make further comparisons with a VGG16 version that includes an LSTM module and also with a two-stream version of VGG16 in Fig. S18 and S19.

6 Discussion

Here we quantitatively studied the role of context in visual object recognition in human observers and computational models in a task that involved recognizing target objects in various contexts. We investigated three critical properties of context: quantity, quality, and dynamics. Contextual facilitatory effects were particularly pronounced for small objects and increased with the amount of peripheral information. Consistent with the eccentricity dependence of human vision, facilitation was not affected by small amounts of blurring, or geometrical rearrangements that left intact information near the target object. Congruent contextual information typically enhanced visual recognition, while incongruent context impairs. Contextual effects could not be accounted for by low-level properties of the image. Interestingly, such contextual modulation happened fast, and could even be elicited in an asynchronous fashion where the context is shown before the target object, but they could be impaired by rapid interruption via backward masking.

To investigate how far we are from human-level in-context object recognition, we evaluated competitive methods in computer vision and introduced a recurrent neural network model (CATNet). CATNet combines a feed-forward visual stream module that extracts image features in a dynamic fashion with an attention module to prioritize different image locations, and integrates information over time, producing a label for the target object. Surprisingly, even though the model lacks the expertise that humans have in interacting with objects in their context, the model adequately demonstrated human-like behavioral characteristics under different context conditions and reaches almost human-level performance in a series of in-context object recognition tasks. However, there are still significant gaps between models and humans, particularly when recognizing small objects within context and even large objects out of context. These results introduce benchmarks to integrate object recognition and scene understanding, and provide initial steps to understand human visual recognition and improve current intelligent computer vision systems.


  • [1] E. Aminoff, N. Gronau, and M. Bar (2006) The parahippocampal cortex mediates spatial and nonspatial associations. Cerebral Cortex 17 (7), pp. 1493–1503. Cited by: §2.1.
  • [2] M. E. Auckland, K. R. Cave, and N. Donnelly (2007) Nontarget objects can influence perceptual processes during object recognition. Psychonomic bulletin & review 14 (2), pp. 332–337. Cited by: §2.1.
  • [3] J. Ba, V. Mnih, and K. Kavukcuoglu (2014) Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755. Cited by: §4.2.
  • [4] M. Bar and E. Aminoff (2003) Cortical analysis of visual context. Neuron 38 (2), pp. 347–358. Cited by: §2.1.
  • [5] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. (2016) Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pp. 4502–4510. Cited by: §2.2.
  • [6] S. Beery, G. Van Horn, and P. Perona (2018) Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 456–473. Cited by: §1, §2.2.
  • [7] I. Biederman, R. J. Mezzanotte, and J. C. Rabinowitz (1982) Scene perception: detecting and judging objects undergoing relational violations. Cognitive psychology 14 (2), pp. 143–177. Cited by: §2.1.
  • [8] W. Brendel and M. Bethge (2019) Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. arXiv preprint arXiv:1904.00760. Cited by: §2.2.
  • [9] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §4.5.
  • [10] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.2, §5.6.
  • [11] X. Chen, L. Li, L. Fei-Fei, and A. Gupta (2018) Iterative visual reasoning beyond convolutions. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7239–7248. Cited by: §2.2.
  • [12] M. J. Choi, A. Torralba, and A. S. Willsky (2012) Context models and out-of-context objects. Pattern Recognition Letters 33 (7), pp. 853–862. Cited by: §1, §2.2.
  • [13] W. Choi and S. Savarese (2012) A unified framework for multi-target tracking and collective activity recognition. In European Conference on Computer Vision, pp. 215–230. Cited by: §2.2.
  • [14] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.1, §4.5.
  • [15] Z. Deng, A. Vahdat, H. Hu, and G. Mori (2016) Structure inference machines: recurrent neural networks for analyzing relations in group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4772–4781. Cited by: §2.2.
  • [16] J. J. DiCarlo, D. Zoccolan, and N. C. Rust (2012) How does the brain solve visual object recognition?. Neuron 73 (3), pp. 415–434. Cited by: §5.5.
  • [17] N. Dvornik, J. Mairal, and C. Schmid (2018) Modeling visual context is key to augmenting object detection datasets. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 364–380. Cited by: §1, §2.2.
  • [18] A. Friedman (1979) Framing pictures: the role of knowledge in automatized encoding and memory for gist.. Journal of experimental psychology: General 108 (3), pp. 316. Cited by: §2.1.
  • [19] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2018) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231. Cited by: §2.2.
  • [20] J. O. Goh, S. C. Siong, D. Park, A. Gutchess, A. Hebrank, and M. W. Chee (2004) Cortical areas involved in object, background, and object-background processing revealed with functional magnetic resonance adaptation. Journal of Neuroscience 24 (45), pp. 10223–10228. Cited by: §2.1.
  • [21] J. M. Gonfaus, X. Boix, J. Van de Weijer, A. D. Bagdanov, J. Serrat, and J. Gonzalez (2010) Harmony potentials for joint classification and segmentation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3280–3287. Cited by: §2.2, §2.2.
  • [22] T. Harris and J. W. Hardin (2013) Exact wilcoxon signed-rank and wilcoxon mann–whitney ranksum tests. The Stata Journal 13 (2), pp. 337–343. Cited by: §3.3.
  • [23] J. M. Henderson, P. A. Weeks Jr, and A. Hollingworth (1999) The effects of semantic consistency on eye movements during complex scene viewing.. Journal of experimental psychology: Human perception and performance 25 (1), pp. 210. Cited by: §2.1.
  • [24] D. Hoiem, A. A. Efros, and M. Hebert (2005) Geometric context from a single image. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, Vol. 1, pp. 654–661. Cited by: §2.2.
  • [25] A. Hollingworth (1998) Does consistent scene context facilitate object perception?. Journal of Experimental Psychology: General 127 (4), pp. 398. Cited by: §2.1.
  • [26] H. Hu, G. Zhou, Z. Deng, Z. Liao, and G. Mori (2016) Learning structured inference neural networks with label relations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2960–2968. Cited by: §2.2.
  • [27] P. Ito (1980) 7 robustness of anova and manova test procedures. Handbook of statistics 1, pp. 199–236. Cited by: §3.3.
  • [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2.2.
  • [29] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr (2010) Graph cut based inference with co-occurrence statistics. In European Conference on Computer Vision, pp. 239–253. Cited by: §2.2.
  • [30] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.1.
  • [31] D. Linsley, D. Shiebler, S. Eberhardt, and T. Serre (2018) Learning what and where to attend. Cited by: §5.7.
  • [32] Y. Liu, R. Wang, S. Shan, and X. Chen (2018) Structure inference net: object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6985–6994. Cited by: §2.2, §2.2.
  • [33] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun, and A. Yuille (2014) The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 891–898. Cited by: §2.2.
  • [34] D. Park, D. Ramanan, and C. Fowlkes (2010) Multiresolution models for object detection. In European conference on computer vision, pp. 241–254. Cited by: §2.2.
  • [35] J. Portilla and E. P. Simoncelli (2000) A parametric texture model based on joint statistics of complex wavelet coefficients. International journal of computer vision 40 (1), pp. 49–70. Cited by: §3.2, §3.2, §5.5.
  • [36] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §4.5.
  • [37] J. Redmon and A. Farhadi (2018) YOLOv3: an incremental improvement. arXiv. Cited by: §4.5.
  • [38] M. Riesenhuber and T. Poggio (1999) Hierarchical models of object recognition in cortex. Nature neuroscience 2 (11), pp. 1019. Cited by: §5.5.
  • [39] T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U. Knoblich, and T. Poggio (2007) A quantitative theory of immediate visual recognition. Progress in brain research 165, pp. 33–56. Cited by: §5.5.
  • [40] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Figure 1, §4.1, §4.5, §5.7.
  • [41] J. Sun and D. W. Jacobs (2017) Seeing what is not there: learning context to determine where objects are missing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5716–5724. Cited by: §1.
  • [42] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning


    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: Figure 1.
  • [43] H. Tang, M. Schrimpf, W. Lotter, C. Moerman, A. Paredes, J. O. Caro, W. Hardesty, D. Cox, and G. Kreiman (2018) Recurrent computations for visual pattern completion. Proceedings of the National Academy of Sciences 115 (35), pp. 8835–8840. Cited by: §3.2, §3.3, §4, §5.5, §5.5.
  • [44] D. Teney, L. Liu, and A. van den Hengel (2017) Graph-structured representations for visual question answering. arXiv preprint. Cited by: §2.2, §2.2.
  • [45] S. Thorpe, D. Fize, and C. Marlot (1996) Speed of processing in the human visual system. nature 381 (6582), pp. 520. Cited by: §5.5.
  • [46] A. Torralba, K. Murphy, and W. Freeman (2010) Using the forest to see the trees: ob-ject recognition in contex. Comm. of the ACM. Cited by: §2.2.
  • [47] A. Torralba, K. P. Murphy, and W. T. Freeman (2005) Contextual models for object detection using boosted random fields. In Advances in neural information processing systems, pp. 1401–1408. Cited by: §2.2, §2.2, §5.6.
  • [48] A. Torralba (2003) Contextual priming for object detection. International journal of computer vision 53 (2), pp. 169–191. Cited by: §2.2, §2.2.
  • [49] A. M. Turk (2012) Amazon mechanical turk. Retrieved August 17, pp. 2012. Cited by: §3.
  • [50] K. Wu, E. Wu, and G. Kreiman (2018) Learning scene gist with convolutional neural networks to improve object recognition. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS), pp. 1–6. Cited by: §2.2, §5.6.
  • [51] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In

    International conference on machine learning

    pp. 2048–2057. Cited by: §4.2, §4.
  • [52] J. Yao, S. Fidler, and R. Urtasun (2012) Describing the scene as a whole: joint object detection, scene classification and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 702–709. Cited by: §2.2, §2.2.
  • [53] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: Figure 1.
  • [54] W. Zaremba, I. Sutskever, and O. Vinyals (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329. Cited by: §4.3.
  • [55] M. Zhang, J. Feng, K. Montejo, J. Kwon, J. H. Lim, and G. Kreiman (2019) Lift-the-flap: context reasoning using object-centered graphs. arXiv preprint arXiv:1902.00163. Cited by: §5.3.