reanalysis of the ObjectNet paper and our annotations and code
Recently, Barbu et al introduced a dataset called ObjectNet which includes objects in daily life situations. They showed a dramatic performance drop of the state of the art object recognition models on this dataset. Due to the importance and implications of their results regarding generalization ability of deep models, we take a second look at their findings. We highlight a major problem with their work which is applying object recognizers to the scenes containing multiple objects rather than isolated objects. The latter results in around 20-30 reported in the ObjectNet paper, we observe that around 10-15 performance loss can be recovered, without any test time data augmentation. In accordance with Barbu et al.'s conclusions, however, we also conclude that deep models suffer drastically on this dataset. Thus, we believe that ObjectNet remains a challenging dataset for testing the generalization power of models beyond datasets on which they have been trained.READ FULL TEXT VIEW PDF
Deep object recognition models have been very successful over benchmark
The deep Convolutional Neural Network (CNN) became very popular as a
In this work, we propose the combined usage of low- and high-level block...
Are face and object recognition abilities independent? Although it is
Finding statistically significant interactions between binary variables ...
A simple prior free factorization algorithmdai2014simple is quite
Phenomenon-specific "adversarial" datasets have been recently designed t...
reanalysis of the ObjectNet paper and our annotations and code
, revolutionized not only computer vision but also several other areas including machine learning, natural language processing, time series prediction, etc. With the initial excitement gradually damping, researchers have started to study the shortcomings of deep learning models and question their generalization power. From prior research (e.g. Azulay and Weiss (2019); Recht et al. (2019); Goodfellow et al. (2014)), we already know that CNNs: a) often fail when applied to transformed versions of the same object. In other words, they are not invariant to transformations such as translation111CNNs are not invariant to translation but are equivariant to it., in-plane and in-depth rotation, scale, lighting, and occlusion, b) lack out of distribution generalization. Even after being exposed to so many different instances of the same object category they are not good at learning that concept. In stark contrast, humans are able to generalize only from few examples, and c) are sensitive to small image perturbations (i.e. adversarial examples Goodfellow et al. (2014)).
Several datasets have been proposed in the past to train and test deep models and to study their generalization ability (e.g. ImageNet Russakovsky et al. (2015), CIFAR Krizhevsky et al. (2009), NORB LeCun et al. (2004), iLab20M Borji et al. (2016)). In a recent effort, Barbu et al. introduced a dataset called ObjectNet which has less bias than other datasets222This dataset, however, has it own biases. It consists of indoor objects that are available to many people, are mobile, are not too large, too small, fragile or dangerous. and is supposed to be used solely as a test set333Unlike some benchmarks (e.g. ImageNet) that hide their test images, images in ObjectNet dataset will be publicly available. See http://objectnet.dev.. Objects in this dataset are pictured by Mechanical Turk workers using a mobile app in a variety of backgrounds, rotations, and imaging viewpoints. It contains 50,000 images from 313 categories, out of which 113 are in common with the ImageNet, and comes with a licence that disallows the researchers to finetune models on it. Barbu et al. find that the state of the art object detectors444They mean object recognizers! perform drastically lower than their corresponding performance on the ImageNet dataset (about 40-45% drop). Here, we revisit the Barbu et al.’s study and seek to answer how much performance of models drops on this dataset compared with their performance on ImageNet. Due to this, here we limit our analysis to the 113 overlapped categories.
Barbu et al.’s work is a great contribution to the field to answer how well object recognition models generalize to the real world situations. It, however, suffers from a major flaw which is making no distinction between "object recognition" and "object detection". They use the term "object detector" to refer to object recognition models. This bring along several concerns:
Instead of applying object recognition models to individual object bounding boxes, they apply them to cluttered scenes containing multiple objects. Alternatively, they could have run object detection models on scenes and measured detection performance (i.e. mean average precision; mAP).
Object detection and object recognition are two distinct, but related, tasks. Each has its own models, datasets, and evaluation measures. For example, as shown in Fig. 1, images in object recognition datasets (e.g. ImageNet Russakovsky et al. (2015)) often contain a single object, usually from a closeup view, whereas scenes in object detection datasets (e.g. MS COCO Lin et al. (2014)) have multiple objects. Due to this, object characteristics in the two types of datasets might be different. For example, objects are often smaller in detection datasets compared to recognition datasets Singh and Davis (2018).
It is true that ImageNet also includes a fair amount of images that have more than one object, but it appears that ObjectNet images are more cluttered and have higher number of objects, although we have not quantified this since this dataset does not come with object-level labels. This discussion is also related to the difference between "scene understanding" and "object recognition". To understand a complex scene, we look around, fixate on individual objects to recognize them, and accumulate information over fixations to perform more complex tasks such as answering a question or describing an event.
One might argue that the top-5 metric somewhat takes care of scenes with multiple objects. While this might be true to some degree, it certainly does not entail that it is better to train or test object recognition models on scenes with multiple objects. One problem is labeling such images. For example, a street scene with a car, a pedestrian, trees and buildings is given only one label which adds noise to the data. Another problem is invariance. Given that individual objects can undergo many transformations, feeding scenes with multiple objects to models exacerbates the problem and leads to combinatorially much more variations which are harder to learn. A better approach would be focusing on individual objects. As humans, we also move our eyes around and place our fovea on one object at a time to recognize it, using information from our peripheral vision. The analogue of this task still does not exist in computer vision. This task is reminiscent of object detection but there are subtle differences which will be elaborated on later in the discussion section.
Here, we first annotate the objects in the ObjectNet scenes and then apply a number of deep object recognition models to only object bounding boxes.
We employ six deep models including AlexNet Krizhevsky et al. (2012), VGG-19 Simonyan and Zisserman (2014), GoogLeNet Szegedy et al. (2015), ResNet-152 He et al. (2016) Inception-v3 Szegedy et al. (2016), and MNASNet Tan et al. (2019). AlexNet, VGG-19, and GoogLeNet have also been employed in the ObjectNet paper Barbu et al. (2019)
. We use the pytorch implementation of these models555https://pytorch.org/docs/stable/torchvision/models.html. Notice that the code from the ObjectNet paper is still not available. Due to possible inconsistency in our code and the code in ObjectNet as well as different data processing methods, in addition to bounding boxes, we also run the models on the entire scenes. This allows us to study whether and how much performance varies across these two conditions.
The 113 categories of the ObjectNet dataset, overlapped with the ImageNet, contain 18,574 images in total. On this subset, the average number of images per category is 164.4 (min = 55, max = 284). Fig. 2 shows the distribution of number of images per category on this dataset. We drew666The annotation tools include: https://github.com/aliborji/objectAnnotation and https://github.com/tzutalin/labelImg. a bounding box around the object corresponding to the category label of each image. If there were multiple objects from that category we tried to included all of them in the bounding box (e.g. chairs around a table). Some example scenes and their corresponding bounding boxes are given in Fig. 1.
Fig. 3 shows an overlay of our results on the same figure from the ObjectNet paper. As it can be seen, applying models to boxes instead of the entire scene improves the performance about 10%, but sill much lower than the results over the ImageNet dataset. The gap, however, is narrower now.
Since the code of Barbu et al. (2019) is not available, we could not run the exact pipeline used by them on the bounding boxes. It is possible that they might have performed a different data normalization or test time data augmentation777Such as rotation, scale, color jittering, cropping, etc. to achieve better results. To remedy this, we also applied models to the whole image to study how much performance varies. Results are shown in Fig. 4. We find that:
Focusing on a small image region containing only a single object increases the performance significantly by around 20-30% across all tested models.
Our results on the whole scene are lower than Barbu et al.’s results (which are also on the whole scene). This entails that applying their code to bounding boxes will likely improve the performance even more than our results using boxes. Assuming 25% gain in performance on top of their best results, when using boxes, will still not close the gap in performance. Please see Fig. 3. We will discuss this further in the next section.
Break down of performance over 113 categories for each of the 6 tested models is shown in Fig. 5 (over isolated objects) and Fig. 6 (over full image). Interestingly, in both cases, almost all models (except the GoogLeNet on isolated objects and the AlexNet on full image) perform the best over the safety pin category. Inspecting the images from this class, we found that they have a single safety pin often hold by a person. The same story is true about the banana class which is the second easy category using the bounding boxes. This object becomes much harder to recognize when using the full image (26.88% vs. 70.3% using boxes) which highlights the benefit of applying models to isolated objects rather than scenes.
Our investigation reveals that deep object recognition models perform significantly better when applied to isolated objects rather than scenes (around 20-30% increase in performance). The reason behind this is two fold. First, there is less variability in single objects compared to scenes containing those objects. Second, deep models used here have been trained on ImageNet images which are less cluttered compared to the ObjectNet images. We anticipate that training models from scratch on large scale datasets that contain isolated objects will likely result in even higher accuracy.
Assuming around 30% increase in performance (at best) over the Barbu et al.’s results using bounding boxes, still leaves a large gap of at least 15% between ImageNet and ObjectNet performances which means that ObjectNet is significantly much harder. It covers a wider range of variations than ImageNet including object instances, viewpoints, rotations, occlusions, etc which pushes the limits of object recognition in both humans and machines. Hence, despite its limitations and biases, ObjectNet is indeed a great resource to test models in realistic situations.
Throughout the annotation process of ObjectNet images, we came across the following observations:
Some objects look very different when they are in motion (e.g. the fan in Fig. 8; row 4)
Some objects appear different under the shadow of other objects (e.g. the hammer in Fig. 8; row 4)
Some objects can be recognized only by reading their labels (e.g. the pet food container in Fig. 7; row 2)
In many images, objects are occluded by hands holding them (e.g. the sock and the shovel in Fig. 7; row 4)
Some objects are hard to recognize in dim light (e.g. the printer in Fig. 7; row 2)
Some categories are often confused with other categories, for example: (bath towel, bed sheet, full sized towel, dishrag or hand towel), (sandal, dress shoe (men), running shoe), (t-shirt, dress, sweater, suit jacket, skirt), (ruler, spatula, pen, match), (padlock, combination lock), and (envelope, letter).
We foresee at least four directions for future work in this area:
First, it would be interesting to see how well state of the art object detectors (e.g. Faster RCNN Ren et al. (2015)) perform on this dataset (e.g. over classes overlapped with the MSCOCO dataset). We expect a big drop in detection performance since recognition is still the main bottleneck in object detection Borji and Iranmanesh (2019).
Second, measuring human performance on ObjectNet dataset will provide a baseline for gauging model performance. Barbu et al. report a human performance of around 95% on this dataset (via a pilot study) when subjects are asked to mention the objects that are present in the scene. This task, however, is different than recognizing isolated objects out of context similar to the regime that is considered here (i.e. similar to rapid scene categorization tasks). In addition, error patterns of models and humans (rather than just raw accuracy measures) will inform us about the mechanisms of object recognition in both humans and machines. It could be that models work in a completely different fashion than the human visual system. For instance, unlike CNNs, we are able to discern the foreground from the image background during recognition. This hints towards an interplay and feedback loop between recognition and segmentation that is currently missing in CNNs. Finally, we are invariant to image transformation only partially, therefore it may not make sense to desire models that are fully invariant (e.g. invariance to 360 degree in-plane rotation).
Third, and related to the second, is the role of context in object recognition. Context is a two-edge sword. On the one hand, using it may lead to relying on trivial correlations that may not always happen at the test time. For example, relying on the fact that a keyboard always appears next to a monitor, may lead to occasional failures when a model is tested on an isolated keyboard. A better example is adversarial patches Brown et al. (2017) where training a model on objects augmented with small random patches forces the model to rely on those features and hence get fooled when tested on other objects with the same patch. On the other hand, completely discarding the context also is a not wise for two reasons. First, the great success of CNNs for object recognition and scene segmentation is attributed to their ability to exploit visual context. Second, as humans we also heavily rely on surrounding visual context. Future research, should further investigate the role of context in object recognition and how best to exploit it.
Fourth, to operationalize the third item above, we propose a new task which is recognizing objects in cluttered scenes containing multiple objects (See Borji and Iranmanesh (2019) for an example study). This task resembles object detection but it is actually different. First, here the ground truth object bounding boxes are given and the task is to correctly label them (i.e. no need to find objects). Second, the evaluation measure is accuracy which is easier to interpret than mean average precision. This task is also different than the current object recognition setup in which context is usually discarded. Existing object detection datasets, such as MS COCO Lin et al. (2014), can be utilized to evaluate models built for solving this task (i.e. instead of detecting objects the aim is to recognize isolated objects confined in bounding boxes).
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2221–2230. Cited by: §1.
Do imagenet classifiers generalize to imagenet?. arXiv preprint arXiv:1902.10811. Cited by: §1.
Here, we show the easiest and hardest objects for the ResNet-152 model over some categories.