ObjectNet Dataset: Reanalysis and Correction

04/04/2020 ∙ by Ali Borji, et al. ∙ 9

Recently, Barbu et al introduced a dataset called ObjectNet which includes objects in daily life situations. They showed a dramatic performance drop of the state of the art object recognition models on this dataset. Due to the importance and implications of their results regarding generalization ability of deep models, we take a second look at their findings. We highlight a major problem with their work which is applying object recognizers to the scenes containing multiple objects rather than isolated objects. The latter results in around 20-30 reported in the ObjectNet paper, we observe that around 10-15 performance loss can be recovered, without any test time data augmentation. In accordance with Barbu et al.'s conclusions, however, we also conclude that deep models suffer drastically on this dataset. Thus, we believe that ObjectNet remains a challenging dataset for testing the generalization power of models beyond datasets on which they have been trained.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

page 17

page 19

page 22

page 23

page 25

page 29

Code Repositories

ObjectNetReanalysis

reanalysis of the ObjectNet paper and our annotations and code


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object recognition is arguably the most important problem at the heart of computer vision. Application of an already known convolutional neural network architecture (CNN) known as LeNet 

LeCun et al. (1998), albeit with new tips and tricks Krizhevsky et al. (2012)

, revolutionized not only computer vision but also several other areas including machine learning, natural language processing, time series prediction, etc. With the initial excitement gradually damping, researchers have started to study the shortcomings of deep learning models and question their generalization power. From prior research (

e.g.  Azulay and Weiss (2019); Recht et al. (2019); Goodfellow et al. (2014)), we already know that CNNs: a) often fail when applied to transformed versions of the same object. In other words, they are not invariant to transformations such as translation111CNNs are not invariant to translation but are equivariant to it., in-plane and in-depth rotation, scale, lighting, and occlusion, b) lack out of distribution generalization. Even after being exposed to so many different instances of the same object category they are not good at learning that concept. In stark contrast, humans are able to generalize only from few examples, and c) are sensitive to small image perturbations (i.e. adversarial examples Goodfellow et al. (2014)).

Several datasets have been proposed in the past to train and test deep models and to study their generalization ability (e.g. ImageNet Russakovsky et al. (2015), CIFAR Krizhevsky et al. (2009), NORB LeCun et al. (2004), iLab20M Borji et al. (2016)). In a recent effort, Barbu et al. introduced a dataset called ObjectNet which has less bias than other datasets222This dataset, however, has it own biases. It consists of indoor objects that are available to many people, are mobile, are not too large, too small, fragile or dangerous. and is supposed to be used solely as a test set333Unlike some benchmarks (e.g. ImageNet) that hide their test images, images in ObjectNet dataset will be publicly available. See http://objectnet.dev.. Objects in this dataset are pictured by Mechanical Turk workers using a mobile app in a variety of backgrounds, rotations, and imaging viewpoints. It contains 50,000 images from 313 categories, out of which 113 are in common with the ImageNet, and comes with a licence that disallows the researchers to finetune models on it. Barbu et al. find that the state of the art object detectors444They mean object recognizers! perform drastically lower than their corresponding performance on the ImageNet dataset (about 40-45% drop). Here, we revisit the Barbu et al.’s study and seek to answer how much performance of models drops on this dataset compared with their performance on ImageNet. Due to this, here we limit our analysis to the 113 overlapped categories.

Figure 1: Sample images from the ObjectNet dataset from chairs, teapots and t-shirts categories, along with their corresponding object bounding boxes. The left panel shows objects in the ImageNet dataset. As it can be seen ImageNet scenes often contain a single isolated object. There are also many images in the ImageNet dataset with multiple objects, but we speculate that scenes in the ObjectNet dataset have more number of images on average Russakovsky et al. (2015). This figure is a modified version of the Figure 2 in Barbu et al. (2019).

Barbu et al.’s work is a great contribution to the field to answer how well object recognition models generalize to the real world situations. It, however, suffers from a major flaw which is making no distinction between "object recognition" and "object detection". They use the term "object detector" to refer to object recognition models. This bring along several concerns:

  • Instead of applying object recognition models to individual object bounding boxes, they apply them to cluttered scenes containing multiple objects. Alternatively, they could have run object detection models on scenes and measured detection performance (i.e. mean average precision; mAP).

  • Object detection and object recognition are two distinct, but related, tasks. Each has its own models, datasets, and evaluation measures. For example, as shown in Fig. 1, images in object recognition datasets (e.g. ImageNet Russakovsky et al. (2015)) often contain a single object, usually from a closeup view, whereas scenes in object detection datasets (e.g. MS COCO Lin et al. (2014)) have multiple objects. Due to this, object characteristics in the two types of datasets might be different. For example, objects are often smaller in detection datasets compared to recognition datasets Singh and Davis (2018).

  • It is true that ImageNet also includes a fair amount of images that have more than one object, but it appears that ObjectNet images are more cluttered and have higher number of objects, although we have not quantified this since this dataset does not come with object-level labels. This discussion is also related to the difference between "scene understanding" and "object recognition". To understand a complex scene, we look around, fixate on individual objects to recognize them, and accumulate information over fixations to perform more complex tasks such as answering a question or describing an event.

  • One might argue that the top-5 metric somewhat takes care of scenes with multiple objects. While this might be true to some degree, it certainly does not entail that it is better to train or test object recognition models on scenes with multiple objects. One problem is labeling such images. For example, a street scene with a car, a pedestrian, trees and buildings is given only one label which adds noise to the data. Another problem is invariance. Given that individual objects can undergo many transformations, feeding scenes with multiple objects to models exacerbates the problem and leads to combinatorially much more variations which are harder to learn. A better approach would be focusing on individual objects. As humans, we also move our eyes around and place our fovea on one object at a time to recognize it, using information from our peripheral vision. The analogue of this task still does not exist in computer vision. This task is reminiscent of object detection but there are subtle differences which will be elaborated on later in the discussion section.

Here, we first annotate the objects in the ObjectNet scenes and then apply a number of deep object recognition models to only object bounding boxes.

2 Experiments and Results

We employ six deep models including AlexNet Krizhevsky et al. (2012), VGG-19 Simonyan and Zisserman (2014), GoogLeNet Szegedy et al. (2015), ResNet-152 He et al. (2016) Inception-v3 Szegedy et al. (2016), and MNASNet Tan et al. (2019). AlexNet, VGG-19, and GoogLeNet have also been employed in the ObjectNet paper Barbu et al. (2019)

. We use the pytorch implementation of these models

555https://pytorch.org/docs/stable/torchvision/models.html. Notice that the code from the ObjectNet paper is still not available. Due to possible inconsistency in our code and the code in ObjectNet as well as different data processing methods, in addition to bounding boxes, we also run the models on the entire scenes. This allows us to study whether and how much performance varies across these two conditions.

2.1 Bounding box annotation

The 113 categories of the ObjectNet dataset, overlapped with the ImageNet, contain 18,574 images in total. On this subset, the average number of images per category is 164.4 (min = 55, max = 284). Fig. 2 shows the distribution of number of images per category on this dataset. We drew666The annotation tools include: https://github.com/aliborji/objectAnnotation and https://github.com/tzutalin/labelImg. a bounding box around the object corresponding to the category label of each image. If there were multiple objects from that category we tried to included all of them in the bounding box (e.g. chairs around a table). Some example scenes and their corresponding bounding boxes are given in Fig. 1.

Figure 2: Frequency of the images per category over the 113 categories of the ObjectNet dataset overlapped with the ImageNet.

2.2 Results

Figure 3: Performance of the state of the art object recognition models over the ObjectNet dataset. Results of our analysis by applying AlexNet Krizhevsky et al. (2012), VGG-19 Simonyan and Zisserman (2014), and ResNet-152 He et al. (2016) on bounding boxes are shown in blue. As it can be seen, applying models to boxes instead of the entire scene improves the performance about 10%, but sill much lower than results over the ImageNet dataset (overlap cases). Note: This is not the best possible score on this dataset since we are not performing test time data augmentation. Using the original code in the ObjectNet paper with bounding boxes will likely improve this results. This figure is a modified version of the Figure 1 in Barbu et al. (2019).

Fig. 3 shows an overlay of our results on the same figure from the ObjectNet paper. As it can be seen, applying models to boxes instead of the entire scene improves the performance about 10%, but sill much lower than the results over the ImageNet dataset. The gap, however, is narrower now.

Since the code of Barbu et al. (2019) is not available, we could not run the exact pipeline used by them on the bounding boxes. It is possible that they might have performed a different data normalization or test time data augmentation777Such as rotation, scale, color jittering, cropping, etc. to achieve better results. To remedy this, we also applied models to the whole image to study how much performance varies. Results are shown in Fig. 4. We find that:

  1. Focusing on a small image region containing only a single object increases the performance significantly by around 20-30% across all tested models.

  2. Our results on the whole scene are lower than Barbu et al.’s results (which are also on the whole scene). This entails that applying their code to bounding boxes will likely improve the performance even more than our results using boxes. Assuming 25% gain in performance on top of their best results, when using boxes, will still not close the gap in performance. Please see Fig. 3. We will discuss this further in the next section.

Figure 4: Results of our analysis by running six deep models on bounding boxes and whole images over the ObjectNet dataset (Corresponding to the overlap case in Fig. 3). Models perform significantly better on boxes than full images.

Break down of performance over 113 categories for each of the 6 tested models is shown in Fig. 5 (over isolated objects) and Fig. 6 (over full image). Interestingly, in both cases, almost all models (except the GoogLeNet on isolated objects and the AlexNet on full image) perform the best over the safety pin category. Inspecting the images from this class, we found that they have a single safety pin often hold by a person. The same story is true about the banana class which is the second easy category using the bounding boxes. This object becomes much harder to recognize when using the full image (26.88% vs. 70.3% using boxes) which highlights the benefit of applying models to isolated objects rather than scenes.

Figure 5: Performance of models on object bounding boxes
Figure 6: Performance of models over the entire image (full image)
Figure 7: A selection of challenging objects that are hard to be recognized by humans. Can you guess the category of the annotated objects in these images? Keys are as follows:
row 1: (skirt, skirt, desk lamp, safety pin, still camera, spatula, tray),
row 2: (vase, pillow, sleeping bag, printer, remote control, pet food container, detergent),
row 3: (vacuum cleaner, vase, vase, shovel, stuffed animal, sandal, sandal),
row 4: (sock, shovel, shovel, skirt, skirt, match, spatula),
row 5: (padlock, padlock, microwave, orange, printer, trash bin, tray)
Figure 8: A selection of challenging objects that are hard to be recognized by humans (continued from above). Can you guess the category of the annotated objects in these images? Keys are as follows:
row 1: (remote control, ruler, full sized towel, ruler, remote control, remote control, remote control),
row 2: (remote control, calendar, butter, bookend, ruler, tray, desk lamp),
row 3: (envelope, envelope, drying rack for dishes, full sized towel, drying rack for dishes, drinking cup, desk lamp),
row 4: (desk lamp, desk lamp, dress, tennis racket, fan, fan, hammer),
row 5: (printer, toaster, printer, helmet, printer, printer, printer)

3 Discussion and Conclusion

Our investigation reveals that deep object recognition models perform significantly better when applied to isolated objects rather than scenes (around 20-30% increase in performance). The reason behind this is two fold. First, there is less variability in single objects compared to scenes containing those objects. Second, deep models used here have been trained on ImageNet images which are less cluttered compared to the ObjectNet images. We anticipate that training models from scratch on large scale datasets that contain isolated objects will likely result in even higher accuracy.

Assuming around 30% increase in performance (at best) over the Barbu et al.’s results using bounding boxes, still leaves a large gap of at least 15% between ImageNet and ObjectNet performances which means that ObjectNet is significantly much harder. It covers a wider range of variations than ImageNet including object instances, viewpoints, rotations, occlusions, etc which pushes the limits of object recognition in both humans and machines. Hence, despite its limitations and biases, ObjectNet is indeed a great resource to test models in realistic situations.

Throughout the annotation process of ObjectNet images, we came across the following observations:

  • Some objects look very different when they are in motion (e.g. the fan in Fig. 8; row 4)

  • Some objects appear different under the shadow of other objects (e.g. the hammer in Fig. 8; row 4)

  • Some object instances look very different from the typical instances in the same class (e.g. the helmet in Fig. 8; row 5, the orange in Fig. 7; row 5)

  • Some objects can be recognized only by reading their labels (e.g. the pet food container in Fig. 7; row 2)

  • Some images have wrong labels (e.g. the pillow in Fig. 7; row 2, the skirt in Fig. 7; row 1, the tray in Fig. 8; row 2.)

  • Some objects are extremely difficult to be recognized by humans (e.g. the tennis racket in Fig. 8; row 4, the shovel in Fig. 7; row 4, and the tray in Fig. 7; row 1)

  • In many images, objects are occluded by hands holding them (e.g. the sock and the shovel in Fig. 7; row 4)

  • Some objects are hard to recognize in dim light (e.g. the printer in Fig. 7; row 2)

  • Some categories are often confused with other categories, for example: (bath towel, bed sheet, full sized towel, dishrag or hand towel), (sandal, dress shoe (men), running shoe), (t-shirt, dress, sweater, suit jacket, skirt), (ruler, spatula, pen, match), (padlock, combination lock), and (envelope, letter).

We foresee at least four directions for future work in this area:

  1. First, it would be interesting to see how well state of the art object detectors (e.g. Faster RCNN Ren et al. (2015)) perform on this dataset (e.g. over classes overlapped with the MSCOCO dataset). We expect a big drop in detection performance since recognition is still the main bottleneck in object detection Borji and Iranmanesh (2019).

  2. Second, measuring human performance on ObjectNet dataset will provide a baseline for gauging model performance. Barbu et al. report a human performance of around 95% on this dataset (via a pilot study) when subjects are asked to mention the objects that are present in the scene. This task, however, is different than recognizing isolated objects out of context similar to the regime that is considered here (i.e. similar to rapid scene categorization tasks). In addition, error patterns of models and humans (rather than just raw accuracy measures) will inform us about the mechanisms of object recognition in both humans and machines. It could be that models work in a completely different fashion than the human visual system. For instance, unlike CNNs, we are able to discern the foreground from the image background during recognition. This hints towards an interplay and feedback loop between recognition and segmentation that is currently missing in CNNs. Finally, we are invariant to image transformation only partially, therefore it may not make sense to desire models that are fully invariant (e.g. invariance to 360 degree in-plane rotation).

  3. Third, and related to the second, is the role of context in object recognition. Context is a two-edge sword. On the one hand, using it may lead to relying on trivial correlations that may not always happen at the test time. For example, relying on the fact that a keyboard always appears next to a monitor, may lead to occasional failures when a model is tested on an isolated keyboard. A better example is adversarial patches Brown et al. (2017) where training a model on objects augmented with small random patches forces the model to rely on those features and hence get fooled when tested on other objects with the same patch. On the other hand, completely discarding the context also is a not wise for two reasons. First, the great success of CNNs for object recognition and scene segmentation is attributed to their ability to exploit visual context. Second, as humans we also heavily rely on surrounding visual context. Future research, should further investigate the role of context in object recognition and how best to exploit it.

  4. Fourth, to operationalize the third item above, we propose a new task which is recognizing objects in cluttered scenes containing multiple objects (See Borji and Iranmanesh (2019) for an example study). This task resembles object detection but it is actually different. First, here the ground truth object bounding boxes are given and the task is to correctly label them (i.e. no need to find objects). Second, the evaluation measure is accuracy which is easier to interpret than mean average precision. This task is also different than the current object recognition setup in which context is usually discarded. Existing object detection datasets, such as MS COCO Lin et al. (2014), can be utilized to evaluate models built for solving this task (i.e. instead of detecting objects the aim is to recognize isolated objects confined in bounding boxes).

References

  • A. Azulay and Y. Weiss (2019) Why do deep convolutional networks generalize so poorly to small image transformations?. Journal of Machine Learning Research 20 (184), pp. 1–25. Cited by: §1.
  • A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz (2019) ObjectNet: a large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems, pp. 9448–9458. Cited by: Figure 1, Figure 3, §2.2, §2.
  • A. Borji and S. M. Iranmanesh (2019) Empirical upper-bound in object detection and more. arXiv preprint arXiv:1911.12451. Cited by: item 1, item 4.
  • A. Borji, S. Izadi, and L. Itti (2016) Ilab-20m: a large-scale controlled object dataset to investigate deep learning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2221–2230. Cited by: §1.
  • T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer (2017) Adversarial patch. External Links: 1712.09665 Cited by: item 3.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: Figure 3, §2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, Figure 3, §2.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
  • Y. LeCun, F. J. Huang, and L. Bottou (2004) Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., Vol. 2, pp. II–104. Cited by: §1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: 2nd item, item 4.
  • B. Recht, R. Roelofs, L. Schmidt, and V. Shankar (2019)

    Do imagenet classifiers generalize to imagenet?

    .
    arXiv preprint arXiv:1902.10811. Cited by: §1.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: item 1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: Figure 1, 2nd item, §1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Figure 3, §2.
  • B. Singh and L. S. Davis (2018) An analysis of scale invariance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3578–3587. Cited by: 2nd item.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §2.
  • M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §2.

4 Appendix

Here, we show the easiest and hardest objects for the ResNet-152 model over some categories.

(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 9: Correctly classified and misclassified examples from the Alarm clock class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 10: Correctly classified and misclassified examples from the Banana class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 11: Correctly classified and misclassified examples from the Band-Aid class by the ResNet model.
(a) Correctly classified; highest confidences. Only three benchs were correctly classified.
(b) Misclassified; highest confidences
(c) Misclassified; lowest confidences
Figure 12: Correctly classified and misclassified examples from the Bench class by the ResNet model.
(a) Correctly classified; highest confidences. Only two plates were correctly classified.
(b) Misclassified; highest confidences
(c) Misclassified; lowest confidences
Figure 13: Correctly classified and misclassified examples from the Plate class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 14: Correctly classified and misclassified examples from the Broom class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 15: Correctly classified and misclassified examples from the Candle class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 16: Correctly classified and misclassified examples from the Fan class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 17: Correctly classified and misclassified examples from the Ruler class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 18: Correctly classified and misclassified examples from the Safety-pin class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 19: Correctly classified and misclassified examples from the Teapot class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 20: Correctly classified and misclassified examples from the TV class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 21: Correctly classified and misclassified examples from the Sock class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 22: Correctly classified and misclassified examples from the Sunglasses class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 23: Correctly classified and misclassified examples from the Tie class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 24: Correctly classified and misclassified examples from the Mug class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 25: Correctly classified and misclassified examples from the Hammer class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 26: Correctly classified and misclassified examples from the Bicycle class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 27: Correctly classified and misclassified examples from the Cellphone class by the ResNet model.
(a) Correctly classified; highest confidences
(b) Correctly classified; lowest confidences
(c) Misclassified; highest confidences
(d) Misclassified; lowest confidences
Figure 28: Correctly classified and misclassified examples from the Chair class by the ResNet model.