In recent years, the performance of visual recognition and detection algorithms have considerably advanced with the progression of data-driven approaches and computational capabilities [1, 2]. These advancements enabled state-of-the-art methods to achieve human-level performance in specific recognition tasks [3, 4]. Despite these significant achievements, it remains a challenge to utilize such technologies in real-world environments that diverge from training conditions. To identify the factors that can affect recognition performance, we need to perform controlled experiments as in [5, 6, 7, 8, 9, 10]. Even though these studies shed a light on the vulnerability of existing recognition approaches, investigated conditions are either limited or unrealistic. Recently, we introduced the CURE-OR dataset and analyzed the recognition performance with respect to simulated challenging conditions . Hendrycks and Dietterich 
also studied the effect of similar conditions by postprocessing the images in the ImageNet dataset. Aforementioned studies overlooked the acquisition conditions and investigated the effect of simulated conditions.In contrast to the literature [5, 6, 7, 8, 9, 10, 12] and our previous work , the main focus of this study is to analyze the effect of real-world acquisition conditions including device type, orientation and background. In Fig. 1, we show sample images obtained under different acquisition conditions.
If we consider ideal acquisition conditions as reference conditions that lead to the highest recognition rate, any variation would decrease the recognition performance and affect visual representations. Based on this assumption, we hypothesize that recognition performance variations can be estimated by variations in visual representations. Overall, the contributions of this manuscript are five folds. First, we investigate the effect of background on object recognition by performing controlled experiments with different backgrounds. Second, we analyze the effect of acquisition devices by comparing the recognition accuracy of images captured with different devices. Third, we analyze the recognition performance with respect to different orientation configurations. Fourth, we introduce a framework to estimate the recognition performance variation under varying backgrounds and orientations. Fifth, we benchmark the performance of handcrafted and data-driven features obtained from deep neural networks in the proposed framework. The outline of this paper is as follows. In Section 2, we analyze the objective recognition performance with respect to acquisition conditions. In Section 3, we describe the recognition performance estimation framework and benchmark hand-crafted and data-driven methods. Finally, we conclude our work in Section 4.
2 Recognition Under Multifarious Conditions
Based on scalability, user-friendliness, computation time, service fees, access to labels and confidence scores, we assessed off-the-shelf platforms and decided to utilize Microsoft Azure Computer Vision (MSFT) and Amazon Rekognition (AMZN) platforms. As a test set, we use the recently introducedCURE-OR
dataset that includes one million images of 100 objects captured with different devices under various object orientations, backgrounds, and simulated challenging conditions. Objects are classified into 6 categories: toys, personnel belongings, office supplies, household items, sport/entertainment items, and health/personal care items as described in. We identified 4 objects per category for each platform for testing, but because Azure only identified 3 objects correctly in one category, we excluded an object with the lowest number of correctly identified images from Amazon for fair comparison. Therefore, we used objects while assessing the robustness of the recognition platforms. Original images (challenge-free) in each category were processed to simulate realistic challenging conditions including underexposure, overexposure, blur, contrast, dirty lens, salt and pepper noise, and resizing as illustrated in . We calculated the top-5 accuracy for each challenge category to quantify recognition performance. Specifically, we calculated the ratio of correct classifications for each object in which ground truth label was among the highest five predictions.
We report the recognition performance of online platforms with respect to varying acquisition conditions in Fig. 2. Each line represents a challenge type, except the purple line that shows the average of all challenge types. In terms of object backgrounds, white background leads to the highest recognition accuracy in both platforms as shown in Fig. 2(a-b), which is followed by 2D textured backgrounds of kitchen and living room, and then by 3D backgrounds of office and living room. Objects are recognized more accurately in front of the white backdrop because there is not any texture or color variation in the background that can resemble other objects. The most challenging scenarios correspond to the real-world office and living room because of complex background structure. Recognition accuracy in front of 2D backdrops is higher than the real-world setups because foreground objects are more distinct when background is out of focus.
In terms of orientations, front view (0 deg) leads to the highest recognition accuracy as shown in Fig. 2(c-d), which is expected because the objects in CURE-OR face forward with their most characteristic features. In contrast, these characteristic features are highly self-occluded in the overhead view, which leads to the lowest recognition performance. In case of left, right, and back views, characteristic features are not as clear as in front view but self-occlusion is not as significant as in overhead view. Therefore, these orientations lead to a medium recognition performance compared to front and overhead views. Recognition performances with respect to acquisition devices are reported in Fig. 2(e-f), which shows that performance variation based on device types is less significant than backgrounds and orientations. However, there is still a performance difference between images obtained from different devices. Overall, Nikon D80 and Logitech C920 lead to highest recognition performance in both platforms, which highlights the importance of image quality for recognition applications.
|Condition||Feature Type||Feature||Distance Metric|
|Amazon Rekognition (AMZN)|
|Microsoft Azure (MSFT)|
3 Recognition Performance Estimation under Multifarious Conditions
Based on the experiments reported in Section 2, the reference configuration that leads to the highest recognition performance is front view, white background, and Nikon DSLR. We conducted two experiments to estimate the recognition performance with respect to changes in background and orientation. We utilized the 10 common objects of both platforms for direct comparison. In the background experiment, we grouped images captured with a particular device (5) in front of a specific background (5), which leads to 25 image groups with front and side views of the objects. In the orientation experiment, we grouped images captured with a particular device (5) from an orientation (3) among front, top, and side views, which leads to 15 image groups with images of the objects in front of white, living room, and kitchen backdrops. For each image group, we obtained an average recognition performance per recognition platform and an average feature distance between the images in the group and their reference image. Finally, we analyzed the relationship between recognition accuracy and feature distance with correlations and scatter plots. We extracted commonly used handcrafted features as well as data-driven features from object images, which are briefly described as follows:
Color: Histograms of color channels in RGB.
Daisy: Local image descriptor based on convolutions of gradients in specific directions with Gaussian filters .
Edge: Histogram of vertical, horizontal, diagonal, and non-directional edges.
Gabor: Frequency and orientation information of images extracted through Gabor filters.
HOG: Histogram of oriented gradients over local regions.
Features obtained from convolutional neural networks that are based onconvolutinal layers stacked on top of each other. The number next to VGG stands for the number of weighted layers in which last three layer correspond to fully connected layers.
We calculated the distance between features through commonly used metrics including norm, norm, norm, sum of absolute differences (SAD), sum of squared absolute differences (SSAD), weighted version of norm (Canberra), norm (Chebyshev), Minkowski distance, Bray-Curtis dissimilarity, and Cosine distance. We report the recognition accuracy estimation performance in Table 1 in terms of Spearman correlation between top-5 recognition accuracy scores and feature distances. We highlight the top data-driven and hand-crafted methods with light blue for each recognition platform and experiment.
In the background experiment, color characteristics of different backgrounds are distinct from each other as observed in Fig. 1. In terms of low level characteristic features including Daisy, Edge, and HOG, edges in the backgrounds can distinguish highly textured backgrounds from less textured backgrounds. However, edges would not be sufficient to distinguish lowly textured backgrounds from each other. Moreover, edges of the foreground objects can dominate the feature representations and mask the effect of changes in the backgrounds. To distinguish differences in backgrounds overlooked by edge characteristics, frequency and orientation characteristics can be considered with Gabor features. Data-driven methods including VGG utilize all three channels of images while extracting features, which can give them an inherent advantage with respect to solely color or structure based methods. Overall, data-driven method VGG leads to the highest performance in the background experiment for both recognition platforms. In terms of hand-crafted features, color leads to the highest performance followed by Gabor representations whereas edge-based methods result in inferior performance as expected.
Distinguishing changes in orientation is more challenging compared to backgrounds because region of interest is limited to a smaller area. Therefore, overall recognition accuracy estimation performances are lower for orientations compared to backgrounds as reported in Table 1. Similar to the background experiment, VGG architectures lead to the highest performance estimation in the orientation experiment. However, hand-crafted methods are dominated by edge features instead of Gabor representations. We show the scatter plots of top performing data-driven and hand-crafted methods in Fig. 3 in which x-axis corresponds to average distance between image features and y-axis corresponds to top-5 accuracy. Image groups corresponding to different configurations are more distinctly clustered in terms of background as observed in Fig. 3(a-b, e-f). In terms of orientation, VGG leads to a clear distinction of configurations for Amazon Rekognition as observed in Fig. 3(c) whereas image groups are overlapping in other experiments as in Fig. 3(d, g-h). Clustering configurations is more challenging in the orientation experiment because it is not even possible to easily separate orientation configurations based on their recognition accuracy.
In this paper, we analyzed the robustness of recognition platforms and reported that object background can affect recognition performance as much as orientation whereas tested device types have minor influence on recognition. We also introduced a framework to estimate recognition performance variation and showed that color-based features capture background variations, edge-based features capture orientation variations, and data-driven features capture both background and orientation variations in a controlled setting. Overall, recognition performance can significantly change depending on the acquisition conditions, which highlights the need for more robust platforms that we can confide in our daily lives. Estimating recognition performance with feature similarity-based metrics can be helpful to test the robustness of algorithms before deployment. However, the applicability of such estimation frameworks can drastically increase if we design no-reference approaches that can provide a recognition performance estimation without a reference image similar to the no-reference algorithms in image quality assessment field.
J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and Li Fei-Fei,
“ImageNet: A large-scale hierarchical image database,”
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2009, pp. 248–255.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision (ECCV), D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., Cham, 2014, pp. 740–755, Springer International Publishing.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in IEEE International Conference on Computer Vision (ICCV), Washington, DC, USA, 2015, pp. 1026–1034, IEEE Computer Society.
-  R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun, “Deep Image: Scaling up Image Recognition,” in arXiv:1501.02876, 2015.
-  S. Dodge and L. Karam, “Understanding how image quality affects deep neural networks,” in International Conference on Quality of Multimedia Experience (QoMEX), June 2016, pp. 1–6.
-  Y. Zhou, S. Song, and N. Cheung, “On classification of distorted images with deep convolutional neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 1213–1217.
H. Hosseini, B. Xiao, and R. Poovendran,
“Google’s cloud vision api is not robust to noise,”
16th IEEE International Conference on Machine Learning and Applications (ICMLA), Dec 2017, pp. 101–105.
-  J. Lu, H. Sibai, E. Fabry, and D Forsyth, “No need to worry about adversarial examples in object detection in autonomous vehicles,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, 2017.
-  N. Das, M. Shanbhogue, S.-T. Chen, F. Hohman, S. Li, L. Chen, M. E. Kounavis, and D. H. Chau, “SHIELD: Fast, practical defense and vaccination for deep learning using jpeg compression,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), New York, NY, USA, 2018, KDD ’18, pp. 196–204, ACM.
-  D. Temel, G. Kwon, M. Prabhushankar, and G. AlRegib, “CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign Recognition,” in Neural Information Processing Systems (NeurIPS), Machine Learning for Intelligent Transportation Systems Workshop, 2017.
-  D. Temel, J. Lee, and G. AlRegib, “CURE-OR: Challenging Unreal and Real Environments for Object Recognition,” in IEEE International Conference on Machine Learning and Applications (ICMLA), 2018.
-  D. Hendrycks and T. G. Dietterich, “Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations,” in International Conference on Learning Representations (ICLR), 2019.
-  E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient dense descriptor applied to wide-baseline stereo,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 32, no. 5, pp. 815–830, May 2010.