There are many classic volumes that define the field of computer vision (e.g., Rosenfeld 1976 , Marr 1982 , Ballard & Brown 1982 , Horn 1986 , Koenderink 1990 , Faugeras 1993 , Szeliski 2010  and more). There, the theoretical foundations of image and video processing, analysis, and perception are developed both theoretically and practically. In part, these represent what we call here theory-based computer vision models.
A geometrical and physical understanding of the image formation process, from illuminant to camera optics to image creation, as well as the material properties of the surfaces that interact with incident light and the motions of objects in a scene, was mathematically modeled in the hope that, when those equations were simulated by a computer, they would reveal the kinds of structures that human vision also perceives. An excellent example of this development process appears in , which details the importance of physical and mathematical constraints for the theory development process with many examples from early work on edges, textures and shape analysis. It is difficult to deny the theoretical validity of those approaches and from the earliest days of computer vision, the performance of these theory-based solutions had always appeared promising, with a large literature supporting this (see the several articles in  on all aspects of computer vision for reviews of early work) as well as many commercial successes.
However, during most of the history of computer vision, the discipline suffered from two main problems (see  for historical review). Firstly, computational power and memory were too meagre to deal with the requirements of vision (theoretically shown in [42, 41]). Secondly, the availability of large sets of test data that could be shared and could permit replication of results was limited. An appropriate empirical methodology and tradition to guide such testing and replication was also missing.
The first problem gradually improved as Moore’s Law played out. Especially important, was the advent of GPUs in the late 1990s , with their general availability a few years later. Major progress was made on the second problem with the introduction of what might be the first large collection of images for computer vision testing, namely, MNIST 
. During the preceding decade, advances in machine learning began to have an important impact on computer vision (e.g., Turk & Pentland Eigenface system, based on learning low-dimensional representations for faces, following Sirovich & Kirby ).
Whereas the scarcity of data precluded extensive use of learning methods in the early days, the emergence of large image sets encouraged exploration of how systems might learn regularities, say in object appearance, over large sets of exemplars. Earlier papers pointed to the utility of images of handwritten digits in testing recognition and learning methods (e.g., work by Hinton et al. ) so the creation of the MNIST set was both timely and impactful. As a result, the community witnessed the emergence of data-driven computer vision models created by extracting statistical regularities from a large number of image samples. The MNIST set was soon joined by others. The PASCAL Visual Object Classes (VOC) Challenge began in 2005 
, ImageNet in 2010, and more. The contribution of these data sets and challenges is undeniable towards the acceleration of developments in computer vision.
It is widely accepted that the AlexNet paper  was a turning point in the discipline. It outperformed theory-driven approaches by a wide margin and thus began the surrender of those methods to data-driven ones. But how did that success come about? Was it simply the fact that AlexNet was a great feat of engineering? Likely this is true; but here, we suggest that there may be a bit more to it. Empirical methodology matters. How this has evolved within the community had also played a role, perhaps a key role, so that the theory-driven computer vision paradigms never had a chance. In the rush to celebrate and exploit this success, no one noticed.
The next sections will play out our path to supporting this assertion. It is interesting to note that what led us to this was work towards understanding how active setting of camera parameters affects certain computer vision algorithms. We summarize the argument here.
2 Effect of Sensor Settings for Interest Point and Saliency Algorithms
Previous work explored the use of interest point/saliency algorithms in an active sensing context and investigated how they perform with varying camera parameters in order to develop a method for dynamically setting those parameters depending on task and context. The experiments revealed a very strong dependence on settings for a range of algorithms. The patterns seemed orderly as if determined by some physical law; the data exhibited a strong and clear structure. A brief summary is provided here with more detail in Andreopoulos & Tsotsos .
The authors evaluated the effects of camera shutter speed and voltage gain, under simultaneous changes in illumination, and demonstrated significant differences in the sensitivities of popular vision algorithms under those variations. They began by creating a dataset that reflected different cameras, camera settings, and illumination levels. They used one CCD sensor (Bumblebee stereo camera, PointGrey Research) and one CMOS sensor (FireflyMV camera, PointGrey Research) to obtain the datasets. The permissible shutter exposure time and gain were uniformly quantized with 8 samples in each dimension. There were an equal number of samples for each combination of sensor settings.
Then, the resulting images were processed by several algorithms, popular before the time of writing: Harris-Affine and Hessian-Affine region detectors , Scale Saliency interest-point detector , Maximally Stable Extremal Regions algorithm (MSER) 
, and Speeded-Up Robust Features extraction algorithm (SURF).
The results, summarized in Figure 1, show that common datasets used to evaluate vision algorithms, typically suffer from a significant sensor specific bias which can make many of the experimental methodologies used to evaluate vision algorithms unable to provide results that generalize in less controlled environments. Simply put, if one wished to use one of these specific algorithms for a particular application, then it is necessary to ensure that the images processed are acquired using the sensor setting ranges that yield good performance (in yellow in Figure 1). Such considerations are rarely observed.
3 Effect of Sensor Settings on Object Detection Algorithms
The same test for more recent recognition algorithms, both classic and deep learning methods, was performed by Wu & Tsotsos . The authors created a dataset containing 2240 images in total, by viewing 5 different objects (bicycle, bottle, chair, potted plant and tv monitor), at 7 levels of illumination and with 64 camera configurations (8 shutter speeds, 8 voltage gains). As in the previous experiment, there were an equal number of samples for each combination of sensor settings. To accurately measure the illumination of the scene, a Yoctopuce light sensor was used. Also, intensity-controllable light bulbs were used to achieve different light conditions, 50lx, 200lx, 400lx, 800lx, 1600lx and 3200lx. The digital camera was a Point Grey Flea3. The allowed ranges of shutter speed and voltage gain were uniformly sampled into 8 distinct values in each dimension.
, the Regions with Convolutional Neural Networks (R-CNN), and the Spatial Pyramid Pooling in Deep Convolutional Networks (SPP-net) .
The results are shown in Figure 2. This time, mean average precision (mAP) values were not thresholded and are plotted intact. The original work presented these plots of each of tested illumination levels as well and these demonstrated that performance depends significantly on illumination level as well as sensor settings and does not easily generalize across these variables. As before, if one wished to use one of these specific algorithms for a particular application, then it is necessary to ensure that the images processed are acquired using the sensor setting ranges that yield good performance (shown in yellow in Figure 2).
In general, it can be seen that there is less orderly structure when compared to the previous set of tests (thus making any characterization of ‘good performing’ sensor settings more difficult) and the authors wondered about the reason underlying this difference.
Could the difference be due to an uneven distribution of training samples along those dimensions? And if so, could overall performance be influenced by such bias?
4 Distributions of Sensor Parameters in Common Computer Vision Datasets
As mentioned, the two above studies caused us to be curious about the reasons behind the uneven and unexpected performance patterns across algorithms. After thorough verifications of the methods employed, we concluded that some imbalance in data distribution across sensing parameters might be the cause.
Surprisingly, among works on various biases in vision datasets, few acknowledge the existence of sensor bias (or capture bias 
) and none provide quantitative statistics. Hence, we set out to test this. We selected two common datasets, Common Objects in Context (COCO) and VOC 2007, the dataset used in the PASCAL Visual Object Classes Challenge in 2007 .
Since both datasets consist of images gathered from Flickr, we used Flickr API to recover EXIF data (a set of tags for camera settings provided by the camera vendor) for each image.
In the COCO dataset 59% and 58% of train and validation data respectively had EXIF data available. We use the trainval35K split commonly used for training object detection algorithms (e.g. SSD, YOLOv3 , and RetinaNet ).
In the VOC2007 dataset 31% of train and test data had EXIF data available, however as Figure 3 shows, the distributions of exposure times and ISO settings for photographs in the COCO and VOC2007 datasets are similar. Note that most images are taken with the automatic ISO settings such as 100, 200, 400 and 800. Most exposure times are short (below 0.1s) and spike around 1/60s. This agrees with the large-scale statistical analysis of millions of images found online in .
Using shutter speed, f-number and ISO we can compute exposure value (EV) using the following formula as in :
where is exposure value, is the f-number, is the exposure time and is the ISO used to take the photograph.
From EV we can derive the illumination level (assuming that the photograph is properly exposed) as in . We define low illumination between -4 and 7 EV (up to 320lx), mid-level illumination between 8 and 10 EV (640 to 2560lx) and high-level illumination above 11 EV (more than 5120lx) which approximately matches the setup in . Figure 4 gives the distributions of exposure times (shutter speeds) in the COCO and VOC2007 datasets. Table 1 shows the image counts in each illumination level, not surpisingly, nearly 90% of all images are acquired under high to medium illumination conditions.
If our hypothesis that some imbalance in data distribution across sensing parameters might be the cause of the even performance was correct, then this should be revealed.
|Train||42,045 (61%)||18,456 (27%)||8054 (12%)|
|Val||1,832 (62%)||771 (27%)||354 (12%)|
5 Object Detection on Images With Different Sensor Parameters from COCO Dataset
We next investigated how different sensor parameters represented in those datasets affect the performance of object detection algorithms. We used several state-of-the art object detection algorithms, specifically, Faster R-CNN , Mask R-CNN , YOLOv3  and RetinaNet . All of them are trained on COCO trainval35K set. Figures 2(a) and 2(b) show the percentages of images for a range of the shutter speed and ISO settings in COCO train and validation sets. The bin edges of heatmaps approximately match the ranges reported in  and . Since shutter speed of the cameras used in the previous works was limited to 1s, in our setup all images with exposure time s fall into the last bin and exposure time values between 0 and 1 are split into 8 equal intervals. Both  and  report gain, which is not available on most consumer cameras, therefore we use ISO values as a proxy. The following ISO bin ranges [0, 100, 200, 400, 800, 1600, 3200, 6400, 10000] approximately correspond to the gain values used in the previous works.
Figure 5 shows evaluation results in terms of mean average precision (mAP) for state-of-the-art object detection algorithms trained on COCO dataset and evaluated on the portion of COCO 5K minival set with available EXIF data and presented in the same style as the previous tests. However, it is difficult to compare our results with the results of the previous works directly because of the differences in the evaluation datasets, algorithms (interest point detection vs object detection), camera parameters (gain vs ISO), inability to precisely establish illumination level in common vision datasets and possible inconsistencies in computing average precision in each case.
Note that nearly 90% of training and validation data in COCO is concentrated in the top row of the diagram (very short exposure times and ISO values of up to 800). Figure 5 reveals very similar results from all 4 algorithms that are trained on this dataset suggesting possible training bias. It is also apparent that the mAP values in the top row are consistent with the reported performance of the algorithms but fluctuate wildly in bins that contain less representative camera parameter ranges. It is hard to attribute this fluctuation entirely to the sensor bias, as other factors may be at play (e.g. types and number of objects, small number of images in the underrepresented bins). However, we believe that it is something that should be investigated further. It is never a useful property for an algorithm to display such significant sensitivity to small parameter changes. For imaging, one might expect a small shift in shutter speed, for example, to lead to only small changes in subsequent detection performance; this experiment shows this is not the case.
The above experiments revealed several key points.
First, theory-based algorithms seem to have an orderly pattern of performance with respect to the sensor settings we examined. This may be due to their analytic definitions; they were not designed to be parameterized for the full range of sensor settings. One can easily conclude that if good performance is sought from any of these algorithms, they should be employed with cameras set to the algorithm’s inherent optimal ranges. Of course, this limited test needs to be greatly expanded before finalizing this conclusion. However, in an engineering context this makes good sense: commercial products are commonly wrapped in detailed instructions about how to use and not use a product to ensure expected performance. In retrospect, knowledge of algorithm performance under varying sensor settings would have enhanced their design.
Second, the same test on modern algorithms reveals a haphazard performance pattern. It might be that some of the large variations are due to biases in the data, maybe some due to the particular objects in question, Others may be due to the properties of the network architectures. This also needs much deeper analysis.
Third, examination of two popular image sets, VOC2007 and COCO, shows that image metadata (sensor settings, camera pose, illumination, etc.) is often not available. This means that for any given ‘in the wild’ set of images, the performance of data-driven methods may be predicted by how well the distribution of images along dimensions of sensor setting and illumination parameters in a test set matches the distribution resulting from the training set. This requires further verification.
Generalization of performance across image sets may not be a sensible expectation. In fact, the direct comparison of theory-driven vs data-driven methods was inappropriate because the theory-driven methods never had a chance. Any of the interest point algorithms in our first test could be applied to any image of either COCO or VOC2007 datasets. The comparison of high performing regions in Figure 1 with distribution of sensor settings of Figure 4 reveals only a tiny overlap: on those datasets, those algorithms would all perform terribly. This is not due to their bad design. It is due partly to their inappropriate use, i.e., use outside of their design specifications, in this case, empirically obtained camera sensor settings. More complete consideration of empirical methodology - where all aspects of the data are recorded and used to ensure reproducibility - might have led to different results in a head-to-head comparison.
In some sense, the community has largely ignored sensor settings and their relevance as described here. This may be due to the belief that since humans can be shown images from any source, representing any sort of object or scene, familiar or unfamiliar, and provide reasonable responses as to their content, computer vision systems should also. This goal is “baked in” to the overall research enterprise and variations are considered either as nuisance or good tests of generalization ability. As an example of the latter, there are a number of studies investigating the effects of various image degradations on the performance of the deep-learning based algorithms. These include both artificially distorted images [31, 11, 33, 5] as well as more realistic transformations, for example, viewing the image through a cheap webcamera .
However, the proposition itself is not well-formed. It is simply untrue that humans, when shown an image out of the vast range of possibilities, perform in the same manner with respect to time to respond, accuracy of response, manner of response, and internal processing strategy. The behavioral literature solidly proves this. Even the computational literature points to this. While humans have been shown to be more robust than state-of-the-art deep networks to nearly all image degradations examined, human performance still gradually drops off as the levels of distortions increase [6, 11].
As we mentioned earlier, most of the experiments investigating the robustness of the deep networks to image distortions do not tie those to the changes resulting from camera settings. Perhaps changing ISO from 100 to 200, and the resulting increase in noise, would not be easily noticeable to human eye, but may have a measurable effect on the performance of CNNs which are known to be affected by very small changes in the images (e.g. small affine transformations, various types of noise ). As  shows changing a single pixel in the right place is all it takes sometimes. It should be clear that the “baked in” belief needs much more nuance and a significantly deeper analysis, if not outright rejection.
Finally, as can be seen in Figures 3 and 5, the variability required to train is not even available in the large datasets we considered. The distributions of images across these parameters was very uneven so training algorithms are impeded with respect to learning the variations. It might be a good practice to require similar demonstrations of image distributions across the relevant parameters in order to ensure that no only training, but comparative evaluations, are propery performed.
With all due respect to all the terrific advances made in computer vision by the application of deep learning methods - here termed data-driven models - we propose here, and provide some justification, that the empirical methodology that led to the turning point in the discipline was based on an oversight that none of us noticed at the time. Sensor settings matter and each algorithm, perhaps most especially the theory-driven ones, have ranges within which one might expect good performance and ranges where one should not expect it. Testing outside the ranges is not only unfair but completely inappropriate. If all you have is a 6-foot ladder you would never try to use it to reach a 12-foot window. You would get another ladder. You use your tools for the purpose they were intended. The 6-foot ladder retains its utility for the lower windows.
The evolution of our discipline’s empirical methodology seems to need a bit of a push in the right direction. If sensor settings (maybe also illumination levels) played a role in the large scale testing of the classic theory-driven algorithms, then perhaps they would have performed at high levels. The community did not do this. In comparing the data-driven with theory-driven algorithms the distribution of camera settings (unknown to everyone) favored the data-driven algorithms because they were trained on such a random distribution while the theory-driven algorithms were tested on data for which they were not designed to operate. But no one realized this at the time. Thus the empirical strategy favored data-driven models.
If the empirical strategy specifically included a fuller specification of optimal operating ranges for all algorithms and each algorithm were tested accordingly, what would their performance rankings be? The theory-driven algorithms would have been tested only on images from camera settings for which they were designed and perhaps the performance gap would have been smaller. Such greater specificity in experimental design seems more common in other disciplines. An empirical method involves the use of objective, quantitative observation in a systematically controlled, replicable situation, in order to test or refine a theory. The key features of the experiment are control over variables (independent, dependent and extraneous), careful objective measurement and establishing cause and effect relationships. At the very least, a discussion on how to firm up empiricism in computer vision needs to take place. Remember the ladder analogy.
The impact of this is greater than simply challenging the competition results. Knowledge of the sensor setting ranges that lead to poor performance would allow one to ensure failure of a particular algorithm. This is as true for any head-to-head technical competition as it is for an adversary. Non-linear performance of any algorithm along one or more parameter settings is a highly undesirable characteristic. It may be that a blend of the approaches, data-driven and theory-driven, would permit better specification towards providing linear performance across the relevant parameters but within a framework where learning can occur using datasets that better represent the target population.
-  (2011) On sensor bias in experimental methods for comparing interest-point, saliency, and recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1), pp. 110–126. Cited by: §1, Figure 1, §2, §5.
-  (2018) Why do deep convolutional networks generalize so poorly to small image transformations?. arXiv preprint arXiv:1805.12177. Cited by: §6.
-  (1982) Computer vision. Prentice-Hall. Cited by: §1.
-  (2006) SURF speeded up robust features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 404–417. Cited by: §2.
-  (2019) GazeGAN: A Generative Adversarial Saliency Model based on Invariance Analysis of Human Gaze During Scene Free Viewing. arXiv preprint arXiv:1905.06803. Cited by: §6.
-  (2017) A study and comparison of human and deep learning recognition performance under visual distortions. In In Proceedings of the 26th International Conference on Computer Communication and Networks (ICCCN), pp. 1–7. Cited by: §6.
-  The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. Note: http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html Cited by: §4.
-  (2005) The 2005 PASCAL Visual Object Classes Challenge. In Machine Learning Challenges Workshop, pp. 117–176. Cited by: §1.
-  (1993) Three-dimensional computer vision: a geometric viewpoint. MIT press. Cited by: §1.
-  (2009) Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (9), pp. 1627–1645. Cited by: §3.
-  (2018) Generalisation in humans and deep neural networks. In Advances in Neural Information Processing Systems (NeuroIPS), pp. 7538–7550. Cited by: §6, §6, §6.
Rich feature hierarchies for accurate object detection and semantic segmentation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587. Cited by: §3.
-  (1992) Computer and robot vision. Vol. 1, Addison-Wesley. Cited by: §1.
-  (2017) Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2961–2969. Cited by: §5.
-  (2015) Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9), pp. 1904–1916. Cited by: §3.
-  (1992) Adaptive elastic models for hand-printed character recognition. In Advances in Neural Information Processing Systems, pp. 512–519. Cited by: §1.
-  (1986) Robot vision. MIT press. Cited by: §1.
-  (2001) Saliency, scale and image description. International Journal of Computer Vision 45 (2), pp. 83–105. Cited by: §2.
-  (1990) Solid shape. MIT press. Cited by: §1.
-  (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §1.
The MNIST Database of Handwritten Digits. Note: http://yann.lecun.com/exdb/mnist/ Cited by: §1.
-  (2017) Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. Cited by: §4, §5.
-  (2014) Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §4.
-  (2016) SSD: Single Shot Multibox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: §4.
-  (1982) Vision: a computational investigation into the human representation and processing of visual information. MIT Press. Cited by: §1.
-  (2004) Robust Wide-Baseline Stereo from Maximally Stable Extremal Regions. Image and Vision Computing 22 (10), pp. 761–767. Cited by: §2.
-  (2004) Scale & affine invariant interest point detectors. International Journal of Computer Vision 60 (1), pp. 63–86. Cited by: §2.
-  (2010) The GPU Computing Era. IEEE Micro 30 (2), pp. 56–69. Cited by: §1.
-  (2018) YOLOv3: An Incremental Improvement. arXiv preprint arXiv:1804.02767. Cited by: §4, §5.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pp. 91–99. Cited by: §5.
-  (2016) Fine-grained recognition in the noisy wild: sensitivity analysis of convolutional neural networks approaches. arXiv preprint arXiv:1610.06756. Cited by: §6.
-  (1976) Digital picture processing. Academic press. Cited by: §1.
-  (2018) Effects of degradations on deep neural network architectures. arXiv preprint arXiv:1807.10108. Cited by: §6.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §1.
-  (1983) Introduction to modern information retrieval. McGraw-Hill College. Cited by: Figure 2.
-  (1987) Low-dimensional procedure for the characterization of human faces. Journal of Optical Society of America 4 (3), pp. 519–524. Cited by: §1.
One pixel attack for fooling deep neural networks.
IEEE Transactions on Evolutionary Computation. Cited by: §6.
-  (2010) Computer Vision: Algorithms and Applications. Springer. Cited by: §1.
-  (2017) A deeper look at dataset bias. In Domain Adaptation in Computer Vision Applications, pp. 37–55. Cited by: §4.
-  (2016) A robustness analysis of deep q networks. In Proceedings of the Australasian Conference on Robotics and Automation, Cited by: §6.
-  (1987) A complexity level analysis of vision. International Journal of Computer Vision 1, pp. 303–320. Cited by: §1.
The complexity of perceptual search tasks.
Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Vol. 89, pp. 1571–1577. Cited by: §1.
-  (1991) Face Recognition Using Eigenfaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 586–591. Cited by: §1.
-  (2013) Selective search for object recognition. International Journal of Computer Vision 104 (2), pp. 154–171. Cited by: §3.
-  (2017) Active control of camera parameters for object detection algorithms. arXiv preprint arXiv:1705.05685. Cited by: Figure 2, §3, §4, §5.
-  (2008) Statistic analysis of millions of digital photos. In Digital Photography IV, Vol. 6817, pp. 68170L. Cited by: §4.
-  (2018) Statistic analysis of millions of digital photos 2017. Electronic Imaging 2018 (5), pp. 1–4. Cited by: §4, §4.
-  (1992) Connectionism and the computational neurobiology of curve detection. Connectionism: Theory and practice. Vancouver Studies in Cognitive Science 3, pp. 277–296. Cited by: §1.