A Toolkit for creating Peripheral Architectures
We present a foveated object detector (FOD) as a biologically-inspired alternative to the sliding window (SW) approach which is the dominant method of search in computer vision object detection. Similar to the human visual system, the FOD has higher resolution at the fovea and lower resolution at the visual periphery. Consequently, more computational resources are allocated at the fovea and relatively fewer at the periphery. The FOD processes the entire scene, uses retino-specific object detection classifiers to guide eye movements, aligns its fovea with regions of interest in the input image and integrates observations across multiple fixations. Our approach combines modern object detectors from computer vision with a recent model of peripheral pooling regions found at the V1 layer of the human visual system. We assessed various eye movement strategies on the PASCAL VOC 2007 dataset and show that the FOD performs on par with the SW detector while bringing significant computational cost savings.READ FULL TEXT VIEW PDF
Many animals and humans process the visual field with a varying spatial
Image and video compression has traditionally been tailored to human vis...
Object detection is an important research area in the field of computer
We propose a model that emulates saccades, the rapid movements of the ey...
A binocular system developed by the author in terms of projective Fourie...
State-of-the-art visual recognition and detection systems increasingly r...
The ability to attend to salient regions of a visual scene is an innate ...
A Toolkit for creating Peripheral Architectures
There has been substantial progress (e.g. [14, 53, 29, 37, 46, 22, 7] to name a few) in object detection research in recent years. However, humans are still unsurpassed in their ability to search for objects in visual scenes. The human brain relies on a variety of strategies 
including prior probabilities of object occurrence, global scene statistics[45, 33] and object co-occurrence [10, 28, 35] to successfully detect objects in cluttered scenes. Object detection approaches have increasingly included some of the human strategies [1, 11, 2, 14, 41]. One remaining crucial difference between the human visual system and a modern object detector is that while humans process the visual field with decreasing resolution away [49, 26, 39, 43] from the fixation point and make saccades to collect information, typical object detectors  scan all locations at the same resolution and repeats this at multiple scales. The goal of the present work is to investigate the impact on object detector performance of using a foveated visual field and saccade exploration rather than the dominant sliding window paradigm [14, 53, 29]. Such endeavor is of interest for two reasons. . First, from the computer vision perspective, using a visual field with varying resolution might lead to reduction in computational complexity, consequently the approach might lead to more efficient object detection algorithms. Second, from a scientific perspective, if a foveated object detection model can achieve similar performance accuracy as a non-foveated sliding window approach, it might suggest a possible reason for the evolution of foveated systems in organisms: achieving successful object detection while minimizing computational and metabolic costs.
Contemporary object detection research can be roughly outlined by the following three important components of a modern object detector: the features, the detection model and the search model. The most popular choices for these three components are Histogram of Oriented Gradients (HOG) features , mixture of linear templates , and the sliding window (SW) method, respectively. Although there are efforts to go beyond these standard choices (e.g. new features [37, 46]; alternative detection models [22, 46], whether object parts should be modeled or not [54, 8]; and alternative search methods [13, 24, 14, 20, 46]), HOG, mixture of linear templates and SW form the crux of modern object detection methods ([14, 53, 29]). Here, we build upon the “HOG + mixture of linear templates” framework and propose a biologically inspired alternative search model to the sliding window method, where the detector searches for the object by making saccades instead of processing all locations at fine spatial resolution (See Section 4 for a more detailed discussion on related work).
The human visual system is known to have a varying resolution visual field. The fovea has higher resolution and this resolution decreases towards the periphery [49, 26, 39, 43]. As a consequence, the visual input at and around the fixation location has more detail relative to peripheral locations away from the fixation point. Humans and other mammals make saccades to align their high resolution fovea with the regions of interest in the visual environment. There are many possible methods to implement such a foveated visual field in an object detection system. In this work, we opt to use a recent model  which specifies how responses of elementary sensors are pooled at the layers (V1 and V2) of the human visual cortex. The model specifies the shapes and sizes of V1, V2 regions which pool responses from the visual field. We use a simplified version of this model as the foveated visual field of our object detector (Figure 1). We call our detector as “the foveated object detector (FOD)” due to its foveated visual field.
The sizes of pooling regions in the visual field increase as a function of eccentricity from the fixation location. As the pooling regions get larger towards the periphery, more information is lost at these locations, which might seem to be a disadvantage, however, the exploration of the scene with the high resolution fovea through a guided search algorithm might mitigate the apparent loss of peripheral information. On the other hand, fewer computational resources are allocated to process these low resolution areas which, in turn, lower the computational cost. In this paper, we investigate the impact of using a foveated visual field on the detection performance and its computational cost savings.
The foveated object detector (FOD) mimics the process by which humans search for objects in scenes utilizing eye movements to point the high resolution fovea to points of interest (Figure 2
). The FOD gets assigned an initial fixation point on the input image and collects information by extracting image features through its foveated visual field. The features extracted around the fixation point are at fine spatial scale while features extracted away from the fixation location at coarser scale. This fine-to-coarse transition is dictated by the pooling region sizes of the visual field. Then, based on the information collected, the FOD chooses the next fixation point and makes a saccade to that point. Finally, the FOD integrates information collected through multiple saccades and outputs object detection predictions.
Training such an object detector entails learning templates at all locations in the visual field. Because the visual field has varying resolution, the appearance of a target object varies depending on where it is located within the visual field. We use the HOG  as image features and a simplified version of the V1 model  to compute pooled features within the visual field. A mixture of linear templates is trained at selected locations in the visual field using a latent-SVM-like [14, 18] framework.
We present an object detector that has a foveated visual field based on physiological measurements in primate visual cortex  and that models the appearance of target objects not only in the high resolution fovea but also in the periphery. . Importantly, the model is developed in the context of a modern object detection algorithm and a standard data-set (PASCAL VOC) allowing for the first time direct evaluation of the impact of a foveated visual system on an object detector.
We believe that object detection using a foveated visual field offers a novel and promising direction of research in the quest for an efficient alternative to the sliding window method, and also a possible explanation for why foveated visual systems might have evolved in organisms. We show that our method achieves greater computational savings than a state-of-the-art cascaded detection method. Another contribution of our work is the latent-LDA formulation (Section 2.4.2) where linear discriminant analysis is used within a latent-variable learning framework.
In the next section, we describe the FOD in detail and report experimental results in Section 3 which is followed by the related work section, conclusions and discussion.
The Freeman-Simoncelli (FS) model 
is neuronal population model of V1 and V2 layers of the visual cortex. The model specifies how responses are pooled (averaged together) hierarchically beginning from the lateral geniculate nucleus to V1 and then the V2 layer. V1 cells encode information about local orientation and spatial frequency whereas the cells in V2 pools V1 responses non-linearly to achieve selectivity for compound features such as corners and junctions. The model is based on findings and physiological measurements of the primate visual cortex and specifies the shapes and sizes of the receptive fields of the cells in V1 and V2. According to the model, the sizes of receptive fields increase linearly as a function of the distance from the fovea and this rate of increase in V2 is larger than that of V1, which means V2 pools larger areas of the visual field in the periphery. The reader is referred to for further details.
We simplify the FS model in two ways. First, the model uses a Gabor filter bank to compute image features and we replace these with the HOG features [6, 14]. Second, we only use the V1 layer and leave the non-linear pooling at V2 as future work. We use this simplified FS model as the foveated visual field of our object detector which is shown in Figure 1. The fovea subtends a radius of degrees. We also only simulate a visual field with a radius of degrees which is sufficient to cover the test images presented at a typical viewing distance of cm. The square boxes with white borders (Figure 1 represent the pooling regions within the fovea. The surrounding colored regions are the peripheral pooling regions. While the foveal regions have equal sizes, the peripheral regions grow in size as a function – which is specified by the FS model – of their distance to the center of the fovea. The color represents the weights that are used in pooling, i.e. weighted summation of, the underlying responses. A pooling region partly overlaps with its neighboring pooling regions (see the supplementary material of  for details). Assuming a viewing distance of cm, the whole visual field covers about a x pixel area (a pixel subtends ). The foveal radius is pixels subtending a visual angle of degrees.
Given an image and a fixation point, we first compute the gradient at each pixel and then for each pooling region, the gradient magnitudes are pooled per orientation for the pixels that fall under the region. At the fovea, where the pooling regions are x pixels, we use the HOG features at the same spatial scale of the original DPM model, and in the periphery, each pooling region takes a weighted sum of HOG features of the x regions that are covered by that pooling region.
The model consists of a mixture of components
where is a linear template and
is the location of the template with respect to the center of the
visual field. The location variable defines a unique bounding box
within the visual field for the template. Specifically,
is a -tuple
whose variables respectively denote width, height and , coordinates of the
template within the visual field.
The template, , is a matrix of weights on the features extracted from
pooling regions underlying the bounding box . The dimensionality of
, i.e. the total number of weights, depends both on the width and
height of its bounding box and its location in the visual field. A component
within the fovea covers a larger number of pooling regions compared to a
with the same width and height, hence the dimensionality of a foveal template is
larger. Three example components are illustrated in Figure
3 where the foveal
component (red) covers x pooling regions while the (blue and green)
peripheral components cover and regions, respectively. Since a fixed
number of features111We use the feature extraction implementation of DPM
(rel5) [17, 14], which extracts a -dimensional
-dimensional feature vector.is extracted from each pooling region (regardless of its size), foveal components have higher-resolution templates associated with them.
Suppose that we are given a model that is already trained for a certain object class. The model is presented with an image and assigned an initial fixation location . We are interested in searching for an object instance in . Because the size of a searched object is not known apriori, the model has to analyze the input image at various scales. We use the same set of image scales given in  and use to denote a scale from that set. When used as a subscript to an image, e.g. , it denotes the scaled version of that image, i.e. width (and height) of is times the width (and height) of . also applies to fixation locations and bounding boxes: if denotes a fixation location , then ; for a bounding box , .
To check whether an arbitrary bounding box within contains an object instance, while the model is fixating at location f, we compute a detection score as
where is a feature extraction function which returns the features of for component (see Equation (1)) when the model is fixating at . The vector is the blockwise concatenation of the templates of all components. effectively chooses which component to use, that is . The fixation location ,, together with the component define a unique location, i.e. a bounding box, on . returns the set of all components whose templates have a predetermined overlap (intersection over union should be at least as in ) with when the model is fixating at . During both training and testing, and are latent variables for example .
Ideally, should hold for an appropriate when contains an object instance within . For an image that does not contain an object instance, should hold for any . For this to work, a subtlety in ’s definition is needed: returns all components of the model (Equation (1)). During training (Section 2.4), this will enforce the responses of all components for a negative image to be suppressed down.
So far, we have looked at the situation where the model has made only one fixation. We describe in Section 2.3 how the model chooses the next fixation location. For now, suppose that the model has made fixations, , and we want to find out whether an arbitrary bounding box contains an object instance. This computation involves integrating observations across multiple fixations, which is a considerably more complicated problem than the single fixation case. The Bayesian decision on whether
contains an object instance is based on the comparison of posterior probabilities:
where denotes the event that there is an object instance at location . We use the posteriors’ ratio as a detection score, the higher it is the more likely contains an instance. Computing the probabilities in (3) requires training a classifier per combination of fixation locations for each different value of , which is intractable. We approximate it using a conditional independence assumption (derivation given in Appendix A):
We model the probability using a classifier and use the sigmoid transfer function to convert raw classification scores to probabilities:
Taking the logarithm of posterior ratios does not alter the ranking of detection scores for different locations, i.e. ’s, because logarithm is a monotonic function. In short, the detection score computed by the FOD for a certain location , is the sum of the individual scores for computed at each fixation.
We use the maximum-a-posteriori (MAP) model  as the basic eye movement strategy of the FOD. The MAP model is shown to be consistent with human eye movements in a variety of visual search tasks [4, 47]. Studies have demonstrated that in some circumstances human saccade statistics better match an ideal searcher  that makes eye movements to locations that maximize the accuracy of localizing targets, yet in many circumstances the MAP model approximates the ideal searcher [32, 51] but is computationally more tractable for objects in real scenes. The MAP model select the location with the highest posterior probability of containing the target object as the next fixation location, that is center of where
Finding the maximum of the posterior above is equivalent to finding the maximum of the posterior ratios,
since for two arbitrary locations ; let and , then we have
A set of dimensions (width and height) is determined from the bounding box statistics of the examples in the training set as done in the initialization of the DPM model . Then, for each width and height, new components with these dimensions are created to tile the entire visual field. However, the density of components in the visual field is not uniform. Locations, i.e. bounding boxes, that do not overlap well with the underlying pooling regions are discarded. To define goodness of overlap, a bounding box is said to intersect with an underlying pooling region if more than one fifth of that region is covered by the bounding box. Overlap is the average coverage across the intersected regions. If the overlap is more than , then a component for that location is created, otherwise the location is discarded (see Figure 4 for an example). In addition, no components are created for locations that are outside of the visual field. Weights of the component templates () are initialized to arbitrary values. Training the model is essentially optimizing these weights on a given dataset.
Consider a training set where is an image and a bounding box and is the total number of examples. If does not contain any positive examples, i.e. object instances, then . Following the DPM model , we train model templates using a latent-SVM formulation:
where if and , otherwise. The set denotes the set of all feasible fixation locations for example . For , a fixation location is considered feasible if there exists a model component whose bounding box overlaps with . For , all possible fixation locations on are considered feasible.
Optimizing the cost function in (10) is manageable for mixtures with few components, however, the FOD has a large number of components in its visual field (typically, for an object class in the PASCAL VOC 2007 dataset , there are around -) and optimizing this cost function becomes prohibitive in terms of computational cost. As an alternative, cheaper linear classifiers can be used. Recently, linear discriminant analysis (LDA) has been used in object detection () producing surprisingly good results with much faster training time. Training a LDA classifier amounts to computing where is the mean of the feature vectors of the positive examples, is the same for the negative examples and
is the covariance matrix of these features. Here, the most expensive computation is the estimation of, which is required for each template with different dimensions. However, it is possible to estimate a global from which covariance matrices for templates of different dimensions can be obtained . For the FOD, we estimate the covariance matrices for the foveal templates and estimate the covariance matrices for peripheral templates by applying the feature pooling transformations to the foveal covariance matrices.
We propose to use LDA in a latent-SVM-like framework as an alternative to the method in  where positive examples are clustered first and then a LDA classifier is trained per cluster. Consider the template, . LDA gives us that LDA gives us that where is the covariance matrix for template , and are the mean of positive and negative feature vectors, respectively, assigned to template . We propose to apply an affine transformation to the LDA classifier:
and modify the cost function as
where the first summation pushes the score of the mean of the negative examples to under zero and the second summation, taken over positive examples only, pushes the scores to above 0. and are appropriate blockwise concatenation of and s. is the regularization constant. Overall, this optimization effectively calibrates the dynamic ranges of different templates’ responses in the model so that the scores of positive examples and negative means are pushed away from each other while the norm of is constraint to prevent overfitting. This formulation does not require the costly mining of hard-negative examples of latent-SVM. We call this formulation (Equation (12)) as latent-LDA.
To optimize (12), we use the classical coordinate-descent procedure. We start by initializing by training on warped-positive examples as in . Then, we alternate between choosing the best values for the latent variables while keeping fixed, and optimizing for while keeping the latent variables of positive examples fixed.
We evaluated our method on the PASCAL VOC 2007 detection (comp3) challenge dataset and protocol (see  for details). All results are obtained by training on the train+val split and testing on the test split.
We first compared our SW implementation, which corresponds to using foveal templates only, to three state-of-the-art methods that are also SW based [14, 29, 18]. Table I gives the AP (average precision) results, i.e. area under the precision-recall curve per class, and mean AP (mAP) over all classes. Originally, the deformable parts model (DPM) uses object parts, however, in order to make a fair comparison with our model, we disabled its parts. The first row of Table I shows the latest version of the DPM system  with the parts-learning code disabled. The second row shows results for another popular SVM-based system, known as the exemplar-SVM (E-SVM), which also only models whole objects, not its parts. Finally, the third row shows results from a LDA-based system, “discriminative decorrelation for classification” (DCC) . All three systems are based on HOG features and mixture of linear templates. The results show that SVM based systems perform better than the LDA based systems, which is not a surprising finding since it is well known that discriminative models outperform generative models in classification tasks. However, LDA’s advantage against this performance loss is that it is ultra fast to train, which is exactly the reason we chose to use LDA instead of SVM. Once the background covariance matrices are estimated (which can be done once and for all ), training is as easy as taking the average of the feature vectors of positive examples and doing a matrix multiplication. We estimated the time that training a SVM based system for our FOD to be about 300 hours (approximately 2 weeks) for a single object class, whereas the LDA based system can be trained under an hour on the same machine which has an Intel i7 processor.
Although our SW method achieves the same mean AP (mAP) score as the DCC method , the latter has a detection model with higher computational cost. We use 2 templates per class while DCC trains more than 15 templates per class within an exemplar-SVM-like framework. DCC considers the dot product of the feature vector of the detection window with every exemplar within a cluster, which basically means that a detection window is compared to all positive examples in the training set. In our case, the number of dot products considered per detection window is equal to the number of templates, which is 2 in this paper, which clearly demonstrates the advantage of our latent-LDA approach over DCC .
Next, we compared the performance of FOD with our SW method. We experimented with two eye movement strategies, MAP (Section 2.3) and random strategy to demonstrate the importance of guidance of eye movements.
|RAND-C||1||This row is the same with the “MAP-C, 1” above.||”|
|RAND-E||1||This row is the same with the “MAP-E, 1” above.||”|
Table II shows the AP scores for FOD with different eye movement strategies and different number of fixations. We also include in this table the “Our SW” result from Table I for ease of reference. The MAP and random strategies are denoted with MAP and RAND, respectively. Because the model accuracy results will depend on initial point of fixation, we ran the models with different initial points of fixation. The presence of a suffix on a model refers to the location of the initial fixation: “-C” stands for the center of the input image, i.e. in normalized image coordinates where the top-left corner is taken as and the bottom-right corner is ; and “-E” for the two locations at the left and right edges of the image, of the image width away from the image border, that is and . MAP-E and RAND-E results are the performance average of two different runs, one with initial fixation close to the left edge of the image, the other run close to the right edge of the image. For the random eye movement, we report the confidence interval for AP over different runs. We ran all systems for a total of fixations. Table II shows results for after , and fixations. A condition with one fixation is a model that makes decisions based only on the initial fixation.
The results show that the FOD using the MAP rule with 5 fixations (MAP-C,5 for short) performs nearly as good as the SW (a difference of in mean AP).
Figure 5 shows the ratio of mean AP for the FOD with the various eye movement strategies to that of the SW system (relative performance) as a function of fixation. The relative performance of the MAP-C to SW (AP of MAP-C divided by AP of SW) is for 5 fixations, for 3 fixations and for 1 fixation. The FOD with eye movement guidance towards the target (MAP-C,5) achieves or exceeds SW’s performance with only 1 fixation in 4 classes, with 3 fixations in 7 classes, with 5 fixations in 2 classes. For the remaining of 7 classes, FOD needs more than 5 fixations to achieve SW’s performance.
MAP-C performs quite well ( relative performance) even with 1 fixation. The reason behind this result is the fact that, on average, bounding boxes in the PASCAL dataset cover a large portion of the images (average bounding box area normalized by image area is ) and are located at and around the center . To reduce the effects of these biases about the location of object placement on the results, we assessed the models with an initial fixation close to the edge of the image (MAP-E). When the initial fixation is closer to the edge of the image, performance is initially worse than when the initial fixation is at the center of the image, The difference in performance diminishes achieving similar performance with five fixations ( difference in mean AP). Figure 6 shows how the distribution of AP scores for different object classes for MAP-E improves from 1 fixation to 5 fixations
To assess the importance of guided saccades towards the target we compared performance of the MAP model against FOD that guides eye movements based on a random eye movement generator.
Figure 5 allows comparisons of the relative performance of the MAP FOD and those with a random eye movement strategy. The performance gap between MAP-C, RAND-C pair and MAP-E,RAND-E pair shows that MAP eye movement strategy is effective in improving the performance of the system.
The computational complexity of the SW method is easily expressed in terms of image size. However, this is not the case for our model. The computational complexity of FOD is where is the number of fixations and is the total number of components, hence templates, on the visual field. These numbers do not explicitly depend on the image size; so in this sense, the complexity of FOD is in terms of image size. Currently, is given as an input parameter but if it were to be automated, e.g. to achieve a certain detection accuracy, would implicitly depend on several factors such as the difficulty of the object class, the location and size distribution of positive examples. Targets that are small (relative to the image size) and that are located far away from the initial fixation location would require more fixations to get a certain detection accuracy. The number of components, , depends on both the visual field parameters (number of angle and eccentricity bins which, in our case, are fixed based on the Freeman-Simoncelli model ) and the bounding box statistics of the target object. These dependencies make it difficult to express the theoretical complexity in terms of input image size. For this reason, we compare the computational costs of FOD and SW in a practical framework, expressed in terms of the total number of operations performed in template evaluations.
In both SW based methods and the FOD, linear template evaluations, i.e. taking dot-products, is the main time consuming operation. We define the computational cost of a method based on the total number of template evaluations it executes (as also done in ). A model may have several templates with different sizes, so instead of counting each template evaluation as 1 operation, we take into account the dimensionalities of the templates. For example, the cost of evaluating a (6-region)x(8-region) HOG template is counted as 48 operations.
It is straightforward to compute the computational cost (as defined above) of the SW method. For the FOD, we run the model on a subset of the testing set and count the number of operations actually performed. Note that, in order to compute a detection score, the FOD first performs a feature pooling (based on the location of the component in the visual field) and then a linear template evaluation. Since these are both linear operations, we combine them into a single linear template. The last column of Table II gives the computational costs of the SW method and the FOD. For the FOD the computational cost is reported as a function of different number of fixations. For ease of comparison, we normalized the costs so that the SW method performs 100 operations in total. The results show that FOD is computationally more efficient than SW. FOD achieves of SW’s performance at of the computational cost of SW. Note that this saving is not directly comparable to that of the cascaded detection method reported in  because FOD’s computational savings comes about from fewer root filter evaluations, whereas in  a richer model (DPM, root filters and part filters) is used and the savings are associated to fewer evaluations in the part filters (i.e., the model applies the root filters at all locations first and sequentially running other filters on the non-rejected locations).
To directly compare the computational savings of the FOD model to a cascade-type object detector, we used a richer and more expensive detection model at the fovea. This is analogous to the cascaded detection idea where cheaper detectors are applied first and more expensive detectors are applied later on the locations not rejected by the cheaper detectors. To this end, we run our FOD and after each fixation we evaluate the full DPM detector (root and part filters together)  only at foveal locations that score above a threshold which is determined on the training set to achieve high recall rate (). We call this approach “FOD-DPM cascade” or FOD-DPM for short. Table III and Figure 7 give the performance result of this approach. FOD-DPM achieves a similar average performance to that of DPM ( relative performance, AP gap) using 9 fixations and exceeds DPM’s performance starting from 11 fixations. On some classes (e.g. bus, car, horse), FOD-DPM exceeds DPM’s performance probably due to lesser number of evaluations and reduced false positives; on other cases (e.g. bike, dog, tv) FOD-DPM underperforms probably due to low recall rate of the FOD detector for these classes. Figure 8 gives per class AP scores of FOD-DPM and DPM to demonstrate the improvement from 1 to 9 fixations.
We compare the computational complexities of FOD-DPM and DPM by their total number of operations as defined above. For a given object class, DPM model has 3 root filters and 8 6x6 part filters. It is straightforward to calculate the number of operations performed by DPM as it uses the SW method. For FOD-DPM, the total number of operations is calculated by adding: 1) FOD’s operations and 2) DPM’s operations at each high-scoring foveal detection , one DPM root filter (with the most similar shape as ) and 8 parts evaluated at all locations within the boundaries of this root filter. Note that we ignore the time for optimal placing of parts in both DPM and FOD-DPM. Cost of feature extraction is also not included as the two methods use the same feature extraction code. We report the computational costs of FOD-DPM and DPM in the last column of Table III. The costs are normalized so that DPM’s cost is 100 operations. Results show that FOD-DPM drastically reduces the cost from to for 9 fixations. Assuming both methods are implemented equally efficiently, this would translate to an approximately x speed-up which is better than the x speed-up reported for a cascaded evaluation of DPM . These results demonstrate the effectiveness of our foveated object detector in guiding the visual search.
Finally, in Figure 9 we give sample detections by the FOD system. We ran the trained bicycle, person and car models on an image outside of the PASCAL datasaet. The models were assigned the same initial location and we ran them for fixations. Results show that the each model fixates at different locations, and these locations are attracted towards instances of the target objects being searched.
The sliding window (SW) method is the dominant model of search in object detection. The complexity of identifying object instances in a given image is where is the number of locations to be evaluated and is the number of object classes to be searched for. Efficient alternatives to sliding windows can be categorized in two groups: (i) methods aimed at reducing , (ii) methods aimed at reducing . Since typically , the are a larger number efforts in trying to reduce , however, reducing the contribution of the number of object classes has recently been receiving increasing interest as search for hundreds of thousands of object classes has started to be tackled . According to this categorization, our proposed FOD method falls into the first group as it is designed to locate object instances by making a set of sequential fixations where in each fixation only a sparse set of locations are evaluated.
In efforts to reduce the number of locations to be evaluated, one line of research is the branch-and-bound methods ([24, 20]) where an upper bound on the quality function of the detection model is used in a global branch and bound optimization scheme. Although the authors provide efficiently computable upper bounds for popular quality functions (e.g. linear template, bag-of-words, spatial pyramid), it might not be trivial to derive suitable upper bounds for a custom quality function. Our method, on the other hand, uses binary classification detection model and is agnostic to the quality function used.
) where a series of cheap to expensive tests are done to locate the object. Cascaded detection is similar to our method in the sense that simple, coarse and cheap evaluations are used together with complex, fine and expensive evaluations. However, we differ with it in that it is essentially a sliding window method with a coarse-to-fine heuristic used to reduce the number of total evaluations. Another coarse-to-fine search scheme is presented in where a set of low to high resolution templates are used. The method starts by evaluating the lowest resolution template – which is essentially a sliding window operation – and selecting the high responding locations for further processing with higher resolution templates. Our method, too, uses a set of varying resolution templates; however, these templates are evaluated at every fixation instead of serializing their evaluations with respect to resolution.
In , a segmentation based method is proposed to yield a small set of locations that are likely to corresponds to objects, which are subsequently used to guide the search in a selective manner. The locations are identified in an object class-independent way using an unsupervised multiscale segmentation approach. Thus, the method evaluates the same set of locations regardless of which object class is being searched for. In contrast, in our method, selection of locations to be foveated is guided by learned object class templates.
The method in , similar to ours, works like a fixational system: at a given time step, the location to be evaluated next is decided based on previous observations. However, there are important differences. In , only a single location is evaluated at a time step whereas we evaluate all template locations within the visual field at each fixation. Their method returns only one box as the result whereas our method is able to output many predictions.
There are also vector quantization based methods [21, 40, 19] aiming to reduce the time required to compute linear template evaluations. These methods to reduce the contribution of in are orthogonal to our foveated approach. Thus, vector quantization approaches can be integrated with the proposed foveated object detection method.
Works in this group aim to reduce the time complexity contributed by the number of object classes. The method proposed in  accelerates the search by replacing the costly linear convolution by a locality sensitive hashing scheme that works on non-linearly coded features. Although they evaluate all locations in a given image, their approach scale constantly over the number of classes, which enables them to evaluate thousands of object classes in a very short amount of time.
Another method  uses a sparse representation of object part templates, and then uses the basis of this representation to reconstruct template responses. When the number of object categories is large, sparse representation serves as a shared dictionary of parts and accelerates the search.
Importantly, the way the methods in this group accelerate search is orthogonal to the savings proposed by using a foveated visual field. Therefore, these methods are complementary and can be integrated with our method to further accelerate search.
In the context of the references listed in this and the previous sections, our method of search through fixations using a non-uniform foveated visual field is novel.
There have been previous efforts, (e.g. ), on biologically inspired object recognition. However, these models do not have a foveated visual field and thus do not execute eye movements. More recent work has implemented biologically inspired search methods. In , a fixed, pre-attentive, low-resolution wide-field camera is combined with a shiftable, attentive, high-resolution narrow-field camera, where the pre-attentive camera generates saccadic targets for the attentive, high-resolution camera. The fundamental difference between this and our method is that while their pre-attentive system has the same coarse resolution everywhere in the visual field, our method, which is a model of the V1 layer of the visual cortex, has a varying resolution that depends on the radial distance to the center of the fovea. There have been previous efforts to create foveated search models with eye movements [31, 51, 38, 30]. Such models have been applied mostly to detect simple signals in computer generated noise [31, 51] and used as benchmarks to compare against human eye movements and performance.
and artificial neural network based models[25, 2]. TAM is a foveated model and it uses scale invariant feature transform (SIFT) features 
for representation and utilizes a training set of images to learn the appearance of the target object. However, it does not include the variability in object appearance due to scale and viewpoint, and the evaluation is done by placing the objects on a uniform background. The Infomax, on the other hand, can use any previously trained object detector and works on natural images. They report successful results on a face detection task. Both TAM and Infomax uses the same template for all locations in the visual field while our method uses different templates for different locations. was applied to image categorization and  to object tracking in videos. Critically, none of these models have been tested on standard object detection datasets nor they have been compared to a SW approach to evaluate the potential performance loss and computational savings of modeling a foveated visual field.
We present an implementation of a foveated object detector with a recent neurobiologically plausible model of pooling in the visual periphery and report the first ever evaluation of a foveated object detection model on a standard data set in computer vision (PASCAL VOC 2007). Our results show that the foveated method achieves nearly the same performance as the sliding window method at of sliding window’s computational cost. Using a richer model (such as DPM ) to evaluate high-scoring locations, FOD is able to outperform the DPM with more computational savings than a state-of-the-art cascaded detection system . These results suggest that using a foveated visual system offers a promising potential for the development of more efficient object detectors.
Derivation for Equation (4):
Derivation for Equation (6):
using (5), we get
Int’l Conf. on Machine Learning, 2011.
Conf. on Computer Vision and Pattern Recognition, pages 886–893, 2005.
Learning to combine foveal glimpses with a third-order Boltzmann machine.In Advances in Neural Information Processing, pages 1–9, 2010.