1 Introduction
A widely used approach to object category recognition has been the bagofwords method [7] combined with the spatial pyramid match kernel [14]
. This approach uses visual feature extraction, quantizes features into a limited set of visual words, and performs classification, often with a support vector machine
[12, 13].In contrast to the bagofwords method, Boiman et al. [3] introduced a featurewise nearest neighbor algorithm called Naive Bayes Nearest Neighbor (NBNN). They do not quantize the visual descriptors and instead retain all of the reference descriptors in their original form.
Boiman et al. [3] showed that quantizing descriptors in the bagofwords model greatly decreases the discriminativity of the data. The bagofwords model usually reduces the high dimensional feature space to just a few thousand visual words.
Despite NBNN’s independence assumption (independence of the descriptors in the query image), Boiman et al
. demonstrated stateoftheart performance on several object recognition datasets, improving upon the commonly used SVM classifier with a spatial pyramid match kernel.
NBNN is a simple algorithm. The task is to determine the most probable class of a query image . Let be all the descriptors in the query image. The training data for a class is a collection of descriptors extracted from a set of labelled example images. These are stored in data structures that allow for efficient nearest neighbor searches (the nearest neighbor of descriptor in class is ). The original NBNN is listed as Algorithm 1.
Our contribution is a modification to the original NBNN algorithm that increases classification accuracy and provides a significant speedup when scaling to large numbers of classes. We eliminate the need to search for a nearest neighbor in each of the classes. Instead, we merge the reference datasets together and use an alternative nearest neighbor search strategy in which we only adjust the scores for the classes nearest to any query descriptor. The question becomes, “What does this descriptor look like?”, instead of “Does this descriptor look like one from a car? a duck? a face? a plane? …” Figure 1 gives a conceptual visualization.
2 Relation to previous work
An obvious issue with the naive Bayes approach is that it makes the unrealistic assumption that image features provide independent evidence for an object category.
In defense of the naive Bayes assumption, Domingos and Pazzani [9]
demonstrate the applicability of the naive Bayes classifier even in domains where the independence assumption is violated. They show that while the independence assumption
does need to hold in order for the naive Bayes classifier to give optimal probability estimates, the classifier can perform well as regards misclassification rate even when the assumption doesn’t hold. They perform extensive evaluations on many realworld datasets that violate the independence assumption and show classification performance on par with or better than other learning methods.Behmo et al. [2] corrects NBNN for the case of unbalanced training sets. Behmo et al. implemented and compared a variant of NBNN that used 1vsall binary classifiers, highlighting the effect of unbalanced training data. In the experiments we present, the training sets are approximately balanced, and we compare our results to the original NBNN algorithm. Behmo et al. also point out that a major practical limitation of NBNN is the time that is needed to perform the nearest neighbor search, which is what our work addresses.
The most recent work on NBNN is by Tuytelaars et al. [19]. They use the NBNN response vector of a query image as the input features for a kernel SVM. This allows for discriminative training and combination with other complimentary features by using multiple kernels. The kernel NBNN gives increased classification accuracy over using the basic NBNN algorithm. Our work is complimentary to this in that the responses resulting from our local NBNN could also be fed into their second layer of discriminative learning. Due to the poor scaling of the original NBNN algorithm, Tuytelaars et al. had to heavily subsample the query images in order to obtain timely results for their experiments, hampering their absolute performance values. In NBNN, what dominates is the time needed to search for nearest neighbors in each of the object category search structures. Even approximate methods can be slow here and scale linearly with the number of categories.
The method we will introduce is a local nearest neighbor modification to the original NBNN. Other methods taking advantage of local coding include locality constrained linear coding by [20] and early cutoff soft assignment by [15]. Both limit themselves to using only the local neighborhood of a descriptor during the coding step. By restricting the coding to use only the local dictionary elements, these methods achieve improvements over their nonlocal equivalents. The authors hypothesize this is due to the manifold structure of descriptor space, which causes Euclidean distances to give poor estimates of membership in codewords far from descriptor being coded [15].
NBNN methods can be compared to the popular spatial pyramid methods [4, 14, 20], which achieve stateoftheart results on image categorization problems. The original spatial pyramid method used hard codeword assignment and average pooling within each of the hierarchical histogram bins. Today, the best performing variants of spatial pyramid use local coding methods combined with max pooling [4, 5, 15, 20]. Stateoftheart spatial pyramid methods achieve high accuracy on benchmark datasets, but there has been no headtohead comparison of NBNN methods against spatial pyramid methods. Previous work has only compared against published figures, but these comparisons are based on different feature sets, which makes it difficult to isolate the contributions of the features from the contributions of the methods.
3 Naive Bayes Nearest Neighbor
To help motivate and justify our modifications to the NBNN algorithm, this section provides an overview of the original derivation [3]. Each image is classified as belonging to class according to
(1) 
Assuming a uniform prior over classes and applying Bayes’ rule,
(2) 
The assumption of independence of the descriptors found in image gives
(3)  
(4) 
Next, approximating in Equation 4 by a Parzen window estimator, with kernel , gives
(5) 
where there are descriptors in the training set for class and is the th nearest descriptor in class . This can be further approximated by using only the nearest neighbors as
(6) 
and NBNN takes this to the extreme by using only the single nearest neighbor ():
(7) 
Choosing a Gaussian kernel for and substituting Equation 7 (the single nearest neighbor approximation of ) into Equation 4 (the sum of log probabilities) gives:
(8)  
(9) 
Equation 9 is the NBNN classification rule: find the class with the minimum distance from the query image.
4 Towards local NBNN
Before introducing local NBNN, we first present some results demonstrating that we can be selective with the updates that we choose to apply for each query descriptor. We start by recasting the NBNN updates as adjustments to the posterior logodds of each class. In this section, we show that only the updates giving positive evidence for a class are necessary.
The effect of each descriptor in a query image can be expressed as a logodds update. This formulation is useful because it allows us to restrict updates to only those classes for which the descriptor gives significant evidence. Let be some class and be the set of all other classes.
The odds () for class is given by
(10)  
(11)  
(12) 
Taking the log and applying Bayes’ rule again gives:
(13)  
(14) 
Equation 14 has an intuitive interpretation. The prior odds are . Each descriptor then contributes a change to the odds of a given class determined by the posterior odds of given , , and how they differ from the prior odds as seen in the increment term of Equation 14. If the posterior odds are equal to the prior odds, , if the posterior odds are greater than the prior odds, , and if the posterior odds are less than the prior odds, the .
This allows an alternative classification rule that’s expressed in terms of logodds increments:
(15) 
where the prior term can be dropped if you assume equal class priors. The increment term is simple to compute if we leave as in the original.
The benefit that comes from this formulation is that we can be selective about which increments to actually use: we can use only the significant logodds updates. For example, we can decide to only adjust the class posteriors for which the descriptor gives a positive contribution to the sum in Equation 15. Table 1 shows that this selectivity does not affect classification accuracy.
Method  Avg. # increments  Accuracy % 

Full NBNN  101  55.2 
Positive increments only  55.0  55.6 
shows one standard deviation.
5 Local NBNN
The selectivity introduced in the previous section shows that we do not need to update each class’s posterior for each descriptor. This section shows that by focusing on a much smaller, local neighborhood (rather than on a particular logodds threshold), we can use an alternate search strategy to speed up the algorithm, and also achieve better classification performance by ignoring the distances to classes far from the query descriptor.
Instead of performing a search for a query descriptor’s nearest neighbor in each of the classes’ reference sets, we search for only the nearest few neighbors in a single, merged dataset comprising all the features from all labelled training data from all classes. Doing one approximate knearestneighbor search in this large index is much faster than querying each of the classes’ approximatenearestneighbor search structures. This is a result of the sublinear growth in computation time with respect to index size for approximate nearest neighbor search algorithms as discussed in Section 5.1. This allows the algorithm to scale up to handle many more classes, avoiding a prohibitive increase in runtime.
This is an approximation to the original method. For each test descriptor in a query image, we do not find a nearest neighbor from every class, only a nearest neighbor from classes that are represented in the nearest descriptors to that test descriptor. We call this local NBNN, visualized in Figure 2.
It is important to properly deal with the set of background classes which were not found in the nearest neighbors of . To handle the classes that were not found in the nearest neighbors, we conservatively estimate their distance to be the distance to the st nearest neighbor (this can be thought of as an upper bound on the density of background features). In practice, instead of adjusting the distance totals to every class, it is more efficient to only adjust the distances for the relatively few classes that were found in the nearest neighbors, but discount those adjustments by the distance to background classes (the +1st nearest neighbor). This does not affect the minimum.
The local NBNN algorithm is as follows:
5.1 Approximate nearest neighbors and complexity
Our algorithm scales with the log of the number of classes rather than linearly in the number of classes. This analysis depends on the nearest neighbor search structure that we use.
For both our implementation of the original NBNN and local NBNN, we use FLANN [17] to store descriptors in efficient approximate nearest neighbor search structures. FLANN is a library for finding approximate nearest neighbor matches that is able to automatically select from among several algorithms and tune their parameters for a specified accuracy. It makes use of multiple, randomized KDtrees as described by SilpaAnan and Hartley [18] and is faster than single KDtree methods like ANN [1] (used by Boiman et al. in the original NBNN) or locality sensitive hashing methods. The computation required and the accuracy of the nearest neighbor search is controlled by the number of leaf nodes checked in the KDtrees.
Following the analysis by Boiman et al. [3], let be the number of training images per class, the number of classes, and the average number of descriptors per image. In the original, each KDtree contains descriptors and each of the query descriptors requires an approximate search in each of the KDtree structures. The accuracy of the approximate search is controlled by the number of distance checks, . The time complexity for processing one query image under the original algorithm is . In our method, there is a single search structure containing descriptors in which we search for nearest neighbors (using distance checks, where ). The time complexity for processing one query image under our method is . The term has moved inside of the term.
6 Experiments and results
We show results on both the Caltech 101 and Caltech 256 Datasets [10, 11]. Each image is resized so that its longest side is 300 pixels, preserving aspect ratio. We train using 15 and 30 images, common reference points from previously published results. SIFT descriptors [16] are extracted on a dense, multiscale grid, and we discard descriptors from regions with low contrast. We have attempted to match as closely as possible the extraction parameters used by Boiman et al. [3].^{1}^{1}1Our code and the feature sets used in our experiments will be made available for ease of comparison.
We measure performance by the average perclass classification accuracy (the average of the diagonal of the confusion matrix) as suggested by
[11].Boiman et al. [3] also introduced an optional parameter, , that controls the importance given to the location of a descriptor when matching. For all experiments, we fix , based on coarse tuning on a small subset of Caltech 101.
As discussed, we use FLANN [17] to store reference descriptors extracted from the labelled images in efficient approximate nearest neighbor search structures.
6.1 Tuning Local NBNN
Figure 3 shows the effect of varying the cutoff, , that defines the local neighborhood of a descriptor. This experiment shows that using a relatively low value for improves performance. Using too low a value for hurts performance, and using a much higher value for reverts to the performance of the original NBNN.
We also demonstrate that this improved accuracy comes at a significant time savings over the original. Instead of building 101 indices, local NBNN uses a single index comprising all the training data, storing a small amount of extra accessory data: the object class of each descriptor.
We vary the computation afforded to both NBNN and local NBNN, and track the associated classification accuracy. For local NBNN, we do a search for 10 nearest neighbors, which returns an example from approximately 7 of the classes on average. The selection of an appropriately low number of nearest neighbors is important (see Figure 3).
To control the computation for each method, we control a parameter of FLANN’s approximate nearest neighbor search: the number of leafnodes checked in the KDtrees. This also determines the accuracy of the approximation. The higher the number of checks, the more expensive the nearest neighbor searches, and the more accurate the nearest neighbors retrieved. While FLANN does allow for autotuning the parameters to achieve a particular accuracy setting, we fix the number of randomized KDtrees used by FLANN to 4 so that we can control the computation more directly. This setting achieves good performance with minimal memory use.
Figure 4 shows the results of this experiment. There are significant improvements in both classification accuracy and computation time. Looking in each of the 101 separate class indices for just a single nearest neighbor in each, and checking only one leaf node in each of those search structures was still slower than localized search in the merged dataset. Even doing 1000, 2000, or 4000 leaf node checks in the merged dataset is still faster.
6.2 Scaling to many classes
Figure 5 further shows how the computation for these two methods grows as a function of the number of classes in the dataset. As new classes are added in our method, the depth of the randomized KDtree search structures increases at a log rate. As we increase the number of classes to 256, local NBNN using the merged dataset runs 100 times faster than the original. In the original method, an additional search structure is required for each class, causing its linear growth rate. This requires a bestbinfirst traversal of the each KDtree. However, in the case where we query a single search structure for 1030 nearest neighbors, the bestbinfirst traversal from root to leaf happens only once, with the remainder of the nearest neighbors and distance checks being performed by backtracking. The preprocessing time to build the indices is almost identical between the two methods.
6.3 Comparisons with other methods
Until now, no comparison has been done between NBNN and spatial pyramid methods using the same base feature set. We show those results in Table 2. (Runtime for the original NBNN on Caltech 256 was prohibitive, so we do not report those results.)
We choose to compare against two spatial pyramid methods. First, the original model introduced by Lazebnik et al. [14]. Second, a recent variant by Liu et al. [15] that takes advantage of local soft assignment in a manner similar to our local cutoff, and that uses max pooling [6] rather than average pooling within each spatial histogram bin. We trained a codebook of size 1024 for each of the training set regimes (Caltech 101 with 15 and 30 training images, Caltech 256 with 15 and 30 training images). Our spatial pyramid was 3 levels (1x1, 2x2, and 4x4 histogram arrangements). For classification, we trained onevsall SVMs using the histogram intersection kernel [14] and used a fixed regularization term for all training regimes.
We also compare against some previously published figures for NBNN. Notably, local NBNN gives the best performance of any NBNN method to date.
While local NBNN (and NBNN) performs better the original spatial pyramid model, it does not perform better than the model of Liu et al. The soft assignment avoids some of the information loss through quantization, and the discriminative training step provides an additional benefit.
The recent kernel NBNN of Tuytelaars et al. is a complimentary contribution, and we suspect that the combinations of local NBNN with the kernel NBNN would lead to even better performance. We hypothesize that this combination would lead to NBNN matching or improving upon the performance of stateoftheart spatial pyramid methods.
There are other results using a single feature type that have higher published accuracy on these benchmarks. For example, Boureau et al. [5] show accuracy on Caltech 101 and on Caltech 256 with 30 training images, but they use a macrofeature built on top of SIFT as their base feature, so that is not directly comparable with our feature set. Combining different feature types together would also yield higher performance as shown frequently in literature [3, 19].
Caltech 101 (15 training images) 
Caltech 101 (30 training images) 
Caltech 256 (15 training images) 
Caltech 256 (30 training images) 

Results from literature  
NBNN [3]  651.14  70.4  30.5^{1}  37 
NBNN [19]  62.70.5  65.51.0     
NBNN kernel [19]  61.30.2  69.60.9     
Results using our feature set  
SPM (Hardassignment, avg.pooling)^{2}  62.50.9  66.32.6  27.30.5  33.10.5 
SPM (Local softassignment, maxpooling)^{3}  68.60.7  76.00.9  33.20.8  39.50.4 
NBNN (Our implementation)  63.20.9^{4}  70.30.6     
Local NBNN  66.11.1  71.90.6  33.50.9  40.1 

Boiman et al.
did not do an experiment with 15 images on this dataset. The 30.5 is an interpolation from their plot.

The original spatial pyramid match by Lazebnik et al. [14] (reimplementation).

A recent variant of the spatial pyramid match from Liu et al. [15] (reimplementation).

Our experiment using NBNN achieves compared to from [3]. The original implementation is not available, and we have had discussions with the authors to resolve these differences in performance. We attribute the disparity to unresolved differences in parameters of our feature extraction.
7 Conclusion
We have demonstrated that local NBNN is a superior alternative to the original NBNN, giving improved classification performance and a greater ability to scale to large numbers of object classes. Classification performance is improved by making adjustments only to the classes found in the local neighborhood comprising nearest neighbors. Additionally, it is much faster to search through a merged index for only the closest few neighbors rather than search for a descriptor’s nearest neighbor from each of the object classes.
Our comparison against spatial pyramid methods confirms previous results [3] claiming that NBNN outperforms the early spatial pyramid models. Further, while NBNN is competitive with the recent stateoftheart variants of the spatial pyramid, additional discriminative training (as in the NBNN kernel of Tuytelaars et al. [19]) may be necessary in order to obtain similar performance.
As new recognition applications such as web search attempt to classify ever larger numbers of visual classes, we can expect the importance of scalability with respect to the number of classes to continue to grow in importance. For example, ImageNet
[8] is working to obtain labelled training data for each visual concept in the English language. With very large numbers of visual categories, it becomes even more apparent that feature indexing should be used to identify only those categories that contain the most similar features rather than separately considering the presence of a feature in every known category.References
 [1] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. Journal of the ACM, 45(6):891–923, Nov. 1998.
 [2] R. Behmo, P. Marcombes, A. Dalalyan, and V. Prinet. Towards optimal naive bayes nearest neighbor. In ECCV, pages 171–184, 2010.
 [3] O. Boiman, E. Shechtman, and M. Irani. In defense of nearestneighbor based image classification. In CVPR, June 2008.
 [4] Y.L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning midlevel features for recognition. In CVPR, pages 2559–2566. IEEE, 2010.
 [5] Y.L. Boureau, N. Le Roux, F. Bach, J. Ponce, and Y. LeCun. Ask the locals: multiway local pooling for image recognition. In ICCV, 2011.
 [6] Y.L. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual recognition. In ICML, 2010.

[7]
G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray.
Visual categorization with bags of keypoints.
In
Workshop on Statistical Learning in Computer Vision, ECCV
, volume 1, 2004.  [8] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, F.F. Li, and L. FeiFei. ImageNet: A largescale hierarchical image database. In CVPR, pages 248–255, June 2009.

[9]
P. Domingos and M. Pazzani.
On the optimality of the simple Bayesian classifier under zeroone
loss.
Journal of Machine Learning  Special issue on learning with probabilistic representations
, 29(2):103–130, 1997.  [10] L. FeiFei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In CVPR Workshop, Apr. 2004.
 [11] G. Griffin, A. Holub, and P. Perona. Caltech256 Object Category Dataset. Technical report, California Institute of Technology, 2007.
 [12] F. Jurie and B. Triggs. Creating efficient codebooks for visual recognition. In CVPR, 2005.
 [13] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Efficient subwindow search: a branch and bound framework for object localization. PAMI, 31(12):2129–42, Dec. 2009.
 [14] S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. CVPR, pages 2169–2178, 2006.
 [15] L. Liu, L. Wang, and X. Liu. In Defense of Softassignment Coding. In ICCV, 2011.
 [16] D. G. Lowe. Distinctive Image Features from ScaleInvariant Keypoints. IJCV, 60(2):91–110, Nov. 2004.
 [17] M. Muja and D. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In VISSAPP, 2009.
 [18] C. SilpaAnan and R. Hartley. Optimised KDtrees for fast image descriptor matching. In CVPR. IEEE, 2008.
 [19] T. Tuytelaars, M. Fritz, K. Saenko, and T. Darrell. The NBNN kernel. In ICCV, 2011.
 [20] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Localityconstrained linear coding for image classification. In CVPR, pages 3360–3367. IEEE, 2010.
Comments
There are no comments yet.