1 Introduction
Practical object recognition problems often have to be solved under severe computation and time constraints. Some examples of interest are natural user interfaces, automotive active safety, robotic vision or sensing for the Internet of Things (IoT). Often the problem is to obtain high accuracy in real time, on a low power platform, or in a background process that can only utilize a small fraction of the CPU. In other cases the classifier is part of a cascade, or a complex multipleclassifier system. The accuracyspeed tradeoff has thus been widely discussed in the literature, and various architectures have been suggested [14, 35, 7, 4, 24, 31]. Here we focus on the extreme end of this tradeoff and ask: how accurate can we get for classifiers working in CPU microseconds.
As a thought experiment, the fastest classifier possible would be the one concatenating all the pixel values into a single index, and then using this index to access a table listing the labels of all possible images. Of course, this is not feasible due to the exponential requirements of memory and training set size, but this limit case points us the direction to follow. The actual architecture we pursue compromises this limit idea in two main ways: first, instead of encoding the whole image with a single index, the image is treated as a dense set of patches where each patch is encoded using the same parameters. This is analogous to a convolutional layer in a convolutional neural network (CNN) [22], where the same set of filters is applied at each image position. Second, instead of describing a patch using a single long index, it is encoded with a set of short indices, that are used to access a set of reasonablesize tables. Votes of all tables at all positions are combined linearly to yield the classifier outputs. Variants of this architecture have been used successfully mainly for classification of depth images [20, 31]. Here we explore this regime for visual recognition in general, under the term Convolutional Tables Ensemble (CTE).
The idea of applying the same feature extraction on a dense locations grid is very old and influential in vision, and is a key tenet in CNNs, the stateoftheart in object recognition. It provides a good structural prior in the form of translation invariance. Another advantage lies in enhanced sample size for learning local feature parameters, since these can be trained from (number of training images)(number of image patches) instances. The architectures we consider here are not deep in the CNN sense, and correspond to a single convolutional layer, followed by spatial pooling.
The main vessel we use for obtaining high classification speed is the utilization of tablebased feature extractors, instead of heavier computations such as applying a large set of filters in a convolutional layer. In tablebased feature extraction, the patch is characterized using a set of fast bit functions, such as a comparison between two pixels. bits are extracted and concatenated into a word. This word is then used as an index into a set of weight tables, one per class, and the weights extracted provide the classes support from this word. Support weights are accumulated across many tables and all image positions, and the label is decided according to the highest scoring class.
The power of this architecture is in the combination of fastbutrich features with a high capacity classifier. Using quick bit functions, the representation considers all their combinations as features. The representation is highly sparse, with of the features active at each position. The classifier is linear, but it operates over numerous highly non linear features. For tables and classes, the number of weights to optimize is , which can be very high even for modest values of . The architecture hence requires a large training set to be used, and it effectively trades training sample size for speed and accuracy.
Pushing the speedaccuracy envelope using this architecture requires making careful structural and algorithmic choices. First, bit functions and image preprocessing should be chosen. We start with the simple functions employed in [20, 31], which were suitable for depth images, and extend them using gradient and color based channels and features employed in [9]. Another type of bit function introduced are spatial bits stating the rough location of the patch, which enable us to combine global and local pooling. A second important choice is between conditional computation of bit functions, leading to tree structures like used in [31], and unconditional computation as in fern structures [20]
. While trees may enable higher accuracy, ferns are better suited for vector processing (such as SSE instructions) and thus provide significant speed advantages. We explore between these ends empirically using a ’long tree’ structure, whose configuration enables testing intermediate structures.
Several works have addressed the challenges of learning a tablesbased classifier [12, 5, 34, 23, 6, 20, 28]
. These vary in optimization effort from extremely random forests
[12] to global optimization of table weights and greedy forward choice of bit functions [28, 20]. Our approach builds on previous approaches, mostly [20], and extends them with new possibilities. We learn the table ensemble by adding one table at a time, using a framework similar to the ’anyboost’ algorithm [26, 2]. Training iterates between minimizing a global convex loss, differentiating this loss w.r.t. examples, and using these gradients to guide construction of the next table. For the global optimization we used two main options: an SVM loss as used in [20] and a softmax loss as commonly used in CNN training. For the optimization of the bit function parameters in a new fern/tree we developed several options: forward bit selection, iterative bit replacement, and iterative local refinement. In some cases, such as the threshold parameters of certain bits, an algorithm providing the optimal solution is suggested. The algorithms considered are described in Section 2.Since CTEs can be much faster than CNNs, while the latter excel at accuracy, one would naturally like to merge their advantages if possible. In several recent studies [17, 15, 29], the output of an accurate but computationally expensive classifier is used to train another classifier, with a different and often computationally cheaper architecture. We made a preliminary attempt to use this technique, termed distillation in [15], to train a CTE classifier with a CNN teacher, with encouraging results on the MNIST data.
In Section 3.2 we present experiments demonstrating the performance gains of our techniques by comparison with the DFE method of [20], ablation studies, ferntree tradeoff experiments, and distillation results. We use several publicly available object recognition benchmarks: MNIST [22], CIFAR10 [19], SVHN [27] and 3HANDPOSE [20]. CTE achieves error improvements of over [20], with improvement on 3HANDPOSE, the original data used in [20]. For MNIST, we were able to train a CTE with error and close to running time using the distillation technique. Even higher accuracy of error can be obtained with a treebased CTE, with some cost in running time.
In section 3.3 we systematically experimented with CTE configurations to obtain accuracyspeed tradeoff graphs for the datasets mentioned. These graphs are compared to similar graphs obtained for CNNs. For the latter we trained NIN networks [25]
, combining stateoftheart accuracy with significant speed advantages, and further accelerated them by scaling parameters of breadth, NIN output dimension and convolution stride. Our results indicate that for a highly restricted CPU budget CTEs provide significantly better accuracy than CNNs, or conversely, CTEs can obtain the same error with a CPU budget lower by
. Typically this is true for classifiers operating below microseconds on a single CPU thread. For the very low CPU bound domain CTE can still provide useful results, whereas CNNs completely break. This makes CTEs a natural architecture choice for that domain. Alternatives to CTE may be provided by the literature dealing with CNN acceleration [35, 36, 1]. However obtaining the speed gains made possible by CTEs using such techniques is far from trivial.In summary, our main contribution in this paper is twofold: First, we develop new algorithms in the CTE framework, improving upon related similar art and extending the framework to general object recognition. Second, we pose an alternative to CNN which enables improved accuracy at the highly CPU constrained regime. Short concluding remarks are given in Section 4.
2 Convolutional Tables Ensemble
We present the classifier structure in Section 2.1 and derive the learning algorithm in 2.2. The details and variations of structure and training appear in 2.3 and 2.4 respectively.
2.1 Notation and classifier structure
A convolutional table ensemble is a classifier where and is the number of classes. The image may undergoes a preparation stage where additional feature channels (’maps’) may be added to it, so it is transformed to with . After preparation, the ensemble sums the votes of convolutional tables, each in turn sums the votes of a word calculator over all pixels in a aggregation area. We now explain this process bottomup.
Word calculator: For an image location , we denote its neighborhood by , and by the patch centered at location . A word calculator is a feature extractor applied to such a patch and returning a bit index, i.e. a function , where is the neighborhood size. In a classifier we have calculators denoted by where and are the calculator parameters. The calculator computes its output by applying bitfunctions to the patch, each producing a single bit. Several types of bit functions are discussed in Section 2.3.
Convolutional table: Each word calculator is applied to all locations in an integration area, and each word extracted casts votes for the output classes. Convolutional table is hence a triplet where is the integration area of word calculator , and is its weight matrix. Denote by the word calculated at location . We gather a histogram counting word occurrences; i.e.,
(1) 
with the discrete delta function. The class support of the convolutional table is the element vector .
Convolutional tables ensemble: The ensemble classification is done by accumulating the class support of all convolutional tables into a linear classifier with a bias term. Let , and . The classifier’s decision is given by
(2) 
where is a vector of class biases. Algorithm box 1 shows the classifier’s test time flow. Note that the histograms are not accumulated in practice, and instead each word computed directly votes for all classes.
2.2 Training
In [20, 28] instances of convolutional tables ensemble were discriminatively optimized for specific tasks and losses (hand pose recognition using SVM in [20], and face alignment using regression in [28]). The main idea behind these methods is to iterate between solving a convex problem for a fixed representation, and augmenting the representation based on gradient signals from the obtained solution. Here we adapt these ideas to linear classification with an arbitrary
regularized convex loss function, using techniques from
[26, 2]. Assume a labeled training sample with fixed representation where , , and denote the th row of the weight matrix by . We want to learn a linear classifier of the form with by minimizing a sample loss function of the form(3) 
with a convex function of . is strictly convex with a single global minimum, hence solvable using known techniques. Once the problem has been solved for the fixed representation , we want to extend the representation by incorporating a new table, effectively adding new features. In order to choose the new features wisely, we consider how the loss changes if a new feature is added to the representation with small class weights, regarded as a small perturbation of the existing model.
Denote by the value of a new feature candidate for example .After incorporating the new feature, example ’s representation changes from to and weights vectors are augmented to with . Class scores are updated to . Finally, the loss is changed to . Denote the new weights vector. We assume that the new feature is added with small weights; i.e., for all . can be Taylor approximated around , with the gradient :
(4)  
Using the gradient in a Taylor approximation of gives
(5)  
Denote . For loss minimization we want to minimize over and . For fixed minimizing over is simple. Denoting , we have to minimize under the constraint . We can minimize each term in the sum independently to get , and the value of the minimum is . Hence, for a single feature addition, we need to maximize the score .
To return to our scenario, we add features at once, generated by a new word calculator . The derivation above can be done for each of them independently, so for the addition of the features we get
(6)  
2.3 Structural variants
The word calculator concept described in Section 2.1 is very general. Here we describe the bit functions and word calculator types we have explored.
Bit functions and input preparation: Word calculators compute an index descriptor of a patch by applying bit functions, each producing a single bit. Each such function is composed of a simple comparison operation, with a few parameters stating its exact operation. Specifically we use the following bit function forms:

One pixel:

Two pixels:

Get Bit :

Integral channel bit:
where is the Heaviside step function. The first two bit function types can be applied to any input channel , while the latter two are meaningful only for specific channels. The channels we consider are as follows:

Original image channels: Gray scale and color channels, or depth and IR (multiplied by a depthbased mask) for depth images.

Gradientbased channels: Two kinds of gradient maps are computed from the original channels following [9]. A normalized gradient channel includes the norm of the gradient for each pixel location. In oriented gradient channels the gradient energy of a pixel is softly quantized into orientation maps.

Spatial channels: Two channels stating the horizontal and vertical location of a pixel in the image. These channels state the quantized location, using and bits respectively.
After preparation, the channels are optionally smoothed by a convolution with a triangle filter. Spatial channels enable the incorporation of patches’ position in the word computed. They are used with a ’Get Bit ’ bit function type, with referring to the higher bits. This effectively puts a spatial grid over the image, thus turning the global summation pooling into local summation using a pyramidlike structure [21]. For example using two bit functions, checking for the th horizontal bit and the th vertical bit, effectively puts a grid over the image where words are summed independently and get different weights for each quarter. Similarly using spatial bits one gets a pyramid, etc. We found that enforcing a different number of spatial bits in each convolutional table improves feature diversity and consequently the accuracy.
Word calculator structure: The main design decision in this respect is the choice between ferns and trees. Ferns include only bit functions, so the number of parameters is relatively small and overfitting during local optimization is less likely. Trees are a much larger hypothesis family with up to bit functions in a full tree. Thus they are likely to enable higher accuracy, but also be more prone to overfit. We explored this tradeoff using a ’long tree’ structure enabling a gradual interplay between the fern and full tree extremes.
In a long tree the bits to compute are divided into stages, with bits computed at stage , so . A tree of depth is built, where a node in stage contains bit functions computing a bits word. A node in stage has children, and it has a childdirecting table of size , with entries containing child indices in . Computation starts at stage at a root node, and after computation of the bits in a node the produced word is used as an index to the childdirecting table, whose output is the index of the child node to descend to. The tree structure is determined by the vectors and of stage size and stage split factors respectively.
When speed is considered, the most important point is that ferns can be efficiently implemented using vector operations (like SSE), constructing the word in several locations at the same time. The efficiency arises because computing the same bit function for several contiguous patches involves access to contiguous pixels, which can be done without expensive gather operations. Conversely, for trees different bit functions are applied at contiguous patches so the accessed pixels are not contiguous in memory. As will be seen in Section 3, trees can be more accurate, but ferns provide considerably better accuracyspeed tradeoff.
Dataset  DFE  CTE base  \Opt TH  \Ftr Norm  \WC opt  \Chnls  \Smooth  \Spatial  \Sp. Enforce 

MNIST  0.77  0.45  0.48  0.58  0.48  0.7  0.66  0.51  0.48 
CIFAR10  31.3  20.3  21.3  21.9  21.8  22.0  22.2  21.5  21.0 
SVHN  11.9  6.5  7.1  7.1  10.5  11.6  7.2  13.2  7.6 
3HANDPOSE  3.2  2.3  2.1  2.5  3.5  4.4  2.7  2.2  2.2 
2.4 Training variants
As stated in algorithm 2, training iterates between gradient based word calculator optimization and global optimization of table weights. We now describe the methods we explored for these two components.
Word calculator optimization: We consider several mechanisms for the optimization of , including forward bit function selection, optimal threshold finding, and iterative bit function replacement/refinement.
In forward selection, we optimize by adding one bit after the other. For fern growing there are such stages. At stage , candidate bit functions are generated, with their type and parameters drawn from a prior distribution. For each , we augment the current word calculator to and choose the one with the highest score. However, we found that simple greedy computation of at each stage is not the best way to optimize , and an auxiliary score which additively normalizes the newlyintroduced features does a better job. Denote the patch features of a word calculator by , by the value of for pixel in image and by the score induced by a patch feature . The addition of a new bit effectively replaces the feature for with new features and . If the gradients in cell are not balanced; i.e., , as is often the case, a feature may get a good score of even if the new bit function is constant, or otherwise uninformative. To handle this, we score a normalized version of the new features, with an average value of 0, which more effectively measures the added information in the new features. The following lemma shows that this is a valid, as well as computationally effective strategy:
Lemma 2.1
Let for and . The following properties hold

Using , in a classifier is equivalent to using ,; i.e, for any weight choice there are such that


with
The proofs are rather simple and appear in appendix 5. Property 1 shows that we may score features instead of features. Since only is affected by the new candidate bit, we can score only those terms when selecting among candidates. Property 3 shows that we can normalize the gradient instead of the feature candidates, which is cheaper (as there are candidates but only a single gradient vector). In summary, we optimize the next bit selection by maximizing
(7) 
over the choice of . The calculation requires a single histogram aggregation sweep over all patches .
Most of the bit functions obtain their bit by comparing an underlying patch measurement to a threshold . For such functions, the optimal threshold parameter can be found with a small additional computational cost. This is done by sorting the underlying values of and computing the sum over in Equation 7 by running with the sorted order. This way, a running statistic of the score can be maintained, computing the score for all possible thresholds and keeping the best.
For a long tree a similar algorithm is employed, with ferns internal to nodes optimized as full ferns, but tree splits requiring special treatment. Assume we are splitting a node in stage , so the current word calculator has already computes a bit word, among which were computed in the current node. We now choose the first bit functions of all the children, as well as the redirection table, to optimize . Since different prefixes of the current calculator are augmented by different bit functions we need to decompose the score. Denote by the index set of bits computed by the current node, and by b(a) the limitation of a binary word to indices . For a bit word , we define the component of contributed by words with by
(8) 
For the tree split we draw a large set of candidate bits, and choose the first bits of the children by optimizing
(9) 
with the set of chosen bits for the children and entry in the redirection table set to the index of the child containing . For this optimization we compute the score matrix with . Given a choice of , amounting to a choice of column subset in , the optimization over is trivial and the score is easy to compute. We optimize over by exhaustively trying all choices of for , and greedily adding columns to until it contains members.
In addition to forward bit selection, we implemented iterative bit replacement and refinement stages. The rationale for this is the observation that while the last bit functions in a fern are chosen to complement the previous ones, the bits chosen at the beginning are not optimized to be complementary and may be suboptimal in a long word calculator. The bit replacement algorithm operates after forward bit selection. It runs over all the bit functions several times and attempts to replace each function with several randomly drawn candidates. A replacement step is accepted if it improves the score. In a similar manner, a bit refinement algorithm attempts to replace a bit function by small perturbations of its parameters, thus effectively implementing a local search. For trees, bit replacement/refinement is done only for bits inside a node, and once a split is made the node parameters are fixed.


Global optimization: We considered two global loss functions in our classification experiments: an SVMbased loss, and a softmaxloss as typically used in neural networks optimization. In the SVM loss, we take the sum of SVM programs, each minimizing a oneversusall error. Let be binary class labels. The loss is
(10) 
The loss aims for class separation in independent classifiers. Its advantage lies in the availability of fast and scalable methods for solving large and sparse SVM programs [30, 16]. The loss gradients are if example is a support vector, and otherwise. In [3] a first order approximation for is derived for new feature addition, in which the example gradients are with the dual SVM variables at the optimum. The two expressions are similar and we did not find noticeable difference between them empirically. The softmax loss is
(11) 
This loss provides a direct minimization of the class error. The gradients are . Conveniently, it can be extended to a distillation loss [15], which enables guidance of the classifier using an internal representation of a welltrained CNN classifier.
Features in a word histogram have significant variance, as some words appear in large quantities in a single image. Without normalization such words may be arbitrarily preferred due to their lower regularization cost they can be used with lower weights. Denote the column of a feature across all examples by
. We found that normalizing each features column by the expected count of active examples improved accuracy and convergence speed in many cases.3 Empirical results
We discuss our experimental setup in 3.1. In Section 3.2 we compare to related art and evaluate the contribution of algorithmic components to the performance. Results of speedaccuracy tradeoffs are presented in 3.3.
3.1 Implementation and data details
The experiments were conducted on publicly available datasets: MNIST, CIFAR10, SVHN and 3HANDPOSE. The first three are standard recognition benchmarks in grayscale (MNIST) or RGB (CIFAR10,SVHN), with classes each. 3HANDPOSE are a class dataset, with hand poses and a fourth class of ’other’, and its images contain depth and IR channels. The image sizes are between (MNIST) and (3HANDPOSE). The training set size ranges from (CIFAR10) to (SVHN).
CTE training code was written in Matlab, with some routines using code from the packages [8, 11, 37]. The test time classier was implemented and optimized in C. For ferns we implemented algorithm 1 with SSE operations. Words are computed for neighboring pixels together, and voting is done for classes at once. For trees we implemented a program generating efficient code of the bit computation loop for a specific tree, so the tree parameters are part of the code. This obtained an acceleration factor of over standard C code. We also threadparallelized the algorithm over the convolutional tables, with good a speedup of obtained from cores. However, we report and compare single thread performance to keep the methodology as simple as possible.
CNN models were trained using MatConvNet [37]
. The implementation is efficient, reported to be comparable to Caffe
[39] in [37], with the convolutional and global layers reduced to matrix multiplication done using an SSEoptimized BLAS package. When measuring execution time, we measured net run time of the convolutional, pooling and global layers alone, without Matlab overhead. Time measurements were made on a Lenovo Thinkpad W530 quad core laptop, with i73720QM core running at 2.6Ghz.MNIST  CIFAR10  SVHN  3HANDPOSE 
3.2 Comparison and variation
Comparison with DFE: The Discriminative Ferns Ensemble (DFE) was suggested in [20] for classification of 3HANDPOSE, and can be seen a baseline for CTE, which enhances it in many aspects. The first two columns in Table 1 present errors of DFE and CTE on the datasets, using ferns for MNIST, SVHN, 3HANDPOSE and for CIFAR10. MNIST was trained with softmax distillation loss (see below for details), and the others with SVM loss. The aggregation area were chosen to be identical for all tables in a classifier, forming a centered square occupying most of the image. To enable the comparison, Mclass error rates are extracted from DFE (in [20] such errors are not reported, and class average true positive rates are reported instead). It can be seen that CTE base provides significant improvements of error reduction over DFE, with obtained for 3HANDPOSE, where DFE was originally applied. Note that the CTE base is not the best choice for 3HANDPOSE. With additional parameter tuning result of can be obtained with ferns, which is an improvement of over DFE.
Ablation experiments: The accuracy obtained by a CTE is influenced by many small incremental improvements related to structural and algorithmic variations. Columns 39 in Table 1 show the contribution of some ingredients by removing them from the baseline CTE. For MNIST, where the effects are small due to the low error, results were averaged over experiments varying in their random seed, a with seedinduced std of . It can be seen that these ingredients consistently contribute to accuracy for nondepth data.
Trees/Ferns tradeoff: The tradeoff between ferns and trees for MNIST and CIFAR10 is presented in Figure 2(Right). For MNIST, the results were averaged over experiments, with a seed induced std of . It can be seen that trees provide better accuracy. However, the speed cost of using trees is significant, due to the inability to efficiently vectorize their implementation.
Distillation experiments: We experimented with knowledge distillation from a CNN to a CTE using the method suggested in [15]
. In such experiments, soft labels are taken from our best CNN model, and a CTE is trained to optimize a convex combination of the standard softmax loss and the KullbackLeibler distance from the CNNinduced probabilities. We attempted this for MNIST and CIFAR10 using our best CNN models, providing
and error respectively as distillation sources. For MNIST, this training methodology proved to be successful. Averaging over seeds, the accuracy of a fern CTE optimized for softmax was (the std was ) without distillation, and with distillation. For comparison, an SVMoptimized CTE with the same parameters obtained error. For CIFAR10 distillation did not consistently improve the results.3.3 SpeedAccuracy tradeoff
We are interested in the tradeoff or Pareto curves [10], showing the best accuracy obtainable for a specific speed constraint and vice versa. Since the design space for variations of and algorithms is huge, and the training time of the algorithms is considerable, we needed to sample it wisely to get a good curve approximation. Our sampling technique is based on two stages. In stage 1 we searched for the most accurate classifiers for CTE and CNN with loose speed constraints, so even slow classifiers were considered. We then used the few top accuracy variants of each architecture as baselines and accelerated them by systematically varying certain design parameters.
Our CNN baseline architectures are variations of DeepCNiN(l,k) [13], with convolutional layers and , implying usage of maps at the th layer. It was shown in [13] that higher values provide better accuracy, but such architectures are much slower than CPU millisecond and so they are outside our domain of interest. We experimented with dropout [33]
, parametric RELU units
[18], affine image transformations following [13], and HSV image transformations following [32]. Acceleration of the baseline architectures used three main parameters. The first was reducing parameter controlling the network width. The second was reduction of the number of maps in the output of the NIN layers. This reduces the number of input maps for the next layer, and can dramatically save computation with relatively small loss of accuracy. The third was raising the convolution stride parameter from to . For CTEs, our exploration space was sketched in Section 2, and it includes both ferns and trees. The best performing configurations were then accelerated using a single parameter: the number of tables in the ensemble.Tradeoff graphs for the datasets are shown in Figure 3. Classification speed in microseconds is displayed along the axis in log scale with base . For all datasets, there is a high speed regime where CTEs provide better accuracy than CNNs. Specifically CTEs are preferable for all datasets when less than microseconds are available for computation. Starting from microseconds and up CNNs are usually better, with CTEs still providing comparable accuracy for MNIST and 3HANDPOSE at the milliseconds regime. Viewed conversely, for a wide range of error rates, if the error rate is obtainable by a CTE, it is obtainable with significant speedups over CNNs. Some examples of this phenomenon are given in Figure 2(Left). Note that while a working point of error for CIFAR10 may seem high, the majority of the oneversusone errors of such a classifier are lower than , which may be good enough for many purposes.
4 Conclusions and further work
We introduced improvements to the convolutional tables framework in terms of bit functions used, word calculator structure, calculator optimization and global optimization. We have shown that for highly computational constrained tasks CTE may provide accuracy higher than CNNs. A natural direction for future research is to replace the flat structure of CTEs with a layered approach, in order to try and enjoy the accuracy of CNNs with the speed of CTEs.
References
 [1] L. Baoyuan, W. Min, F. Hassan, T. Marshall, and P. Marianna. Sparse convolutional neural networks. In CVPR, 2015.
 [2] A. BarHillel, T. Hertz, and D. Weinshall. Object class recognition by boosting a part based model. In CVPR, 2005.
 [3] A. BarHillel, D. Levi, E. Krupka, and C. Goldberg. Partbased feature synthesis for human detection. In ECCV, 2010.
 [4] R. Benenson, M. Mathias, R. Timofte, and L. J. V. Gool. Pedestrian detection at 100 frames per second. In CVPR, 2012.
 [5] A. Bosch, A. Zisserman, and X. Muñoz. Image classification using random forests and ferns. In ICCV, pages 1–8, 2007.

[6]
A. Criminisi, J. Shotton, and E. Konukoglu.
Decision forests for classification, regression, density estimation, manifold learning and semisupervised learning.
Technical report, Microsoft Research, 2011. 
[7]
T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik.
Fast, accurate detection of 100,000 object classes on a single
machine.
In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
, Washington, DC, USA, 2013.  [8] P. Dollár. Piotr’s Computer Vision Matlab Toolbox (PMT). http://vision.ucsd.edu/ pdollar/toolbox/doc/index.html.
 [9] P. Dollar, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In BMVC, 2009.
 [10] C. J. (Ed). Multicriteria Analysis. SpringerVerlag, 1997.

[11]
R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin.
Liblinear: A library for large linear classification.
Journal of Machine Learning Research
, 9:1871–1874, 2008.  [12] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Mach. Learn., 63(1):3–42, Apr. 2006.
 [13] B. Graham. Spatiallysparse convolutional neural networks. CoRR, abs/1409.6070, 2014.
 [14] K. He, S. Zhang, X.and Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. CoRR, abs/1406.4729v2, 2014.
 [15] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
 [16] C.J. Hsieh, K.W. Chang, C.J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for largescale linear svm. In ICML, 2008.
 [17] B. Jimmy and C. Rich. Do deep nets really need to be deep? In NIPS, 2014.
 [18] H. Kaiming, Z. Xiangyu, R. Shaoqing, and S. Jian. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classifcation. CoRR, abs/1502.01852, 2015.
 [19] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Master’s thesis, Department of Computer Science, University of Toronto, 2009.
 [20] E. Krupka, A. Vinnikov, B. Klein, A. B. Hillel, D. Freedman, and S. Stachniak. Discriminative ferns ensemble for hand pose recognition. In CVPR, 2014.
 [21] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, pages 2169–2178, 2006.
 [22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, 1998.
 [23] V. Lepetit and P. Fua. Keypoint recognition using randomized trees. PAMI, 28:1465–1479, 2008.
 [24] D. Levi, S. Silberstein, and A. BarHillel. Fast multiplepart based object detection using kdferns. In CVPR, 2013.
 [25] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2014.
 [26] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent. In NIPS, pages 512–518, 2000.

[27]
Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, , and A. Y. Ng.
Reading digits in natural images with unsupervised feature learning.
In
NIPS Workshop on Deep Learning and Unsupervised Feature Learning
, 2011.  [28] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 fps via regressing local binary features. In CVPR, June 2014.
 [29] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. CoRR, abs/1412.6550, 2015.
 [30] S. ShalevShwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated subgradient solver for svm. In ICML, 2007.
 [31] J. Shotton, T. Sharp, A. Kipman, A. W. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore. Realtime human pose recognition in parts from single depth images. Commun. ACM, 56(1):116–124, 2013.
 [32] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. A. Patwary, Prabhat, and R. P. Adams. Scalable bayesian optimization using deep neural networks. CoRR, abs/1502.05700, 2014.
 [33] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, 2014.
 [34] E. Tola, V.Lepetit, and P. Fua. A Fast Local Descriptor for Dense Matching. In CVPR, 2008.
 [35] L. Vadim, G. Yaroslav, R. Maksim, O. Ivan, and L. Victor. Speedingup convolutional neural networks using finetuned cpdecomposition. CoRR, abs/1412.6553, 2014.
 [36] V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
 [37] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. In Proceeding of the ACM Int. Conf. on Multimedia, 2015.
 [38] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001.
 [39] J. Yangqing, S. Evan, D. Jeff, K. Sergey, L. Jonathan, G. Ross, G. Sergio, and D. Trevor. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, 2014.
5 Appendix


From the definition , so . Also, since , we have
Hence, for weights ,
So and fulfill the lemma’s statement.

Using we continue to

