Convolutional Tables Ensemble: classification in microseconds

02/14/2016 ∙ by Aharon Bar-Hillel, et al. ∙ Microsoft 0

We study classifiers operating under severe classification time constraints, corresponding to 1-1000 CPU microseconds, using Convolutional Tables Ensemble (CTE), an inherently fast architecture for object category recognition. The architecture is based on convolutionally-applied sparse feature extraction, using trees or ferns, and a linear voting layer. Several structure and optimization variants are considered, including novel decision functions, tree learning algorithm, and distillation from CNN to CTE architecture. Accuracy improvements of 24-45 standard object recognition benchmarks. Using Pareto speed-accuracy curves, we show that CTE can provide better accuracy than Convolutional Neural Networks (CNN) for a certain range of classification time constraints, or alternatively provide similar error rates with 5-200X speedup.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Practical object recognition problems often have to be solved under severe computation and time constraints. Some examples of interest are natural user interfaces, automotive active safety, robotic vision or sensing for the Internet of Things (IoT). Often the problem is to obtain high accuracy in real time, on a low power platform, or in a background process that can only utilize a small fraction of the CPU. In other cases the classifier is part of a cascade, or a complex multiple-classifier system. The accuracy-speed trade-off has thus been widely discussed in the literature, and various architectures have been suggested [14, 35, 7, 4, 24, 31]. Here we focus on the extreme end of this trade-off and ask: how accurate can we get for classifiers working in CPU microseconds.

As a thought experiment, the fastest classifier possible would be the one concatenating all the pixel values into a single index, and then using this index to access a table listing the labels of all possible images. Of course, this is not feasible due to the exponential requirements of memory and training set size, but this limit case points us the direction to follow. The actual architecture we pursue compromises this limit idea in two main ways: first, instead of encoding the whole image with a single index, the image is treated as a dense set of patches where each patch is encoded using the same parameters. This is analogous to a convolutional layer in a convolutional neural network (CNN) [22], where the same set of filters is applied at each image position. Second, instead of describing a patch using a single long index, it is encoded with a set of short indices, that are used to access a set of reasonable-size tables. Votes of all tables at all positions are combined linearly to yield the classifier outputs. Variants of this architecture have been used successfully mainly for classification of depth images [20, 31]. Here we explore this regime for visual recognition in general, under the term Convolutional Tables Ensemble (CTE).

The idea of applying the same feature extraction on a dense locations grid is very old and influential in vision, and is a key tenet in CNNs, the state-of-the-art in object recognition. It provides a good structural prior in the form of translation invariance. Another advantage lies in enhanced sample size for learning local feature parameters, since these can be trained from (number of training images)(number of image patches) instances. The architectures we consider here are not deep in the CNN sense, and correspond to a single convolutional layer, followed by spatial pooling.

The main vessel we use for obtaining high classification speed is the utilization of table-based feature extractors, instead of heavier computations such as applying a large set of filters in a convolutional layer. In table-based feature extraction, the patch is characterized using a set of fast bit functions, such as a comparison between two pixels. bits are extracted and concatenated into a word. This word is then used as an index into a set of weight tables, one per class, and the weights extracted provide the classes support from this word. Support weights are accumulated across many tables and all image positions, and the label is decided according to the highest scoring class.

The power of this architecture is in the combination of fast-but-rich features with a high capacity classifier. Using quick bit functions, the representation considers all their combinations as features. The representation is highly sparse, with of the features active at each position. The classifier is linear, but it operates over numerous highly non linear features. For tables and classes, the number of weights to optimize is , which can be very high even for modest values of . The architecture hence requires a large training set to be used, and it effectively trades training sample size for speed and accuracy.

Pushing the speed-accuracy envelope using this architecture requires making careful structural and algorithmic choices. First, bit functions and image preprocessing should be chosen. We start with the simple functions employed in [20, 31], which were suitable for depth images, and extend them using gradient and color based channels and features employed in [9]. Another type of bit function introduced are spatial bits stating the rough location of the patch, which enable us to combine global and local pooling. A second important choice is between conditional computation of bit functions, leading to tree structures like used in [31], and unconditional computation as in fern structures [20]

. While trees may enable higher accuracy, ferns are better suited for vector processing (such as SSE instructions) and thus provide significant speed advantages. We explore between these ends empirically using a ’long tree’ structure, whose configuration enables testing intermediate structures.

Several works have addressed the challenges of learning a tables-based classifier [12, 5, 34, 23, 6, 20, 28]

. These vary in optimization effort from extremely random forests 

[12] to global optimization of table weights and greedy forward choice of bit functions [28, 20]. Our approach builds on previous approaches, mostly [20], and extends them with new possibilities. We learn the table ensemble by adding one table at a time, using a framework similar to the ’anyboost’ algorithm [26, 2]. Training iterates between minimizing a global convex loss, differentiating this loss w.r.t. examples, and using these gradients to guide construction of the next table. For the global optimization we used two main options: an SVM loss as used in [20] and a softmax loss as commonly used in CNN training. For the optimization of the bit function parameters in a new fern/tree we developed several options: forward bit selection, iterative bit replacement, and iterative local refinement. In some cases, such as the threshold parameters of certain bits, an algorithm providing the optimal solution is suggested. The algorithms considered are described in Section 2.

Since CTEs can be much faster than CNNs, while the latter excel at accuracy, one would naturally like to merge their advantages if possible. In several recent studies [17, 15, 29], the output of an accurate but computationally expensive classifier is used to train another classifier, with a different and often computationally cheaper architecture. We made a preliminary attempt to use this technique, termed distillation in [15], to train a CTE classifier with a CNN teacher, with encouraging results on the MNIST data.

In Section 3.2 we present experiments demonstrating the performance gains of our techniques by comparison with the DFE method of [20], ablation studies, fern-tree trade-off experiments, and distillation results. We use several publicly available object recognition benchmarks: MNIST [22], CIFAR-10 [19], SVHN [27] and 3-HANDPOSE [20]. CTE achieves error improvements of over [20], with improvement on 3-HANDPOSE, the original data used in [20]. For MNIST, we were able to train a CTE with error and close to running time using the distillation technique. Even higher accuracy of error can be obtained with a tree-based CTE, with some cost in running time.

In section 3.3 we systematically experimented with CTE configurations to obtain accuracy-speed trade-off graphs for the datasets mentioned. These graphs are compared to similar graphs obtained for CNNs. For the latter we trained NIN networks [25]

, combining state-of-the-art accuracy with significant speed advantages, and further accelerated them by scaling parameters of breadth, NIN output dimension and convolution stride. Our results indicate that for a highly restricted CPU budget CTEs provide significantly better accuracy than CNNs, or conversely, CTEs can obtain the same error with a CPU budget lower by

. Typically this is true for classifiers operating below microseconds on a single CPU thread. For the very low CPU bound domain CTE can still provide useful results, whereas CNNs completely break. This makes CTEs a natural architecture choice for that domain. Alternatives to CTE may be provided by the literature dealing with CNN acceleration [35, 36, 1]. However obtaining the speed gains made possible by CTEs using such techniques is far from trivial.

In summary, our main contribution in this paper is two-fold: First, we develop new algorithms in the CTE framework, improving upon related similar art and extending the framework to general object recognition. Second, we pose an alternative to CNN which enables improved accuracy at the highly CPU constrained regime. Short concluding remarks are given in Section 4.

2 Convolutional Tables Ensemble

Input: An image I of size ,

classifier parameters

, ,

Output: A classifier decision in

Initialization: For

Prepare extended image

For all tables

For all pixels

Compute

For

Return

Algorithm 1 Convolutional Tables Ensemble: Classification

We present the classifier structure in Section 2.1 and derive the learning algorithm in 2.2. The details and variations of structure and training appear in 2.3 and 2.4 respectively.

2.1 Notation and classifier structure

A convolutional table ensemble is a classifier where and is the number of classes. The image may undergoes a preparation stage where additional feature channels (’maps’) may be added to it, so it is transformed to with . After preparation, the ensemble sums the votes of convolutional tables, each in turn sums the votes of a word calculator over all pixels in a aggregation area. We now explain this process bottom-up.

Word calculator: For an image location , we denote its neighborhood by , and by the patch centered at location . A word calculator is a feature extractor applied to such a patch and returning a -bit index, i.e. a function , where is the neighborhood size. In a classifier we have calculators denoted by where and are the calculator parameters. The calculator computes its output by applying bit-functions to the patch, each producing a single bit. Several types of bit functions are discussed in Section 2.3.

Convolutional table: Each word calculator is applied to all locations in an integration area, and each word extracted casts votes for the output classes. Convolutional table is hence a triplet where is the integration area of word calculator , and is its weight matrix. Denote by the word calculated at location . We gather a histogram counting word occurrences; i.e.,

(1)

with the discrete delta function. The class support of the convolutional table is the -element vector .

Convolutional tables ensemble: The ensemble classification is done by accumulating the class support of all convolutional tables into a linear classifier with a bias term. Let , and . The classifier’s decision is given by

(2)

where is a vector of class biases. Algorithm box 1 shows the classifier’s test time flow. Note that the histograms are not accumulated in practice, and instead each word computed directly votes for all classes.

Input: A labeled training set

Parameters , convex loss

Output: A classifier

Initialization: if ,

if

For

Table addition: choose to optimize:

Update representation:

,

Global optimization: train by solving

If get loss gradients:

Algorithm 2 Convolutional Tables Ensemble: Training

2.2 Training

In [20, 28] instances of convolutional tables ensemble were discriminatively optimized for specific tasks and losses (hand pose recognition using SVM in [20], and face alignment using regression in [28]). The main idea behind these methods is to iterate between solving a convex problem for a fixed representation, and augmenting the representation based on gradient signals from the obtained solution. Here we adapt these ideas to linear -classification with an arbitrary

-regularized convex loss function, using techniques from 

[26, 2]. Assume a labeled training sample with fixed representation where , , and denote the -th row of the weight matrix by . We want to learn a linear classifier of the form with by minimizing a sample loss function of the form

(3)

with a convex function of . is strictly convex with a single global minimum, hence solvable using known techniques. Once the problem has been solved for the fixed representation , we want to extend the representation by incorporating a new table, effectively adding new features. In order to choose the new features wisely, we consider how the loss changes if a new feature is added to the representation with small class weights, regarded as a small perturbation of the existing model.

Denote by the value of a new feature candidate for example .After incorporating the new feature, example ’s representation changes from to and weights vectors are augmented to with . Class scores are updated to . Finally, the loss is changed to . Denote the new weights vector. We assume that the new feature is added with small weights; i.e., for all . can be Taylor approximated around , with the gradient :

(4)

Using the gradient in a Taylor approximation of gives

(5)

Denote . For loss minimization we want to minimize over and . For fixed minimizing over is simple. Denoting , we have to minimize under the constraint . We can minimize each term in the sum independently to get , and the value of the minimum is . Hence, for a single feature addition, we need to maximize the score .

To return to our scenario, we add features at once, generated by a new word calculator . The derivation above can be done for each of them independently, so for the addition of the features we get

(6)

where we used Equation 1 for and denoted . The resulting training algorithm, summarized in algorithm box 2, iterates between global classifier optimization and greedy optimization of the next convolutional table by maximizing .

2.3 Structural variants

The word calculator concept described in Section 2.1 is very general. Here we describe the bit functions and word calculator types we have explored.

Bit functions and input preparation: Word calculators compute an index descriptor of a patch by applying bit functions, each producing a single bit. Each such function is composed of a simple comparison operation, with a few parameters stating its exact operation. Specifically we use the following bit function forms:

  • One pixel:

  • Two pixels:

  • Get Bit :

  • Integral channel bit:

where is the Heaviside step function. The first two bit function types can be applied to any input channel , while the latter two are meaningful only for specific channels. The channels we consider are as follows:

  • Original image channels: Gray scale and color channels, or depth and IR (multiplied by a depth-based mask) for depth images.

  • Gradient-based channels: Two kinds of gradient maps are computed from the original channels following [9]. A normalized gradient channel includes the norm of the gradient for each pixel location. In oriented gradient channels the gradient energy of a pixel is softly quantized into orientation maps.

  • Integral channels: Integral images [38] of channels from the previous two forms, again following [9]. Applying integral channel bits to these channels allows fast calculation of channel area sums.

  • Spatial channels: Two channels stating the horizontal and vertical location of a pixel in the image. These channels state the quantized location, using and bits respectively.

After preparation, the channels are optionally smoothed by a convolution with a triangle filter. Spatial channels enable the incorporation of patches’ position in the word computed. They are used with a ’Get Bit ’ bit function type, with referring to the higher bits. This effectively puts a spatial grid over the image, thus turning the global summation pooling into local summation using a pyramid-like structure [21]. For example using two bit functions, checking for the -th horizontal bit and the -th vertical bit, effectively puts a grid over the image where words are summed independently and get different weights for each quarter. Similarly using spatial bits one gets a pyramid, etc. We found that enforcing a different number of spatial bits in each convolutional table improves feature diversity and consequently the accuracy.

Word calculator structure: The main design decision in this respect is the choice between ferns and trees. Ferns include only bit functions, so the number of parameters is relatively small and over-fitting during local optimization is less likely. Trees are a much larger hypothesis family with up to bit functions in a full tree. Thus they are likely to enable higher accuracy, but also be more prone to overfit. We explored this trade-off using a ’long tree’ structure enabling a gradual interplay between the fern and full tree extremes.

In a long tree the bits to compute are divided into stages, with bits computed at stage , so . A tree of depth is built, where a node in stage contains bit functions computing a -bits word. A node in stage has children, and it has a child-directing table of size , with entries containing child indices in . Computation starts at stage at a root node, and after computation of the bits in a node the produced word is used as an index to the child-directing table, whose output is the index of the child node to descend to. The tree structure is determined by the vectors and of stage size and stage split factors respectively.

When speed is considered, the most important point is that ferns can be efficiently implemented using vector operations (like SSE), constructing the word in several locations at the same time. The efficiency arises because computing the same bit function for several contiguous patches involves access to contiguous pixels, which can be done without expensive gather operations. Conversely, for trees different bit functions are applied at contiguous patches so the accessed pixels are not contiguous in memory. As will be seen in Section 3, trees can be more accurate, but ferns provide considerably better accuracy-speed trade-off.

Dataset DFE CTE base \Opt TH \Ftr Norm \WC opt \Chnls \Smooth \Spatial \Sp. Enforce
MNIST 0.77 0.45 0.48 0.58 0.48 0.7 0.66 0.51 0.48
CIFAR-10 31.3 20.3 21.3 21.9 21.8 22.0 22.2 21.5 21.0
SVHN 11.9 6.5 7.1 7.1 10.5 11.6 7.2 13.2 7.6
3-HANDPOSE 3.2 2.3 2.1 2.5 3.5 4.4 2.7 2.2 2.2
Figure 1: Comparison and ablation: Columns one and two present errors of a DFE [20] and a baseline CTE. The next columns show results obtained when the CTE baseline is ablated in a single aspect. Opt TH: bit function thresholds are randomized instead of optimally chosen. Ftr Norm: histogram features are not multiplicatively normalized before training. WC opt bit functions in the word calculator are randomly chosen, and not optimized. Chnls: appearance channels beyond original ones are not used. Smooth: input channels are not smoothed during preprocessing. Spatial: Spatial bits are not used. Sp. Enforce: spatial bits are not enforced (When enforced, a random number in of spatial bits is enforced for each table).

2.4 Training variants

As stated in algorithm 2, training iterates between gradient based word calculator optimization and global optimization of table weights. We now describe the methods we explored for these two components.

Word calculator optimization: We consider several mechanisms for the optimization of , including forward bit function selection, optimal threshold finding, and iterative bit function replacement/refinement.

In forward selection, we optimize by adding one bit after the other. For fern growing there are such stages. At stage , candidate bit functions are generated, with their type and parameters drawn from a prior distribution. For each , we augment the current word calculator to and choose the one with the highest score. However, we found that simple greedy computation of at each stage is not the best way to optimize , and an auxiliary score which additively normalizes the newly-introduced features does a better job. Denote the patch features of a word calculator by , by the value of for pixel in image and by the score induced by a patch feature . The addition of a new bit effectively replaces the feature for with new features and . If the gradients in cell are not balanced; i.e., , as is often the case, a feature may get a good score of even if the new bit function is constant, or otherwise uninformative. To handle this, we score a normalized version of the new features, with an average value of 0, which more effectively measures the added information in the new features. The following lemma shows that this is a valid, as well as computationally effective strategy:

Lemma 2.1

Let for and . The following properties hold

  1. Using , in a classifier is equivalent to using ,; i.e, for any weight choice there are such that

  2. with

The proofs are rather simple and appear in appendix 5. Property 1 shows that we may score features instead of features. Since only is affected by the new candidate bit, we can score only those terms when selecting among candidates. Property 3 shows that we can normalize the gradient instead of the feature candidates, which is cheaper (as there are candidates but only a single gradient vector). In summary, we optimize the next bit selection by maximizing

(7)

over the choice of . The calculation requires a single histogram aggregation sweep over all patches .

Most of the bit functions obtain their bit by comparing an underlying patch measurement to a threshold . For such functions, the optimal threshold parameter can be found with a small additional computational cost. This is done by sorting the underlying values of and computing the sum over in Equation 7 by running with the sorted order. This way, a running statistic of the score can be maintained, computing the score for all possible thresholds and keeping the best.

For a long tree a similar algorithm is employed, with ferns internal to nodes optimized as full ferns, but tree splits requiring special treatment. Assume we are splitting a node in stage , so the current word calculator has already computes a -bit word, among which were computed in the current node. We now choose the first bit functions of all the children, as well as the redirection table, to optimize . Since different prefixes of the current calculator are augmented by different bit functions we need to decompose the score. Denote by the index set of bits computed by the current node, and by b(a) the limitation of a binary word to indices . For a -bit word , we define the component of contributed by words with by

(8)

For the tree split we draw a large set of candidate bits, and choose the first bits of the children by optimizing

(9)

with the set of chosen bits for the children and entry in the redirection table set to the index of the child containing . For this optimization we compute the score matrix with . Given a choice of , amounting to a choice of column subset in , the optimization over is trivial and the score is easy to compute. We optimize over by exhaustively trying all choices of for , and greedily adding columns to until it contains members.

In addition to forward bit selection, we implemented iterative bit replacement and refinement stages. The rationale for this is the observation that while the last bit functions in a fern are chosen to complement the previous ones, the bits chosen at the beginning are not optimized to be complementary and may be suboptimal in a long word calculator. The bit replacement algorithm operates after forward bit selection. It runs over all the bit functions several times and attempts to replace each function with several randomly drawn candidates. A replacement step is accepted if it improves the score. In a similar manner, a bit refinement algorithm attempts to replace a bit function by small perturbations of its parameters, thus effectively implementing a local search. For trees, bit replacement/refinement is done only for bits inside a node, and once a split is made the node parameters are fixed.

Dataset and Error CTE () CNN () speedup
MNIST 0.01 4.8 63.9 13.1
CIFAR-10 0.25 168.8 882 5.2
SVHN 0.15 18.4 88 4.7
3-HANDPOSE 0.035 6.3 1250 199.3
Dataset Tree form Error(%) speed ()
MNIST Fern 0.45 106
Depth 3, 4-way splits 0.43 398
Depth 6, 2-way splits 0.39 498
CIFAR Fern 21.0 800
  -10 Depth 3, 4-way splits 19.7 3111
Depth 4, 3-way splits 19.3 3544
Figure 2: Left:CTE-CNN speed differences: Some examples of required accuracy points where a CTE can meet the accuracy while being considerably faster than a CNN. Right:Ferns/Trees trade-off: Accuracy and speed for several fern/tree configurations on MNIST and CIFAR-10. The fern classifiers are the baseline classifiers whose results are reported in Figure 1. For trees the bits are split evenly between the stages (for MNIST, which uses -bit tables, the last stage gets one less bit), and all the split parameters are equal.

Global optimization: We considered two global loss functions in our classification experiments: an SVM-based loss, and a softmaxloss as typically used in neural networks optimization. In the SVM loss, we take the sum of SVM programs, each minimizing a one-versus-all error. Let be binary class labels. The loss is

(10)

The loss aims for class separation in independent classifiers. Its advantage lies in the availability of fast and scalable methods for solving large and sparse SVM programs [30, 16]. The loss gradients are if example is a support vector, and otherwise. In [3] a first order approximation for is derived for new feature addition, in which the example gradients are with the dual SVM variables at the optimum. The two expressions are similar and we did not find noticeable difference between them empirically. The softmax loss is

(11)

This loss provides a direct minimization of the -class error. The gradients are . Conveniently, it can be extended to a distillation loss [15], which enables guidance of the classifier using an internal representation of a well-trained CNN classifier.

Features in a word histogram have significant variance, as some words appear in large quantities in a single image. Without normalization such words may be arbitrarily preferred due to their lower regularization cost- they can be used with lower weights. Denote the column of a feature across all examples by

. We found that normalizing each features column by the expected count of active examples improved accuracy and convergence speed in many cases.

3 Empirical results

We discuss our experimental setup in 3.1. In Section 3.2 we compare to related art and evaluate the contribution of algorithmic components to the performance. Results of speed-accuracy trade-offs are presented in 3.3.

3.1 Implementation and data details

The experiments were conducted on publicly available datasets: MNIST, CIFAR-10, SVHN and 3-HANDPOSE. The first three are standard recognition benchmarks in gray-scale (MNIST) or RGB (CIFAR-10,SVHN), with classes each. 3-HANDPOSE are a -class dataset, with hand poses and a fourth class of ’other’, and its images contain depth and IR channels. The image sizes are between (MNIST) and (3-HANDPOSE). The training set size ranges from (CIFAR-10) to (SVHN).

CTE training code was written in Matlab, with some routines using code from the packages [8, 11, 37]. The test time classier was implemented and optimized in C. For ferns we implemented algorithm 1 with SSE operations. Words are computed for neighboring pixels together, and voting is done for classes at once. For trees we implemented a program generating efficient code of the bit computation loop for a specific tree, so the tree parameters are part of the code. This obtained an acceleration factor of   over standard C code. We also thread-parallelized the algorithm over the convolutional tables, with good a speed-up of obtained from cores. However, we report and compare single thread performance to keep the methodology as simple as possible.

CNN models were trained using MatConvNet [37]

. The implementation is efficient, reported to be comparable to Caffe 

[39] in [37], with the convolutional and global layers reduced to matrix multiplication done using an SSE-optimized BLAS package. When measuring execution time, we measured net run time of the convolutional, pooling and global layers alone, without Matlab overhead. Time measurements were made on a Lenovo Thinkpad W530 quad core laptop, with i7-3720QM core running at 2.6Ghz.

MNIST CIFAR-10 SVHN 3-HANDPOSE
Figure 3: Speed-accuracy trade-offs curves. X axis states classifier run time in microseconds, in log scale base . Each point states the (speed,acuarcy) result of a single trained classifier. the Lines are the lower envelope of classifiers from the CTE and CNN family. The number of classifiers trained over the data is 172 (MNIST), 283 (CIFAR-10), 89 (SVHN) and 111 (3-HANDPOSE).

3.2 Comparison and variation

Comparison with DFE: The Discriminative Ferns Ensemble (DFE) was suggested in [20] for classification of 3-HANDPOSE, and can be seen a baseline for CTE, which enhances it in many aspects. The first two columns in Table 1 present errors of DFE and CTE on the datasets, using ferns for MNIST, SVHN, 3-HANDPOSE and for CIFAR-10. MNIST was trained with softmax distillation loss (see below for details), and the others with SVM loss. The aggregation area were chosen to be identical for all tables in a classifier, forming a centered square occupying most of the image. To enable the comparison, M-class error rates are extracted from DFE (in [20] such errors are not reported, and -class average true positive rates are reported instead). It can be seen that CTE base provides significant improvements of error reduction over DFE, with obtained for 3-HANDPOSE, where DFE was originally applied. Note that the CTE base is not the best choice for 3-HANDPOSE. With additional parameter tuning result of can be obtained with ferns, which is an improvement of over DFE.

Ablation experiments: The accuracy obtained by a CTE is influenced by many small incremental improvements related to structural and algorithmic variations. Columns 3-9 in Table 1 show the contribution of some ingredients by removing them from the baseline CTE. For MNIST, where the effects are small due to the low error, results were averaged over experiments varying in their random seed, a with seed-induced std of . It can be seen that these ingredients consistently contribute to accuracy for non-depth data.

Trees/Ferns trade-off: The trade-off between ferns and trees for MNIST and CIFAR-10 is presented in Figure 2(Right). For MNIST, the results were averaged over experiments, with a seed induced std of . It can be seen that trees provide better accuracy. However, the speed cost of using trees is significant, due to the inability to efficiently vectorize their implementation.

Distillation experiments: We experimented with knowledge distillation from a CNN to a CTE using the method suggested in [15]

. In such experiments, soft labels are taken from our best CNN model, and a CTE is trained to optimize a convex combination of the standard softmax loss and the Kullback-Leibler distance from the CNN-induced probabilities. We attempted this for MNIST and CIFAR-10 using our best CNN models, providing

and error respectively as distillation sources. For MNIST, this training methodology proved to be successful. Averaging over seeds, the accuracy of a -fern CTE optimized for softmax was (the std was ) without distillation, and with distillation. For comparison, an SVM-optimized CTE with the same parameters obtained error. For CIFAR-10 distillation did not consistently improve the results.

3.3 Speed-Accuracy trade-off

We are interested in the trade-off or Pareto curves [10], showing the best accuracy obtainable for a specific speed constraint and vice versa. Since the design space for variations of and algorithms is huge, and the training time of the algorithms is considerable, we needed to sample it wisely to get a good curve approximation. Our sampling technique is based on two stages. In stage 1 we searched for the most accurate classifiers for CTE and CNN with loose speed constraints, so even slow classifiers were considered. We then used the few top accuracy variants of each architecture as baselines and accelerated them by systematically varying certain design parameters.

Our CNN baseline architectures are variations of DeepCNiN(l,k) [13], with convolutional layers and , implying usage of maps at the -th layer. It was shown in [13] that higher values provide better accuracy, but such architectures are much slower than CPU millisecond and so they are outside our domain of interest. We experimented with dropout [33]

, parametric RELU units 

[18], affine image transformations following [13], and HSV image transformations following [32]. Acceleration of the baseline architectures used three main parameters. The first was reducing parameter controlling the network width. The second was reduction of the number of maps in the output of the NIN layers. This reduces the number of input maps for the next layer, and can dramatically save computation with relatively small loss of accuracy. The third was raising the convolution stride parameter from to . For CTEs, our exploration space was sketched in Section 2, and it includes both ferns and trees. The best performing configurations were then accelerated using a single parameter: the number of tables in the ensemble.

Trade-off graphs for the datasets are shown in Figure 3. Classification speed in microseconds is displayed along the -axis in log scale with base . For all datasets, there is a high speed regime where CTEs provide better accuracy than CNNs. Specifically CTEs are preferable for all datasets when less than microseconds are available for computation. Starting from microseconds and up CNNs are usually better, with CTEs still providing comparable accuracy for MNIST and 3-HANDPOSE at the milliseconds regime. Viewed conversely, for a wide range of error rates, if the error rate is obtainable by a CTE, it is obtainable with significant speedups over CNNs. Some examples of this phenomenon are given in Figure 2(Left). Note that while a working point of error for CIFAR-10 may seem high, the majority of the one-versus-one errors of such a classifier are lower than , which may be good enough for many purposes.

4 Conclusions and further work

We introduced improvements to the convolutional tables framework in terms of bit functions used, word calculator structure, calculator optimization and global optimization. We have shown that for highly computational constrained tasks CTE may provide accuracy higher than CNNs. A natural direction for future research is to replace the flat structure of CTEs with a layered approach, in order to try and enjoy the accuracy of CNNs with the speed of CTEs.

References

5 Appendix

Here we prove the statements of lemma 2.1 from Section 2.4.

    1. From the definition , so . Also, since , we have

      Hence, for weights ,

      So and fulfill the lemma’s statement.

    2. Using we continue to