1 Introduction
In image classification, feature extraction is an important step, specially in domains where the training set has a large dimensional space that requires a higher processing and memory resource. A recent trend in feature extraction for image classification is the construction of sparse features, where these features consist in the representation of the signal in an overcomplete dictionary. When the dictionary is learned specific to the input dataset, the classification of sparse features can achieve results comparable to stateoftheart classification algorithms
(Mairal et al., 2012). However, this approach has a drawback at test time, as the sparse coding of the input test sample is computationally intense, being impracticable to embedded applications that have scarce computational and power resources.A recent approach to this drawback is to learn a sparsifying transform from the target image dataset (Fawzi et al., 2014; Shekhar et al., 2014; Ravishankar and Bresler, 2013)
. At test time, this approach reduces the sparse coding of the input image to a simple matrixvector multiplication followed by a softthreshold, which can be efficiently realized in hardware due to its inherent parallel nature. Nevertheless, these matrixvector multiplications require floatingpoint operations, which may have a high cost in hardware, specially in FPGA, as it requires a much larger area when working with floatingpoint operations, and higher energy consumption.
Exploring some properties we derive from these classifiers, we propose a set of techniques to reduce their computational cost at test time, which we divide into four main groups: (i) use test images in its raw representation (integer) instead of their normalized version (floatingpoint) and thus replace the costly floatingpoint operations by integer operations, which are cheaper to implement in hardware and do not affect the classification accuracy; (ii) discretize both transform dictionary and classifier by approximating its elements to their nearest power of 2 and thus replace all multiplications by simple bit shifts, at the cost of a slight decrease in the classification accuracy; (iii) decrease the dynamic range of the test images by reducing the quantization level of the integer valued test images; (iv) and decrease the dynamic range of the dictionary first by penalizing the norm of its entries at the training phase and second by zeroing out its entries that have absolute values smaller than a trained threshold. The last two techniques reduce the bit precision of the matrixvector multiplication at the cost of a slight decrease in the classification accuracy.
As a study case for our techniques, we use a recent classification algorithm named Learning Algorithm for SoftThresholding classifier (LAST), which learns both the sparse representation of the signals and the hyperplane vector classifier at the same time. Our tests use the same datasets used in the paper that introduces LAST and our results indicate that our techniques reduce the computational cost while not substantially degrading the classification accuracy. Moreover, in a particular dataset we tested, our techniques substantially increased the classification accuracy.
In this work, all simulations we ran to test our techniques were performed on image classification using LAST. Nevertheless, our proposed techniques are sufficiently general to be applied on different problems and different classification algorithms that use matrixvector multiplications to extract features, such as Extreme Learning Machine (ELM) (Huang et al., 2006)
and Deep Neural Networks (DNN)
(Schmidhuber, 2015).To the best of our knowledge, this paper presents the first generic approach to reduce the computational cost at test time of classifiers that are based on learned transform. This has a valuable application in embedded systems where power consumption is critical and computational power is restricted. Furthermore, these techniques dismiss the necessity of using DSPs for intense matrixvector operations in FPGAs architectures in the context of image classification, lowering the overall manufacturing cost of embedded systems.
2 Overview of Sparse Representation Classification
In this section, we briefly review both synthetical and analytical sparse representation of signals along with the threshold operation used as a sparse coding approach (Section 2.1). We also review LAST (Section 2.2).
2.1 Sparse Representation of Signals
Let be a signal vector and be an overcomplete dictionary. The sparse representation problem is to find the coefficient vector such that is minimum, i.e.,
(1) 
where measures the number of nonzero coefficients. Therefore, the signal can be synthesized as a linear combination of the nonzero vectors from the dictionary , also called synthesis operator. The solution of (1) requires testing all possible sparse vectors , which is a combination of elements taken at a time. This problem is NPhard, but an approximate solution can be obtained by using the norm in place of the norm
(2) 
where is the norm. The solution of (2) can be computed by solving the problem of minimizing the norm of the coefficients among all decompositions, which is convex and can be solved efficiently. If the solution of (2) is sufficiently sparse, it will be equal to the solution of (1) (Donoho and Huo, 2001).
Sparse coding transform (Ravishankar and Bresler, 2013)
is another way of sparsifying a signal, where the dictionary is a linear transform that maps the signal to a sparse representation. For example, signals formed by the superposition of sinusoids have a dense representation in the time domain and a sparse representation in the frequency domain. For this type of signal, the Fourier transform is the sparse coding transform. Quite simply,
is the sparse transform of , where is the sparse coefficient vector. In general, the transform , can be a well structured fixed base such as a DFT or learned specifically to the target problem represented in the training dataset. A learned dictionary can be an overcomplete dictionary learned from the signal dataset, as in (Shekhar et al., 2014), a square invertible dictionary, as in (Ravishankar and Bresler, 2013), or even a dictionary without restrictions on the number of atoms, as in LAST (Fawzi et al., 2014).When a signal is corrupted by additive white Gaussian noise (AWGN), its transform will result in a coefficient vector that is not sparse. A common way of making it sparse is to apply a threshold operation to its entries right after the transform, where the entries lower than the threshold are set to zero. Softthreshold is an threshold operator that, in addition to the threshold operation, subtracts the remaining values by the threshold, shrinking them toward zero (Donoho and Johnstone, 1994).
Let be the coefficients of a sparse representation of a signal corrupted by AWGN given by
(3) 
where are independent identically distributed as , is the noise level, and are the coefficients of the sparse representation of the pure signal vector.
2.2 Learning Algorithm for SoftThresholding Classifier (LAST)
LAST (Fawzi et al., 2014), is an algorithm based on a learned transform followed by a softthreshold, as described in Section 2.1. Differently from the original softthreshold map presented in (4), LAST uses a softthreshold version that also sets to zero all negative values , where is the threshold, also called sparsity parameter. We chose LAST to be our study case because of its simplicity in the learning process, as it jointly learns the sparsifying dictionary and the classifier hyperplane.
For the training cases with labels , the sparsifying dictionary that contains atoms and the classifier hyperplane are estimated using the supervised optimization
(5) 
where
is the hinge loss function
and is the regularization parameter that prevents the overfitting of the classifier to the training set. At test time, the classification of each test case is performed by first extracting the sparse features from the signal , using , followed by the classification of these features using , where is the class returned by the classifier. We direct the reader to (Fawzi et al., 2014), for a deeper understanding of LAST.3 Proposed Techniques
We present in this section our techniques to simplify the test time computations of classifiers that are based on learned transform and softthreshold. We first present in Section 2.2 the empirical findings that underlie our techniques and afterward our techniques in Section 3.2.
3.1 Theoretical Results on Computational Cost Reduction
For the purpose of brevity, we coined the term powerize to concisely describe the operation of approximating a value to its closest power of 2.
Theorem 1.
The relative distance between any real scalar and its powerized version is upper bounded by .
Proof.
Let , and be the distance between and its powerized version. The distance is maximum when is the middle point between both closest power of 2, which is .
Therefore, the distance when is , and so the maximum relative distance between and its powerized version is , which is equal to . ∎
We now show how the classification accuracy on the test set behaves when small variations are introduced on the entries of and . Using the datasets described in the beginning of this section, we trained 10 different pairs of and , with 50 atoms, and created 50 versions of each pair. Each of these versions and , , were built by multiplying the elements of and
by a random value chosen from the uniform distribution on the open interval
, where . Next, we evaluated all them on the test set. The results, shown in Figure 1, indicate a clear tradeoff between the classification accuracy and how far the entries of both and are displaced from the corresponding entries of and , which is controlled by the .Hypothesis 1.
Both and can be powerized at the cost of a small decrease in the classification accuracy.
It is worth noting that the Theorem 1 guarantees an upper bound of for the relative distance between any real scalar and its powerized version. Therefore, it is reasonable to hypothesize that the classification accuracy using the powerized pair and is no worse than using and , when , shown in Figure 1. To support this hypothesis, we ran another simulation using the datasets described in the beginning of this section. For this simulation, we trained 10 pairs of and with different training sets and evaluated them and their respective powerized versions on the test set. Regarding the bark vs. woodgrain dataset, the original model accuracy were and the proposed model accuracy were . As for the bark vs. woodgrain dataset, the original model accuracy were and the proposed model accuracy were .
Theorem 2.
Let and be, respectively, the sparsifying dictionary and the linear classifier trained with the normalized training set ( norm equals to ). The classification of the both raw signals (integer values) and normalized signals ( norm equals to ) are exactly the same when the sparsity parameter is properly adjusted to the raw signals.
Proof.
Let and be respectively a raw vector from the test set and its normalized version, with , and and trained with . Therefore, the extracted features are and the softthresholded feature is . Finally, the classification of is .
As the norm of any real vector different from the null vector is always greater than 0, then , and, thus
Therefore, as , the expressions , with , and , with are equivalent. ∎
Empirical evidence 1.
Increasing the sparsity of the dictionary up to a certain level value will decrease the minimum number of bits necessary to store it, and consequently also reduce the number of bits to compute the sparse representation , at the cost of a slight classification accuracy decrease.
We hypothesized that forcing to be sparse would decrease its dynamic range with no substantial decrease of its classification accuracy. To test our hypothesis we performed another simulation with the datasets described in the beginning of this section. For each of the 14 threshold values linearly spaced between and , we averaged the results of 10 pairs of and trained for different training sets evaluated on the test set. As shown in Figure 2(c), the first cut different from zero already reduces the number of bits to represent in half while unexpectedly increasing its classification accuracy. Also, the third cut different from zero on shown in Figure 2(d) maintains its classification accuracy while reducing its dynamic range to less than half of the original.
Empirical evidence 2.
Decreasing the quantization level of the integer valued test images up to a certain level will decrease the dynamic range of at the cost of a slight classification accuracy decrease.
We also hypothesized the original continuous signal may be unnecessarily over quantized and its quantization level may be decreased while not substantially affecting the classification accuracy. To test our hypotheses, we performed another simulation with the binary datasets described in the beginning of this section. In this simulation, we averaged the results of one thousand runs consisting in 10 pairs of and trained for different training sets evaluated on the test set. Each training set was quantized from 1 to 15 quantization levels. The results are shown in Figure 3. Its worth noting in this figure that both datasets can be reduced to 2 bit (Quantization level equals to 2 and 3) with a limited decrease of the classification accuracy.
3.2 Proposed Techniques
Technique 1.
Use signals in its raw representation (in integer) instead of their normalized version (in floatingpoint).
Technique 2.
Powerize and .
Technique 3.
Decrease the dynamic range of the test set by quantizing the integer valued test images .
Technique 4.
Decrease the dynamic range of the entries of by penalizing their norm during the training followed by hardthresholding them, using a trained threshold.
Our strategy to decrease the dynamic range of the dictionary involves the addition of a penalty to the norm of its entries during the minimization of the objective function of LAST, described in (5). The new objective function becomes
(6) 
where controls this new penalization. In Section 3.3, we show our proposed technique of including this penalization into general constrained optimization algorithms.
After training and using the modified objective function (6), we apply a hardthreshold to its entries to zero out the values closer to zero. Our assumption is that these small values of have little contribution on the final feature value and, thus, can be set to zero without affecting much the classification accuracy. As for the threshold value, we test the best one from all unique absolute values of after it has been powerized using our Technique 2. As the number of unique absolute values of is substantially reduced after using the Technique 2, the computational burden to test all possible values is greatly reduced.
3.3 Inclusion of an Norm Penalization Term in Dictionary Training Algorithms Based on Constrained Optimization
We show how to include a term into the objective function that penalizes potential dictionaries whose entries have larger energy values, as opposed to lowerenergy dictionaries. By favoring vectors with lower energies, we may obtain dictionaries which span over narrower ranges of values. We follow our explanation by showing how to include this penalization into gradient descent (GD) methods, which is one of the most used methods for optimization (Boyd and Vandenberghe, 2004).
Several dictionary and classifier training methods are based on constrained optimization programs such as (Fawzi et al., 2014; Ravishankar and Bresler, 2013)
(7)  
subject to 
where: (i) is an vector containing the dictionary terms and is an vector of classifier parameters; (ii) , , is the cost function based on the training set; (iii) is the null vector; (iv) and is a function representing scalar equality constraints. Some methods also include inequality constraints.
In order to penalize the total energy associated to the dictionary entries, we can replace any problem of the form (7) by
(8)  
subject to 
where is a penalization weight.
Iterative methods are commonly used to solve constrained optimization problems (Boyd and Vandenberghe, 2004) such as (8). They start with an initial value for , which is iterated to generate a supposedly convergence sequence satisfying
(9) 
where is the step size and is the step computed based on the particular iterative method.
We consider the method GD, where computing requires evaluating the gradient of a dual function associated with the objective function and the constraints (Boyd and Vandenberghe, 2004). Specifically, the Lagrangian is an example of a dual function, thus having a local maximum that is a minimum of the objective function at a point that satisfies the constraints. For problems (7) and (8), the Lagrangian functions are given respectively by
(10)  
(11) 
with the vector of Lagrange multipliers.
Our first objective regarding solving the modified problem (8) is to compute the gradient of in terms of the gradient of , so as to show how a problem that solves (7) can be modified in order to solve (8).
3.3.1 Including the Penalization Term in GD Methods
In GD optimization methods, the step depends directly upon the gradient of the dual function , as evaluated at (Boyd and Vandenberghe, 2004). We now establish the relation between and , in order to determine the modification to such methods in order to include the penalization we propose.
By comparing (10) and (11), and by defining as the gradient of any function with respect to vector , note that
As , it is easy to see that .
In summary, the gradient of the modified Lagrangian can be computed from the original Lagrangian used in a given optimization problem by using the expressions
(12)  
(13)  
(14) 
Equations (12), (13), and (14) show how we modify the estimated gradient in any GD method (such as LAST (Fawzi et al., 2014)) in order to penalize the range of the dictionary entries, and thus try to force a solution with a narrower range. Note that only the gradient with respect to the dictionaries is altered.
3.4 Model Selection
Our simulations, described in Section 4, generated many different models due to the large number of parameter combinations of both Technique 3 and Technique 4. To select the best combination of the parameters , , and we relied on the classification accuracy on a separate data set. The range of these parameters is defined in Section 4.1, as well as the parameter that controls the tradeoff between the classification accuracy and the bit resolution of the final classifier, denoted by . We used the following steps for the model selection: (i) First, we used of the training set to train the models ( and ) and used the remaining to estimate the best combination of the parameters , , and . (ii) Let be the set of models trained with all combinations of the parameters , , and . Also, let be the set of the classification results of the training set using the models and be the best training accuracy from . (iii) From , we create the subset that contains the models with results . (iv) From , we create a new subset with results , where is the lowest number of bits necessary for the computation of . (v) From , we finally choose the model such that the result .
The traditional rule of thumb of using of the dataset to train and to test is a safe way of estimating of the true classification accuracy when the classification accuracy on the whole dataset set is higher than (Dobbin and Simon, 2011). As we are solely reserving part of the training set for the selection of the best parameters values, and not for the estimation of the true classification accuracy, we opted for the more conservative proportion of to train our models. This has the advantage of lowering the chance of missing an underrepresented training set sample. Moreover, the last step in our model selection algorithm selects the model that produces the sparsest signal representation, as it leads to models that generalize better (Bengio et al., 2013).
4 Simulations
In this section, we evaluate how our techniques affect the accuracy of LAST on the same datasets used in (Fawzi et al., 2014), which are described in Section 4.1, along with the parameters we chose to evaluate our techniques, and, at last, the analysis of the results we obtained comes in Section 4.2.
4.1 Datasets and Choice of the Parameters
We used five out of the six datasets used in the paper that describes LAST (Fawzi et al., 2014), because we could not find the USPS dataset in integer. The simulations consist in training and with the original version and the modified version of LAST made with our techniques presented in Section 3.2.
The first two datasets contain patches of textures extracted from the Brodatz dataset (Valkealahti and Oja, 1998). As in (Fawzi et al., 2014), the first task consisted in discriminating between the images bark versus woodgrain and the second task consisted in the discrimination of pigskin versus pressedcl. First, we separated both images in two disjoint pieces and took the training patches from one piece and the test patches from the other one. As in (Fawzi et al., 2014), the training and test sets were built with 500 patches of the textures with size of pixels. These patches were transformed into vectors and then normalized to have norm equals to .
The third binary dataset was built using a subset of the CIFAR10 image dataset (Krizhevsky, 2009). This dataset contains 10 classes of 60 000 tiny RGB images, with 50 000 images in the training set and 10 000 in the test set. Each image has 3 color channels and it is stored in a vector of positions. The chosen images are those labeled as deer and horse.
The first multiclass dataset was the MNIST dataset (LeCun et al., 1998), which contains 70 000 images of handwritten digits of size distributed in 60 000 images in the training set and 10 000 images in the test set. As in (Fawzi et al., 2014), all images have zeromean and norm equals to .
The last task consisted in the classification of all 10 classes from the CIFAR10 image dataset.
For all datasets, we fixed the parameter and let assume all unique values of the powerized version of , i.e., after applying the Technique 2. As the number of unique values of is substantially lower than the ones of , the necessary computational burden to test all valid thresholds is low. Also, we fixed the quantization parameter . At last, we fixed the tradeoff parameter on . The choice of these parameter values was empirically based on a previous run of all simulations. As for the parameters in LAST, we used the same used in (Fawzi et al., 2014). We direct the reader to (Fawzi et al., 2014) for further understanding of the parameters and their values used in LAST.
4.2 Results and Analyses
In this section, the original results are the ones from the classification of the test set using the model built with the original LAST algorithm. Conversely, the proposed results are the ones obtained from the classification of the test set using the best model built for each dataset. The best model is the one selected using the methodology presented in Section 3.4.
We show the results of our simulations on the binary tasks in Figure 4. As shown in Figures 4(d), 4(e), and 4(f), our techniques do not substantially decrease the original classification accuracy. At the same time, our techniques considerably reduce the number of bits necessary to perform the multiplication , as shown in Figures 4(a), 4(b), and 4(c). This reduction allow the use of 32bit singleprecision floatingpoint in GPUs instead of 64bit doubleprecision floatingpoint, which increases the computational throughput (Du et al., 2012).
One can note the original results in Figures 4(d) and 4(e) are lower than the ones presented in (Fawzi et al., 2014). Differently from their work, we used disjoint training and test sets to allow a better estimation of the true classification accuracy.
Table 1 contains the results of the simulations on the tasks MNIST and CIFAR10. The original results we obtained for both large datasets have higher classification error than the ones reported in (Fawzi et al., 2014). We hypothesize that this is caused by the random nature of LAST for larger datasets, where each GD is optimized for a small portion of the data called minibatch, which is randomly sampled from the training set. Moreover, we trained and using of the training set used in (Fawzi et al., 2014) and this may negatively affect the generalization power of the dictionary and classifier.
Note that our techniques resulted in a slight increase of the classification error on the MNIST task. Nevertheless, our techniques reduced the number of bits necessary to run the classification at test time to less than half. Again, this dynamic range reduction is highly valuable for applications on both GPU and FPGA. As for the CIFAR10 task, our techniques produced a model that has substantial lower error than the original model using almost half of the necessary number of bits at test time.
[table_mnist_cifar_10_error_and_num_bits_D_X.tex]header [table_mnist_cifar_10_error_and_num_bits_D_X.tex]data 
The results we presented in this section indicate the feasibility of using integer operations in place of floatingpoint ones and use bit shifts instead of multiplications with a slight decrease of the classification accuracy. These substitutions reduce the computational cost of classification at test time in FPGAs, which is important in embedded applications, where power consumption is critical. Moreover, our techniques reduce almost in half the number of bits necessary to perform the most expensive operation in the classification, the matrixvector multiplication . This was a result of the application of both Technique 3 and Technique 4, which enables the use of 32bit singleprecision floatingpoint operations in place of 64bit doubleprecision floatingpoint operations in GPUs, which can almost double their computational throughput (Du et al., 2012).
Also, it is worth noting that our techniques were developed to reduce the computational cost of the classification with an expected accuracy reduction, within acceptable limits. Nevertheless, the classification accuracies on the bark versus woodgrain dataset using our techniques substantially outperforms the accuracies using the original model, as shown in Figure 4
(a). These new higher accuracies were unexpected. Regarding the original models, we noted that the classification accuracies on the training set were 100% when using dictionaries with at least 50 atoms. These models were probably overfitted to the training set, making them fail to generalize to new data. As our powerize technique introduces a perturbation to the elements of both
and , we hypothesize that it reduced the overfitting of and to the training set and, consequently, increased their generalization power on unseen data (Pfahringer, 1995). However, this needs further investigation.5 Conclusion
This paper presented a set of techniques for the reduction of the computations at test time of classifiers that are based on learned transform and softthreshold. Basically the techniques are: adjust the threshold so the classifier can use signals represented in integer instead of their normalized version in floatingpoint; reduce the multiplications to simple bit shifts by approximating the entries from both dictionary and classifier vector to the nearest power of 2; and increase the sparsity of the dictionary by applying a hardthresholding to its entries. We ran simulations using the same datasets used in the original paper that introduces LAST and our results indicate that our techniques substantially reduce the computation load at a small cost of the classification accuracy. Moreover, in one of the datasets tested there was a substantial increase in the accuracy of the classifier. These proposed optimization techniques are valuable in applications where power consumption is critical.
Acknowledgments
This work was partially supported by a scholarship from the Coordination of Improvement of Higher Education Personnel (Portuguese acronym CAPES). We thank the Dept. of ECE of the UTEP for allowing us access to the NSFsupported cluster (NSF CNS0709438) used in all the simulations here described and also Mr. N. Gumataotao for his assistance with it. We thank Mr. A. Fawzi for the source code of LAST and all the help with its details. We also thank Dr. G. von Borries for fruitful cooperation and discussions.
References
 Bengio et al. (2013) Bengio, Y., Courville, A., Vincent, P., 2013. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1798–1828.
 Boyd and Vandenberghe (2004) Boyd, S.P., Vandenberghe, L., 2004. Convex Optimization. Cambridge University Press.
 Dobbin and Simon (2011) Dobbin, K.K., Simon, R.M., 2011. Optimally splitting cases for training and testing high dimensional classifiers. BMC medical genomics 4, 31.
 Donoho and Huo (2001) Donoho, D.L., Huo, X., 2001. Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory 47, 2845–2862.
 Donoho and Johnstone (1994) Donoho, D.L., Johnstone, I.M., 1994. Ideal spatial adaptation by wavelet shrinkage. Biometrika Trust 81, 425–455.
 Du et al. (2012) Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J., 2012. From CUDA to OpenCL: Towards a performanceportable solution for multiplatform GPU programming. Parallel Computing 38, 391–407.

Fawzi et al. (2014)
Fawzi, A., Davies, M.,
Frossard, P., 2014.
Dictionary Learning for Fast Classification Based on
Softthresholding.
International Journal of Computer Vision , 1–16.
 Huang et al. (2006) Huang, G.B., Zhu, Q.Y., Siew, C.K., 2006. Extreme learning machine: theory and applications. Neurocomputing 70, 489–501.
 Krizhevsky (2009) Krizhevsky, A., 2009. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE. Institute of Electrical and Electronics Engineers 86, 2278–2324.
 Mairal et al. (2012) Mairal, J., Bach, F., Ponce, J., 2012. TaskDriven Dictionary Learning. Pattern Analysis and Machine Intelligence, IEEE Transactions on 34, 791–804.

Pfahringer (1995)
Pfahringer, B., 1995.
CompressionBased Discretization of Continuous Attributes, in: Proc. 12th International Conference on Machine Learning, unknown. pp. 456–463.
 Ravishankar and Bresler (2013) Ravishankar, S., Bresler, Y., 2013. Learning Sparsifying Transforms. IEEE Transactions on Signal Processing 61, 1072–1086.
 Schmidhuber (2015) Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks : the official journal of the International Neural Network Society 61, 85–117.
 Shekhar et al. (2014) Shekhar, S., Patel, V.M., Chellappa, R., 2014. Analysis sparse coding models for imagebased classification, in: IEEE International Conference on Image Processing. Proceedings.
 Valkealahti and Oja (1998) Valkealahti, K., Oja, E., 1998. Reduced multidimensional cooccurrence histograms in texture classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 90–94.
Comments
There are no comments yet.