Computational Cost Reduction in Learned Transform Classifications

04/26/2015
by   Emerson Lopes Machado, et al.
0

We present a theoretical analysis and empirical evaluations of a novel set of techniques for computational cost reduction of classifiers that are based on learned transform and soft-threshold. By modifying optimization procedures for dictionary and classifier training, as well as the resulting dictionary entries, our techniques allow to reduce the bit precision and to replace each floating-point multiplication by a single integer bit shift. We also show how the optimization algorithms in some dictionary training methods can be modified to penalize higher-energy dictionaries. We applied our techniques with the classifier Learning Algorithm for Soft-Thresholding, testing on the datasets used in its original paper. Our results indicate it is feasible to use solely sums and bit shifts of integers to classify at test time with a limited reduction of the classification accuracy. These low power operations are a valuable trade off in FPGA implementations as they increase the classification throughput while decrease both energy consumption and manufacturing cost.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

02/09/2014

Dictionary learning for fast classification based on soft-thresholding

Classifiers based on sparse representations have recently been shown to ...
07/01/2020

Optimisation of the PointPillars network for 3D object detection in point clouds

In this paper we present our research on the optimisation of a deep neur...
11/28/2017

A Transprecision Floating-Point Platform for Ultra-Low Power Computing

In modern low-power embedded platforms, floating-point (FP) operations e...
05/02/2018

Compressed Dictionary Learning

In this paper we show that the computational complexity of the Iterative...
06/20/2017

Improving text classification with vectors of reduced precision

This paper presents the analysis of the impact of a floating-point numbe...
07/03/2016

Understanding the Energy and Precision Requirements for Online Learning

It is well-known that the precision of data, hyperparameters, and intern...
10/09/2012

Cost-Sensitive Tree of Classifiers

Recently, machine learning algorithms have successfully entered large-sc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In image classification, feature extraction is an important step, specially in domains where the training set has a large dimensional space that requires a higher processing and memory resource. A recent trend in feature extraction for image classification is the construction of sparse features, where these features consist in the representation of the signal in an overcomplete dictionary. When the dictionary is learned specific to the input dataset, the classification of sparse features can achieve results comparable to state-of-the-art classification algorithms

(Mairal et al., 2012). However, this approach has a drawback at test time, as the sparse coding of the input test sample is computationally intense, being impracticable to embedded applications that have scarce computational and power resources.

A recent approach to this drawback is to learn a sparsifying transform from the target image dataset (Fawzi et al., 2014; Shekhar et al., 2014; Ravishankar and Bresler, 2013)

. At test time, this approach reduces the sparse coding of the input image to a simple matrix-vector multiplication followed by a soft-threshold, which can be efficiently realized in hardware due to its inherent parallel nature. Nevertheless, these matrix-vector multiplications require floating-point operations, which may have a high cost in hardware, specially in FPGA, as it requires a much larger area when working with floating-point operations, and higher energy consumption.

Exploring some properties we derive from these classifiers, we propose a set of techniques to reduce their computational cost at test time, which we divide into four main groups: (i) use test images in its raw representation (integer) instead of their normalized version (floating-point) and thus replace the costly floating-point operations by integer operations, which are cheaper to implement in hardware and do not affect the classification accuracy; (ii) discretize both transform dictionary and classifier by approximating its elements to their nearest power of 2 and thus replace all multiplications by simple bit shifts, at the cost of a slight decrease in the classification accuracy; (iii) decrease the dynamic range of the test images by reducing the quantization level of the integer valued test images; (iv) and decrease the dynamic range of the dictionary first by penalizing the norm of its entries at the training phase and second by zeroing out its entries that have absolute values smaller than a trained threshold. The last two techniques reduce the bit precision of the matrix-vector multiplication at the cost of a slight decrease in the classification accuracy.

As a study case for our techniques, we use a recent classification algorithm named Learning Algorithm for Soft-Thresholding classifier (LAST), which learns both the sparse representation of the signals and the hyperplane vector classifier at the same time. Our tests use the same datasets used in the paper that introduces LAST and our results indicate that our techniques reduce the computational cost while not substantially degrading the classification accuracy. Moreover, in a particular dataset we tested, our techniques substantially increased the classification accuracy.

In this work, all simulations we ran to test our techniques were performed on image classification using LAST. Nevertheless, our proposed techniques are sufficiently general to be applied on different problems and different classification algorithms that use matrix-vector multiplications to extract features, such as Extreme Learning Machine (ELM) (Huang et al., 2006)

and Deep Neural Networks (DNN)

(Schmidhuber, 2015).

To the best of our knowledge, this paper presents the first generic approach to reduce the computational cost at test time of classifiers that are based on learned transform. This has a valuable application in embedded systems where power consumption is critical and computational power is restricted. Furthermore, these techniques dismiss the necessity of using DSPs for intense matrix-vector operations in FPGAs architectures in the context of image classification, lowering the overall manufacturing cost of embedded systems.

2 Overview of Sparse Representation Classification

In this section, we briefly review both synthetical and analytical sparse representation of signals along with the threshold operation used as a sparse coding approach (Section 2.1). We also review LAST (Section 2.2).

2.1 Sparse Representation of Signals

Let be a signal vector and be an overcomplete dictionary. The sparse representation problem is to find the coefficient vector such that is minimum, i.e.,

(1)

where measures the number of nonzero coefficients. Therefore, the signal can be synthesized as a linear combination of the nonzero vectors from the dictionary , also called synthesis operator. The solution of (1) requires testing all possible sparse vectors , which is a combination of elements taken at a time. This problem is NP-hard, but an approximate solution can be obtained by using the norm in place of the norm

(2)

where is the norm. The solution of (2) can be computed by solving the problem of minimizing the norm of the coefficients among all decompositions, which is convex and can be solved efficiently. If the solution of (2) is sufficiently sparse, it will be equal to the solution of (1) (Donoho and Huo, 2001).

Sparse coding transform (Ravishankar and Bresler, 2013)

is another way of sparsifying a signal, where the dictionary is a linear transform that maps the signal to a sparse representation. For example, signals formed by the superposition of sinusoids have a dense representation in the time domain and a sparse representation in the frequency domain. For this type of signal, the Fourier transform is the sparse coding transform. Quite simply,

is the sparse transform of , where is the sparse coefficient vector. In general, the transform , can be a well structured fixed base such as a DFT or learned specifically to the target problem represented in the training dataset. A learned dictionary can be an overcomplete dictionary learned from the signal dataset, as in (Shekhar et al., 2014), a square invertible dictionary, as in (Ravishankar and Bresler, 2013), or even a dictionary without restrictions on the number of atoms, as in LAST (Fawzi et al., 2014).

When a signal is corrupted by additive white Gaussian noise (AWGN), its transform will result in a coefficient vector that is not sparse. A common way of making it sparse is to apply a threshold operation to its entries right after the transform, where the entries lower than the threshold are set to zero. Soft-threshold is an threshold operator that, in addition to the threshold operation, subtracts the remaining values by the threshold, shrinking them toward zero (Donoho and Johnstone, 1994).

Let be the coefficients of a sparse representation of a signal corrupted by AWGN given by

(3)

where are independent identically distributed as , is the noise level, and are the coefficients of the sparse representation of the pure signal vector.

Because the coefficients in (3) are sparse, there exist a threshold that can separate most of the pure signal from the noise using the soft-thresholding operator

(4)

where

is the sign function. For classification tasks, the best estimate of

can be computed using the training set.

2.2 Learning Algorithm for Soft-Thresholding Classifier (LAST)

LAST (Fawzi et al., 2014), is an algorithm based on a learned transform followed by a soft-threshold, as described in Section 2.1. Differently from the original soft-threshold map presented in (4), LAST uses a soft-threshold version that also sets to zero all negative values , where is the threshold, also called sparsity parameter. We chose LAST to be our study case because of its simplicity in the learning process, as it jointly learns the sparsifying dictionary and the classifier hyperplane.

For the training cases with labels , the sparsifying dictionary that contains atoms and the classifier hyperplane are estimated using the supervised optimization

(5)

where

is the hinge loss function

and is the regularization parameter that prevents the overfitting of the classifier to the training set. At test time, the classification of each test case is performed by first extracting the sparse features from the signal , using , followed by the classification of these features using , where is the class returned by the classifier. We direct the reader to (Fawzi et al., 2014), for a deeper understanding of LAST.

3 Proposed Techniques

We present in this section our techniques to simplify the test time computations of classifiers that are based on learned transform and soft-threshold. We first present in Section 2.2 the empirical findings that underlie our techniques and afterward our techniques in Section 3.2.

3.1 Theoretical Results on Computational Cost Reduction

For the purpose of brevity, we coined the term powerize to concisely describe the operation of approximating a value to its closest power of 2.

Theorem 1.

The relative distance between any real scalar and its powerized version is upper bounded by .

Proof.

Let , and be the distance between and its powerized version. The distance is maximum when is the middle point between both closest power of 2, which is .

Therefore, the distance when is , and so the maximum relative distance between and its powerized version is , which is equal to . ∎

We now show how the classification accuracy on the test set behaves when small variations are introduced on the entries of and . Using the datasets described in the beginning of this section, we trained 10 different pairs of and , with 50 atoms, and created 50 versions of each pair. Each of these versions and , , were built by multiplying the elements of and

by a random value chosen from the uniform distribution on the open interval

, where . Next, we evaluated all them on the test set. The results, shown in Figure 1, indicate a clear trade-off between the classification accuracy and how far the entries of both and are displaced from the corresponding entries of and , which is controlled by the .

(a) bark versus woodgrain
(b) pigskin versus pressedcl
Figure 1: Classification accuracy when the elements of and are displaced from their original value by a random amount, upper bounded by . These results were built from the classification of the test set using 10 different pairs of , with 50 atoms, and . The datasets used in this simulation are described in Section 4.1.
Hypothesis 1.

Both and can be powerized at the cost of a small decrease in the classification accuracy.

It is worth noting that the Theorem 1 guarantees an upper bound of for the relative distance between any real scalar and its powerized version. Therefore, it is reasonable to hypothesize that the classification accuracy using the powerized pair and is no worse than using and , when , shown in Figure 1. To support this hypothesis, we ran another simulation using the datasets described in the beginning of this section. For this simulation, we trained 10 pairs of and with different training sets and evaluated them and their respective powerized versions on the test set. Regarding the bark vs. woodgrain dataset, the original model accuracy were and the proposed model accuracy were . As for the bark vs. woodgrain dataset, the original model accuracy were and the proposed model accuracy were .

Theorem 2.

Let and be, respectively, the sparsifying dictionary and the linear classifier trained with the normalized training set ( norm equals to ). The classification of the both raw signals (integer values) and normalized signals ( norm equals to ) are exactly the same when the sparsity parameter is properly adjusted to the raw signals.

Proof.

Let and be respectively a raw vector from the test set and its normalized version, with , and and trained with . Therefore, the extracted features are and the soft-thresholded feature is . Finally, the classification of is .

As the norm of any real vector different from the null vector is always greater than 0, then , and, thus

Therefore, as , the expressions , with , and , with are equivalent. ∎

Empirical evidence 1.

Increasing the sparsity of the dictionary up to a certain level value will decrease the minimum number of bits necessary to store it, and consequently also reduce the number of bits to compute the sparse representation , at the cost of a slight classification accuracy decrease.

We hypothesized that forcing to be sparse would decrease its dynamic range with no substantial decrease of its classification accuracy. To test our hypothesis we performed another simulation with the datasets described in the beginning of this section. For each of the 14 threshold values linearly spaced between and , we averaged the results of 10 pairs of and trained for different training sets evaluated on the test set. As shown in Figure 2(c), the first cut different from zero already reduces the number of bits to represent in half while unexpectedly increasing its classification accuracy. Also, the third cut different from zero on shown in Figure 2(d) maintains its classification accuracy while reducing its dynamic range to less than half of the original.

(a) bark versus woodgrain
(b) pigskin versus pressedcl
(c) bark versus woodgrain
(d) pigskin versus woodgrain
Figure 2: Classification accuracy for hard threshold values applied to the dictionary . These are the average of the classification results on the test set evaluated with 10 pairs of and , with 50 atoms, trained with different training sets. The original results are shown with . The datasets are described in Section 4.1.
Empirical evidence 2.

Decreasing the quantization level of the integer valued test images up to a certain level will decrease the dynamic range of at the cost of a slight classification accuracy decrease.

We also hypothesized the original continuous signal may be unnecessarily over quantized and its quantization level may be decreased while not substantially affecting the classification accuracy. To test our hypotheses, we performed another simulation with the binary datasets described in the beginning of this section. In this simulation, we averaged the results of one thousand runs consisting in 10 pairs of and trained for different training sets evaluated on the test set. Each training set was quantized from 1 to 15 quantization levels. The results are shown in Figure 3. Its worth noting in this figure that both datasets can be reduced to 2 bit (Quantization level equals to 2 and 3) with a limited decrease of the classification accuracy.

(a) bark versus woodgrain
(b) pigskin versus pressedcl
Figure 3: Classification accuracy on quantized versions of the test set. These results are the average of the classification results of the test set evaluated with 10 pairs of and , with 50 atoms, trained with different training sets. The original results are marked in the position of the Quantization level axis. The datasets are described in Section 4.1.

3.2 Proposed Techniques

Technique 1.

Use signals in its raw representation (in integer) instead of their normalized version (in floating-point).

Technique 2.

Powerize and .

Technique 3.

Decrease the dynamic range of the test set by quantizing the integer valued test images .

Technique 4.

Decrease the dynamic range of the entries of by penalizing their norm during the training followed by hard-thresholding them, using a trained threshold.

Our strategy to decrease the dynamic range of the dictionary involves the addition of a penalty to the norm of its entries during the minimization of the objective function of LAST, described in (5). The new objective function becomes

(6)

where controls this new penalization. In Section 3.3, we show our proposed technique of including this penalization into general constrained optimization algorithms.

After training and using the modified objective function (6), we apply a hard-threshold to its entries to zero out the values closer to zero. Our assumption is that these small values of have little contribution on the final feature value and, thus, can be set to zero without affecting much the classification accuracy. As for the threshold value, we test the best one from all unique absolute values of after it has been powerized using our Technique 2. As the number of unique absolute values of is substantially reduced after using the Technique 2, the computational burden to test all possible values is greatly reduced.

3.3 Inclusion of an Norm Penalization Term in Dictionary Training Algorithms Based on Constrained Optimization

We show how to include a term into the objective function that penalizes potential dictionaries whose entries have larger energy values, as opposed to lower-energy dictionaries. By favoring vectors with lower energies, we may obtain dictionaries which span over narrower ranges of values. We follow our explanation by showing how to include this penalization into gradient descent (GD) methods, which is one of the most used methods for optimization (Boyd and Vandenberghe, 2004).

Several dictionary and classifier training methods are based on constrained optimization programs such as (Fawzi et al., 2014; Ravishankar and Bresler, 2013)

(7)
subject to

where: (i) is an vector containing the dictionary terms and is an vector of classifier parameters; (ii) , , is the cost function based on the training set; (iii) is the null vector; (iv) and is a function representing scalar equality constraints. Some methods also include inequality constraints.

In order to penalize the total energy associated to the dictionary entries, we can replace any problem of the form (7) by

(8)
subject to

where is a penalization weight.

Iterative methods are commonly used to solve constrained optimization problems (Boyd and Vandenberghe, 2004) such as (8). They start with an initial value for , which is iterated to generate a supposedly convergence sequence satisfying

(9)

where is the step size and is the step computed based on the particular iterative method.

We consider the method GD, where computing requires evaluating the gradient of a dual function associated with the objective function and the constraints (Boyd and Vandenberghe, 2004). Specifically, the Lagrangian is an example of a dual function, thus having a local maximum that is a minimum of the objective function at a point that satisfies the constraints. For problems (7) and (8), the Lagrangian functions are given respectively by

(10)
(11)

with the vector of Lagrange multipliers.

Our first objective regarding solving the modified problem (8) is to compute the gradient of in terms of the gradient of , so as to show how a problem that solves (7) can be modified in order to solve (8).

3.3.1 Including the Penalization Term in GD Methods

In GD optimization methods, the step depends directly upon the gradient of the dual function , as evaluated at  (Boyd and Vandenberghe, 2004). We now establish the relation between and , in order to determine the modification to such methods in order to include the penalization we propose.

By comparing (10) and (11), and by defining as the gradient of any function with respect to vector , note that

As , it is easy to see that .

In summary, the gradient of the modified Lagrangian can be computed from the original Lagrangian used in a given optimization problem by using the expressions

(12)
(13)
(14)

Equations (12), (13), and (14) show how we modify the estimated gradient in any GD method (such as LAST (Fawzi et al., 2014)) in order to penalize the range of the dictionary entries, and thus try to force a solution with a narrower range. Note that only the gradient with respect to the dictionaries is altered.

3.4 Model Selection

Our simulations, described in Section 4, generated many different models due to the large number of parameter combinations of both Technique 3 and Technique 4. To select the best combination of the parameters , , and we relied on the classification accuracy on a separate data set. The range of these parameters is defined in Section 4.1, as well as the parameter that controls the trade-off between the classification accuracy and the bit resolution of the final classifier, denoted by . We used the following steps for the model selection: (i) First, we used of the training set to train the models ( and ) and used the remaining to estimate the best combination of the parameters , , and . (ii) Let be the set of models trained with all combinations of the parameters , , and . Also, let be the set of the classification results of the training set using the models and be the best training accuracy from . (iii) From , we create the subset that contains the models with results . (iv) From , we create a new subset with results , where is the lowest number of bits necessary for the computation of . (v) From , we finally choose the model such that the result .

The traditional rule of thumb of using of the dataset to train and to test is a safe way of estimating of the true classification accuracy when the classification accuracy on the whole dataset set is higher than (Dobbin and Simon, 2011). As we are solely reserving part of the training set for the selection of the best parameters values, and not for the estimation of the true classification accuracy, we opted for the more conservative proportion of to train our models. This has the advantage of lowering the chance of missing an underrepresented training set sample. Moreover, the last step in our model selection algorithm selects the model that produces the sparsest signal representation, as it leads to models that generalize better (Bengio et al., 2013).

4 Simulations

In this section, we evaluate how our techniques affect the accuracy of LAST on the same datasets used in (Fawzi et al., 2014), which are described in Section 4.1, along with the parameters we chose to evaluate our techniques, and, at last, the analysis of the results we obtained comes in Section 4.2.

4.1 Datasets and Choice of the Parameters

We used five out of the six datasets used in the paper that describes LAST (Fawzi et al., 2014), because we could not find the USPS dataset in integer. The simulations consist in training and with the original version and the modified version of LAST made with our techniques presented in Section 3.2.

The first two datasets contain patches of textures extracted from the Brodatz dataset (Valkealahti and Oja, 1998). As in (Fawzi et al., 2014), the first task consisted in discriminating between the images bark versus woodgrain and the second task consisted in the discrimination of pigskin versus pressedcl. First, we separated both images in two disjoint pieces and took the training patches from one piece and the test patches from the other one. As in (Fawzi et al., 2014), the training and test sets were built with 500 patches of the textures with size of pixels. These patches were transformed into vectors and then normalized to have norm equals to .

The third binary dataset was built using a subset of the CIFAR-10 image dataset (Krizhevsky, 2009). This dataset contains 10 classes of 60 000 tiny RGB images, with 50 000 images in the training set and 10 000 in the test set. Each image has 3 color channels and it is stored in a vector of positions. The chosen images are those labeled as deer and horse.

The first multiclass dataset was the MNIST dataset (LeCun et al., 1998), which contains 70 000 images of handwritten digits of size distributed in 60 000 images in the training set and 10 000 images in the test set. As in (Fawzi et al., 2014), all images have zero-mean and norm equals to .

The last task consisted in the classification of all 10 classes from the CIFAR-10 image dataset.

For all datasets, we fixed the parameter and let assume all unique values of the powerized version of , i.e., after applying the Technique 2. As the number of unique values of is substantially lower than the ones of , the necessary computational burden to test all valid thresholds is low. Also, we fixed the quantization parameter . At last, we fixed the trade-off parameter on . The choice of these parameter values was empirically based on a previous run of all simulations. As for the parameters in LAST, we used the same used in (Fawzi et al., 2014). We direct the reader to (Fawzi et al., 2014) for further understanding of the parameters and their values used in LAST.

4.2 Results and Analyses

In this section, the original results are the ones from the classification of the test set using the model built with the original LAST algorithm. Conversely, the proposed results are the ones obtained from the classification of the test set using the best model built for each dataset. The best model is the one selected using the methodology presented in Section 3.4.

We show the results of our simulations on the binary tasks in Figure 4. As shown in Figures 4(d), 4(e), and 4(f), our techniques do not substantially decrease the original classification accuracy. At the same time, our techniques considerably reduce the number of bits necessary to perform the multiplication , as shown in Figures 4(a), 4(b), and 4(c). This reduction allow the use of 32-bit single-precision floating-point in GPUs instead of 64-bit double-precision floating-point, which increases the computational throughput (Du et al., 2012).

One can note the original results in Figures 4(d) and 4(e) are lower than the ones presented in (Fawzi et al., 2014). Differently from their work, we used disjoint training and test sets to allow a better estimation of the true classification accuracy.

(a) bark versus woodgrain
(b) pigskin versus pressedcl
(c) CIFAR-10 deer versus horse
(d) bark versus woodgrain
(e) pigskin versus pressedcl
(f) CIFAR-10 deer versus horse
Figure 4: Comparison of the results using the original LAST algorithm and our proposed techniques. Regarding the classification at test time, these figures show for each dataset the trade-off between the necessary number of bits (top) and the classification accuracy (bottom). Our approach reduces the necessary number of bits to almost half of the original formulation at the cost of a slight decrease of the classification accuracy. The datasets are described in Section 4.1.

Table 1 contains the results of the simulations on the tasks MNIST and CIFAR-10. The original results we obtained for both large datasets have higher classification error than the ones reported in (Fawzi et al., 2014). We hypothesize that this is caused by the random nature of LAST for larger datasets, where each GD is optimized for a small portion of the data called mini-batch, which is randomly sampled from the training set. Moreover, we trained and using of the training set used in (Fawzi et al., 2014) and this may negatively affect the generalization power of the dictionary and classifier.

Note that our techniques resulted in a slight increase of the classification error on the MNIST task. Nevertheless, our techniques reduced the number of bits necessary to run the classification at test time to less than half. Again, this dynamic range reduction is highly valuable for applications on both GPU and FPGA. As for the CIFAR-10 task, our techniques produced a model that has substantial lower error than the original model using almost half of the necessary number of bits at test time.

[table_mnist_cifar_10_error_and_num_bits_D_X.tex]header [table_mnist_cifar_10_error_and_num_bits_D_X.tex]data
Table 1: Comparison between the original and the proposed results regarding the classification error and number of bits necessary to compute the matrix-vector multiplication of the sparse representation.

The results we presented in this section indicate the feasibility of using integer operations in place of floating-point ones and use bit shifts instead of multiplications with a slight decrease of the classification accuracy. These substitutions reduce the computational cost of classification at test time in FPGAs, which is important in embedded applications, where power consumption is critical. Moreover, our techniques reduce almost in half the number of bits necessary to perform the most expensive operation in the classification, the matrix-vector multiplication . This was a result of the application of both Technique 3 and Technique 4, which enables the use of 32-bit single-precision floating-point operations in place of 64-bit double-precision floating-point operations in GPUs, which can almost double their computational throughput (Du et al., 2012).

Also, it is worth noting that our techniques were developed to reduce the computational cost of the classification with an expected accuracy reduction, within acceptable limits. Nevertheless, the classification accuracies on the bark versus woodgrain dataset using our techniques substantially outperforms the accuracies using the original model, as shown in Figure 4

(a). These new higher accuracies were unexpected. Regarding the original models, we noted that the classification accuracies on the training set were 100% when using dictionaries with at least 50 atoms. These models were probably overfitted to the training set, making them fail to generalize to new data. As our powerize technique introduces a perturbation to the elements of both

and , we hypothesize that it reduced the overfitting of and to the training set and, consequently, increased their generalization power on unseen data (Pfahringer, 1995). However, this needs further investigation.

5 Conclusion

This paper presented a set of techniques for the reduction of the computations at test time of classifiers that are based on learned transform and soft-threshold. Basically the techniques are: adjust the threshold so the classifier can use signals represented in integer instead of their normalized version in floating-point; reduce the multiplications to simple bit shifts by approximating the entries from both dictionary and classifier vector to the nearest power of 2; and increase the sparsity of the dictionary by applying a hard-thresholding to its entries. We ran simulations using the same datasets used in the original paper that introduces LAST and our results indicate that our techniques substantially reduce the computation load at a small cost of the classification accuracy. Moreover, in one of the datasets tested there was a substantial increase in the accuracy of the classifier. These proposed optimization techniques are valuable in applications where power consumption is critical.

Acknowledgments

This work was partially supported by a scholarship from the Coordination of Improvement of Higher Education Personnel (Portuguese acronym CAPES). We thank the Dept. of ECE of the UTEP for allowing us access to the NSF-supported cluster (NSF CNS-0709438) used in all the simulations here described and also Mr. N. Gumataotao for his assistance with it. We thank Mr. A. Fawzi for the source code of LAST and all the help with its details. We also thank Dr. G. von Borries for fruitful cooperation and discussions.

References

  • Bengio et al. (2013) Bengio, Y., Courville, A., Vincent, P., 2013. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1798–1828.
  • Boyd and Vandenberghe (2004) Boyd, S.P., Vandenberghe, L., 2004. Convex Optimization. Cambridge University Press.
  • Dobbin and Simon (2011) Dobbin, K.K., Simon, R.M., 2011. Optimally splitting cases for training and testing high dimensional classifiers. BMC medical genomics 4, 31.
  • Donoho and Huo (2001) Donoho, D.L., Huo, X., 2001. Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory 47, 2845–2862.
  • Donoho and Johnstone (1994) Donoho, D.L., Johnstone, I.M., 1994. Ideal spatial adaptation by wavelet shrinkage. Biometrika Trust 81, 425–455.
  • Du et al. (2012) Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J., 2012. From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Computing 38, 391–407.
  • Fawzi et al. (2014) Fawzi, A., Davies, M., Frossard, P., 2014. Dictionary Learning for Fast Classification Based on Soft-thresholding.

    International Journal of Computer Vision , 1–16.

  • Huang et al. (2006) Huang, G.B., Zhu, Q.Y., Siew, C.K., 2006. Extreme learning machine: theory and applications. Neurocomputing 70, 489–501.
  • Krizhevsky (2009) Krizhevsky, A., 2009. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE. Institute of Electrical and Electronics Engineers 86, 2278–2324.
  • Mairal et al. (2012) Mairal, J., Bach, F., Ponce, J., 2012. Task-Driven Dictionary Learning. Pattern Analysis and Machine Intelligence, IEEE Transactions on 34, 791–804.
  • Pfahringer (1995) Pfahringer, B., 1995.

    Compression-Based Discretization of Continuous Attributes, in: Proc. 12th International Conference on Machine Learning, unknown. pp. 456–463.

  • Ravishankar and Bresler (2013) Ravishankar, S., Bresler, Y., 2013. Learning Sparsifying Transforms. IEEE Transactions on Signal Processing 61, 1072–1086.
  • Schmidhuber (2015) Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks : the official journal of the International Neural Network Society 61, 85–117.
  • Shekhar et al. (2014) Shekhar, S., Patel, V.M., Chellappa, R., 2014. Analysis sparse coding models for image-based classification, in: IEEE International Conference on Image Processing. Proceedings.
  • Valkealahti and Oja (1998) Valkealahti, K., Oja, E., 1998. Reduced multidimensional co-occurrence histograms in texture classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 90–94.