1 Introduction
The recent decade has witnessed the emergence of huge volumes of high dimensional information produced by all sorts of sensors. For instance, a massive amount of highresolution images are uploaded on the Internet every minute. In this context, one of the key challenges is to develop techniques to process these large amounts of data in a computationally efficient way. We focus in this paper on the image classification
problem, which is one of the most challenging tasks in image analysis and computer vision. Given training examples from multiple classes, the goal is to find a rule that permits to predict the class of test samples.
Linear classification is a computationally efficient way to categorize test samples. It consists in finding a linear separator between two classes.Linear classification has been the focus of much research in statistics and machine learning for decades and the resulting algorithms are well understood. However, many datasets cannot be separated linearly and require complex nonlinear classifiers. A popular nonlinear scheme, which leverages the efficency and simplicity of linear classifiers, embeds the data into a high dimensional feature space, where a linear classifier is eventually sought. The feature space mapping is chosen to be nonlinear in order to convert nonlinear relations to linear relations. This nonlinear classification framework is at the heart of the popular kernelbased methods
(ShaweTaylor and Cristianini, 2004)that make use of a computational shortcut to bypass the explicit computation of feature vectors. Despite the popularity of kernelbased classification, its computational complexity at test time strongly depends on the number of training samples
(Burges, 1998), which limits its applicability in large scale settings.A more recent approach for nonlinear classification is based on sparse coding, which consists in finding a compact representation of the data in an overcomplete dictionary. Sparse coding is known to be beneficial in signal processing tasks such as denoising (Elad and Aharon, 2006), inpainting (Fadili et al, 2009), coding (Figueras i Ventura et al, 2006), but it has also recently emerged in the context of classification, where it is viewed as a nonlinear feature extraction mapping. It is usually followed by a linear classifier (Raina et al, 2007), but can also be used in conjunction with other classifiers (Wright et al, 2009). Classification architectures based on sparse coding have been shown to work very well in practice and even achieve stateoftheart results on particular tasks (Mairal et al, 2012; Yang et al, 2009). The crucial drawback of sparse coding classifiers is however the prohibitive cost of computing the sparse representation of a signal or image sample at test time. This limits the relevance of such techniques in largescale vision problems or when computational power is scarce.
To remedy to these large computational requirements, we adopt in the classification a computationally efficient sparsifying transform, the soft thresholding mapping , defined by:
(1) 
for and . Note that, unlike the usual definition of softthresholding given by , we consider here the onesided version of the softthresholding map, where the function is equal to zero for negative values (see Fig. 3 (a) vs. Fig 3 (b)). The map is naturally extended to vectors by applying the scalar map to each coordinate independently. Given a dictionary , this map can be applied to a transformed signal that represents the coefficients of features in a signal . Its outcome, which only considers the most important features of , is used for classification. In more details, we consider in this paper the following simple twostep procedure for classification:

Feature extraction: Let and . Given a test point , compute .

Linear classification: Let . If is positive, assign to class . Otherwise, assign to class .
The architecture is illustrated in Fig. 1. The proposed classification scheme has the advantage of being simple, efficient and easy to implement as it involves a single matrixvector multiplication and a operation.
The softthresholding map has been successfully used in (Coates and Ng, 2011)
, as well as in a number of deep learning architectures
(Kavukcuoglu et al, 2010b), which shows the relevance of this efficient feature extraction mapping. The remarkable results in Coates and Ng (2011) show that this simple encoder, when coupled with a standard learning algorithm, can often achieve results comparable to those of sparse coding, provided that the number of labeled samples and the dictionary size are large enough. However, when this is not the case, a proper training of the classifier parameters becomes crucial for reaching good classification performance. This is the objective of this paper.We propose a novel supervised dictionary learning algorithm, which we call LAST (Learning Algorithm for SoftThresholding classifier). It jointly learns the dictionary and the linear classifier tailored for the classification architecture based on softthresholding. We pose the learning problem as an optimization problem comprising a loss term that controls the classification accuracy and a regularizer that prevents overfitting. This problem is shown to be a differenceofconvex (DC) program, which is solved efficiently with an iterative DC solver. We then perform extensive experiments on textures, digits and natural images datasets, and show that the proposed classifier, coupled with our dictionary learning approach, exhibits remarkable performance with respect to numerous competitor methods. In particular, we show that our classifier provides comparable or better classification accuracy than sparse coding schemes.
The rest of this paper is organized as follows. In the next Section, we highlight the related work. In Section 3, we formulate the dictionary learning problem for classifiers based on softthresholding. Section 4 then presents our novel learning algorithm, LAST, based on DC optimization. In Section 5, we perform extensive experiments on textures, natural images and digits datasets and Section 6 finally gathers a number of important observations on the dictionary learning algorithm, and the classification scheme.
2 Related work
We first highlight in this section the difference between the proposed approach and existing techniques from the sparse coding and dictionary learning literature. Then, we draw a connection between the considered approach and neural network models on the architecture and optimization aspects.
2.1 Sparse coding
The classification scheme adopted in this paper shares similarities with the now popular architectures that use sparse coding at the feature extraction stage. We recall that the sparse coding mapping, applied to a datapoint in a dictionary consists in solving the optimization problem
(2) 
It is now known that, when the parameters of the sparse coding classifier are trained in a discriminative way, excellent classification results are obtained in many vision tasks (Mairal et al, 2012, 2008; Ramirez et al, 2010). In particular, significant gains over the standard reconstructive dictionary learning approaches are obtained when the dictionary is optimized for classification. Several dictionary learning methods also consider an additional structure (e.g., lowrankness) on the dictionary, in order to incorporate a taskspecific prior knowledge (Zhang et al, 2013; Chen et al, 2012; Ma et al, 2012)
. This line of research is especially popular in face recognition applications, where a mixture of subspace model is known to hold
(Wright et al, 2009). Up to our knowledge, all the discriminative dictionary learning methods optimize the dictionary in regards to the sparse coding map in Eq. (2), or a variant that still requires to solve a non trivial optimization problem. In our work however, we introduce a discriminative dictionary learning method specific to the efficient softthresholding map. Interestingly, softthresholding can be viewed as a coarse approximation to nonnegative sparse coding, as we show in Appendix A. This further motivates the use of softthresholding for feature extraction, as the merits of sparse coding for classification are now wellestablished.Closer to our work, several approaches have been introduced to approximate sparse coding with a more efficient feedforward predictor (Kavukcuoglu et al, 2010a; Gregor and LeCun, 2010), whose parameters are learned in order to minimize the approximation error with respect to sparse codes. These works are however different from ours in several aspects. First, our approach does not require the result of the softthresholding mapping to be close to that of sparse coding. We rather require solely a good classification accuracy on the training samples. Moreover, our dictionary learning approach is purely supervised, unlike Kavukcuoglu et al (2010a, b). Finally, these methods often use nonlinear maps (e.g., hyperbolic tangent in Kavukcuoglu et al (2010a), multilayer softthresholding in Gregor and LeCun (2010)) that are different from the one considered in this paper. The single softthresholding mapping considered here has the advantage of being simple, very efficient and easy to implement in practice. It is also strongly tied to sparse coding (see Appendix A).
2.2 Neural networks
The classification architecture considered in our work is also quite strongly related to artificial neural network models (Bishop, 1995)
. Neural network models are multilayer architectures, where each layer consists of a set of neurons. The neurons compute a linear combination of the activation values of the preceding layer, and an
activation functionis then used to convert the neurons’ weighted input to its activation value. Popular choices of activation functions are logistic sigmoid and hyperbolic tangent nonlinearities. Our classification architecture can be seen as a neural network with one hidden layer and
as the hidden units’ activation function, and zero bias (Fig. 2). Equivalently, the activation function can be set to with a constant bias across all hidden units. The dictionary defines the connections between the input and hidden layer, while represents the weights that connect the hidden layer to the output.In an important recent contribution, Glorot et al (2011) showed that using the rectifier activation function results in better performance for deep networks than the more classical hyperbolic tangent function. On top of that, the rectifier nonlinearity is more biologically plausible, and leads to sparse networks; a property that is highly desirable in representation learning (Bengio et al, 2013). While the architecture considered in this paper is close to that of Glorot et al (2011), it differs in several important aspects. First, our architecture assumes that hidden units have a bias equal to , shared across all the hidden units, while it is unclear whether any constraint on the bias is set in the existing rectifier networks. The parameter is intimately related to the sparsity of the features. This can be justified by the fact that is an approximant to the nonnegative sparse coding map with sparsity penalty (see Appendix A). Without imposing any restriction on the neurons’ bias (e.g., negativity) in rectifier networks, the representation might however not be sparse. This potentially explains the necessity to use an additional sparsifying regularizer on the activation values in Glorot et al (2011) to enforce the sparsity of the network, while sparsity is achieved implicitly in our scheme. Second, unlike the work of (Glorot et al, 2011) that employs a biological argument to introduce the rectifier function, we choose the softthresholding nonlinearity due to its strong relation to sparse coding. Our work therefore provides an independent motivation for considering the rectifier activation function, while the biological motivation in (Glorot et al, 2011)
in turn gives us another motivation for considering softthresholding. Third, rectified linear units are very often used in the context of deep networks
(Maas et al, 2013; Zeiler et al, 2013), and seldom used with only one hidden layer. In that sense, the classification scheme considered in this paper has a simpler description, and can be seen as a particular instance of the general neural network models.From an optimization perspective, our learning algorithm leverages the simplicity of our classification architecture and is very different from the generic techniques used to train neural networks. In particular, while neural networks are generally trained with stochastic gradient descent, we adopt an optimization based on the DC framework that directly exploits the structure of the learning problem.
3 Problem formulation
We present below the learning problem, that estimates jointly the dictionary
and linear classifier in our fast classification scheme described in Section 1. We consider the binary classification task where anddenote respectively the set of training points and their associated labels. We consider the following supervised learning formulation
(3) 
where
denotes a convex loss function that penalizes incorrect classification of a training sample and
is a regularization parameter that prevents overfitting. The softthresholding map has been defined in Eq. (1). Typical loss functions that can be used in Eq. (3) are the hinge loss (), which we adopt in this paper, or its smooth approximation, the logistic loss (). The above optimization problem attempts to find a dictionary and a linear separator such that has the same sign as on the training set, which leads to correct classification. At the same time, it keeps small in order to prevent overfitting. Note that to simplify the exposition, the bias term in the linear classifier is dropped. However, our study extends straightforwardly to include nonzero bias.The problem formulation in Eq. (3
) is reminiscent of the popular support vector machine (SVM) training procedure, where only a linear classifier
is learned. Instead, we embed the nonlinearity directly in the problem formulation, and learn jointly the dictionary and the linear classifier . This significantly broadens the applicability of the learned classifier to important nonlinear classification tasks. Note however that adding a nonlinear mapping raises an important optimization challenge, as the learning problem is no more convex.When we look closer at the optimization problem in Eq. (3), we note that, for any , the objective function is equal to:
where , and . Therefore, without loss of generality, we set the sparsity parameter to in the rest of this paper. This is in contrast with traditional dictionary learning approaches based on or minimization problems, where a sparsity parameter needs to be set manually beforehand. Fixing and unconstraining the norms of the dictionary atoms essentially permits to adapt the sparsity to the problem at hand. This represents an important advantage, as setting the sparsity parameter is in general a difficult task. A sample is then assigned to class ‘’ if , and class ‘’ otherwise.
Finally, we note that, even if our focus primarily goes to the binary classification problem, the extension to multiclass can be easily done through a onevsall strategy, for instance.
4 Learning algorithm
The problem in Eq. (3) is nonconvex and difficult to solve in general. In this section, we propose to relax the original optimization problem and cast it as a differenceofconvex (DC) program. Leveraging this property, we introduce LAST, an efficient algorithm for learning the dictionary and the classifier parameters in our classification scheme based on softthresholding.
4.1 Relaxed formulation
We rewrite now the learning problem in an appropriate form for optimization. We start with a simple but crucial change of variables. Specifically, we define , and . Using this change of variables, we have for any ,
Therefore, the problem in Eq.(3), with , can be rewritten in the following way:
(4)  
The equivalence between the two problem formulations in Eqs. (3) and (4) only holds when the components of the linear classifier
are restricted to be all non zero. This is however not a limiting assumption as zero components in the normal vector of the optimal hyperplane of Eq. (
3) can be removed, which is equivalent to using a dictionary of smaller size.The variable , that is the sign of the components of , essentially encodes the “classes” of the different atoms. In other words, an atom for which (i.e., is positive) is most likely to be active for samples of class ‘’. Conversely, atoms with are most likely active for class ‘’ samples. We assume here that the vector is known a priori. In other words, this means that we have a prior knowledge on the proportion of class and class atoms in the desired dictionary. For example, setting half of the entries of the vector to be equal to and the other half to encodes the prior knowledge that we are searching for a dictionary with a balanced number of classspecific atoms. Note that can be estimated from the distribution of the different classes in the training set, assuming that the proportion of classspecific atoms in the dictionary should approximately follow that of the training samples.
After the above change of variables, we now approximate the term in Eq.(4) with a smooth function where , and is a parameter that controls the accuracy of the approximation (Fig. 3 (b)). Specifically, as increases, the quality of the approximation becomes better. The function with
is often referred to as “softplus” and plays an important role in the training objective of many classification schemes, such as the classification restricted Boltzmann machines
(Larochelle et al, 2012). Note that this approximation is used only to make the optimization easier at the learning stage; at test time, the original softthresholding is applied for feature extraction.Finally, we replace the strict inequality in Eq. (4) with , where is a small positive constant number. The latter constraint is easier to handle in the optimization, yet both constraints are essentially equivalent in practice.
We end up with the following optimization problem:
that is a relaxed version of the learning problem in Eq. (4). Once the optimal variables are determined, and can be obtained using the above change of variables.
4.2 DC decomposition
The problem (P) is still a nonconvex optimization problem that can be hard to solve using traditional methods, such as gradient descent or Newtontype methods. However, we show in this section that problem (P) can be written as a difference of convex (DC) program (Horst, 2000) which leads to efficient solutions.
We first define DC functions. A realvalued function defined on a convex set is called DC on if, for all , can be expressed in the form
where and are convex functions on . A representation of the above form is said to be a DC decomposition of . Note that DC decompositions are clearly not unique, as provides other decompositions of , for any convex function . Optimization problems of the form , where and for are all DC functions, are called DC programs.
The following proposition now states that the problem (P) is DC:
Proposition 1
For any convex loss function and any convex function , the problem is DC.
While Proposition 1 states that the problem (P) is DC, it does not give an explicit decomposition of the objective function, which is crucial for optimization. The following proposition exhibits a decomposition when is the hinge loss.
Proposition 2
When , the objective function of problem (P) is equal to , where
4.3 Optimization
DC problems are well studied optimization problems and efficient optimization algorithms have been proposed in (Horst, 2000; Tao and An, 1998) with good performance in practice (see An and Tao (2005) and references therein, Sriperumbudur et al (2007)). While there exists a number of popular approaches that solve globally DC programs (e.g., cutting plane and branchandbound algorithms (Horst, 2000)), these techniques are often inefficient and limited to very small scale problems. A robust and efficient difference of convex algorithm (DCA) is proposed in Tao and An (1998), which is suited for solving general large scale DC programs. DCA is an iterative algorithm that consists in solving, at each iteration, the convex optimization problem obtained by linearizing (i.e., the non convex part of ) around the current solution. The local convergence of DCA is proven in Theorem 3.7 of Tao and An (1998), and we refer to this paper for further theoretical guarantees on the stability and robustness of the algorithm. Although DCA is only guaranteed to reach a local minima, the authors of Tao and An (1998) state that DCA often converges to a global optimum. When this is not the case, using multiple restarts might be used to improve the solution. We note that DCA is very close to the concaveconvex procedure (CCCP) introduced in (Yuille et al, 2002).
At iteration of DCA, the linearized optimization problem is given by:
(5) 
where and are the solution estimates at iteration , and the functions and are defined in Proposition 2. Note that, due to the convexity of , the problem in Eq. (5) is convex and can be solved using any convex optimization algorithm (Boyd and Vandenberghe, 2004). The method we propose to use here is a projected firstorder stochastic subgradient descent algorithm. Stochastic gradient descent is an efficient optimization algorithm that can handle large training sets (Akata et al, 2014). To make the exposition clearer, we first define the function:
The objective function of Eq. (5) that we wish to minimize can then be written as . We solve this optimization problem with the projected stochastic subgradient descent algorithm in Algorithm 1.
In more details, at each iteration of Algorithm 1, a training sample is drawn. and are then updated by performing a step in the direction . Many different stepsize rules can be used with stochastic gradient descent methods. In this paper, similarly to the strategy employed in Mairal et al (2012), we have chosen a stepsize that remains constant for the first iterations, and then takes the value .^{1}^{1}1The precise choice of the parameters and are discussed later in Section 5.1. Moreover, to accelerate the convergence of the stochastic gradient descent algorithm, we consider a small variation of Algorithm 1
, where a minibatch containing several training samples along with their labels is drawn at each iteration, instead of a single sample. This is a classical heuristic in stochastic gradient descent algorithms. Note that, when the size of the minibatch is equal to the number of training samples, this algorithm reduces to traditional batch gradient descent.
Finally, our complete LAST learning algorithm based on DCA is formally given in Algorithm 2. Starting from a feasible point and , LAST solves iteratively the constrained convex problem given in Eq. (5) with the solution proposed in Algorithm 1. Recall that this problem corresponds to the original DC program (P), except that the function has been replaced by its linear approximation around the current solution at iteration . Many criteria can be used to terminate the algorithm. We choose here to terminate when a maximum number of iterations has been reached, and terminate the algorithm earlier when the following condition is satisfied:
where the matrix is the row concatenation of and , and is a small positive number. This condition detects the convergence of the learning algorithm, and is verified whenever the change in and is very small. This termination criterion is used for example in Sriperumbudur et al (2007).
5 Experimental results
In this section, we evaluate the performance of our classification algorithm on textures, digits and natural images datasets, and compare it to different competitor schemes. We expose in Section 5.1 the choice of the parameters of the model and the algorithm. We then focus on the experimental assessment of our scheme. Following the methodology of Coates and Ng (2011)
, we break the feature extraction algorithms into (i) a learning algorithm (e.g, KMeans) where a set of basis functions (or dictionary) is learned and (ii) an encoding function (e.g.,
sparse coding) that maps an input point to its feature vector. In a first step of our analysis (Section 5.2), we therefore fix the encoderto be the softthresholding mapping and compare LAST to existing supervised and unsupervised learning techniques. Then, in the following subsections, we compare our complete classification architecture (i.e., learning and encoding function) to several classifiers, in terms of accuracy and efficiency. In particular, we show that our proposed approach is able to compete with recent classifiers, despite its simplicity.
5.1 Parameter selection
We first discuss the choice of the model parameters for our method. Unless stated otherwise, we choose the vector according to the distribution of the different classes in the training set. We set the value of the regularization parameter to , as it was found empirically to be a good choice in our experiments. It is worth mentioning that setting by crossvalidation might give better results, but it would also be computationally more expensive. We set moreover the parameter of the softthresholding mapping approximation to . Recall finally that the sparsity parameter is always equal to in our method, and therefore does not require any manual setting or crossvalidation procedure.
In all experiments, we have moreover chosen to initialize LAST by setting equal to a random subsample of the training set, and is set to the vector whose entries are all equal to . We however noticed empirically that choosing a different initialization strategy does not significantly change the testing accuracy. Then, we fix the maximum number of iterations of LAST to . Moreover, setting properly the parameters and in Algorithm 1 is quite crucial in controlling the convergence of the algorithm. In all the experiments, we have set the parameter , where denotes the number of iterations. Furthermore, during the first iterations, several values of are tested , and the value that leads to the smallest objective function is chosen for the rest of the iterations. Finally, the minibatch size in Algorithm 1 depends on the size of the training data. In particular, when the size of the training data is relatively small (i.e., smaller than ), we used a batch gradient descent, as the computation of the (complete) gradient is tractable. In this case, we set the number of iterations to . Otherwise, we use a batch size of , and perform iterations of the stochastic gradient descent in Algorithm 1.
5.2 Analysis of the learning algorithm
In a first set of experiments, we focus on the comparison of our learning algorithm (LAST) to other learning techniques, and fix the encoder to be the softthresholding mapping for all the methods. We present a comparative study on textures and natural images classification tasks.
5.2.1 Experimental settings
We consider the following dictionary learning algorithms:

Supervised random samples: The atoms of are chosen randomly from the training set, in a supervised manner. That is, if denotes the desired proportion of class ‘’ atoms in the dictionary, the dictionary is built by randomly picking training samples from class ‘’ and samples from class ‘’, where is the number of atoms in the dictionary.

Supervised Kmeans: We build the dictionary by merging the subdictionaries obtained by applying the Kmeans algorithm successively to training samples of class ‘’ and ‘’, where the number of clusters is fixed respectively to and .

Dictionary learning for sparse coding: The dictionary is built by solving the classical dictionary learning problem for sparse coding:
(6) To solve this optimization problem, we used the algorithm proposed by Mairal et al (2010) and implemented in the SPAMS package. The parameter is chosen by a crossvalidation procedure in the set . Note that, while the previous two learning algorithms make use of the labels, this algorithm is unsupervised.

Stochastic Gradient Descent (SGD): The dictionary and classifier are obtained by optimizing the following objective function using minibatch stochastic gradient descent:
with . This corresponds to the original objective function in Eq. (3), where is replaced with its smooth approximant.^{2}^{2}2We also tested SGD on the original (nonsmooth) optimization problem. This resulted in slightly worse performance. We therefore only report results obtained on the smoothed objective function. This smoothing procedure is similar to the one used in our relaxed formulation (Section 4.1). As in LAST, we set , , and use the same initialization strategy. This setting allows us to directly compare LAST and this generic stochastic gradient descent procedure widely used for training neural networks. Following Glorot et al (2011), we use a minibatch size of , and use a constant step size chosen in . The stepsize is chosen through a crossvalidation procedure, with a randomly chosen validation set made up of of the training data. The number of iterations of SGD is set to .
For the first three algorithms, the parameter in the softthresholding mapping is chosen with cross validation in . The features are then computed by applying the soft thresholding map , and a linear SVM classifier is trained in the feature space. For the random samples and means approaches, we set as we consider classification tasks with roughly equal number of training samples from each class. Finally, for SGD and LAST, the dictionary and linear classifier are learned simultaneously. The encoder is used to compute the features.
5.2.2 Experimental results
In our first experiment, we consider two binary texture classification tasks, where the textures are collected from the 32 Brodatz dataset (Valkealahti and Oja, 1998) and shown in Fig. 4. For each pair of textures under test, we build the training set by randomly selecting patches per texture, and the test data is constructed similarly by taking patches per texture. The test data does not contain any of the training patches. All the patches are moreover normalized to have unit norm. Fig. 5 shows the binary classification accuracy of the softthresholding based classifier as a function of the dictionary size, for dictionaries learned with the different algorithms.
For the first task (bark vs. woodgrain), one can see that LAST and SGD dictionary learning methods outperform the other methods for small dictionary sizes. For large dictionaries (i.e., ) however, all the learning algorithms yield approximately the same classification accuracy. This result is in agreement with the conclusions of Coates and Ng (2011), where the authors show empirically that the choice of the learning algorithm becomes less crucial when dictionaries are very large. In the second and more difficult classification task (pigskin vs. pressedcl), our algorithm yields the best classification accuracy for all tested dictionary sizes (). Interestingly, unlike the previous task, the design of the dictionary is crucial for all tested dictionary sizes. Using much larger dictionaries might result in performance that is close to the one obtained using our algorithm, but comes at the price of additional computational and memory costs.
Fig. 6 further illustrates the evolution of the objective function with respect to the elapsed training time for LAST and SGD, for a dictionary of size . One can see that LAST quickly converges to a solution with a small objective function. On the other hand, SGD reaches a solution with larger objective function than LAST.
We now conduct experiments on the popular CIFAR10 image database (Krizhevsky and Hinton, 2009). The dataset contains classes of RGB images. For simplicity and better comparison of the different learning algorithms, we restrict in a first stage the dataset to the two classes “deer” and “horse”. We extend our results to the multiclass scenario later in Section 5.5. Fig. 7 illustrates some training examples from the two classes. The classification results are reported in Fig. 8.
Once again, the softthresholding based classifier with a dictionary and linear classifier learned with LAST outperforms all other learning techniques. In particular, using the LAST dictionary learning strategy results in significantly higher performance than stochastic gradient descent for all dictionary sizes. We further note that with a very small dictionary (i.e., ), LAST reaches an accuracy of , whereas some learning algorithms (e.g., Kmeans) do not reach this accuracy even with a dictionary that contains as many as atoms. To further illustrate this point, we show in Fig. 9 the 2D testing features obtained with a dictionary of two atoms, when is learned respectively with the KMeans method and LAST. Despite the very lowdimensionality of the feature vectors, the two classes can be separated with a reasonable accuracy using our algorithm (Fig. 9 (b)), whereas features obtained with the Kmeans algorithm clearly cannot be discriminated (Fig. 9 (a)). We finally illustrate in Fig. 10 the dictionaries learned using KMeans and LAST for atoms. It can be observed that, while KMeans dictionary consists of smoothed images that minimize the reconstruction error, our algorithm learns a discriminative dictionary whose goal is to underline the difference between the images of the two classes.
In summary, our supervised learning algorithm, specifically tailored for the softthresholding encoder provides significant improvements over traditional dictionary learning schemes. Our classifier can reach high accuracy rates, even with very small dictionaries, which is not possible with other learning schemes.
5.3 Classification performance on binary datasets
In this section, we compare the proposed LAST classification method^{3}^{3}3By extension, we define the LAST classifier to be the softthresholding based classifier, where the parameters are learned with LAST. to other classifiers. Before going through the experimental results, we first present the different methods under comparison:

Linear SVM: We use the efficient Liblinear (Fan et al, 2008) implementation for training the linear classifier. The regularization parameter is chosen using a crossvalidation procedure.

RBF kernel SVM: We use LibSVM (Chang and Lin, 2011) for training. Similarly, the regularization and width parameters are set with crossvalidation.

Sparse coding: Similarly to the previous section, we train the dictionary by solving Eq. (6). We use however the encoder that “matches naturally” with this training algorithm, that is:
where is the test sample, the previously learned dictionary and the resulting feature vector. A linear SVM is then trained on the resulting feature vectors. This classification architecture, denoted “sparse coding” below, is similar to that of Raina et al (2007).

Nearest neighbor classifier (NN): Our last comparative scheme is a nearest neighbor classifier where the dictionary is learned using the supervised Kmeans procedure described in 5.2.1. At test time, the sample is assigned the label of the dictionary atom (i.e., cluster) that is closest to it.
Note that we have dropped the supervised random samples learning algorithm used in the previous section as it was shown to have worse classification accuracy than the Kmeans approach.
Task 1 [%]  Task 2 [%]  

Linear SVM  49.5  49.1 
RBF kernel SVM  98.5  90.1 
Sparse coding ()  97.5  85.5 
Sparse coding ()  98.1  90.9 
NN ()  94.3  84.1 
NN ()  97.8  86.6 
LAST ()  98.7  87.3 
LAST ()  98.6  93.5 
Table 1 first shows the accuracies of the different classifiers in the two binary textures classification tasks described in 5.2.2. In both experiments, the linear SVM classifier results in a very poor performance, which is close to the random classifier. This suggests that the considered task is nonlinear, and has to be tackled with a nonlinear classifier. One can see that the RBF kernel SVM results in a significant increase in the classification accuracy. Similarly, the sparse coding non linear mapping also results in much better performance compared to the linear classifier, while the nearest neighbor approach performs a bit worse than sparse coding. We note that, for a fixed dictionary size, our classifier outperforms NN and sparse coding classifiers in both tasks. Moreover, it provides comparable or superior performance to the RBF kernel SVM in both tasks.
We now turn to the binary experiment “deer” vs. “horse” described in the previous subsection. We show the classification accuracies of the different classifiers in Table 2. LAST outperforms sparse coding and nearest neighbour classifiers for the tested dictionary sizes. RBF kernel SVM however slightly outperforms LAST with in this experiment. Note however that the RBF kernel SVM approach is much slower at test time, which makes it impractical for largescale problems.
“deer” vs. “horse” [%]  

Linear SVM  72.6 
RBF kernel SVM  83.5 
Sparse coding ()  70.6 
Sparse coding ()  76.2 
NN ()  67.7 
NN ()  70.9 
LAST ()  80.1 
LAST ()  82.8 
Overall, the proposed LAST classifier compares favorably to the different tested classifiers. In particular, LAST outperforms the sparse coding technique for a fixed dictionary size in our experiments. This result is notable, as sparse coding classifiers are known to provide very good classification performance in vision tasks. Note that, when used with another standard learning approach as KMeans, the softthresholding based classifier is outperformed by sparse coding, which shows the importance of the learning scheme in the success of this classifier.
5.4 Handwritten digits classification
We now consider a classification task on the MNIST (LeCun et al, 1998) and USPS (Hull, 1994) handwritten digits datasets. USPS contains images of size pixels, with images used for training and
for testing. The larger MNIST database is composed of
training images and test images, all of size pixels. We preprocess all the images to have zeromean and to be of unit Euclidean norm.We address the multiclass classification task using a onevsall strategy, as it is often done in classification problems. Specifically, we learn a separate dictionary and a binary linear classifier by solving the optimization problem for each onevsall problem. Classification is then done by predicting using each binary classifier, and choosing the prediction with highest score. In LAST, for each onevsall task, we naturally set of the entries of to and the other entries to , assuming the distribution of features of the different classes in the dictionary should roughly be that of the images in the training set. In our proposed approach and SGD, we used dictionaries of size for USPS and for MNIST as the latter dataset contains much more training samples.
MNIST  USPS  

Linear SVM  8.19  9.07 
RBF kernel SVM  1.4  4.2 
KNN  5.0  5.2 
LAST  1.32  4.53 
Sparse coding  3.0  5.33 
Huang and Aviyente (2006)    6.05 
SDLG L (Mairal et al, 2008)  3.56  6.67 
SDLD L (Mairal et al, 2008)  1.05  3.54 
Ramirez et al (2010)  1.26  3.98 
SGD  2.22  5.88 
3 layers ReLU net (Glorot et al, 2011) 
1.43   
We compare LAST to baseline classification techniques described in the previous section, as well as to sparse coding based methods. In addition to building the dictionary in an unsupervised way, we consider the sparse coding classifiers in Mairal et al (2008); Huang and Aviyente (2006); Ramirez et al (2010), which construct the dictionary in a supervised fashion.
Classification results are shown in Table 3. One can see that LAST largely outperforms linear and nearest neighbour classifiers. Moreover, our method has a slightly better accuracy than RBFSVM in MNIST, while being slightly worse on the USPS dataset. Our approach also outperforms the softthresholding based classifier optimized with stochastic gradient descent on both tasks, which highlights the benefits of our optimization technique compared to the standard algorithm used for training neural networks. We also report from Glorot et al (2011) the performance of a three hidden layer rectified network optimized with stochastic gradient decent, without unsupervised pretraining. It can be seen that LAST, while having a much simpler architecture, slightly outperforms the deep rectifier network on the MNIST task. Furthermore, LAST outperforms the unsupervised sparse coding classifier in both datasets. Interestingly, the proposed scheme also competes with, and sometimes outperforms the discriminative sparse coding techniques of (Huang and Aviyente, 2006; Mairal et al, 2008; Ramirez et al, 2010), where the dictionary is tuned for classification. While providing comparable results, the LAST classifier is much faster at test time than sparse coding techniques and RBFSVM classifiers. It is noteworthy to mention that the best discriminative dictionary learning results we are aware of on these datasets are achieved by Mairal et al (2012) with an error rate of on MNIST and on USPS. Note however that in this paper, the authors explicitly incorporate translation invariance in the problem by augmenting the training set with shifted versions of the digits. Our focus goes here instead on methods that do not augment the training set with distorted or transformed samples.
5.5 CIFAR10 classification
We now consider the multiclass classification problem on the CIFAR10 dataset (Krizhevsky and Hinton, 2009). The dataset contains color images of size pixels, with images for training and for testing. The classifier input consists of vectors of raw pixel values of dimension . This setting, similar to that of Glorot et al (2011), takes no advantage of the fact that we are dealing with images and is sometimes referred to as “permutation invariant”, as columns in the data could be shuffled without affecting the result. We consider this scenario to focus on the comparison of the performance of the classifiers. Due to the relatively high dimensions of the problem (, ), we limit ourselves to classifiers with feedforward architectures. In fact, using RBFSVM for this task would be prohibitively slow at the training and testing stage. For each onevsall task, we set the dictionary size of LAST and SGD methods to . Moreover, unlike the previous experiment, we set in LAST half of the entries of the sign vector to and the other half to . This is due to the high variability of intraclass images and the relatively small dictionary size: the number of atoms required to encode the positive class might not be sufficient if is set according to the distribution of images in the training set. The results are reported in Table 4.
CIFAR10  

Linear SVM  59.70 
LAST  46.56 
SGD  52.96 
3 layers ReLU net  50.86 
3 layers ReLU net + sup. pretrain  49.96 
Once again, this experiment confirms the superiority of our learning algorithm over linear SVM. Moreover, LAST significantly outperforms the generic SGD training algorithm (by more than ) in this challenging classification example. What is more surprising is that LAST significantly surpasses the rectifier neural network with hidden layers (Glorot et al, 2011) trained using a generic stochastic gradient descent algorithm (with or without pretraining). This shows that, despite the simplicity of our architecture (it can be seen as one hidden layer), the adequate training of the classification scheme can give better performance than complicated structures that are potentially difficult to train. We finally report the results of sparse coding classifier with a dictionary trained using Eq. (6). If we use a dictionary with atoms, we get an error of . By using a much larger dictionary of atoms, the error reduces to . The computation of the test features is however computationally very expensive in that case.
6 Discussion
We first discuss in this section aspects related to the computational complexity of LAST. Then, we analyze the sparsity of the obtained solutions. We finally explain some of the differences between LAST and the generic stochastic gradient descent algorithm.
6.1 Computational complexity at test time
We compare the computational complexity and running times of LAST classifier to the ones of different classification algorithms. Table 5 shows the computational complexity for classifying one test sample using various classifiers and the time needed to classify MNIST test images. We recall that , , and denote respectively the signals dimension, the number of training samples and the dictionary size. Clearly, linear classification is very efficient as it only requires the computation of one inner product between two vectors of dimension . Nonlinear SVMs however have a test complexity that is linear in the number of support vectors, which scales linearly with the training size (Burges, 1998). This solution is therefore not practical for relatively large training sets, like MNIST or CIFAR10. Feature extraction with sparse coding involves solving an optimization problem, which roughly requires matrixvector multiplications, where controls the precision (Beck and Teboulle, 2009). For a typical value of , the complexity becomes (neglecting other constants), that is orders of magnitude larger than the complexity of the proposed method. This can be seen clearly in the computation times, as our approach is slightly more expensive than linear SVM, but remains much faster than other methods. Note moreover that the softthresholding classification scheme is very simple to implement in practice at test time, as it is a direct map that only involves and linear operations.
6.2 Sparsity
Sparsity is a highly beneficial property in representation learning, as it helps decomposing the factors of variations in the data into high level features (Bengio et al, 2013; Glorot et al, 2011). To assess the sparsity of the learned representation, we compute the average sparsity of our representation over all data points (training and testing combined) on the MNIST and CIFAR10 dataset. We obtain an average of zeros in the MNIST case, and for CIFAR10. In other words, our representations are very sparse, without adding an explicit sparsity penalization as in (Glorot et al, 2011). Interestingly, the reported average sparsity in (Glorot et al, 2011) is on MNIST and on CIFAR10. Our onelayer representation therefore exhibits an interesting sparsity property, while providing good predictive performance.
Complexity  Time [s]  

Linear SVM  0.4  
RBF kernel SVM  154  
Sparse coding  ^{4}^{4}4The complexity reported here is that of the FISTA algorithm Beck and Teboulle (2009), where denotes the required precision. Note that another popular method for solving sparse coding is the homotopy method, which is efficient in practice, however it has exponential theoretical complexity Mairal and Yu (2012).  14 ^{5}^{5}5To provide a fair comparison with our method, we used dictionaries of the same size as for our proposed approach, for the sake of this experiment. 
LAST classifier  1.0 
6.3 LAST vs. stochastic gradient descent
As discussed earlier, the softthresholding classification scheme belongs to the more general neural network models. Neural networks are commonly optimized with stochastic gradient descent algorithms, as opposed to the DC method proposed in this paper. The proposed learning algorithm has several advantages compared to SGD:

Better local minimum: In all our experiments, LAST reached a better solution than SGD in terms of the testing accuracy. This confirms the observations of Tao and An (1998) whereby DCA converges to “good” local minima, and often to global minima in practice.

Descent method: Unlike stochastic gradient descent, LAST (and more generally DCA) is a descent method. Moreover, it is guaranteed to converge to a critical point (Tao and An, 1998).

No stepsize selection: Stochastic gradient descent (and more generally gradient descent based algorithms) are very sensible to the difficult choice of the stepsize. Choosing a large stepsize in SGD can be beneficial as it helps escaping local minimas, but it can also lead to an oscillatory behaviour that prevents convergence. Interestingly, our optimization algorithm does not involve any stepsize selection, when given a convex optimization solver. In fact, our algorithm solves a sequence of convex problems, which can be solved with any offtheshelf convex solver. Note that even if the intermediate convex optimization problems are solved with a gradientdescent based technique, the choice of the stepsize is less challenging as we have a better understanding of the theoretical properties of stepsize rules in convex optimization problems.
As we have previously mentioned, unlike SGD, our algorithm assumes the sign vector of the linear classifier to be known. A simple heuristic choice of this parameter was shown however to provide very good results in the experiments, compared to SGD. Of course, choosing this parameter with crossvalidation might lead to better results, but also implies a slower training procedure.
7 Conclusion
We have proposed a supervised learning algorithm tailored for the soft thresholding based classifier. The learning problem, which jointly estimates a discriminative dictionary and a classifier hyperplane is cast as a DC problem and solved efficiently with an iterative algorithm. The proposed algorithm (LAST), which leverages the DC structure, significantly outperforms stochastic gradient descent in all our experiments. Furthermore, the resulting classifier consistently leads to better results than the unsupervised sparse coding classifier. Our method moreover compares favorably to other standard techniques as linear, RBF kernel or nearest neighbour classifiers. The proposed LAST classifier has also been shown to compete with recent discriminative sparse coding techniques in handwritten digits classification experiments. We should mention that, while the sparse coding encoder features some form of competition between the different atoms in the dictionary (often referred to as explainingaway (Gregor and LeCun, 2010)), our encoder acts on the different atoms independently. Despite its simple behavior, our scheme is competitive when the dictionary and classifier parameters are learned in a suitable manner.
The classification scheme adopted in this paper can be seen as a one hidden layer neural network with a softthresholding activation function. This activation function has recently gained significant attention in the deep learning community, as it is believed to make the training procedure easier and less prone to bad local minima. Our work reveals an interesting structure of the optimization problem for the onehidden layer version of that network that allows to reach good minima. An interesting question is whether it is possible to find a similar structure for networks with many hidden layers. This would help the training of deep networks, and offer insights on this challenging problem, which is usually tackled using stochastic gradient descent.
Appendix A Softthresholding as an approximation to nonnegative sparse coding
We show here that softthresholding can be viewed as a coarse approximation to the nonnegative sparse coding mapping (Denil and de Freitas, 2012). To see this, we consider the proximal gradient algorithm to solve the sparse coding problem with additional nonnegativity constraints on the coefficients. Specifically, we consider the following mapping
The proximal gradient algorithm proceeds by iterating the following recursive equation to convergence:
where prox is the proximal operator, is the chosen stepsize and is the indicator function, which is equal to if all the components of the vector are nonnegative, and otherwise. Using the definition of the proximal mapping, we have
Therefore, imposing the initial condition , and a stepsize , the first step of the proximal gradient algorithm can be written
which precisely corresponds to our softthresholding map. In this way, our softthresholding map corresponds to an approximation of sparse coding, where only one iteration of proximal gradient algorithm is performed.
Appendix B Proofs
b.1 Proof of Proposition 1
Before going through the proof of Proposition 1, we need the following results in (Horst, 2000, Section ):
Proposition 3

Let be DC functions. Then, for any set of real numbers , is also DC.

Let be DC and be convex. Then, the composition is DC.
We recall that the objective function of (P) is given by:
The function is convex and therefore DC. We show that the first part of the objective function is also DC. We rewrite this part as follows:
Since is convex, is also convex (Boyd and Vandenberghe, 2004). As the loss function is convex, we finally conclude from Proposition 3 that the objective function is DC. Moreover, since the constraint is convex, we conclude that (P) is a DC optimization problem.
b.2 Proof of Proposition 2
We now suppose that , and derive the DC form of the objective function. We have:
The objective function of (P) can therefore be written as , with:
where and are convex functions.
Acknowledgments
The authors would like to thank the associate editor and the anonymous reviewers for their valuable comments and references that helped to improve the quality of this paper.
References
 Akata et al (2014) Akata Z, Perronnin F, Harchaoui Z, Schmid C (2014) Good practice in largescale learning for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 507–520
 An and Tao (2005) An LTH, Tao PD (2005) The DC (difference of convex functions) programming and DCA revisited with dc models of real world nonconvex optimization problems. Annals of Operations Research 133(14):23–46
 Beck and Teboulle (2009) Beck A, Teboulle M (2009) A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1):183–202
 Bengio et al (2013) Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8):1798–1828

Bishop (1995)
Bishop CM (1995) Neural Networks for Pattern Recognition.
Oxford University Press, Inc.  Boyd and Vandenberghe (2004) Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press
 Burges (1998) Burges C (1998) A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery 2(2):121–167
 Chang and Lin (2011) Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2:27:1–27:27
 Chen et al (2012) Chen CF, Wei CP, Wang YC (2012) Lowrank matrix recovery with structural incoherence for robust face recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2618–2625
 Coates and Ng (2011) Coates A, Ng A (2011) The importance of encoding versus training with sparse coding and vector quantization. In: International Conference on Machine Learning (ICML), pp 921–928
 Denil and de Freitas (2012) Denil M, de Freitas N (2012) Recklessly approximate sparse coding. arXiv preprint arXiv:12080959
 Elad and Aharon (2006) Elad M, Aharon M (2006) Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing 15(12):3736–3745
 Fadili et al (2009) Fadili J, Starck JL, Murtagh F (2009) Inpainting and zooming using sparse representations. The Computer Journal 52(1):64–79
 Fan et al (2008) Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9:1871–1874

Glorot et al (2011)
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier networks. In:
International Conference on Artificial Intelligence and Statistics (AISTATS)
, vol 15, pp 315–323  Gregor and LeCun (2010) Gregor K, LeCun Y (2010) Learning fast approximations of sparse coding. In: International Conference on Machine Learning (ICML), pp 399–406
 Horst (2000) Horst R (2000) Introduction to global optimization. Springer
 Huang and Aviyente (2006) Huang K, Aviyente S (2006) Sparse representation for signal classification. In: Advances in Neural Information Processing Systems, pp 609–616
 Hull (1994) Hull JJ (1994) A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(5):550–554
 Kavukcuoglu et al (2010a) Kavukcuoglu K, Ranzato M, LeCun Y (2010a) Fast inference in sparse coding algorithms with applications to object recognition. arXiv preprint arXiv:10103467
 Kavukcuoglu et al (2010b) Kavukcuoglu K, Sermanet P, Boureau YL, Gregor K, Mathieu M, LeCun Y (2010b) Learning convolutional feature hierarchies for visual recognition. In: Advances in Neural Information Processing Systems (NIPS), pp 1090–1098
 Krizhevsky and Hinton (2009) Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto
 Larochelle et al (2012) Larochelle H, Mandel M, Pascanu R, Bengio Y (2012) Learning algorithms for the classification restricted boltzmann machine. The Journal of Machine Learning Research 13:643–669
 LeCun et al (1998) LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324
 Ma et al (2012) Ma L, Wang C, Xiao B, Zhou W (2012) Sparse representation for face recognition based on discriminative lowrank dictionary learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2586–2593
 Maas et al (2013) Maas A, Hannun A, Ng A (2013) Rectifier nonlinearities improve neural network acoustic models. In: International Conference on Machine Learning (ICML)
 Mairal and Yu (2012) Mairal J, Yu B (2012) Complexity analysis of the lasso regularization path. In: International Conference on Machine Learning (ICML), pp 353–360
 Mairal et al (2008) Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A (2008) Supervised dictionary learning. In: Advances in Neural Information Processing Systems (NIPS), pp 1033–1040
 Mairal et al (2010) Mairal J, Bach F, Ponce J, Sapiro G (2010) Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research 11:19–60
 Mairal et al (2012) Mairal J, Bach F, Ponce J (2012) Taskdriven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(4):791–804

Raina et al (2007)
Raina R, Battle A, Lee H, Packer B, Ng AY (2007) Selftaught learning: transfer learning from unlabeled data. In:
International Conference on Machine Learning (ICML), pp 759–766  Ramirez et al (2010) Ramirez I, Sprechmann P, Sapiro G (2010) Classification and clustering via dictionary learning with structured incoherence and shared features. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3501–3508
 ShaweTaylor and Cristianini (2004) ShaweTaylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press
 Sriperumbudur et al (2007) Sriperumbudur BK, Torres DA, Lanckriet GR (2007) Sparse eigen methods by DC programming. In: International Conference on Machine learning (ICML), pp 831–838
 Tao and An (1998) Tao PD, An LTH (1998) A DC optimization algorithm for solving the trustregion subproblem. SIAM Journal on Optimization 8(2):476–505
 Valkealahti and Oja (1998) Valkealahti K, Oja E (1998) Reduced multidimensional cooccurrence histograms in texture classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(1):90–94
 Figueras i Ventura et al (2006) Figueras i Ventura R, Vandergheynst P, Frossard P (2006) Lowrate and flexible image coding with redundant representations. IEEE Transactions on Image Processing 15(3):726–739
 Wright et al (2009) Wright J, Yang A, Ganesh A, Sastry S, Ma Y (2009) Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2):210–227
 Yang et al (2009) Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1794–1801
 Yuille et al (2002) Yuille A, Rangarajan A, Yuille A (2002) The concaveconvex procedure (CCCP). In: Advances in Neural Information Processing Systems (NIPS), vol 2, pp 1033–1040
 Zeiler et al (2013) Zeiler M, Ranzato M, Monga R, Mao M, Yang K, Le Q, Nguyen P, Senior A, Vanhoucke V, Dean J, Hinton G (2013) On rectified linear units for speech processing. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3517–3521
 Zhang et al (2013) Zhang Y, Jiang Z, Davis L (2013) Learning structured lowrank representations for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 676–683