1 Introduction
Sparse representation of signals is considered a very powerful signal processing technique which has drawn massive interest in recent years mainly due to its success in solving a wide variety of problems in different fields such as biomedical signal processing [1, 2]
[3] and image analysis [4], including image denoising [5], color image restoration [6] and image classification [7]. Roughly speaking, the problem of sparse representation consists of obtaining approximations of the involved signals in terms of linear combinations of only a few prescribed very simple characteristic signals taken from a large set [8, 9]. Besides providing a robust framework against distortions, missing data and noise, sparse representation of signals has many other advantages such as super resolution and dimensionality reduction
[10].A sparse representation problem (SRP) is usually divided into two subproblems: an inference problem and a learning problem. The first one, which is often called “sparse coding”, consists of computing a representation vector satisfying a particular sparsity constraint given a predefined dictionary. The second one, which involves solving a more complex problem, consists of finding an “optimal”, in certain sense, dictionary for representing a given set of training signals. It is important to point out however, that most formulations of SRPs only focus on minimizing a prescribed total representation error and they do not take into account any apriori discriminative information which could significantly improve the performance in the case of multiobject classification problems.
The first datadriven dictionary learning algorithms were originally developed almost two decades ago [8, 11, 12]
. Some of them have their roots in probabilistic frameworks by considering the observed data as realizations of certain random variables
[8, 11]. In [11]for example, the authors developed an algorithm for finding a redundant dictionary maximizing the likelihood function of the probability distribution of the data. In that work, an analytic expression for the likelihood function was derived by approximating the posterior distribution by Gaussian functions. On the other hand, an iterative approach for dictionary learning, known as the “Method for Optimal Directions” (MOD), was presented in
[12]. The sparse coding stage of this method makes use of a greedy algorithm called “Orthogonal Matching Pursuit” (OMP) [13] followed by a simple dictionary updating rule.A new iterative algorithm was proposed by Aharon et al. in [9]
. This new approach, called “K Singular Value Decomposition” (KSVD), consists mainly of two stages: a sparse coding stage and a dictionary learning stage. The OMP algorithm is used in the sparse coding stage, which is followed by a dictionary updating step where the atoms are updated one at a time and the representation coefficients are allowed to change in order to minimize the total representation error.
In the last decade, the interest in developing algorithms based on sparse representation of signals for pattern recognition purposes has notably increased
[7, 14, 15]. This is so because a large number of authors have proposed new supervised approaches for pattern recognition using sparse representations of signals. For instance, a discriminative version of the standard KSVD method applied to face recognition was presented by Zhang Q.
et al. [7]. In that work, the authors included a discriminative term into the objective function of the standard KSVD algorithm. Results have shown that this modification constitutes an appropriate way to learn dictionaries which satisfy both criteria: low reconstruction error and high recognition rates. Also, Pham D. et al. [14] proposed an iterative method that simultaneously optimizes a dictionary and a linear classifier. The authors successfully used the method in an image categorization problem. More recently, a novel approach called “Label Consistent KSVD” (LCKSVD) for dictionary learning was proposed in [15]. In that work a discriminative sparse representation and a single predictive linear classifier were efficiently integrated into the objective function.However, besides supervised dictionary learning methods, many other new alternative options were presented [4, 16, 17]. These new alternatives are mainly based on the pursuit of discriminability of sparse representations through the development of “structured” or, more precisely, categoryspecific dictionary methods. In [4], a method for learning multiple dictionaries that uses the reconstruction errors yielded by these dictionaries on image patches to derive a pixelwise classification. This algorithm has proved to be robust specially for local image classification tasks. A method for learning multiple nonredundant dictionaries for complex object categorization was proposed in [16]. This method was assessed on both visual object categorization and document classification imagerelated problems yielding competitive performances. In [17], a method that simultaneously optimizes both a structured dictionary (categoryspecific visual words for each feature) and a classifier was introduced. This method yielded good recognition rates showing a significant improvement over stateoftheart object classification methods. A new method for structured dictionary learning was recently proposed by Sun et al. [18]. In that work, the learned dictionary was decomposed into classspecific subdictionaries for the classification that is conducted measuring the minimum reconstruction error among all the classes. The method was tested using both the synthetic data and the realworld data showing good performances.
In this work we propose a novel multiclass discriminative measure and a new dictionary learning method which yields structured dictionaries which are composed by categoryspecific subdictionaries specially constructed for multiclass classification purposes. Thus, the novelty of our approach is twofold. First, we introduce an innovative and effective multiclass discriminative measure whose main property is precisely its capability for quantifying the discriminative degree of each one of the atoms in a given dictionary. This measure takes into account not only when a particular atom is used for representing a signal coming from a certain class and the magnitude of its corresponding representation coefficient, but also the effect that such an atom has in the total representation error. Secondly, this work presents a novel method for discriminative structured dictionary learning which yields a dictionary increasing the classifier recognition rate.
The organization of this article is as follows. A brief review of sparse representation of signals is presented in Section 2. In Section 3, we make a description of the database used in the experiments and we propose both a new discriminative measure and a structured dictionary learning method. Section 4 details all the experiments, while results and discussion are presented in Section 5. Finally, concluding comments and future works are presented in Section 6.
2 Sparse representation of signals
Sparse representation is a signal processing technique that seeks the sparsest representation of all the signals in a given set in terms of linear combinations of certain basic waveforms. The sparse representation problem can be separated into two subproblems. Namely the socalled sparse coding problem and the dictionary learning problem. We shall now proceed to describe in detail each one of these subproblems. For that, let be a discrete signal and let (generally with ) be a dictionary whose columns are atoms that we want to use for obtaining representations of of the form . Here, and in the sequel, we shall refer to the vector as a “representation” of . Sparsity consists essentially of obtaining a representation with as few nonzero elements as possible. A way of obtaining such representations consists of solving the following problem:
where denotes the pseudonorm, defined as the number of nonzero elements of . It turns out that imposing an exact representation of is a too restrictive constraint, which makes an NP hard problem [19, §1.8], yielding the approach highly unsuitable for most practical applications.
Hence, the exact representation requirement is often relaxed by allowing small representation errors and imposing an upper bound on the pseudonorm of the representations. Thus, a small error representation tolerant version of is defined as follows:
where is a prescribed positive integer parameter. This formulation considers the presence of possible additive noise terms. In other words, it assumes that the signal can be represented in the form , where is a small energy noise term. Thus, this approach is more appropriate in a wide variety of real applications (such as biomedical signal and image processing) where the captured raw signals are always contaminated by noise. Several greedy strategies have been proposed for solving problem [20, 13]. Among them, the OMP algorithm is perhaps the most commonly used strategy. This greedy algorithm ensures convergence to the projection of into the span of atoms in a given dictionary, in no more than iterations. It is important to note that the representation vector has no more than nonzero entries. Figure 1 shows an example of the representation vectors obtained with this approach for two images of different classes coming from a widely used database which we shall describe in detail in Section 3. Note that most coefficients are strictly equal to zero.
Although preconstructed dictionaries, such as the well known wavelet packets [21], typically lead to fast sparse coding, they are almost always highly restricted to certain classes of signals. Hence, due to their lack of generalization, new approaches introducing datadriven dictionary learning techniques have emerged. A dictionary learning problem associated to the data: , , , and a collection of signals in , , can be formally written as:
The solution of this problem yields on one hand a dictionary and, on the other hand, representations for all the signals in terms of that dictionary complying with the sparsity constraint for each one of the “” involved signals . It is important to point out that in such a process, the total representation error is minimized.
Although datadriven dictionary learning algorithms produce sparse representations of signals which are robust against distortions and missing data, such representations quite often turn out to be unsatisfactory if the final objective is signal classification. This is mainly due to the fact that those algorithms do not take into account prior information concerning class membership. To overcome this flaw, several alternative approaches producing sparse representations in terms of a unique (and shallow) dictionary for signal classification were presented [7, 14, 15]. A different approach is the construction of structured dictionaries composed by subdictionaries whose atoms are discriminative, in certain sense, for each one of the classes, i.e. each subdictionary has a group of atoms that are discriminative only for a particular class. The use of structured dictionaries could be useful for reducing the features dimension, avoiding overfitting and optimizing the performance of a classifier, among others. In recent years, there has been increasing interest in developing algorithms whose main purpose is to obtain “optimal” subdictionaries to be used for signal classification [1, 22, 23]. In [22], a method called “Clustering based Online Learning of Dictionaries” (COLD) was presented. This algorithm makes use of the mean shift clustering procedure [24] to identify modes in the distribution of the atoms and hence obtain a dictionary of minimal size. Recently, Chen et al. [23]
introduced a dictionary learning method for image and video editing tasks. In that work, the problem of seeking an optimal dictionary is solved by using a symmetric version of the “KullbackLeibler Divergence” (KLD)
[25]. This divergence has been successfully used for detecting redundant atoms in a given dictionary. Our proposal consists of defining and using a new discriminative measure for selecting the most discriminative atoms for each one of the classes and use them for building a new structured dictionary.3 Materials and methods
In this section we make a brief description of the database used in the experiments. Additionally, we describe in detail both the new proposed multiclass discriminative measure and the novel structured dictionary learning method.
3.1 Database
One of the most popular databases used to assess Computer Vision and Pattern Recognition methods is the “Modified NIST” (MNIST) database
[26]. This database has been widely used for assessing new methods including Deep Learning techniques
[27], Extreme Learning Machines [28]and a many types of neural networks
[29], among others. The MNIST database contains a total of 70,000 normalized and centered grayscale images of handwritten digits ranging from 0 (zero) to 9 (nine), each one of size (leading to a feature vector of length 784). Also, the number of images per class varies from 5,421 to 6,742, corresponding to classes 5 and 1, respectively. Additionally, this database provides information about standard partitions used for training (60,000) and testing (10,000).Although each one of all original (raw) images coming from the MNIST database can be represented as a single column vector consisting of 784 elements (features), it becomes necessary to reduce its dimensionality for practical reasons. In this work, the image dimension reduction process is carried out by using the well known bicubic interpolation method
[30] which is not only accurate, but also smooth and computationally efficient. This method was used for obtaining new (reduced) images each one of size leading to feature vectors of length 256.3.2 A new discriminative measure
Discriminative dictionaries can be thought of as a collection of atoms specially learned for signal classification. These dictionaries not only produce accurate representations of the training signals (in terms of their waveforms) coming from different classes, but they also render their representations easy to distinguish by a suitable classifier. However, the problem of finding a discriminative dictionary is computationally very costly. A way to overcome the computational complexities entailed by such a problem consists of defining an appropriate discriminative value functional that independently evaluates each one of the atoms in a given dictionary. This simplification is based on the assumption that each atom in the dictionary is used to model specific characteristics that are not modeled by any one of the other atoms. Thus, the discriminative information provided by a particular atom is different from the information contributed by all the other atoms.
In a previous work [1], we presented a simple approach for quantifying the discriminative degree of the atoms of a given dictionary in the context of a binary classification problem. The approach essentially consists of counting the number of times that a particular atom is used, i.e. it becomes “active” for representing signals belonging to each one of both classes and . As a result of this counting process, an activation frequency () for each atom given the class, is considered. To quantify the discriminative degree of the atom (, the column of ), the absolute difference of activation frequencies of that atom for classes and () is computed. This value will be large if (an only if) the atom is much more frequently used for representing signals in one of the two classes and, in that case, it can be thought of as a quantifier of the capability of to supply important discriminative information regarding class membership. The use of this discriminative quantifier gave rise to a method called Most Discriminative Column Selection (MDCS) for discriminative subdictionary construction [1]. The MDCS method has shown to be robust for efficiently extracting meaningful features from segments of pulse oximetry signals for detecting apneahypopnea events.
In this work we propose an extension of the measure described above to multiclass classification problems. This extension consists of defining and using a new multiobjective function aimed at quantifying the discriminative degree of each one of the atoms in a given dictionary. This function will be defined as a convex combination of three discriminative terms, all based on the affine sparse representations of the data. In what follows, a detailed description of each one of such terms as well as a formal definition of the function are presented.
3.2.1 Activation frequency measure
Conditional activation frequencies provide a reasonable starting point for determining the discriminative degree of individual atoms in a given dictionary. For this reason, our approach begins by computing the activation frequency of given the class , for . Moreover, the conditional activation probability of given (that a signal belongs to) class is defined as . Given a set of signals belonging to class , this conditional probability can be approximated by the quotient . Note that if the problem is balanced, i.e. if the number of available signals belonging to each one of the classes is the same, say , then , more precisely , for all and . In this work, the problem of quantifying the discriminability of each atom is tackled by analyzing their individual contributions to the signal classification process. More specifically, a particular atom is considered as having important discriminative information for class signals if , for all . Hence, if is discriminative for class , the activation of the representation coefficient will be strongly associated to class membership. Since the performance of a classifier highly depends on the discriminability of their inputs, it is reasonable to think that using the representation coefficients , , , as inputs of a classifier, for atoms selected using that criterion, could result in good recognition rates.
For a given , , we shall denote by the class that maximizes all conditional activation probabilities , for all , i.e. such that
(1) 
In the (unlikely) case that there is more than one value of maximizing , is defined by randomly choosing one of them, for instance the smallest one (note that the order of the classes is completely irrelevant). Similarly, for a fixed , , is defined as the class leading to the second largest conditional activation probability, i.e. such that
(2) 
Here again if there is more than one value of satisfying (2), then is chosen randomly as any one of them.
Next we define the function by
(3) 
we shall refer to as the “activation frequency measure”.
Note that . The atom is said to be discriminative (for class ) if and only if . Clearly, within this setting, if an atom is discriminative, it will be so only for the class , otherwise it will be discriminative for none of them. Moreover, the value of can be thought of as a “measure” of the degree of discriminability of the atom (for the corresponding class ), based solely on the activation frequency information.
Figure 2 shows graphic representations of two examples of conditional activation probabilities and , for , associated to atoms (top) and (bottom), respectively. The vertical bars represent the value of each conditional activation probability (top) and (bottom), for . Clearly, for the top case (atom ) and , and therefore the atom is considered to be discriminative (for class 4). For the bottom case (atom ) and (although these values could be interchanged), but since , one has implying that is not discriminative for class , and therefore is not discriminative for any one of the classes.
3.2.2 Coefficient magnitude measure
On one hand, the sparse representation of signals provides valuable information regarding the activation of atoms and, on the other hand, it can highlight important characteristics or features contained in particular event related waveforms of signals or images such as brightness variations in images and slight changes in biomedical signals, to name but a few. With the above observation in mind, we proceed now to define a second measure that takes into account the magnitude of the representation coefficients. For that, given an atom , let and be the classes as defined in (1) and (2), respectively, and let and the matrices which provide the sparse representations of and , respectively, in terms of the dictionary , i.e. and . Additionally, let denote the quotient , where represents the row of the matrix . The coefficient magnitude measure is the function defined by
(4) 
Here again . Based on this measure, an atom is said to be discriminative (for the class ) if and only if and, in that case, the value of quantifies the corresponding degree of discriminability of for the class .
3.2.3 Representation error measure
We now proceed to describe the third measure for quantifying the discriminative degree of each atom in a dictionary. This measure takes into account the contribution of each atom to the total representation error. Let be the matrix providing the sparse representation of , as in the previous measure. Clearly, the contribution of the class to the total representation error can be written as [9]
(5)  
where denotes the total representation error for all class signals when is removed. Hence, a large value of indicates that the contribution of to the representation of class signals is large. We then define a “representation error measure” by
(6) 
where , for , .
Here again , and an atom is said to be discriminative (for class ) with respect to this measure if and only if . In such a case, the value of quantifies the corresponding degree of discriminability.
3.2.4 A combined discriminative measure
Each one of the three previously defined measures takes into account different properties related to the discriminability of each one of the atoms (in a given dictionary). It is then reasonable to think of a measure that appropriately combines all three of them. With that in mind, given two positive parameters and , with , we define the function as
(7) 
We shall refer to as the “combined discriminative measure”. Clearly, as and vary between and , (7) exhausts all possible convex combinations of the three single measures , and . A challenging problem, on which we shall shed some light in Section 4.3, consists precisely of finding the “optimal” pair of parameters leading to the best recognition rate, for a given problem.
3.3 Dictionary learning algorithm
Supervised dictionary learning methods have observed great interest in recent years. Implementations of these methods were originally focused on efficiently learning simple dictionaries (unstructured) that incorporate information of “discriminability” (in terms of signal classification) in their optimization process. This information can be introduced to the learning model by considering different discriminative criteria [31, 32, 33]. The most commonly used criteria are the so called “softmax” cost function [16], Fisher criterion [34] and linear predictive classification error [7, 14], to name just a few.
Although there exist several ways to simultaneously optimize both a dictionary, i.e. to solve a representation learning problem, and a classifier, i.e. to find a solution to a classification problem, a very often used strategy consists simply of dividing that problem into two subproblems [4, 16]. Hence, it is possible to use all existing traditional dictionary learning techniques, such as MOD and KSVD, and therefore train a single classifier at a later stage. Our proposal is based precisely on this strategy but introducing class information in the dictionary learning stage. For that, we propose a new method for multiclass structured dictionary learning called “Discriminant Atom Selection KSVD” (DASKSVD) in which we use the proposed discriminative measure to efficiently select classspecific discriminant atoms from some given “auxiliary” dictionaries to iteratively construct a structured one. The DASKSVD method aims at building a structures dictionary by stacking sidebyside subdictionaries , each one of size , for all , . It is signal matrix ,important to point out that each subdictionary contains atoms that are discriminative, in terms of , for class signals.
We now proceed to describe the building steps of the proposed DASKSVD method in more detail (Algorithm 1). Here, and in the sequel, we shall consider the vectors as realizations of a particular dimensional random vector . Given an signal matrix , composed of samples, the required sparsity level , the redundancy factor , the number of class training signals, , the number of iterations and the class label vector , the proposed algorithm begins by assigning an initial uniform probability distribution over so , for all (Alg. 1, line 2). The value of is the probability that a training signal is selected from in order to construct a new sampled “learning” matrix that is used specifically for learning the initial dictionary . Additionally, if a certain training signal is used for learning the dictionary in a particular iteration, then it is desirable that such a signal be less likely than the other ones in the following iterations. Hence, promoting diversity in this way one might think that the final learned atoms are capable of highlighting different intrinsic properties of the training data.
The iterative process of this algorithm (Alg. 1, lines from 3 to 10) begins by statistically sampling samples (note that , for instance 10 times smaller) from each class signal matrix . As a result of such a sampling process, a matrix of size is built (Alg. 1, line 4). Also, to compute the distribution from both and , we multiply the value of by a nonnegative number if (and only if) has been selected, i.e. (in that case ). Otherwise is left unchanged. It is important to point out that an appropriate normalization of these weights forcing them to sum one is needed. Figure 3 shows graphic representations of five probability distributions , for . It can be observed that, at the first iteration, all samples have the same probability to be selected. In addition, see that the probability value of most samples decreases as the iteration order increases.
In order to increase robustness, all training signals used to learn the dictionary (Alg. 1, line 5) are also degraded by incorporating an additive zeromean Gaussian noise whose magnitude increases proportionally according to the iteration level. The magnitude of the noise is updated by , where
is the variance of
and is a (prescribed) nonnegative number, . For instance, the magnitude of the noise associated to the signal at iteration 5 will be . It is important to point out however that, the first iteration () of the proposed learning algorithm leaves the original image undegraded. On the other hand, the dictionary is learned by means of the traditional unsupervised KSVD algorithm [9]. Then the sparse matrix is obtained by applying the previously mentioned OMP algorithm (Alg. 1, line 6). The reason for having chosen this pursuit algorithm is because it guarantees convergence to the projection of each one of the signals into the span of the dictionary atoms, in no more than iterations leaving the rest of the coefficients equal to zero.As previously mentioned, at the beginning of each iteration, the standard unsupervised KSVD algorithm was used to learn a dictionary of size . Note that this dictionary learning stage does not take into account any information concerning class membership. Additionally, the sampled subset of signals used to learn the dictionary was appropriately degraded by incorporating additive Gaussian noise with different magnitudes. Left and right sides of Figure 4 show examples of atoms coming from the dictionary that were learned at iterations 1 and 20, respectively. It can be seen that, at the first iteration, the dictionary is learned by means of noisefree input signals. On the other hand, the dictionary learned at iteration 20 still preserves the structure of the handwritten digits on a blurred background.
The proposed discriminative approach consists of optimizing and using the new combined discriminative measure for selecting the most discriminative atoms of for each one of the classes (Alg. 1, lines from 7 to 9). As explained in Section 3.2, the value of corresponds to the degree of discriminability of the atom for one (and only one) class, which is denoted by . Note that the process of selecting the most discriminative atoms carries a serious trouble since the problem of finding the optimal pair of parameters is very challenging. For more details about the tuning of that pair of parameters, we refer the reader to Section 4.3 and Appendix A. Also, the construction of the subdictionary (Alg 1, line 8) basically consists of taking onebyone the most discriminative atoms of for each one of the classes and stacking them sidebyside. In the case that there is more than one classrelated candidate complying with the proposed discriminative criterion, is defined as the atom that maximizes all possible values of . Otherwise, in case that lacks of discriminative atoms, the signal selection process (Alg. 1, line 4) is restarted.
3.4 Classifier
In this work a Multilayer Perceptron (MLP) neural network is used in order to assess the proposed method. The MLP neural network is one of the most popular classes of neural networks whose architecture consists of a fully connected assembly of single artificial neurons. The MLP neural network is typically comprised by an input layer, one (or more) hidden layers and an output layer
[35]. The inputs (features) are processed layerbylayer moving forward through the network. Each artificial neuron receives one (or more) inputs from its preceding nodes, processes the information and produces an output that is transmitted to the next node. The output of each neuron is reached by applying an activation (transfer) function (linear or not) to the weighted sum of the inputs plus a bias term. More precisely, the output of a neuron is defined as(8) 
where the transfer function is denoted by , and the weights that connect the input to the neuron for a given layer is represented by .
Since the MLP neural network training process is supervised, the desired outputs (labels) are required. The most popularly used method for training MLP neural networks is the backpropagation algorithm [36]. This algorithm iteratively adjusts the synaptic weights in the network by minimizing a given measure which quantifies the difference between the current output vector and the desired one.
4 Experiments
In this section we present a brief description regarding the experimental setup. Additionally, we make a brief recall of the evaluation metric used for assessing the proposed dictionary learning method. Finally, we comment on appropriate ways for tuning the parameters.
4.1 Experimental setup
As mentioned above, the performance of the new DASKSVD method is evaluated using standard partitions for training and testing of the MNIST database. Although it is not a requirement, our experiments were performed by using a balanced set of training and validation samples. For that, subsets consisting of 4,000 and 1,000 images for each one of the classes coming from the standard partition of the training dataset were randomly chosen. Hence, new training and validation matrices ( and ) comprised by 40,000 and 10,000 samples, respectively, were built. It is important to point out however that, the standard partition of the testing dataset of size was left unchanged.
It becomes appropriate to mention that the matrix was used both for dictionary learning and training the MLP neural network while the matrix was used for testing the MLP neural network as well as for parameters tuning. Furthermore, the matrix was only taken into account for performing the final test.
We shall now proceed to describe the parameter settings for the DASKSVD method that were used in the experiments. We evaluated the effect that produces the size of in the final recognition rate. For that, we have considered four structured dictionaries denoted by , , and which are composed by 50, 100, 150 and 200 atoms, respectively. Hence, the DASKSVD algorithm was run 20 iterations, i.e. .
4.2 Evaluation metric
Overall accuracy rate constitutes one of the most popular performance measures used to assess pattern recognitionrelated methods. The accuracy measure (Acc) is defined as the proportion of correctly predicted testing samples. Let the number of testing samples, and the label and prediction, respectively, regarding and the well known delta function whose output is true (one) if and false (zero) otherwise. The Acc measure is defined as:
(9) 
4.3 Parameters tuning
Although the pursuit for discriminative atoms is perhaps one of the most challenging issues to be addressed in this work, finding optimal pair of parameters () leading to the best recognition rate is also a very difficult task. However, the problem of finding that optimal pair of parameters strongly depends on the application under study. For that reason, we propose applying the well known and widely used “grid search” method for parameter optimization. For more details regarding grid search method, we refer the reader to Appendix A. In what follows, the final choice of the remaining parameters of the proposed algorithm are described.
At each iteration of the proposed DASKSVD method, one (and only one) discriminant atom for each one of the classes is selected. Hence, each iteration of this method generates discriminative atoms and therefore, if the algorithm is configured to perform iterations, then the final structured dictionary will be composed by discriminant atoms. In order to explore the effect of the final structured dictionary size, the experiments were performed by considering a total of 20 iterations, i.e. . Thus, the final discriminative dictionary is composed by 200 atoms (assuming ). On the other hand, the number of samples for each class used to learn the full dictionary was set to .
As described in Section 3.3, and are two parameters () that need to be adjusted and fixed. Several trials were performed in order to obtain appropriate values for those parameters. A value of was finally selected and used in our experiments. Additionally, it was found that a value of presented the best tradeoff between image degradation and iteration order.
The standard KSVD algorithm starts by performing a random selection of 256 samples coming from the learning signal matrix . Note that the redundancy factor () used for constructing the dictionary is equal to one, i.e. . Also, the maximum number of KSVD iterations was fixed to 50 in the code. It is also well known that the KSVD algorithm internally computes sparse codes representing each one of all involved signals. These codes were obtained by means of OMP algorithm. To establish an appropriate sparsity level, a great variety of sparse solutions were tested. It was found that a sparsity degree of 20% presents the best tradeoff between discriminability and representativity of all signals.
The MLP neural network training process was performed using backpropagation method. This algorithm was optimized minimizing the Mean Squared Error (MSE) function through Scaled Conjugate Gradient (SCG) method. Also, the output of each neuron was determined by applying a saturating linear transfer function. Additionally, the structure of the MLP neural network was configured such that the sizes of its hidden and input layers are equal.
5 Results and discussion
As already explained above, the matrices denoted by and provide the sparse representations of and , respectively, in terms of a dictionary through and . Also, the feature vectors comprising the matrices and were used as inputs for training and testing, respectively, the MLP neural network. The final test was performed by taken into account the standard partition of the testing dataset and each one of the previously learned structured dictionaries . The matrix was obtained by means of the OMP algorithm. Also, the inputs of the already trained MLP neural networks are the feature vectors coming from
and, moreover, the outputs of these networks are evaluated to compute the final accuracy. In addition, structured dictionaries composed by 50, 100, 150 and 200 discriminant atoms were evaluated. Mean and standard deviation of the classification results over 10 rounds were found to be 94.87% (
0.33%), 94.79% ( 0.30%), 94.36% ( 0.27%) and 91.25% ( 0.63%) for feature vector sizes of 200, 150, 100 and 50, respectively. Also, Table 1 presents a comparative summary of the best recognition rates yielded by MLP neural networks trained using as input the matrix obtained by taken into account each one of the evaluated structured dictionaries. Also, details regarding the required number of weights of the MLP neural network for each one of such dictionaries are included. It is important to point out that these results were obtained by considering a fixed hidden layer size coinciding with the input feature vector size. Maximal accuracy rates of 96.2, 95.9, 95.0 and 92.2 were obtained for feature vector sizes of 200, 150, 100 and 50, respectively. Hence, results show that “discriminative” feature vectors of length 200 are the best option for handwritten digits recognition. On the other hand, the last column of Table 1 shows the total number of weights required to train each one of the MLP neural networks.Dictionary  Classifier  Acc (%)  Number of weights 

MLP505010  92.23  3,060  
MLP10010010  95.03  11,110  
MLP15015010  95.90  24,160  
MLP20020010  96.20  42,210 
Lecun et al. [29]
tested several configurations of onehidden layer fully connected MLP neural networks trained for handwritten digit recognition. One of them consists of directly using the original (raw) data, i.e. without tacking into account any signal preprocessing or feature selection, as input of the classifier. Thus, vectors containing 784 features corresponding to images of size
were used as inputs of the classifier. The first two rows of Table 2 shows maximal percentages of accuracy rates (Acc) yielded by MLP neural networks with 300 (MLP78430010) and 1000 (MLP784100010) neurons in their hidden layer. The number of training weights for each one of the networks are also included in the last column. Accuracy rates on the standard test partition of 95.3% and 95.5% were yielded by MLP neural networks with 300 and 1000 hidden neurons, respectively. It can be observed that, as a result of increasing the number of hidden neurons (from 300 to 1000), a slight improvement in the result was achieved. Also, the number of weights of the network has increased from 238,510 to 795,010, which represent an increment of 333%.Method  Classifier  Acc (%)  Number of weights 
Raw data [29]  MLP78430010  95.3  238,510 
MLP784100010  95.5  795,010  
DASKSVD  MLP2005010  95.3  10,560 
MLP20010010  96.1  21,110  
MLP20020010  96.2  42,210  
MLP20030010  96.4  63,310  
MLP200100010  96.7  211,010  
KSVD [9]  MLP2005010  93.5  10,560 
MLP20010010  92.8  21,110  
MLP20020010  92.3  42,210  
MLP20030010  92.8  63,310  
MLP200100010  92.7  211,010  
LCKSVD2 [15]  MLP2005010  91.8  10,560 
MLP20010010  91.9  21,110  
MLP20020010  92.0  42,210  
MLP20030010  92.1  63,310  
MLP200100010  92.3  211,010 
Table 2 also shows a comparative summary of the results yielded by MLP neural networks with a reduction in the dimension of the feature vectors. For that, the proposed DASKSVD method was used for obtaining feature vectors of length 200. As shown in Table 1, structured dictionaries composed by 200 discriminative atoms () are the best option for handwritten digit recognition. Clearly, the use of small dimensional feature vectors produce a significant dimension reduction but retaining discriminative information and therefore, the computing time required for classification is reduced. Thus, the number of input units of the MLP neural network was reduced (from 784 to 200) in 74.49% compared with those required by the original raw data. The table shows the average over 10 rounds of accuracy rates yielded by MLP neural networks with 200 input units while varying the number of hidden neurons from 50 to 1000. The last column of this table shows the required number of training weights. Accuracy rates on the standard testing dataset of 96.4% and 96.7% were achieved by MLP neural networks with 300 (MLP20030010) and 1000 (MLP200100010) hidden neurons, respectively. Additionally, the performance of MLP neural networks with 50, 100 and 200 hidden neurons were tested without showing significant improvements in the results.
It is also important to point out that the classifier MLP2005010 (DASKSVD method) has achieved the same recognition rate (95.3%) as MLP78430010 (Raw data) using a MLP neural network composed by only a 4.42% of the required weights. It was also found that taking into account the best option that uses the original raw data as inputs of the classifier (MLP784100010), it has 795,010 training weights while DASKSVD method (MLP200100010) has not only 211,010 weights, but also increases a 1.2% in the performance of the classifier. As a result of that analysis, one might think that the proposed DASKSVD method produces a significant dimension reduction while enhancing the overall recognition rate. Summing up, it was demonstrated that using the proposed DASKSVD method for dimension reduction undoubtedly enhances the recognition rate of MLP neural networks.
We have compared the performance of the new DASKSVD method with the standard KSVD method as well as with the discriminativebased LCKSVD2 method. It can be observed from Table 2 that the proposed DASKSVD method outperform all the others showing robustness and effectiveness with the same size of the dictionary in the recognition of handwritten digits images coming from MNIST database. The maximum recognition rate yielded by the DASKSVD method was 96.7% which clearly outperforms those yielded by both KSVD (93.5%) and LCKSVD2 (92.3%) methods.
We have also evaluated the statistical significance of the results presented in Table 2 by computing the probability that the DASKSVD method yields better recognition rates than all the other evaluated methods
. In order to perform this test we assumed the statistical independence of the classification errors for each image and we approached the error’s Binomial distribution by means of a Gaussian distribution. This is possible because we have a sufficiently high number of testing samples (10,000).
In this way, for 95.5% and 96.7% corresponding to recognition rates yielded by “Raw data” that produced the best performance among all methods considered in the experiments and the new proposed one (DASKSVD), respectively, we have that .6 Conclusions
In this work, both a new discriminative measure and a novel method for learning structured dictionaries for multiclass classification problems were introduced. This new measure is capable of efficiently quantifying the degree of discriminability of each one of the atoms in a particular dictionary. The use of such a measure gave rise to what we called the Discriminant Atom Selection KSVD (DASKSVD) method for dictionary learning. The method was tested with a widely used database for handwritten digit recognition and compared with three stateoftheart classification methods. Experimental results showed that DASKSVD significantly outperforms the other three methods achieving good recognition rates and additionally, reducing the computational cost of the classifier.
Clearly, there is much further room for improvements. In particular, future research lines include the evaluation of our learning method with other well known databases, more analysis of the combined discriminative measure as well as the study of its properties and the exploration of new deep structures.
7 Acknowledgments
This work was supported in part by Consejo Nacional de Investigaciones Científicas y Técnicas, CONICET, through PIP 20142016 No. 11220130100216CO, by Agencia Nacional de Promoción Científica y Técnica, ANPCyT, under projects PICT 20142627, PICT 20150977 and PICT 20174596 and by Universidad Nacional del Litoral, UNL, through projects CAI+D 500 201501 00059 LI, CAI+D 500 201501 00082 LI and CAI+D 504 201501 00036 LI “Problemas Inversos y Aplicaciones a Procesamiento de Señales e Imágenes”.
References
 [1] R. Rolón, L. Larrateguy, L. D. Persia, R. Spies, and H. Rufiner, “Discriminative methods based on sparse representations of pulse oximetry signals for sleep apnea–hypopnea detection,” Biomedical Signal Processing and Control, vol. 33, pp. 358–367, 2017.
 [2] V. Peterson, H. L. Rufiner, and R. D. Spies, “Generalized sparse discriminant analysis for eventrelated potential classification,” Biomedical Signal Processing and Control, vol. 35, pp. 70–78, 2017.
 [3] L. Li, S. Li, and Y. Fu, “Learning lowrank and discriminative dictionary for image classification,” Image and Vision Computing, vol. 32, no. 10, pp. 814–823, 2014.
 [4] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Discriminative learned dictionaries for local image analysis,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2008.
 [5] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Transactions on Image Processing, vol. 15, no. 12, pp. 3736–3745, 2006.
 [6] J. Mairal, M. Elad, and G. Sapiro, “Sparse representation for color image restoration,” IEEE Transactions on Image Processing, vol. 17, no. 1, pp. 53–69, 2008.
 [7] Q. Zhang and B. Li, “Discriminative KSVD for dictionary learning in face recognition,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2691–2698, June 2010.
 [8] M. S. Lewicki and B. A. Olshausen, “Probabilistic framework for the adaptation and comparison of image codes,” Journal of the Optical Society of America A, vol. 16, no. 7, p. 1587, 1999.
 [9] M. Aharon, M. Elad, and A. Bruckstein, “KSVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, pp. 4311–4322, Nov. 2006.
 [10] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image superresolution via sparse representation,” IEEE Transactions on Image Processing, vol. 19, no. 11, pp. 2861–2873, 2010.
 [11] M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete representations,” Neural Computation, vol. 12, no. 2, pp. 337–365, 2000.
 [12] K. Engan, S. O. Aase, and J. H. Husoy, “Method of optimal directions for frame design,” in 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 2443–2446, 1999.
 [13] J. Tropp and A. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE Transactions on Information Theory, vol. 53, pp. 4655–4666, Dec. 2007.
 [14] D. S. Pham and S. Venkatesh, “Joint learning and dictionary construction for pattern recognition,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2008.
 [15] Z. Jiang, Z. Lin, and L. Davis, “Label Consistent KSVD: Learning a discriminative dictionary for recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 2651–2664, Nov. 2013.
 [16] W. Zhang, A. Surve, X. Fern, and T. Dietterich, “Learning nonredundant codebooks for classifying complex objects,” pp. 1–8, ACM Press, 2009.
 [17] L. Yang, R. Jin, R. Sukthankar, and F. Jurie, “Unifying discriminative visual codebook generation with classifier training for object category recognition,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2008.
 [18] Y. Sun, Y. Quan, and J. Fu, “Sparse coding and dictionary learning with classspecific group sparsity,” Neural Computing and Applications, vol. 30, pp. 1265–1275, Aug. 2018.
 [19] M. Elad, Sparse and redundant representations. SpringerVerlag New York, 2010.
 [20] S. G. Mallat and Z. Zhang, “Matching pursuits with timefrequency dictionaries,” IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3397–3415, 1993.
 [21] R. R. Coifman, Y. Meyer, S. Quake, and M. V. Wickerhauser, “Signal processing and compression with wavelet packets,” in Wavelets and Their Applications, pp. 363–379, Springer, Dordrecht, 1994.
 [22] N. Rao and F. Porikli, “A clustering approach to optimize online dictionary learning,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1293–1296, 2012.
 [23] X. Chen, J. Li, D. Zou, and Q. Zhao, “Learn sparse dictionaries for edit propagation,” IEEE Transactions on Image Processing, vol. 25, no. 4, pp. 1688–1698, 2016.
 [24] D. Comaniciu and P. Meer, “Mean shift: a robust approach toward feature space analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603–619, 2002.

[25]
H. Jeffreys, “An invariant form for the prior probability in estimation problems,”
Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol. 186, no. 1007, pp. 453–461, 1946.  [26] Y. Lecun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard, and V. Vapnik, “Comparison of learning algorithms for handwritten digit recognition,” International Conference on Artificial Neural Networks, Paris, 1995.

[27]
S. Kim, Z. Yu, R. M. Kil, and M. Lee, “Deep learning of support vector machines with class probability output networks,”
Neural Networks, vol. 64, pp. 19–28, 2015.  [28] P. d. Chazal, J. Tapson, and A. v. Schaik, “A comparison of extreme learning machines and backpropagation trained feedforward networks processing the mnist database,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2165–2168, 2015.
 [29] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [30] W. S. Russell, “Polynomial interpolation schemes for internal derivative distributions on structured grids,” Applied Numerical Mathematics, vol. 17, no. 2, pp. 129–171, 1995.
 [31] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in Advances in Neural Information Processing Systems 19, pp. 801–808, MIT Press, 2007.
 [32] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp. 19–60, Mar. 2010.
 [33] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3360–3367, June 2010.
 [34] K. Huang and S. Aviyente, “Sparse representation for signal classification,” in Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06, (Cambridge, MA, USA), pp. 609–616, MIT Press, 2006.
 [35] S. Haykin, Neural networks: A comprehensive foundation. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2nd ed., 1998.
 [36] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by backpropagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
Appendix A Grid search
The grid search method starts by dividing the interval into segments of length and generating different combinations of the parameters and such that . This constraint suggests that the boundary of the work space coincides with a right triangle whose vertexes are the pair of parameters corresponding to , and . Figure 5 shows an example of the grid search method for three different values of . It can be observed that small values of entail evaluating a large number of combinations.
In order to reduce the computational cost, we have performed a grid search of the optimal pair of parameters into two stages. The first one consists of defining and using in order to locate potential “regions” in the search space where recognition rates are maximized. Also, the second stage takes into account these regions and, moreover, performs a more refined search using . In that way, each new refined region of search is established by considering all possible pair of parameters complying with (see Figure 6). This definition coincides with all that are inside to a close disc of radius centered at .
The most discriminative atoms of according to the combined measure were selected and taken in for building structured dictionaries. As mentioned above, the problem of finding the optimal pair of parameters was solved by applying the grid search method. This search was initially carried out by taking into account an interval length of which leads to 28 different pair of parameters. Figure 7 shows a summary of the results obtained by applying the grid search method for each one of the four evaluated dictionaries. In particular, we have found that using structured dictionaries comprised by more than 5 classrelated discriminative atoms, the MLP neural networks achieved good recognition rates. This figure also shows, for each one of the evaluated dictionaries, two highlighted regions denoted by and where recognition rate are maximum. Among all highlighted regions, one might think that simultaneous values of and close to zero allow selecting the most discriminative atoms of . In case of using a structured dictionary comprised by 5 discriminative atoms for each one of the classes, we found that search regions and are centered at and , respectively, and centered at and , otherwise.
We also analyzed the overall performance (taken over 10 rounds) of the classifier for each one of the evaluated dictionaries. As it can be seen in Table 3, outperforms all the others yielding the maximum (Max) recognition rate. Also, it can be seen that small structured dictionary sizes entail low classification rates. This may be due to the fact that low dimensional sparse vectors are not capable of capturing relevant information for signal classification. Otherwise, if the dimension of such vectors increases (from 100 to 200) then significant improvements are observed.
Dictionary  Classifier  Acc (%)  Max () 

MLP505010  91.25 (0.67)  92.23  
MLP10010010  94.36 (0.27)  95.03  
MLP15015010  94.74 (0.27)  95.90  
MLP20020010  94.87 (0.33)  96.20 
The second stage grid search method was successfully applied to each one of the tested structured dictionaries. Results have shown that, in this case, no improvements in the recognition rates were found. Thus, the optimal pair of parameters and are the ones found in the first stage. Figure 8 shows the results obtained by applying the refined grid search to regions (left) and (right) corresponding to the structured dictionary . It can be clearly seen that the values of and suggest that the most discriminative atoms of a particular dictionary are not only those more frequently used for signal representation, but also the ones that minimize the total signal representation error. This imply that using only the third term of the proposed combined measure, we ensure finding the most discriminative atoms of a given dictionary.
Comments
There are no comments yet.