In many areas across science and engineering, researchers are dealing with signals that are often inherently sparse with respect to a certain dictionary (also called basis or transform). The seminal paper by neuroscientists Olshausen and Field [olshausen1997sparse] points out that the receptive fields in human being’s visual cortex utilize sparse coding to extract meaningful information from images. In the signal processing domain, the emerging field of Compressed Sensing (CS) [candes2006robust]
relies on the key assumption that the signal is sparse under some orthogonal transformations, such as the Fourier transform.
Traditionally, dictionaries are designed for desired properties in spatial or frequency domain or both. Recently, a different methodology to learn the dictionary from data is explored, which could better capture data characteristics. There are two different directions for designing such a signal dependent dictionary:
(i) Using data directly as the dictionary: Wright et al. [wright2009robust]
proposed a sparse representation-based classifier (SRC) that concatenates the training data from different classes into a single dictionary and uses class-specific residue for face recognition. Besides supervised tasks, a data dictionary is also utilized to cluster the high dimensional data by finding intrinsic low dimensional structures with respect to itself[elhamifar2012sparse].
(ii) Training a dictionary using data: Aharon et al. [aharon2006img] proposed an algorithm called K-SVD that guarantees all training data to be sparsely represented by the learned dictionary and demonstrated its advantages in image processing tasks. Yu et al. [yu2009nonlinear] justified that encoding data with dictionary atoms in its neighborhood can guarantee a nonlinear function of the data to be well approximated by a linear function.
In contrast to the former approach, the learned dictionary in the latter approach removes the redundant information in the learning process, therefore the size of the dictionary does not grow with the size of the data. In this paper, we will focus on the latter approach. Moreover, we assume that the data has been properly aligned, although data alignment [shekhar2013generalized, qiu2013learning] is another active research area with growing interests.
I-a Dictionary Learning for Reconstruction
Dictionary learning (DL) is first attempted for the purpose of reconstruction. The learning process can be described by following optimization problem:
Given training data (), the dictionary and corresponding sparse coefficients are both learned. Each column of and are denoted as () and (), respectively. The dictionary size is typically larger than signal dimension . The parameter balances the trade-off between data fidelity and the sparsity regularization via the -norm.
This non-convex optimization problem is usually solved by iterating between sparse coding and dictionary updating. In the sparse coding stage, the sparse coefficient is found with respect to a fixed dictionary . This can be carried out by greedy pursuit enforcing constraints on -norm [aharon2006img], convex optimization targeting -norm [mairal2009online, ramirez2012mdl], minimizing -norm with locality constraint [yu2009nonlinear], optimizing structured sparsity [jenatton2010proximal, zelnik2012dictionary] or Bayesian methods [zhou2012nonparametric]. In the dictionary updating stage, each dictionary atom is updated using only data with non-zero sparse coefficients on index . This sub-problem can be solved by either block coordinate descent [mairal2009online]aharon2006img]. Desirable features, such as multi-resolution [mairal2007learning] and transformation invariant [kavukcuoglu2009learning], could also be integrated to further improve performances in specific applications. Note that all the dictionary atoms should have unit -norm to avoid the scenario that dictionary atoms have arbitrary large norm but sparse codes have small values.
I-B Dictionary Learning for Classification
Notice that sparse coefficients could also be interpreted as features, therefore it is natural to explore the benefits of using DL for classification. A general framework for this purpose is illustrated in Fig 1. The low dimensional signal is mapped to its high dimensional feature (sparse coefficient) using a learned dictionary , which could make the hidden patterns more prominent and easier to capture. A classifier
is then utilized to predict the label vector. The key here is to design and with discriminative properties by adding extra constraints and . Now the optimization problem becomes:
The function could be a logistic function [mairal2012task], a linear classifier [rodriguez2008sparse, zhang2010discriminative], a label consistency term [jiang2011learning, zhang2013online], a low rank constraint [zhang2013learning] or Fisher discrimination criterion [yang2011fisher]. An example of is to force the sub-dictionaries for different classes to be as incoherent as possible [ramirez2010classification]. The label can be assigned using class-specific residue [ramirez2010classification] or linear classification [jiang2011learning]. Most aforementioned methods embed the label information into the DL problem explicitly, which could complicate the optimization procedure [yang2011fisher].
I-C Our Contributions and Paper Structure
Most methods mentioned in Section I.B simply add extra classification constraints on top of the DL formulation for reconstruction. In contrast to these approaches, we focus on improving the intrinsic discriminative properties of the dictionary by introducing a structured dictionary learning framework (StructDL) that incorporates structured sparsity on different levels. Our specific contributions are listed below111Preliminary version of this work will be presented at the IEEE International Conference on Image Processing, 2014 [suo2014gddl]..
In contrast to the approaches that add extra constraints[zhang2010discriminative, jiang2011learning], our formulation does not increase the size of the problem because the regularization is enforced implicitly. Different from approaches using group sparsity [Yu-Tseh2013], structured low rank [zhang2013learning] and hierarchical tree sparsity constraints [jenatton2010proximal] in DL, we propose to use hierarchical group sparsity, which can be naturally extended to its multi-task variation group structured dirty model for regularization. More importantly, the latter can uniquely incorporate sparsity, group structure and locality in a single formulation, which are all desired features for an ideal dictionary to be used in classification.
We show theoretically that our approach has the advantage of perfect block structure for classification at the cost of a stricter condition. We also point out that the condition is more likely to be satisfied when the dictionary size is smaller, thus making our method more favorable than -norm based DL.
We employ both synthetic and real-world datasets to illustrate the superior performance of the proposed StructDL framework. Meanwhile, we also point out scenarios where limitations still exist.
The paper is organized as follows. In Section II, we illustrate the structured dictionary learning framework for classification (StructDL), including its single task and multi-task versions. In Section III, we derive conditions to guarantee its classification performance using a noiseless model. In Section IV, extensive experiments are performed with synthetic and real datasets to compare StructDL with other state-of-art methods. We end the paper with a conclusion and a discussion on future work in Section V.
In this section, we introduce notations that will be used throughout the article. We use bold lower-case letters such as to represent vectors, bold upper-case letters such as to represent matrices, and bold lower-case letter with subscript such as to represent columns of a matrix. The dimensions of vectors and matrices are often clear from the context. For any vector , we use to denote its -norm . A group is a subset of indices in . A group structure denotes a pre-defined set of non-overlapping groups. We use , , and to denote spectral norm, trace, rank of the matrix and dimension of the subspace, respectively.
Ii Structured Dictionary Learning For Classification
Ii-a Motivation from a Coding Perspective
The coding stage in the DL process typically adopts - or -norm to encourage sparsity (the latter one is also referred as Lasso [tibshirani1996regression]). Its formulation is
The corresponding prior distribution for Lasso is a multivariate Laplacian distribution with the independence assumption, thus the chosen support could fall anywhere.
Since sparsity alone could not regulate the support location, locality-constrained linear coding (LLC) [wang2010locality] is proposed to enforce locality instead of sparsity. The objective function of LLC is defined as:
where denotes the element-wise multiplication, and is a weight vector indicating the similarity between signal and dictionary atoms. By controlling the size of the neighborhood, locality constraint could lead to sparsity as well. Conceptually, LLC endorses the local structure in the dictionary but loses the global perspective. For instance, the data lying on the class boundary could be coded with dictionary atoms from either side or both sides, creating ambiguity for classification tasks.
To promote both sparsity and group structure, Hierarchical Lasso (HiLasso) [sprechmann2011c] is proposed as:
where is a predefined group structure, and is the sub-vector extracted from using the indices in group . The group structure of HiLasso naturally yields locality because it reflects the clustering of dictionary atoms. It is also relevant for classification tasks, since this grouping of dictionary atoms naturally reflects their labels. To be more specific, the dictionary is the concatenation of sub-dictionaries belonging to different classes, where is the total number of classes and has size . In contrast to LLC, HiLasso captures the global information embedded in the group structure.
In the multi-task setup, different tasks could share same sets of dictionary atoms, which leads to a variant of HiLasso, called Collaborative HiLasso (C-HiLasso) [sprechmann2011c]. C-HiLasso captures the correlation on the group level, but it does not reveal explicitly if any dictionary atoms are shared by all tasks (within-class similarity) or uniquely utilized by individual task (within-class variation). The within-class variation generally makes the data clusters less compact and harder to classify, therefore it will be beneficial to separate it from the within-class similarity component to better capture the core essence of the data for discriminative applications. A mixture of coefficients model is proposed to carry out this decomposition, which is termed the Dirty Model [jalali2010dirty]:
where denotes the Frobenius norm, -norm encourages the block sparsity and -norm promotes sparsity. The Dirty model addresses the drawback of C-HiLasso because points out dictionary atoms that are shared across all tasks (similarity) and captures those that are uniquely utilized by individual task (difference). However, it assumes no label differences between dictionary atoms, thus it lacks the group information that indicates sub-dictionaries for different classes.
In summary, there are three key factors one could consider when designing DL methods for classification: sparsity, group structure and if possible, within-group similarity. Sparsity makes it easier to interpret the data and brings in the possibility of identifying the difference in a high-dimensional feature space. Group structure naturally coincides with the label information in the classification problem. It enforces the labels implicitly, thus will not increase the size of the problem. Within-group similarity can be used to further refine the group structure by finding a smaller set of dictionary atoms in each group that can resemble all the data in each class.
Inspired by this observation, we propose the framework of structured dictionary learning StructDL with a single task version, Hierarchical Dictionary Learning (HiDL) and a multi-task version, Group Structured Dirty Dictionary Learning (GDDL) as in Fig 2. Different from sparsity or locality driven DL approaches, HiDL strictly enforces the group boundary between different classes, thus works better when the data is close to the class boundary. As an extension of HiDL to multi-task scenario, GDDL combines the group structure with the Dirty Model so that we could find the shared atoms from in each class. This could further strength the locality within each group since the shared dictionary atoms will be more compact in a small neighborhood as in Fig 2(d). Notice that constraint functions and mentioned in Section I.B could also be merged into the StructDL framework. However, we adhere to a simple formulation to better understand the principles that matter in following sections.
Ii-B Hierarchical Dictionary Learning (HiDL)
When training data has large within-class variability, it makes more sense to utilize sparse coding in a single task setup than leveraging correlation in multi-task coding. A properly structured mapping enforced by HiLasso (II.3) in DL process can guarantee that dictionary atoms are only updated by training data from same class. This implicit label consistency between dictionary atoms and data can not be enforced by either Lasso or LLC. Thus, we propose the single task version of StructDL Hierarchical Dictionary Learning (HiDL), whose objective function is
essentially incorporating HiLasso into DL process. Similar to other DL methods, HiDL iterates between sparse coding and dictionary update. For the sparse coding stage, we are solving HiLasso problem with a well-defined group structure. Convex optimization based approaches [sprechmann2011c, bach2012structured] or Bayesian approach using structured Spike and Slab prior [suo2013hierarchical] can be adopted for this purpose.
For the dictionary update stage, we adopt the method of block coordinate descent with a warm start to update one dictionary atom at a time [mairal2009online]. Furthermore, we will show in Section III that under certain conditions this approach forces the dictionary atoms to be updated in the same subspace. Using the facts that and trace is invariant under cyclic permutations, the objective function of the dictionary update step can be changed to:
Taking the derivative and set it to zero, we obtain the dictionary update procedure as follow:
where is the value of at coordinate with and being the -th atom at -th and -th iterations, respectively. According to (II.10), dictionary atoms always have unit norm.
Putting together the sparse coding and dictionary update processes, we complete the algorithm for StructDL as presented in Algorithm 1. The dictionary is initialized with random sampling of training data and the motivation will be explained in Section III from a theoretical standpoint.
Ii-C Group Structured Dirty Dictionary Learning (GDDL)
HiDL makes the assumption that different tasks are independent on how they select dictionary atoms, therefore the sparse coding step for each task is carried out separately. In some applications, training data in each class is tightly clustered, indicating a large within-class similarity. For instance, pictures of the same person taken under different illumination conditions in face recognition tasks can still be visually identified to belong to same class. Such correlation among training data with the same label is not properly captured by HiDL. Therefore, we propose a multi-task extension of HiDL Group Structured Dirty Model Dictionary Learning (GDDL) as below:
where is all training data from -th class, while and are the sub-matrices in and consisting of columns for class , respectively. Furthermore, and are the sub-matrices by extracting rows with indices in group from and , respectively. The first three terms impose the Dirty Model with -norm and -norm for promoting row sparsity and sparsity, respectively. Since the dictionary contains sub-dictionaries from all classes, extra constraints are needed to guarantee the active rows from and active indices from fall into the same group, respectively. Inspired by C-HiLasso, we use the collaborative Group Lasso regularizers and to force the group boundary.
The underlying model of GDDL can be interpreted as a generalization of C-HiLasso and the Dirty Model. When different tasks do not have to share atoms, the sparse coding step of (II.11) turns into
which is exactly C-HiLasso enforcing both group sparsity and within-group sparsity. When there is no label difference between dictionary atoms (no group structure), the sparse coding step of (II.11) becomes
which is the Dirty Model with decomposition of row sparsity and sparsity terms.
Nevertheless, there are two key differences between GDDL and the Dirty model. First, GDDL extends the Dirty model by adding another layer of group sparsity, which is illustrated in Fig 3. Different from the Dirty Model, GDDL enforces all the activate supports to stay within the same group corresponding to the desired class. Within the group, the sparse codes are further decomposed into two parts, one with supports shared across tasks and one with unique supports associated with different tasks. And the shared dictionary atoms captures the similarity among tasks. Second, the Dirty Model is oriented from a reconstruction perspective while the GDDL brings in the group structure for labeling purposes thus being geared towards classification. In short, GDDL could uniquely combine sparsity, group structure and within-group similarity (or locality) in a single formulation.
Optimization Approach: The sparse coding step of GDDL the Group Structured Dirty Model problem can be reformulated as follows:
with the re-scaled regularization parameters (which will not affect the results). We choose the alternating direction method of multipliers (ADMM) as the optimization approach because of its simplicity, efficiency and robustness [boyd2011distributed, yang2011alternating]. By introducing two auxiliary variables and , this problem can be reformulated as:
Therefore, the augmented Lagrangian function with respect to , , , and can be formed as:
where , , are the Lagrangian multipliers for equality constraints and is a penalty parameter. The augmented Lagrangian function (II-C) can be minimized over , , , and iteratively by fixing one variable at a time and updating the others. The entire algorithm is summarized in Algorithm 2, where we let , , . And , and are the submatrices with columns corresponding to -th class in , and , respectively.
The key steps in Algorithm 2 are Step 4 and 6. Because Group Structured Dirty Model could be regarded as an extension of C-HiLasso as pointed out by (II.12), in Step 6 can be solved using the same operator for C-HiLasso ((III.14), [sprechmann2011c]), which is derived using SpaRSA framework [wright2009sparse]. Although similar procedure can be carried out for Step 4 using the same framework, we follow a more straightforward approach to derive the corresponding operator.
As pointed out in [bach2011optimization], the proximal operators associated with the composite norm in hierarchical sparse coding can be obtained by the composition of the proximal operators as long as the sparsity structures follows the right order. This order is termed as a total order relationship or tree-structured sets of groups (Definition 1, [jenatton2010proximal]), which requires that the two groups are either disjoint or one is included in the other. In our case, the Group Structured Dirty Model contains group sparsity structure and row sparsity structure for and it contains group sparsity structure and element-wise sparsity structure for . Both cases satisfy the total order relationship because either the individual index or the individual row is included in groups as clearly shown in Fig 3(b). After establishing the total order relationship, the proximal operators for composite norm could be constructed by applying the proximal operators for smaller groups first, followed by the ones for larger groups. Therefore, the corresponding operators for Step 4 and 6 in Algorithm 2 can be derived as below:
where and are the proximal operators for group sparsity, whereas and promotes the selection of only a few non-zero rows and elements, respectively. So for Step 4 can be readily computed by applying first the proximal operator associated with the -norm (row-wise soft-thresholding) and then the one associated with group sparsity . Similarly, the C-HiLasso operator for Step 6 is just applying the element-wise soft-thresholding and then the group thresholding, which is same as in [sprechmann2011c]. Here, we have .
Inside each group, the proximal operator that encourages row sparsity is:
where is defined as -th row of and . So it will zero out rows with -norms below the threshold . The proximal operator for component-wise sparsity is:
where is the value of at the coordinate . Finally, the proximal operator for group sparsity is:
where is the sub-matrix with rows indexed by group . It has the effect of zeroing or keeping coefficients in the same group all together. Note that since GDDL separates the sparse code into shared indices and unique indices , we observe rarely the group that wins the selection in is different from the selection in . To avoid such scenario, we enforce the same group selection by always using the group selected by row-sparsity term, because it is a stronger constraint than sparsity.
Ii-D Classification approach
For classification, we choose a linear classifier for its simplicity and the purpose of fair comparison with results of other techniques, although advanced classification techniques (i.e., SRC) could potentially lead to better performances. The linear classifier is found by:
where is the learned sparse codes for training data from either HiDL or GDDL. The matrix provides the label information for training data. If training data belongs to the -th class, then is one and all other elements in the same column are zero. The parameter controls the trade-off between the classification accuracy and the smoothness of the classifier. If the sparse coefficient has block diagonal structure, so does the linear classifier . Thus, the non-zero sparse coefficients on undesired support could be zeroed out by the classifier. We will further explore the condition for to have the block diagonal structure in Section III. For each test data , we find its sparse code by solving HiLasso or Group Structured Dirty Model problem with the learned dictionary , then apply the classifier to get the label vector . The test data is then assigned to the class .
For GDDL, we only use the shared sparse coefficient to train the classifier. This has the benefit of making the sparse coefficients more discriminative because they are mapped to the dictionary atoms that are within the center of the cluster. Therefore we could increase the between class distance among the sparse codes of different classes. For the subsequent classification step, we only feed the shared sparse code into the classifier.
Iii Theoretical Analysis
In this section, we will focus on HiDL and present theoretical guarantees to justify the benefit and tradeoff of using structured sparsity in DL for classification. Currently, most of the theoretical analysis of DL focused on the properties of the learned dictionary from a reconstruction perspective. It has been shown that given enough noiseless or small Gaussian noise contaminated training data, using -or -norm regularization in DL leads to a dictionary
, which is a local minimum around the groundtruth with high probability[spielman2012exact, jenatton2012local, schnass2013identificability]. However, little theoretical effort is focused on analyzing the discrimination power of the learned dictionary, which we will explore in this section.
The DL problem is non-convex, making the direct analysis of its solution not trivial. Inspired by the connection between K-SVD and K-means, we interpret the sparse coding stage as analogous to sparse subspace clustering (SSC)[elhamifar2012sparse], and the dictionary learning step is essentially a way of learning the basis for different subspaces. However, there are two key differences between HiDL and SSC.
(i) HiDL is proposed for classification and SSC is developed for clustering, thus the first difference is the availability of the group structure (label) information. In HiDL, different groups correspond to different subspaces (labels). This in turn leads to the enforcement of group structure sparsity rather than -norm, which is later shown to make the condition for perfect sparse decomposition stricter. However, this price is paid to make the sparse code more discriminative by guaranteeing perfect block structure to separate different classes;
(ii) To represent the subspaces, HiDL uses learned dictionary atoms while SSC uses data directly. Therefore, the success of SSC only depends on the success recovery of sparse coding step since subspace representation (data) is fixed. While for HiDL, dictionary atoms are updated in every iteration so we also need to demonstrate that the dictionary update will not jeopardize the representation of the subspaces. This motivates us to take an inductive approach for analysis.
In this section, we assume that the sparse decomposition is exact so all training data have a perfect decomposition . Scalings of and do not affect the optimal solution, so we replace them by a single parameter . Now the sparse coding step of HiDL could be re-written as:
Then, we borrow the concepts of independent and disjoint subspaces from SSC framework [elhamifar2012sparse] as below.
Definition 1: Given a collection of subspaces . If , then is independent where denotes the direct sum operator. If every pair of subspaces intersect only at the origin, then is disjoint.
The index of subspaces () is purposely chosen to be same as the class labels to emphasize the correspondence between sub-dictionary and subspace (class label). To characterize two disjoint subspaces, [elhamifar2012sparse] also defined an important notion: the smallest principal angle.
Definition 2: The smallest principle angle between two disjoint subspaces and is:
Iii-a Performance Analysis
With the aforementioned notations, we use an induction approach to show the following result.
Theorem 1: Given enough noiseless training data points spanning all subspaces of dimension . If we train the dictionary using HiDL, and both Lemma 1 (or Lemma 3) and Lemma 4 are satisfied, the noiseless test data from the same subspaces will have a perfect block sparse representation with respect to the trained dictionary.
To be more specific, we will show two properties that hold under certain conditions.
(i) Support recovery property: in the sparse coding stage, the sparse code for training data of -th class will have a perfect block structure such that and , where and indicate the sub-vectors corresponding to the subspace and all other subspaces except ;
(ii) Subspace consistency property: in the dictionary learning stage, the dictionary update procedures (II.7) - (II.10) guarantee the dictionary atoms to be updated in the same subspace.
Support recovery property: Similar to Theorem 1 in [elhamifar2012sparse], it is straightforward to see the support recovery property holds for the case of independent subspace.
Lemma 1: (Independent Subspace Case) Suppose the data are drawn from subspaces of dimension . Let denotes the sub-dictionary for subspace and denotes the sub-dictionary for all other subspaces except . Assume that every sub-dictionary is full column rank. If these subspaces are independent, then for every input , (III.1) recovers a perfect subspace-sparse structure, i.e., the resulting solutions have and .
For the disjoint subspace case, we define and as below:
The support recovery property also holds for the disjoint subspace case as long as the following lemma holds.
Lemma 2: (Disjoint Subspace Case) Given the same data and dictionary as in the independent subspace case above. If these subspaces are disjoint, then (III.1) recovers a perfect subspace sparse structure if and only if for all nonzero ,
Note that and are the sub-vectors of and defined by group .
Since the condition for the disjoint subspace case in Lemma 2 does not explicitly impose the requirements on either the dictionary or the data, we further relate it to the characteristics of the data to be more intuitive, which yields the following result.
Lemma 3: (Disjoint Subspace Case) Consider a collection of data points drawn from disjoint subspaces of dimension . If the condition
is satisfied, then for every nonzero input , (III.1) recovers a perfect subspace sparse structure, i.e., and .
Step 1: First, we will find the upper bound for the left side of the original condition in Lemma 2, . Since data and is full column rank, we have,
Since the subspace structure matches the group structure, we have
Applying the vector norm property yields
where is the size of sub-dictionary . Next, applying (III.3) and the matrix norm properties ( and ) , we have
where is the smallest singular value of . Thus, we have derived the upper bound for the left side of the condition.
Step 2: We will now show the lower bound for the right side of the condition . Notice that we have
where we have abused the notation to mean all the groups excluding the one corresponding to the class . Because
we can instead find the lower bound for the simplified condition . Based on the definition of , we have
Using the Holder’s inequalities ( and ) , we obtain
With the definition of smallest principle angle and the vector norm inequality, we can write
where we use to denote the largest norm of the columns of , which is 1 because we restrict the dictionary atoms in a convex set to have unit norm. Therefore, the lower bound for the right side can be shown to be