Learning Discriminative Multilevel Structured Dictionaries for Supervised Image Classification

02/28/2018 ∙ by Jeremy Aghaei Mazaheri, et al. ∙ 0

Sparse representations using overcomplete dictionaries have proved to be a powerful tool in many signal processing applications such as denoising, super-resolution, inpainting, compression or classification. The sparsity of the representation very much depends on how well the dictionary is adapted to the data at hand. In this paper, we propose a method for learning structured multilevel dictionaries with discriminative constraints to make them well suited for the supervised pixelwise classification of images. A multilevel tree-structured discriminative dictionary is learnt for each class, with a learning objective concerning the reconstruction errors of the image patches around the pixels over each class-representative dictionary. After the initial assignment of the class labels to image pixels based on their sparse representations over the learnt dictionaries, the final classification is achieved by smoothing the label image with a graph cut method and an erosion method. Applied to a common set of texture images, our supervised classification method shows competitive results with the state of the art.



There are no comments yet.


page 8

page 9

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Sparse representations have become popular in several applications of signal, image and video processing, such as denoising [elad_image_2006, dong_sparsity-based_2011], super-resolution, inpainting, compression [figueras_i_ventura_low-rate_2006, sezer_sparse_2008, bryt_compression_2008, zepeda_image_2011] or classification. While it was common to analyze and reconstruct signals based on representations over predefined bases such as wavelets and DCT, research in the recent years has shown that learning overcomplete dictionaries adapted to the structure of the treated signals can significantly improve the representation quality. Observing that learning redundant dictionaries from collections of data samples under sparsity priors leads to models that fit and approximate well the characteristics of signals [aharon_k-svd:_2006], [engan_method_1999], the learning of dictionaries in a supervised setting for the discrimination of different classes of signals has also become a popular research problem [jiang_label_2013]. In this work, we propose a method to learn multilevel structured dictionaries with high discrimination capability for the problem of pixelwise image classification.

We consider a supervised classification setting where the classes are known and exemplars are available for each class. In particular, we are interested in image classification problems with a large amount of variability between data samples of the same class, resulting from e.g., dominant presence of irregular high-frequency content in the image classes, or multiple subcategories within the same image class with little resemblance between them. Some example applications could be texture classification problems where the considered image texture classes are rich in high-frequency content with little correlation between several patterns belonging to the same class differing by shifts, orientation differences, etc.; or remote sensing satellite images with high variability within the same image class (e.g., the “city” class containing both smooth image regions corresponding to flat areas such as parks and rivers; and regions rich in texture corresponding to populated urban areas with buildings and streets).

In this setting, we consider the problem of learning a discriminative dictionary model for each class. In order to handle the large variability or the presence of multiple subcategories of patterns in each image class, we propose to use multilevel dictionaries having a tree-like structure. In the proposed setting, the overall class-representative dictionary consists of subdictionaries residing at multiple levels, such that each subdictionary in a level originates from a certain atom of a subdictionary in the preceding level. The representation of an image patch in a multilevel dictionary is simply computed by tracing down the branches, i.e., first choosing an atom in the first-level subdictionary, then selecting an atom from the second-level subdictionary corresponding to the first atom, and similarly descending until the desired sparsity level is attained. The patches of test images are classified with respect to their reconstruction errors over each class-representative multilevel dictionary.

Such a multilevel dictionary structure is particularly suitable for the considered image classification problem with high intra-class variability. In a setting with various patterns of little resemblance in the same class, the atoms in upper-level subdictionaries capture the main characteristics of the patterns such as orientation, so that dissimilar patterns are represented with different atoms in these early levels. The lower-level subdictionaries originating from different atoms in upper levels are then particularly adapted to the structures of the different types of patterns present in the class and learn the fine details of these patterns. The representation of signals with the proposed multilevel structured dictionaries is illustrated in Figure 1.

Fig. 1: Illustration of the proposed structured dictionaries for two levels. The atoms in the first level capture the main characteristics of different types of patterns in the same class. Each second-level dictionary originates from a different first-level atom. Second-level atoms learn the details of the patterns selecting the corresponding first-level atom.

Many methods in the literature use sparse representations and dictionary learning for the problem of supervised classification [jiang_label_2013], [mairal_discriminative_2008], [mairal_supervised_2008], [zhang_discriminative_2010]

. Using the known labels of training data, these methods learn one or several dictionaries to allow the classification of test images based on their sparse representations in the learnt dictionaries. Although the traditional single-level flat dictionaries used typically in supervised dictionary learning can learn the main characteristics of different classes via sparse representations, the main philosophy of these methods is to tune the atoms to fit well the common features in the same class while pushing them away from the features of the other classes. While such methods give quite impressive results in applications such as face recognition with rather small variability within the same class, their performance may degrade in problems with large intra-class variability. On the other hand, the multilevel dictionary structures proposed in our method have a high learning capacity that explicitly and efficiently takes account of possible intra-class variations.

We propose to learn the class-representative multilevel dictionaries in a sequential way, by optimizing each atom of a subdictionary with respect to a discriminative learning objective. Our objective function seeks to update each atom in order to fit the residuals of the signals from the same class using that atom, while increasing the reconstruction error of the signals from other classes when represented with that atom. We first train the dictionaries with image patches from a set of known classes. Then, for the image patches centered around each pixel of a test picture, we compute the reconstruction errors over the learnt class-representative dictionaries. Finally, the label image is obtained by applying a combination of two smoothing methods: a label expansion algorithm based on a graph cut, and an erosion algorithm, which both use the information of the reconstruction errors of the patches over the learnt dictionaries. We evaluate our method with experiments on several texture classification problems. The experimental results show that our method gives competitive results with the state of the art.

In Section II, we give an overview of the related work. Section III presents the proposed method for learning supervised dictionaries with a multilevel adaptive structure, together with a description of the classification algorithm based on these learnt structured dictionaries. In Section IV

, we describe the smoothing steps applied for improving the label estimates. We present the experimental results on texture images in Section

V, and Section VI concludes the paper.

Ii Related work

We now briefly overview some related works on sparse representations and dictionary learning.

Ii-a Sparse representations and unsupervised dictionary learning

Sparse representations consist in representing a signal as a linear combination of only a few columns, known as atoms, from a dictionary , under a sparsity constraint as



is the coefficient vector corresponding to the sparse representation of

over , and is the sparsity constraint, i.e., the maximum number of non-zero coefficients in . The -norm of is equal to the number of non-zero coefficients in . The dictionary is composed of atoms , that are supposed to be normalized to have unit -norm as .

The computation of the sparse approximation of a signal in (1) is an NP-hard problem and some greedy algorithms have been developed to find an approximate solution, such as the Matching Pursuit (MP) [mallat_matching_1993] and the Orthogonal Matching Pursuit (OMP) [pati_orthogonal_1993] algorithms, which search in each iteration the atom of the dictionary that is the most correlated with the current residual vector. Several other methods such as the Basis Pursuit algorithm [chen_atomic_1998] propose to relax the optimization problem by replacing the -norm of with its -norm.

Many dictionary learning methods have been proposed to learn a dictionary from a set of training vectors under sparsity constraints, in order to better adapt the dictionary to the data. Unsupervised dictionary methods typically solve the problem


where is the sparsity constraint applied to each column of , i.e., the maximum number of non-zero coefficients in , and is the Frobenius norm.

Many dictionary learning algorithms such as the Method of Optimal Directions (MOD) [engan_method_1999, engan_frame_1999-1] and K-SVD [aharon_k-svd:_2006] apply an iterative optimization procedure with two major steps. The first step consists of the sparse coding of the training vectors over the fixed dictionary to compute , which can be solved with pursuit algorithms; and the second step is the update of the dictionary based on the decompositions computed in the previous step. Some dictionary learning algorithms impose constraints on the dictionary, such as the Sparse K-SVD method[rubinstein_double_2010], which aims to learn a sparse dictionary, or the Non-Negative K-SVD method [aharon_k-svd_2005], which learns a non-negative dictionary. An online dictionary learning method based on stochastic approximations is proposed in [mairal_online_2010]. Finally, structured multilevel dictionaries are learnt in [zepeda_iteration-tuned_2010, zepeda_image_2011]

based on the idea of adapting each dictionary to one iteration of the pursuit algorithm, so that atoms are sequentially selected from dictionaries at different levels by going down the branches in sparse coding. While the multilevel dictionary structure used in our method is based on the principle developed in these previous works, we focus here on the supervised learning problem for classification applications unlike these works.

Ii-B Supervised dictionary learning

Supervised dictionary methods aim to learn dictionaries such that sparse representations of signals over the learnt dictionaries allow an accurate estimation of their class labels. Some dictionary learning methods learn one global dictionary to represent all classes. The study in [rodriguez_sparse_2008] shows the advantage of learnt dictionaries over predefined dictionaries in classification, where a dictionary is learnt with a discrimination term applied on the coefficients. A discriminative formulation with a linear and bilinear classifier applied to the sparse coefficients is employed in [mairal_supervised_2008]. A discriminative version of K-SVD is presented in [zhang_discriminative_2010]. A classifier is jointly learnt with the dictionary and then applied to the coefficients of a test picture to classify it. Applied to face recognition, it offers better results than the K-SVD dictionary. The problem of [zhang_discriminative_2010] is extended in the Label Consistent K-SVD method [jiang_learning_2011], [jiang_label_2013]. The dictionary is learnt along with a linear classifier using the sparse coefficients in order to increase the discrimination capability of the coefficients, while another term in the objective directly imposes the similarity of the sparse coefficients among the samples from the same class. The methods in [zhou_bilevel_2017] and [yankelevsky_structure_2017] are based on similar formulations while they also include a graph-regularization term on the sparse coefficients. The authors of [zhou_bilevel_2017] further propose to remove the sparse reconstruction term from the objective function and include it only in the constraints of the optimization problem. In the sparse decomposition of a training sample, the coefficients corresponding to other classes are suppressed with a differentiable term based on the -norm in [wang_crosslabel_2017], while a graph-regularization term is also included in the objective. A Fisher criterion is applied on the sparse coefficients in the learning in [yang_fisher_2011]. The dictionary learning problem is formulated in a Bayesian setting in [akhtar_discriminative_2016], such that sparsity is imposed via class-dependent Bernoulli random vectors, and a classifier is trained on sparse codes. A couple of other methods consider the semi-supervised dictionary learning problem. A linear classifier on sparse codes is learnt in [wang_adaptively_2015] while the unlabeled samples are also incorporated in the discriminative term of the learning objective, proportionally to the confidence of their label estimates. The authors of [jian_semi_2016]

emphasize that there may be overlapping features between different classes and propose to learn a global dictionary along with the corresponding soft label vectors in a graph-regularized semi-supervised learning scheme.

Some other methods learn one dictionary per class and classify test data based on the reconstruction error over each dictionary. In [mairal_discriminative_2008], dictionaries that are both reconstructive and discriminative are learnt for each class by optimizing a sparse reconstruction error term and a discriminative term. The discriminative term in the objective function involves the reconstruction errors of samples over the dictionaries. Test samples are then classified by searching for the dictionary giving the minimum reconstruction error. A smoothing graph cut step is finally applied to refine the label image. A dictionary is learnt for each class in [ramirez_classification_2010] with an incoherence criterion imposed on the dictionaries to make them independent. This incoherence term is also used in [kong_dictionary_2012] where an additional dictionary is also learnt in order to capture the patterns common to different classes.

Finally, there are also discriminative dictionary learning methods relying on a categorical or relational organization of the image classes. A dictionary learning method for multilabel image annotation is proposed in [cao_sled_2015], where the image labels are first organized into exclusive groups such that two labels that simultaneously occur in the same training image are in different groups. A discriminative dictionary is then learnt with a Fisher criterion for each label group. Test images are finally classified according to their sparse representations by imposing group sparsity in their sparse coding. The method in [shen_multilevel_2015] learns discriminative dictionaries with a multilevel structure. Their method addresses the particular application of large scale classification with a high number of classes and relies strictly on the availability of a category hierarchy organization of the given classes in the form of a tree model. A global tree-structured dictionary is then learnt where the multilevel tree structure is directly inherited from the given category hierarchy tree model, and the dictionary in each node of the tree is specialized for a group of classes residing under the same subcategory. A similar tree structure is used for emotion classification in [chen_sparse_2015], where each node is associated with a dictionary and a classifier. While the dictionaries are learnt in an unsupervised manner, the classifiers are trained so as to discriminate between the confused classes branching from that node based on sparse codes. Although the methods in [shen_multilevel_2015] and [chen_sparse_2015] learn tree-structured multilevel dictionaries, these methods differ significantly from ours in that their multilevel dictionary structures are formed quite differently for different usages and purposes.

Iii Learning discriminative structured dictionaries for classification

Our classification method is based on the learning of discriminative structured dictionaries. Multilevel structured dictionaries, composed of many small dictionaries organized on several levels, have the ability to better specialize and thus more efficiently capture the high variability within a class.

This concept of structured dictionaries has been first introduced in [zepeda_iteration-tuned_2010], and developed in [zepeda_image_2011], under the name of Iteration-Tuned Dictionaries (ITD). The structure is based on the idea of learning a different dictionary for each iteration of the pursuit algorithm. Thus, each atom added in the decomposition of a signal is selected in a new dictionary by going down the multilevel dictionary strucuture. Several structures, represented in several levels that contain one or several dictionaries, have been developed within this concept, like the Basic ITD (BITD) composed of one dictionary per level, or the Tree-Structured ITD (TSITD) structured as a tree of dictionaries. In these structures, each dictionary at a level is learnt based on a subset of residuals computed at the previous level.

Fig. 2: The Adaptive Structure. Each atom at a given level leads to the generation of a specialized dictionary at the next level, learnt from only the data samples selecting that atom. At each level , all branches without sufficiently many data samples to continue learning a dictionary at the next level are merged together to learn a new dictionary at the next level .

Another tree structure, called Tree K-SVD [aghaei_mazaheri_learning_2013], has been derived from the TSITD structure. Each dictionary it contains is learnt with the K-SVD algorithm [aharon_k-svd:_2006] with a sparsity of one atom. Starting with one dictionary at the first level, the principle of these tree structures is to learn for each atom at a level one child dictionary at the next level. They are thus quickly composed of too many dictionaries when the number of levels increases, and many can be incomplete or even empty, which can be problematic.

Motivated by these observations, we propose here the discriminative Adaptive Structure by building on our previous work [aghaei_mazaheri_learning_2013-1], which focused on image compression by learning reconstructive Adaptive Structures. The Adaptive Structure is a new dictionary structure whose topology is adaptively determined during the learning in order to not contain any incomplete dictionary. The branches in the structure are progressively pruned, according to their usage rate, and merged into a unique and more general branch whenever there is not enough data to learn new dictionaries down the branches. This adaptive structure enables the learning of more levels than the tree structure while keeping the total number of atoms reasonable. In the sequel, we first describe the Adaptive Structure in Section III-A. Then, in Section III-B we present the proposed discriminative dictionary learning method based on Adaptive Structures in a supervised learning setting. Finally, in Section III-C, we present our supervised classification algorithm.

Iii-a The Adaptive Structure

The Adaptive Structure demonstrated in Fig. 2 is learnt with a top-down approach level after level. Each dictionary in the structure consists of atoms and is learnt with K-SVD [aharon_k-svd:_2006], [aharon_k-svd_????], for a sparsity of one atom. Let denote the set of training samples. The single dictionary at the first level

consisting of unit-norm atoms is learnt using all training data , by setting the residual term of the first level simply as . Each training data sample , , is then approximated by one atom of the first dictionary as

and the residual vectors

are computed to form the residual set for the next level. The residuals in are split into groups such that each group consists of the residuals of the training samples selecting the atom in the first level

For each set with , if it contains sufficiently many residuals to satisfy

a dictionary at the second level is learnt from , where denotes the cardinality of a set. Otherwise, in order not to create an incomplete dictionary, the dictionary is not learnt and the set of residuals is saved. At the end of the learning of the second level, all the saved residual sets at this level are merged in as

The merged residual set is then used to learn a new dictionary , the dictionary of the “merged branches” at the second level of the structure.

The same procedure is then applied to the dictionaries of the second level to learn the dictionaries of the third level. The residual sets of insufficient cardinality at the third level are merged, together with the residuals from at the previous level, to form . The residual set is then used to learn the corresponding dictionary at the third level.

This procedure is continued to learn the multilevel Adaptive Structure until a desired number of levels is reached. With this method, the branches with a high usage rate, i.e., the branches selected by many training samples, will be further developed to result in new dictionaries down the tree. On the other hand, the branches with a low usage rate will be quickly pruned and the corresponding residuals will be merged to learn rather general dictionaries (in contrast to the more specialized ones residing at non-merged branches). Thus, during this learning process, the structure adapts itself according to the training vectors in order not to contain any incomplete or empty dictionaries.

Once the Adaptive Structure is learnt, the sparse decomposition of a test sample is computed by selecting one atom per level, beginning with the first level and descending down the multilevel structure. Given a test sample , it is approximated at the first level by an atom of the first-level dictionary selected with the MP algorithm [mallat_matching_1993] with a sparsity of one

where is the sparse coefficient obtained as

The residual vector is then computed and approximated with the same procedure by another atom from a dictionary at the second level, the child dictionary of the atom chosen at the first level. The residual computation and atom selection procedure is continued by descending down the multilevel structure along a branch until a given sparsity is reached. The dictionary to use at each level is thus determined by the atom chosen at the previous level in the approximation of . When the end of a branch is reached, the atom at the next level is selected within the dictionary of the “merged branch” of the structure, and the decomposition continues after that along this branch. For a structured multilevel dictionary , the reconstruction error of the test sample for a sparsity of atoms is thus obtained as


where are the atoms chosen at the levels to from and are the corresponding coefficients.

Iii-B Discriminative Learning with Adaptive Structures

We now describe our proposed method where discriminative Adaptive Structures are learnt for supervised image classification. We propose to learn one multilevel dictionary with the Adaptive Structure for each class. We have observed that in order to achieve satisfactory performance, it suffices to apply the discriminative learning procedure described below at the first level of the structures where there is only one dictionary. We learn the dictionaries at the other levels with the K-SVD algorithm with a sparsity of 1 atom, by following the Adaptive Structure as described in Section III-A. Since the dictionary structure is learnt with a top-down approach, applying a discrimination-based learning at the top level impacts the other levels as well and has an effect on the whole multilevel dictionary structure.

Let denote the dictionary at the first level of the Adaptive Structure to be learnt for the class , for . We aim to learn a dictionary that is both reconstructive and discriminative, which efficiently represents the data from its own class but yields a large reconstruction error for the data from other classes. Hence, the dictionaries are learnt considering the data from both their own class and the other classes. In this way, the reconstruction errors of test samples on the learnt class-representative dictionaries can be used to classify test data.

In the following, we first introduce our discriminative dictionary learning objective and then discuss its minimization. Next, we explain how the data samples included in the objective function are chosen and finally present the overall discriminative dictionary learning algorithm.

Iii-B1 Discrimination model

We propose to update the dictionaries sequentially (atom by atom), by minimizing the following objective function for updating an atom of the dictionary of class


The first term in the above cost function is a reconstructive term aiming to adapt the atom to the training data from its own class , where denotes the restricted subset of data samples from class that use the atom in their decomposition. The second term is a discriminative term, whose goal is to push the atom away from the training samples of the other classes. Thus, we search for the atom minimizing the reconstruction error of the data from its own class and maximizing the reconstruction error of the data from the other classes. The positive weight parameter balances the two terms according to the ratio between the number of samples in and as


where denotes the number of columns in a matrix (i.e., the number of data samples) with a slight abuse of notation. The positive constant adjusts the compromise between reconstruction and discrimination. The exact choice of the samples for each class and atom will be explained later in Section III-B3.

Iii-B2 Minimization of the objective function

The cost function to minimize can be rewritten as


This is equivalent to


With the constraint , we have . Hence, we can simplify the cost function to


In order to solve this minimization problem under the constraint , we then apply the Lagrange multipliers method and minimize the function


Setting the derivative of with respect to to gives


We then evaluate the derivative with respect to and equate it to as


which gives


Taking the transpose of both sides, we get


This equation is of the form




The atom

is thus an eigenvector of

with a unit -norm. Since our objective in (4) imposes the atom to fit the samples while repulsing it from , the sought atom is the eigenvector of

corresponding to its maximum eigenvalue.

Iii-B3 Choice of the sample set

In order to adapt the discrimination term to each class and even to each atom to update, we follow a particular strategy when forming the matrix that contains the data from the other classes than the current class . Rather than choosing

to contain all data samples from the other classes, we wish to particularly discriminate the updated atom from the classes most similar to its class. For this purpose, we compute an affinity matrix that represents the similarity between each pair of classes.

In order to compute the class affinity matrix, for each class we first compute a representative vector that best fits the data samples . In order to avoid computing an almost constant vector as the representative vector, we first subtract the mean value of each data sample in . We then choose the representative vector as the one that maximizes the energy of the data samples when projected onto it as

where is the mean-removed version of the training sample . The solution of the above problem gives as the unit-norm eigenvector of associated with its maximum eigenvalue, where is the matrix containing the mean-removed samples . We then obtain the class affinity matrix such that the affinity between the -th and -th classes is given by the similarity between their class representative vectors as

Hence, is a symmetric matrix with ’s on the diagonal and affinity values varying between and on its off-diagonal entries.

With this affinity matrix, we then determine by selecting from each class a variable number of vectors according to its affinity with the current class . If the number of training samples in each class is the same and equal to , we set the number of samples to be selected from class as

where the function rounds the values to the nearest integer. Note that this strategy can be easily adapted to the case where the number of training samples is different for each class, by choosing such that

where contains the samples in from class , with . The samples are chosen as the samples from class that have the highest correlation with the atom to update, i.e., the samples from class that are the most susceptible to choose for their sparse decomposition in the dictionary of class .

With this strategy, each dictionary becomes more discriminative towards the classes closest to its own class, instead of equally treating all the other classes. Indeed, two classes with high dissimilarity do not necessarily need an extra discrimination criterion to be distinguished.

Iii-B4 Overall discriminative dictionary learning algorithm

Let us now describe the overall algorithm to learn a multilevel discriminative dictionary for each class .

The dictionary that composes the first level of the Adaptive Structure for class is computed as follows. The dictionary is first initialized by training vectors from its own class, randomly selected and normalized to be of unit -norm. The algorithm iterates between a sparse decomposition step and a dictionary update step as frequently done.

In the sparse decomposition step, the decompositions of the data samples from class are computed with the MP algorithm [mallat_matching_1993] for a sparsity of one atom. Thus, for each vector in , we search for the atom in that is the most correlated with it. This step will allow us to compute for each atom the matrix composed of the training vectors from the class choosing this atom at this decomposition step.

The dictionary is then updated sequentially, atom by atom. For each atom of , the matrix composed of training vectors from the other classes is formed with respect to the class affinities as described in Section III-B3 and the matrix is computed. The matrix in (15) can then be computed and the atom is updated as the unit-norm eigenvector of corresponding to the maximum eigenvalue.

Once the discriminative dictionary is computed by alternatingly updating the sparse codes and the atoms, the residual set for the next level is computed from , and the reconstructive Adaptive Structure learning described in Section III-A continues until the desired number of levels. This procedure is repeated for each class to obtain a class-representative multilevel structured dictionary for each class.

Iii-C Classification of test images based on learnt structured dictionaries

Test samples are classified with respect to their reconstruction errors over the learnt multilevel dictionaries as follows. A given test sample is first decomposed over each one of the class-representative dictionaries for a given sparsity , where is the number of classes. Then the reconstruction error of is computed over each dictionary of class , , as described in Section III-A


Here is the atom selected at level , chosen in the dictionary at level that corresponds to the atom selected at the previous level ; and is the coefficient of in the decomposition of .

In this paper, we focus on pixelwise classification of a test image. In this case, the training and test samples are image patches. The test samples are obtained by taking a square patch around each pixel of the given test image so as to assign a class label to each pixel. In such a setting, it is useful to normalize the reconstruction residuals by the norm of the test patch , in order to prevent the patches of high norm from dominating the overall label estimation during the smoothing steps discussed in Section IV. We thus consider the normalized error


A simple classification strategy would be to search the class minimizing the error for each patch


However, this classification rule leads in general to a fractional segmentation of the test picture resulting in many small and disconnected label support regions. In order to improve the label image and obtain more uniform and smooth label supports, we apply two smoothing steps, discussed in Section IV.

Iv Smoothing steps

In order to improve the estimate of the label image obtained via the reconstruction errors over the class-representative dictoinaries as described in (18), we apply a label smoothing procedure that comprises two steps: a label expansion step with a graph cut, followed by an erosion step to erode the remaining small undesirable label support regions.

Iv-a Label expansion via graph cuts

The first smoothing step considers an -expansion algorithm minimizing an energy function with a graph cut [boykov_fast_2001, kolmogorov_what_2004, boykov_experimental_2004]. The algorithm estimates the label of each pixel by minimizing the following energy function based on a Potts model:


The first term is the data cost and corresponds to the sum on all the pixels of the cost of assigning a label to the pixel . Rather than applying a cost of to the class offering the lowest reconstruction error, and to the other classes as done in [mairal_discriminative_2008], we set the data cost as

where is the patch centered around the pixel and is the reconstruction error of over the dictionary of the class as defined in (17). Such a choice of the data cost provides a ranking of all the classes for each pixel.

The second term is the smoothing cost summed over all neighboring pixels and (denoted as ), which is defined as

Hence, if two neighboring pixels and share the same label, then the associated cost is . Otherwise, this cost is constant and equal to the parameter .

This model encourages a labeling with several large regions whose pixels share the same label. Adapting the parameter makes the label image more smooth or less smooth. By setting it to , only the data cost, i.e. the reconstruction errors, is considered and the resulting label image is composed of many small and disconnected regions as label supports. Meanwhile, choosing a too big will fuse the label supports too much and the estimated label image will contain less label support regions than desired. The parameter can possibly be chosen as a constant or depending on the class labels and .

An -expansion method [boykov_fast_2001] is applied to minimize the energy function. This method expands at each iteration label after label, searching for the optimal expansion for each label, in order to decrease the energy function . The expansion consists of modifying possibly numerous pixels simultaneously by assigning the current label, called label , to these pixels. We have used the Matlab wrapper [bagon_matlab_2006] for the experiments.

(a) Test image 1
(b) Test image 2
(c) Test image 3
(d) Test image 4
(e) Test image 5
(f) Test image 6
(g) Test image 7
(h) Test image 8
(i) Test image 9
(j) Test image 10
(k) Test image 11
(l) Test image 12
Fig. 3: Texture image dataset: test images.

Iv-B Erosion

After a first smoothing step realized with a label expansion algorithm, some small undesirable label support regions can remain in the label image. In order to remove them, we add an erosion step [skretting_energy_2014] applied directly on the label image obtained after the first smoothing step and based on the same data cost .

The -erosion algorithm [skretting_energy_2014] (available online [skretting_page]) works with a close variant of the energy function (19) and seeks to erode the small segments in priority (a segment corresponds to a group of connected pixels of the same label) to decrease the energy function. Too small segments are always eroded whereas too big segments are never eroded. The segments between these limits are treated one after the other, beginning with the smaller ones. For each segment, its pixels are relabeled one by one, the segment being eroded by the segments around. If the new labeling of the segment decreases the energy function, then the erosion of the segment is accepted, otherwise it is canceled.

A slight erosion step is also realized on the label support edges.

V Experimental results

The proposed method is tested on a set of texture images, commonly used for supervised classification and texture segmentation. We compare our method to several state-of-the-art dictionary learning and texture classification algorithms.

V-a Pre-processing of the data

In order to improve the classification, some pre-processing operations, already used in [mairal_discriminative_2008]

, are performed on the training and test patches. A Gaussian mask of standard deviation

is first applied with an element-wise multiplication with the patch, in order to give more weight to the center of the patches, as it is possible that the peripheral pixels of a patch are from a different class if the patch lies on an edge between two classes. The weight is thus at the central pixel(s) of the patch and decreases with the distance to the center of the patch. Each patch is then sharpened with a Laplacian filter (of size ), each Laplacian filtered patch being subtracted from the patch to get the sharpened patch. Note that when computing the affinity matrix as described in Section III-B3, we use the non-processed (original) versions of the image patches, as we have observed that pre-processing may impair the estimation of the affinities between classes.

Besides, in order to be able to classify the pixels on the borders of the test picture, we generate some additional pixels along the borders by taking the mirror image of the pixels close to the borders. This allows the classification of the border pixels by using the square patches artificially generated around them.

V-B Texture classification

V-B1 Texture image dataset

The dataset composed of texture images, used in our experiments, has first been introduced in [randen_filtering_1999], and has since been used in several articles dealing with classification and segmentation. It has been created with pictures from the Brodatz album [brodatz_textures:_1966], from the Vision Texture database of the MIT, and from the texture image database MeasTex. The pictures have thus been captured with different equipments under different conditions. Each one of the 12 test images in Fig. 3 corresponds to a different supervised classification problem with different texture classes. The number of classes in each problem varies between and . The training images corresponding to each one of these 12 supervised classification problems are also available. The training and test images have been taken from different portions of each texture. The dataset is available online [randen_trygve_????].

V-B2 Parameters

For each one of the 12 texture classification problems, an Adaptive Structure is learnt per texture class from the corresponding training picture. Overlapping blocks are extracted from these pictures to learn each dictionary structure on training vectors. The structures are composed of complete dictionaries of atoms. We limit the size of each dictionary in the structures so that they capture the characteristics of their own class and do not become too efficient for the representation of other classes. The first level of the structures, made discriminant, is learnt in iterations, whereas the next levels are learnt in iterations. The dictionaries in the structures are initialized with randomly chosen training vectors. The parameter balancing the reconstruction and discrimination at the first level is empirically set to .

The test patches, also of size pixels, are decomposed over each class-representative dictionary structure of the corresponding texture classification problem, for a sparsity of atoms. Since in a classification problem we do not look for the best approximation of a patch but rather would like to classify it based on its reconstruction errors over the different dictionary structures, it is better to avoid high sparsity values. In practice, we have observed that the sparsity of atoms gives good results in general.

For the first smoothing step with the graph cut, the smoothing parameter is experimentally set to the constant value for all different label pairs , with , and to for , which has been observed to yield good results. Finally, for the second smoothing step of erosion, the parameter111This parameter is used in the energy function in [skretting_energy_2014] and is different from the parameter we use to balance reconstruction and discrimination in our objective function to learn each dictionary at the first level of the structures. is set to as in [skretting_energy_2014]. Areas of less than pixels are always eroded whereas areas of more than pixels never are. Between these limits, the erosion depends on the minimization of the cost function. A slight erosion of pixels is also performed on the edges.

V-B3 Results

We first present in Fig. 4 some example atoms from the multilevel dictionaries learnt with the proposed algorithm for the classification problem of experiment 6 (Fig. 3(f)). Sample regions from the training images of the texture classes 1 and 12 of this experiment are shown in Figures 4(a) and 4(b). Figures 4(c) and 4(d) show some of the multilevel dictionaries learnt for these two texture classes. For both texture classes, the first-level dictionary is displayed, together with the second-level dictionaries originating from two different atoms of the first-level dictionary.

It can be observed that the atoms in the first-level dictionaries capture well the main characteristics of each class. The first-level dictionary of class 1 consists of atoms containing rather smooth and curvy features, whereas the first-level dictionary of class 12 contains atoms capturing straight edges and corners. We can observe that, due to the large intra-class variation in these texture classes rich in content, the atoms in the first-level dictionary of the same class can be quite different from each other. The proposed multi-level dictionary structure is then seen to be well-adapted to this setting as it allows the specialization of the atoms at later levels based on the structure of the atoms at earlier levels they originate from. Indeed, it can be seen in Fig. 4 that the second-level dictionaries derived from two different first-level atoms of the same class capture finer details but tend to have different characteristics. In class 1, the second-level dictionary derived from atom 2 of the first level inherits the round-shaped circular structure of atom 2, while the second-level dictionary derived from atom 45 is tuned to represent more straight and diagonally-oriented texture features. Similarly, in class 12, the second-level dictionary originating from atom 60 contains mainly horizontally oriented atoms as the dominant orientation of the atom 60 is horizontal, while the second-level dictionary originating from atom 22 captures both vertically and horizontally oriented corner-like features just like the atom 22. This confirms that the dictionaries learnt at different levels are successfully specialized to adapt to different fine and coarse texture features present in image classes of large intra-class variation.

(a) Class 1
(b) Class 12
(c) Multilevel dictionary for class 1
(d) Multilevel dictionary for class 12
Fig. 4: Some example multilevel dictionaries learnt for classes 1 and 12 of experiment 6

Next, we demonstrate the effect of the different stages of the proposed method. In Fig. 5 the label images obtained after each step of our classification method are shown for the test image . We can see the benefits of the smoothing steps whereas a straighforward estimation of the class labels based on the minimum reconstruction error leads to a noisy segmentation (Fig. 5(b)). The label expansion algorithm via graph cut is crucial to create larger label support regions and suppress the majority of the small isolated label supports (Fig. 5(c)). Note that the graph cut algorithm does not take the label image in Fig. 5(b) as an input parameter but the data cost matrix consisting of the reconstruction errors computed for each pixel and each class. The final erosion step erodes the last remaining small label supports and slightly erodes the edges in order to obtain a clean label image (Fig. 5(d)), close to the ground truth (Fig. 5(e)). The erosion algorithm takes the label image obtained after the graph cut algorithm (Fig. 5(c)) as an initial segmentation, and uses the same data cost matrix.

(a) Test image 4
(b) Min error
(c) After graph cut
(d) After erosion
(e) Ground truth
Fig. 5: Test image 4 at the successive steps of the classification algorithm.

Finally, we compare in Table I the classification error rates of our method with several methods from the literature. In [randen_filtering_1999], introducing the dataset, numerous filtering methods are compared and the best result obtained for each test picture is presented. The authors of [maenpaa_texture_2000] have improved the previous results using the Local Binary Pattern operator on texture patches and by computing histograms of the values to characterize a texture.222We report the corrected results in an erratum posted by the authors at http://www.cse.oulu.fi/wsgi/CMV/SupervisedTextureSegmentation A multi-scale version, considering several patch sizes, has also been studied. The authors of[di_lillo_texture_2007]

have then proposed to extract texture discriminative features in the frequency domain by applying a Fourier transform in polar coordinates, followed by dimensionality reduction via PCA (Principal Components Analysis) or the computation of Fisher coefficients. Centroids are then computed for each class with a vector quantization method. The results presented in

[mairal_discriminative_2008] are also included in our comparison, using reconstructive (R) or discriminative (D) dictionaries, and a graph cut based smoothing method. Finally, the results obtained with the -erosion method in [skretting_energy_2014] are added. A dictionary is learnt per class with the RLS-DLA algorithm [skretting_recursive_2010] and class labels are estimated based on the approximation errors computed for each pixel and each class in an energy minimization step. In this step, a Gaussian filter is applied before applying the -erosion algorithm, followed by further erosion of the edges of the label support regions in order to smooth their borders. In the smoothing steps of our method, the label expansion algorithm uses a random ordering of the labels to be expanded in each iteration. The classification results can thus change from one trial to another, despite the use of the same dictionaries and parameters. We thus perform the smoothing steps times and report the average error over these random trials. The difference between different trials remains small in general for the same image.

It is seen in Table I that, over the 12 different texture classification experiments, our method gives the best results in three experiments and is among the best two methods in nine experiments. Except for two problematic images (test images and ) the classification error of our method does not exceed that of the state of the art by more than . Our average classification error over the experiments is , which is the smallest among the compared methods.

Im. [randen_filtering_1999] [maenpaa_texture_2000] [di_lillo_texture_2007] [mairal_discriminative_2008](R) [mairal_discriminative_2008](D) [skretting_energy_2014] Our meth.
1 7.2 7.5 3.37 1.69 1.61 2.00 1.25
2 18.9 15.5 16.05 36.5 16.42 3.24 3.42
3 20.6 10.9 13.03 5.49 4.15 4.01 3.05
4 16.8 8.4 6.62 4.60 3.67 2.55 2.59
5 17.2 7.9 8.15 4.32 4.58 1.26 6.60
6 34.7 16.1 18.66 15.50 9.04 6.72 8.20
7 41.7 20.3 21.67 21.89 8.80 4.14 2.36
8 32.3 16.2 21.96 11.80 2.24 4.80 3.13
9 27.8 20.2 9.61 21.88 2.04 3.90 2.06
10 0.7 0.3 0.36 0.17 0.17 0.42 0.23
11 0.2 0.9 1.33 0.73 0.60 0.61 0.43
12 2.5 5.0 1.14 0.37 0.78 0.70 0.94
Av. 18.4 10.8 10.16 10.41 4.50 2.87 2.86
TABLE I: Classification error rates (in %) of our method for the test images in comparison with several methods from the state of the art. The best two results for each image are in bold.

Some example classification results are presented for several test images in Figures 6, 7, and 8. We observe that the only zones of misclassification are concentrated within a thin band over the edges between label supports, and the classification performance of our method is quite satisfactory in these experiments.

(a) Test image 7
(b) Label image
(c) Ground truth
Fig. 6: Test image 7 and its label image compared to the ground truth.
(a) Test image 9
(b) Label image
(c) Ground truth
Fig. 7: Test image 9 and its label image compared to the ground truth.
(a) Test image 11
(b) Label image
(c) Ground truth
Fig. 8: Test image 11 and its label image compared to the ground truth.

Meanwhile, the texture classification experiments corresponding to the test images 5 and 6 are particularly challenging and these are the only settings where our method gives a classification error rate superior to (Table I). Our results on these two test images are presented in Figures 9 and 10. For the test image in Fig. 9, the main factor increasing the classification error is the misclassification of the region in the bottom-left corner of the picture, where the label of the bottom texture spreads too much on the leftmost texture. Indeed, the border between these two textures on the test image is difficult to see even for the human eye and the two texture classes have very similar characteristics. Hence, the erroneous label can easily be diffused over the leftmost texture and it is not surprising to observe a relatively high misclassification rate in this experiment.

For the test image 6 given in Fig. 10, the problem is different. When we observe the final label image (Fig. 10(d)), the major regions of misclassification are over the textures on the left at the bottom, where the label spreads too much, and on the right at the bottom, where the whole texture has a wrong label. In the label image obtained after the graph cut based smoothing step (Fig. 10(c)), we notice that the misclassification regions for these two textures lie respectively on the left part of the first one and at the top of the second one. When we look at these specific areas in the original image (Fig. 10(a)), we can see that they seem over-exposed and thus brighter in comparison to the rest of the textures. Meanwhile, this over-exposure is not present in the training images, which can disturb the classification algorithm as this kind of variation has not been learnt, and lead to confusion with the other classes.

We also notice that textures with regular and small patterns are easily classified, even without the smoothing steps (Fig. 10(b)), as the learning is easier.

(a) Test image 5
(b) Min error
(c) After graph cut
(d) After erosion
(e) Ground truth
Fig. 9: Test image 5 and the successive steps of the classification algorithm.
(a) Test image 6
(b) Min error
(c) After graph cut
(d) After erosion
(e) Ground truth
Fig. 10: Test image 6 and the successive steps of the classification algorithm.

V-C Enriching the training dataset

In order to deal with the over-exposure problems, particularly present on the test image , we propose to enrich the training dataset by adding some over-exposed versions of the training images. An over-exposed image, called , is created from the training image with the following equations


where is the exposure offset added to the image , is the maximum value in after the exposure offset has been added to the image , and is the minimum value in .

In this way, the training set corresponding to the test image is augmented by generating different over-exposed versions of the training images with over-exposure levels , and . We balance the number of original and over-exposed training samples by generating a total of over-exposed samples ( samples for each value) added to original training samples. Dictionaries are then learnt from this new training dataset for the classification problem .

Some classification results obtained for the test image are presented in Fig. 11. It can be observed that augmenting the training data set with over-exposed samples has the potential to improve the classification performance. However, we have also observed that in the smoothing step, the expansion of the labels with a random ordering in the graph cut method may produce a more erroneous label image in some random realizations of the experiment. We have obtained an average classification error of over 20 different random repetitions of the same experiment, whereas the error rate was before enriching the learning dataset for the test image 6. If we take into account this new error rate for the test image , the mean classification error rate computed over the experiments is reduced to from its previous value in Table I.

(a) After graph cut
(b) After erosion
Fig. 11: Label images obtained for the test image after enriching the training dataset with over-exposed patches.

Enriching the training dataset has thus improved the results. The new over-exposed training data have been helpful for learning dictionary structures containing more information and more conscious of this possible intra-class exposure variation. This solution could be applied to other test images as well undergoing the same problem.

Vi Conclusion

In this paper, we have proposed a method for learning discriminative multilevel structured dictionaries for supervised image classification. We have presented a classification algorithm that learns one dictionary per class, where test images are classified with respect to their reconstruction errors on these dictionaries. For the construction of the dictionaries, we have adopted the Adaptive Structure derived from a tree structure, which we made discriminant with a novel objective function to learn multilevel dictionaries that are both reconstructive and discriminative. The proposed dictionaries thus have a high learning capacity due to their multilevel topology and are well-adapted to the classification of images with high intra-class variation. An affinity matrix has been incorporated in the objective function to adjust the discrimination of a class from the others depending on their pairwise affinities. A combination of two smoothing methods has been used to obtain a clean segmentation and classification of the textures in the test image. Experiments conducted on a common dataset of texture images have shown competitive results with the state of the art. We have finally proposed to enrich the training dataset to deal with over-exposure problems.

Enriching the dataset seems promising and future efforts may focus on more complex and realistic over-exposure models. Applying discrimination to all the dictionaries in the multilevel structure may also potentially be of interest, but might increase the complexity of the learning. Finally, a last future direction is to explore other affinity measures in the construction of the affinity matrix, in order to better characterize the pairwise similarities of classes and thus enhance the discrimination capability of the learnt dictionaries.


This work was supported by Airbus Defence & Space.