Reconciliation of Statistical and Spatial Sparsity For Robust Image and Image-Set Classification

by   Hao Cheng, et al.
Nanyang Technological University

Recent image classification algorithms, by learning deep features from large-scale datasets, have achieved significantly better results comparing to the classic feature-based approaches. However, there are still various challenges of image classifications in practice, such as classifying noisy image or image-set queries and training deep image classification models over the limited-scale dataset. Instead of applying generic deep features, the model-based approaches can be more effective and data-efficient for robust image and image-set classification tasks, as various image priors are exploited for modeling the inter- and intra-set data variations while preventing over-fitting. In this work, we propose a novel Joint Statistical and Spatial Sparse representation, dubbed J3S, to model the image or image-set data for classification, by reconciling both their local patch structures and global Gaussian distribution mapped into Riemannian manifold. To the best of our knowledge, no work to date utilized both global statistics and local patch structures jointly via joint sparse representation. We propose to solve the joint sparse coding problem based on the J3S model, by coupling the local and global image representations using joint sparsity. The learned J3S models are used for robust image and image-set classification. Experiments show that the proposed J3S-based image classification scheme outperforms the popular or state-of-the-art competing methods over FMD, UIUC, ETH-80 and YTC databases.


page 1

page 6


Deep Image Destruction: A Comprehensive Study on Vulnerability of Deep Image-to-Image Models against Adversarial Attacks

Recently, the vulnerability of deep image classification models to adver...

Efficient Deep Aesthetic Image Classification using Connected Local and Global Features

In this paper we investigate the aesthetic image classification problem,...

Joint Hierarchical Category Structure Learning and Large-Scale Image Classification

We investigate the scalable image classification problem with a large nu...

Unifying Deep Local and Global Features for Efficient Image Search

A key challenge in large-scale image retrieval problems is the trade-off...

Patchnet: Interpretable Neural Networks for Image Classification

The ability to visually understand and interpret learned features from c...

Inference via Sparse Coding in a Hierarchical Vision Model

Sparse coding has been incorporated in models of the visual cortex for i...

I Introduction

Image classification is a fundamental problem in image processing and computer vision. Comparing to classic algorithms based on pre-defined features, recent image classification schemes applied machine learning techniques to optimize feature representation directly from the data themselves. More recently, deep learning approaches for image classification have achieved the state-of-the-art results on many benchmarking datasets, such as the popular ImageNet 

[5]. Despite of the promising performance achieved under simple and ideal problem setups, there are still various challenges when (i) classification based on queries that contain a set of object variations (i.e., image-set classification), or (ii) the image data is limited or with relatively low quality (i.e., weakly supervised classification).

To be specific, while conventional classification tasks process a single image in each query, image-set classification [16, 20, 8, 41] has recently gained more attention, in which each query set contains multiple images with strong correlation (e.g., query object with multiple views, poses or illuminations). Such type of algorithms are widely applied in applications such as video-based face classification [16], multi-spectral image classification, etc. Comparing to the single-image classification algorithms, effective image-set methods need to additionally exploit the hidden structure among image sets, e.g., the inter- and intra-set data variations. Furthermore, popular deep features tend to be generic and incorporate very little prior knowledge by learning from large-scale, high-quality, and fully annotated training datasets [5]

. Such approaches are ideal with fully supervised learning, but less data-efficient and robust when training sets are small-scale or corrupted (

e.g., noisy).

Recent works on sparse signal modeling have demonstrated their effectiveness in image representation for various tasks [48, 7, 6, 21, 4, 34, 14]. Comparing to the deep features, sparse representation is model-based, thus much more robust to practical challenges such as noise or over-fitting [43]. While many existing works focused on exploiting image patch-based sparsity, global statistical properties are typically ignored or failed to be incorporated jointly in a principled approach. Recent works show that high-order statistics of image features are critical in classification tasks [18, 22, 23, 27], leading to better results comparing to many first-order methods.

In this work, we propose a novel Joint Statistical and Spatial Sparse representation (J3S), i.e.,

learning the coupled dictionaries for both local patch features, and the global data Gaussian distribution mapped into Riemannian manifold. Their dictionary-domain sparse coefficients are reconciled by solving a sparse coding problem with joint sparsity. We propose an efficient yet effective alternating minimization algorithm to solve the J3S sparse coding problem. To the best of our knowledge, no work to date utilized both global statistics and local patch structures jointly via sparse representation for image classification. Furthermore, we apply the learned J3S model for robust image-set and single-image classification applications. Extensive experimental results on material classification, object recognition and video-based face recognition tasks are presented, and we demonstrate that the proposed J3S-based image classification scheme outperforms the popular or state-of-the-art competing methods.

In short, the contributions of this paper include:

  • Learning global statistical and local patch dictionaries for visual classification task by coupling them with joint sparsity;

  • Utilizing principal component analysis (PCA) to reduce the J3S model complexity while maintaining the effectiveness;

  • Investigating the robustness of the proposed model under various conditions i.e., noisy condition and few-shot setting.

  • Achieving the state-of-the-art results on both noisy image and image-set classification tasks.

The remainder of this article is organized as follows. Section II summarizes the related work on image or image-set classification problems, including manifold learning, deep learning and sparse representation. Section III introduces two kinds of dictionary learning methods based on Gaussian-based statistical information and patch-based spatial information respectively, the proposed J3S model and the classification module. Section IV describes the solution of the proposed J3S model based on alternation minimization and analyzes the time and space complexity as well as a simple strategy for model acceleration effectively. Section V demonstrates the performance of the proposed J3S model for image and image-set tasks over several standard databases under different conditions such as noise and few-shot. Section VI concludes with proposals for future work. The preliminary work has appeared in [3]. 111Significant changes have been made compared to our previous work in [3]. First, we improve the J3S method by reducing the model complexity with a simple and efficient approach. Second, we add more description and analysis on the dictionary learning and classification. Third, we include new experiments on ablation study to investigate the model convergence and parameter selection. Furthermore, we conduct an extra experiment on an object recognition task based on the ETH-80 database to evaluate the generalizability of the proposed J3S model in different scenarios. Finally, we conduct some additional popular settings, e.g., noisy condition and few-shot setting are conducted to validate the performance and robustness of the proposed J3S model.

Ii Related Work

Image-set classification aims to identify the common class of a multi-image query. The inherent properties of each query set can be modeled effectively by popular methods such as manifold learning, deep learning, sparse coding, etc.

Manifold Learning: The classic methods based on Discriminant Canonical Correlations (DCC) [17] proposed to classify image sets by maximizing the canonical correlations of within-class sets and minimizing the canonical correlations of between-class sets. Later on, more subspace methods [2] were proposed to simplify the geometric structure learning for image sets. However, these approaches are limited as most image sets lie on a Riemannian manifold rather than Euclidean subspaces [39, 13], e.g., symmetric positive definite (SPD) manifold is widely used to represent image sets. To ease the computation, the Log-Euclidean Riemannian Metric (LERM) framework [1] proposed to map data from SPD manifold to its tangent Euclidean space. Besides, Log-Euclidean Manifold Learning (LEML) [13] projects the original SPD manifold to a lower-dimension discriminative SPD manifold while preserving its original geometry. More recently, Riemannian Manifold Metric Learning (RMML) [47] proposed a more generalized metric learning method which can be applied to multiple manifolds. From a statistical perspective, when modeling image sets or the multi-channel features via Gaussian distribution, their covariance matrices for a collection of Gaussian can form a Riemannian manifold of SPD matrices [39, 37, 36]. Covariance Discriminative Learning (CDL) [39] derived a Riemannian kernel function to map covariance matrix from manifold space to Hilbert space, where kernelized linear methods can be used for learning.

Deep Learning: Recently, more works on deep learning have shown its capability for image-set classification [12, 25, 32]. Deep Reconstruction Model (DRM) [12]

learns a template deep reconstruction model using neural networks and then uses the minimal reconstruction residual to classify a query set. Multi-manifold deep learning (MMDML)

[25] maps multiple sets of image into a shared feature subspace to leverage the nonlinear information. More recently, Deep Match Kernels (DMK) [32] is proposed for image-set classification without considering specific assumptions on image distribution and geometrical structures and builds local match kernels to leverage its generic deep features.

DCC [17]
LEML [13]
RMML [47]
CDL [39]
RSR [10]
KGDL [11]
DRM [12]
MMDML [25]
DMK [32]
TABLE I: Comparison of the key attributes between the proposed J3S method, and other image-set classification algorithms.

Sparse Representation: Sparse coding based classification represents a query sample on a dictionary composed of the training samples of all classes, and then classified by the reconstruction error of each class [45, 15, 42, 35]. Besides, the sparse coefficients can be used as the extracted features for classification, e.g., linear spatial pyramid matching [46]. Most existing works focus on sparse coding and dictionary learning on the zero-order information, i.e., the original feature space, while the first-order and second-order statistics contain global information and take the correlation of the data into account. They can be more robust to variations in images and videos applications, e.g., variations of poses, illumination and occlusions.

Table I summarized the aforementioned related methods, as well as the proposed J3S method for image-set classification. Furthermore, some recent works also proposed sparse coding and dictionary learning models on Riemannian manifold of SPD matrices and Grassmann manifold: Sparse coding on Riemannian manifold can be converted to a kernel sparse coding problem by deriving valid kernels for SPD manifold [10, 4] or Grassmann manifold [11]. However, none of the existing works combined statistical with spatial priors in the sparse representation. Besides, the robustness of the image-set classification has been rarely investigated.

Iii Dictionary Construction and Joint Sparse Representation

In this section, we present the J3S model for classification tasks, including the dictionary construction of statistical and spatial models and joint sparse coding. Our proposed J3S model can deal well with different types of input data such as single image and image set.

To obtain the unified feature representations for classifying both a single image and an image data set, we apply the corresponding data preprocessing methods. Specifically, for an image set with feature of each image , we combine them to construct the image set representation directly; For a single image , we employ its deep feature representation using a pre-trained CNN extractor as local features to construct where . Thus, both an image or image set can be represented in a similar form as , where is the feature dimension of each image and is the number of image in each image set or the number of channel for a single image.

Iii-a Statistical Dictionary Construction

Based on the Gaussian statistical model, we need to compute the mean vector

and covariance matrix in Reproducing Kernel Hilbert Space (RKHS) for the corresponding Gaussian descriptor . We map into an RKHS by the mapping function with Hellinger’s kernel, the mean vector and covariance matrix can be computed as:


Here , and is the centering matrix. However, when the dimension of the original features (i.e., ) is very high, and the number of samples (i.e., ) is small, such a Gaussian descriptor can not work well. To solve this problem, following [37]

, we estimate the robust covariance matrix

by solving a regularized maximum likelihood estimation problem as:


where is the von Neumann matrix divergence [19] of two matrices and is a regularizing parameter. The optimal solution of problem (2) can be computed as:



is the diagonal matrix of the singular values in decreasing order, and

is the orthogonal matrix consisting of the eigenvectors corresponding to the singular values.


are computed by the singular value decomposition (SVD) of the covariance matrix

as .

By using the mean vector and robust covariance matrix , we can define the embedding symmetric positive definite matrix as:


where is a parameter to balance the orders of magnitude between them.

Fig. 1: The framework of Joint Statistical and Spatial Sparse representation of image and image set classification.

Iii-B Spatial Dictionary Construction

Similarly, given a sample , considering spatial information, we can learn a patch-based unitary dictionary from the feature map of a single image or gray feature of an image set. For a single image, we choose the original image or the deep feature as the input to learn a patch-based unitary dictionary. In contrast, for an image set, we combine each single image feature to construct a patch-based unitary dictionary to exploit within the class structure. For image and image set classification, the objective is to learn a unitary dictionary based on 2D image patches constructed from sample by solving the following problem with the synthesis model as:


where is used for extracting patches from , is the vectorized form of , is the number of total patches and

is the identity matrix.

Sparse coding problems under the synthesis model are NP-hard in general, and even the approximate algorithms are typically expensive [29]. However, since problem (5) learns the unitary dictionary, it is equivalent to the unitary transform learning problem [44], i.e., a signal is approximately sparsifiable using a learned unitary transform , as , where is sparse and is a small residual in the transform domain. The corresponding transform learning problem is formulated as


Based on [44], the two sparsity models can be unified under the unitary dictionary assumption, i.e., , and .

Proposition 1

Under the unitary dictionary assumption, the problems (5) and (6) are equivalent.

Proof 1

Based on the unitary dictionary assumption, we have and Thus, the function in problem (5) is identical to that in problem (6), i.e., . Therefore, the problems (5) and (6) become equivalent, and .

Thus, we can obtain the optimal dictionary in (5) by solving its equivalent problem (6) which has an exact and closed-form solution [30], i.e., where are computed by the SVD of as .

Fig. 1 illustrates the framework of J3S, in which a sample (either an image set or a single image) can be modelled by a statistical model to obtain the embedding SPD matrix , and simultaneously modelled by a spatial patch-based model to generate a unitary transform dictionary , which are then used for joint sparse coding.

Fig. 2: The framework of classification module of Joint Statistical and Spatial Sparse representation model.

Iii-C J3S Sparse Coding

To reconcile the two types of dictionaries generated from statistical Gaussian modeling and spatial patch-wise unitary dictionary learning, we propose the joint statistical and spatial sparse representation (J3S) model, i.e., for any query sample, we impose joint sparsity on their statistical and spatial dictionary-domain coefficients, to maintain the consistency and dependency of their individual sparse representation. For simplicity, we use the following symbols to simplify the objective function,

Here, and are the mapping function of Gaussian and patch-based unitary dictionary corresponding to the training samples belonging to the -th class, respectively.

For statistical Gaussian-based dictionary, each embedding matrix is an SPD matrix which can be viewed as a point on the corresponding SPD manifold based on Eq (4). Direct vectorization of the SPD matrix to generate a dictionary will destroy the intrinsic structure, which may cause information loss. To avoid such loss while measuring the similarity between two matrices on the SPD manifold, we use a general framework called LERM [1] to map each of them to its tangent space through matrix logarithm operation . By using such embedding, we directly measure the similarity on the tangent space using the Euclidean distance. In this way, we can get the vectorized form of each matrix as the statistical Gaussian-based feature for feature representation. To simplify the calculation of SPD matrix , here we only extract upper triangle elements to construct the dictionary, and thus the mapping function can be written as:


For patch-based unitary model, is the vectorized mapping function of a unitary dictionary.

Given the joint statistical and spatial dictionaries, we compute the coefficient vectors and of the -th query sample with feature representations and by solving the following problem:


where is the weighting parameter defined to balance the scale of statistical model and patch-based model, and are used to map two kinds of representations, respectively. is the norm for row sparse.

By solving the optimization problem in (8), we can get the representation coefficient vectors and that correspond to the Gaussian and patch-based dictionary models, respectively. With two coefficient vectors and , we can compute the reconstruction loss of the -th query sample using only one particular sub-dictionaries and corresponded coefficient sub-vectors of the -th class as:


Here and are particular coefficient sub-vectors of the -th class for -th query sample. and are reconstructed statistical and spatial representations of the -th query sample, respectively. and are two sub-dictionaries of training samples from the -th class. and are the coefficient sub-vectors containing the sparse code corresponding to the training samples from the -th class. Moreover, (8) and (9) share similar regularized terms to constrain the overall sparsity for sparse coding. While for classification, we only keep the reconstruction loss term, ignoring the influence of the regularized term on it, thus the reconstruction loss can be rewritten as:


For a visual classification task, the most commonly used algorithm is Nearest Neighbor (NN), aiming to find the closest labeled sample to the current query sample according to the pre-defined metric methods and classify the query sample into the category corresponding to the closest sample. Inspired by the idea of NN, we assume that features from the same class should be easier to reconstruct, since their feature representations contain similar embeddings, while features from different classes will be more difficult and produce larger reconstruction errors. For -th query sample, we utilize the reconstruction loss defined in (10) as the measurement for classification and measure similarity in terms of the overall representation of the whole category.


Here is the reconstruction error of the -th class computed by (10) and is the predicted label of -th query sample.

Fig. 2 shows how the classification module works. Specifically, to classify the query sample, for each class , we only use labeled samples of the corresponding category for joint sparse representation to reconstruct the query sample. The query data can then be classified according to the weighted reconstruction error of each class.

Iv Algorithm

We propose a joint sparse representation model for image and image-set classification tasks. It is obvious that each sub-problem of (8) is convex. Thus we use alternation minimization to solve the optimization problem.

Update : The partial derivatives of the objective function with respect to the will be set to 0.


where is a diagonal matrix with the -th diagonal element as .

Thus we can get the iteration of as:


Update : can be updated as follows:


where is an offset to prevent the unsolvable problem of Eq (16) in this paper.

Iv-a Complexity Analysis

We discuss the time and space complexity of our proposed J3S model. Compared with the steps of the dictionary construction and sparse coding, which require to construct the dictionaries and learn sparse code jointly, the classification part is calculated only based on the results obtained in (8), so we do not consider the impact of the classification part on the complexity of the algorithm here. As the sub-problems of (8) are all convex and the objective functions are all lower-bounded, the optimization algorithm can converge to a local minimum [28]. The time complexity consists of the updating of , , and . The computational complexity of is , and the computational complexity of is . Hence, the main time complexity of the proposed algorithm is , where is the iteration number, is the number of training samples, and are dimensions of two dictionaries, respectively. is the number of channels or samples and is the patch size, and is the larger feature dimension of two dictionaries.

For space complexity, the proposed J3S model needs to save two dictionaries and for all samples, a pair of sparse vectors , and the corresponding matrix for each query sample. The dimensions of statistical dictionary and unitary dictionary are equal to and , respectively.

Iv-B A simple and effective strategy for Model Acceleration

The time and space complexity of the J3S model depends on the dimension of two dictionaries. Referring to the process of dictionary construction introduced earlier, we use the lower triangle form to store matrix information of for dimensionality reduction. However, the dimension of dictionary is still too high when using deep feature with while the number of training samples is only hundreds. In this way, and the time complexity is close to , which is not conducive to the application of the proposed algorithm.

To reduce time and space cost, we use the commonly used principal component analysis (PCA) method [26] to perform dimensionality reduction operations on the two dictionaries to eliminate redundant information between different sizes. For simplicity, we use to represent the dictionary of statistical Gaussian model and patch-based model generated from the whole dataset, respectively as

Specifically, we learn a principal component transformation to map the data from the original space to a new low-dimension space. Using the transformation , we get new low-dimensional dictionaries and of the statistical and spatial models as:


After PCA operation, we store these two matrices for sparse representation learning. For each iteration, we select the columns corresponding to the training samples of the -th class to form the dictionaries and to optimize (8). With PCA, we can reduce the dimensions of two dictionaries to the same level of , which reduces the time and space cost, i.e., the time complexity is reduced from to with . The overall optimization procedure is formulated as Algorithm 1.

0:     Training data .A query data .
1:  Construct two dictionaries and of the whole dataset;
2:  Adapt PCA to get the low dimension representation and of two dictionaries;
3:  Initialize as an identity matrix;
4:  repeat
5:     update according to Eq (14);
6:     update according to Eq (15);
7:     update according to Eq (16);  
8:  until convergence criterion satisfied.
9:  Classify the query data by (11).
9:    The prediction label of the query data .
Algorithm 1 Joint statistical and spatial sparse representation.

V Experiments

We present experimental results on video-based face recognition, material classification, and object recognition tasks to demonstrate the effectiveness of the proposed J3S classification algorithm 222The reproducible implementations of the J3S algorithms will be made publicly available upon paper acceptance.. We conduct experiments on four databases: Flickr Material Database (FMD) [31], UIUC Material Database [24], ETH-80 [20] and YouTube Celebrities [16]. FMD and UIUC databases are used for image-based classification, while ETH-80 and YTC databases are used for image set-based classification. These databases contain samples in different materials, views, illuminations, and even different modalities. Fig. 3 shows some sample images with different categories from each database.

Fig. 3: Examples of four databases. (a) Flickr Material Database; (b) YouTube Celebrities Database; (c) UIUC Material Database; (d) ETH-80 Database.

Flickr Material database (FMD) has 10 materials categories of 1000 images in the wild [31]. Each image is selected from with variations of illuminations, rotations, and scales. We filter images with the VGG-VD16

model pre-trained on the ImageNet database and employ the output of the last convolution layer as local features with the size of

. Following [37], we randomly choose 50 images in each category for gallery and the other 50 for probes and repeat this experiment ten times.

UIUC Material database contains 216 images of 18 material categories in the wild [24]. We obtain the deep features of UIUC material by taking the same measures on FMD. We randomly choose half images in each category for the gallery and the other half for probes.

ETH-80 database contains 80 image sets of 8 object categories [20]. Each category has 10 sub-object with 41 images of different views. Following [39], we randomly choose 5 objects as the gallery and the other 5 as probes in each category. The size of each image is resized to , and the intensity feature is used. Thus, each image set can be expressed by the matrix of .

YouTube Celebrities (YTC) database contains 1910 video clips of 47 subjects [16] with different numbers of frames in each video. Following [39, 13], we use histogram equalization to eliminate light effects in pre-processing step and randomly select 3 videos per subject for the gallery and 6 videos for probes. Then, each image is resized to a image with the intensity feature. Thus each video can be expressed by the matrix of where is the number of frames in each video.

AHISD(linear) [2] 72.50 46.72 55.37 64.65
AHISD(non-linear) [2] 72.00 46.72 55.37 66.58
CHISD(linear) [2] 79.75 47.52 65.09 67.24
CHISD(non-linear) [2] 72.50 63.90 65.65 68.09
MMD [40] 85.75 60.60 62.78 69.60
MDA [38] 87.75 62.50 67.13 64.72
SPDML-AIRM [9] 90.75 63.42 74.72 67.50
SPDML-Stein [9] 90.75 63.80 68.24 68.10
LEML [13] 93.50 66.60 69.17 69.85
RMML-SPD [47] 95.00 68.88 70.09 78.05
RMML-GM [47] 93.00 69.62 76.48 69.15
CDL-LDA [39] 94.00 76.92 78.89 70.21
CDL-PLS [39] 94.00 75.36 76.39 69.94
RSR [10] 91.50 74.92 72.59 72.77
KGDL [11] 93.00 77.40 76.32 73.91
DRM [12] 98.12 N/A N/A 72.55
MMDML[25] 94.50 N/A N/A 78.5
J3S w/o Spatial Dict. 95.25 81.40 83.43 82.87
J3S 96.00 82.58 84.07 83.09
TABLE II: Classification accuracy (in ) over the four selected databases: AHISD and CHISD are affine subspace based methods; MMD to RMML are nonlinear manifold based methods; CDL is a Gaussian distribution based method; RSR and KGDL are based on sparse coding; DRM and MMDML are deep methods. The best (second best resp.) results are highlighted as Red (Blue resp.).

V-a Competing methods

To illustrate the effectiveness of the proposed model, we compare our method with the following representatives of the subspace, non-linear manifold, statistical, sparse representation, and deep based methods.

  • Affine subspace based methods: AHISD and CHISD [2].

  • Nonlinear manifold based methods: MMD[40], MDA [38], SPDML [9], LEML [13], and RMML [47].

  • Gaussian distribution based methods: CDL [39].

  • Sparse representation based methods: RSR [10], and KGDL [11].

  • Deep based methods: DRM [12], and MMDML [25].

V-B Parameter Setting

We apply the implementations of competing methods provided by the authors with the default settings suggested by the corresponding papers. For MMD, the PCA percentage is set to . For MDA, we set the number of local models, between-class NN local models, and the subspace dimension the same as [38]. For SPDML, we implement both SPDML-AIRM and SPDML-Stein versions. In both versions, following [9], is set as the minimum of the samples in one class. The new dimension of the low-dimensional manifold and are tuned by 5-fold cross-validation. We compare our method with both linear and non-linear versions of AHISD and CHISD [2], where energy by PCA is retained in non-linear AHISD and the value of error penalty in CHISD is set as same as [2]. For LEML, is tuned from to and the value of is tuned from to . For RMML, is set to and is tuned from to . For CDL, the distance metric is learned with linear discriminant analysis (LDA) and partial least squares (PLS) in Hilbert space. The reduced feature dimension is set to for LDA, where is the number of classes. For RSR and KGDL, we use SPAMS as a sparse solver and set other parameters as suggested in the papers. The dimension of the subspace of the Grassmann manifold in KGDL is set to 10.

There are four parameters , , , and for our proposed J3S method. The weighting parameter is defined to balance two sparse representation models and adjusted based on different scales of databases. For some databases, such as the UIUC database, which contains only a few labeled samples of each class, the statistical dictionary may be challenging to represent the reliable and complete information of a class. Thus, we set a small value to mitigate the impact of the first term in Eq (8), while we set for other databases. All regularization parameters , , and are all set to . Moreover, we implement a common backbone VGG-VD16 network pre-trained on the ImageNet database for feature extractor in this paper. The maximum number of iterations is set to 50, and we take an early stop when the difference between loss before and after two iterations is less than .

V-C Image and Image-Set Classification

Table II compares the image classification results using the proposed method, as well as all selected competing methods. Furthermore, we included two deep learning methods, DRM [12] and MMDML [25], by quoting the results reported over YTC database. Note that classic methods randomly choose nine image sets for each class, where three image sets for training and the rest six for testing and report the average accuracy of ten times. On the contrary, the selected deep-based models divide the whole database into five folds with nine image sets for each class and keep training the model until convergence while the network input is still based on a single picture. It is clear that our proposed J3S approach outperforms all competition methods over the FMD, UIUC, and YTC databases. For the ETH-80 database, our method outperforms other competition methods except for DRM, which might due to the way of processing data. Unlike the J3S model, DRM first computes the LBP features of the training data and generates a subset randomly from the training samples, which enhances the capability of the deep network. Moreover, during testing, the learned DRM model is used to reconstruct each image of a test image-set sample, and a voting strategy is adopted for classification. In contrast, our J3S model treats all samples in each image set as a classification object. Tables II also show that using joint two dictionaries could help integrate multiple information to facilitate classification tasks.

V-D Noisy image classification

We simulate i.i.d.

Gaussian noise with standard deviation

from to for all training and testing data on the UIUC and FMD databases to generate noisy images for classification. Table III and IV show the classification accuracy of two databases under different noise ratios. The results show that the proposed J3S method outperforms the competition methods subject to noise corruption. Also, we can observe a clear downward trend for classification accuracy from the two tables as the noise level increases. Simultaneously, our proposed J3S model can still perform better than any other models in all noise levels. Moreover, we also find that for methods with supervised dimension reduction, i.e., SPDML-Stein and CDL-LDA, their performances at a relatively higher noise level are more elevated than performance at a lower noise level on the UIUC database. This is partially due to the fact that models can discard the less critical noise part during the dimensionality reduction process.

SPDML-Stein 66.57 67.96 64.81 65.37
SPDML-AIRM 74.26 73.06 71.02 70.37
LEML 69.17 69.81 67.22 66.02
CDL-LDA 79.63 77.96 76.85 76.94
CDL-PLS 76.48 74.91 72.41 70.74
J3S 83.61 82.41 81.39 80.46
TABLE III: Classification accuracy (in ) on noisy data with different noise levels () of the UIUC database.
SPDML-Stein 62.86 58.54 54.94 52.12
SPDML-AIRM 66.60 62.52 58.68 54.46
LEML 66.52 63.82 59.80 56.76
CDL-LDA 76.60 74.62 71.90 70.98
CDL-PLS 74.24 71.68 69.02 66.44
J3S 82.04 80.10 76.05 74.46
TABLE IV: Classification accuracy (in ) on noisy data with different noise levels () of the FMD database.

V-E Ablation Study

V-E1 Weight Analysis

As Eq (8) stated, the weighting parameter is used to balance two dictionary models. We conduct an experiment to investigate the effectiveness of weighting parameter settings on classification accuracy. Table V shows the image classification accuracy averaged over the ETH-80, FMD, UIUC, and YTC databases, with different values of . For ETH-80, FMD, and YTC databases, it is obvious that as the weighting parameter of the statistical model increases from , the classification accuracy rate has a significantly increase, which is due to the introduction of higher-order Gaussian information. Compared with the spatial model, the statistical Gaussian model is more discriminative but still needs the spatial model to capture the local information. Thus, when the weighting parameter increases to a certain level (e.g., from 0.5 to 0.7), continuing to increase will cause the accuracy to fluctuate within a small range. In contrast, we observe that the classification accuracy on the UIUC database becomes worse when increasing the weight parameter . A potential explanation is that, the statistical dictionary may be challenging and unreliable to represent the entire information of a class if only given a few labeled samples such as the UIUC database. Simultaneously, when the weighting parameter

increases, the impact of the statistical term on the loss function becomes more significant, so the classification accuracy decreases by about


ETH-80 94.00 95.00 96.00 96.00 95.00
FMD 80.58 81.86 82.50 82.36 82.50
UIUC 84.07 83.06 83.24 83.43 83.33
YTC 76.70 80.92 82.70 83.09 83.01
TABLE V: Classification accuracy (in ) v.s the weighting parameter .

Moreover, Table VI shows the classification results of the J3S model w/ and w/o PCA with different values of the weighting parameter . We can observe that our proposed J3S model achieves the best performance with the same weighting parameter () under two settings. Meanwhile, we find that, after PCA dimensionality reduction, the highest accuracy rate has improved slightly from to while the algorithm complexity has decreased, which demonstrates the effectiveness of the J3S model w/ PCA strategy.

J3S w/o PCA 83.98 83.33 82.50 82.36 82.50
J3S 84.07 83.06 83.24 83.43 83.33
TABLE VI: J3S model w/ or w/o PCA under different values of the weighting parameter on the UIUC database.
Fig. 4: Visualization of different results based on the sparse representation of joint dictionaries or two single dictionaries, respectively. Two query samples selected from different categories are shown in the row on the left, and corresponding classification results are shown in the right row.

V-E2 Feature Selection for Dictionary Construction

To investigate the proper feature for dictionary construction to extract local information, we try different unitary dictionaries based on deep feature maps, original gray images, and RGB images, respectively. Table VII

shows the classification accuracy with different feature selections for our proposed J3S model on clean and noisy image data (

) of the UIUC database. We can observe that without a unitary dictionary, the model performs worse than other settings under the noisy condition, which can explain the role of the spatial module on the robustness of the model from one side. Meanwhile, we observe that the accuracy of methods based on a unitary dictionary of gray or RGB image drops smaller than the unitary dictionary based on a deep feature map under the noise condition. Since noisy images are fed into a deep CNN structure pre-trained on clean data, it is more difficult to distinguish the noise portion than shallow image features.

Settings Acc (Clean) Acc (Noise)
w/o unitary dict. 83.43 80.00
w/ Deep feature based unitary dict. 84.07 80.46
w/ Gray image based unitary dict. 83.15 80.12
w/ RGB image based unitary dict. 80.37 80.18
TABLE VII: Classification accuracy (in ) on clean and noisy () data of the UIUC database.

V-E3 Convergence Analysis

According to Eq (8), it is easily proved that each part of the objective function is convex. With alternation minimization, this optimization problem can be divided into two convex problems and solved easily. Fig. 5 shows that the J3S model can converge within a few iterations with different regularization parameters. Meanwhile, we can observe that after only one iteration, the J3S model will reduce the value of the loss function to near convergence. Moreover, compared Fig. 5 with Fig. 5, we find that increasing regularization parameters and for the J3S model will make the convergence more stable but require more iterations to converge fully, i.e., small regularization parameters need more iterations to converge just like in Fig. 5 than in Fig. 5. A potential explanation is that the regularization parameters and of two sparse models are adopted to control the stringency of sparse vectors and for a query sample with index . The constraint of the regular terms on the J3S model is proportional to the scale of the corresponding regularization parameters.

(a) .
(b) .
Fig. 5: Convergence curves with different settings of three regularization parameters , , and on the UIUC database. We only show the loss value of 30 iterations on the basis of ensuring the convergence of the model.
SPDML-Stein 42.41 53.70 60.56 64.81 66.57 68.24
SPDML-AIRM 48.06 62.13 68.24 70.83 72.69 74.72
LEML N/A 53.43 61.76 65.09 67.59 69.17
CDL-LDA 26.48 38.89 47.22 51.56 65.19 78.89
CDL-PLS 55.19 67.96 72.64 74.56 75.56 76.39
CNN+Mean+SVM 59.60 70.44 75.56 77.78 78.70 81.67
CNN+Gau+SVM 61.61 72.22 77.40 79.58 81.90 84.01
J3S 61.94 75.56 78.89 80.83 82.41 84.07
TABLE VIII: Classification accuracy (in ) with only labeled training samples of each class on the UIUC database.

V-F Few-shot Classification

We consider a popular and challenging setting, i.e., the few-shot setting for image classification tasks, to investigate the robustness of the proposed J3S model with only a few supervised information. For few-shot learning, the whole dataset is divided into two non-overlapping label sets, i.e., training set and testing set. Following the meta-learning strategy [33], most existing few-shot methods construct -way -shot tasks on the testing set to evaluate the generalized performance of proposed models trained on the training set. Here and are the number of class and labeled samples, respectively, and and are often small values.

Typical few-shot methods follow a methodology that learns the models only from the training set and tests the classification accuracy on the testing set. Unlike this learning strategy, based on representative learning, the proposed J3S model solves the classification task by utilizing the reconstruction loss computed by two coefficient vectors learned from both labeled training samples and each query sample. A little different from the typical few-shot setting, we set the value of to the total number of categories instead of a commonly used fixed value . In contrast, we still set to a small number, identical to the few-shot setting.

Table VIII shows classification accuracy on the UIUC database with different numbers of the labeled samples in each class for dictionary learning. As mentioned before, the UIUC database only has training samples for each class in total, so the general classification setting with all training samples satisfies one kind of few-shot setting with . From table VIII, we can observe that our proposed J3S model outperforms the other methods in all few-shot settings, i.e., from 1 to 6. Note that almost all Gaussian-based models, i.e., CDL-LDA, CDL-PLS, two Gaussian-based CNN models, and our proposed J3S model perform well even when using only a few training samples for each class. The result demonstrates that the global statistical model can implement rich information of a class, enhancing the classification capability in few-shot tasks. Meanwhile, we observe that our J3S model and the method CNN+Gau+SVM perform better than the other two Gaussian-based methods CDL-LDA and CDL-PLS. A potential explanation is that, CDL-LDA and CDL-PLS learned only based on the covariance matrix while the J3S model and CNN+Gau+SVM method jointly implement first-order and second-order information, which leads to better accuracy. Moreover, CDL-LDA performs poorly when because for the 1-shot setting, the feature dimension is much larger than the number of training samples (here

). After feature projection (so-called dimensionality reduction), the LDA-based model cannot maintain the difference among neighbors and keep the within-class variance to a minimum value for Nearest Neighbor classification. In comparison to LDA, PLS has proven to be helpful in this situation as it is not limited by the low discrimination dimensions.

Additionally, as the supervised information decreases, i.e., is selected from to , the gap between our proposed J3S model and the method CNN+Gau+SVM becomes larger. This might due to the effectiveness of spatial information when the statistical model cannot provide sufficient information for classification. However, when , the difference among all methods will become minor because only one support sample of each category can be utilized for learning, which results in insufficient information for classification, and the model is vulnerable to bias.

Vi Conclusion

In this paper, we proposed a novel J3S model for robust image and image-set classification. Gaussian distribution is used to keep high-order image statistical information, while patch-based sparse representation is used to capture image local structure. A simple and effective dimensionality reduction operation by PCA is utilized to reduce the algorithm complexity. We conducted experiments over four popular databases for clean and noisy image classification tasks. Moreover, we conducted parameter sensitivity analysis and tested the robustness of the algorithm under the popular few-shot setting. Results show that our proposed method achieves superior performance compared to a variety of algorithms under several settings.


  • [1] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache (2007) Geometric means in a novel vector space structure on symmetric positive-definite matrices. SIAM journal on matrix analysis and applications 29 (1), pp. 328–347. Cited by: §II, §III-C.
  • [2] H. Cevikalp and B. Triggs (2010) Face recognition based on image sets. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2567–2573. Cited by: TABLE I, §II, 1st item, §V-B, TABLE II.
  • [3] H. Cheng and B. Wen (2020) Joint statistical and spatial sparse representation for robust image and image-set classification. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 2411–2415. Cited by: §I, footnote 1.
  • [4] A. Cherian and S. Sra (2016) Riemannian dictionary learning and sparse coding for positive definite matrices. IEEE transactions on neural networks and learning systems 28 (12), pp. 2859–2871. Cited by: §I, §II.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. F. (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §I, §I.
  • [6] Q. Feng and Y. Zhou (2016) Kernel combined sparse representation for disease recognition. IEEE Transactions on Multimedia 18 (10), pp. 1956–1968. Cited by: §I.
  • [7] S. Gao, L. Chia, I. W. Tsang, and Z. Ren (2014) Concurrent single-label image classification and annotation via efficient multi-layer group sparse coding. IEEE Transactions on multimedia 16 (3), pp. 762–771. Cited by: §I.
  • [8] S. Gao, Z. Zeng, K. Jia, T. Chan, and J. Tang (2015) Patch-set-based representation for alignment-free image set classification. IEEE Transactions on Circuits and Systems for Video Technology 26 (9), pp. 1646–1658. Cited by: §I.
  • [9] M. Harandi, M. Salzmann, and R. Hartley (2014) From manifold to manifold: geometry-aware dimensionality reduction for SPD matrices. In European Conference on Computer Vision, pp. 17–32. Cited by: 2nd item, §V-B, TABLE II.
  • [10] M. Harandi, C. Sanderson, R. Hartley, and B.C. Lovell (2012) Sparse coding and dictionary learning for symmetric positive definite matrices: a kernel approach. In European Conference on Computer Vision, pp. 216–229. Cited by: TABLE I, §II, 4th item, TABLE II.
  • [11] M. Harandi, C. Sanderson, C. Shen, and B.C. Lovell (2013) Dictionary learning and sparse coding on Grassmann manifolds: an extrinsic solution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3120–3127. Cited by: TABLE I, §II, 4th item, TABLE II.
  • [12] M. Hayat, M. Bennamoun, and S. An (2014) Deep reconstruction models for image set classification. IEEE transactions on pattern analysis and machine intelligence 37 (4), pp. 713–727. Cited by: TABLE I, §II, 5th item, §V-C, TABLE II.
  • [13] Z. Huang, R. Wang, S. Shan, X. Li, and X. Chen (2015) Log-euclidean metric learning on symmetric positive definite manifold with application to image set classification.. In International Conference on Machine Learning, pp. 720–729. Cited by: TABLE I, §II, 2nd item, TABLE II, §V.
  • [14] P. Jing, Y. Shang, L. Nie, Y. Su, J. Liu, and M. Wang (2020) Learning low-rank sparse representations with robust relationship inference for image memorability prediction. IEEE Transactions on Multimedia. Cited by: §I.
  • [15] L. Kang, C. Hsu, H. Chen, C. Lu, C. Lin, and S. Pei (2011)

    Feature-based sparse representation for image similarity assessment

    IEEE Transactions on multimedia 13 (5), pp. 1019–1030. Cited by: §II.
  • [16] M. Kim, S. Kumar, V. Pavlovic, and H. Rowley (2008) Face tracking and recognition with visual constraints in real-world videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §I, §V, §V.
  • [17] T. Kim, J. Kittler, and R. Cipolla (2007) Discriminative learning and recognition of image set classes using canonical correlations. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (6). Cited by: TABLE I, §II.
  • [18] T. Kobayashi (2014) Dirichlet-based histogram feature transform for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3278–3285. Cited by: §I.
  • [19] B. Kulis, M. A. Sustik, and I. S. Dhillon (2009) Low-rank kernel learning with bregman matrix divergences.. Journal of Machine Learning Research 10 (2). Cited by: §III-A.
  • [20] B. Leibe and B. Schiele (2003) Analyzing appearance and contour based methods for object categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 402–409. Cited by: §I, §V, §V.
  • [21] L. Li, D. Wu, J. Wu, H. Li, W. Lin, and A. C. Kot (2016) Image sharpness assessment by sparse representation. IEEE Transactions on Multimedia 18 (6), pp. 1085–1097. Cited by: §I.
  • [22] P. Li, X. Lu, and Q. Wang (2015) From dictionary of visual words to subspaces: locality-constrained affine subspace coding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2348–2357. Cited by: §I.
  • [23] P. Li, J. Xie, Q. Wang, and W. Zuo (2017) Is second-order information helpful for large-scale visual recognition?. In Proceedings of the IEEE international conference on computer vision, pp. 2070–2078. Cited by: §I.
  • [24] Z. Liao, J. Rock, Y. Wang, and D. Forsyth (2013-06) Non-parametric filtering for geometric detail extraction and material representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §V, §V.
  • [25] J. Lu, G. Wang, W. Deng, P. Moulin, and J. Zhou (2015-06) Multi-manifold deep metric learning for image set classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: TABLE I, §II, 5th item, §V-C, TABLE II.
  • [26] A. M. Martinez and A. C. Kak (2001) Pca versus lda. IEEE transactions on pattern analysis and machine intelligence 23 (2), pp. 228–233. Cited by: §IV-B.
  • [27] T. T. Nguyen, T. P. Nguyen, and F. Bouchara (2020) Prominent local representation for dynamic textures based on high-order gaussian-gradients. IEEE Transactions on Multimedia. Cited by: §I.
  • [28] U. Niesen, D. Shah, and G.W. Wornell (2009) Adaptive alternating minimization algorithms. IEEE Transactions on Information Theory 55 (3), pp. 1423–1429. Cited by: §IV-A.
  • [29] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad (1993) Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar conference on signals, systems and computers, pp. 40–44. Cited by: §III-B.
  • [30] S. Ravishankar and Y. Bresler (2015) Sparsifying transform learning with efficient optimal updates and convergence guarantees. IEEE Transactions on Signal Processing 63 (9), pp. 2389–2404. Cited by: §III-B.
  • [31] L. Sharan, R. Rosenholtz, and E. H. Adelson (2009) Material perception: what can you see in a brief glance?. Journal of Vision 9 (8), pp. 784–784. Cited by: §V, §V.
  • [32] H. Sun, X. Zhen, Y. Zheng, G. Yang, Y. Yin, and S. Li (2017) Learning deep match kernels for image-set classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3307–3316. Cited by: TABLE I, §II.
  • [33] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §V-F.
  • [34] B. Wang, Y. Hu, J. Gao, Y. Sun, F. Ju, and B. Yin (2020) Learning adaptive neighborhood graph on grassmann manifolds for video/image-set subspace clustering. IEEE Transactions on Multimedia 23, pp. 216–227. Cited by: §I.
  • [35] L. Wang, S. WANG, D. Kong, B. Yin, et al. (2020) Hardness-aware dictionary learning: boosting dictionary for recognition. IEEE Transactions on Multimedia. Cited by: §II.
  • [36] Q. Wang, P. Li, and L. Zhang (2017) G2DeNet: global Gaussian distribution embedding network and its application to visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2730–2739. Cited by: §II.
  • [37] Q. Wang, P. Li, W. Zuo, and L. Zhang (2016) RAID-G: robust estimation of approximate infinite dimensional Gaussian with application to material recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4433–4441. Cited by: §II, §III-A, §V.
  • [38] R. Wang and X. Chen (2009) Manifold discriminant analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 429–436. Cited by: 2nd item, §V-B, TABLE II.
  • [39] R. Wang, H. Guo, L. S. Davis, and Q. Dai (2012) Covariance discriminative learning: a natural and efficient approach to image set classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2496–2503. Cited by: TABLE I, §II, 3rd item, TABLE II, §V, §V.
  • [40] R. Wang, S. Shan, X. Chen, and W. Gao (2008) Manifold-manifold distance with application to face recognition based on image set. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: 2nd item, TABLE II.
  • [41] R. Wang, X. Wu, and J. Kittler (2020) Graph embedding multi-kernel metric learning for image set classification with grassmann manifold-valued features. IEEE Transactions on Multimedia. Cited by: §I.
  • [42] B. Wen, S. Ravishankar, and Y. Bresler (2015) Structured overcomplete sparsifying transform learning with convergence guarantees and applications. International Journal of Computer Vision 114 (2-3), pp. 137–167. Cited by: §II.
  • [43] B. Wen, S. Ravishankar, and Y. Bresler (2017) FRIST—flipping and rotation invariant sparsifying transform learning and applications. Inverse Problems 33 (7), pp. 074007. Cited by: §I.
  • [44] B. Wen, Y. Li, Y. Li, and Y. Bresler (2020) A set-theoretic study of the relationships of image models and priors for restoration problems. arXiv preprint arXiv:2003.12985. Cited by: §III-B.
  • [45] J. Wright, A.Y. Yang, A. Ganesh, S. Sastry, and Y. Ma (2009) Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2), pp. 210–227. Cited by: §II.
  • [46] J. Yang, K. Yu, Y. Gong, and T. Huang (2009) Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1794–1801. Cited by: §II.
  • [47] P. Zhu, H. Cheng, Q. Hu, Q. Wang, and C. Zhang (2018) Towards generalized and efficient metric learning on riemannian manifold.. In IJCAI, pp. 3235–3241. Cited by: TABLE I, §II, 2nd item, TABLE II.
  • [48] Z. Zhu, F. Guo, H. Yu, and C. Chen (2014)

    Fast single image super-resolution via self-example learning and sparse representation

    IEEE Transactions on Multimedia 16 (8), pp. 2178–2190. Cited by: §I.