I Introduction
Image classification is a fundamental problem in image processing and computer vision. Comparing to classic algorithms based on predefined features, recent image classification schemes applied machine learning techniques to optimize feature representation directly from the data themselves. More recently, deep learning approaches for image classification have achieved the stateoftheart results on many benchmarking datasets, such as the popular ImageNet
[5]. Despite of the promising performance achieved under simple and ideal problem setups, there are still various challenges when (i) classification based on queries that contain a set of object variations (i.e., imageset classification), or (ii) the image data is limited or with relatively low quality (i.e., weakly supervised classification).To be specific, while conventional classification tasks process a single image in each query, imageset classification [16, 20, 8, 41] has recently gained more attention, in which each query set contains multiple images with strong correlation (e.g., query object with multiple views, poses or illuminations). Such type of algorithms are widely applied in applications such as videobased face classification [16], multispectral image classification, etc. Comparing to the singleimage classification algorithms, effective imageset methods need to additionally exploit the hidden structure among image sets, e.g., the inter and intraset data variations. Furthermore, popular deep features tend to be generic and incorporate very little prior knowledge by learning from largescale, highquality, and fully annotated training datasets [5]
. Such approaches are ideal with fully supervised learning, but less dataefficient and robust when training sets are smallscale or corrupted (
e.g., noisy).Recent works on sparse signal modeling have demonstrated their effectiveness in image representation for various tasks [48, 7, 6, 21, 4, 34, 14]. Comparing to the deep features, sparse representation is modelbased, thus much more robust to practical challenges such as noise or overfitting [43]. While many existing works focused on exploiting image patchbased sparsity, global statistical properties are typically ignored or failed to be incorporated jointly in a principled approach. Recent works show that highorder statistics of image features are critical in classification tasks [18, 22, 23, 27], leading to better results comparing to many firstorder methods.
In this work, we propose a novel Joint Statistical and Spatial Sparse representation (J3S), i.e.,
learning the coupled dictionaries for both local patch features, and the global data Gaussian distribution mapped into Riemannian manifold. Their dictionarydomain sparse coefficients are reconciled by solving a sparse coding problem with joint sparsity. We propose an efficient yet effective alternating minimization algorithm to solve the J3S sparse coding problem. To the best of our knowledge, no work to date utilized both global statistics and local patch structures jointly via sparse representation for image classification. Furthermore, we apply the learned J3S model for robust imageset and singleimage classification applications. Extensive experimental results on material classification, object recognition and videobased face recognition tasks are presented, and we demonstrate that the proposed J3Sbased image classification scheme outperforms the popular or stateoftheart competing methods.
In short, the contributions of this paper include:

Learning global statistical and local patch dictionaries for visual classification task by coupling them with joint sparsity;

Utilizing principal component analysis (PCA) to reduce the J3S model complexity while maintaining the effectiveness;

Investigating the robustness of the proposed model under various conditions i.e., noisy condition and fewshot setting.

Achieving the stateoftheart results on both noisy image and imageset classification tasks.
The remainder of this article is organized as follows. Section II summarizes the related work on image or imageset classification problems, including manifold learning, deep learning and sparse representation. Section III introduces two kinds of dictionary learning methods based on Gaussianbased statistical information and patchbased spatial information respectively, the proposed J3S model and the classification module. Section IV describes the solution of the proposed J3S model based on alternation minimization and analyzes the time and space complexity as well as a simple strategy for model acceleration effectively. Section V demonstrates the performance of the proposed J3S model for image and imageset tasks over several standard databases under different conditions such as noise and fewshot. Section VI concludes with proposals for future work. The preliminary work has appeared in [3]. ^{1}^{1}1Significant changes have been made compared to our previous work in [3]. First, we improve the J3S method by reducing the model complexity with a simple and efficient approach. Second, we add more description and analysis on the dictionary learning and classification. Third, we include new experiments on ablation study to investigate the model convergence and parameter selection. Furthermore, we conduct an extra experiment on an object recognition task based on the ETH80 database to evaluate the generalizability of the proposed J3S model in different scenarios. Finally, we conduct some additional popular settings, e.g., noisy condition and fewshot setting are conducted to validate the performance and robustness of the proposed J3S model.
Ii Related Work
Imageset classification aims to identify the common class of a multiimage query. The inherent properties of each query set can be modeled effectively by popular methods such as manifold learning, deep learning, sparse coding, etc.
Manifold Learning: The classic methods based on Discriminant Canonical Correlations (DCC) [17] proposed to classify image sets by maximizing the canonical correlations of withinclass sets and minimizing the canonical correlations of betweenclass sets. Later on, more subspace methods [2] were proposed to simplify the geometric structure learning for image sets. However, these approaches are limited as most image sets lie on a Riemannian manifold rather than Euclidean subspaces [39, 13], e.g., symmetric positive definite (SPD) manifold is widely used to represent image sets. To ease the computation, the LogEuclidean Riemannian Metric (LERM) framework [1] proposed to map data from SPD manifold to its tangent Euclidean space. Besides, LogEuclidean Manifold Learning (LEML) [13] projects the original SPD manifold to a lowerdimension discriminative SPD manifold while preserving its original geometry. More recently, Riemannian Manifold Metric Learning (RMML) [47] proposed a more generalized metric learning method which can be applied to multiple manifolds. From a statistical perspective, when modeling image sets or the multichannel features via Gaussian distribution, their covariance matrices for a collection of Gaussian can form a Riemannian manifold of SPD matrices [39, 37, 36]. Covariance Discriminative Learning (CDL) [39] derived a Riemannian kernel function to map covariance matrix from manifold space to Hilbert space, where kernelized linear methods can be used for learning.
Deep Learning: Recently, more works on deep learning have shown its capability for imageset classification [12, 25, 32]. Deep Reconstruction Model (DRM) [12]
learns a template deep reconstruction model using neural networks and then uses the minimal reconstruction residual to classify a query set. Multimanifold deep learning (MMDML)
[25] maps multiple sets of image into a shared feature subspace to leverage the nonlinear information. More recently, Deep Match Kernels (DMK) [32] is proposed for imageset classification without considering specific assumptions on image distribution and geometrical structures and builds local match kernels to leverage its generic deep features.Methods 





DCC [17]  ✓  ✓  
AHISD/CHISD [2]  ✓  ✓  ✓  
LEML [13]  ✓  ✓  ✓  
RMML [47]  ✓  ✓  ✓  
CDL [39]  ✓  ✓  ✓  
RSR [10]  ✓  ✓  ✓  
KGDL [11]  ✓  ✓  ✓  
DRM [12]  ✓  ✓  
MMDML [25]  ✓  ✓  ✓  
DMK [32]  ✓  ✓  
Proposed  ✓  ✓  ✓  ✓  
J3S 
Sparse Representation: Sparse coding based classification represents a query sample on a dictionary composed of the training samples of all classes, and then classified by the reconstruction error of each class [45, 15, 42, 35]. Besides, the sparse coefficients can be used as the extracted features for classification, e.g., linear spatial pyramid matching [46]. Most existing works focus on sparse coding and dictionary learning on the zeroorder information, i.e., the original feature space, while the firstorder and secondorder statistics contain global information and take the correlation of the data into account. They can be more robust to variations in images and videos applications, e.g., variations of poses, illumination and occlusions.
Table I summarized the aforementioned related methods, as well as the proposed J3S method for imageset classification. Furthermore, some recent works also proposed sparse coding and dictionary learning models on Riemannian manifold of SPD matrices and Grassmann manifold: Sparse coding on Riemannian manifold can be converted to a kernel sparse coding problem by deriving valid kernels for SPD manifold [10, 4] or Grassmann manifold [11]. However, none of the existing works combined statistical with spatial priors in the sparse representation. Besides, the robustness of the imageset classification has been rarely investigated.
Iii Dictionary Construction and Joint Sparse Representation
In this section, we present the J3S model for classification tasks, including the dictionary construction of statistical and spatial models and joint sparse coding. Our proposed J3S model can deal well with different types of input data such as single image and image set.
To obtain the unified feature representations for classifying both a single image and an image data set, we apply the corresponding data preprocessing methods. Specifically, for an image set with feature of each image , we combine them to construct the image set representation directly; For a single image , we employ its deep feature representation using a pretrained CNN extractor as local features to construct where . Thus, both an image or image set can be represented in a similar form as , where is the feature dimension of each image and is the number of image in each image set or the number of channel for a single image.
Iiia Statistical Dictionary Construction
Based on the Gaussian statistical model, we need to compute the mean vector
and covariance matrix in Reproducing Kernel Hilbert Space (RKHS) for the corresponding Gaussian descriptor . We map into an RKHS by the mapping function with Hellinger’s kernel, the mean vector and covariance matrix can be computed as:(1) 
Here , and is the centering matrix. However, when the dimension of the original features (i.e., ) is very high, and the number of samples (i.e., ) is small, such a Gaussian descriptor can not work well. To solve this problem, following [37]
, we estimate the robust covariance matrix
by solving a regularized maximum likelihood estimation problem as:(2) 
where is the von Neumann matrix divergence [19] of two matrices and is a regularizing parameter. The optimal solution of problem (2) can be computed as:
(3)  
Here
is the diagonal matrix of the singular values in decreasing order, and
is the orthogonal matrix consisting of the eigenvectors corresponding to the singular values.
andare computed by the singular value decomposition (SVD) of the covariance matrix
as .By using the mean vector and robust covariance matrix , we can define the embedding symmetric positive definite matrix as:
(4) 
where is a parameter to balance the orders of magnitude between them.
IiiB Spatial Dictionary Construction
Similarly, given a sample , considering spatial information, we can learn a patchbased unitary dictionary from the feature map of a single image or gray feature of an image set. For a single image, we choose the original image or the deep feature as the input to learn a patchbased unitary dictionary. In contrast, for an image set, we combine each single image feature to construct a patchbased unitary dictionary to exploit within the class structure. For image and image set classification, the objective is to learn a unitary dictionary based on 2D image patches constructed from sample by solving the following problem with the synthesis model as:
(5) 
where is used for extracting patches from , is the vectorized form of , is the number of total patches and
is the identity matrix.
Sparse coding problems under the synthesis model are NPhard in general, and even the approximate algorithms are typically expensive [29]. However, since problem (5) learns the unitary dictionary, it is equivalent to the unitary transform learning problem [44], i.e., a signal is approximately sparsifiable using a learned unitary transform , as , where is sparse and is a small residual in the transform domain. The corresponding transform learning problem is formulated as
(6) 
Based on [44], the two sparsity models can be unified under the unitary dictionary assumption, i.e., , and .
Proof 1
Thus, we can obtain the optimal dictionary in (5) by solving its equivalent problem (6) which has an exact and closedform solution [30], i.e., where are computed by the SVD of as .
Fig. 1 illustrates the framework of J3S, in which a sample (either an image set or a single image) can be modelled by a statistical model to obtain the embedding SPD matrix , and simultaneously modelled by a spatial patchbased model to generate a unitary transform dictionary , which are then used for joint sparse coding.
IiiC J3S Sparse Coding
To reconcile the two types of dictionaries generated from statistical Gaussian modeling and spatial patchwise unitary dictionary learning, we propose the joint statistical and spatial sparse representation (J3S) model, i.e., for any query sample, we impose joint sparsity on their statistical and spatial dictionarydomain coefficients, to maintain the consistency and dependency of their individual sparse representation. For simplicity, we use the following symbols to simplify the objective function,
Here, and are the mapping function of Gaussian and patchbased unitary dictionary corresponding to the training samples belonging to the th class, respectively.
For statistical Gaussianbased dictionary, each embedding matrix is an SPD matrix which can be viewed as a point on the corresponding SPD manifold based on Eq (4). Direct vectorization of the SPD matrix to generate a dictionary will destroy the intrinsic structure, which may cause information loss. To avoid such loss while measuring the similarity between two matrices on the SPD manifold, we use a general framework called LERM [1] to map each of them to its tangent space through matrix logarithm operation . By using such embedding, we directly measure the similarity on the tangent space using the Euclidean distance. In this way, we can get the vectorized form of each matrix as the statistical Gaussianbased feature for feature representation. To simplify the calculation of SPD matrix , here we only extract upper triangle elements to construct the dictionary, and thus the mapping function can be written as:
(7) 
For patchbased unitary model, is the vectorized mapping function of a unitary dictionary.
Given the joint statistical and spatial dictionaries, we compute the coefficient vectors and of the th query sample with feature representations and by solving the following problem:
(8)  
where is the weighting parameter defined to balance the scale of statistical model and patchbased model, and are used to map two kinds of representations, respectively. is the norm for row sparse.
By solving the optimization problem in (8), we can get the representation coefficient vectors and that correspond to the Gaussian and patchbased dictionary models, respectively. With two coefficient vectors and , we can compute the reconstruction loss of the th query sample using only one particular subdictionaries and corresponded coefficient subvectors of the th class as:
(9)  
Here and are particular coefficient subvectors of the th class for th query sample. and are reconstructed statistical and spatial representations of the th query sample, respectively. and are two subdictionaries of training samples from the th class. and are the coefficient subvectors containing the sparse code corresponding to the training samples from the th class. Moreover, (8) and (9) share similar regularized terms to constrain the overall sparsity for sparse coding. While for classification, we only keep the reconstruction loss term, ignoring the influence of the regularized term on it, thus the reconstruction loss can be rewritten as:
(10) 
For a visual classification task, the most commonly used algorithm is Nearest Neighbor (NN), aiming to find the closest labeled sample to the current query sample according to the predefined metric methods and classify the query sample into the category corresponding to the closest sample. Inspired by the idea of NN, we assume that features from the same class should be easier to reconstruct, since their feature representations contain similar embeddings, while features from different classes will be more difficult and produce larger reconstruction errors. For th query sample, we utilize the reconstruction loss defined in (10) as the measurement for classification and measure similarity in terms of the overall representation of the whole category.
(11) 
Here is the reconstruction error of the th class computed by (10) and is the predicted label of th query sample.
Fig. 2 shows how the classification module works. Specifically, to classify the query sample, for each class , we only use labeled samples of the corresponding category for joint sparse representation to reconstruct the query sample. The query data can then be classified according to the weighted reconstruction error of each class.
Iv Algorithm
We propose a joint sparse representation model for image and imageset classification tasks. It is obvious that each subproblem of (8) is convex. Thus we use alternation minimization to solve the optimization problem.
Update : The partial derivatives of the objective function with respect to the will be set to 0.
(12)  
(13)  
where is a diagonal matrix with the th diagonal element as .
Thus we can get the iteration of as:
(14) 
(15) 
Update : can be updated as follows:
(16) 
where is an offset to prevent the unsolvable problem of Eq (16) in this paper.
Iva Complexity Analysis
We discuss the time and space complexity of our proposed J3S model. Compared with the steps of the dictionary construction and sparse coding, which require to construct the dictionaries and learn sparse code jointly, the classification part is calculated only based on the results obtained in (8), so we do not consider the impact of the classification part on the complexity of the algorithm here. As the subproblems of (8) are all convex and the objective functions are all lowerbounded, the optimization algorithm can converge to a local minimum [28]. The time complexity consists of the updating of , , and . The computational complexity of is , and the computational complexity of is . Hence, the main time complexity of the proposed algorithm is , where is the iteration number, is the number of training samples, and are dimensions of two dictionaries, respectively. is the number of channels or samples and is the patch size, and is the larger feature dimension of two dictionaries.
For space complexity, the proposed J3S model needs to save two dictionaries and for all samples, a pair of sparse vectors , and the corresponding matrix for each query sample. The dimensions of statistical dictionary and unitary dictionary are equal to and , respectively.
IvB A simple and effective strategy for Model Acceleration
The time and space complexity of the J3S model depends on the dimension of two dictionaries. Referring to the process of dictionary construction introduced earlier, we use the lower triangle form to store matrix information of for dimensionality reduction. However, the dimension of dictionary is still too high when using deep feature with while the number of training samples is only hundreds. In this way, and the time complexity is close to , which is not conducive to the application of the proposed algorithm.
To reduce time and space cost, we use the commonly used principal component analysis (PCA) method [26] to perform dimensionality reduction operations on the two dictionaries to eliminate redundant information between different sizes. For simplicity, we use to represent the dictionary of statistical Gaussian model and patchbased model generated from the whole dataset, respectively as
Specifically, we learn a principal component transformation to map the data from the original space to a new lowdimension space. Using the transformation , we get new lowdimensional dictionaries and of the statistical and spatial models as:
(17) 
After PCA operation, we store these two matrices for sparse representation learning. For each iteration, we select the columns corresponding to the training samples of the th class to form the dictionaries and to optimize (8). With PCA, we can reduce the dimensions of two dictionaries to the same level of , which reduces the time and space cost, i.e., the time complexity is reduced from to with . The overall optimization procedure is formulated as Algorithm 1.
V Experiments
We present experimental results on videobased face recognition, material classification, and object recognition tasks to demonstrate the effectiveness of the proposed J3S classification algorithm ^{2}^{2}2The reproducible implementations of the J3S algorithms will be made publicly available upon paper acceptance.. We conduct experiments on four databases: Flickr Material Database (FMD) [31], UIUC Material Database [24], ETH80 [20] and YouTube Celebrities [16]. FMD and UIUC databases are used for imagebased classification, while ETH80 and YTC databases are used for image setbased classification. These databases contain samples in different materials, views, illuminations, and even different modalities. Fig. 3 shows some sample images with different categories from each database.
Flickr Material database (FMD) has 10 materials categories of 1000 images in the wild [31]. Each image is selected from Flickr.com with variations of illuminations, rotations, and scales. We filter images with the VGGVD16
model pretrained on the ImageNet database and employ the output of the last convolution layer as local features with the size of
. Following [37], we randomly choose 50 images in each category for gallery and the other 50 for probes and repeat this experiment ten times.UIUC Material database contains 216 images of 18 material categories in the wild [24]. We obtain the deep features of UIUC material by taking the same measures on FMD. We randomly choose half images in each category for the gallery and the other half for probes.
ETH80 database contains 80 image sets of 8 object categories [20]. Each category has 10 subobject with 41 images of different views. Following [39], we randomly choose 5 objects as the gallery and the other 5 as probes in each category. The size of each image is resized to , and the intensity feature is used. Thus, each image set can be expressed by the matrix of .
YouTube Celebrities (YTC) database contains 1910 video clips of 47 subjects [16] with different numbers of frames in each video. Following [39, 13], we use histogram equalization to eliminate light effects in preprocessing step and randomly select 3 videos per subject for the gallery and 6 videos for probes. Then, each image is resized to a image with the intensity feature. Thus each video can be expressed by the matrix of where is the number of frames in each video.
Methods  ETH80  FMD  UIUC  YTC 

AHISD(linear) [2]  72.50  46.72  55.37  64.65 
AHISD(nonlinear) [2]  72.00  46.72  55.37  66.58 
CHISD(linear) [2]  79.75  47.52  65.09  67.24 
CHISD(nonlinear) [2]  72.50  63.90  65.65  68.09 
MMD [40]  85.75  60.60  62.78  69.60 
MDA [38]  87.75  62.50  67.13  64.72 
SPDMLAIRM [9]  90.75  63.42  74.72  67.50 
SPDMLStein [9]  90.75  63.80  68.24  68.10 
LEML [13]  93.50  66.60  69.17  69.85 
RMMLSPD [47]  95.00  68.88  70.09  78.05 
RMMLGM [47]  93.00  69.62  76.48  69.15 
CDLLDA [39]  94.00  76.92  78.89  70.21 
CDLPLS [39]  94.00  75.36  76.39  69.94 
RSR [10]  91.50  74.92  72.59  72.77 
KGDL [11]  93.00  77.40  76.32  73.91 
DRM [12]  98.12  N/A  N/A  72.55 
MMDML[25]  94.50  N/A  N/A  78.5 
J3S w/o Spatial Dict.  95.25  81.40  83.43  82.87 
J3S  96.00  82.58  84.07  83.09 
Va Competing methods
To illustrate the effectiveness of the proposed model, we compare our method with the following representatives of the subspace, nonlinear manifold, statistical, sparse representation, and deep based methods.
VB Parameter Setting
We apply the implementations of competing methods provided by the authors with the default settings suggested by the corresponding papers. For MMD, the PCA percentage is set to . For MDA, we set the number of local models, betweenclass NN local models, and the subspace dimension the same as [38]. For SPDML, we implement both SPDMLAIRM and SPDMLStein versions. In both versions, following [9], is set as the minimum of the samples in one class. The new dimension of the lowdimensional manifold and are tuned by 5fold crossvalidation. We compare our method with both linear and nonlinear versions of AHISD and CHISD [2], where energy by PCA is retained in nonlinear AHISD and the value of error penalty in CHISD is set as same as [2]. For LEML, is tuned from to and the value of is tuned from to . For RMML, is set to and is tuned from to . For CDL, the distance metric is learned with linear discriminant analysis (LDA) and partial least squares (PLS) in Hilbert space. The reduced feature dimension is set to for LDA, where is the number of classes. For RSR and KGDL, we use SPAMS as a sparse solver and set other parameters as suggested in the papers. The dimension of the subspace of the Grassmann manifold in KGDL is set to 10.
There are four parameters , , , and for our proposed J3S method. The weighting parameter is defined to balance two sparse representation models and adjusted based on different scales of databases. For some databases, such as the UIUC database, which contains only a few labeled samples of each class, the statistical dictionary may be challenging to represent the reliable and complete information of a class. Thus, we set a small value to mitigate the impact of the first term in Eq (8), while we set for other databases. All regularization parameters , , and are all set to . Moreover, we implement a common backbone VGGVD16 network pretrained on the ImageNet database for feature extractor in this paper. The maximum number of iterations is set to 50, and we take an early stop when the difference between loss before and after two iterations is less than .
VC Image and ImageSet Classification
Table II compares the image classification results using the proposed method, as well as all selected competing methods. Furthermore, we included two deep learning methods, DRM [12] and MMDML [25], by quoting the results reported over YTC database. Note that classic methods randomly choose nine image sets for each class, where three image sets for training and the rest six for testing and report the average accuracy of ten times. On the contrary, the selected deepbased models divide the whole database into five folds with nine image sets for each class and keep training the model until convergence while the network input is still based on a single picture. It is clear that our proposed J3S approach outperforms all competition methods over the FMD, UIUC, and YTC databases. For the ETH80 database, our method outperforms other competition methods except for DRM, which might due to the way of processing data. Unlike the J3S model, DRM first computes the LBP features of the training data and generates a subset randomly from the training samples, which enhances the capability of the deep network. Moreover, during testing, the learned DRM model is used to reconstruct each image of a test imageset sample, and a voting strategy is adopted for classification. In contrast, our J3S model treats all samples in each image set as a classification object. Tables II also show that using joint two dictionaries could help integrate multiple information to facilitate classification tasks.
VD Noisy image classification
We simulate i.i.d.
Gaussian noise with standard deviation
from to for all training and testing data on the UIUC and FMD databases to generate noisy images for classification. Table III and IV show the classification accuracy of two databases under different noise ratios. The results show that the proposed J3S method outperforms the competition methods subject to noise corruption. Also, we can observe a clear downward trend for classification accuracy from the two tables as the noise level increases. Simultaneously, our proposed J3S model can still perform better than any other models in all noise levels. Moreover, we also find that for methods with supervised dimension reduction, i.e., SPDMLStein and CDLLDA, their performances at a relatively higher noise level are more elevated than performance at a lower noise level on the UIUC database. This is partially due to the fact that models can discard the less critical noise part during the dimensionality reduction process.Methods  

SPDMLStein  66.57  67.96  64.81  65.37 
SPDMLAIRM  74.26  73.06  71.02  70.37 
LEML  69.17  69.81  67.22  66.02 
CDLLDA  79.63  77.96  76.85  76.94 
CDLPLS  76.48  74.91  72.41  70.74 
J3S  83.61  82.41  81.39  80.46 
Methods  

SPDMLStein  62.86  58.54  54.94  52.12 
SPDMLAIRM  66.60  62.52  58.68  54.46 
LEML  66.52  63.82  59.80  56.76 
CDLLDA  76.60  74.62  71.90  70.98 
CDLPLS  74.24  71.68  69.02  66.44 
J3S  82.04  80.10  76.05  74.46 
VE Ablation Study
VE1 Weight Analysis
As Eq (8) stated, the weighting parameter is used to balance two dictionary models. We conduct an experiment to investigate the effectiveness of weighting parameter settings on classification accuracy. Table V shows the image classification accuracy averaged over the ETH80, FMD, UIUC, and YTC databases, with different values of . For ETH80, FMD, and YTC databases, it is obvious that as the weighting parameter of the statistical model increases from , the classification accuracy rate has a significantly increase, which is due to the introduction of higherorder Gaussian information. Compared with the spatial model, the statistical Gaussian model is more discriminative but still needs the spatial model to capture the local information. Thus, when the weighting parameter increases to a certain level (e.g., from 0.5 to 0.7), continuing to increase will cause the accuracy to fluctuate within a small range. In contrast, we observe that the classification accuracy on the UIUC database becomes worse when increasing the weight parameter . A potential explanation is that, the statistical dictionary may be challenging and unreliable to represent the entire information of a class if only given a few labeled samples such as the UIUC database. Simultaneously, when the weighting parameter
increases, the impact of the statistical term on the loss function becomes more significant, so the classification accuracy decreases by about
.Databases  

ETH80  94.00  95.00  96.00  96.00  95.00 
FMD  80.58  81.86  82.50  82.36  82.50 
UIUC  84.07  83.06  83.24  83.43  83.33 
YTC  76.70  80.92  82.70  83.09  83.01 
Moreover, Table VI shows the classification results of the J3S model w/ and w/o PCA with different values of the weighting parameter . We can observe that our proposed J3S model achieves the best performance with the same weighting parameter () under two settings. Meanwhile, we find that, after PCA dimensionality reduction, the highest accuracy rate has improved slightly from to while the algorithm complexity has decreased, which demonstrates the effectiveness of the J3S model w/ PCA strategy.
Settings  

J3S w/o PCA  83.98  83.33  82.50  82.36  82.50 
J3S  84.07  83.06  83.24  83.43  83.33 
VE2 Feature Selection for Dictionary Construction
To investigate the proper feature for dictionary construction to extract local information, we try different unitary dictionaries based on deep feature maps, original gray images, and RGB images, respectively. Table VII
shows the classification accuracy with different feature selections for our proposed J3S model on clean and noisy image data (
) of the UIUC database. We can observe that without a unitary dictionary, the model performs worse than other settings under the noisy condition, which can explain the role of the spatial module on the robustness of the model from one side. Meanwhile, we observe that the accuracy of methods based on a unitary dictionary of gray or RGB image drops smaller than the unitary dictionary based on a deep feature map under the noise condition. Since noisy images are fed into a deep CNN structure pretrained on clean data, it is more difficult to distinguish the noise portion than shallow image features.Settings  Acc (Clean)  Acc (Noise) 

w/o unitary dict.  83.43  80.00 
w/ Deep feature based unitary dict.  84.07  80.46 
w/ Gray image based unitary dict.  83.15  80.12 
w/ RGB image based unitary dict.  80.37  80.18 
VE3 Convergence Analysis
According to Eq (8), it is easily proved that each part of the objective function is convex. With alternation minimization, this optimization problem can be divided into two convex problems and solved easily. Fig. 5 shows that the J3S model can converge within a few iterations with different regularization parameters. Meanwhile, we can observe that after only one iteration, the J3S model will reduce the value of the loss function to near convergence. Moreover, compared Fig. 5 with Fig. 5, we find that increasing regularization parameters and for the J3S model will make the convergence more stable but require more iterations to converge fully, i.e., small regularization parameters need more iterations to converge just like in Fig. 5 than in Fig. 5. A potential explanation is that the regularization parameters and of two sparse models are adopted to control the stringency of sparse vectors and for a query sample with index . The constraint of the regular terms on the J3S model is proportional to the scale of the corresponding regularization parameters.
Methods  

SPDMLStein  42.41  53.70  60.56  64.81  66.57  68.24 
SPDMLAIRM  48.06  62.13  68.24  70.83  72.69  74.72 
LEML  N/A  53.43  61.76  65.09  67.59  69.17 
CDLLDA  26.48  38.89  47.22  51.56  65.19  78.89 
CDLPLS  55.19  67.96  72.64  74.56  75.56  76.39 
CNN+Mean+SVM  59.60  70.44  75.56  77.78  78.70  81.67 
CNN+Gau+SVM  61.61  72.22  77.40  79.58  81.90  84.01 
J3S  61.94  75.56  78.89  80.83  82.41  84.07 
VF Fewshot Classification
We consider a popular and challenging setting, i.e., the fewshot setting for image classification tasks, to investigate the robustness of the proposed J3S model with only a few supervised information. For fewshot learning, the whole dataset is divided into two nonoverlapping label sets, i.e., training set and testing set. Following the metalearning strategy [33], most existing fewshot methods construct way shot tasks on the testing set to evaluate the generalized performance of proposed models trained on the training set. Here and are the number of class and labeled samples, respectively, and and are often small values.
Typical fewshot methods follow a methodology that learns the models only from the training set and tests the classification accuracy on the testing set. Unlike this learning strategy, based on representative learning, the proposed J3S model solves the classification task by utilizing the reconstruction loss computed by two coefficient vectors learned from both labeled training samples and each query sample. A little different from the typical fewshot setting, we set the value of to the total number of categories instead of a commonly used fixed value . In contrast, we still set to a small number, identical to the fewshot setting.
Table VIII shows classification accuracy on the UIUC database with different numbers of the labeled samples in each class for dictionary learning. As mentioned before, the UIUC database only has training samples for each class in total, so the general classification setting with all training samples satisfies one kind of fewshot setting with . From table VIII, we can observe that our proposed J3S model outperforms the other methods in all fewshot settings, i.e., from 1 to 6. Note that almost all Gaussianbased models, i.e., CDLLDA, CDLPLS, two Gaussianbased CNN models, and our proposed J3S model perform well even when using only a few training samples for each class. The result demonstrates that the global statistical model can implement rich information of a class, enhancing the classification capability in fewshot tasks. Meanwhile, we observe that our J3S model and the method CNN+Gau+SVM perform better than the other two Gaussianbased methods CDLLDA and CDLPLS. A potential explanation is that, CDLLDA and CDLPLS learned only based on the covariance matrix while the J3S model and CNN+Gau+SVM method jointly implement firstorder and secondorder information, which leads to better accuracy. Moreover, CDLLDA performs poorly when because for the 1shot setting, the feature dimension is much larger than the number of training samples (here
). After feature projection (socalled dimensionality reduction), the LDAbased model cannot maintain the difference among neighbors and keep the withinclass variance to a minimum value for Nearest Neighbor classification. In comparison to LDA, PLS has proven to be helpful in this situation as it is not limited by the low discrimination dimensions.
Additionally, as the supervised information decreases, i.e., is selected from to , the gap between our proposed J3S model and the method CNN+Gau+SVM becomes larger. This might due to the effectiveness of spatial information when the statistical model cannot provide sufficient information for classification. However, when , the difference among all methods will become minor because only one support sample of each category can be utilized for learning, which results in insufficient information for classification, and the model is vulnerable to bias.
Vi Conclusion
In this paper, we proposed a novel J3S model for robust image and imageset classification. Gaussian distribution is used to keep highorder image statistical information, while patchbased sparse representation is used to capture image local structure. A simple and effective dimensionality reduction operation by PCA is utilized to reduce the algorithm complexity. We conducted experiments over four popular databases for clean and noisy image classification tasks. Moreover, we conducted parameter sensitivity analysis and tested the robustness of the algorithm under the popular fewshot setting. Results show that our proposed method achieves superior performance compared to a variety of algorithms under several settings.
References
 [1] (2007) Geometric means in a novel vector space structure on symmetric positivedefinite matrices. SIAM journal on matrix analysis and applications 29 (1), pp. 328–347. Cited by: §II, §IIIC.

[2]
(2010)
Face recognition based on image sets.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2567–2573. Cited by: TABLE I, §II, 1st item, §VB, TABLE II.  [3] (2020) Joint statistical and spatial sparse representation for robust image and imageset classification. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 2411–2415. Cited by: §I, footnote 1.
 [4] (2016) Riemannian dictionary learning and sparse coding for positive definite matrices. IEEE transactions on neural networks and learning systems 28 (12), pp. 2859–2871. Cited by: §I, §II.
 [5] (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §I, §I.
 [6] (2016) Kernel combined sparse representation for disease recognition. IEEE Transactions on Multimedia 18 (10), pp. 1956–1968. Cited by: §I.
 [7] (2014) Concurrent singlelabel image classification and annotation via efficient multilayer group sparse coding. IEEE Transactions on multimedia 16 (3), pp. 762–771. Cited by: §I.
 [8] (2015) Patchsetbased representation for alignmentfree image set classification. IEEE Transactions on Circuits and Systems for Video Technology 26 (9), pp. 1646–1658. Cited by: §I.
 [9] (2014) From manifold to manifold: geometryaware dimensionality reduction for SPD matrices. In European Conference on Computer Vision, pp. 17–32. Cited by: 2nd item, §VB, TABLE II.
 [10] (2012) Sparse coding and dictionary learning for symmetric positive definite matrices: a kernel approach. In European Conference on Computer Vision, pp. 216–229. Cited by: TABLE I, §II, 4th item, TABLE II.
 [11] (2013) Dictionary learning and sparse coding on Grassmann manifolds: an extrinsic solution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3120–3127. Cited by: TABLE I, §II, 4th item, TABLE II.
 [12] (2014) Deep reconstruction models for image set classification. IEEE transactions on pattern analysis and machine intelligence 37 (4), pp. 713–727. Cited by: TABLE I, §II, 5th item, §VC, TABLE II.
 [13] (2015) Logeuclidean metric learning on symmetric positive definite manifold with application to image set classification.. In International Conference on Machine Learning, pp. 720–729. Cited by: TABLE I, §II, 2nd item, TABLE II, §V.
 [14] (2020) Learning lowrank sparse representations with robust relationship inference for image memorability prediction. IEEE Transactions on Multimedia. Cited by: §I.

[15]
(2011)
Featurebased sparse representation for image similarity assessment
. IEEE Transactions on multimedia 13 (5), pp. 1019–1030. Cited by: §II.  [16] (2008) Face tracking and recognition with visual constraints in realworld videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §I, §V, §V.
 [17] (2007) Discriminative learning and recognition of image set classes using canonical correlations. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (6). Cited by: TABLE I, §II.
 [18] (2014) Dirichletbased histogram feature transform for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3278–3285. Cited by: §I.
 [19] (2009) Lowrank kernel learning with bregman matrix divergences.. Journal of Machine Learning Research 10 (2). Cited by: §IIIA.
 [20] (2003) Analyzing appearance and contour based methods for object categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 402–409. Cited by: §I, §V, §V.
 [21] (2016) Image sharpness assessment by sparse representation. IEEE Transactions on Multimedia 18 (6), pp. 1085–1097. Cited by: §I.
 [22] (2015) From dictionary of visual words to subspaces: localityconstrained affine subspace coding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2348–2357. Cited by: §I.
 [23] (2017) Is secondorder information helpful for largescale visual recognition?. In Proceedings of the IEEE international conference on computer vision, pp. 2070–2078. Cited by: §I.
 [24] (201306) Nonparametric filtering for geometric detail extraction and material representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §V, §V.
 [25] (201506) Multimanifold deep metric learning for image set classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: TABLE I, §II, 5th item, §VC, TABLE II.
 [26] (2001) Pca versus lda. IEEE transactions on pattern analysis and machine intelligence 23 (2), pp. 228–233. Cited by: §IVB.
 [27] (2020) Prominent local representation for dynamic textures based on highorder gaussiangradients. IEEE Transactions on Multimedia. Cited by: §I.
 [28] (2009) Adaptive alternating minimization algorithms. IEEE Transactions on Information Theory 55 (3), pp. 1423–1429. Cited by: §IVA.
 [29] (1993) Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar conference on signals, systems and computers, pp. 40–44. Cited by: §IIIB.
 [30] (2015) Sparsifying transform learning with efficient optimal updates and convergence guarantees. IEEE Transactions on Signal Processing 63 (9), pp. 2389–2404. Cited by: §IIIB.
 [31] (2009) Material perception: what can you see in a brief glance?. Journal of Vision 9 (8), pp. 784–784. Cited by: §V, §V.
 [32] (2017) Learning deep match kernels for imageset classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3307–3316. Cited by: TABLE I, §II.
 [33] (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §VF.
 [34] (2020) Learning adaptive neighborhood graph on grassmann manifolds for video/imageset subspace clustering. IEEE Transactions on Multimedia 23, pp. 216–227. Cited by: §I.
 [35] (2020) Hardnessaware dictionary learning: boosting dictionary for recognition. IEEE Transactions on Multimedia. Cited by: §II.
 [36] (2017) G2DeNet: global Gaussian distribution embedding network and its application to visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2730–2739. Cited by: §II.
 [37] (2016) RAIDG: robust estimation of approximate infinite dimensional Gaussian with application to material recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4433–4441. Cited by: §II, §IIIA, §V.
 [38] (2009) Manifold discriminant analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 429–436. Cited by: 2nd item, §VB, TABLE II.
 [39] (2012) Covariance discriminative learning: a natural and efficient approach to image set classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2496–2503. Cited by: TABLE I, §II, 3rd item, TABLE II, §V, §V.
 [40] (2008) Manifoldmanifold distance with application to face recognition based on image set. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: 2nd item, TABLE II.
 [41] (2020) Graph embedding multikernel metric learning for image set classification with grassmann manifoldvalued features. IEEE Transactions on Multimedia. Cited by: §I.
 [42] (2015) Structured overcomplete sparsifying transform learning with convergence guarantees and applications. International Journal of Computer Vision 114 (23), pp. 137–167. Cited by: §II.
 [43] (2017) FRIST—flipping and rotation invariant sparsifying transform learning and applications. Inverse Problems 33 (7), pp. 074007. Cited by: §I.
 [44] (2020) A settheoretic study of the relationships of image models and priors for restoration problems. arXiv preprint arXiv:2003.12985. Cited by: §IIIB.
 [45] (2009) Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2), pp. 210–227. Cited by: §II.
 [46] (2009) Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1794–1801. Cited by: §II.
 [47] (2018) Towards generalized and efficient metric learning on riemannian manifold.. In IJCAI, pp. 3235–3241. Cited by: TABLE I, §II, 2nd item, TABLE II.

[48]
(2014)
Fast single image superresolution via selfexample learning and sparse representation
. IEEE Transactions on Multimedia 16 (8), pp. 2178–2190. Cited by: §I.