An Interpretable Compression and Classification System: Theory and Applications

07/21/2019 ∙ by Tzu-Wei Tseng, et al. ∙ University of Southern California National Chiao Tung University 11

This study proposes a low-complexity interpretable classification system. The proposed system contains three main modules including feature extraction, feature reduction, and classification. All of them are linear. Thanks to the linear property, the extracted and reduced features can be inversed to original data, like a linear transform such as Fourier transform, so that one can quantify and visualize the contribution of individual features towards the original data. Also, the reduced features and reversibility naturally endure the proposed system ability of data compression. This system can significantly compress data with a small percent deviation between the compressed and the original data. At the same time, when the compressed data is used for classification, it still achieves high testing accuracy. Furthermore, we observe that the extracted features of the proposed system can be approximated to uncorrelated Gaussian random variables. Hence, classical theory in estimation and detection can be applied for classification. This motivates us to propose using a MAP (maximum a posteriori) based classification method. As a result, the extracted features and the corresponding performance have statistical meaning and mathematically interpretable. Simulation results show that the proposed classification system not only enjoys significant reduced training and testing time but also high testing accuracy compared to the conventional schemes.



There are no comments yet.


page 1

page 6

page 7

page 8

page 10

page 11

Code Repositories



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Classification for multimedia has been studied and applied to a variety of applications such as security, entertainment, and forensics for years. The development of intelligent algorithms for classification results in efficiency improvements, innovations, and cost savings in several areas. However, classification based on visual content is a challenging task because there is usually a large amount of intra-class variability, caused by different lighting conditions, misalignment, blur, and occlusion. To overcome the difficulty, numerous feature extraction modules have been studied to extract the most significant features and the developed models can achieve higher accuracy. Selecting a suitable classification module to handle the extracted features according to their properties is also important. A good combination of feature extraction and classification modules is the key to attain high accuracy in image classification problems.

In general, classification models can be divided into two categories: 1) Convolutional neural networks (CNN) structure and 2) non-CNN based structure. The CNN structures are in the mainstream [1]-[6]

, and they are usually stacked up with several convolution layers, max-pooling layers, and ReLU layers. Hence the depth of the structure can be deep such as ResNet


. Due to the deep structure, CNN models can extract the features well and achieve high accuracies. However, CNN models usually share several disadvantages, including mathematically intractable, irreversible, and time-consuming. Most of the CNN models use handcrafted-based feature extraction or backpropagation to train the convolution layers which makes the models mathematically inexplicable and time-consuming. In addition, the max pooling and ReLU layers are nonlinear, hence the structures become irreversible. The above disadvantages increase the difficulty in designing and adjusting suitable structures for various types of data sources. Therefore, research has been conducted to design a model which can be easily and fully interpretable mathematically like the second category introduced below.

Classification models using non-CNN based structures are mainly built with two parts, feature extraction modules, and classification modules. To design a model which is mathematically interpretable, several machine learning techniques for feature extraction or classification have been developed including various versions of principal component analysis (PCA)

[7]-[17], linear discriminant analysis (LDA) [18]-[25]

, support vector machine (SVM), nearest neighbor (NN), etc. For the feature extraction modules, in

[7]-[17] the authors used PCA; while in [18]-[25], the authors used LDA to extract features from images. For the classification modules, [8], [9], [13], [17] and [21]

applied the nearest neighbor classifier. On the other hand,

[7] and [20] utilized the SVM classifier. The combinations of feature extraction and classification modules are generally nonlinear. Thus to inverse or recover the extracted features to original data is difficult.

Inversing the features to original data reveals important information for classification. For instance, if certain significant features are extracted and they result in high classification accuracy, one would be interested in visualizing or quantifying these features in the original multimedia if this is feasible. The reversibility is like Fourier series. One can inverse individual Fourier series, quantify how they contribute to the original signals and understand the importance of individual series. Due to the difficulty of data reversibility in most existing solutions, methods to compress data efficiently dedicated for classification purposes have not been well addressed yet.

Data compression for classification purposes lead to not only reduced storage size but also decreased computational complexity. For instance, in a classification problem, if the extracted features can be reduced while the reduced features can achieve satisfactory testing accuracy and the inversed image from the reduced features still keep important characteristics of the original image as well, one may store the reduced features instead of all features. Also, the reduced features can be used for classification to decrease computational complexity. Furthermore, such data compression is interpretable because it keeps the most significant features for classifications.

In this study, we propose an interpretable classification system that can achieve high testing accuracy with low complexity. The proposed system is linear and inversible. Hence data compression via the system is possible. The proposed system can be divided into three parts including feature extraction, feature reduction and the classification for the reduced features. Thanks to the linear property of the proposed system, the extracted features can be inversed to original data to see insight into individual features. Thus, it is like a linear transform, such as Fourier transform, where both forward and backward directions are feasible. The proposed system can also significantly reduce the extracted features while it still maintains high classification accuracy. As a result, data compression for classification purposes is realizable via the proposed system. Moreover, we find that the extracted features in the proposed system can be approximated to uncorrelated Gaussian random variables. It is worth pointing out that the Gaussian approximation result was also observed in [7]. The Gaussian approximation result endues the proposed system and the extracted features statistical meaning. Hence classical estimation and detection theory can be applied to the proposed system for classifying the data [26]

. This motivates us to use the concept of maximum a posteriori (MAP) to detect the class of input images. Therefore for every input image, there is a probability that this image belongs to each class. The detected class is the one that has the maximum probability. Since every class has a probability for the input data, one can also determine top candidate classes for the data and develop more sophisticated algorithms to refine the classification results. Consequently the proposal is an interpretable compression and classification system.

To verify the classification ability of the proposed system, we conduct experiments in face recognition “who is this person?”. We use the dataset in

[27], Labeled Faces in the Wild (LFW). LFW is widely used by face recognition models like [1]-[4] and [8]. Experiment results show that the proposed scheme outperforms conventional systems in terms of both testing accuracy and computational complexity. Moreover, the training and testing time of the proposed system is much faster than conventional schemes. In a standard PC platform, two datasets with 2804 and 6592 images only take 11 and 16 seconds for training respectively. Also, less than 1 msec to recognize the class of one image. The accuracy reaches 97.61% for a 19-person dataset and 84.71% for a 158-person dataset. Furthermore, thanks to the linear property, the proposed system can inverse the reduced features to the original image and the compression ratio is significant. In our experiments, the proposed system can reduce the number of features from 12288 to 270, a compression ratio up to 45.51:1; at the same time, the average deviation between the original and compressed images is only 9.14%, and more importantly the testing accuracy is 97.61% for the 19-person dataset and 84.71% for the 158-person dataset.

The outline of the remaining parts is organized as follow: In Section II, we present the linear recognition model and corresponding algorithms. In Sections III-V, individual modules of the proposed system are explained in details, where Section III explains the linear feature extraction module, Section IV describes the linear feature reduction module, and Section V introduces the linear classification module. In Section VI, we provide experimental results to show the advantages of the proposed system. Conclusions and future works are given in Section VII.

Ii Proposed Recognition Model and Methods

Fig. 1: The proposed linear classification system consisting of the preprocessing, feature extraction, feature reduction, and classification modules.

A block diagram of the proposed recognition model is shown in Fig. 1, which consists of preprocessing, feature extraction, feature reduction, and classification modules. The preprocessing module handles the original defects of the dataset such as noise, contrast, and brightness of the images. The preprocessing module contains operations including object detection, image processing, and data augmentation. It is an adjustable module designed according to the dataset. The feature extraction module extracts the significant features out from the dataset and has multiple stages. Each of the stages is a linear transformation. Hence the transformation is invertible, and the images can be reconstructed from the extracted features. This point will become more clear later. After the feature extraction module, features are reduced for data compression purpose and avoiding overfitting. Finally, the reduced features are passed through the classification module, where we propose to use linear discriminant classifier because we find that a Gaussian statistical model can be assumed after the proposed linear transformation. Because the proposed system is linear and based on statistics, many results can be explained and improved mathematically. Let us explain individual modules more detailed in the following sections.

Iii Feature extraction with variable cuboid size

In this section, we introduce the feature extraction module, which is linear and hence both forward direction (data to extracted features) and inverse direction (features to original data) are feasible. The extracted features can be assumed to be uncorrelated Gaussian random variables. Those points are explained in the following subsections.

Iii-a Forward Direction

The proposed feature extraction module is a fast linear data-driven feedforward transformation with low space complexity. Referring to the block diagram in Fig. 1, let be the th preprocessed image of a size of , where . Hence is the total number of the input images. In the testing phase, the value of can be one; while in the training phase, it has to be more than one to find the transform kernels. Then, we feed all of the preprocessed images into the multi-stage proposed scheme and let the initial global cuboid be , where the superscript is the stage index and , and .

Assume that there are stages. At each of stage, we reshape the global cuboid into multiple non-overlapping local cuboids, perform principal component analysis on individual cuboids, collect the results, and reshape them into another global cuboid for the next stage. According to the size of input images, one can adjust the side lengths of the local cuboids in the vertical and horizontal direction at each stage. Let the side lengths of the local cuboids in stage be , where , are adjustable. The values of , and satisfy


When the initial input and the side lengths , are given, the input can be processed by the multi-stage scheme. The dataflow and the dimension conversion at Stage are shown in Fig. 2 and also explained in Steps 1-3 below: The stage index is with increasing order, i.e., .

Fig. 2: The dataflow, reshape operations and dimensions of cuboids between Steps and .

Step 1. Global cuboid to several local cuboids. Let the global cuboid of the th image at Stage be with size . The global cuboid is cut into several local cuboids with size at stage . With the constraints of the local cuboid side lengths, one can perfectly cut the global cuboid into non-overlapping local cuboids below


where and . For instance, letting the side lengths be 4. If the image at Stage is with size , the size becomes at Stage .

Step 2. Principal component analysis to local cuboids.

The principal components are used as the transform kernels. Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of possibly correlated observations into linearly uncorrelated sets. When data vectors are projected onto the principal components, the PCA coefficients have larger variances which can help in separating data into various classes.

First, we reshape the local cuboids to vectors , where is equal to the volume of the local cuboid defined as


and the reshape operation is defined as


Second, the principal components of the vectors

are calculated. The principal components are the eigenvectors of the covariance matrix of the data vectors. Let the mean for all data vectors be

given by


The covariance matrix is then calculated using


Then we find the eigenvectors of , where each is the th eigenvector of , and . Note that the notation is the th column of matrix . Hence are the principal components of . Third, we project the vectors onto the transform kernels (principal components) and obtain the PCA coefficients defined as


Step 3. Reshape PCA coefficients and form another global cuboid . We then reshape all the PCA coefficients obtained in Step 2 to place the coefficient vectors in the spectral direction using the following procedure:


Combining all , where and , global cuboid for next stage can be formed given by


where . The dimension conversion of global cuboids from Stage to Stage can be written as


Stop criterion. The whole process is stopped when the last stage, Stage , is approached, and the global cuboids is obtained.

Fig. 3: The dimension conversion of the global cuboid through stages. and are the forward and inverse of the proposed scheme between stage and , respectively.

At the final stage, one can obtain the final PCA coefficients given by


where is with the constraints of , and that . As a result, there is no growth in space complexity and the space complexity is throughout the whole process. Note that the space complexity doubles at each stage for the system in [7]. Then the global cuboid at the final stage can be reshaped to the PCA coefficient vector at the final stage:


The PCA coefficients at the final stage are also called features that are used to determine the class of the input images.

Iii-B Gaussian Approximation of Features

As mentioned in Step 2 of the previous subsection, the PCA is a procedure that can convert a set of possibly correlated coefficients into uncorrelated ones. Hence the coefficients at the final stage should be almost uncorrelated. To see this, the element of the correlation coefficient matrix of the PCA coefficients at the final stage is defined as


where is the mean of defined as


For example, let be the feature vector that will be detailed in Experiment 1 in Sec. VI, where the dimension is 90. Then, the obtained correlation coefficient matrix is shown in Fig. 5

. From the figure, we see that the PCA coefficients at the final stage are almost uncorrelated. In addition to the uncorrelated relationship among features, we also find that the statistics of individual features can be approximated by Gaussian distribution. The histograms of some randomly picked samples are shown in Fig. 

4, in which Gaussian approximation well matches the histograms.

Fig. 4: Gaussian approximation of the extracted features in the proposed system. (a) Feature No. 12, (b) Feature No. 55, (c) Feature No. 72, (d) Feature No. 90 from inputs which belong to class 5.

In addition to this dataset, we have also verified various datasets and observed similar Gaussian approximation results of the proposed system.

Knowing that the features are nearly uncorrelated Gaussian distribution, we can treat the features as the Gaussian mixture model (GMM). The GMM phenomenon was also reported in

[7]. Approximating the features using the GMM is useful and motivates us to use the linear discriminant classifier, a MAP-based detector, that will be introduced in Sec. .

Fig. 5: The correlation matrix of the extracted features, which is nearly diagonal with diagonal elements being 1 and the off-diagonal elements being smaller than 1.0e-05.

Iii-C Inverse Direction

Since the proposed scheme is a linear transformation, both forward and inverse transformations can be conducted. The inverse transformation is to reconstruct the PCA coefficients at the final stage to the preprocessed image . We elaborate on the inverse transformation in the following steps: Now the stage index is with decreasing order, i.e., .

Step 1. Reshape the PCA coefficients back to vectors . This is a simple inverse procedure of (10) given by


Step 2. Project the the vectors onto the inverse transform kernel. Referring to (9), the inverse kernel of is simply its inverse matrix. Hence the backward result of this step can be easily obtained by


where is the inverse matrix of .

Step 3. Reshape back to the local cuboids. Referring to (6), this step is given by


Step 4. Form global cuboid and take PCA coefficients. The method to collect the local cuboids to form global cuboid is the inverse of (4), and the method for taking the PCA coefficients for the previous stage is the inverse of (11).

Conduct Steps 1-4 for all stages, the preprocessed images can be recovered losslessly.

Iv Feature Reduction

The invertibility is an important property of the proposed system. One can reduce the features and thus keep significant elements of . By inversing with reduced features, the original image can still be reconstructed with important features reserved. One may treat this as a lossy data compression dedicated for classification purposes or feature filtering. Next, let us introduce how to reduce the features and how to recover data from the reduced features.

Iv-a Proposed Methods for Feature Reduction

To reserve the features that have the largest discriminations, the PCA can again be applied for this purpose. The property is that when data projects onto its significant principal components, the principal coefficients have larger variances. As a result, the principal coefficients with larger variances can be considered more significant and have larger discriminant power than those with smaller ones. Hence the problem reduces to keeping the features that have the largest variances. The variance vector of the features are defined as


where is the mean defined in (18). When is obtained, one can find the set of indices which corresponds to the largest variances defined as


Once the set is found, the features can be reduced accordingly. The feature vector after the feature reduction can be obtained by


The feature vector has a size reduction from to . Also, contains the top few significant features of the th image and represents this image in training or testing phase.

Determining a suitable value of is important. This value can be determined in the training phase. More specifically, given a set of training data, one can set so that the classier has the best performance and this will be shown in the simulation results later.

Iv-B Recover Images from Reduced Features

As mentioned in Sec. (III-C), the features before feature reduction can be used to reconstruct the preprocessed images losslessly. This is also true for the features after feature reduction with lossy image reconstruction. The inverse transform is different only at Stage , which is elaborated below. For Stage :

Step 1. Form transform kernel at Stage with feature reduction. The transform kernel can be obtained by extracting the columns indexed by set given by


Step 2. Project features after feature reduction. The feature vector is projected onto the Moore-Penrose pseudo-inverse matrix of transform kernel, i.e., and yield


where is the pseudo-inverse matrix of in (25).

Step 3. Reshape to local cuboids. The vector is shaped to the local cuboids,


After Steps 1-3 for Stage , the same procedure for Steps 1-4 introduced in Sec.III-C shall be conducted for other stages from to . After -stage of inverse transformation, the initial global cuboid is reconstructed. Due to the energy loss from the eliminated features, image processing is needed for reducing the reconstruction loss. First, we compensate the energy loss using a brightness gap defined as


Second, we apply histogram equalization to enhance the contrast and fix the range span as well. The histogram equalization was widely used, see e.g., [28] and [29]. Here, we use the function provided by the Matlab. Finally, the recovered image can be obtained with a slight loss defined as


In Fig. 6, we provide an example using an image from the testing dataset that will be introduced in Sec. VI later. The original features of the image in the red layer is 4096, and we recover the image using only 90 features. In row (a), the red layer of the original image and its histogram are shown. In row (b), we show the recovered image without any image processing and we can see the image is darker than the original one and the contrast is poor, though one can still recognize the image. In row (c), the brightness has been compensated by adding defined in (28). Hence the mean of the histogram shifts. In row (d), we show the recovered image using (29). One can see that no matter from the image or its histogram, they are the closest to those in row (a). Thus the image processing in (28) and (29) really helps when recovering images from reduced few features.

Fig. 6: (a): Original image and its histogram. (b): Recovered image without any image processing. (c): Recovered image after adding brightness. (d): Recovered image after adding brightness and histogram equalization.

When , reconstructing images from the reduced features can be regarded as a lossy data decompression. Such data compression and decompression are meaningful for data classification. More specifically, if only most significant features are needed to achieve a target classification performance, one can simply store the data set with features instead of features because both have comparable classification performance. Namely, we only store the top significant features , instead of the original data . Such a concept may be used to reduce the data size and computational complexity in classification algorithms. The compression ratio with feature reduction can then be written as


The examples in Fig. 6 (a) and (d) thus have a compression ratio of .

V Linear Discriminant Classifier

After the feature reduction, the reduced features can be fed into the classifier, which is introduced here. As discussed in Sec. III-B, the features can be approximated to be uncorrelated Gaussian random variables. Hence classical theory in estimation and detection can be used and the resulting performances have theoretical explanations. Now we have the reduced feature vector

and would like to determine which class it belongs to. To treat such a problem, classical maximum a posteriori (MAP) estimator provides a good reference. In the area of machine learning and pattern recognition, this concept is usually called linear discriminant analysis (LDA), which finds a linear combination of features that can separate two or more classes. Here we combine the two concepts, LDA and MAP estimator to design the linear discriminant classifier introduced as follows:

First, let be the output of the classifier, which is the class that belongs to, and be Class . The MAP estimator is given by



is the number of classes. The posterior probability is defined as


where is written as


Since is a constant for all classes, one can ignore the denominator in (32). As for the numerator in (32), since we consider the input features are multivariate Gaussian distributed, the priori probability can be written as


where is the number of training images which are with class , and is the number of all training images. The mean of is defined as


and the covariance matrix is defined as


where , , are the feature vectors belong to class .

In the linear discriminant classifier, we need a common covariance matrix so as to keep the likelihood function linear. In this work, we propose to use the pooled covariance matrix as the common matrix. The pooled covariance matrix is defined as


The likelihood function for multivariate Gaussian random variables is then given by


where recall that is the number of reduced features.

To simplify the calculation, referring to (32), we take the natural logarithm, and ignore the denominator of the posterior probability function , i.e.,


The term can be rewritten as


From (37), since the covariance is common, and can be ignored. Then the remaining term can be written in a linear form consisting of a weight and a bias term for an individual class given by


where the term is the weight and the term is the bias. Finally, the classifier can be written in a linear equation form as follows:


We summarize the proposed scheme for training and testing including the linear transformation of feature extraction, feature reduction and linear discriminant classifier in respectively in Algorithms 1 and 2.

0:   preprocessed training images . Local cuboid spatial sizes in stage for .
0:  Transform kernels . Weight and bias . Remaining indices .
1:  Initialization: . Let global cuboid size be , where , and . - - - - - - - - - - - Feature Extraction - - - - - - - - - - -
2:  for  do
3:     Form local cuboids using (4) and reshape them to vectors using (6).
4:      and
5:     for  do
6:        for  do
7:           Find the transform kernel (PCs) using eigenvector-based method.
8:           Project onto to obtain PCA coefficients using (9).
9:           Reshape to to form global cuboid using (11).
10:        end for
11:     end for
12:  end for
13:  Obtain final PCA coefficients at the final stage and reshape them to vector . - - - - - - - - - - - Feature Reduction - - - - - - - - - - -
14:  Reduce to obtain the feature vector using Sec. , and store the indices of remaining features.
15:  For colored images, repeat 1-15 for each primary color layer, and cascade all features to a vector . - - - - - - - Linear Discriminant Classifier - - - - - - -
16:  Feed and labels into the Linear Discriminant Classifier, and obtain weight and bias .
Algorithm 1 Training Process
0:  The preprocessed testing image(s) . Local cuboid spatial sizes same as Algo. 1. Transform kernels . Weight and bias . Remaining indices .
0:  The classification accuracy.
1:  Initialization: . Let global cuboid size be , where , and . - - - - - - - - - - - Feature Extraction - - - - - - - - - - -
2:  for  do
3:     Repeat Steps 3-4 in Algo. 1.
4:     for  do
5:        for  do
6:           Read the transform kernel .
7:           Repeat Steps 8-9 in Algo. 1.
8:        end for
9:     end for
10:  end for
11:  Repeat Step 13 in Algo. 1. - - - - - - - - - - - Feature Reduction - - - - - - - - - - -
12:  Reduce features according to remaining indices to obtain feature vector .
13:  Repeat Step 15 in Algo. 1. - - - - - - - Linear Discriminant Classifier - - - - - - -
14:  Use the classifier in (43) to estimate which class the th image belongs to.
15:  Use label(s) to check the classification accuracy.
Algorithm 2 Testing Process

Vi Experimental Results

In this section, the experiment results are provided to show the performance of the proposed classification system.

Dataset and Hardware. Labeled Faces in the Wild (LFW) is a well known academic test set for face verification [1]. There are two datasets used in the experiments. For the first data set, 158 classes are selected from the whole dataset which has more than 6 images for training and 2 images to testing per class. For the second data set, 19 classes are selected out of the 158 classes in the first data set, which has more than 30 images for training and 10 images for testing. Table I details the size of the two datasets. Intel(R) Core(TM) i7-8700CPU and 16GB RAM is used to conduct the experiments.

Training data Testing data Total Total
classes per class per class training data testing data
TABLE I: Dataset Details

Image Preprocess. Firstly, the face is detected and the background of the image is blacked via an open-source python file [30]. Then all the images are resized to . An overview of the images after the face detection operation and resizing is shown in Fig. 7. Secondly, the training images are augmented. The reason is that in the first dataset, there are some classes that have only 6 images for training, which is insufficient to train a good model. Also, some faces do not face the same directions as shown in Fig. 7. Therefore, we augment the training images by flipping the images horizontally and make the number of images doubles. Thirdly, we separate the three primary colors of images into R, G, B 3 layers and then equalize the 3 layers of histogram individually. This step enhances the contrast of images and enable the separation of the three primary color layers.

Fig. 7: An overview of images after face detection and resizing.

After preprocessing, the training data are passed through the proposed system, and Algorithm 1 is used to process the data.

Experiment 1: Testing accuracy as functions of various numbers of features. As discussed in Sec. IV, the number of features id reduced from to to avoid overfitting in training classifier, where for one primary color layer in the experiments. For designing a face recognition model, the main goal is to achieve high testing accuracy, and thus is decided mainly according to the testing accuracy. Let the local cuboid sizes be set as , , , which is a 3-stage scheme. Table II shows the testing accuracy corresponding to 8 different numbers of features. Several observations are summarized: First, more features do not necessarily lead to better testing accuracy. This is because more features may result in overfitting problem and hence feature reduction is needed. Second, for both datasets, the best testing accuracy occurs when for one primary color layer and a total of 270 features for three layers. Consequently, the compression ratio is , which is a significant reduction of data size.

features classes features classes
19 158 19 158
12288 12.61% 0.10% 300 97.17% 83.53%
9000 5.43% 0.10% 270
6000 7.39% 23.08% 240 97.17% 84.12%
3000 6.09% 61.24% 90 93.91% 78.60%
900 96.09% 81.66% 45 87.61% 63.51%
TABLE II: Testing accuracy as functions of various numbers of features

Experiment 2: Testing accuracy and computations as functions of various local cuboid sizes. In this experiment, three sets of variable local cuboid sizes are used to figure out efficient local cuboid sizes. Efficiency considers the computational speed and testing accuracy. The computational speed depends on the number of stages. When the local cuboid sizes are large, the proposed multi-stage scheme has fewer stages and hence fewer computations, but the feature extraction might lose the precision of extracting the most significant features, and it leads to bad accuracy in the end. Thus a good local cuboid size should be determined for achieving high testing accuracy with low computations. We consider the following three settings of local cuboid sizes: Setting 1: There is one stage with . Setting 2: There are 3 stages with , , . Setting 3: There are 6 stages with , , , , , . Here, the number of extracted features is fixed to 270, which leads the best testing accuracy according to Experiment 1. The results are shown in Table III. Observed from the table that Settings 2 and 3 both achieve the best accuracy. However, Setting 2 has a lower computational cost than that in Setting 3. Hence Setting 2 is an efficient setting of local cuboid sizes for this experiment.

19-classes 158-classes
Setting Time(Train) Time(Test) Accuracy Time(Train) Time(Test) Accuracy
1 s s s s
2 s s s s
3 s s s s
TABLE III: Testing accuracy and computations as functions of various local cuboid sizes.

Experiment 3: Performance with multiple candidates. Following Experiments 1 and 2, we show all the images that are incorrectly recognized from the 19-classes testing dataset in Fig. 8. We see that there are some problematic images which are marked with red and blue rectangles. In the images with red rectangles, the faces are covered by hands; while in the image with a blue rectangle, the error actually cames from image preprocessing (face detection operation). When we eliminate those problematic images or select a better face detection operation, the testing accuracy can be improved from 97.6 to 98.5. Similarly, in Fig. 9, we displayed the incorrectly recognized problematic images from the 158-classes testing dataset. Some of them are seriously affected by sunglasses, hands, and even other’s shoulder. When we exclude these errors, the accuracy improves from 84.71 to 86.3.

Fig. 8: All the 11 incorrectly recognized images from the 19-classes dataset. There are some problematic images boxed in red and blue rectangles. If they are eliminated, the accuracy can reach .
Fig. 9: All the incorrectly recognized problematic images from the 158-classes dataset. If these unavoidable errors are eliminated, the accuracy can reach .

Additionally, because the proposed scheme classifies the images using a MAP-like estimator, for each image the classifier outputs the individual probabilities for individual classes that this image belongs to. Hence it is possible to determine several candidates with the highest probabilities that one image belongs to. Refined and improved algorithms can be developed to determine the best decision from the candidates with the highest probabilities. It is worth mentioning that some classifiers cut space into regions and do not have the cluster center in advance, such as SVM. Such schemes need to pay more effort if more candidates are to be selected. Here we show the testing accuracy of top-3 and top-5 candidates. The accuracy for 158-class dataset reaches 93.1 when having top-3 guesses, and 95.7 when having top-5 guesses. Moreover, the accuracy for 19-class dataset reaches 99.1 when having top-3 guesses, and 99.6 when having top-5 guesses.

The CDFs of top guesses are shown in Fig. 10.

Fig. 10: The CDF of top-X guesses for both testing datasets. The accuracies of both datasets reach above when having top-3 guesses, for the 19-class dataset and for the 158-class dataset.

Experiment 4: Image reconstruction using various numbers of features. In this experiment, we reconstruct images from various numbers of features. The reasons are two folded. The first one is to see how many features are sufficient to reconstruct the original images from a point of view of data compression. The second reason is to see what features and why they are eliminated in the original images. In Experiments 1 and 2, we see using full features may lead to overfitting problem. By reversing the images from the reduced features, the corresponding results can also be used to explain what may lead to overfitting problem and what are the important features of the original images that can be used to mostly distinguish images from different classes. Here, we recover images from 12288 (full features), 9000, 6000, 3000, 270 features which respectively have 4096, 3000, 2000, 1000, 90 features for recovering one primary color layer. To compare the recovery result, we use the percent deviation to calculate the loss defined as


Table IV shows the corresponding percent deviations and compression ratio. It is observed that using more features does not necessarily lead to a small percent deviation due to overfitting issue. In addition, reconstructing images from 270 features has a very high compression ratio and a satisfactory low percent deviation. Note that the deviation for 12288 is very small but nonzero due to the accumulated computational error of multiple stages in the platform.

Features Percent deviation Compression ratios
12288 1.7e-04% 1:1
9000 5.48% 1.37:1
6000 12.41% 2.05:1
3000 15.51% 4.10:1
270 9.14% 45.51:1
240 9.31% 51.2:1
90 11.45% 136.5:1
TABLE IV: Compression ratios

Let us see how the recovered images look like. Fig. 11, shows some reconstructed sample images. Row (a) are the samples recovered from full 12288 features and the average percent deviation is 1.7226e-04%, which proves that we can recover the image losslessly from full features. Row (b) are the samples recovered from 9000 features and the average percent deviation is 5.48%. Rows (c) and (d) are the samples recovered from 6000 and 3000 features and the average percent deviations are 12.41% and 15.51% respectively. Observing that as the reduced features increase in this level, the deviations grow higher. Also in rows (c) and row (d), the reconstructed images contain certain insignificant details which look like noises. As a result, the reconstruction quality and the testing accuracy are poor due to overfitting as we can see in the previous example in Table II. When the number of features reduces to 270, however, in row (e), we see that the insignificant details disappear and the effect just like passing the images through a smoothing process. Consequently, the deviation is only 9.14% which is even lower than row (c) and row (d). This result also reflects that 270 features are suitable for training the classifier since they grasp most of the significant information of an image and avoid overfitting occurred as using 6000 and 3000 features.

Fig. 11: Some image recovery samples. (a) are the samples of recovery images recovered from 12288 features, the percent deviation is . (b) to (e) are recovered from reduced features, and the numbers of features are 9000, 6000, 3000, 270, the percent deviations are .

Moreover from the top and bottom samples of the first column in Fig. 11, the glasses disappear in the reconstructed image from the reduced 270 features. This implies that the proposed scheme can remove redundancy that is irrelevant or unimportant to classification. To see this point more clearly, Fig. 12 shows more reconstructed images from the 270 features. Observed from the figure that glasses and figures disappear. That is, significant features extracted by the proposed scheme would be the information about face structures, not those disturbing objects like glasses and fingers. Therefore, when we reconstruct the images from a suitable amount of features, the disturbing items should disappear. From a viewpoint of data compression, the compressed data is dedicated to achieving better classification performance. This effect may be treated as “feature filtering” of the proposed system in classification.

Fig. 12: The image recovery method eliminates disturbing objects. The top image originally has fingers in front of the man’s chin and disappears after recovered. The middle and the bottom images both have glasses at first but disappear after recovered.

Experiment 5: Comparisons between conventional and proposed schemes. In this experiment, we compare the proposed system with the AlexNet and the Saak transform in [7]. The same preprocessed training and testing images are used. We apply the AlexNet provided by Matlab. The CNN model is trained using exactly the same layers like those in AlexNet. Moreover, considering the hardware, we allow that the CNN provided by Matlab uses GPU to speed up, and the proposed scheme only uses CPU. The GPU that CNN uses is NVIDIA GeForce GTX 1050 Ti. For the Saak transform, 600 features are used, which can achieve its best accuracy. The result is shown in Table V. We see that the proposed model achieves a better accuracy while the whole computational time including training and testing is only one twenty-fifth of that with CNN.

AlexNet Saak Transform (600 features) Proposed Scheme (270 features)
of Time Time Accuracy Time Time Accuracy Time Time Accuracy
classes (Train) (Test) (Train) (Test) (Train) (Test)
s s s s s s
s s s s s s
TABLE V: Alexnet vs. Saak transform vs. Proposed recognition model

Vii Conclusion and future work

We have proposed a linear classification system that can inverse the extracted and reduced features to original data, and achieve data compression for classification purposes as well. Experimental results show that the proposed system outperforms the conventional classification schemes in terms of not only computational complexity but also testing accuracy. From the viewpoint of data compression, the proposed system compresses the data in a way beneficial for classification purposes. That is, when the images are recovered from the reduced features via the proposed system, several unimportant feature redundancies for classifications such as glasses and covered hand on the faces are naturally filtered out. We call this effect feature filtering. When the compressed data is used for classification, the testing accuracy is high while the recovered images can still achieve a small percent deviation as well. The nice properties including linearity, reversibility, achieving data compression and feature filtering have made the proposed system worth further investigation. The solutions to develop sophisticated algorithms to refine the detection results among top guesses, and to efficiently utilize the advantages of data compression for classification are still open.


  • [1] F. Schroff, D. Kalenichenko and J. Philbin, “FaceNet: a unified embedding for face recognition and clustering,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 815–823, Jun. 2015.
  • [2] H. Wang, J. Hu and W. Deng, “Compressing fisher vector for robust face recognition,” IEEE Access, vol. 5, pp. 23157–23165, Sep. 2017.
  • [3] J. Lu, V. E. Liong, X. Zhou, J. Zhou, “Learning compact binary face descriptor for face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, pp. 2041–2056, Mar. 2015.
  • [4] C. Ding and D. Tao, “Robust face recognition via multimodal deep face representation,” IEEE Trans. Multimedia, vol. 17, pp. 2049–2058, Nov. 2015.
  • [5] B.-C. Chen, C.-S. Chen and W. H. Hsu, “Face recognition and retrieval using cross-age reference coding with cross-age celebrity dataset,” IEEE Trans. Multimedia, vol. 17, pp. 804–815, Apr. 2015.
  • [6] K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition,” Computer Vision and Pattern Recognition (CVPR), pp. 770–778, Dec. 2016.
  • [7] C.-C. J. Kuo and Y. Chen, “On data-driven saak transform,” arXiv preprint arXiv:1710.04176, Oct. 2017.
  • [8]

    T. H. Chan, K. Jia, S. Gao, et al.: “Pcanet: a simple deep learning baseline for image classification?,”

    IEEE Trans. Image Process., vol. 24, pp. 5017–5032, Dec. 2015.
  • [9] L. Zhao and Z. Guo, “Face recognition method based on adaptively weighted block-two dimensional principal component analysis,” Proc. of IEEE Int. Conf. on Computational Intelligence, Communication Systems and Networks, pp. 22-25, Jul. 2011.
  • [10] I. Dagher and R. Nachar, “Face recognition using IPCA-ICA algorithm,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, pp. 996–1000, Jun. 2006.
  • [11] S. Rajendran, A. Kaul, R. Nath, A. Arora and S. Chauhan, “Comparison of pca and 2d-pca on indian faces,” Signal Propagation and Computer Technology (ICSPCT), 2014 International Conference on. IEEE., pp. 561–566, Jul. 2014.
  • [12] R. He, B.-G. Hu, W.-S. Zheng and X.-W. Kong, “Robust principal component analysis based on maximum correntropy criterion,” IEEE Trans. Image Process., vol. 20, pp. 1485–1494, Jun. 2011.
  • [13] H. M. Ebied, “Feature extraction using pca and kernel-pca for face recognition,” 2012 8th International Conference on Informatics and Systems (INFOS)., pp. MM–72–MM–77, May 2012.
  • [14] X. Xiao and Y. Zhou, “Two-dimensional quaternion PCA and sparse PCA,” IEEE Trans. on Neural Networks and Learning Systems., vol. 30, pp. 1–15, Jul. 2018.
  • [15] Y. Pei, “Linear principal component discriminate analysis,” 2015 IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC2015)., vol. 28, pp. 2108–2113, Feb. 2015.
  • [16]

    J. Xu, M. Li, X. Zhao and Z. Chang, “Self-learning super-resolution using convolutional principal component analysis and random matching,”

    IEEE Trans. Multimedia., vol. 21, pp. 1108-1121, Sep. 2018.
  • [17] Y. Choi, T. Tokumoto, M. Lee and S. Ozawa, “Incremental two-dimensional two-directional principal component analysis (l(2DPCA) for face recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing., pp. 1493-1496, May 2011.
  • [18] X. Jiang, “Linear subspace learning-based dimensionality reduction,”IEEE Signal Process. Mag., vol. 28, pp. 16-26, Mar. 2011.
  • [19] C. Liu and H. Wechsler, “Gabor feature based classification using the enhanced Fisher linear discriminant model for face recognition,”IEEE Trans. Image Process., vol. 11, pp. 467–476, Apr. 2002.
  • [20] F. Ye, Z. Shi and Z. Shi, “A comparative study of PCA, LDA and kernel LDA for image classification,” International Symposium on ubiquitous virtual reality, IEEE., pp. 51-54, Jul. 2009.
  • [21] Z. Zeng and P. Huang, “Palmprint recognition using Gabor feature-based two-directional two-dimensional linear discriminant analysis,” IEEE International Conference on EMEIT., vol. 4, pp. 1917-1921, Aug. 2011.
  • [22] C. Xiang and D. Huang, “Feature extraction using recursive cluster-based linear discriminant with application to face recognition,” IEEE Transactions on Image Processing., vol. 15, pp. 3824–3832, Nov. 2006.
  • [23] T.-K. Kim and J. Kittler, “Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, pp. 318–327, Mar. 2005.
  • [24] S. W. Ji and J. P. Ye, “Generalized linear discriminant analysis: a unified framework and efficient model selection,” IEEE Transactions on Neural Networks., vol. 19, pp. 1768-1782, Sep. 2008.
  • [25] M. H. Siddiqi, R. Ali, A. M. Khan, Y. T. Park, S. Lee, et al.: “Human facial expression recognition using stepwise linear discriminant analysis and hidden conditional random fields,” IEEE Trans. Image Process., vol. 24, pp. 1386–1398, Feb. 2015.
  • [26] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, 1993.
  • [27] University of Massachusetts Amherst, Labeled Faces in the Wild., [Online]. Available:
  • [28] N. Senthilkumaran and J. Thimmiaraja, “Histogram equalization for image enhancement using MRI brain images,” in 2014 World Congress on Computing and Communication Technologies., pp. 80–83, Mar. 2014.
  • [29] J. H. Han, S. Yang and B. U. Lee, “A novel 3-D color histogram equalization method with uniform 1-D gray scale histogram,” IEEE Trans.on Image Processing., vol. 20, pp. 506-512, Feb. 2011.
  • [30] Dlib, shape predictor 68 face landmarks., [Online]. Available: