1 Introduction
Supervised subspace learning is useful for embedding high dimensional data, such as images, in a low dimensional subspace for better separation of classes. Fisher Discriminant Analysis (FDA)
[15, 11], first proposed in [9], was one of the first supervised subspace learning methods which is based on scatters and variances of data. The FDA subspace tries to separate the classes from one another while the data instances within every class collapse to a small region
[16]. Recently, deep FDA [8, 7] was proposed which uses a least squares approach [32, 34].On the other hand, recently, it was empirically shown that compression can improve the classification accuracy. In [22]
, a pretrained deep neural network was fed with the compressed test images with JPEG (uniform quantization in DCT domain) compression. Note that they work with data in the pixel domain and not frequency domain. They empirically showed that this compression can improve the classification accuracy, opening a question for further theoretical investigations.
In this paper, we propose Quantized FDA (QFDA), which makes a bridge between machine learning [11] and information theory [6]. In QFDA, we want an optimal subspace which separates the classes favorably for the quantized data while minimizing the distortion and entropy (rate) as much as possible, without sacrificing classification performance. Our aim is to find a subspace for quantized data which is at least as good as the subspace for nonquantized data in terms of separation of classes. Therefore, one can throw away the noncompressed data with large volume and merely store the quantized data and the learned directions spanning the subspace. In other words, we compress the data but the separation of classes is still acceptable.
In this paper, we use the following notations:

: training sample size

: training sample size in the th class

: dimensionality of data in the input space

: dimensionality of data after zeropadding

: number of classes

: th training data instance

: th training data instance in the th class

whose dimensionality becomes after zeropadding.

whose dimensionality becomes after zeropadding.

: th test (outofsample) data instance

whose dimensionality becomes after zeropadding.
Other notations will be introduced at their first appearance in the paper.
This paper is organized as follows. Section 2 reviews FDA. Uniform quantization in the frequency domain is reviewed in Section 3. The proposed QFDA is explained in Section 4. The connection of QFDA and ratedistortion optimization in the field of compression is discussed in Section 5. Section 6 reports the experiments including the proposed quantized Fisherfaces for face analysis in QFDA. Finally, Section 7 concludes the paper and discusses the possible future direction.
2 Fisher Discriminant Analysis
Let be the projection matrix whose columns are the projection directions spanning the subspace so the subspace is the columnspace of . If we truncate the projection matrix to have , the subspace is spanned by projection directions and it will be a dimensional subspace where .
The projection into the subspace and reconstruction of training and outofsample data are [28]:
(1)  
(2)  
(3)  
(4) 
where , , , and are projection and reconstruction of training data and projection and reconstruction of outofsample data, respectively.
FDA was first proposed in [9]. See recent reviews and analysis in [15, 11]. It maximizes the Fisher criterion [31, 12]:
(5) 
where denotes the trace of matrix. The Fisher criterion is a generalized RayleighRitz Quotient [23]. Hence, the optimization in Eq. (5) is equivalent to:
(6)  
subject to 
where the and are the between and within scatters, respectively, defined as:
(7)  
(8) 
where the mean of the th class is:
(9) 
The total scatter can be considered as the summation of the between and within scatters [32, 30]:
(10) 
Therefore, the Fisher criterion or Eq. (5) can be written as:
(11) 
The is a constant can be dropped in the optimization problem; therefore, the optimization in FDA can be:
(12)  
subject to 
The Lagrangian of the problem is [4]:
where is a diagonal matrix including the Lagrange multipliers. Setting the derivative of the Lagrangian to zero gives:
(13) 
which is the generalized eigenvalue problem
[14]. Hence, FDA directions are the eigenvectors in this generalized eigenvalue problem. Comparing the Eq. (
12) with the optimization of PCA [18, 13] shows that PCA captures the orthonormal directions with the maximum variance of data; however, the FDA has the same goal but with the orthonormal directions which are manipulated with .One of the ways to solve the generalized eigenvalue problem is [14]:
(14) 
where stacks the eigenvectors columnwise. In order to guarantee that is not singular, we strengthen its diagonal:
(15) 
where is a very small positive number, large enough to make full rank. In the literature, this approach is known as regularized discriminant analysis [10].
We can write the total and within scatters in matrix form:
(16)  
(17) 
respectively, where and are centering matrices, is the identity matrix, and
is the vector of ones with dimensionality
. Note that the centering matrix is symmetric and idempotent.3 Uniform Quantization in DCT Domain
In this section, we describe a simple version of uniform quantization in the Discrete Cosine Transform (DCT) domain [1, 24]. This quantization method defines the wellknown JPEG (Joint Photographic Experts Group) compression [20, 27].
We first center the data, i.e., remove their mean:
(18) 
Then, we reshape every image (a column of ) to its imagearray form. The image is then zeropadded to have height and width having integer number of ()pixel blocks [27]. Let denote the new dimensionality of image after zeropadding. Afterwards, the image is divided to its blocks. Let with denote an image block. A twodimensional DCT transform [29] is applied to every block [27]:
(19)  
where is the signal in the DCT domain with . Therefore, for every block, we have frequencies. The and are:
(20) 
We denote the reshaped to a vector by . Moreover, let denote the image transformed to the DCT (frequency) domain.
Let denote the uniform quantization function [6]:
(21) 
where is the vector whose th element, , is the number of quantization levels for the th frequency. More specifically, where is an upperbound on the number of quantization levels. In order to calculate , we first bootstrap a sample of
images from the training dataset. This bootstrap is an estimation of the whole dataset according to MonteCarlo approximation
[25]. We calculate:(22) 
where , , denotes absoulte value, is the reshaped DCT block to a vector, and is the th frequency in the th block of the th training image.
The Eq. (21) quantizes every frequency in every block of the image where the quantization levels for a frequency in different blocks are the same. The mapping has the following steps:
(23) 
(24) 
(25) 
where denotes the floor function (rounding to the largest integer less than or equal to the input value), and:
(26) 
(27) 
Two examples of quantization using the above formulae are provided in Fig. 1. We denote the quantized signal of image by . We take the notations and .
4 Quantized Fisher Discriminant Analysis
4.1 Quantized Fisher Criterion
In QFDA, the total and within scatters are defined as:
(28)  
(29) 
respectively, where
is a hyperparameter controlling the relative importance of the first and second terms. The parameters in
and can be different but we use the same parameter for the sake of simplicity.The first aim of QFDA is to solve the following problem:
(30)  
subject to 
We name the objective function in Eq. (30) the quantized Fisher criterion and denote it by .
4.2 Minimization of Rate
At the same time, QFDA desires to minimize the rate of the quantized data for better compression. The rate of quantization of the th frequency is defined as [6]:
(31) 
where is the length of the th quantization level and
is the area under the Probability Density Function (PDF) of the DCT signal,
, in the interval of the th quantization level:(32) 
where is the PDF of the DCT transformed signal for the th frequency. In order to estimate this PDF, we first take the same bootstrapped images which we have from Section 3. In other words, we use the
th frequency in all the blocks of the sampled images. Then, for every frequency, we use kernel density estimation
[26]with Gaussian kernel because Gaussian is the most common distribution in the realworld signals, also supported by central limit theorem
[17]. Note that, again, the estimation of PDF with a bootstrapped sample from data follows the MonteCarlo approximation [25].The rate can be approximated by the entropy because if we use in Eq. (31), we have:
(33) 
which is the summation of entropies in the quantization intervals. In other words, if the frequency has a lot of information because of significant changes amongst the blocks/images, it should have a large number of quantization levels and we expect that its rate/entropy to be large. We use the approximation in Eq. (33) for the rate in QFDA optimization. We calculate an overall rate for the dataset by averaging over the rates of frequencies:
(34) 
4.3 Optimization for QFDA
The complete cost function in QFDA is:
(35)  
subject to 
which minimizes the rate but maximizes the quantized Fisher criterion where the optimization variable is the vector containing the number of quantization levels for the frequencies. The is the regularization parameter in this optimization.
The columns of are the quantized Fisher directions. We can truncate this matrix to have a dimensional quantized Fisher subspace where . In other words, the column space of is the quantized Fisher subspace. We calculate the quantized Fisher directions by solving Eq. (30) for a given . The solution is similar to the solution of Eq. (12) which is Eq. (15) where we use Eqs. (28) and (29) for and , respectively.
We solve the Eq. (35
) using Particle Swarm Optimization (PSO)
[19] which is a powerful metaheuristic optimization method [5]. The reason that we solve this optimization problem using a heuristic method is that this cost is a little ugly in a mathematical sense and also the search space is discrete which makes the problem harder. The cost of PSO is the cost in Eq. (
35) where before feeding a solution (particle) to it, we first project the input to the valid set of values. We denote the projection for the th frequency by and define it as:(36) 
Note that the quantized Fisher criterion in the cost is calculated as explained before for a given (particle). The rate, existing in the cost, is also calculated using Eq. (33).
5 Connection to RateDistortion Optimization
Quantization and compression usually deal with the ratedistortion optimization [21, 6], where the rate and distortion are in tradeoff meaning that lower distortion usually comes with higher rate and vice versa. In the ratedistortion optimization, the rate is minimized for better compression while the distortion is also tried to be minimized for better preserved quality.
The optimization in Eq. (35) can be interpreted as the ratedistortion optimization. We explain the reason in the following. The criterion in Eq. (30) is a generalized RayleighRitz Quotient [23]. Hence, the optimization (30), for a given , is equivalent to:
(37)  
subject to 
where maximization has been changed to minimization by negating the objective function. The Lagrangian relaxation [4] of the problem is:
(38)  
where the diagonal of are the Lagrange multipliers. Now, consider the Eq. (35) whose Lagrange relaxation is similarly as the following:
(39)  
where we have dropped the terms concerning the constraints in Eq. (35) for simpler analysis. This equation shows that we are minimizing the within scatter of the projected data. According to Eq. (29), the terms and are minimized. Minimization of the former minimizes the cloud of th quantized class in the QFDA subspace. Therefore, it plays the role of minimization of the entropy (or information) which helps minimization of the rate. On the other hand, minimization of is like minimization of the distortion. The reason is that this term shows the variance (dissimilarity) of the quantized (i.e., distorted) data from nondistorted data. Hence, its minimization plays the minimization of distortion.
Although the terms and also exist in which is maximized in Eq. (39), but notice that they represent the total variance of data which guarantees better separation of classes in the subspace. Therefore, it makes more sense to consider for ratedistortion optimization and not . Moreover, the rate exists in Eq. (39) which is minimized as expected. Therefore, in that equation, rate is minimized (in and ) and distortion is also minimized (in ) as we have in ratedistortion optimization. It is also noteworthy that, according to Eqs. (29) and (39), the tradeoff of minimization of rate and distortion is controlled by the two hyperparameters and .
0.1  0.5  1  1.5  2  

0.01  0.122 0.062  0.129 0.049  0.127 0.052  0.182 0.069  0.146 0.058 
0.1  0.117 0.074  0.135 0.053  0.165 0.074  0.157 0.064  0.121 0.063 
1  0.125 0.064  0.107 0.066  0.169 0.068  0.132 0.063  0.153 0.063 
10  0.123 0.054  0.117 0.056  0.117 0.049  0.156 0.060  0.151 0.079 
0.1  0.5  1  1.5  2  

0.01  0.266 0.125  0.266 0.126  0.256 0.130  0.265 0.130  0.263 0.122 
0.1  0.266 0.125  0.266 0.126  0.256 0.130  0.265 0.130  0.263 0.122 
1  0.266 0.125  0.266 0.126  0.256 0.130  0.265 0.130  0.263 0.122 
10  0.266 0.125  0.266 0.126  0.256 0.130  0.265 0.130  0.263 0.122 
6 Experiments
6.1 Quantized Fisherfaces vs. Fisherfaces
We used the AT&T face dataset [2] which includes facial images of different people. The images have different expressions and poses making this dataset hard enough. The size of images, in this dataset, are pixels. For computational reasons, we resampled the images to pixels because calculation of eigenvectors takes time and every iteration of optimization includes calculation of eigenvectors. We divided the images into two classes of having and not having eye glasses. We split data to training, test, and validation sets with proportions , , and , respectively.
We used the validation set for finding the best values for and . For several permutations of values of these two hyperparameters, we ran the PSO optimization. For every PSO optimization, we used five particles and we found ten iterations to be sufficient. A 10Nearest Neighbor (10NN) classification was used for evaluating the QFDA on all training, test, and validation sets. Nearest neighbor classification is useful to evaluate the structure of the embedded data in the subspace. Note that the training set was used for classification of test and validation sets.
For evaluations, we did the classification considering the first up to the first
dimensions of the subspace. We report the results as the average (and standard deviation) over the
error rates. Table 1 reports the validation errors where , were found to be the best. In the following, we report the training, test, and validation errors for these optimum hyperparameters:
training: 0.126 0.032

test: 0.208 0.041

validation: 0.107 0.066
Note that the above results are the QFDA results for embedding the quantized DCT transformed images. The average error rates for embedding the original (nonquantized) DCT transformed images using the FDA:

training: 0.112 0.017

test: 0.185 0.021

validation: 0.179 0.025
Except for the validation set, the results of QFDA are slightly worse than FDA; however, we should note that the compression was remarkably high from 20.2 kilo bytes per image to 10 kilo bytes. The drop of classification rate was not significant however. The above comparison shows that we can throw away the original data and its FDA subspace and keep the compressed images with the QFDA subspace to classify the quantized classes.
The found optimum number of quantization levels for the frequencies are depicted as a bar plot in Fig. 2(a). The result is convincing because the DC component (frequency ) usually has the smallest variance in the pixel domain; thus it has largest variance in the frequency domain. The opposite analysis exists for the highest frequency. Hence, we expect that more quantization levels should be assigned to the lower frequencies.
The inverse DCT transform of the leading eigenvectors (directions) of QFDA and FDA subspaces are shown in Fig. 3. The facial eigenvectors, or ghost faces, of FDA are referred to as Fisherfaces in the literature [3]. Similarly, we name the ghost faces in QFDA as quantized Fisherfaces. Comparing the Fisherfaces and quantized Fisherfaces shows that quantized Fisherfaces take care of the JPEG blocking resulted from the quantization. Both Fisherfaces and quantized Fisherfaces are capturing the features regarding the eye regions of faces which is expected be3cause the two classes are different in terms of having or not having eye glasses. Note that we have more than one eigenvector here for both FDA and QFDA because we have regularized it with in Eq. (15).
Recall that in Eq. (18), we centered the data. Several zeropadded images, centered data, and quantized images are shown in Fig. 4. As mentioned before, the size of every images has been halved without any significant drop in accuracy of classification. As shown in this figure, the eyes and glasses have been quantized with higher quality as expected because the eyes are important for classification between having and not having eye glasses. On the other hand, some nonimportant facial regions are quantized with less quality which is expected for minimization of rate.
6.2 Experiments on Fashion MNIST Dataset
In order to do experiments for more than two classes, we used the Fashion MNIST dataset [33] which includes classes of different clothing items. The size of images, in this dataset, is pixels. The settings of PSO and cross validation were the same as the previous experiment. Table 2 shows the validation results for this dataset. This table shows that for this dataset, QFDA is almost robust to the changes of . The reason might be because the size of images is small in this dataset and thus the rate is usually high (distortion is small) because we do not have huge number of dummy pixels. The quantized images, shown in Fig. 5, also show that distortion is not significant in this dataset, although some level of distortion is observed if one looks at the images carefully.
The table 2 shows that can be a good choice for the parameters. In the following, we report the training, test, and validation errors for these optimum hyperparameters:

training: 0.236 0.102

test: 0.283 0.133

validation: 0.263 0.122
The above results are the QFDA results for embedding the quantized DCT transformed images. The average error rates for embedding the original (nonquantized) DCT transformed images using the FDA:

training: 0.273 0.086

test: 0.315 0.114

validation: 0.317 0.121
This shows that, on this dataset, QFDA has performed much better than FDA. The size of image before and after quantization is almost 7.91 and 7.89 kilo bytes which means that the compression was not significant as explained before.
The found optimum number of quantization levels for the frequencies are depicted as a bar plot in Fig. 2(b). The same interpretation as before exists for why DC component should have larger number of quantization levels.
The inverse DCT transform of eighteen leading eigenvectors of QFDA and FDA are illustrated in Fig. 6. This figure shows that the eigenvectors of QFDA have captured more features in comparison to the eigenvectors of FDA. The eigenvectors of FDA have captured partially scanned features rather than complete features. This explains the better results of QFDA.
7 Conclusion and Future Direction
This paper proposed Quantized Fisher Discriminant Analysis (QFDA) which made a bridge between machine learning, manifold learning, and information theory. There is a huge lack of literature for combination of machine learning and information theory and this paper tried to tackle this gap. This method optimized a proposed cost function using PSO algorithm. This cost function can be interpreted as a ratedistortion cost which is used in compression purposes. The quantized Fisherfaces method was also proposed for facial analysis in QFDA. The experiments reporting validation, optimum number of quantization levels, visualization of QFDA eigenvectors, and display of quantized images showed the merit of this new subspace learning method.
In this paper, we worked on uniform quantization in the DCT domain which is what JPEG compression does. A possible future work is to consider a general nonuniform quantization. In that case, the variables for quantization optimization will be an integer number of levels, float values for start of quantization intervals, and float values for the mapping values in quantization. Therefore, if the upperbound on is , the vector of solution (particle in PSO) should have or just dimensions where some of its entries will be zero if .
Acknowledgment
The authors thank Sepideh Shaterian Bidgoli for her very helpful discussions in developing the idea of this paper.
References
 [1] (1974) Discrete cosine transform. IEEE transactions on Computers 100 (1), pp. 90–93. Cited by: §3.
 [2] (1994) AT&T face dataset. Note: https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.htmlOnline, Accessed: July 2019 Cited by: §6.1.
 [3] (1997) Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis & Machine Intelligence (7), pp. 711–720. Cited by: §6.1.
 [4] (2004) Convex optimization. Cambridge university press. Cited by: §2, §5.
 [5] (2005) Search methodologies. Springer. Cited by: §4.3.
 [6] (2012) Elements of information theory. John Wiley & Sons. Cited by: §1, §3, §4.2, §5.
 [7] (2019) Deep least squares Fisher discriminant analysis. IEEE transactions on neural networks and learning systems. Cited by: §1.
 [8] (2017) Deep Fisher discriminant analysis. In International WorkConference on Artificial Neural Networks, pp. 501–512. Cited by: §1.
 [9] (1936) The use of multiple measurements in taxonomic problems. Annals of eugenics 7 (2), pp. 179–188. Cited by: §1, §2.
 [10] (1989) Regularized discriminant analysis. Journal of the American statistical association 84 (405), pp. 165–175. Cited by: §2.
 [11] (2001) The elements of statistical learning. Springer series in statistics New York. Cited by: §1, §1, §2.

[12]
(2013)
Introduction to statistical pattern recognition
. Elsevier. Cited by: §2. 
[13]
(2019)
Unsupervised and supervised principal component analysis: tutorial
. arXiv preprint arXiv:1906.03148. Cited by: §2.  [14] (2019) Eigenvalue and generalized eigenvalue problems: tutorial. arXiv preprint arXiv:1903.11240. Cited by: §2, §2.
 [15] (2019) Fisher and kernel fisher discriminant analysis: tutorial. arXiv preprint arXiv:1906.09436. Cited by: §1, §2.
 [16] (2019) Feature selection and feature extraction in pattern analysis: a literature review. arXiv preprint arXiv:1905.02845. Cited by: §1.
 [17] (2001) Central limit theorem. Encyclopedia of Mathematics, Springer. Cited by: §4.2.
 [18] (2011) Principal component analysis. Springer. Cited by: §2.
 [19] (2010) Particle swarm optimization. Encyclopedia of machine learning, pp. 760–766. Cited by: §4.3.
 [20] (1992) Digital compression and coding of continuoustone still images: requirements and guidelines. ITUT Recommendation T 81. Cited by: §3.
 [21] (1998) Ratedistortion methods for image and video compression. IEEE Signal processing magazine 15 (6), pp. 23–50. Cited by: §5.

[22]
(2019)
Compression improves image classification accuracy.
In
Canadian Conference on Artificial Intelligence
, pp. 525–530. Cited by: §1.  [23] (1998) The symmetric eigenvalue problem. Classics in Applied Mathematics 20. Cited by: §2, §5.
 [24] (2014) Discrete cosine transform: algorithms, advantages, applications. Academic press. Cited by: §3.
 [25] (2013) Monte carlo statistical methods. Springer Science & Business Media. Cited by: §3, §4.2.
 [26] (2015) Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons. Cited by: §4.2.
 [27] (1992) The JPEG still picture compression standard. IEEE transactions on consumer electronics 38 (1), pp. xviii–xxxiv. Cited by: §3, §3.
 [28] (2012) Geometric structure of highdimensional data and dimensionality reduction. Springer. Cited by: §2.
 [29] (1994) Image compression using the discrete cosine transform. Mathematica journal 4 (1), pp. 81. Cited by: §3.
 [30] (2005) Fisher linear discriminant analysis. Technical report Department of Computer Science, University of Toronto. Cited by: §2.
 [31] (2006) Analysis on Fisher discriminant criterion and linear separability of feature space. In 2006 International Conference on Computational Intelligence and Security, pp. 1671–1676. Cited by: §2.
 [32] (2007) Least squares linear discriminant analysis. In Proceedings of the 24th international conference on machine learning, pp. 1087–1093. Cited by: §1, §2.
 [33] (2017) Fashion MNIST dataset. Note: https://www.kaggle.com/zalandoresearch/fashionmnistOnline, Accessed: July 2019 Cited by: §6.2.

[34]
(2010)
Regularized discriminant analysis, ridge regression and beyond
. Journal of Machine Learning Research 11 (Aug), pp. 2199–2228. Cited by: §1.
Comments
There are no comments yet.