Supervised subspace learning is useful for embedding high dimensional data, such as images, in a low dimensional subspace for better separation of classes. Fisher Discriminant Analysis (FDA)[15, 11], first proposed in 
, was one of the first supervised subspace learning methods which is based on scatters and variances of data. The FDA subspace tries to separate the classes from one another while the data instances within every class collapse to a small region. Recently, deep FDA [8, 7] was proposed which uses a least squares approach [32, 34].
On the other hand, recently, it was empirically shown that compression can improve the classification accuracy. In 
, a pre-trained deep neural network was fed with the compressed test images with JPEG (uniform quantization in DCT domain) compression. Note that they work with data in the pixel domain and not frequency domain. They empirically showed that this compression can improve the classification accuracy, opening a question for further theoretical investigations.
In this paper, we propose Quantized FDA (QFDA), which makes a bridge between machine learning  and information theory . In QFDA, we want an optimal subspace which separates the classes favorably for the quantized data while minimizing the distortion and entropy (rate) as much as possible, without sacrificing classification performance. Our aim is to find a subspace for quantized data which is at least as good as the subspace for non-quantized data in terms of separation of classes. Therefore, one can throw away the non-compressed data with large volume and merely store the quantized data and the learned directions spanning the subspace. In other words, we compress the data but the separation of classes is still acceptable.
In this paper, we use the following notations:
: training sample size
: training sample size in the -th class
: dimensionality of data in the input space
: dimensionality of data after zero-padding
: number of classes
: -th training data instance
: -th training data instance in the -th class
whose dimensionality becomes after zero-padding.
whose dimensionality becomes after zero-padding.
: -th test (out-of-sample) data instance
whose dimensionality becomes after zero-padding.
Other notations will be introduced at their first appearance in the paper.
This paper is organized as follows. Section 2 reviews FDA. Uniform quantization in the frequency domain is reviewed in Section 3. The proposed QFDA is explained in Section 4. The connection of QFDA and rate-distortion optimization in the field of compression is discussed in Section 5. Section 6 reports the experiments including the proposed quantized Fisherfaces for face analysis in QFDA. Finally, Section 7 concludes the paper and discusses the possible future direction.
2 Fisher Discriminant Analysis
Let be the projection matrix whose columns are the projection directions spanning the subspace so the subspace is the column-space of . If we truncate the projection matrix to have , the subspace is spanned by projection directions and it will be a dimensional subspace where .
The projection into the subspace and reconstruction of training and out-of-sample data are :
where , , , and are projection and reconstruction of training data and projection and reconstruction of out-of-sample data, respectively.
where the and are the between and within scatters, respectively, defined as:
where the mean of the -th class is:
Therefore, the Fisher criterion or Eq. (5) can be written as:
The is a constant can be dropped in the optimization problem; therefore, the optimization in FDA can be:
The Lagrangian of the problem is :
where is a diagonal matrix including the Lagrange multipliers. Setting the derivative of the Lagrangian to zero gives:
which is the generalized eigenvalue problem
. Hence, FDA directions are the eigenvectors in this generalized eigenvalue problem. Comparing the Eq. (12) with the optimization of PCA [18, 13] shows that PCA captures the orthonormal directions with the maximum variance of data; however, the FDA has the same goal but with the orthonormal directions which are manipulated with .
One of the ways to solve the generalized eigenvalue problem is :
where stacks the eigenvectors column-wise. In order to guarantee that is not singular, we strengthen its diagonal:
where is a very small positive number, large enough to make full rank. In the literature, this approach is known as regularized discriminant analysis .
3 Uniform Quantization in DCT Domain
In this section, we describe a simple version of uniform quantization in the Discrete Cosine Transform (DCT) domain [1, 24]. This quantization method defines the well-known JPEG (Joint Photographic Experts Group) compression [20, 27].
We first center the data, i.e., remove their mean:
Then, we reshape every image (a column of ) to its image-array form. The image is then zero-padded to have height and width having integer number of ()-pixel blocks . Let denote the new dimensionality of image after zero-padding. Afterwards, the image is divided to its blocks. Let with denote an image block. A two-dimensional DCT transform  is applied to every block :
where is the signal in the DCT domain with . Therefore, for every block, we have frequencies. The and are:
We denote the reshaped to a vector by . Moreover, let denote the image transformed to the DCT (frequency) domain.
Let denote the uniform quantization function :
where is the vector whose -th element, , is the number of quantization levels for the -th frequency. More specifically, where is an upperbound on the number of quantization levels. In order to calculate , we first bootstrap a sample of
images from the training dataset. This bootstrap is an estimation of the whole dataset according to Monte-Carlo approximation. We calculate:
where , , denotes absoulte value, is the reshaped DCT block to a vector, and is the -th frequency in the -th block of the -th training image.
The Eq. (21) quantizes every frequency in every block of the image where the quantization levels for a frequency in different blocks are the same. The mapping has the following steps:
where denotes the floor function (rounding to the largest integer less than or equal to the input value), and:
Two examples of quantization using the above formulae are provided in Fig. 1. We denote the quantized signal of image by . We take the notations and .
4 Quantized Fisher Discriminant Analysis
4.1 Quantized Fisher Criterion
In QFDA, the total and within scatters are defined as:
is a hyperparameter controlling the relative importance of the first and second terms. The parameters inand can be different but we use the same parameter for the sake of simplicity.
The first aim of QFDA is to solve the following problem:
We name the objective function in Eq. (30) the quantized Fisher criterion and denote it by .
4.2 Minimization of Rate
At the same time, QFDA desires to minimize the rate of the quantized data for better compression. The rate of quantization of the -th frequency is defined as :
where is the length of the -th quantization level and
is the area under the Probability Density Function (PDF) of the DCT signal,, in the interval of the -th quantization level:
where is the PDF of the DCT transformed signal for the -th frequency. In order to estimate this PDF, we first take the same bootstrapped images which we have from Section 3. In other words, we use the
-th frequency in all the blocks of the sampled images. Then, for every frequency, we use kernel density estimation
with Gaussian kernel because Gaussian is the most common distribution in the real-world signals, also supported by central limit theorem. Note that, again, the estimation of PDF with a bootstrapped sample from data follows the Monte-Carlo approximation .
The rate can be approximated by the entropy because if we use in Eq. (31), we have:
which is the summation of entropies in the quantization intervals. In other words, if the frequency has a lot of information because of significant changes amongst the blocks/images, it should have a large number of quantization levels and we expect that its rate/entropy to be large. We use the approximation in Eq. (33) for the rate in QFDA optimization. We calculate an overall rate for the dataset by averaging over the rates of frequencies:
4.3 Optimization for QFDA
The complete cost function in QFDA is:
which minimizes the rate but maximizes the quantized Fisher criterion where the optimization variable is the vector containing the number of quantization levels for the frequencies. The is the regularization parameter in this optimization.
The columns of are the quantized Fisher directions. We can truncate this matrix to have a -dimensional quantized Fisher subspace where . In other words, the column space of is the quantized Fisher subspace. We calculate the quantized Fisher directions by solving Eq. (30) for a given . The solution is similar to the solution of Eq. (12) which is Eq. (15) where we use Eqs. (28) and (29) for and , respectively.
We solve the Eq. (35
) using Particle Swarm Optimization (PSO) which is a powerful metaheuristic optimization method 
. The reason that we solve this optimization problem using a heuristic method is that this cost is a little ugly in a mathematical sense and also the search space is discrete which makes the problem harder. The cost of PSO is the cost in Eq. (35) where before feeding a solution (particle) to it, we first project the input to the valid set of values. We denote the projection for the -th frequency by and define it as:
Note that the quantized Fisher criterion in the cost is calculated as explained before for a given (particle). The rate, existing in the cost, is also calculated using Eq. (33).
5 Connection to Rate-Distortion Optimization
Quantization and compression usually deal with the rate-distortion optimization [21, 6], where the rate and distortion are in trade-off meaning that lower distortion usually comes with higher rate and vice versa. In the rate-distortion optimization, the rate is minimized for better compression while the distortion is also tried to be minimized for better preserved quality.
The optimization in Eq. (35) can be interpreted as the rate-distortion optimization. We explain the reason in the following. The criterion in Eq. (30) is a generalized Rayleigh-Ritz Quotient . Hence, the optimization (30), for a given , is equivalent to:
where maximization has been changed to minimization by negating the objective function. The Lagrangian relaxation  of the problem is:
where the diagonal of are the Lagrange multipliers. Now, consider the Eq. (35) whose Lagrange relaxation is similarly as the following:
where we have dropped the terms concerning the constraints in Eq. (35) for simpler analysis. This equation shows that we are minimizing the within scatter of the projected data. According to Eq. (29), the terms and are minimized. Minimization of the former minimizes the cloud of -th quantized class in the QFDA subspace. Therefore, it plays the role of minimization of the entropy (or information) which helps minimization of the rate. On the other hand, minimization of is like minimization of the distortion. The reason is that this term shows the variance (dissimilarity) of the quantized (i.e., distorted) data from non-distorted data. Hence, its minimization plays the minimization of distortion.
Although the terms and also exist in which is maximized in Eq. (39), but notice that they represent the total variance of data which guarantees better separation of classes in the subspace. Therefore, it makes more sense to consider for rate-distortion optimization and not . Moreover, the rate exists in Eq. (39) which is minimized as expected. Therefore, in that equation, rate is minimized (in and ) and distortion is also minimized (in ) as we have in rate-distortion optimization. It is also noteworthy that, according to Eqs. (29) and (39), the trade-off of minimization of rate and distortion is controlled by the two hyperparameters and .
|0.01||0.122 0.062||0.129 0.049||0.127 0.052||0.182 0.069||0.146 0.058|
|0.1||0.117 0.074||0.135 0.053||0.165 0.074||0.157 0.064||0.121 0.063|
|1||0.125 0.064||0.107 0.066||0.169 0.068||0.132 0.063||0.153 0.063|
|10||0.123 0.054||0.117 0.056||0.117 0.049||0.156 0.060||0.151 0.079|
|0.01||0.266 0.125||0.266 0.126||0.256 0.130||0.265 0.130||0.263 0.122|
|0.1||0.266 0.125||0.266 0.126||0.256 0.130||0.265 0.130||0.263 0.122|
|1||0.266 0.125||0.266 0.126||0.256 0.130||0.265 0.130||0.263 0.122|
|10||0.266 0.125||0.266 0.126||0.256 0.130||0.265 0.130||0.263 0.122|
6.1 Quantized Fisherfaces vs. Fisherfaces
We used the AT&T face dataset  which includes facial images of different people. The images have different expressions and poses making this dataset hard enough. The size of images, in this dataset, are pixels. For computational reasons, we resampled the images to pixels because calculation of eigenvectors takes time and every iteration of optimization includes calculation of eigenvectors. We divided the images into two classes of having and not having eye glasses. We split data to training, test, and validation sets with proportions , , and , respectively.
We used the validation set for finding the best values for and . For several permutations of values of these two hyperparameters, we ran the PSO optimization. For every PSO optimization, we used five particles and we found ten iterations to be sufficient. A 10-Nearest Neighbor (10-NN) classification was used for evaluating the QFDA on all training, test, and validation sets. Nearest neighbor classification is useful to evaluate the structure of the embedded data in the subspace. Note that the training set was used for classification of test and validation sets.
For evaluations, we did the classification considering the first up to the first
dimensions of the subspace. We report the results as the average (and standard deviation) over theerror rates. Table 1 reports the validation errors where , were found to be the best. In the following, we report the training, test, and validation errors for these optimum hyperparameters:
training: 0.126 0.032
test: 0.208 0.041
validation: 0.107 0.066
Note that the above results are the QFDA results for embedding the quantized DCT transformed images. The average error rates for embedding the original (non-quantized) DCT transformed images using the FDA:
training: 0.112 0.017
test: 0.185 0.021
validation: 0.179 0.025
Except for the validation set, the results of QFDA are slightly worse than FDA; however, we should note that the compression was remarkably high from 20.2 kilo bytes per image to 10 kilo bytes. The drop of classification rate was not significant however. The above comparison shows that we can throw away the original data and its FDA subspace and keep the compressed images with the QFDA subspace to classify the quantized classes.
The found optimum number of quantization levels for the frequencies are depicted as a bar plot in Fig. 2-(a). The result is convincing because the DC component (frequency ) usually has the smallest variance in the pixel domain; thus it has largest variance in the frequency domain. The opposite analysis exists for the highest frequency. Hence, we expect that more quantization levels should be assigned to the lower frequencies.
The inverse DCT transform of the leading eigenvectors (directions) of QFDA and FDA subspaces are shown in Fig. 3. The facial eigenvectors, or ghost faces, of FDA are referred to as Fisherfaces in the literature . Similarly, we name the ghost faces in QFDA as quantized Fisherfaces. Comparing the Fisherfaces and quantized Fisherfaces shows that quantized Fisherfaces take care of the JPEG blocking resulted from the quantization. Both Fisherfaces and quantized Fisherfaces are capturing the features regarding the eye regions of faces which is expected be3cause the two classes are different in terms of having or not having eye glasses. Note that we have more than one eigenvector here for both FDA and QFDA because we have regularized it with in Eq. (15).
Recall that in Eq. (18), we centered the data. Several zero-padded images, centered data, and quantized images are shown in Fig. 4. As mentioned before, the size of every images has been halved without any significant drop in accuracy of classification. As shown in this figure, the eyes and glasses have been quantized with higher quality as expected because the eyes are important for classification between having and not having eye glasses. On the other hand, some non-important facial regions are quantized with less quality which is expected for minimization of rate.
6.2 Experiments on Fashion MNIST Dataset
In order to do experiments for more than two classes, we used the Fashion MNIST dataset  which includes classes of different clothing items. The size of images, in this dataset, is pixels. The settings of PSO and cross validation were the same as the previous experiment. Table 2 shows the validation results for this dataset. This table shows that for this dataset, QFDA is almost robust to the changes of . The reason might be because the size of images is small in this dataset and thus the rate is usually high (distortion is small) because we do not have huge number of dummy pixels. The quantized images, shown in Fig. 5, also show that distortion is not significant in this dataset, although some level of distortion is observed if one looks at the images carefully.
The table 2 shows that can be a good choice for the parameters. In the following, we report the training, test, and validation errors for these optimum hyperparameters:
training: 0.236 0.102
test: 0.283 0.133
validation: 0.263 0.122
The above results are the QFDA results for embedding the quantized DCT transformed images. The average error rates for embedding the original (non-quantized) DCT transformed images using the FDA:
training: 0.273 0.086
test: 0.315 0.114
validation: 0.317 0.121
This shows that, on this dataset, QFDA has performed much better than FDA. The size of image before and after quantization is almost 7.91 and 7.89 kilo bytes which means that the compression was not significant as explained before.
The found optimum number of quantization levels for the frequencies are depicted as a bar plot in Fig. 2-(b). The same interpretation as before exists for why DC component should have larger number of quantization levels.
The inverse DCT transform of eighteen leading eigenvectors of QFDA and FDA are illustrated in Fig. 6. This figure shows that the eigenvectors of QFDA have captured more features in comparison to the eigenvectors of FDA. The eigenvectors of FDA have captured partially scanned features rather than complete features. This explains the better results of QFDA.
7 Conclusion and Future Direction
This paper proposed Quantized Fisher Discriminant Analysis (QFDA) which made a bridge between machine learning, manifold learning, and information theory. There is a huge lack of literature for combination of machine learning and information theory and this paper tried to tackle this gap. This method optimized a proposed cost function using PSO algorithm. This cost function can be interpreted as a rate-distortion cost which is used in compression purposes. The quantized Fisherfaces method was also proposed for facial analysis in QFDA. The experiments reporting validation, optimum number of quantization levels, visualization of QFDA eigenvectors, and display of quantized images showed the merit of this new subspace learning method.
In this paper, we worked on uniform quantization in the DCT domain which is what JPEG compression does. A possible future work is to consider a general non-uniform quantization. In that case, the variables for quantization optimization will be an integer number of levels, float values for start of quantization intervals, and float values for the mapping values in quantization. Therefore, if the upperbound on is , the vector of solution (particle in PSO) should have or just dimensions where some of its entries will be zero if .
The authors thank Sepideh Shaterian Bidgoli for her very helpful discussions in developing the idea of this paper.
-  (1974) Discrete cosine transform. IEEE transactions on Computers 100 (1), pp. 90–93. Cited by: §3.
-  (1994) AT&T face dataset. Note: https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.htmlOnline, Accessed: July 2019 Cited by: §6.1.
-  (1997) Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis & Machine Intelligence (7), pp. 711–720. Cited by: §6.1.
-  (2004) Convex optimization. Cambridge university press. Cited by: §2, §5.
-  (2005) Search methodologies. Springer. Cited by: §4.3.
-  (2012) Elements of information theory. John Wiley & Sons. Cited by: §1, §3, §4.2, §5.
-  (2019) Deep least squares Fisher discriminant analysis. IEEE transactions on neural networks and learning systems. Cited by: §1.
-  (2017) Deep Fisher discriminant analysis. In International Work-Conference on Artificial Neural Networks, pp. 501–512. Cited by: §1.
-  (1936) The use of multiple measurements in taxonomic problems. Annals of eugenics 7 (2), pp. 179–188. Cited by: §1, §2.
-  (1989) Regularized discriminant analysis. Journal of the American statistical association 84 (405), pp. 165–175. Cited by: §2.
-  (2001) The elements of statistical learning. Springer series in statistics New York. Cited by: §1, §1, §2.
Introduction to statistical pattern recognition. Elsevier. Cited by: §2.
Unsupervised and supervised principal component analysis: tutorial. arXiv preprint arXiv:1906.03148. Cited by: §2.
-  (2019) Eigenvalue and generalized eigenvalue problems: tutorial. arXiv preprint arXiv:1903.11240. Cited by: §2, §2.
-  (2019) Fisher and kernel fisher discriminant analysis: tutorial. arXiv preprint arXiv:1906.09436. Cited by: §1, §2.
-  (2019) Feature selection and feature extraction in pattern analysis: a literature review. arXiv preprint arXiv:1905.02845. Cited by: §1.
-  (2001) Central limit theorem. Encyclopedia of Mathematics, Springer. Cited by: §4.2.
-  (2011) Principal component analysis. Springer. Cited by: §2.
-  (2010) Particle swarm optimization. Encyclopedia of machine learning, pp. 760–766. Cited by: §4.3.
-  (1992) Digital compression and coding of continuous-tone still images: requirements and guidelines. ITU-T Recommendation T 81. Cited by: §3.
-  (1998) Rate-distortion methods for image and video compression. IEEE Signal processing magazine 15 (6), pp. 23–50. Cited by: §5.
Compression improves image classification accuracy.
Canadian Conference on Artificial Intelligence, pp. 525–530. Cited by: §1.
-  (1998) The symmetric eigenvalue problem. Classics in Applied Mathematics 20. Cited by: §2, §5.
-  (2014) Discrete cosine transform: algorithms, advantages, applications. Academic press. Cited by: §3.
-  (2013) Monte carlo statistical methods. Springer Science & Business Media. Cited by: §3, §4.2.
-  (2015) Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons. Cited by: §4.2.
-  (1992) The JPEG still picture compression standard. IEEE transactions on consumer electronics 38 (1), pp. xviii–xxxiv. Cited by: §3, §3.
-  (2012) Geometric structure of high-dimensional data and dimensionality reduction. Springer. Cited by: §2.
-  (1994) Image compression using the discrete cosine transform. Mathematica journal 4 (1), pp. 81. Cited by: §3.
-  (2005) Fisher linear discriminant analysis. Technical report Department of Computer Science, University of Toronto. Cited by: §2.
-  (2006) Analysis on Fisher discriminant criterion and linear separability of feature space. In 2006 International Conference on Computational Intelligence and Security, pp. 1671–1676. Cited by: §2.
-  (2007) Least squares linear discriminant analysis. In Proceedings of the 24th international conference on machine learning, pp. 1087–1093. Cited by: §1, §2.
-  (2017) Fashion MNIST dataset. Note: https://www.kaggle.com/zalando-research/fashionmnistOnline, Accessed: July 2019 Cited by: §6.2.
Regularized discriminant analysis, ridge regression and beyond. Journal of Machine Learning Research 11 (Aug), pp. 2199–2228. Cited by: §1.