Log In Sign Up

Principal Component Analysis Based on Tℓ_1-norm Maximization

Classical principal component analysis (PCA) may suffer from the sensitivity to outliers and noise. Therefore PCA based on ℓ_1-norm and ℓ_p-norm (0 < p < 1) have been studied. Among them, the ones based on ℓ_p-norm seem to be most interesting from the robustness point of view. However, their numerical performance is not satisfactory. Note that, although Tℓ_1-norm is similar to ℓ_p-norm (0 < p < 1) in some sense, it has the stronger suppression effect to outliers and better continuity. So PCA based on Tℓ_1-norm is proposed in this paper. Our numerical experiments have shown that its performance is superior than PCA-ℓ_p and ℓ_pSPCA as well as PCA, PCA-ℓ_1 obviously.


page 1

page 2

page 3

page 4


Principal Component Analysis: A Generalized Gini Approach

A principal component analysis based on the generalized Gini correlation...

Weighted Low Rank Approximation for Background Estimation Problems

Classical principal component analysis (PCA) is not robust to the presen...

Iteratively Reweighted Least Squares Algorithms for L1-Norm Principal Component Analysis

Principal component analysis (PCA) is often used to reduce the dimension...

Incremental Principal Component Analysis Exact implementation and continuity corrections

This paper describes some applications of an incremental implementation ...

Robust PCA: Optimization of the Robust Reconstruction Error over the Stiefel Manifold

It is well known that Principal Component Analysis (PCA) is strongly aff...

EigenGame: PCA as a Nash Equilibrium

We present a novel view on principal component analysis (PCA) as a compe...

Self-paced Principal Component Analysis

Principal Component Analysis (PCA) has been widely used for dimensionali...

I Introduction

Principal component analysis (PCA)[1, 2]

, a popular toolkit for data processing and pattern recognition, has been widely investivated during the last decades. It is often utilized for dimensionality reduction. PCA tries to find a set of projection vectors consisting of the linear combinations of the given data that either maximizes the dispersion of the projected data or minimizes the reconstruction error. These projection vectors construct a low-dimensional subspace that can capture the intrinsic structure of the original data.

However, classical PCA has a fatal drawback. It is sensitive to outliers because using -norm metric. To overcome this problem, -norm is substituted by -norm. Baccini . [3] proposed a PCA based on -norm (

-PCA) by minimizing the reconstruction error, and correspondingly, a heuristic algorithm based on maximum likelihood estimation was presented. Subsequently, the weighted median alogrithm and the quadratic programming algorithm were proposed in

[4], where the robustness with -norm was also addressed. Noticing that the above -PCA methods are not rotational invariant. Ding . [5] proposed a rotational invariant -norm PCA (-PCA) which combines the merits of - and -norm PCA. However, Kwak[6] pointed out that -PCA depends highly on the dimension of a subspace to be founded. And in [6], Kwak also proposed a PCA based on -norm (PCA-) by maximizing the dispersion. The PCA- is a greedy method with easy implementation. Then Nie . proposed its non-greedy version with better experimental results in [7]. Unlike the aforementioned methods where only the local solution can be obtained, another PCA based on -norm proposed in [8] could find a global solution. In addition, The other PCA methods based on -norm are concerned with the sparseness, regularization, kernel trick and two-dimensional problem (2D)[9, 10, 11, 12, 13, 14, 15].

To further improve the robustness, some reaschers noticed the -norm. Liang .[16] proposed the generalized PCA based on -norm (-norm GPCA), where the -norm was employed to be as constraint instead of the objective function. In [17], Kwak extended PCA- to PCA- for an arbitrary and proposed both the greedy and non-greedy algorithms. The proposed algorithms are convergent under the condition of . The other PCA methods based on -norm are concerned with low-rank technique, sparseness and 2D problem[18, 19, 20]. It is naturally believed that -norm is more robust to -norm when , but it does not satisfy Lipschitz-continuity which is important for robustness [21, 22]. And most of -norm PCA methods have been shown to be non-monotonic when . These all restricted the applications of the -norm PCAs.

In this paper, to give a more robust PCA with Lipschitz-continuity measurement, the T-norm is studied. Indeed, T-norm is similar to -norm () in some sense, but it has the stronger suppression effect to outliers and better continuity. Using this norm, we proposed a PCA based on T-norm (TPCA) by maximizing T-norm-based dispersion in the projection space. Correspondingly, to solve the optimization problem, a modified ascent method on sphere is constructed. The results of the preliminary experiments show that TPCA is more robust than some current PCAs based on -norm and -norm.

The rest of this paper is organized as follows. In Section II, we introduce and analyze the T-norm. In Section III, the optimization problem of our PCA based on T-norm is formulated. To solve the optimization problem, an ascend method is constructed and investigated in Section IV. In Section V, TPCA is applied to several artifical and real datasets and the performances are compared with some other current PCA methods. Finally, the conclusion follows in Section VI.

Ii T-norm

In this section, based on the transformed (T) penality function [23, 24, 26, 27, 28, 25], T-norm is introduced: for a vector , we define T-norm as


where is the operator of component


and is a positive shape parameter. Generally speaking, the norm should satisfy the following three properties: i) Positive definite: for all , and iff ; ii) Triangle inequality: for all , ; iii) Absolutely homogeneity: for all and scalar , . And means the general form of norms. Obviously, T-norm satisfies the first two properties but not satisfies the third one. So, strictly speaking, T-norm is not a norm. But in this paper, we still call it T-norm for convenience.

Further, we discuss the properties of T-norm and compare them with those of -norm (). The -norm of a vector is denoted as


where is the operator of component


It is known that, e. g. see [25], T-norm is related with -norm in the following way: for any vector , with the change of parameter , T

-norm interpolates

-norm and -norm as


To show the similarity between T-norm and -norm, their contours with and are plotted in [25]. From the set of figures, it is concluded that T-norm with and indeed approximates -norm, -norm and -norm, respectively. This seems to imply that a one-to-one relationship exists between and making T-norm approximate -norm. For example, corresponds to and T-norm approximates -norm. However, investigating the definitions of T-norm and -norm carefully, we do find their severe difference. In fact, we need only to compare their component operators as shown in the following property.

Property 1.

For any fixed and , comparing with , there exist the following conclusions: the function is Lipschitz-continuous with the Lipschitz constant . When increases from 0 to infinity, the function value increases from 0 to a finite value . However, is not Lipschitz-continuous. When increases from 0 to infinity, the function value increases from 0 to infinity.

Fig. 1: The differences of and . (a) For small value of . (b) For large value of .

Note that the above property points out the difference between T-norm with any and -norm with any , including and any , particularly and . For the last case, both the component operators and are shown in Fig. 1, where Fig. 1 (a) and Fig. 1 (b) indicate their difference when is small or large, respectively. Corresponding to Fig. 1 (a), we have and . And corresponding to Fig. 1 (b), we have and . So, there is a marked difference between T-norm and -norm whether is small or lagre.

Fig. 2: The comparisons of , , and . The functions , and increase from 0 to . The function has an asymptote .

The above discussion implies the advantages of applying T-norm in robust problem. In fact, retrospect the development course of the norm in PCA: from -norm to -norm and then to -norm; and their corresponding component operators from to and then to . Fig. 2 shows the figures of these three operators. Obviously, for large , when increases, the growth slows down gradually from to and then to . Fig. 2 also shows the figure of the component operator in T-norm and the function with fixed is bounded. And from to , the growth slows further. This means that T-norm has better suppression effect to outliers. In addition, as discussed above, has better continuity than , especially its Lipschitz continuity which is good for robustness. Therefore, it can be expected to have better robustness by using T-norm in PCA than -norm.

Iii Problem formulation

Let be a given data matrix, where and denote the dimension of the original space and the number of sapmles respectively. Without loss of generality, suppose the data has been centralized, i.e. ,.

Firstly, we consider the following general PCA maximization problem


where is the projection matrix consisted of projection vectors. When is subtituted by -, -, and -norm, it is identical to -PCA, PCA- and PCA- respectively. However, it is difficult to solve equation (6) directly for some norms. To address this problem, it is simplified into a series of problems and (6) becomes the following optimization probelm


When , greedy method could be utilized to solve.

In this paper, we employ T-norm into equation (6) and construct the PCA based on T-norm as follows


When , it is also difficult to find an optimal solution of (8). We also simplify the problem into a series of optimization probelms by using a greedy method, therefore, we will first solve the following optimization problem


which is equivalent to


Iv Algorithm

Since the problem (8) is non-convex and non-smooth, the traditional convex optimization technique could not be used directly. Therefore, we first consider to solve problem (10), which is a relatively simple situation of problem (8). Even so, it is also difficult to solve (10) since it has the division operator of absolute value functions. Although the alternating direction method of multipliers (ADMM) [29, 30] is a popular method to solve non-convex and non-smooth problem, it does not apply to our problem because of the constraint . Motivated by the methods in [31] and [32], we design a modified gradient ascent method on a sphere to solve (10). And the method could guarantee the constraint.

Here, we need to compute the gradient of with respect to as follows



and a random positive vector is added on to satisfy when . Then we project onto the tangent plane of on the unit sphere as and normalize it as , where denotes as inner product of vectors and the unit sphere is determined by the constrain . For the -th step, and , then we have the following update rule

where controls the step size.

Input: The data matrix , the parameter of
Output: The projection vector .
Initialization: Find , where
              Set . Give
 Compute the gradient of at by (11);
 If and are collinear
  , where is the perturb-
  ation satisfying that .
 End if;
 Project onto the tangent plane of ,
 i.e., , then
 normalize , ;
 Update .
 Until ;
 Update ;
Until convergence

The above update rule guarantees that remains of unit length. However, when and are collinear, it is not applicable. Inspired by noisy gradient descent algorithm (NGD) [33], we add a perturbation to to escape this problem. In addition, to accelerate the convergence, is chosen as an adaptive step size [31]. The details are described in Algorithm 1. And for Algorithm 1, we have the following proposition.

Proposition 1.

The Algorithm 1 will monotonically increase the objective of the problem (10) in each iteration.


As we know, is the fastest ascent direction. When and are collinear, we set

it is clear that is still an ascent direction after adding the perturbation , because satisfies .

Then by projecting onto the tangent plane of , we obtain and normalize it as . Since , where is the angle between and , the direction is also an ascent direction. Then we have the update rule

where is the step size. And we set until . Since and are orthogonal, the update rule keeps the unit vector of . To acclerate the convergence, we set for the next iteration. ∎

As the objective of problem (10) has an upper bound, proposition 1 indicates that the Algorithm 1 is convergent.

Now we can obtain the first projection vector by calling Algorithm 1. To solve more than one projection vectors, we use a genernal orthogonalization method to compute the remaining vectors. Firstly, we give the details of our orthogonalization procedure in Algorithm 2. Then, using the inductive method, proposition 2 shows that the projection vectors solved by Algorithm 2 are strictly orthogonal. Its proof also describes the details of Algorithm 2.

Input: The data matrix , the parameter of
     T-norm , and the number of projection
     vectors .
Output: The projection matrix .
Initialization: , , .
 Solve problem (12) by Algorithm and get its solution
, compute the -th projection vector ;
 Update ;
 Compute by solving the linear equations
 and following the Gram-Schmidt procedure;
Proposition 2.

The projection vectors obtained by Algorithm 2 are orthonormal.


According to the inductive assumption, we first assume that vectors are orthonormal in a -dimensional subspace. Thus is an orthonormal matrix and we denote Span. Then is a -dimensional subspace. Recall that the primary goal to search for a vector satisfying problem (10). Once the subspace has been obtained, we need to solve through the following optimization problem, which could be solved by Algorithm 1


where is the null space of and dim. Then update , where . It is obvious that is orthogonal to Therefore, is also an orthonormal matrix.

In nature, The data is projected onto the subspace to implement Algorithm 1. At last, to perform the next iteration, we need to find a basis of . To obtain , we need only to solve the linear equation and make this basis orthonormal by following the Schmidt orthogonalization. ∎

V Experiments

In this section, we evaluate the performance of TPCA on an artifical dataset and two human face databases including Yale[34] and Jaffe[35]. To demonstrate the robustness of our method, we add outliers in the artifical dataset and random block noise in the face databases. For comparsion, the calssical PCA[2], PCA-[7], PCA-[17], and SPCA[20]

have also been utilized. We use the nearest neighbor classifier (1-NN) for classification, which assigns a test sample to the class of its nearest neighbor in the training samples. The implementation environment is MATLAB R2017a.

V-a A Toy Example

Firstly, we evaluate the robustness of TPCA on a two-dimensional artifical dataset, containing 30 data points and 4 outliers. The 30 data points are generated by picking from -3 to 3 with the same interval and yielding

from the Gaussian distribution

, satisfying that the summation over , equals to zero, and depicted by navy blue ””. 4 outliers are of coordinates [-4,4.8], [-3.7,5.1], [-3.3,6] and [-2.4,5.5], depicted by red ””. The dataset is shown in Fig. 3.

Obviously, when discarding outliers, the included angle between the ideal projection direction of the dataset and -axis is , where the ideal projection direction is depicted by black solid line. The first principal components of TPCA, PCA, PCA-, PCA- and SPCA are obtained by applying them to the artifical dataset with outliers under different parameters. The parameters in TPCA and in PCA and SPCA are chosen from and , respectively. These principal components and their included angles with the ideal projection direction are also plotted in Fig. 3.

From Fig. 3, we see that the principal components learned by PCA, PCA-, PCA- and SPCA are severely deviated from the ideal projection direction, and the included angle of SPCA is up to when . However, the principal components learned by TPCA are slightly deviated from the ideal projection direction, especially when paramater is small. Its principal components are much closer to the ideal projection direction which indicate that TPCA is more robust to outliers than the other PCAs.

(a) PCA and PCA-
(b) PCA-
(c) SPCA
(d) TPCA
Fig. 3: Experimental results for PCA, PCA-, PCA-, SPCA, TPCA on an artifical dataset.

V-B Real-world Datasets

The performance of TPCA, PCA and SPCA depends on parameter or . For each of the three methods, we search the optimal parameter from = [100, 50, 10, 1, 0.5, 0.1, 0.05, 0.01, 0.001] or = [1, 0.9, 0.7, 0.5, 0.3, 0.1, 0.01, 0.001] on all real-world datasets.

V-B1 Yale

The Yale face database contains 165 grayscale images of 15 individuals under different lighting conditions and facial expressions, these facial expressions include happy, normal, sad, sleepy, surprised and wink. Each individual has 11 images. Each image in Yale database is cropped to pixels. 9 images of each person are randomly selected for training and the ( and ) block noise is added to them. Original and noisy sample images of one individual are shown in Fig. Then we employ PCA, PCA-, PCA-, SPCA and TPCA to extract features respectively. For each given parameter or , we compute the average classification accuracies of 15 random splits on original data and noisy data.

Fig. 4: The accuracies of Yale database under the optimal parameter. (a) The accuracy of each method on original data. (b) The accuracy of each method on data with block noise. (c) The accuracy of each method on data with block noise.
Original data 65.77 64.22 63.77 63.33 62.22
With block noise 59.33 56.00 57.78 56.44 57.33
With block noise 55.33 52.66 52.00 51.11 50.66
TABLE I: The average classification accuracies of yale database under the optimal dimension.

For each method, Fig. 4 plots their average classification accuracy vs. the dimension of reduced space under the optimal parameter. Table I lists the classification accuracy of each method under the optimal dimension. The above results show that TPCA outperforms the other methods in all conditions. And the accuracy of TPCA is around 2.5% higher than the other PCAs. From Fig. 4, the accuracy of TPCA has an upward tendency along the number of dimension, comparing with data without noise, the advantages in performance are strengthened on noisy data. The reason is that we use T-norm, which has stronger suppression effect to noise. When the number of dimension reaches around 30, the accuracy tends to be stable.

V-B2 Jaffe

The Jaffe database contains 213 images of 7 facial expressions posed by 10 Japanese female individuals. Each image is resized to pixels. We randomly choose 70% of each individual’s images for training, adding the same noise as Yale database, and the remainders for testing. Some samples in Jaffe database are shown in Fig . Then PCA, PCA-, PCA-, SPCA and TPCA are applied to extract features. For each given parameter or , the average classification accuracies on original data and ( and ) noisy data over 15 random splits are considered.

Fig. 5: The accuracies of Jaffe database under the optimal parameter. (a) The accuracy of each method on original data. (b) The accuracy of each method on data with block noise. (c) The accuracy of each method on data with block noise.
Original data 99.37 98.75 99.06 98.95 99.06
With block noise 98.12 96.97 97.81 97.08 96.97
With block noise 94.47 93.43 94.16 92.81 94.06
TABLE II: The average classification accuracies of jaffe database under the optimal dimension.

Fig plots the average classification accuracy vs. the dimension of reduced space for each method under the optimal parameter on Jaffe database. The classification accuracy of each method under the optimal parameter is listed in Table II. It can be seen that TPCA is superior to the other methods. And the trend of accuracy and the behaviour of parameter on this database are also similar to those on Yale database. When the number of dimension reaches about 20, the accuracy tends to be stable.

V-C Convergence Experiments

We finally investigate the performance of TPCA in terms of convergency. Fig. 6 plots the convergence curves on artifical data with/without outliers and the above two databases with/without noise. The results illustrate that TPCA can converge quickly, generally within about 10 steps. It is consistent with the conclusion in proposition 1.

(a) Artifical dataset
(b) Yale
(c) Jaffe
Fig. 6: Variation of objective function value along the number of iteration for TPCA on artifical dataset, Yale database and Jaffe database.

Vi Conclusion

In this paper, we have introduced a new T-norm and shown its properties which indicate that T-norm is more robust than -norm (). Then we proposed a novel dimensionality reduction method called TPCA. It employed T-norm as the distance metric to maximize the dispersion of the projected data. TPCA was more robust to noise and outliers than -norm-based PCA methods with higher classification accuracy. And convergence experiments showed that TPCA can converge quickly. T

-norm not only could be applied to the unsupervised PCA but also supervised dimensionality methods, even other methods in machine learning. These will be our future work.


  • [1] H. Hotelling, “Analysis of a complex of statistical variables into principal components”, Journal of Educational Psychology, vol. 24, no. 6, pp. 417-441, 1933.
  • [2] I. T. Jolliffe, Principal Component Analysis, 2nd ed.   New York, NY, USA: Springer-Verlag, 2002.
  • [3] A. Baccini, P. Besse and A.D. Falguerolles, “A -norm PCA and a heuristic approach, in Ordinal and Symbolic Data Analysis”, E. Diday, Y. Lechevalier and O. Opitz Eds., Springer, pp. 359-368, 1996.
  • [4] Q. Ke and T. Kanade, “Robust subspace computation using norm”, Technical Report CMU-CS-03-172, Carnegie Mellon University, Aug. 2003,
  • [5] C. Ding, D. Zhou, X.-F. He and H.-Y. Zha, “-PCA: rotational invariant -norm principal component analysis for robust subspace factorization”, in Proceedings of the 23rd internal conference on Machine learning, pp. 281-288, 2006.
  • [6] N. Kwak, “Principal component analysis based on -norm maximization”, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 30, no. 9, pp. 1672-1680, Sep. 2008.
  • [7] F. Nie, H. Huang, C. Ding, D. Luo and H. Wang, “Robust principal component analysis with non-greedy -norm maximization”, in Proceedings of the 22nd international joint conference on Artifical Intelligence, pp. 1433-1438, 2011.
  • [8] J. P. Brooks, J. H. Dula and E. L. Boone, “A pure -norm principal component analysis”, Computational Statistics & Data Analysis, vol. 61, pp. 83-98, May 2013.
  • [9] D. Y. Meng, Q. Zhao and Z. B. Xu, “Improve robustness of sparse PCA by -norm maximization”, Pattern Recognition, vol. 45, no. 1, pp. 487-497, Jan. 2012.
  • [10]

    Z. H. Lai, Y. Xu, Q. C. Chen, J. Yang and D. Zhang, “Multilinear sparse principal component analysis”, IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 10, pp. 1942-1950, Oct. 2014.

  • [11] H. X. Wang and J. Wang, “2DPCA with -norm for simultaneously robust and sparse modeling”, Neural Networks, vol. 46, pp. 190-198, Oct. 2013.
  • [12] G. F. Lu, J. Zou, Y. Wang and Z. Q. Wang, “-norm-based principal component analysis with adaptive regularization”, Pattern Recognition, vol. 60, pp. 901-907, Dec. 2016.
  • [13] C. Kim and D. Klabjan, “A simple and fast algorithm for -norm kernel PCA”, IEEE Transaction on Pattern Analysis and Machine Intelligence, Mar. 2019.
  • [14] J. C. Fan and T. W. S. Chow, “Exactly robust kernel principal component analysis”, IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 3, pp. 749-761, Mar. 2020.
  • [15] S. Liwicki, G. Tzimiropoulos, S. Zafeiriou and M. Pantic, “Euler principal component analysis”, International Journal of Compjuter Vision, vol. 101, no. 3, pp. 498-518, Feb. 2013.
  • [16]

    Z. Z. Liang, S. X. Xia, Y. Zhou, L. Zhang and Y. F. Li, “Feature extraction based on

    -norm generalized principal component analysis”, Pattern Recognition Letters, vol. 34, no. 9, pp. 1037-1045, Jul. 2013.
  • [17] N. Kwak, “Principal component analysis by -norm maximization”, IEEE Transaction on Cybernetics, vol. 44, no. 5, pp. 594-609, May 2014.
  • [18] J. Wang, “Generalized 2-D principal component analysis by -norm for image analysis”, IEEE Transaction on Cybernetics, vol. 46, no. 3, pp. 792-803, Mar. 2016.
  • [19] K. G. Quach, C. N. Duong, K. Luu and T. D. Bui, “Non-convex online robust PCA: Enhance sparsity via

    -norm minimization”, Computer Vision and Image Understanding, vol. 158, pp. 126-140, May 2017.

  • [20] C. N. Li, W. J. Chen and Y. H. Shao, “Robust sparse -norm principal component analysis”, Acta Automatica Sinica, vol. 43, no. 1, pp. 142-151, Jan. 2017.
  • [21] T. W. Weng, H. Zhang, P. Y. Chen, J. Yi, D. Su, Y. Gao, C. J. Hsieh and L. Daniel, “Evaluating the robustness of neural networks: An extreme value theory approach”, ICLR 2018 Conference, 2018.
  • [22] Z. Cranko, S. Kornblith, Z. Shi and R. Nock, “Lipschitz networks and distributional robustness”,, 2018.
  • [23] M. Nikolva, “Local strong homogeneity of a regularized estimator”, SIAM Journal on Applied Mathematics, vol. 61, no. 2, pp. 633-658, Aug. 2000.
  • [24] J. C. Lv and Y. Y. Fan, “A unified approach to model selection and sparse recovery using regularized least squares”, Annals of Statistics, vol. 37, no. 6A, pp. 3498-3528, Dec. 2009.
  • [25] R. R. Ma, J. Y. Miao, L. F. Niu and P. Zhang, “Transformed regularization for learning sparse deep neural networks”, Neural Networks, vol. 119, pp. 286-298, Nov. 2019.
  • [26] S. Zhang and J. Xin, “Minimization of transformed penalty: theory, difference of convex function algorithm, and robust application in compressed sensing”, Mathematical Programming, vol. 169, no. 1, pp. 307-336, May 2018.
  • [27] S. Zhang and J. Xin, “Minimization of transformed penalty: closed form representation and iterative thresholding algorithms”, Communications in Mathematical Sciences, vol. 15, no. 2, pp. 511-537, 2017.
  • [28] S. Zhang, P. H. Yin and J. Xin, “Transformed schatten-1 iterative thresholding algorithms for low rank matrix completion”, Communications in Mathematical Sciences, vol. 15, no. 3, pp. 839-862, 2017.
  • [29] S. Boyd, N. Parikh, E. Chu, B. Peleato and J. Eckstein, “Distribution optimization and statistical learning via the alternating direction method of multipliers”, Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1-122, 2011.
  • [30] C. N. Li, Y. H. Shao, W. T. Yin and M. Z. Liu, “Robust and Sparse Linear Discriminant Analysis via an Alternating Direction Method of Multipliers”, IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 3, pp. 915-926, Mar. 2020.
  • [31] M. Yu, L. Shao, X. Zhen and X. He, “Local feature discriminant projection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 9, pp. 1908-1914, Sep. 2015.
  • [32] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, 2004.
  • [33] P. Jain and P. Kar, “Non-convex optimization for machine learning”, Foundations and Trends in Machine Learning, vol. 10, no. 3-4, pp. 142-363, Dec. 2017.
  • [34] P.N. Belhumeur, J.P. Hespanha and D.J. Kriegman, “Eigenfaces vs. Fisherfaces: recognition using class specific linear projection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711-720, Jul. 1997.
  • [35] M. Lyons, S. Akamatsu, M. Kamachi and J. Gyoba, “Coding facial expressions with Gabor wavelets”, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, Apr. 1998.