I Introduction
Principal component analysis (PCA)[1, 2]
, a popular toolkit for data processing and pattern recognition, has been widely investivated during the last decades. It is often utilized for dimensionality reduction. PCA tries to find a set of projection vectors consisting of the linear combinations of the given data that either maximizes the dispersion of the projected data or minimizes the reconstruction error. These projection vectors construct a lowdimensional subspace that can capture the intrinsic structure of the original data.
However, classical PCA has a fatal drawback. It is sensitive to outliers because using norm metric. To overcome this problem, norm is substituted by norm. Baccini . [3] proposed a PCA based on norm (
PCA) by minimizing the reconstruction error, and correspondingly, a heuristic algorithm based on maximum likelihood estimation was presented. Subsequently, the weighted median alogrithm and the quadratic programming algorithm were proposed in
[4], where the robustness with norm was also addressed. Noticing that the above PCA methods are not rotational invariant. Ding . [5] proposed a rotational invariant norm PCA (PCA) which combines the merits of  and norm PCA. However, Kwak[6] pointed out that PCA depends highly on the dimension of a subspace to be founded. And in [6], Kwak also proposed a PCA based on norm (PCA) by maximizing the dispersion. The PCA is a greedy method with easy implementation. Then Nie . proposed its nongreedy version with better experimental results in [7]. Unlike the aforementioned methods where only the local solution can be obtained, another PCA based on norm proposed in [8] could find a global solution. In addition, The other PCA methods based on norm are concerned with the sparseness, regularization, kernel trick and twodimensional problem (2D)[9, 10, 11, 12, 13, 14, 15].To further improve the robustness, some reaschers noticed the norm. Liang .[16] proposed the generalized PCA based on norm (norm GPCA), where the norm was employed to be as constraint instead of the objective function. In [17], Kwak extended PCA to PCA for an arbitrary and proposed both the greedy and nongreedy algorithms. The proposed algorithms are convergent under the condition of . The other PCA methods based on norm are concerned with lowrank technique, sparseness and 2D problem[18, 19, 20]. It is naturally believed that norm is more robust to norm when , but it does not satisfy Lipschitzcontinuity which is important for robustness [21, 22]. And most of norm PCA methods have been shown to be nonmonotonic when . These all restricted the applications of the norm PCAs.
In this paper, to give a more robust PCA with Lipschitzcontinuity measurement, the Tnorm is studied. Indeed, Tnorm is similar to norm () in some sense, but it has the stronger suppression effect to outliers and better continuity. Using this norm, we proposed a PCA based on Tnorm (TPCA) by maximizing Tnormbased dispersion in the projection space. Correspondingly, to solve the optimization problem, a modified ascent method on sphere is constructed. The results of the preliminary experiments show that TPCA is more robust than some current PCAs based on norm and norm.
The rest of this paper is organized as follows. In Section II, we introduce and analyze the Tnorm. In Section III, the optimization problem of our PCA based on Tnorm is formulated. To solve the optimization problem, an ascend method is constructed and investigated in Section IV. In Section V, TPCA is applied to several artifical and real datasets and the performances are compared with some other current PCA methods. Finally, the conclusion follows in Section VI.
Ii Tnorm
In this section, based on the transformed (T) penality function [23, 24, 26, 27, 28, 25], Tnorm is introduced: for a vector , we define Tnorm as
(1) 
where is the operator of component
(2) 
and is a positive shape parameter. Generally speaking, the norm should satisfy the following three properties: i) Positive definite: for all , and iff ; ii) Triangle inequality: for all , ; iii) Absolutely homogeneity: for all and scalar , . And means the general form of norms. Obviously, Tnorm satisfies the first two properties but not satisfies the third one. So, strictly speaking, Tnorm is not a norm. But in this paper, we still call it Tnorm for convenience.
Further, we discuss the properties of Tnorm and compare them with those of norm (). The norm of a vector is denoted as
(3) 
where is the operator of component
(4) 
It is known that, e. g. see [25], Tnorm is related with norm in the following way: for any vector , with the change of parameter , T
norm interpolates
norm and norm as(5) 
To show the similarity between Tnorm and norm, their contours with and are plotted in [25]. From the set of figures, it is concluded that Tnorm with and indeed approximates norm, norm and norm, respectively. This seems to imply that a onetoone relationship exists between and making Tnorm approximate norm. For example, corresponds to and Tnorm approximates norm. However, investigating the definitions of Tnorm and norm carefully, we do find their severe difference. In fact, we need only to compare their component operators as shown in the following property.
Property 1.
For any fixed and , comparing with , there exist the following conclusions: the function is Lipschitzcontinuous with the Lipschitz constant . When increases from 0 to infinity, the function value increases from 0 to a finite value . However, is not Lipschitzcontinuous. When increases from 0 to infinity, the function value increases from 0 to infinity.
Note that the above property points out the difference between Tnorm with any and norm with any , including and any , particularly and . For the last case, both the component operators and are shown in Fig. 1, where Fig. 1 (a) and Fig. 1 (b) indicate their difference when is small or large, respectively. Corresponding to Fig. 1 (a), we have and . And corresponding to Fig. 1 (b), we have and . So, there is a marked difference between Tnorm and norm whether is small or lagre.
The above discussion implies the advantages of applying Tnorm in robust problem. In fact, retrospect the development course of the norm in PCA: from norm to norm and then to norm; and their corresponding component operators from to and then to . Fig. 2 shows the figures of these three operators. Obviously, for large , when increases, the growth slows down gradually from to and then to . Fig. 2 also shows the figure of the component operator in Tnorm and the function with fixed is bounded. And from to , the growth slows further. This means that Tnorm has better suppression effect to outliers. In addition, as discussed above, has better continuity than , especially its Lipschitz continuity which is good for robustness. Therefore, it can be expected to have better robustness by using Tnorm in PCA than norm.
Iii Problem formulation
Let be a given data matrix, where and denote the dimension of the original space and the number of sapmles respectively. Without loss of generality, suppose the data has been centralized, i.e. ,.
Firstly, we consider the following general PCA maximization problem
(6) 
where is the projection matrix consisted of projection vectors. When is subtituted by , , and norm, it is identical to PCA, PCA and PCA respectively. However, it is difficult to solve equation (6) directly for some norms. To address this problem, it is simplified into a series of problems and (6) becomes the following optimization probelm
(7) 
When , greedy method could be utilized to solve.
In this paper, we employ Tnorm into equation (6) and construct the PCA based on Tnorm as follows
(8) 
When , it is also difficult to find an optimal solution of (8). We also simplify the problem into a series of optimization probelms by using a greedy method, therefore, we will first solve the following optimization problem
(9) 
which is equivalent to
(10) 
Iv Algorithm
Since the problem (8) is nonconvex and nonsmooth, the traditional convex optimization technique could not be used directly. Therefore, we first consider to solve problem (10), which is a relatively simple situation of problem (8). Even so, it is also difficult to solve (10) since it has the division operator of absolute value functions. Although the alternating direction method of multipliers (ADMM) [29, 30] is a popular method to solve nonconvex and nonsmooth problem, it does not apply to our problem because of the constraint . Motivated by the methods in [31] and [32], we design a modified gradient ascent method on a sphere to solve (10). And the method could guarantee the constraint.
Here, we need to compute the gradient of with respect to as follows
(11) 
where
and a random positive vector is added on to satisfy when . Then we project onto the tangent plane of on the unit sphere as and normalize it as , where denotes as inner product of vectors and the unit sphere is determined by the constrain . For the th step, and , then we have the following update rule
where controls the step size.
Input: The data matrix , the parameter of 
Tnorm. 
Output: The projection vector . 
Initialization: Find , where 
. 
Set . Give 
randomly. 
Repeat: 
Compute the gradient of at by (11); 
If and are collinear 
, where is the perturb 
ation satisfying that . 
End if; 
Project onto the tangent plane of , 
i.e., , then 
normalize , ; 
Update . 
Repeat: 
Until ; 
Update ; 
Until convergence 
The above update rule guarantees that remains of unit length. However, when and are collinear, it is not applicable. Inspired by noisy gradient descent algorithm (NGD) [33], we add a perturbation to to escape this problem. In addition, to accelerate the convergence, is chosen as an adaptive step size [31]. The details are described in Algorithm 1. And for Algorithm 1, we have the following proposition.
Proposition 1.
The Algorithm 1 will monotonically increase the objective of the problem (10) in each iteration.
Proof.
As we know, is the fastest ascent direction. When and are collinear, we set
it is clear that is still an ascent direction after adding the perturbation , because satisfies .
Then by projecting onto the tangent plane of , we obtain and normalize it as . Since , where is the angle between and , the direction is also an ascent direction. Then we have the update rule
where is the step size. And we set until . Since and are orthogonal, the update rule keeps the unit vector of . To acclerate the convergence, we set for the next iteration. ∎
As the objective of problem (10) has an upper bound, proposition 1 indicates that the Algorithm 1 is convergent.
Now we can obtain the first projection vector by calling Algorithm 1. To solve more than one projection vectors, we use a genernal orthogonalization method to compute the remaining vectors. Firstly, we give the details of our orthogonalization procedure in Algorithm 2. Then, using the inductive method, proposition 2 shows that the projection vectors solved by Algorithm 2 are strictly orthogonal. Its proof also describes the details of Algorithm 2.
TPCA 
Input: The data matrix , the parameter of 
Tnorm , and the number of projection 
vectors . 
Output: The projection matrix . 
Initialization: , , . 
. 
Repeat: 
; 
Solve problem (12) by Algorithm and get its solution 
, compute the th projection vector ; 
Update ; 
Compute by solving the linear equations 
and following the GramSchmidt procedure; 
Until 
Proposition 2.
The projection vectors obtained by Algorithm 2 are orthonormal.
Proof.
According to the inductive assumption, we first assume that vectors are orthonormal in a dimensional subspace. Thus is an orthonormal matrix and we denote Span. Then is a dimensional subspace. Recall that the primary goal to search for a vector satisfying problem (10). Once the subspace has been obtained, we need to solve through the following optimization problem, which could be solved by Algorithm 1
(12) 
where is the null space of and dim. Then update , where . It is obvious that is orthogonal to Therefore, is also an orthonormal matrix.
In nature, The data is projected onto the subspace to implement Algorithm 1. At last, to perform the next iteration, we need to find a basis of . To obtain , we need only to solve the linear equation and make this basis orthonormal by following the Schmidt orthogonalization. ∎
V Experiments
In this section, we evaluate the performance of TPCA on an artifical dataset and two human face databases including Yale[34] and Jaffe[35]. To demonstrate the robustness of our method, we add outliers in the artifical dataset and random block noise in the face databases. For comparsion, the calssical PCA[2], PCA[7], PCA[17], and SPCA[20]
have also been utilized. We use the nearest neighbor classifier (1NN) for classification, which assigns a test sample to the class of its nearest neighbor in the training samples. The implementation environment is MATLAB R2017a.
Va A Toy Example
Firstly, we evaluate the robustness of TPCA on a twodimensional artifical dataset, containing 30 data points and 4 outliers. The 30 data points are generated by picking from 3 to 3 with the same interval and yielding
from the Gaussian distribution
, satisfying that the summation over , equals to zero, and depicted by navy blue ””. 4 outliers are of coordinates [4,4.8], [3.7,5.1], [3.3,6] and [2.4,5.5], depicted by red ””. The dataset is shown in Fig. 3.Obviously, when discarding outliers, the included angle between the ideal projection direction of the dataset and axis is , where the ideal projection direction is depicted by black solid line. The first principal components of TPCA, PCA, PCA, PCA and SPCA are obtained by applying them to the artifical dataset with outliers under different parameters. The parameters in TPCA and in PCA and SPCA are chosen from and , respectively. These principal components and their included angles with the ideal projection direction are also plotted in Fig. 3.
From Fig. 3, we see that the principal components learned by PCA, PCA, PCA and SPCA are severely deviated from the ideal projection direction, and the included angle of SPCA is up to when . However, the principal components learned by TPCA are slightly deviated from the ideal projection direction, especially when paramater is small. Its principal components are much closer to the ideal projection direction which indicate that TPCA is more robust to outliers than the other PCAs.
VB Realworld Datasets
The performance of TPCA, PCA and SPCA depends on parameter or . For each of the three methods, we search the optimal parameter from = [100, 50, 10, 1, 0.5, 0.1, 0.05, 0.01, 0.001] or = [1, 0.9, 0.7, 0.5, 0.3, 0.1, 0.01, 0.001] on all realworld datasets.
VB1 Yale
The Yale face database contains 165 grayscale images of 15 individuals under different lighting conditions and facial expressions, these facial expressions include happy, normal, sad, sleepy, surprised and wink. Each individual has 11 images. Each image in Yale database is cropped to pixels. 9 images of each person are randomly selected for training and the ( and ) block noise is added to them. Original and noisy sample images of one individual are shown in Fig. Then we employ PCA, PCA, PCA, SPCA and TPCA to extract features respectively. For each given parameter or , we compute the average classification accuracies of 15 random splits on original data and noisy data.
Accuracy(%)  

Method  TPCA  PCA  SPCA  PCA  PCA 
Original data  65.77  64.22  63.77  63.33  62.22 
With block noise  59.33  56.00  57.78  56.44  57.33 
With block noise  55.33  52.66  52.00  51.11  50.66 
For each method, Fig. 4 plots their average classification accuracy vs. the dimension of reduced space under the optimal parameter. Table I lists the classification accuracy of each method under the optimal dimension. The above results show that TPCA outperforms the other methods in all conditions. And the accuracy of TPCA is around 2.5% higher than the other PCAs. From Fig. 4, the accuracy of TPCA has an upward tendency along the number of dimension, comparing with data without noise, the advantages in performance are strengthened on noisy data. The reason is that we use Tnorm, which has stronger suppression effect to noise. When the number of dimension reaches around 30, the accuracy tends to be stable.
VB2 Jaffe
The Jaffe database contains 213 images of 7 facial expressions posed by 10 Japanese female individuals. Each image is resized to pixels. We randomly choose 70% of each individual’s images for training, adding the same noise as Yale database, and the remainders for testing. Some samples in Jaffe database are shown in Fig . Then PCA, PCA, PCA, SPCA and TPCA are applied to extract features. For each given parameter or , the average classification accuracies on original data and ( and ) noisy data over 15 random splits are considered.
Accuracy(%)  

Method  TPCA  PCA  SPCA  PCA  PCA 
Original data  99.37  98.75  99.06  98.95  99.06 
With block noise  98.12  96.97  97.81  97.08  96.97 
With block noise  94.47  93.43  94.16  92.81  94.06 
Fig plots the average classification accuracy vs. the dimension of reduced space for each method under the optimal parameter on Jaffe database. The classification accuracy of each method under the optimal parameter is listed in Table II. It can be seen that TPCA is superior to the other methods. And the trend of accuracy and the behaviour of parameter on this database are also similar to those on Yale database. When the number of dimension reaches about 20, the accuracy tends to be stable.
VC Convergence Experiments
We finally investigate the performance of TPCA in terms of convergency. Fig. 6 plots the convergence curves on artifical data with/without outliers and the above two databases with/without noise. The results illustrate that TPCA can converge quickly, generally within about 10 steps. It is consistent with the conclusion in proposition 1.
Vi Conclusion
In this paper, we have introduced a new Tnorm and shown its properties which indicate that Tnorm is more robust than norm (). Then we proposed a novel dimensionality reduction method called TPCA. It employed Tnorm as the distance metric to maximize the dispersion of the projected data. TPCA was more robust to noise and outliers than normbased PCA methods with higher classification accuracy. And convergence experiments showed that TPCA can converge quickly. T
norm not only could be applied to the unsupervised PCA but also supervised dimensionality methods, even other methods in machine learning. These will be our future work.
References
 [1] H. Hotelling, “Analysis of a complex of statistical variables into principal components”, Journal of Educational Psychology, vol. 24, no. 6, pp. 417441, 1933.
 [2] I. T. Jolliffe, Principal Component Analysis, 2nd ed. New York, NY, USA: SpringerVerlag, 2002.
 [3] A. Baccini, P. Besse and A.D. Falguerolles, “A norm PCA and a heuristic approach, in Ordinal and Symbolic Data Analysis”, E. Diday, Y. Lechevalier and O. Opitz Eds., Springer, pp. 359368, 1996.
 [4] Q. Ke and T. Kanade, “Robust subspace computation using norm”, Technical Report CMUCS03172, Carnegie Mellon University, Aug. 2003, http://citeseer.ist.psu.edu/ke03robust.html.
 [5] C. Ding, D. Zhou, X.F. He and H.Y. Zha, “PCA: rotational invariant norm principal component analysis for robust subspace factorization”, in Proceedings of the 23rd internal conference on Machine learning, pp. 281288, 2006.
 [6] N. Kwak, “Principal component analysis based on norm maximization”, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 30, no. 9, pp. 16721680, Sep. 2008.
 [7] F. Nie, H. Huang, C. Ding, D. Luo and H. Wang, “Robust principal component analysis with nongreedy norm maximization”, in Proceedings of the 22nd international joint conference on Artifical Intelligence, pp. 14331438, 2011.
 [8] J. P. Brooks, J. H. Dula and E. L. Boone, “A pure norm principal component analysis”, Computational Statistics & Data Analysis, vol. 61, pp. 8398, May 2013.
 [9] D. Y. Meng, Q. Zhao and Z. B. Xu, “Improve robustness of sparse PCA by norm maximization”, Pattern Recognition, vol. 45, no. 1, pp. 487497, Jan. 2012.

[10]
Z. H. Lai, Y. Xu, Q. C. Chen, J. Yang and D. Zhang, “Multilinear sparse principal component analysis”, IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 10, pp. 19421950, Oct. 2014.
 [11] H. X. Wang and J. Wang, “2DPCA with norm for simultaneously robust and sparse modeling”, Neural Networks, vol. 46, pp. 190198, Oct. 2013.
 [12] G. F. Lu, J. Zou, Y. Wang and Z. Q. Wang, “normbased principal component analysis with adaptive regularization”, Pattern Recognition, vol. 60, pp. 901907, Dec. 2016.
 [13] C. Kim and D. Klabjan, “A simple and fast algorithm for norm kernel PCA”, IEEE Transaction on Pattern Analysis and Machine Intelligence, Mar. 2019.
 [14] J. C. Fan and T. W. S. Chow, “Exactly robust kernel principal component analysis”, IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 3, pp. 749761, Mar. 2020.
 [15] S. Liwicki, G. Tzimiropoulos, S. Zafeiriou and M. Pantic, “Euler principal component analysis”, International Journal of Compjuter Vision, vol. 101, no. 3, pp. 498518, Feb. 2013.

[16]
Z. Z. Liang, S. X. Xia, Y. Zhou, L. Zhang and Y. F. Li, “Feature extraction based on
norm generalized principal component analysis”, Pattern Recognition Letters, vol. 34, no. 9, pp. 10371045, Jul. 2013.  [17] N. Kwak, “Principal component analysis by norm maximization”, IEEE Transaction on Cybernetics, vol. 44, no. 5, pp. 594609, May 2014.
 [18] J. Wang, “Generalized 2D principal component analysis by norm for image analysis”, IEEE Transaction on Cybernetics, vol. 46, no. 3, pp. 792803, Mar. 2016.

[19]
K. G. Quach, C. N. Duong, K. Luu and T. D. Bui, “Nonconvex online robust PCA: Enhance sparsity via
norm minimization”, Computer Vision and Image Understanding, vol. 158, pp. 126140, May 2017.
 [20] C. N. Li, W. J. Chen and Y. H. Shao, “Robust sparse norm principal component analysis”, Acta Automatica Sinica, vol. 43, no. 1, pp. 142151, Jan. 2017.
 [21] T. W. Weng, H. Zhang, P. Y. Chen, J. Yi, D. Su, Y. Gao, C. J. Hsieh and L. Daniel, “Evaluating the robustness of neural networks: An extreme value theory approach”, ICLR 2018 Conference, 2018.
 [22] Z. Cranko, S. Kornblith, Z. Shi and R. Nock, “Lipschitz networks and distributional robustness”, https://arxiv.org/abs/1809.01129, 2018.
 [23] M. Nikolva, “Local strong homogeneity of a regularized estimator”, SIAM Journal on Applied Mathematics, vol. 61, no. 2, pp. 633658, Aug. 2000.
 [24] J. C. Lv and Y. Y. Fan, “A unified approach to model selection and sparse recovery using regularized least squares”, Annals of Statistics, vol. 37, no. 6A, pp. 34983528, Dec. 2009.
 [25] R. R. Ma, J. Y. Miao, L. F. Niu and P. Zhang, “Transformed regularization for learning sparse deep neural networks”, Neural Networks, vol. 119, pp. 286298, Nov. 2019.
 [26] S. Zhang and J. Xin, “Minimization of transformed penalty: theory, difference of convex function algorithm, and robust application in compressed sensing”, Mathematical Programming, vol. 169, no. 1, pp. 307336, May 2018.
 [27] S. Zhang and J. Xin, “Minimization of transformed penalty: closed form representation and iterative thresholding algorithms”, Communications in Mathematical Sciences, vol. 15, no. 2, pp. 511537, 2017.
 [28] S. Zhang, P. H. Yin and J. Xin, “Transformed schatten1 iterative thresholding algorithms for low rank matrix completion”, Communications in Mathematical Sciences, vol. 15, no. 3, pp. 839862, 2017.
 [29] S. Boyd, N. Parikh, E. Chu, B. Peleato and J. Eckstein, “Distribution optimization and statistical learning via the alternating direction method of multipliers”, Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1122, 2011.
 [30] C. N. Li, Y. H. Shao, W. T. Yin and M. Z. Liu, “Robust and Sparse Linear Discriminant Analysis via an Alternating Direction Method of Multipliers”, IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 3, pp. 915926, Mar. 2020.
 [31] M. Yu, L. Shao, X. Zhen and X. He, “Local feature discriminant projection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 9, pp. 19081914, Sep. 2015.
 [32] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, 2004.
 [33] P. Jain and P. Kar, “Nonconvex optimization for machine learning”, Foundations and Trends in Machine Learning, vol. 10, no. 34, pp. 142363, Dec. 2017.
 [34] P.N. Belhumeur, J.P. Hespanha and D.J. Kriegman, “Eigenfaces vs. Fisherfaces: recognition using class specific linear projection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711720, Jul. 1997.
 [35] M. Lyons, S. Akamatsu, M. Kamachi and J. Gyoba, “Coding facial expressions with Gabor wavelets”, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, Apr. 1998.