DeepAI

# All Sparse PCA Models Are Wrong, But Some Are Useful. Part I: Computation of Scores, Residuals and Explained Variance

Sparse Principal Component Analysis (sPCA) is a popular matrix factorization approach based on Principal Component Analysis (PCA) that combines variance maximization and sparsity with the ultimate goal of improving data interpretation. When moving from PCA to sPCA, there are a number of implications that the practitioner needs to be aware of. A relevant one is that scores and loadings in sPCA may not be orthogonal. For this reason, the traditional way of computing scores, residuals and variance explained that is used in the classical PCA cannot directly be applied to sPCA models. This also affects how sPCA components should be visualized. In this paper we illustrate this problem both theoretically and numerically using simulations for several state-of-the-art sPCA algorithms, and provide proper computation of the different elements mentioned. We show that sPCA approaches present disparate and limited performance when modeling noise-free, sparse data. In a follow-up paper, we discuss the theoretical properties that lead to this problem.

07/01/2020

### A New Basis for Sparse PCA

The statistical and computational performance of sparse principal compon...
05/01/2017

11/10/2020

### Supervised PCA: A Multiobjective Approach

Methods for supervised principal component analysis (SPCA) aim to incorp...
10/12/2022

### Sparse PCA: a Geometric Approach

We consider the problem of maximizing the variance explained from a data...
10/26/2012

### Large-Scale Sparse Principal Component Analysis with Application to Text Data

Sparse PCA provides a linear combination of small number of features tha...
12/17/2012

### Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes

Given a multivariate data set, sparse principal component analysis (SPCA...
10/01/2020

### EigenGame: PCA as a Nash Equilibrium

We present a novel view on principal component analysis (PCA) as a compe...

## 1 Introduction

In modern data analysis there is an increased interest in sparse (component) methods to obtain models that are simpler and easier to interpret. One of such methods is the popular sparse Principal Component Analysis (sPCA) Jolliffe2003 ; Zou2006 ; PMD ; GPCA . While the properties of sPCA have already been discussed PMD ; Mackey2008 ; Hastie:2015:SLS:2834535 , there are several aspects that have not been treated in detail and that are relevant for the interpretation of the models.

sPCA is based on Principal Component Analysis (PCA) Jolliffe02 ; Jackson03 , which factorizes a data matrix using the criterion of maximizing variance. The PCA model follows the expression:

 X=^T^PT+E, (1)

where is a data matrix, the score matrix, the loading matrix and the matrix of residuals. An interesting property of PCA is that the loadings and the scores are orthogonal and, as a result, they can be computed either using a simultaneous or sequential algorithm.

When moving from PCA to sPCA, there are at least two interrelated consequences that need to be considered: i) the appearance of correlation among scores and/or loadings in different components and ii) the departure of components from the row-space defined by the original data. The first point makes the traditional way of computing scores, residuals and captured variance used in PCA not applicable in sPCA. Moreover, the captured variance is often used as a criterion for model quality and for selection/comparison of model variants. Therefore, its accurate computation is necessary. The second point affects interpretability, specially for multi-component models. Both effects are especially relevant in those domains in which the interpretation of the scores/residuals is necessary, like in chemometrics.

Zou et al. Zou2006

proposed a method based on the QR decomposition to compute the variance captured by sPCA when scores are correlated; we will show that this method is generally incorrect. Witten

et al. PMD

related the correlation of the loadings/scores to the departure from the row/column-space of the data and proposed an orthogonalized version of their sPCA method, where scores are made orthogonal; we will show that this orthogonalization reduces the captured variance and dwarfs interpretation, but improves the estimation of sparse loadings. Mackey

Mackey2008 discussed different deflation methods in the context of sPCA, some of which compute residuals incorrectly, leading to the double-counting of variance, i.e., a portion of variance in the data captured by more than one component. He also contributed a new deflation method which, we will show, still requires the corrections that will be proposed and discussed in this paper.

In this study we present a comparison among several widely used sPCA algorithms and reflect on how scores, residuals and captured variance should be computed for sparse components, something central in chemometrics. The problem of the departure of components from the row-space is treated in a follow-up paper. The rest of the paper is organized as follows. In Section 2, we review a set of sPCA algorithms considered. In Section 3, we discuss the mathematical properties of sPCA, using PCA as a baseline. In Section 4, we propose two simulated examples that mimic spectral data, and a Monte Carlo experiment, that will illustrate interpretation problems and solutions in the following sections. Section 5 discusses the implications of correlated scores and loadings in sPCA. Section 6 presents the correct computation of scores, residuals and captured variance in sPCA models. Section 7 offers some final considerations and conclusions.

## 2 Sparse PCA algorithms

Most sPCA algorithms modify the classical PCA by including sparse-inducing constraints or penalties with the or norms Richt2012 . Jolliffe et al. Jolliffe2003 developed the SCoTLASS algorithm (Simplified Component Technique-LASSO), which incorporates the least absolute shrinkage and selection operator (lasso) (Tibshirani1994, ), where the norm (absolute values) of the loadings is penalized. The SCoTLASS criterion follows:

 (2)

where is the resulting sparse loading, with superscript referring to SCoTLASS. To obtain successive components, the SCoTLASS optimization constraints the 2nd and further sparse loadings to be orthogonal to the rest. The SCoTLASS criterion is very computational demanding Hastie:2015:SLS:2834535 , and this makes its use unpractical for data exploration. In addition, the number of nonzero elements in is upper-bounded by the number of observations in the data, which is a structural limitation of the lasso constraint.

### 2.1 Sparse PCA by Zou, Hastie and Tibshirani

Zou et al. Zou2006

introduced an alternative formulation to generate sparse components, motivated by the structural limitation of the lasso and the computation cost of SCoTLASS. This new model was also named Sparse PCA (SPCA, we will use a capital ’S’ to differentiate this approach from the general sPCA technique, although at that time the term was already used to designate the group of sparse variants of PCA. SPCA redefines PCA as a regularized regression problem with the ridge penalty, where the responses are the PCA scores. In this formulation any loading vector of PCA,

, equals , where

 ^pR=argminp∥^t−Xp∥2F+λ2∥p∥22 (3)

with superscript referring to regression, and is the corresponding score vector. Therefore, , and is obtained by normalizing to unit length. This equivalence holds in general for any non-negative . Equivalently, the loadings of PCA can be obtained from the following regression problem followed by a normalization step:

 {^PR,^QR}=argminP,Q∥X−XPQT∥2F+λ2A∑a=1∥pa∥22s.t.QTQ=I (4)

where represents the th column vector in , with the number of components, and we distinguish between weights and loadings . In similar modeling methods, like Partial Least Squares, weights are used for the derivation of scores from the data and loadings for the identification of the captured variance in X, e.g. for deflation. The advantage of Eq. (4) over Eq. (3) is that the former does not require the pre-computation of the PCA scores. is proportional to the PCA loadings for any non-negative . Therefore, we can understand Eq. (4) as a flexible method to compute PCA, where constraints and penalties can be introduced either in the weights or the loadings.

Using this flexible redefinition of PCA the lasso can be implemented using a criterion close to the (naive) elastic net (Zou2005, ), which is a combination of both the lasso and the ridge constraint:

 (5)

with superscript referring to SPCA. In the following, we will refer to as sparse loadings and to as auxiliary loadings to simplify the discussion in connection with other sPCA techniques. Note that the roles of and are performed by a unique set of loadings in PCA and most sPCA variants. The SPCA algorithm is simultaneous, in the sense that all components are identified in the same run. Note the expression in Eq. (5) is not optimal to approximate data, because is constrained to be orthonormal, while at the same time the norms of the vectors in are penalized, leaving no flexibility to approximate and resulting in very large residuals. Besides, the sparse loadings are not normalized. The authors suggest to normalize them after the optimization, in the last step of their algorithm, but they overlook the need to update accordingly. This reflects the fact that authors were only interested in as the output of the algorithm, and the model in Eq. (5) is only a means to arrive to such output.

They also realized that this SPCA algorithm produces highly correlated scores, and concluded that the current method for computing the captured variance as used in PCA is too optimistic. For this reason, they suggested to use the QR decomposition of , which is for orthogonal and upper triangular, and to compute the captured variance of the model as the sum of squares of the elements in the diagonal of .

An alternative SPCA algorithm has been published as part of the SPASM Toolbox Sjostrand2012 , which is a sequential variant of SPCA, in which the loadings are computed one at a time. This algorithm is faster than the simultaneous versions, with a small reduction of captured variance.

### 2.2 The Penalized Matrix Decomposition by Witten, Tibshirani and Hastie

Witten et al. PMD introduced a new iterative algorithm based on alternating of cardinality-constrained rank-one variance maximization and matrix deflation. For the first task, they proposed the Penalized Matrix Decomposition (PMD), a biconvex problem solved by alternating between the identification of

(left pseudo-eigenvector) and

(right pseudo-eigenvector). PMD equals the 1-rank Singular Value Decomposition (SVD) when no constrains are considered. The sPCA formulation in the PMD framework is:

 {^pP,^uP}=argmaxp,u uTXps.t.∥p∥1≤c2,∥p∥22≤1,∥u∥22≤1 (6)

where is obtained using a soft-thresholding operator, with superscript referring to PMD. The corresponding pseudo-singular value is obtained as:

 ^dP=(^uP)TX^pP (7)

After each component is obtained, deflation is performed as:

 X=X−^dP^uP(^pP)T (8)

This solution is related to SCoTLASS and sPCA PMD , but according to Hastie:2015:SLS:2834535 this method is more efficient, something we have also seen experimentally. Regarding the deflation, they state that while projection deflation works well for true eigenvectors, when using constrained versions of PCA (i.e. sPCA) and/or are not in the column and row spaces of . As an alternative to projection deflation, they propose to compute using an orthogonalized factor to all the previous left pseudo-eigenvectors. By incorporating this deflation, orthogonal scores are obtained. However, the authors themselves claim that it is not clear whether orthogonality is a desirable property PMD ; Hastie:2015:SLS:2834535 . We will call this the orthogonalized deflation approach.

### 2.3 Group-wise Principal Component Analysis

The Group-wise Principal Component Analysis (GPCA) GPCA

is a sparse approach to PCA proposed by some of the authors. The main difference between GPCA and the sPCA variants presented in the previous sections is that sparsity is not achieved by using a loss function with the LASSO constraint but it is defined in terms of groups of correlated variables and it is imposed using a set of nested PCA loops. Every component contains non-zero loadings for a single group of correlated variables, and this component is calibrated using only this subset of variables.

GPCA starts with the identification of a set of (possibly overlapping) groups of correlated variables obtained from a map , with elements containing the strength of the relationship between variables and . An example of this map is the correlation matrix of . In the original formulation of GPCA, the MEDA approach (Missing-data for Exploratory Data analysis) Camacho2011missing was implemented to define , which uses a missing data strategy to estimate the correlation between any two variables, an approach which has been found to be effective in filtering out noise when estimating correlations Camacho2011missing .

Once the groups have been defined, the GPCA algorithm computes candidate loading vectors, where the -th vector has non zero elements associated to the variables in the -th group. From these, only the loading with the largest variance is retained and it is used to deflate data matrix . The algorithm iterates until a set of pseudo-components is extracted.

In GPCA, the deflation by Mackey Mackey2008 is used who showed that the common approach of single-component deflation in PCA is in general not valid for sPCA. This is the projection deflation, equivalent to Eq. (8):

 XA=XA−1(I−^pA(^pAT^pA)−1^pAT) (9)

where denote the residuals after PCs have been extracted. After normalizing to unit length, Eq. (9) simplifies to:

 XA=XA−1(I−^pA^pAT) (10)

Setting to be the covariance matrix of a data sample after PCs have been extracted, the projection deflation can be written as

 XXA=(I−^pA^pTA)XXA−1(I−^pA^pTA) (11)

For a pseudo-eigenvector of , Mackey Mackey2008

shows that projection deflation annihilates the corresponding pseudo-eigenvalue while maintaining the others in the data. However, when deflating with respect to a series of non-orthogonal pseudo-eigenvectors, projection deflation may lead to re-introduce previously deflated components and to the ”double counting” of variance, which can cause misleading results for interpretation. Mackey also shows that this problem can be solved by annihilating the variance captured by the pseudo-eigenvector that is orthogonal to the space spanned by the previous extracted pseudo-eigenvectors. For that, he introduces a combination of projection deflation and orthogonalized deflation, named orthogonalized projection deflation. This, similarly to

Zou2006 , adds to the sparse loadings a set of auxiliary orthogonal right pseudo-eigenvectors to perform the deflation.

One way to do this deflation follows a sequential Gram-Schmidt decomposition:

 ^qA=BA−1^pA (12)
 XA=XA−1(I−^qA^qTA) (13)
 BA=BA−1(I−^qA^qTA) (14)

where

is initialized to the identity matrix. We will call this approach the Mackey’s generalized deflation.

## 3 From principal components to sparse principal components

When moving from a classical principal component model to any of it sparse variants there are a number of implications that we think is pivotal to assess, especially in the context of data interpretation:

1. Correlation of scores
The scores resulting from a standard PCA are uncorrelated as a consequence of maximizing variance in each component, while the scores of sPCA model are typically correlated. This correlation complicates proper visualization but also allows for a more flexible modeling. As already discussed, Witten et al. PMD provide a solution to correct for this correlation, but it is not clear whether it should be used. More importantly, care should be taken when computing the captured variance with correlated scores.

Correlation of loadings is a relevant problem, in particular when scatter plots of scores are used for interpretation. The visualization in scatter plots assumes orthogonal axes in the original variable space. This holds for the standard PCA components, but it does not necessarily hold for sparse components. Again, care should be taken when computing the captured variance with correlated loadings.

3. Loss of captured variance per principal component.
The PCs are optimal in terms of captured variance. This is a property that makes them especially useful for data interpretation, since a large portion of variance/information is explained by a reduced set of components. This in turn simplifies the visualization of the data, since a minimum number of PCs needs to be visualized to account for a certain level of variance. However, part of the captured variance might not be of interest and could even complicate interpretation. As a relevant example, all PCs include a certain amount of embedded noise malinowski2002factor . When we move to the sparse context, we simplify the interpretation of loadings, but also need more components to capture the same amount of variance.

4. Moving outside the row-space
Sparse loadings can be outside the data row-space as a result of constraints/penalties applied. As a matter of fact, there is no sparse solution inside the row-space in the presence of noise. If the constraints/penalties are reasonable, we can improve the model quality with this departure. A notable example of this are the non-negative constraints, useful to model non-negative data. While the departure of a sparse component can be advantageous, it also complicates the generation of multi-component models.

We will discuss these implications in detail and quantify them using simulation in the next Sections, except for the departure from the row-space, which will be treated in the follow-up paper.

## 4 Simulated examples and Monte Carlo study

We will use a number of examples with simulated data to support our discussion, where we will opt not to introduce noise so that we can easily illustrate the behavior of sPCA algorithms. First, we will simulate two different examples for visual illustration, representing two common situations in exploratory data analysis that are of interest for this study: when orthogonality is consistent with the underlying data generation process and when it is not. The motivation is that a subset of sPCA methods and PCA impose orthogonality in some or all the factors, and we want to assess to which extent this is a useful constraint. As a second step we will generalize the results through a Monte Carlo simulation. We will consider several variants and implementations of sPCA, some of which are evaluated and compared in this paper for the first time:

1. The sPCA algorithm by Zou et al. Zou2006 (SPCA) and the sequential (iterative) implementation in the SPASM toolbox Sjostrand2012 (SPCA-Sq).

2. The PMD algorithm by PMD with projection (PMD-PD) and orthogonalized deflation (PMD-O) discussed in section 2.2, and a modified version using Mackey’s Mackey2008 generalized deflation (PMD-M) discussed in section 2.3.

3. The GPCA algorithm by GPCA using projection deflation (GPCA-PD), Mackey ’s Mackey2008 generalized deflation (GPCA-M) and orthogonalized deflation by PMD (GPCA-O).

### 4.1 Orthogonal spectra

The first example simulates spectra from a mixture of two compounds that result in two orthogonal components. The simulation has the following generating rules:

• The first component has non-zero values for variables {1-10} and the second for variables {11-20}. For each component, loadings follow a shape of spectra: {0.1 0.3 0.5 0.7 0.9 0.9 0.7 0.5 0.3 0.1}. Note loadings are intentionally non-negative, orthogonal and sparse (they contain 0’s).

• The scores for five individuals are generated following the pattern:

 T(5×2)=[0.500.500.500.2500.250]T

Note that the scores are intentionally non-negative and orthogonal.

• The data set is generated with dimension following . There is no noise in the data, which means that the data is purely rank 2 and the true profiles are the ones shown above. The low dimension in observations and variables and the absence of noise are not realistic, but are useful to illustrate the performance of the models in a simple scenario. Figure 1: Orthogonal spectra example: true data and matrix factorization methods.

The data set is decomposed using PCA and the different variants of sPCA, and results are shown in Figure 1. Data is not mean-centered and components are orthogonal. Since all methods perfectly capture the data structure, and are equally interpretable, we shall conclude that they have equal performance when applied to noise-free, orthogonal data, without the requirement to impose orthogonality.

### 4.2 Non-orthogonal spectra

In the second example, we generate data from a set of sparse components that simulate non-orthogonal spectra with the following generating rules:

• Three components () are simulated: the first has non-zero values for variables {1-10}, the second for variables {6-15} and the third for {11-20}. For each component, the loadings follow a spectral shape like in the example in the previous section. Note loadings are intentionally non-negative and overlapping, and therefore are non-orthogonal, and sparse.

• The scores for five individuals are generated following the pattern:

 T(5×3)=⎡⎢⎣0.50.50.50.500.2500.2500.2500.12500.1250⎤⎥⎦T

Note that the scores for the first component are not orthogonal to the other two, but they are all non-negative.

• The data set is generated with dimension following . No noise is introduced in the data, which means that the data is purely rank 3. Figure 2: Non-orthogonal spectra example: true data and matrix factorization methods.

Figure 2 shows the result of the simulation and the approximation of data by the sparse factorization methods considered along with PCA. Scores were intentionally computed following the standard approach as , with the loadings estimated in the algorithms. PCA shows two limitations and fails to reproduce the structure of the data: first, components are not sparse and are constrained to be orthogonal and this induces distortion in the reconstructed spectra; second, scores must also be orthogonal, and to satisfy this condition some score values have to be negative to compensate for the positive ones. Concerning sparse models, they show a disparate performance. The sequential version of SPCA visually outperforms the original SPCA and most other methods in the approximation. GPCA-PD and GPCA-M are also very accurate, but have difficulties to model the second and third loading vector. PMD is also less accurate in determining the loadings. Regarding the scores, orthogonalized methods distort the simulated (non-orthogonal) scores, as expected, and give the same score patterns as PCA.

### 4.3 Monte Carlo simulation

Based on the previous examples, we designed a Monte Carlo study in which datasets are generated where the components can have a random number of non-zero loadings and the values of the scores and loadings are also randomly selected. With respect to the previous illustrative examples, here the data sets are simulated 1) with more realistic dimensions and structure ( and 5 components) as encountered in practical applications, and 2) are not constrained to be non-negative. Under this simulation scheme, the scores and the loadings are likely correlated, but at a lesser extent than in the case of non-orthogonal spectra in previous section. The generating rules are:

1. Five components () are simulated. For each simulated data the loading matrix has size , for a total of 1000 loadings. An auxiliary matrix is generated, so that for elements of below 1 the corresponding elements in are set to 0 (). This procedure generates, on average, 160 non-zero loadings for every 1000 loadings simulated. The resulting loading vectors have a different degree of sparsity, contain both positive and negative values and are usually correlated because the non zero elements can overlap.

2. The scores for 50 observations are randomly generated from and are multiplied by a constant so that their variance follows an exponential decay, similar to previous examples. The scores can be non-orthogonal and contain either positive or negative values.

3. The data set is generated with dimension but no noise is added, like in the previous examples. This means that the data is purely rank 5.

The simulation is repeated 100 times. All sparse algorithms considered require the setting of one or more metaparameters. We optimistically selected the metaparameters for each method to meet on average the known level of sparseness of the data generated. In this way, all methods can be fairly compared.

Results from the non-orthogonal spectra and Monte Carlo simulation will be used to support the discussion in the following sections.

As previously discussed, scores and loadings in sPCA can be correlated. To quantify to which extent this happens, we defined and computed the following statistics

• MACS: Mean absolute correlation coefficient of the scores, measured as the mean of the absolute value of the elements below the diagonal of , where is the Hadamard (element-wise) division. This parameter ranges between 0 (absence of correlation) and 1 (total correlation).

• MACL: Mean absolute correlation coefficient of the loadings, measured as the mean of the absolute value of the elements below the diagonal of , for normalized loadings. This parameter ranges between 0 (absence of correlation) and 1 (total correlation).

Results are shown in Table 1 for the non-orthogonal spectra and in Figures 3(a) and 3(b) for the Monte Carlo simulation. The actual correlation of the simulated data is shown as a baseline. It can be seen that all but the orthogonalized versions provide correlated scores, and that all methods produce correlated loadings. Non-orthogonal methods seem to largely overestimate the true correlation in the scores, while orthogonal methods obviously underestimate it (they yield exactly 0 correlation in all cases). The sequential approaches tend to underestimate the correlation in loadings. Figure 3: Box Plots of MACS (a), MACL (b) and MACS with corrected scores (c) for the Monte Carlo simulation. Please, note y-axis are of different scale.

From these results, we can infer that correlation in scores and loadings is commonplace. While this can be beneficial for the interpretation of complex data that can be, in general, non-orthogonal, it also poses several problems and limitations that should be taken into account when analyzing a sparse model. Correlated loadings distort distances in the scatter plots Kiers ; Geladi , while correlated scores and/or loadings may lead to the incorrect estimation of explained and residual variance Mackey2008 . More implications resulting from the correlation of loadings and scores will be described in the next sections and in the follow-up paper.

## 6 Computation of Scores, Residuals and Variance

The non-orthogonality of the loadings has critical implications for the computation of the scores, the residuals and explained variance. While in standard PCA the scores are calculated as

 ^T=X^P. (15)

in the sparse PCA setting the loadings should be computed using the correction formula

 ^T=X^P(^PT^P)+ (16)

since Eq. (15) cannot be applied because loadings are, in general, not orthogonal. The superscript ’+’ indicates the Moore-Penrose inverse. For orthonormal loadings, Eq. (16) reduces to Eq. (15). A noteworthy characteristics of Eq. (16) is that the scores cannot be accurately computed in a sequential, one-component-at-a-time fashion, since the scores depend on the scores of the other components.

The level of inaccuracy in the computation of the residual can be assessed using a Monte Carlo simulation. Figures 3(a) and 3(c) show a comparison of the correlation observed in the scores computed one-at-a-time () and the correlation observed in the scores corrected using Eq. (16). It can be seen that for all methods considered, the correlation between scores is affected by the correction and is reduced in most cases. The correction also induces correlation in orthogonalized methods.The way correction affects the residuals can be quantified using the following RSS statistic:

• RSS: The residual sum-of-squares normalized by the total sum-of-squares of the data, /, with . This quantity reflects the quality of the scores capturing the variance in the data, and should be as close to 0 as possible.

Figure 4 compares residuals before and after correction: the correction reduces the sum-of-squares of the residuals. This reduction comes by definition, given that Eq. (16) is a least squares correction. Note that uncorrected residuals are used within the sPCA deflation-based algorithms to compute new scores, so this computation may be inaccurate.

Figure 5 shows the results of correcting the scores using Eq. (16) in the non-orthogonal spectra example when simultaneous SPCA is used. Comparing these results with those shown in Figure 2, it can be seen that the correction is relevant and greatly improves interpretability. However, for sequential approaches, the correction is negligible.

The correct computation of scores also affects the estimation of the variance explained by each component. Consider a full rank approximation . The variance of can be computed as: . For orthogonal loadings, . However, for non-orthogonal loadings the last equality does not hold. In general, the captured/explained variance in a component model with one sparse mode should be measured as . A more detailed discussion can be found in the A.

To support the previous discussion, we defined and calculated several statistics for (1) the scores computed as and (2) for the scores corrected with Eq. (16). These statistics are based on different ways of calculating the captured variance. In all of them, the captured variance is added to the variance in the residuals and normalized by the total sum-of-squares of the data. Therefore, all statistics are expected to be equal to 1. The considered statistics are defined as:

• TotQR: The fraction of total variance computed as: (QSS+)/, where QSS is the sum-of-squares of the diagonal of from the QR decomposition of .

• TotT: The fraction of total variance computed as: (+)/, where the first addend in the numerator is the sum-of-squares of the scores.

• TotPT: The fraction of total variance computed as: (+)/, where the first addend in the numerator is the sum-of-squares of the reconstructed data.

Numerical results for the non-orthogonal spectra are presented in Table 2. The corrected computations lead to the expected value TotPT = 1. However, this is not the case for non-corrected scores and for other approaches to compute the variance. As discussed previously, the QR decomposition is, in general, an incorrect way to compute variance, since TotQR is different from 1 in most occasions. This also holds for TotT. The amount of variance captured by the orthogonalized methods is the same independently of the statistic used to measured it.

Figure 6 shows the effect of the correction in the scores in the Monte Carlo simulation. Without the correction, all methods, except for orthogonalized versions, show variance artifacts in TotPT. This especially affects the simultaneous SPCA. After correction, TotPT is systematically equal to 1, while TotQR and TotT are not, indicating that these statistics are not appropriate to describe the variance captured by the sparse model.

## 7 Conclusion

In this paper we reviewed a number of Sparse Principal Component Analysis (sPCA) algorithms and compared them through simulation. We showed that sPCA approaches present disparate and limited performance when modeling sparse data, even in the optimistic case of noise-free data. In a follow-up paper, we discuss the theoretical properties that lead to this problem. The comparative in this and the follow up paper can be useful to make a proper choice of the sPCA algorithm, depending on the practical goal of the analysis.

In this paper we focused on the inherent properties os SPCA, that make it substantially different to PCA and, for this difference, PCA and sparse PCA models must be handled in different ways. In the light of this, we provided corrected equations for quantities such as captured variances, scores and residuals. These corrections are relevant, taking into account that scores and residuals are central in chemometrics and related disciplines for data visualization and interpretation. Furthermore, explained variance is useful for comparison among model variants, and also to compare sPCA with other modeling approaches.

## Appendix A Calculation of the explained variance

For a standard PCA model it holds

 X=TPT+E=^X+E (17)

from which it follows that

 ∥X∥2F=∥TPT+E∥2F. (18)

Using the general equation it follows that

 ∥X∥2=tr(XTX)=tr(PTTTPT)+tr(PTTE)+tr(ETTPT)+tr(ETE) (19)

by simply multiplying out the second part of Equation (18). If deflation is performed in the score space then are the fixed regressors and a property of the deflation (i.e. regression) is that and thus

 ∥X∥2=tr(XTX)=tr(PTTTPT)+tr(ETE)=∥^X∥2+∥E∥2 (20)

which is a simple split-up of explained variation. If the deflation is performed in the loading space then and Equation (20) holds since .

Remarks regarding calculating the explained variances:

1. The split-up of variation depends on whether Equation (17) holds and, additionally, either or .

2. There is no requirement regarding the orthogonality of and/or .

3. There is also no requirement of the columns of being in the column space of nor the columns of in the row space of .

These equations become simpler if there is orthogonality. If then

 ∥X∥2=tr(XTX)=tr(PTTTPT)+tr(ETE)=tr(PPT)+tr(ETE)=tr(PTP)+tr(ETE). (21)

Similarly, if then

 ∥X∥2=tr(XTX)=tr(PTTTPT)+tr(ETE)=tr(PTPTTT)+tr(ETE)=tr(TTT)+tr(ETE) (22)

which explains why sometimes calculating explained variances using only scores or loadings may work.

## Acknowledgement

This work is partly supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds through project TIN2017-83494-R and the ”Plan Propio de la Universidad de Granada”.

## References

• (1) Jolliffe I.T., Trendafilov N.T., Uddin M. A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics. 2003.
• (2) Zou Hui, Hastie Trevor, Tibshirani Robert. Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics. 2006;15:265–286.
• (3) Witten Daniela M., Tibshirani Robert, Hastie Trevor. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis Biostatistics. 2009;10:515–534.
• (4) Camacho José, Rodríguez-Gómez Rafael A., Saccenti Edoardo. Group-wise Principal Component Analysis for Exploratory Data Analysis Journal of Computational and Graphical Statistics. 2017;26:501-512.
• (5) Mackey Lester. Deflation Methods for Sparse PCA. Nips. 2008:1–8.
• (6) Hastie Trevor, Tibshirani Robert, Wainwright Martin. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC 2015.
• (7) Jolliffe I.T.. Principal component analysis. EEUU: Springer Verlag Inc. 2002.
• (8) Jackson J.E.. A User’s Guide to Principal Components. England: Wiley-Interscience 2003.
• (9) Richt Peter, Tak Martin. Alternating Maximization : Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes 2012 IEEE International Conference on Big Data. 2012:1–20.
• (10) Tibshirani Robert. Regression Selection and Shrinkage via the Lasso Journal of the Royal Statistical Society, Series B. 1994;58:267–288.
• (11) Zou H, Hastie T. Regularization and variable selection via the elastic-net Journal of the Royal Statistical Society. 2005;67:301–320.
• (12) Sjöstrand Karl, Clemmensen Line, Larsen Rasmus, Einarsson Gudmundur, Ersbøll Bjarne. SpaSM: A MATLAB Toolbox for Sparse Statistical Modeling Journal of Statistical Software, Articles. 2018;84:1–37.
• (13) Camacho J.. Missing-data theory in the context of exploratory data analysis Chemometrics and Intelligent Laboratory Systems. 2010;103:8–18.
• (14) Malinowski E.R.. Factor Analysis in Chemistry. Wiley 2002.
• (15) Kiers Henk A. L.. Some procedures for displaying results from three-way methods Journal of Chemometrics. 2000;14:151-170.
• (16) Geladi Paul, Manley Marena, Lestander Torbjörn. Scatter plotting in multivariate data analysis Journal of Chemometrics. 2003;17:503-511.