1 Introduction
In modern data analysis there is an increased interest in sparse (component) methods to obtain models that are simpler and easier to interpret. One of such methods is the popular sparse Principal Component Analysis (sPCA) Jolliffe2003 ; Zou2006 ; PMD ; GPCA . While the properties of sPCA have already been discussed PMD ; Mackey2008 ; Hastie:2015:SLS:2834535 , there are several aspects that have not been treated in detail and that are relevant for the interpretation of the models.
sPCA is based on Principal Component Analysis (PCA) Jolliffe02 ; Jackson03 , which factorizes a data matrix using the criterion of maximizing variance. The PCA model follows the expression:
(1) 
where is a data matrix, the score matrix, the loading matrix and the matrix of residuals. An interesting property of PCA is that the loadings and the scores are orthogonal and, as a result, they can be computed either using a simultaneous or sequential algorithm.
When moving from PCA to sPCA, there are at least two interrelated consequences that need to be considered: i) the appearance of correlation among scores and/or loadings in different components and ii) the departure of components from the rowspace defined by the original data. The first point makes the traditional way of computing scores, residuals and captured variance used in PCA not applicable in sPCA. Moreover, the captured variance is often used as a criterion for model quality and for selection/comparison of model variants. Therefore, its accurate computation is necessary. The second point affects interpretability, specially for multicomponent models. Both effects are especially relevant in those domains in which the interpretation of the scores/residuals is necessary, like in chemometrics.
Zou et al. Zou2006
proposed a method based on the QR decomposition to compute the variance captured by sPCA when scores are correlated; we will show that this method is generally incorrect. Witten
et al. PMDrelated the correlation of the loadings/scores to the departure from the row/columnspace of the data and proposed an orthogonalized version of their sPCA method, where scores are made orthogonal; we will show that this orthogonalization reduces the captured variance and dwarfs interpretation, but improves the estimation of sparse loadings. Mackey
Mackey2008 discussed different deflation methods in the context of sPCA, some of which compute residuals incorrectly, leading to the doublecounting of variance, i.e., a portion of variance in the data captured by more than one component. He also contributed a new deflation method which, we will show, still requires the corrections that will be proposed and discussed in this paper.In this study we present a comparison among several widely used sPCA algorithms and reflect on how scores, residuals and captured variance should be computed for sparse components, something central in chemometrics. The problem of the departure of components from the rowspace is treated in a followup paper. The rest of the paper is organized as follows. In Section 2, we review a set of sPCA algorithms considered. In Section 3, we discuss the mathematical properties of sPCA, using PCA as a baseline. In Section 4, we propose two simulated examples that mimic spectral data, and a Monte Carlo experiment, that will illustrate interpretation problems and solutions in the following sections. Section 5 discusses the implications of correlated scores and loadings in sPCA. Section 6 presents the correct computation of scores, residuals and captured variance in sPCA models. Section 7 offers some final considerations and conclusions.
2 Sparse PCA algorithms
Most sPCA algorithms modify the classical PCA by including sparseinducing constraints or penalties with the or norms Richt2012 . Jolliffe et al. Jolliffe2003 developed the SCoTLASS algorithm (Simplified Component TechniqueLASSO), which incorporates the least absolute shrinkage and selection operator (lasso) (Tibshirani1994, ), where the norm (absolute values) of the loadings is penalized. The SCoTLASS criterion follows:
(2) 
where is the resulting sparse loading, with superscript referring to SCoTLASS. To obtain successive components, the SCoTLASS optimization constraints the 2nd and further sparse loadings to be orthogonal to the rest. The SCoTLASS criterion is very computational demanding Hastie:2015:SLS:2834535 , and this makes its use unpractical for data exploration. In addition, the number of nonzero elements in is upperbounded by the number of observations in the data, which is a structural limitation of the lasso constraint.
2.1 Sparse PCA by Zou, Hastie and Tibshirani
Zou et al. Zou2006
introduced an alternative formulation to generate sparse components, motivated by the structural limitation of the lasso and the computation cost of SCoTLASS. This new model was also named Sparse PCA (SPCA, we will use a capital ’S’ to differentiate this approach from the general sPCA technique, although at that time the term was already used to designate the group of sparse variants of PCA. SPCA redefines PCA as a regularized regression problem with the ridge penalty, where the responses are the PCA scores. In this formulation any loading vector of PCA,
, equals , where(3) 
with superscript referring to regression, and is the corresponding score vector. Therefore, , and is obtained by normalizing to unit length. This equivalence holds in general for any nonnegative . Equivalently, the loadings of PCA can be obtained from the following regression problem followed by a normalization step:
(4) 
where represents the th column vector in , with the number of components, and we distinguish between weights and loadings . In similar modeling methods, like Partial Least Squares, weights are used for the derivation of scores from the data and loadings for the identification of the captured variance in X, e.g. for deflation. The advantage of Eq. (4) over Eq. (3) is that the former does not require the precomputation of the PCA scores. is proportional to the PCA loadings for any nonnegative . Therefore, we can understand Eq. (4) as a flexible method to compute PCA, where constraints and penalties can be introduced either in the weights or the loadings.
Using this flexible redefinition of PCA the lasso can be implemented using a criterion close to the (naive) elastic net (Zou2005, ), which is a combination of both the lasso and the ridge constraint:
(5) 
with superscript referring to SPCA. In the following, we will refer to as sparse loadings and to as auxiliary loadings to simplify the discussion in connection with other sPCA techniques. Note that the roles of and are performed by a unique set of loadings in PCA and most sPCA variants. The SPCA algorithm is simultaneous, in the sense that all components are identified in the same run. Note the expression in Eq. (5) is not optimal to approximate data, because is constrained to be orthonormal, while at the same time the norms of the vectors in are penalized, leaving no flexibility to approximate and resulting in very large residuals. Besides, the sparse loadings are not normalized. The authors suggest to normalize them after the optimization, in the last step of their algorithm, but they overlook the need to update accordingly. This reflects the fact that authors were only interested in as the output of the algorithm, and the model in Eq. (5) is only a means to arrive to such output.
They also realized that this SPCA algorithm produces highly correlated scores, and concluded that the current method for computing the captured variance as used in PCA is too optimistic. For this reason, they suggested to use the QR decomposition of , which is for orthogonal and upper triangular, and to compute the captured variance of the model as the sum of squares of the elements in the diagonal of .
An alternative SPCA algorithm has been published as part of the SPASM Toolbox Sjostrand2012 , which is a sequential variant of SPCA, in which the loadings are computed one at a time. This algorithm is faster than the simultaneous versions, with a small reduction of captured variance.
2.2 The Penalized Matrix Decomposition by Witten, Tibshirani and Hastie
Witten et al. PMD introduced a new iterative algorithm based on alternating of cardinalityconstrained rankone variance maximization and matrix deflation. For the first task, they proposed the Penalized Matrix Decomposition (PMD), a biconvex problem solved by alternating between the identification of
(left pseudoeigenvector) and
(right pseudoeigenvector). PMD equals the 1rank Singular Value Decomposition (SVD) when no constrains are considered. The sPCA formulation in the PMD framework is:
(6) 
where is obtained using a softthresholding operator, with superscript referring to PMD. The corresponding pseudosingular value is obtained as:
(7) 
After each component is obtained, deflation is performed as:
(8) 
This solution is related to SCoTLASS and sPCA PMD , but according to Hastie:2015:SLS:2834535 this method is more efficient, something we have also seen experimentally. Regarding the deflation, they state that while projection deflation works well for true eigenvectors, when using constrained versions of PCA (i.e. sPCA) and/or are not in the column and row spaces of . As an alternative to projection deflation, they propose to compute using an orthogonalized factor to all the previous left pseudoeigenvectors. By incorporating this deflation, orthogonal scores are obtained. However, the authors themselves claim that it is not clear whether orthogonality is a desirable property PMD ; Hastie:2015:SLS:2834535 . We will call this the orthogonalized deflation approach.
2.3 Groupwise Principal Component Analysis
The Groupwise Principal Component Analysis (GPCA) GPCA
is a sparse approach to PCA proposed by some of the authors. The main difference between GPCA and the sPCA variants presented in the previous sections is that sparsity is not achieved by using a loss function with the LASSO constraint but it is defined in terms of groups of correlated variables and it is imposed using a set of nested PCA loops. Every component contains nonzero loadings for a single group of correlated variables, and this component is calibrated using only this subset of variables.
GPCA starts with the identification of a set of (possibly overlapping) groups of correlated variables obtained from a map , with elements containing the strength of the relationship between variables and . An example of this map is the correlation matrix of . In the original formulation of GPCA, the MEDA approach (Missingdata for Exploratory Data analysis) Camacho2011missing was implemented to define , which uses a missing data strategy to estimate the correlation between any two variables, an approach which has been found to be effective in filtering out noise when estimating correlations Camacho2011missing .
Once the groups have been defined, the GPCA algorithm computes candidate loading vectors, where the th vector has non zero elements associated to the variables in the th group. From these, only the loading with the largest variance is retained and it is used to deflate data matrix . The algorithm iterates until a set of pseudocomponents is extracted.
In GPCA, the deflation by Mackey Mackey2008 is used who showed that the common approach of singlecomponent deflation in PCA is in general not valid for sPCA. This is the projection deflation, equivalent to Eq. (8):
(9) 
where denote the residuals after PCs have been extracted. After normalizing to unit length, Eq. (9) simplifies to:
(10) 
Setting to be the covariance matrix of a data sample after PCs have been extracted, the projection deflation can be written as
(11) 
For a pseudoeigenvector of , Mackey Mackey2008
shows that projection deflation annihilates the corresponding pseudoeigenvalue while maintaining the others in the data. However, when deflating with respect to a series of nonorthogonal pseudoeigenvectors, projection deflation may lead to reintroduce previously deflated components and to the ”double counting” of variance, which can cause misleading results for interpretation. Mackey also shows that this problem can be solved by annihilating the variance captured by the pseudoeigenvector that is orthogonal to the space spanned by the previous extracted pseudoeigenvectors. For that, he introduces a combination of projection deflation and orthogonalized deflation, named orthogonalized projection deflation. This, similarly to
Zou2006 , adds to the sparse loadings a set of auxiliary orthogonal right pseudoeigenvectors to perform the deflation.One way to do this deflation follows a sequential GramSchmidt decomposition:
(12) 
(13) 
(14) 
where
is initialized to the identity matrix. We will call this approach the Mackey’s generalized deflation.
3 From principal components to sparse principal components
When moving from a classical principal component model to any of it sparse variants there are a number of implications that we think is pivotal to assess, especially in the context of data interpretation:

Correlation of scores
The scores resulting from a standard PCA are uncorrelated as a consequence of maximizing variance in each component, while the scores of sPCA model are typically correlated. This correlation complicates proper visualization but also allows for a more flexible modeling. As already discussed, Witten et al. PMD provide a solution to correct for this correlation, but it is not clear whether it should be used. More importantly, care should be taken when computing the captured variance with correlated scores. 
Correlation of loadings
Correlation of loadings is a relevant problem, in particular when scatter plots of scores are used for interpretation. The visualization in scatter plots assumes orthogonal axes in the original variable space. This holds for the standard PCA components, but it does not necessarily hold for sparse components. Again, care should be taken when computing the captured variance with correlated loadings. 
Loss of captured variance per principal component.
The PCs are optimal in terms of captured variance. This is a property that makes them especially useful for data interpretation, since a large portion of variance/information is explained by a reduced set of components. This in turn simplifies the visualization of the data, since a minimum number of PCs needs to be visualized to account for a certain level of variance. However, part of the captured variance might not be of interest and could even complicate interpretation. As a relevant example, all PCs include a certain amount of embedded noise malinowski2002factor . When we move to the sparse context, we simplify the interpretation of loadings, but also need more components to capture the same amount of variance. 
Moving outside the rowspace
Sparse loadings can be outside the data rowspace as a result of constraints/penalties applied. As a matter of fact, there is no sparse solution inside the rowspace in the presence of noise. If the constraints/penalties are reasonable, we can improve the model quality with this departure. A notable example of this are the nonnegative constraints, useful to model nonnegative data. While the departure of a sparse component can be advantageous, it also complicates the generation of multicomponent models.
We will discuss these implications in detail and quantify them using simulation in the next Sections, except for the departure from the rowspace, which will be treated in the followup paper.
4 Simulated examples and Monte Carlo study
We will use a number of examples with simulated data to support our discussion, where we will opt not to introduce noise so that we can easily illustrate the behavior of sPCA algorithms. First, we will simulate two different examples for visual illustration, representing two common situations in exploratory data analysis that are of interest for this study: when orthogonality is consistent with the underlying data generation process and when it is not. The motivation is that a subset of sPCA methods and PCA impose orthogonality in some or all the factors, and we want to assess to which extent this is a useful constraint. As a second step we will generalize the results through a Monte Carlo simulation. We will consider several variants and implementations of sPCA, some of which are evaluated and compared in this paper for the first time:

The sPCA algorithm by Zou et al. Zou2006 (SPCA) and the sequential (iterative) implementation in the SPASM toolbox Sjostrand2012 (SPCASq).

The PMD algorithm by PMD with projection (PMDPD) and orthogonalized deflation (PMDO) discussed in section 2.2, and a modified version using Mackey’s Mackey2008 generalized deflation (PMDM) discussed in section 2.3.

The GPCA algorithm by GPCA using projection deflation (GPCAPD), Mackey ’s Mackey2008 generalized deflation (GPCAM) and orthogonalized deflation by PMD (GPCAO).
4.1 Orthogonal spectra
The first example simulates spectra from a mixture of two compounds that result in two orthogonal components. The simulation has the following generating rules:

The first component has nonzero values for variables {110} and the second for variables {1120}. For each component, loadings follow a shape of spectra: {0.1 0.3 0.5 0.7 0.9 0.9 0.7 0.5 0.3 0.1}. Note loadings are intentionally nonnegative, orthogonal and sparse (they contain 0’s).

The scores for five individuals are generated following the pattern:
Note that the scores are intentionally nonnegative and orthogonal.

The data set is generated with dimension following . There is no noise in the data, which means that the data is purely rank 2 and the true profiles are the ones shown above. The low dimension in observations and variables and the absence of noise are not realistic, but are useful to illustrate the performance of the models in a simple scenario.
The data set is decomposed using PCA and the different variants of sPCA, and results are shown in Figure 1. Data is not meancentered and components are orthogonal. Since all methods perfectly capture the data structure, and are equally interpretable, we shall conclude that they have equal performance when applied to noisefree, orthogonal data, without the requirement to impose orthogonality.
4.2 Nonorthogonal spectra
In the second example, we generate data from a set of sparse components that simulate nonorthogonal spectra with the following generating rules:

Three components () are simulated: the first has nonzero values for variables {110}, the second for variables {615} and the third for {1120}. For each component, the loadings follow a spectral shape like in the example in the previous section. Note loadings are intentionally nonnegative and overlapping, and therefore are nonorthogonal, and sparse.

The scores for five individuals are generated following the pattern:
Note that the scores for the first component are not orthogonal to the other two, but they are all nonnegative.

The data set is generated with dimension following . No noise is introduced in the data, which means that the data is purely rank 3.
Figure 2 shows the result of the simulation and the approximation of data by the sparse factorization methods considered along with PCA. Scores were intentionally computed following the standard approach as , with the loadings estimated in the algorithms. PCA shows two limitations and fails to reproduce the structure of the data: first, components are not sparse and are constrained to be orthogonal and this induces distortion in the reconstructed spectra; second, scores must also be orthogonal, and to satisfy this condition some score values have to be negative to compensate for the positive ones. Concerning sparse models, they show a disparate performance. The sequential version of SPCA visually outperforms the original SPCA and most other methods in the approximation. GPCAPD and GPCAM are also very accurate, but have difficulties to model the second and third loading vector. PMD is also less accurate in determining the loadings. Regarding the scores, orthogonalized methods distort the simulated (nonorthogonal) scores, as expected, and give the same score patterns as PCA.
4.3 Monte Carlo simulation
Based on the previous examples, we designed a Monte Carlo study in which datasets are generated where the components can have a random number of nonzero loadings and the values of the scores and loadings are also randomly selected. With respect to the previous illustrative examples, here the data sets are simulated 1) with more realistic dimensions and structure ( and 5 components) as encountered in practical applications, and 2) are not constrained to be nonnegative. Under this simulation scheme, the scores and the loadings are likely correlated, but at a lesser extent than in the case of nonorthogonal spectra in previous section. The generating rules are:

Five components () are simulated. For each simulated data the loading matrix has size , for a total of 1000 loadings. An auxiliary matrix is generated, so that for elements of below 1 the corresponding elements in are set to 0 (). This procedure generates, on average, 160 nonzero loadings for every 1000 loadings simulated. The resulting loading vectors have a different degree of sparsity, contain both positive and negative values and are usually correlated because the non zero elements can overlap.

The scores for 50 observations are randomly generated from and are multiplied by a constant so that their variance follows an exponential decay, similar to previous examples. The scores can be nonorthogonal and contain either positive or negative values.

The data set is generated with dimension but no noise is added, like in the previous examples. This means that the data is purely rank 5.
The simulation is repeated 100 times. All sparse algorithms considered require the setting of one or more metaparameters. We optimistically selected the metaparameters for each method to meet on average the known level of sparseness of the data generated. In this way, all methods can be fairly compared.
Results from the nonorthogonal spectra and Monte Carlo simulation will be used to support the discussion in the following sections.
5 Correlation of Scores and Loadings
As previously discussed, scores and loadings in sPCA can be correlated. To quantify to which extent this happens, we defined and computed the following statistics

MACS: Mean absolute correlation coefficient of the scores, measured as the mean of the absolute value of the elements below the diagonal of , where is the Hadamard (elementwise) division. This parameter ranges between 0 (absence of correlation) and 1 (total correlation).

MACL: Mean absolute correlation coefficient of the loadings, measured as the mean of the absolute value of the elements below the diagonal of , for normalized loadings. This parameter ranges between 0 (absence of correlation) and 1 (total correlation).
Results are shown in Table 1 for the nonorthogonal spectra and in Figures 3(a) and 3(b) for the Monte Carlo simulation. The actual correlation of the simulated data is shown as a baseline. It can be seen that all but the orthogonalized versions provide correlated scores, and that all methods produce correlated loadings. Nonorthogonal methods seem to largely overestimate the true correlation in the scores, while orthogonal methods obviously underestimate it (they yield exactly 0 correlation in all cases). The sequential approaches tend to underestimate the correlation in loadings.
Simulated  sPCA  sPCASq  PMDPD  PMDO  PMDM  GPCAPD  GPCAM  GPCAO  
MACS  0.43  0.84  0.73  0.66  0  0.58  0.52  0.52  0 
MACL  0.17  0.21  0.05  0.09  0.05  0.07  0.03  0.03  0.03 
From these results, we can infer that correlation in scores and loadings is commonplace. While this can be beneficial for the interpretation of complex data that can be, in general, nonorthogonal, it also poses several problems and limitations that should be taken into account when analyzing a sparse model. Correlated loadings distort distances in the scatter plots Kiers ; Geladi , while correlated scores and/or loadings may lead to the incorrect estimation of explained and residual variance Mackey2008 . More implications resulting from the correlation of loadings and scores will be described in the next sections and in the followup paper.
6 Computation of Scores, Residuals and Variance
The nonorthogonality of the loadings has critical implications for the computation of the scores, the residuals and explained variance. While in standard PCA the scores are calculated as
(15) 
in the sparse PCA setting the loadings should be computed using the correction formula
(16) 
since Eq. (15) cannot be applied because loadings are, in general, not orthogonal. The superscript ’+’ indicates the MoorePenrose inverse. For orthonormal loadings, Eq. (16) reduces to Eq. (15). A noteworthy characteristics of Eq. (16) is that the scores cannot be accurately computed in a sequential, onecomponentatatime fashion, since the scores depend on the scores of the other components.
The level of inaccuracy in the computation of the residual can be assessed using a Monte Carlo simulation. Figures 3(a) and 3(c) show a comparison of the correlation observed in the scores computed oneatatime () and the correlation observed in the scores corrected using Eq. (16). It can be seen that for all methods considered, the correlation between scores is affected by the correction and is reduced in most cases. The correction also induces correlation in orthogonalized methods.The way correction affects the residuals can be quantified using the following RSS statistic:

RSS: The residual sumofsquares normalized by the total sumofsquares of the data, /, with . This quantity reflects the quality of the scores capturing the variance in the data, and should be as close to 0 as possible.
Figure 4 compares residuals before and after correction: the correction reduces the sumofsquares of the residuals. This reduction comes by definition, given that Eq. (16) is a least squares correction. Note that uncorrected residuals are used within the sPCA deflationbased algorithms to compute new scores, so this computation may be inaccurate.
Figure 5 shows the results of correcting the scores using Eq. (16) in the nonorthogonal spectra example when simultaneous SPCA is used. Comparing these results with those shown in Figure 2, it can be seen that the correction is relevant and greatly improves interpretability. However, for sequential approaches, the correction is negligible.
The correct computation of scores also affects the estimation of the variance explained by each component. Consider a full rank approximation . The variance of can be computed as: . For orthogonal loadings, . However, for nonorthogonal loadings the last equality does not hold. In general, the captured/explained variance in a component model with one sparse mode should be measured as . A more detailed discussion can be found in the A.
To support the previous discussion, we defined and calculated several statistics for (1) the scores computed as and (2) for the scores corrected with Eq. (16). These statistics are based on different ways of calculating the captured variance. In all of them, the captured variance is added to the variance in the residuals and normalized by the total sumofsquares of the data. Therefore, all statistics are expected to be equal to 1. The considered statistics are defined as:

TotQR: The fraction of total variance computed as: (QSS+)/, where QSS is the sumofsquares of the diagonal of from the QR decomposition of .

TotT: The fraction of total variance computed as: (+)/, where the first addend in the numerator is the sumofsquares of the scores.

TotPT: The fraction of total variance computed as: (+)/, where the first addend in the numerator is the sumofsquares of the reconstructed data.
Numerical results for the nonorthogonal spectra are presented in Table 2. The corrected computations lead to the expected value TotPT = 1. However, this is not the case for noncorrected scores and for other approaches to compute the variance. As discussed previously, the QR decomposition is, in general, an incorrect way to compute variance, since TotQR is different from 1 in most occasions. This also holds for TotT. The amount of variance captured by the orthogonalized methods is the same independently of the statistic used to measured it.
sPCA  sPCASq  PMDPD  PMDO  PMDM  GPCAPD  GPCAM  GPCAO  
TotQR  1.2730  0.9616  0.9106  1.0000  0.9223  0.9345  0.9345  1.0000 
TotT  1.7747  1.0722  1.0000  1.0000  1.0008  1.0000  1.0000  1.0000 
TotPT  2.5494  1.1444  1.0689  1.0000  1.0457  1.0435  1.0435  1.0000 
TotQR*  0.8241  0.9097  0.8541  1.0000  0.8857  0.8947  0.8947  1.0000 
TotT*  0.8606  0.9603  0.9362  1.0000  0.9579  0.9584  0.9584  1.0000 
TotPT*  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000 
Figure 6 shows the effect of the correction in the scores in the Monte Carlo simulation. Without the correction, all methods, except for orthogonalized versions, show variance artifacts in TotPT. This especially affects the simultaneous SPCA. After correction, TotPT is systematically equal to 1, while TotQR and TotT are not, indicating that these statistics are not appropriate to describe the variance captured by the sparse model.


7 Conclusion
In this paper we reviewed a number of Sparse Principal Component Analysis (sPCA) algorithms and compared them through simulation. We showed that sPCA approaches present disparate and limited performance when modeling sparse data, even in the optimistic case of noisefree data. In a followup paper, we discuss the theoretical properties that lead to this problem. The comparative in this and the follow up paper can be useful to make a proper choice of the sPCA algorithm, depending on the practical goal of the analysis.
In this paper we focused on the inherent properties os SPCA, that make it substantially different to PCA and, for this difference, PCA and sparse PCA models must be handled in different ways. In the light of this, we provided corrected equations for quantities such as captured variances, scores and residuals. These corrections are relevant, taking into account that scores and residuals are central in chemometrics and related disciplines for data visualization and interpretation. Furthermore, explained variance is useful for comparison among model variants, and also to compare sPCA with other modeling approaches.
Appendix A Calculation of the explained variance
For a standard PCA model it holds
(17) 
from which it follows that
(18) 
Using the general equation it follows that
(19) 
by simply multiplying out the second part of Equation (18). If deflation is performed in the score space then are the fixed regressors and a property of the deflation (i.e. regression) is that and thus
(20) 
which is a simple splitup of explained variation. If the deflation is performed in the loading space then and Equation (20) holds since .
Remarks regarding calculating the explained variances:

The splitup of variation depends on whether Equation (17) holds and, additionally, either or .

There is no requirement regarding the orthogonality of and/or .

There is also no requirement of the columns of being in the column space of nor the columns of in the row space of .
These equations become simpler if there is orthogonality. If then
(21) 
Similarly, if then
(22) 
which explains why sometimes calculating explained variances using only scores or loadings may work.
Acknowledgement
This work is partly supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds through project TIN201783494R and the ”Plan Propio de la Universidad de Granada”.
References
 (1) Jolliffe I.T., Trendafilov N.T., Uddin M. A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics. 2003.
 (2) Zou Hui, Hastie Trevor, Tibshirani Robert. Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics. 2006;15:265–286.
 (3) Witten Daniela M., Tibshirani Robert, Hastie Trevor. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis Biostatistics. 2009;10:515–534.
 (4) Camacho José, RodríguezGómez Rafael A., Saccenti Edoardo. Groupwise Principal Component Analysis for Exploratory Data Analysis Journal of Computational and Graphical Statistics. 2017;26:501512.
 (5) Mackey Lester. Deflation Methods for Sparse PCA. Nips. 2008:1–8.
 (6) Hastie Trevor, Tibshirani Robert, Wainwright Martin. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC 2015.
 (7) Jolliffe I.T.. Principal component analysis. EEUU: Springer Verlag Inc. 2002.
 (8) Jackson J.E.. A User’s Guide to Principal Components. England: WileyInterscience 2003.
 (9) Richt Peter, Tak Martin. Alternating Maximization : Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes 2012 IEEE International Conference on Big Data. 2012:1–20.
 (10) Tibshirani Robert. Regression Selection and Shrinkage via the Lasso Journal of the Royal Statistical Society, Series B. 1994;58:267–288.
 (11) Zou H, Hastie T. Regularization and variable selection via the elasticnet Journal of the Royal Statistical Society. 2005;67:301–320.
 (12) Sjöstrand Karl, Clemmensen Line, Larsen Rasmus, Einarsson Gudmundur, Ersbøll Bjarne. SpaSM: A MATLAB Toolbox for Sparse Statistical Modeling Journal of Statistical Software, Articles. 2018;84:1–37.
 (13) Camacho J.. Missingdata theory in the context of exploratory data analysis Chemometrics and Intelligent Laboratory Systems. 2010;103:8–18.
 (14) Malinowski E.R.. Factor Analysis in Chemistry. Wiley 2002.
 (15) Kiers Henk A. L.. Some procedures for displaying results from threeway methods Journal of Chemometrics. 2000;14:151170.
 (16) Geladi Paul, Manley Marena, Lestander Torbjörn. Scatter plotting in multivariate data analysis Journal of Chemometrics. 2003;17:503511.