1 Introduction
MultiVariate Analysis (MVA) techniques have been widely used during the last century since Principal Component Analysis (PCA)
[1]was proposed as a simple and efficient way to reduce data dimension by projecting the data over the maximum variance directions. Since then, many variants have emerged trying to include supervised information to this data dimension reduction process; this is the case of algorithms such as Canonical Correlation Analysis (CCA)
[2], Partial Least Squares (PLS) approaches [3, 4], or Orthonormalized PLS (OPLS) [5]. In fact, we can find many real applications where these methods have been successfully applied: in biomedical engineering [6, 7], remote sensing [8, 9], or chemometrics [10], among many others.Some recent significant contributions in the field have focused on trying to gain interpretability by means of including and norms, or even group lasso penalties, in the MVA formulations. This is the case of extensions such as sparse PCA [11], sparse OPLS [6], grouplasso penalized OPLS (or SRRR) [12], and regularized CCA (or L21SDA) [13]
. All these approaches are based on an iterative process which combines two optimization problems. The first step consists of a regularized leastsquares problem to obtain the vectors for the extraction of input features; the second step involves a minimization problem which is typically solved as an orthogonal Procrustes problem.
Despite these regularized approaches have been recurrently applied in feature extraction and dimensionality reduction scenarios [6], [13], all of them ignore one intrinsic and important property of most MVA approaches: uncorrelation of the extracted features in the new subspace
. When this property holds, the feature extraction process provides additional advantages: (1) The subsequent learning tasks (working over this new space) are easen, for instance, leastsquare problems (Ridge Regression, LASSO,…) can work independently over each dimension, and the effects of variations of the input data are isolated in the different directions. (2) The selection of optimal feature subsets becomes straightforward; since once a set of features is computed, obtaining an optimum reduced subset consists of selecting those features with highest associated eigenvalue. Consequently, the adjustment of the optimum number of extracted features is simplified.
In this paper, we analyze in detail the above mentioned MVA formulations showing, from a theoretical and experimental point of view, some drawbacks overlooked until now in the literature. Concretely, we will demonstrate that these MVA approaches (1) do not obtain uncorrelated features in general; (2) do not converge to their associated nonregularized MVA solutions; (3) suffer some issues that depend on the algorithm initialization, e.g., depending on algorithm initialization the methods can fail to progress at all.
As solution to these problems, this paper proposes an alternative to orthogonal Procrustes. In order to do so, we rely on a common framework that allows us to deal simultaneously with the most common MVA methods (PCA, CCA and OPLS), and extend it to favor interpretable solutions by including a regularization term. Similarly to existing methods, we propose a solution to this generalized formulation which is based on an iterative process but does not suffer from the above problems.
The paper is organized as follows: Firstly, Section 2 introduces this generalized MVA framework. Then, Section 3 presents the iterative process required to solve its regularized extension and describes both Procrustes solution, as well our proposal based on a standard eigenvalue problem. Section 4 explains the limitations of the Procrustes solution in greater detail, and provide theoretical proof of the most important problems of this approach. Section 5 illustrates and compare the suitability of the new proposed solution with that of Procrustes using some real problems that support well the theoretical findings. Finally, Section 6 concludes the paper.
2 Framework for MVA with uncorrelated features
This section reviews some wellknown MVA methods under a unifying framework, so that subsequent sections can deal with these methods in a unified manner. Before that, notation used throughout the paper is presented.
Let us assume a supervised learning scenario, where the goal is to learn relevant features from an input data set of
training data , where and are considered as the input and output vectors, respectively. Therefore, and denote the dimensions of the input and output spaces. For notational convenience, we define the input and output data matrices: and, so sample estimations of the input and output data covariance matrices, as well as of their crosscovariance matrix, can be calculated as
, and , where we have neglected the scaling factor , and superscript denotes vector or matrix transposition. The goal of linear MVA methods is to find relevant features by combining the original variables, i.e., , where the th column of is a vector containing the coefficients associated to the th extracted feature.The results in this paper apply, at least, to PCA, CCA, and OPLS, all these methods having in common that the extracted features are uncorrelated, i.e., , with a diagonal matrix. MVA methods that do not enforce this uncorrelation, more notably PLS, are therefore left outside the scope of this paper.
A common framework for many regularized MVA methods can be found in [14]. According to it, these methods pursue the minimization of the following objective function:
(1) 
where is a regularization term, such as the ridge regularization (), the norm (), or the penalty for variable selection (). Parameter trades off the importance of the regularization term in (2), can be considered a matrix for the extraction of output features, and different particularizations of matrix give rise to the considered MVA methods, in particular for CCA, for OPLS, and with for PCA [14, 15].
In order to extract uncorrelated features, the loss function (
2) is formally minimized subject to . However, it is proved in [15] that the same solution is obtained constraining the minimization to . For the case in which can be derived, it is possible to obtain a closedform solution for as a function of . Introducing this solution back into (2), the problem can be rewritten in terms of only. For instance, when the solution for can be found by taking derivatives of (2) with respect to . After setting the result equal to zero, we obtain(2) 
where . Now, replacing (2) into (2), the loss function can be written as a function only,
Minimizing this functional with respect to , subject to , the solution is given in terms of the following generalized eigenvalue problem,
which can be rewritten as a standard eigenvalue problem:
(3) 
where we have defined . Thus, can also be obtained as (see (2))
(4) 
Table 1 provides the above expression particularized for the CCA, OPLS and PCA methods. For each method, we show the corresponding eigenvalue problem that defines the solution for , the associated , and the solution for provided by (2).
(eig. problem)  

CCA  
OPLS  
PCA 
2.1 Uncorrelation of the extracted features
It is important to remark that, in the absence of regularization, the above approach still produces uncorrelated features, in spite of not enforcing it explicitly. To prove this, we set and multiply both sides of (4) from the left by , arriving at:
(5) 
Next, substituting (4) in (3), and premultiplying both sides by , we obtain
(6) 
Therefore, by jointly considering (5) and (6) we have
(7) 
which proves the uncorrelation of the extracted features, since is diagonal.
3 Iterative solutions for regularized MVA methods
In the case of nonderivable regularizations, the minimization of (2) s.t. has not a closedform solution. This problem is found, for instance, when using LASSO regularization, or the very useful norm, that performs variable selection. In order to solve these regularized MVA methods, many authors have recurred in the literature to the following iterative coupled procedure:

Step. For fixed , find the matrix that minimizes (2) subject to or, rewriting this step in terms of , solve by minimizing
(9) where we have defined .
Step above is typically solved in the literature by using the orthogonal Procrustes approach. As we will see later, this solution neglects the uncorrelation among the extracted features and, despite of that, since it was initially proposed by [11] for the sparse PCA algorithm, it has been wrongly extended to supervised approaches such as sparse OPLS [6], grouplasso penalized OPLS (or SRRR) [12], and regularized CCA (or L21SDA) [13]. Note that this Procrustes approach can still be considered mainstream, as it can be checked in the very recent works [19, 20]. An example of some other proposed Procrustesbased solutions can be found not only in theoretical proposals [21, 22, 23, 24, 25, 26], but also in realworld applications such as medical imaging [27], optical emission spectroscopy [28], or decoding intracranial data [6].
Therefore, the main purpose of this paper is twofold: (1) to alert the machine learning community about limitations of Procrustes when used as part of the above iterative method, as next section theoretically analyzes; (2) to propose an alternate method for the
step that pursues feature uncorrelation, which is next presented (Section 3.2).3.1 Generalized solution: step with Orthogonal Procrustes
Problem (9) is known as Orthogonal Procrustes, whose optimal solution is given by
, given the singular value decomposition
[29].3.2 Proposed solution: step as an eigenvalue problem
Using Lagrange multipliers, we reformulate (9) as the following maximization problem
where is a matrix containing the Lagrange multipliers. Taking derivatives of with respect to , and setting this result to zero, we arrive at the following expression
(10) 
Now, since (6) needs to hold to guarantee uncorrelation of the extracted features, this implies that matrix should also be diagonal, which is not necessarily satisfied by the solution of (9). In other words, when using the iterative procedure described above, it is not sufficient to impose during the step, but we need to additionally impose (6) to get uncorrelated features.
Assuming that is a diagonal matrix, we can now premultiply both terms of (10) by their transposes. Multiplying further by from the left, and using the fact that , we arrive at the following eigenvalue problem that is the basis of our method:
(11) 
Table 2 includes a summary of the  and steps for the particular cases of regularized CCA, OPLS and PCA. Remember that can be straightforwardly computed from using the relations indicated in the last column of Table 1.
step (reg. LS)  step (eigenvalue problem)  

reg. CCA  
reg. OPLS  
reg. PCA 
3.3 Relationship between both solutions
In this section we demonstrate that, in the absence of regularization, the solution to the eigenvalue problem (11) is given by , where the columns of are the left singular vectors of matrix . This implies that the solution of our method is just a rotation of the solution obtained with Procrustes, . This rotation plays a crucial role at uncorrelating the extracted features. Indeed, in the experiments section we will see that not only more uncorrelated features can be obtained, but also that the extracted features are more effective at minimizing the overall objective function (2).
We start by rewriting the singular value decomposition of as
(12) 
now, multiplying both terms of (12) by their transposes from the right, we have
(13) 
Further multiplying both terms by from the right, and comparing the result with (11), we can see that the solution to the eigenvalue problem (11) is precisely and .
4 Undesired properties of orthogonal Procrustes in regularized MVA
In this section, we provide theoretical arguments about the unsuitability of using orthogonal Procrustes as the solution to the step, showing that the obtained solution lacks some desired properties of MVA methods. In order to do so, we work on a generalization of the property declared in [11], which states that a good regularized MVA method should reduce to the original (unregularized) MVA solution when the regularization term is suppressed. We will show that this is not the case when using the solution based on Procrustes. In particular, we study the two following issues that occur when setting ():

The extracted features are not uncorrelated in general. This issue itself dismantles the correctness of all MVA methods based on the Procrustes solution.

Initialization of the iterative process becomes critical, and in some cases the algorithm may not progress at all (for ).
We demonstrate next the above statements, and discuss further on their implications.
4.1 Uncorrelation of the input features using Procrustes
Denoting the solution of the step as and the solution after the next step as , we can use (2) to write
(14) 
since this is the closedform optimal solution of the step when regularization is removed. Now, it is easy to see that the autocorrelation matrix of the extracted features can be rewritten as
(15) 
Recalling that , and that , (15) can be finally expressed as:
(16) 
which is not diagonal in a general case, and, thus, there is no guarantee that the extracted features are uncorrelated. In fact, since is a diagonal matrix and
is an orthogonal matrix (
), only permutations matrices P will result in uncorrelated features (diagonal ); in this case, solutions and extract the same features, but not necessarily in the same order.Experimental section will demonstrate that methods based on Procrustes do not necessarily enjoy the desired uncorrelation property and, even, when the regularization is cancelled (for ) the correlation among the features will imply that part of the variance of the original data described by one feature will also affect other features. Furthermore, experiments we will show that, since this method does not explicitly pursue such uncorrelation, the obtained solution for results in higher correlation among the features than that of the proposed method.
4.2 Proof of initialization dependency by applying orthogonal Procrustes approach
In the experiments section, we will illustrate that the solution achieved when using Procrustes shows a significant variance when the initialization conditions are changed, even when the regularization term is removed. In this subsection, we pay attention to a particular issue associated to the initialization of the algorithm. In particular, we show that when is initialized with an orthogonal matrix (which is a quite common case in the literature) the algorithm does not progress at all.
Let denote an orthogonal matrix which is used for the algorithm initialization. Subsequently, and will denote the solutions to the  and step, that can be obtained from as
In order to express in terms of , we use expressions for steps 1 and 2 to arrive at
(17) 
where we have defined for compactness reasons. Now, multiplying both sides of (17) by theirs transposes from the right and from the left, we obtain the following expressions (note is symmetric)
(18)  
(19) 
From these, the following equalities that will be helpful for this demonstration are obtained:
(20)  
(21) 
Finally, multiplying (20) by the transpose of (21), we can express as a function of , and simplify the resulting expression as follows:
where we made use of (17) and the fact that is orthogonal (i.e., ).
Therefore, we have proved that Procrustes based MVA iterative process results in its paralysis when the regularization term is canceled and is initialized as an orthogonal matrix. This is the case of the method proposed in [6]
, where the algorithm is initialized with the eigenvectors of
. Note also that, since (or) is imposed, an orthogonal matrix is a reasonable choice for initialization, the identity matrix being a classic choice in these cases.
5 Experiments
The previous section theoretically demonstrated the problems of Procrustes based MVA methods, as well as the validity of our proposal. In this section, we show empirically the differences of both approaches over a real problem. To that end, we are going to compare three implementations: iterative MVA solutions using Procrustes approach (referred to as “Procrustes”) and the proposed solution (denoted “Proposal”); furthermore, whenever possible, the original algorithm implementations (“Original”) will be included in the comparison. For all implementations, we are going to consider wellknown MVA methods derived from the generalized framework: PCA, CCA and OPLS.
For this study, problem segment [30] will be used along this section. This dataset consists of 18 input variables, 7 output dimensions and 2390 samples. To be able to analyze initialization dependencies of iterative approaches, all results have been averaged over 50 random initializations.
To start this evaluation, we are going to consider that no regularization is applied () and analyze the following algorithm behaviors when different number of extracted features are used:

Convergence to the minimum of the objective function: evaluating the achieved value of the cost function (2) ^{1}^{1}1In CCA, we will consider its formulation as a maximization of a trace problem., we will be able to study whether the compared solutions are able to achieve the same performance as original MVA solutions (see Figure 1).

Information of the extracted variables: we can measure when the extracted features are correlated or redundant by means of the Total Explained Variance (TEV) concept [11], since its value decreases when there are relationships among features; thus, higher values of this parameter would indicate that the extracted features are more informative (see Figure 2
). The explained variance of a single variable would be given by computing the QR decomposition
, and taking the absolute value of the diagonal elements of R. Thus,
a) PCA  b) CCA  c) OPLS 
a) PCA  b) CCA  c) OPLS 
In the light of these results, we confirm those problems of the Procrusters based approaches, and check that the proposed MVA implementation can overcome them. In particular:

From Figure 1, we can conclude that, when extracted features are considered, Procrustes based approaches are not able to converge to the standard MVA solution. The proposed versions, however, converge to exactly the same solution as the original methods.

The Proposed MVA approach is able to extract more informative features. This is demonstrated by larger TEV values (for any value of ) than those achieved with the Procrustes method. This is a direct consequence of the uncorrelation among the extracted features.

Last but not least, standard deviation of all Procrustes based solutions reveal serious initialization dependency. Note that the proposed solutions, as well as the original MVA methods, converge to the same solution for all initializations.
For the sake of completeness, we are also going to analyze uncorrelation among the extracted features when an penalty is used. For this purpose, we will directly measure the Correlation of the Extracted Features (CEF) by calculating the Frobenius norm
thus, CEF values different from zero would reveal correlations among the extracted components. In particular, Figure 3 includes the values of CEF parameter for different Sparsity Rates (SR) (from zero to 80%^{2}^{2}2Take into account that exploring higher sparsity rates would lead all approaches to set and the CEF parameter would no longer make sense.). For this study, three different initializations for Procrustes based methods are considered: (1) ‘Procrandom’ that, as our proposal implementations, uses uniformly random values in the range from 0 to 1; (2) ‘Procorthog’ which uses an orthogonal matrix given by the eigenvectors of (or in SPCA^{3}^{3}3Note that in this case SPCA initialization is equal to the standard PCA solution.), as it is proposed [11] and [6]; and, (3) ‘Procideal’ which directly starts the iterative process with the ideal solution when no sparsity regularization is used.
When the regularization parameter is included, all approaches (included the proposed ones) are not able to obtain an absolute uncorrelation among the extracted features; even so, CEF values reveal that our proposal gets a higher uncorrelation among the features than Procrustes approaches (independently of its initialization). Besides, when the regularization parameter is close to zero (SR = 0%), Procrustes versions are only able to obtain uncorrelated features if they are initialized with the ideal solution (Procideal).
a)  b)  c) 
6 Conclusions
Solutions for regularized MVA approaches are based in an iterative approach consisting of two coupled steps. Whereas the first step easen the inclusion of regularization terms, the second results in a constrained minimization problem which is generally solved as a orthogonal Procrustes problem. Despite the generalized use of this scheme, it fails in obtaining a new subspace of uncorrelated features, this being a desired property of MVA solutions.
In this paper we have analyzed the drawbacks of these schemes, proposing an alternative algorithm to force the uncorrelation property. The advantages of the proposed technique over the methods based on Procrustes have been discussed theoretically, and further confirmed via simulations.
Acknowledgments
This work has been partly supported by MINECO project TEC201452289R.
References
 [1] K. Pearson, “On lines and planes of closest fit to systems of points in space,” Philosophical Magazine, vol. 2, no. 6, pp. 559–572, 1901.
 [2] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, pp. 321–377, 1936.
 [3] H. Wold, “Estimation of principal components and related models by iterative least squares,” in Multivariate Analysis. Academic Press, 1966, pp. 391–420.
 [4] ——, “Nonlinear estimation by iterative least squares procedures,” in Research Papers in Statistics. Wiley, 1966, pp. 411–444.
 [5] K. Worsley, J. Poline, K. Friston, and A. Evans., “Characterizing the response of pet and fMRI data using multivariate linear models (MLM),” Neuroimage, vol. 6, pp. 305–319, 1998.
 [6] M. A. J. van Gerven, Z. C. Chao, and T. Heskes, “On the decoding of intracranial data using sparse orthonormalized partial least squares,” Journal of Neural Engineering, vol. 9, no. 2, pp. 26 017–26 027, 2012.
 [7] L. K. Hansen, “Multivariate strategies in functional magnetic resonance imaging,” Brain and Language, vol. 102, no. 2, pp. 186–191, 2007.
 [8] J. ArenasGarcía and G. CampsValls, “Efficient kernel orthonormalized PLS for remote sensing applications,” IEEE Trans. Geosci. Remote Sens., vol. 44, pp. 2872–2881, 2008.
 [9] J. ArenasGarcía and K. B. Petersen, “Kernel multivariate analysis in remote sensing feature extraction,” in Kernel Methods for Remote Sensing Data Analysis, G. CampsValls and L. Bruzzone, Eds. Wiley, 2009.
 [10] M. Barker and W. Rayens, “Partial least squares for discrimination,” Journal of Chemometrics, vol. 17, no. 3, pp. 166–173, 2003.
 [11] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” Journal of Computational and Graphical Statistics, vol. 15, no. 2, pp. 265–286, 2006.
 [12] L. Chen and J. Z. Huang, “Sparse reducedrank regression for simultaneous dimension reduction and variable selection,” Journal of the American Statistical Association, vol. 107, no. 500, pp. 1533–1545, 2012.

[13]
X. Shi, Y. Yang, Z. Guo, and Z. Lai, “Face recognition by sparse discriminant analysis via joint
norm minimization,” Pattern Recognition, vol. 47, no. 7, pp. 2447–2453, 2014.  [14] G. C. Reinsel and R. P. Velu, Multivariate reducedrank regression: theory and applications. Springer New York, 1998.
 [15] S. MuñozRomero, J. ArenasGarcía, and V. GómezVerdejo, “Sparse and kernel OPLS feature extraction based on eigenvalue problem solving,” Pattern Recognition, vol. 48, no. 5, pp. 1797 – 1811, 2015.

[16]
F. Nie, H. Huang, X. Cai, and C. Ding, “Efficient and robust feature selection via joint
norms minimization,” in Advances in Neural Information Processing Systems 23. The MIT Press, 2010, pp. 1813–1821.  [17] M. Grant, S. Boyd, and Y. Ye, “CVX: Matlab software for disciplined convex programming,” 2008.
 [18] J. Kim and H. Park, “Toward faster nonnegative matrix factorization: A new algorithm and comparisons,” in Proc. 8th IEEE Intl. Conf. on Data Mining (ICDM’08). Pisa, Italy: IEEE, December 2008, pp. 353–362.

[19]
Z. Lai, W. K. Wong, Y. Xu, J. Yang, and D. Zhang, “Approximate orthogonal
sparse embedding for dimensionality reduction,”
IEEE Transactions on Neural Networks and Learning Systems
, vol. 27, no. 4, pp. 723–735, April 2016.  [20] Z. Hu, G. Pan, Y. Wang, and Z. Wu, “Sparse principal component analysis via rotation and truncation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 4, pp. 875–890, April 2016.
 [21] Z. Qiao, L. Zhou, and J. Z. Huang, “Effective linear discriminant analysis for high dimensional, low sample size data,” in Proceeding of the World Congress on Engineering, vol. 2. Citeseer, 2008, pp. 1070–1075.
 [22] ——, “Sparse linear discriminant analysis with applications to high dimensional low sample size data,” IAENG International Journal of Applied Mathematics, vol. 39, no. 1, pp. 48–60, 2009.
 [23] W. Dou, G. Dai, C. Xu, and Z. Zhang, “Sparse unsupervised dimensionality reduction algorithms,” in Machine Learning and Knowledge Discovery in Databases, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2010, vol. 6321, pp. 361–376.
 [24] J. Guo, G. James, E. Levina, G. Michailidis, and J. Zhu, “Principal component analysis with sparse fused loadings,” Journal of Computatinal and Graphical Statistics, vol. 19, no. 4, pp. 947–962, 2010.

[25]
Y. Han, F. Wu, J. Jia, Y. Zhuang, and B. Yu, “Multitask sparse discriminant
analysis (mtsda) with overlapping categories,” in
Proc. 24th AAAI Conference on Artificial Intelligence
, Atlanta, Georgia, USA, July 2010, pp. 469–474. 
[26]
Y. Liu, F. Shang, W. Fan, J. Cheng, and H. Cheng, “Generalized higherorder orthogonal iteration for tensor decomposition and completion,” in
Advances in Neural Information Processing Systems 27. Curran Associates, Inc., 2014, pp. 1763–1771.  [27] K. Sjöstrand, M. B. Stegmann, and R. Larsen, “Sparse principal component analysis in medical shape modeling,” in Proc. SPIE Medical Imaging: Image Processing, vol. 6144. San Diego, CA, USA: International Society for Optics and Photonics, March 2006, pp. 61 444X–61 444X–12.
 [28] B. Ma, S. McLoone, J. Ringwood, and N. Macgearailt, “Selecting signature optical emission spectroscopy variables using sparse principal component analysis,” in Proc. IEEE 11th Intl. Conf. on Computer and Information Technology (ICCIT). IEEE, December 2008, pp. 14–19.
 [29] P. H. Schönemann, “A generalized solution of the orthogonal procrustes problem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966.
 [30] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010. [Online]. Available: http://archive.ics.uci.edu/ml
Comments
There are no comments yet.