MultiVariate Analysis (MVA) techniques have been widely used during the last century since Principal Component Analysis (PCA)
was proposed as a simple and efficient way to reduce data dimension by projecting the data over the maximum variance directions. Since then, many variants have emerged trying to include supervised information to this data dimension reduction process; this is the case of algorithms such as Canonical Correlation Analysis (CCA), Partial Least Squares (PLS) approaches [3, 4], or Orthonormalized PLS (OPLS) . In fact, we can find many real applications where these methods have been successfully applied: in biomedical engineering [6, 7], remote sensing [8, 9], or chemometrics , among many others.
Some recent significant contributions in the field have focused on trying to gain interpretability by means of including and norms, or even group lasso penalties, in the MVA formulations. This is the case of extensions such as sparse PCA , sparse OPLS , group-lasso penalized OPLS (or SRRR) , and -regularized CCA (or L21SDA) 
. All these approaches are based on an iterative process which combines two optimization problems. The first step consists of a regularized least-squares problem to obtain the vectors for the extraction of input features; the second step involves a minimization problem which is typically solved as an orthogonal Procrustes problem.
Despite these regularized approaches have been recurrently applied in feature extraction and dimensionality reduction scenarios , , all of them ignore one intrinsic and important property of most MVA approaches: uncorrelation of the extracted features in the new subspace
. When this property holds, the feature extraction process provides additional advantages: (1) The subsequent learning tasks (working over this new space) are easen, for instance, least-square problems (Ridge Regression, LASSO,…) can work independently over each dimension, and the effects of variations of the input data are isolated in the different directions. (2) The selection of optimal feature subsets becomes straightforward; since once a set of features is computed, obtaining an optimum reduced subset consists of selecting those features with highest associated eigenvalue. Consequently, the adjustment of the optimum number of extracted features is simplified.
In this paper, we analyze in detail the above mentioned MVA formulations showing, from a theoretical and experimental point of view, some drawbacks overlooked until now in the literature. Concretely, we will demonstrate that these MVA approaches (1) do not obtain uncorrelated features in general; (2) do not converge to their associated non-regularized MVA solutions; (3) suffer some issues that depend on the algorithm initialization, e.g., depending on algorithm initialization the methods can fail to progress at all.
As solution to these problems, this paper proposes an alternative to orthogonal Procrustes. In order to do so, we rely on a common framework that allows us to deal simultaneously with the most common MVA methods (PCA, CCA and OPLS), and extend it to favor interpretable solutions by including a regularization term. Similarly to existing methods, we propose a solution to this generalized formulation which is based on an iterative process but does not suffer from the above problems.
The paper is organized as follows: Firstly, Section 2 introduces this generalized MVA framework. Then, Section 3 presents the iterative process required to solve its regularized extension and describes both Procrustes solution, as well our proposal based on a standard eigenvalue problem. Section 4 explains the limitations of the Procrustes solution in greater detail, and provide theoretical proof of the most important problems of this approach. Section 5 illustrates and compare the suitability of the new proposed solution with that of Procrustes using some real problems that support well the theoretical findings. Finally, Section 6 concludes the paper.
2 Framework for MVA with uncorrelated features
This section reviews some well-known MVA methods under a unifying framework, so that subsequent sections can deal with these methods in a unified manner. Before that, notation used throughout the paper is presented.
Let us assume a supervised learning scenario, where the goal is to learn relevant features from an input data set oftraining data , where and are considered as the input and output vectors, respectively. Therefore, and denote the dimensions of the input and output spaces. For notational convenience, we define the input and output data matrices: and
, so sample estimations of the input and output data covariance matrices, as well as of their cross-covariance matrix, can be calculated as, and , where we have neglected the scaling factor , and superscript denotes vector or matrix transposition. The goal of linear MVA methods is to find relevant features by combining the original variables, i.e., , where the th column of is a vector containing the coefficients associated to the th extracted feature.
The results in this paper apply, at least, to PCA, CCA, and OPLS, all these methods having in common that the extracted features are uncorrelated, i.e., , with a diagonal matrix. MVA methods that do not enforce this uncorrelation, more notably PLS, are therefore left outside the scope of this paper.
A common framework for many regularized MVA methods can be found in . According to it, these methods pursue the minimization of the following objective function:
where is a regularization term, such as the ridge regularization (), the norm (), or the penalty for variable selection (). Parameter trades off the importance of the regularization term in (2), can be considered a matrix for the extraction of output features, and different particularizations of matrix give rise to the considered MVA methods, in particular for CCA, for OPLS, and with for PCA [14, 15].
In order to extract uncorrelated features, the loss function (2) is formally minimized subject to . However, it is proved in  that the same solution is obtained constraining the minimization to . For the case in which can be derived, it is possible to obtain a closed-form solution for as a function of . Introducing this solution back into (2), the problem can be rewritten in terms of only. For instance, when the solution for can be found by taking derivatives of (2) with respect to . After setting the result equal to zero, we obtain
Minimizing this functional with respect to , subject to , the solution is given in terms of the following generalized eigenvalue problem,
which can be rewritten as a standard eigenvalue problem:
where we have defined . Thus, can also be obtained as (see (2))
Table 1 provides the above expression particularized for the CCA, OPLS and PCA methods. For each method, we show the corresponding eigenvalue problem that defines the solution for , the associated , and the solution for provided by (2).
2.1 Uncorrelation of the extracted features
It is important to remark that, in the absence of regularization, the above approach still produces uncorrelated features, in spite of not enforcing it explicitly. To prove this, we set and multiply both sides of (4) from the left by , arriving at:
which proves the uncorrelation of the extracted features, since is diagonal.
3 Iterative solutions for regularized MVA methods
In the case of non-derivable regularizations, the minimization of (2) s.t. has not a closed-form solution. This problem is found, for instance, when using LASSO regularization, or the very useful -norm, that performs variable selection. In order to solve these regularized MVA methods, many authors have recurred in the literature to the following iterative coupled procedure:
Step-. For fixed , find the matrix that minimizes (2) subject to or, rewriting this step in terms of , solve by minimizing
where we have defined .
Step- above is typically solved in the literature by using the orthogonal Procrustes approach. As we will see later, this solution neglects the uncorrelation among the extracted features and, despite of that, since it was initially proposed by  for the sparse PCA algorithm, it has been wrongly extended to supervised approaches such as sparse OPLS , group-lasso penalized OPLS (or SRRR) , and -regularized CCA (or L21SDA) . Note that this Procrustes approach can still be considered mainstream, as it can be checked in the very recent works [19, 20]. An example of some other proposed Procrustes-based solutions can be found not only in theoretical proposals [21, 22, 23, 24, 25, 26], but also in real-world applications such as medical imaging , optical emission spectroscopy , or decoding intracranial data .
Therefore, the main purpose of this paper is two-fold: (1) to alert the machine learning community about limitations of Procrustes when used as part of the above iterative method, as next section theoretically analyzes; (2) to propose an alternate method for the-step that pursues feature uncorrelation, which is next presented (Section 3.2).
3.1 Generalized solution: -step with Orthogonal Procrustes
3.2 Proposed solution: -step as an eigenvalue problem
Using Lagrange multipliers, we reformulate (9) as the following maximization problem
where is a matrix containing the Lagrange multipliers. Taking derivatives of with respect to , and setting this result to zero, we arrive at the following expression
Now, since (6) needs to hold to guarantee uncorrelation of the extracted features, this implies that matrix should also be diagonal, which is not necessarily satisfied by the solution of (9). In other words, when using the iterative procedure described above, it is not sufficient to impose during the -step, but we need to additionally impose (6) to get uncorrelated features.
Assuming that is a diagonal matrix, we can now premultiply both terms of (10) by their transposes. Multiplying further by from the left, and using the fact that , we arrive at the following eigenvalue problem that is the basis of our method:
Table 2 includes a summary of the - and -steps for the particular cases of regularized CCA, OPLS and PCA. Remember that can be straightforwardly computed from using the relations indicated in the last column of Table 1.
|-step (reg. LS)||-step (eigenvalue problem)|
3.3 Relationship between both solutions
In this section we demonstrate that, in the absence of regularization, the solution to the eigenvalue problem (11) is given by , where the columns of are the left singular vectors of matrix . This implies that the solution of our method is just a rotation of the solution obtained with Procrustes, . This rotation plays a crucial role at uncorrelating the extracted features. Indeed, in the experiments section we will see that not only more uncorrelated features can be obtained, but also that the extracted features are more effective at minimizing the overall objective function (2).
We start by rewriting the singular value decomposition of as
now, multiplying both terms of (12) by their transposes from the right, we have
4 Undesired properties of orthogonal Procrustes in regularized MVA
In this section, we provide theoretical arguments about the unsuitability of using orthogonal Procrustes as the solution to the -step, showing that the obtained solution lacks some desired properties of MVA methods. In order to do so, we work on a generalization of the property declared in , which states that a good regularized MVA method should reduce to the original (unregularized) MVA solution when the regularization term is suppressed. We will show that this is not the case when using the solution based on Procrustes. In particular, we study the two following issues that occur when setting ():
The extracted features are not uncorrelated in general. This issue itself dismantles the correctness of all MVA methods based on the Procrustes solution.
Initialization of the iterative process becomes critical, and in some cases the algorithm may not progress at all (for ).
We demonstrate next the above statements, and discuss further on their implications.
4.1 Uncorrelation of the input features using Procrustes
Denoting the solution of the -step as and the solution after the next -step as , we can use (2) to write
since this is the closed-form optimal solution of the -step when regularization is removed. Now, it is easy to see that the autocorrelation matrix of the extracted features can be rewritten as
Recalling that , and that , (15) can be finally expressed as:
which is not diagonal in a general case, and, thus, there is no guarantee that the extracted features are uncorrelated. In fact, since is a diagonal matrix and
is an orthogonal matrix (), only permutations matrices P will result in uncorrelated features (diagonal ); in this case, solutions and extract the same features, but not necessarily in the same order.
Experimental section will demonstrate that methods based on Procrustes do not necessarily enjoy the desired uncorrelation property and, even, when the regularization is cancelled (for ) the correlation among the features will imply that part of the variance of the original data described by one feature will also affect other features. Furthermore, experiments we will show that, since this method does not explicitly pursue such uncorrelation, the obtained solution for results in higher correlation among the features than that of the proposed method.
4.2 Proof of initialization dependency by applying orthogonal Procrustes approach
In the experiments section, we will illustrate that the solution achieved when using Procrustes shows a significant variance when the initialization conditions are changed, even when the regularization term is removed. In this subsection, we pay attention to a particular issue associated to the initialization of the algorithm. In particular, we show that when is initialized with an orthogonal matrix (which is a quite common case in the literature) the algorithm does not progress at all.
Let denote an orthogonal matrix which is used for the algorithm initialization. Subsequently, and will denote the solutions to the - and -step, that can be obtained from as
In order to express in terms of , we use expressions for steps 1 and 2 to arrive at
where we have defined for compactness reasons. Now, multiplying both sides of (17) by theirs transposes from the right and from the left, we obtain the following expressions (note is symmetric)
From these, the following equalities that will be helpful for this demonstration are obtained:
where we made use of (17) and the fact that is orthogonal (i.e., ).
Therefore, we have proved that Procrustes based MVA iterative process results in its paralysis when the regularization term is canceled and is initialized as an orthogonal matrix. This is the case of the method proposed in 
, where the algorithm is initialized with the eigenvectors of. Note also that, since (or
) is imposed, an orthogonal matrix is a reasonable choice for initialization, the identity matrix being a classic choice in these cases.
The previous section theoretically demonstrated the problems of Procrustes based MVA methods, as well as the validity of our proposal. In this section, we show empirically the differences of both approaches over a real problem. To that end, we are going to compare three implementations: iterative MVA solutions using Procrustes approach (referred to as “Procrustes”) and the proposed solution (denoted “Proposal”); furthermore, whenever possible, the original algorithm implementations (“Original”) will be included in the comparison. For all implementations, we are going to consider well-known MVA methods derived from the generalized framework: PCA, CCA and OPLS.
For this study, problem segment  will be used along this section. This dataset consists of 18 input variables, 7 output dimensions and 2390 samples. To be able to analyze initialization dependencies of iterative approaches, all results have been averaged over 50 random initializations.
To start this evaluation, we are going to consider that no regularization is applied () and analyze the following algorithm behaviors when different number of extracted features are used:
Convergence to the minimum of the objective function: evaluating the achieved value of the cost function (2) 111In CCA, we will consider its formulation as a maximization of a trace problem., we will be able to study whether the compared solutions are able to achieve the same performance as original MVA solutions (see Figure 1).
Information of the extracted variables: we can measure when the extracted features are correlated or redundant by means of the Total Explained Variance (TEV) concept , since its value decreases when there are relationships among features; thus, higher values of this parameter would indicate that the extracted features are more informative (see Figure 2
). The explained variance of a single variable would be given by computing the QR decomposition, and taking the absolute value of the diagonal elements of R. Thus,
|a) PCA||b) CCA||c) OPLS|
|a) PCA||b) CCA||c) OPLS|
In the light of these results, we confirm those problems of the Procrusters based approaches, and check that the proposed MVA implementation can overcome them. In particular:
From Figure 1, we can conclude that, when extracted features are considered, Procrustes based approaches are not able to converge to the standard MVA solution. The proposed versions, however, converge to exactly the same solution as the original methods.
The Proposed MVA approach is able to extract more informative features. This is demonstrated by larger TEV values (for any value of ) than those achieved with the Procrustes method. This is a direct consequence of the uncorrelation among the extracted features.
Last but not least, standard deviation of all Procrustes based solutions reveal serious initialization dependency. Note that the proposed solutions, as well as the original MVA methods, converge to the same solution for all initializations.
For the sake of completeness, we are also going to analyze uncorrelation among the extracted features when an penalty is used. For this purpose, we will directly measure the Correlation of the Extracted Features (CEF) by calculating the Frobenius norm
thus, CEF values different from zero would reveal correlations among the extracted components. In particular, Figure 3 includes the values of CEF parameter for different Sparsity Rates (SR) (from zero to 80%222Take into account that exploring higher sparsity rates would lead all approaches to set and the CEF parameter would no longer make sense.). For this study, three different initializations for Procrustes based methods are considered: (1) ‘Proc-random’ that, as our proposal implementations, uses uniformly random values in the range from 0 to 1; (2) ‘Proc-orthog’ which uses an orthogonal matrix given by the eigenvectors of (or in SPCA333Note that in this case SPCA initialization is equal to the standard PCA solution.), as it is proposed  and ; and, (3) ‘Proc-ideal’ which directly starts the iterative process with the ideal solution when no sparsity regularization is used.
When the regularization parameter is included, all approaches (included the proposed ones) are not able to obtain an absolute uncorrelation among the extracted features; even so, CEF values reveal that our proposal gets a higher uncorrelation among the features than Procrustes approaches (independently of its initialization). Besides, when the regularization parameter is close to zero (SR = 0%), Procrustes versions are only able to obtain uncorrelated features if they are initialized with the ideal solution (Proc-ideal).
Solutions for regularized MVA approaches are based in an iterative approach consisting of two coupled steps. Whereas the first step easen the inclusion of regularization terms, the second results in a constrained minimization problem which is generally solved as a orthogonal Procrustes problem. Despite the generalized use of this scheme, it fails in obtaining a new subspace of uncorrelated features, this being a desired property of MVA solutions.
In this paper we have analyzed the drawbacks of these schemes, proposing an alternative algorithm to force the uncorrelation property. The advantages of the proposed technique over the methods based on Procrustes have been discussed theoretically, and further confirmed via simulations.
This work has been partly supported by MINECO project TEC2014-52289-R.
-  K. Pearson, “On lines and planes of closest fit to systems of points in space,” Philosophical Magazine, vol. 2, no. 6, pp. 559–572, 1901.
-  H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, pp. 321–377, 1936.
-  H. Wold, “Estimation of principal components and related models by iterative least squares,” in Multivariate Analysis. Academic Press, 1966, pp. 391–420.
-  ——, “Non-linear estimation by iterative least squares procedures,” in Research Papers in Statistics. Wiley, 1966, pp. 411–444.
-  K. Worsley, J. Poline, K. Friston, and A. Evans., “Characterizing the response of pet and fMRI data using multivariate linear models (MLM),” Neuroimage, vol. 6, pp. 305–319, 1998.
-  M. A. J. van Gerven, Z. C. Chao, and T. Heskes, “On the decoding of intracranial data using sparse orthonormalized partial least squares,” Journal of Neural Engineering, vol. 9, no. 2, pp. 26 017–26 027, 2012.
-  L. K. Hansen, “Multivariate strategies in functional magnetic resonance imaging,” Brain and Language, vol. 102, no. 2, pp. 186–191, 2007.
-  J. Arenas-García and G. Camps-Valls, “Efficient kernel orthonormalized PLS for remote sensing applications,” IEEE Trans. Geosci. Remote Sens., vol. 44, pp. 2872–2881, 2008.
-  J. Arenas-García and K. B. Petersen, “Kernel multivariate analysis in remote sensing feature extraction,” in Kernel Methods for Remote Sensing Data Analysis, G. Camps-Valls and L. Bruzzone, Eds. Wiley, 2009.
-  M. Barker and W. Rayens, “Partial least squares for discrimination,” Journal of Chemometrics, vol. 17, no. 3, pp. 166–173, 2003.
-  H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” Journal of Computational and Graphical Statistics, vol. 15, no. 2, pp. 265–286, 2006.
-  L. Chen and J. Z. Huang, “Sparse reduced-rank regression for simultaneous dimension reduction and variable selection,” Journal of the American Statistical Association, vol. 107, no. 500, pp. 1533–1545, 2012.
X. Shi, Y. Yang, Z. Guo, and Z. Lai, “Face recognition by sparse discriminant analysis via joint-norm minimization,” Pattern Recognition, vol. 47, no. 7, pp. 2447–2453, 2014.
-  G. C. Reinsel and R. P. Velu, Multivariate reduced-rank regression: theory and applications. Springer New York, 1998.
-  S. Muñoz-Romero, J. Arenas-García, and V. Gómez-Verdejo, “Sparse and kernel OPLS feature extraction based on eigenvalue problem solving,” Pattern Recognition, vol. 48, no. 5, pp. 1797 – 1811, 2015.
F. Nie, H. Huang, X. Cai, and C. Ding, “Efficient and robust feature selection via joint-norms minimization,” in Advances in Neural Information Processing Systems 23. The MIT Press, 2010, pp. 1813–1821.
-  M. Grant, S. Boyd, and Y. Ye, “CVX: Matlab software for disciplined convex programming,” 2008.
-  J. Kim and H. Park, “Toward faster nonnegative matrix factorization: A new algorithm and comparisons,” in Proc. 8th IEEE Intl. Conf. on Data Mining (ICDM’08). Pisa, Italy: IEEE, December 2008, pp. 353–362.
Z. Lai, W. K. Wong, Y. Xu, J. Yang, and D. Zhang, “Approximate orthogonal
sparse embedding for dimensionality reduction,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 4, pp. 723–735, April 2016.
-  Z. Hu, G. Pan, Y. Wang, and Z. Wu, “Sparse principal component analysis via rotation and truncation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 4, pp. 875–890, April 2016.
-  Z. Qiao, L. Zhou, and J. Z. Huang, “Effective linear discriminant analysis for high dimensional, low sample size data,” in Proceeding of the World Congress on Engineering, vol. 2. Citeseer, 2008, pp. 1070–1075.
-  ——, “Sparse linear discriminant analysis with applications to high dimensional low sample size data,” IAENG International Journal of Applied Mathematics, vol. 39, no. 1, pp. 48–60, 2009.
-  W. Dou, G. Dai, C. Xu, and Z. Zhang, “Sparse unsupervised dimensionality reduction algorithms,” in Machine Learning and Knowledge Discovery in Databases, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2010, vol. 6321, pp. 361–376.
-  J. Guo, G. James, E. Levina, G. Michailidis, and J. Zhu, “Principal component analysis with sparse fused loadings,” Journal of Computatinal and Graphical Statistics, vol. 19, no. 4, pp. 947–962, 2010.
Y. Han, F. Wu, J. Jia, Y. Zhuang, and B. Yu, “Multi-task sparse discriminant
analysis (mtsda) with overlapping categories,” in
Proc. 24th AAAI Conference on Artificial Intelligence, Atlanta, Georgia, USA, July 2010, pp. 469–474.
Y. Liu, F. Shang, W. Fan, J. Cheng, and H. Cheng, “Generalized higher-order orthogonal iteration for tensor decomposition and completion,” inAdvances in Neural Information Processing Systems 27. Curran Associates, Inc., 2014, pp. 1763–1771.
-  K. Sjöstrand, M. B. Stegmann, and R. Larsen, “Sparse principal component analysis in medical shape modeling,” in Proc. SPIE Medical Imaging: Image Processing, vol. 6144. San Diego, CA, USA: International Society for Optics and Photonics, March 2006, pp. 61 444X–61 444X–12.
-  B. Ma, S. McLoone, J. Ringwood, and N. Macgearailt, “Selecting signature optical emission spectroscopy variables using sparse principal component analysis,” in Proc. IEEE 11th Intl. Conf. on Computer and Information Technology (ICCIT). IEEE, December 2008, pp. 14–19.
-  P. H. Schönemann, “A generalized solution of the orthogonal procrustes problem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966.
-  A. Frank and A. Asuncion, “UCI machine learning repository,” 2010. [Online]. Available: http://archive.ics.uci.edu/ml