Regularized Multivariate Analysis Framework for Interpretable High-Dimensional Variable Selection

Multivariate Analysis (MVA) comprises a family of well-known methods for feature extraction which exploit correlations among input variables representing the data. One important property that is enjoyed by most such methods is uncorrelation among the extracted features. Recently, regularized versions of MVA methods have appeared in the literature, mainly with the goal to gain interpretability of the solution. In these cases, the solutions can no longer be obtained in a closed manner, and more complex optimization methods that rely on the iteration of two steps are frequently used. This paper recurs to an alternative approach to solve efficiently this iterative problem. The main novelty of this approach lies in preserving several properties of the original methods, most notably the uncorrelation of the extracted features. Under this framework, we propose a novel method that takes advantage of the l-21 norm to perform variable selection during the feature extraction process. Experimental results over different problems corroborate the advantages of the proposed formulation in comparison to state of the art formulations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 8

page 11

05/09/2016

Why (and How) Avoid Orthogonal Procrustes in Regularized Multivariate Analysis

Multivariate Analysis (MVA) comprises a family of well-known methods for...
06/18/2018

Variable Importance Assessments and Backward Variable Selection for High-Dimensional Data

Variable selection in high-dimensional scenarios is of great interested ...
07/16/2019

Variable selection in sparse high-dimensional GLARMA models

In this paper, we propose a novel variable selection approach in the fra...
06/08/2020

Interpretable Signal Analysis with Knockoffs Enhances Classification of Bacterial Raman Spectra

Interpretability is important for many applications of machine learning ...
02/14/2020

A comparison of different types of Niching Genetic Algorithms for variable selection in solar radiation estimation

Variable selection problems generally present more than a single solutio...
05/26/2020

Instance Explainable Temporal Network For Multivariate Timeseries

Although deep networks have been widely adopted, one of their shortcomin...
12/02/2017

PFAx: Predictable Feature Analysis to Perform Control

Predictable Feature Analysis (PFA) (Richthofer, Wiskott, ICMLA 2015) is ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Multivariate Analysis (MVA) comprises a collection of tools that play a fundamental role in statistical data analysis. These techniques have become increasingly popular since the proposal of Principal Component Analysis (PCA) in 1901

[1]

. PCA was proposed as a simple and efficient way to reduce data dimension by projecting the data over the largest variance directions. As illustrated in Fig.

1

, PCA learns from a given dataset a set of projection vectors, so that data can be represented in a low-dimensional space that preserves the directions of the input space where the data shows the largest variance. A typical example to illustrate PCA is face recognition, where the projection vectors are known as eigenfaces

[2]. Nevertheless, PCA has been used in many other applications, and can indeed be considered as one of the most widely-used tools for feature extraction.

Fig. 1: Feature extraction with Multivariate Analysis. During the training phase, MVA methods learn the most relevant directions for a particular dataset (in face recognition these vectors are known as eigenfaces

when the PCA method is applied). Feature extraction is then carried out multiplying any vector in the input representation space with these eigenvectors. Subsequent learning tools can then be applied on this new subspace of reduced dimensionality.

Other MVA algorithms have emerged that are especially suited to supervised learning tasks (e.g., in regression and classification). In these problems, the goal is not just to represent the input data as efficiently as possible, but it actually becomes of major importance to keep the directions of the input space that are more highly correlated with the label information. This is the case of algorithms such as Canonical Correlation Analysis (CCA)

[3], Partial Least Squares (PLS) approaches [4, 5], and Orthonormalized PLS (OPLS) [6]. Consider for instance a toy classification problem in Fig. 2. In this problem, the direction of the maximum variance extracted by PCA (left subplot) results in overlapping distributions of the two classes along this direction, while a supervised method like OPLS (right subplot) successfully identifies the most discriminative information. Although this toy example is based on a classification task, the same advantages of supervised MVA over standard PCA are encountered in regression tasks—see [7] for a detailed theoretical and experimental review of these methods.

(a) PCA (b) OPLS
Fig. 2: Projected data over the first eigenvector of PCA and OPLS in a binary classification task.

The simplicity of these methods, as well as the availability of highly-optimized libraries for solving the linear algebra problems they involve, justifies the extensive use of MVA in many application fields, such as biomedical engineering [8, 9], remote sensing [10, 11], or chemometrics [12], among many others (see also [7] for a more detailed review of application-oriented research in the field).

An important property of PCA, OPLS, and CCA is that they lead to uncorrelated variables, so that the feature extraction process provides additional advantages:

  • The relevance of each extracted feature is directly given by the magnitude of its associated eigenvalue, which simplifies the selection of a reduced subset of features, if necessary.

  • Subsequent learning tasks are simplified, more notably, when the covariance matrix inversion is required. This is the case of least-square based problems, such as Ridge Regression or lasso (least absolute shrinkage and selection operator)

    [13].

Standard versions of MVA methods implement just a feature extraction process, in the sense that all original variables are used to build the new features. However, over the last few years there have been many significant contributions to this field that have focused on gaining interpretability of the extracted features by incorporating sparsity-inducing norms, such as the and norms [14], as a penalty term in the minimization problem. When these regularization terms are included, the projection vectors are favored to include zeros in some of their components, making it easier to understand the process to build the new features and thus gaining in interpretability. In fact, the rewards solutions that perform a real variable selection process, in the sense that some of the original variables are excluded from all projection vectors at once. In other words, only a subset of the original variables are used to build the new features.

Some of the most significant contributions in this direction are sparse PCA [15], sparse OPLS [8], group-lasso penalized OPLS (also known as Sparse Reduced Rank Regression, SRRR) [16], and -regularized CCA (or L21SDA) [17]. All these approaches are based on an iterative process which combines the optimization of two coupled least-squares problems, one of them subject to a minimization constraint. Since the inspiring work [15], this constrained least-squares minimization has been typically treated as an orthogonal Procrustes problem [18], an approach that can still be considered mainstream (see, e.g., the very recent works [19, 20]).

A first objective of this paper is to highlight and make the computational intelligence community aware of some limitations derived from the use of orthogonal Procrustes in the context of regularized MVA methods. As explained in [21], these methods

  1. do not converge to their associated non-regularized MVA solutions when the penalty term is removed,

  2. are highly dependent on initialization, and may even fail to progress towards a solution,

  3. do not in general obtain uncorrelated features.

As solution to these problems, [21] proposes an alternative optimization procedure avoiding the use of the Procrustes solution. In this paper, we will briefly review the framework presented in [21] to derive regularized MVA versions, and illustrate the approach and its associated advantages by introducing a novel MVA method using the norm [14] as regularization term. Apart from the advantages we have already discussed implying variable selection over the original variables, this norm holds the property of rotational invariance, a fact that we will exploit to significantly reduce the computational cost of the training phase. Although some authors have already adapted the robust variable selection method [14] to the MVA scenario (see, e.g., the group-lasso penalized OPLS method [16] or the -regularized CCA [17]), these adaptations are based on orthogonal Procrustes and the rotational invariance property of the norm is not exploited, taking unnecessary extra computational burden.

In short, the main contributions of this paper can be summarized as:

  • Review a framework for regularized MVA methods, and explain an alternative to the most commonly used Procrustes solution to overcome the limitations of this approach.

  • Obtain novel MVA algorithms based on regularization.

  • Illustrate the effectiveness of these algorithms to carry out feature extraction and, at the same time, obtain some understanding of the original input variables.

The rest of the paper is organized as follows. Section II reviews the common framework for regularized multivariate analysis, and explains an advantageous alternative to the use of orthogonal Procrustes in this context. Then, Section III particularizes the MVA framework by including an norm penalty, explaining in detail how to derive a computationally efficient solution and pointing the differences between our proposal and other existing solutions. Section IV is devoted to experiments and Section V draws the main conclusions of our work.

Ii Regularized MVA framework: enforcing feature uncorrelation

Let us assume a supervised learning scenario, where the goal is to carry out feature extraction in the input space, learning the projection vectors from a training dataset of input-output pairs , where and are the input and output vectors, respectively. Therefore, and denote the dimensions of the input and output spaces. For notational convenience, we define the input and output data matrices: and , with columnwise arranged patterns. It will be assumed throughout the paper that these matrices are centered [22]

, so that sample estimations of the input and output data covariance matrices, as well as of their cross-covariance matrix, can be calculated as

, and , where we have neglected the scaling factor , and superscript denotes vector or matrix transposition. The goal of linear MVA methods is to find relevant features by combining the original variables, i.e., , where the th column of is a vector containing the coefficients associated to the th extracted feature. Note that we are referring to the components of as variables, whereas the components of are being referred to as features. Consequently, feature extraction implies obtaining from , whereas variable selection is the process of selecting a subset of the original variables in . Besides, the feature extraction process can also imply variable selection when the projection matrix has some of their rows equal to zero.

In this paper, we deal with MVA methods which force the extracted features to be uncorrelated; this applies, at least, to PCA, CCA, and OPLS. MVA methods that do not enforce feature uncorrelation, more notably PLS, are therefore left outside the scope of this paper. A common framework for these regularized MVA methods can be set including an uncorrelation constraint, , over the formulation of [23]:

(1)

where is an matrix of regression coefficients, parameter trades off the importance of the regularization term , denotes the Frobenius norm of matrix , and is the trace operator. Finally, different selections of matrix give rise to the considered MVA methods, in particular for CCA, for OPLS, and with for PCA [23, 24].

The objective function in (1) is composed of two terms. The first term tries to minimize the reconstruction error when matrix is estimated from the projected data as . Note that this is different from standard least-squares since the introduction of matrix imposes a representation bottleneck [25], i.e., matrix needs to be approximated from a matrix with less features than the original matrix . The regularization term is usually a particular matrix norm that gives a desired property to the solution. Three common regularization terms are:

  • , where is the element in the th-row and th-column of . This term is known as Tikhonov, ridge, or regularization, and it is used to improve the conditioning of the solution.

  • , where is the absolute value of . This term is known as lasso regularization [13] and it is frequently used to induce sparsity on the solution matrix (i.e., to nullify some elements of ).

  • , being the th row of . This is known as regularization and penalizes all coefficients corresponding to a single variable as a whole, making them drop to zero jointly, thus favoring variable selection.

Fig. 3: Examples of the provided projection matrices for a case with and and considering the regularizations , , and . Coefficients that take a zero value have been identified and represented in white.

Fig. 3 depicts the solution matrix for the above regularization terms over a toy problem. As it can be seen, and penalties result in many elements of dropping to zero. Furthermore, norm provides the sparsity in a structured way, i.e., the coefficients of are annulled by rows. Since each row of is associated to a different input variable, this structured sparsity implies that many input variables are completely ignored during the feature extraction, so that variable selection is also achieved in this case.

When using non-derivable penalties, such as or norms, the solution of the minimization problem in (1) cannot be obtained in closed-form. However, as shown in [24] an equivalent formulation can be obtained by replacing the uncorrelation constraint in (1) by , leading to

(2)

As demonstrated in [24], (1) and (2) provide the same solution, but using (2) is normally more efficient for the common case . Furthermore, placing the constraint on matrix allows to solve the problem with the two-step iterative method that we describe in Algorithm LABEL:tab:UWsteps (please, refer to [24] for further details). For simplicity, the solution of the -step has been written in terms of a new matrix . As it can be seen, the minimization problems involved by the - and -steps are coupled, so it becomes necessary to iterate both steps until some convergence criterion is met.

algocf[ht]

The -step is a regularized least-squares problem that can take advantage of a great variety of existing efficient solvers [14, 26, 27]. With respect to the -step, it is important to point out that uncorrelation has to be enforced. When this is not the case, the above steps can provide infinite coupled pairs of solutions which are rotated versions of the desired ones, and the uncorrelation property of the extracted features is lost.

In fact, in the literature -step is typically solved by using the orthogonal Procrustes approach. As it has been proved in [21], this solution neglects the uncorrelation among the extracted features. In spite of this limitation, since its initial proposal in [15] for the sparse PCA algorithm, it has been later extended to supervised approaches such as sparse OPLS [8], group-lasso penalized OPLS [16], and -regularized CCA (or L21SDA) [17]. In this respect, this paper aims (1) to highlight the limitations of Procrustes when used as part of the above iterative method, and (2) to encourage the adoption of an alternate method for the -step that pursues feature uncorrelation.

Ii-a Solving the -step with Orthogonal Procrustes

The minimization problem in the -step is known as orthogonal Procrustes, and its optimal solution is given by , where and

are the matrices of the singular value decomposition

[18], with , and where was defined in Algorithm LABEL:tab:UWsteps.

However, when the Procrustes solution is adopted, the uncorrelation of the extracted features is not explicitly imposed during this step. In the simplest case when the regularization is not used (), this causes that the extracted features differ from those of the corresponding standard MVA formulation (see [21] for further details and experimental results). For the general case in which , the Procrustes solution results in higher correlation among the features than when using the alternative method described in the next subsection, as will also be illustrated in Section IV.

Apart from the correlation among the extracted features, the use of Procrustes also makes the algorithm highly dependent on initialization. For instance, it can be shown that when the regularization is removed and

is initialized as an orthogonal matrix the algorithm fails to progress at all.

Ii-B Solving the -step as an Eigenvalue Problem

In [21], it is proven that the uncorrelation of the extracted features can be obtained if the -step is solved by means of the following eigenvalue problem:

(3)

being the diagonal matrix containing the largest eigenvalues arranged in decreasing order.

The desired uncorrelation is obtained due to this eigenvalue problem forces the diagonalization of the matrix , which is a necessary condition to meet the uncorrelated extracted features.

Table I includes a summary of the - and -steps for the particular cases of regularized CCA, OPLS and PCA, when formulating the -step as an eigenvalue problem. Remember that can be straightforwardly computed from using the relation .

-step (reg. LS) -step (eigenvalue problem)
reg. CCA
reg. OPLS
reg. PCA
TABLE I: Proposed solution for the two coupled steps of most popular regularized MVA methods.

Iii MVA methods with penalty

In this section, we particularize the presented MVA framework for the regularization norm. In this way, we can take advantage of the variable selection property enjoyed by this norm and obtain an algorithm that can simultaneously perform dimensionality reduction and variable selection.

For this purpose, let us replace in (2). Rewriting also the minimization problem in terms of , we arrive at

(4)

where is the new output matrix.

Considering the iterative solution detailed in Subsection II-B, the solution of (4) can be obtained by an iterative procedure consisting of two coupled steps:

  1. -step. For fixed , find the matrix that minimizes the following regularized least-squares problem,

    (5)

    In the next subsection, we will further analyze this minimization problem to obtain an efficient solution that exploits the properties of regularization.

  2. -step. For fixed , matrix is obtained solving the eigenvalue problem (3). As already discussed, existing algorithms solve this step by using orthogonal Procrustes with the undesired consequences described in previous sections.

Iii-a An Efficient Implementation for the -step

To solve the -step, we start from the iterative solution proposed in [14], where is redefined as and is obtained as:

(6)

being a diagonal matrix, where its th diagonal element is is the th row of , is the number of input variables (i.e., the number of rows of ), and is the number of training data. The straightforward application of this solution would result in a -step involving two coupled iterative processes: one between and , and other between and (note that they are coupled by means of matrix ).

However, these processes can be decoupled by taking advantage of the fact that is the solution of an eigenvalue problem (i.e., ) and rewriting each diagonal term of as a function of :

(7)

In this way, the solution of the -step is independent of matrix . This result, known in the literature as the rotational invariance property for rows of the norm [14], allows us to follow this simplified procedure:

  • Find the optimum by iterating expressions (7) (for ) and (6) until a stopping criterion is met.111Although other criteria could be considered, we stop the method when , where the superscripts denote the iteration index and is a small constant, or when a maximum number of iterations have been completed.

  • Compute in a single step by solving:

    which results from (2) considering that and .

In this way, we can obtain important computational savings (as we will analyze in the experimental section). Algorithm LABEL:Tab:MVAL21_pseudocode summarizes this algorithm. This approach let us formulate based methods such as -OPLS and -PCA, where and the new output matrix is (-OPLS) or (-PCA); or -CCA for and .

algocf[ht]

Iii-B Differences with State of the Art Approaches

In principle, previously proposed L21SDA [17] and SRRR [16] algorithms attempt to solve the same problems as the algorithms -CCA and -OPLS presented in this paper. However, due to the procedure followed by their resolution, these state of the art algorithms suffer from the following important inconveniences:

  • First, they present the aforementioned drawbacks of all Procrustes based MVA methods.

  • Second, they do not exploit the rotational invariance property resulting in considerably larger computational burden in comparison with our proposal. Whereas our proposed solution completely decouples both iterative procedures and gets to reduce them to just one iterative process, where can be computed at the end, L21SDA method has to obtain the value of inside the iterative procedure. The case of SRRR algorithm is even worse, since it does not merge the two iterative processes, causing a much more expensive solution. The following section will analyze these issues over some real problems.

Iv Experiments

This section analyzes the advantages of the proposed -MVA framework from different points of view. To this purpose, we have split it into three subsections so that each one focuses on a different advantage of our proposal. The first subsection shows the advantages of including regularization to provide MVA methods with variable selection capabilities. Then, we will analyze the ability of the MVA approaches and, in particular, the -MVA methods for dealing with data that have high multicollinearity among the input variables. This is a difficult situation for MVA methods, since linear dependency among the input variables can cause large fluctuations in the solution. Finally, we will show the importance of avoiding the Procrustes solution by comparing our proposals with state of the art L21SDA and SRRR.

Iv-a Variable Selection by means of Regularization

In this section we are going to deal with a hyperspectral image segmentation and classification problem. This data set comes from a set of hyperspectral sensors mounted on satellite or airborne platform which acquire the reflected energy by the Earth with high spatial detail and in several wavelengths. In particular, we have selected the standard Airborne Visible/Infrared Imaging Spectrometer image taken over Northwest Indiana’s Indian Pine in June 1992 [28]. This dataset consists of 220 spectral bands, with 20 noisy bands covering the region of water absorption. Discriminating among the major crop classes in the area can be very difficult (in particular, given the moderate spatial resolution of 20 m), making the scene a challenging benchmark to validate classification accuracy of hyperspectral imaging algorithms. Besides, the large number of narrow spectral bands induce a high collinearity among variables, making MVA approaches a powerful tool for this application.

The selected hyperspectral image has pixels and contains quite unbalanced classes (ranging from 20 to 10,776 pixels). Among the available 21,025 labeled pixels,

were used for training the feature extractors and classifiers, and the remaining

were taken apart for testing purposes. The discriminative power of all extracted features was tested using a linear SVM classifier.

To analyze the advantages of the penalty as variable selection tool, we are going to analyze the performance of our proposed regularized MVA framework when different regularizations are used: ridge norm or penalty, lasso regularization or norm, and the penalty. For this first study, only OPLS methods have been considered. Fig. 4 shows the reconstructed image or the classification map for these three regularized versions of OPLS, including its overall accuracy (OA) over the test data and the percentage of selected bands ( band). In this case, regularization parameters, as well as the number of selected variables of the -OPLS method, were adjusted using five-fold cross-validation in the training set. In particular, we have explored a rectangular grid taking values from the sets and for and the SVM regularization parameter, respectively. We have checked that these intervals are sufficiently large to ensure that the limits were not selected as a result of the CV. After this validation process, all methods show similar accuracy, with and OPLS versions achieving an accuracy of using almost all bands (their percentage of selected bands is around ). The performance of the -OPLS method, is only slightly better with an accuracy of , although it uses only of the spectral bands.

Fig. 4: Natural and RGB composite (by using 50, 27 and 17 channels) hyperspectral images and its reconstructed images using -OPLS, -OPLS and -OPLS algorithms.

In order to go deeper into the structured sparsity obtained by -OPLS, and its variable selection capabilities, Fig. 5 depicts, for the three regularized OPLS versions, the importance of each variable (calculated as ) over their first three eigenvectors (). The total importance of each variable, given as its averaged importance over all the eigenvectors, is included at the last row of the plot for each algorithm and denoted as “TOT”. In this case, to better analyze the sparsity properties of the different regularization terms, the regularization parameter () has been fixed in such a way the both and norms provide similar number of zeros over all the projection vectors. Furthermore, for comparison purposes, all components with a relevance value lower than have been drop to zero. As expected, -OPLS is able to nullify some of the eigenvector components and, in a few cases, this provides a variable selection (a of the bands have zero in all their associated components); however, -OPLS presents this sparsity in a structured way (all columns are zero), causing of the input bands are not used by their associated eigenvectors. Regarding -OPLS, it is important to remark that it also seems to provide sparsity over its solutions, but this is mainly due to the threshold applied over the relevant values; even so, it only removes of the input bands. Furthermore, and OPLS versions are selecting some positions of the noisy water bands (marked with three shadow rectangles) in the figure, whereas -OPLS is able to remove all of them.

Fig. 5: Analysis of the band selection capabilities for -OPLS, -OPLS and -OPLS algorithms. For each algorithm, each row plots the relevance of the components of the first three eigenvectors and the last row includes the average relevance over all eigenvectors (TOT). Water absorption bands, that should be discarded for the classification tasks, have been highlighted in grey color.

Finally, to analyze the influence of the regularization term over the number of selected bands, Fig. 6 displays the total importance of each variable (averaged over all eigenvectors) for -OPLS when the parameter is varied from to . When (lack of regularization), the method is recovering standard OPLS solution, which is quite similar to versions in terms of sparsity. However, as larger gamma values are considered less bands are used. In particular, values of close to are enough to remove all water absorption bands.

Fig. 6: Analysis of the band selection capabilities for the -OPLS method according to penalization term value.

Iv-B Dealing with Multicollinearity of the Input Variables

The aim of this subsection is to illustrate the advantages of combining the variable selection and feature extraction processes when there exist multicollinearity among the input variables. This is a difficult situation for MVA methods since highly correlated variables can cause large fluctuations in the solution in response to small changes in the model or data.

For this purpose, we compare the proposed methods against the state-of-the-art Robust Feature Selection (RFS) algorithm

[14]

, which is an efficient and an outlier-robust implementation of the least squares problem with an

penalization term. We start this study with a toy regression problem and, next, we carry out a similar evaluation over two classification problems with high dimensionality and multicollinearity among their variables. The main characteristics of these problems are summarized in Table II.

Carcinomas 139 / 35 9182 11
Yale () 120 / 45 1024 15
TABLE II: Main properties of the datasets: number of training () and test () samples, number of input () and output () variables, and number of training images per person ().

Regression Toy Problem with High Multicollinearity

This toy problem consists of a simple artificial regression problem with three types of input variables: relevant, redundant and noisy. In particular, this problem considers random variables where

are the relevant ones, which are generated following a Gaussian distribution with zero mean and variance randomly selected from 0 to 4;

variables can be considered redundant, since they are obtained as a linear combination of the relevant ones; the model also includes noisy variables, generated as independent Gaussian variables with zero mean and unit variance. Therefore, defining the observation and the output vector , being the number of output variables, the regression model is given by:

where is a vector of Gaussian noise with mean 0 and variance ,

is a fixed matrix with random elements selected from an uniform distribution between

and , and

is a zero-matrix with the appropriate size. Thus, the regression coefficient matrix is built such that

depends only on the relevant input variables.

Following the above model, we build a set of training samples and apply a 70/30 (

) partitioning to obtain the training and test sets, respectively. Then, we normalize both sets to zero mean and unitary standard deviation. This process is repeated over 10 random executions, obtaining independent datasets, to average the final results over these runs.

Variable selection is carried out taking the best variables after sorting them by relevance according to the corresponding values of or (with ) for RFS and the -MVA methods, respectively. Once the variables have been obtained, an optimal Least Squares (LS) regression model is adjusted by using as inputs either the selected original variables (in the case of RFS) or the features extracted from the variables selected by -MVA algorithms. The iterative process of -MVA methods is stopped when a maximum of 50 iterations are reached or when the Frobenius norm of the difference between the solutions obtained in two consecutive iterations is less than a tolerance value .

(a)
(b)
Fig. 7: Comparative curves in terms of MSE according to the number of selected variables () for (a) and (b) .

In Fig. 7, MSE obtained by the proposed -CCA and -OPLS algorithms using all extracted features and the reference algorithm RFS are shown according to the number of selected variables for two values of the penalty parameter: (a) selected by cross-validation, and (b) selected to illustrate the robustness of -OPLS with respect to the selection of this parameter. As can be seen, multicollinearity can cause serious problems of overfitting, in fact, this is the case of the RFS method that, although it is a robust method in the presence of outliers, suffers from serious overfitting caused by redundant variables of the problem. On the contrary, MVA methods can successfully deal with such problems. We can see that they improve on RFS performance for , and remain stable after that point with no significant degradation. It can be shown that for our methods successfully identify the relevant variables in all cases. In addition to this, -MVA extracted features remain mostly orthogonal, as a consequence of the proposed optimization method.

Real World Classification Problems

This subsection analyzes the advantages of the proposed -MVA methods, in comparison to RFS method, over two real classification problems with high dimensionality and multicollinearities among their variables: Carcinomas and Yale.

To make a fair comparison between the methods under study, the free parameter of these models () is selected through a 10 fold cross-validation. For this study, -MVA methods use all the extracted features with the corresponding selected variables. The extracted features (both from -MVA methods and RFS) are then fed to a linear SVM whose accuracy is used to evaluate the performance of each method. We explored the same ranges of values for and SVM regularization parameter C as we did in IV-A.

Fig. 8 displays the overall accuracy (OA) as a function of the number of selected variables () for -OPLS, -CCA and RFS. These results corroborate the conclusions derived from the toy problem, that is, multicollinearity among variables causes RFS to suffer from overfitting problems, which is more evident in the Carcinomas dataset; on the contrary, -MVA approaches overcome this drawback. It is also interesting to see that -OPLS clearly outperforms the other methods. According to the -OPLS and -CCA curves, one might conclude that all relevant information of the Carcinomas dataset lies within 2 of the variables, which is where these algorithms reach their maximum performance.

(a)
(b)
Fig. 8: Comparative curves in terms of overall accuracy as a function of the number of selected variables ().

Iv-C Comparison with L21SDA and SRRR

Whereas the previous subsection compared our approaches with a pure variable selection method, in this subsection, we carry out a comparison against the state of the art approaches based on the Procrustes solution. Remember that SRRR [16] can be seen as an OPLS with penalty, whereas L21SDA [17] is a CCA version including the same penalty term. To the best of our knowledge, no PCA method with penalty has appeared in the literature to date, but its derivation following a Procrustes formulation is straightforward, and is considered here for the sake of completeness.

Here, the experimental procedure is the same as in the previous subsection, but the curves shown below are made based on the number of features extracted instead of the number of selected variables. Fig. 9 shows the OA obtained according to the number of extracted features. When a low number of features is extracted (), -CCA and -OPLS clearly outperform L21SDA and SRRR methods. This advantage is due to the ability of the proposed framework to extract a set of mostly uncorrelated features, making easier the training of the subsequent classifier and straightforward the selection of an optimum reduced subset of features.

To analyze in detail this issue, Figs. 10 and 11 show the correlation matrices of he projected data and the discrimination capabilities of these methods for Carcinomas dataset. As expected, the proposed -MVA approaches, unlike Procrustes-based schemes, are able to obtain an almost complete uncorrelation among the projected data (note that non-diagonal terms are almost null in our proposed methods). This fact directly provides an improvement of the discrimination capability of new projected features, as Fig. 11 reveals. In this plot, we can check that -CCA algorithm is able to project the data into a two dimensional space without overlapping among classes, making easier the subsequent classification task (in this case, the OA is close to 60); whereas, L21SDA projects most of the classes over the same region and, therefore, the classifier accuracy is reduced by half (OA is around 30).

Note that, when all the extracted features are used, the results are the same, since the final classifier (SVM) uses all the projected information, which is just a reconstruction from the original space.

(a)
(b)
Fig. 9: Comparative curves in terms of OA according to the number of extracted features () for -CCA, -OPLS and reference methods L21SDA and SRRR.
Fig. 10: Comparison among the feature correlation matrices of our proposed methods and those proposed in the literature, which use Procrustes approach for the Carcinomas dataset.
(a)
(b)
Fig. 11: Discriminative power comparison between -CCA and L21SDA when only using the two first extracted features in Carcinomas dataset.

Finally, a comparative study of the computational burden is also shown in Fig. 12. As expected, proposed methods are computationally more efficient, as a direct consequence of exploiting the rotational invariance of the norm, as explained in Subsection III-A.

(a)
(b)
Fig. 12: Training time (in seconds) required by -CCA, -OPLS, L21SDA and SRRR algorithms.

V Conclusions

Solutions of regularized MVA approaches are based on an iterative approach consisting of two coupled steps. Whereas the first step eases the inclusion of regularization terms, the second results in a constrained minimization problem which has been typically solved as an orthogonal Procrustes problem. Despite the extended use of this scheme, it fails in obtaining a new subspace of uncorrelated features, this being a desired property of MVA solutions. In this paper we have analyzed the drawbacks of these schemes, recurring to an alternative to the Procrustes solution for the second step, that forces uncorrelation among the extracted features, and thus overcome the drawbacks of previous schemes.

In order to show the practical advantages of our regularized MVA solution, this paper particularizes the proposed method to derive MVA methods implementing regularization. These proposed -MVA methods provide an efficient selection of the relevant variables of the problem exploiting the rotational invariance of the norm. At the same time, they can deal with the multicollinearity problems using feature extraction and providing mostly uncorrelated features.

Finally, experimental results over high dimensional problems show that the methods included in this MVA framework are not only computationally more efficient than previous state of the art solutions, but also can improve their performance.

Acknowledgment

This work has been partly supported by MINECO projects TEC2013-48439-C4-1-R, TEC2014-52289-R and TEC2016-75161-C2-2-R, and Comunidad de Madrid projects PRICAM P2013/ICE-2933 and S2013/ICE-2933.

References

  • [1] K. Pearson, “On lines and planes of closest fit to systems of points in space,” Philosophical Magazine, vol. 2, no. 6, pp. 559–572, 1901.
  • [2] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of cognitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
  • [3] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, pp. 321–377, 1936.
  • [4] H. Wold, “Estimation of principal components and related models by iterative least squares,” in Multivariate Analysis.   Academic Press, 1966, pp. 391–420.
  • [5] ——, “Non-linear estimation by iterative least squares procedures,” in Research Papers in Statistics.   Wiley, 1966, pp. 411–444.
  • [6] K. Worsley, J. Poline, K. Friston, and A. Evans., “Characterizing the response of PET and fMRI data using multivariate linear models (MLM),” Neuroimage, vol. 6, pp. 305–319, 1998.
  • [7] J. Arenas-García, K. B. Petersen, G. Camps-Valls, and L. K. Hansen, “Kernel multivariate analysis framework for supervised subspace learning: A tutorial on linear and kernel multivariate methods,” IEEE Signal Processing Magazine, vol. 30, no. 4, pp. 16–29, July 2013.
  • [8] M. A. J. van Gerven, Z. C. Chao, and T. Heskes, “On the decoding of intracranial data using sparse orthonormalized partial least squares,” Journal of Neural Engineering, vol. 9, no. 2, pp. 26 017–26 027, 2012.
  • [9] L. K. Hansen, “Multivariate strategies in functional magnetic resonance imaging,” Brain and Language, vol. 102, no. 2, pp. 186–191, 2007.
  • [10] J. Arenas-García and G. Camps-Valls, “Efficient kernel orthonormalized PLS for remote sensing applications,” IEEE Trans. Geosci. Remote Sens., vol. 44, pp. 2872–2881, 2008.
  • [11] J. Arenas-García and K. B. Petersen, “Kernel multivariate analysis in remote sensing feature extraction,” in Kernel Methods for Remote Sensing Data Analysis, G. Camps-Valls and L. Bruzzone, Eds.   Wiley, 2009.
  • [12] M. Barker and W. Rayens, “Partial least squares for discrimination,” Journal of Chemometrics, vol. 17, no. 3, pp. 166–173, 2003.
  • [13] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
  • [14] F. Nie, H. Huang, X. Cai, and C. Ding, “Efficient and robust feature selection via joint -norms minimization,” in Advances in Neural Information Processing Systems 23.   The MIT Press, 2010, pp. 1813–1821.
  • [15] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” Journal of Computational and Graphical Statistics, vol. 15, no. 2, pp. 265–286, 2006.
  • [16] L. Chen and J. Z. Huang, “Sparse reduced-rank regression for simultaneous dimension reduction and variable selection,” Journal of the American Statistical Association, vol. 107, no. 500, pp. 1533–1545, 2012.
  • [17] X. Shi, Y. Yang, Z. Guo, and Z. Lai, “Face recognition by sparse discriminant analysis via joint -norm minimization,” Pattern Recognition, vol. 47, no. 7, pp. 2447–2453, 2014.
  • [18] P. H. Schönemann, “A generalized solution of the orthogonal procrustes problem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966.
  • [19] Z. Lai, W. K. Wong, Y. Xu, J. Yang, and D. Zhang, “Approximate orthogonal sparse embedding for dimensionality reduction,”

    IEEE Transactions on Neural Networks and Learning Systems

    , vol. 27, no. 4, pp. 723–735, April 2016.
  • [20] Z. Hu, G. Pan, Y. Wang, and Z. Wu, “Sparse principal component analysis via rotation and truncation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 4, pp. 875–890, April 2016.
  • [21] S. Muñoz-Romero, V. Gómez-Verdejo, and J. Arenas-García, “Why (and how) avoid orthogonal procrustes in regularized multivariate analysis,” arXiv preprint, arXiv:submit/1555588, 2016.
  • [22] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis.   Cambridge University Press, 2004.
  • [23] G. C. Reinsel and R. P. Velu, Multivariate Reduced-Rank Regression: Theory and Applications.   Springer New York, 1998.
  • [24] S. Muñoz-Romero, J. Arenas-García, and V. Gómez-Verdejo, “Sparse and kernel OPLS feature extraction based on eigenvalue problem solving,” Pattern Recognition, vol. 48, no. 5, pp. 1797 – 1811, 2015.
  • [25] S. Roweis and C. Brody, “Linear heteroencoders,” Gatsby Computational Neuroscience Unit, Tech. Rep. 1999-002, 1999.
  • [26] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming, version 2.1,” http://cvxr.com/cvx, Mar. 2014.
  • [27] J. Kim and H. Park, “Toward faster nonnegative matrix factorization: A new algorithm and comparisons,” in Proc. 8th IEEE Intl. Conf. on Data Mining (ICDM’08).   Pisa, Italy: IEEE, December 2008, pp. 353–362.
  • [28] (2013) Purdue Univ. Web site. College of Engineering. [Online]. Available: http://dynamo.ecn.purdue.edu/~biehl/MultiSpec