Sparse Principal Component Analysis (SPCA) using Variable Projection
Sparse principal component analysis (SPCA) has emerged as a powerful technique for modern data analysis. We discuss a robust and scalable algorithm for computing sparse principal component analysis. Specifically, we model SPCA as a matrix factorization problem with orthogonality constraints, and develop specialized optimization algorithms that partially minimize a subset of the variables (variable projection). The framework incorporates a wide variety of sparsity-inducing regularizers for SPCA. We also extend the variable projection approach to robust SPCA, for any robust loss that can be expressed as the Moreau envelope of a simple function, with the canonical example of the Huber loss. Finally, randomized methods for linear algebra are used to extend the approach to the large-scale (big data) setting. The proposed algorithms are demonstrated using both synthetic and real world data.READ FULL TEXT VIEW PDF
Sparse Principal Component Analysis (SPCA) using Variable Projection
A wide range of phenomena in the physical, engineering, biological, and social sciences feature rich dynamics that give rise to multiscale structures in both space and time, including fluid dynamics, atmospheric-ocean interactions, climate modeling, epidemiology, and neuroscience. Remarkably, the underlying dynamics of such systems are typically inherently low-rank in nature, generating data sets where dimensionality reduction techniques, such as principal component analysis (PCA), can be used as a critically enabling diagnostic tool for interpretable characterizations of the dynamics. PCA decompositions express time-varying patterns as a linear combination of the dominant correlated spatial activity of the state of a system as it evolves in time. Although commonly used, the PCA approach generates global modes that often mix or blend various spatio-temporal scales, and cannot identify underlying governing dynamics that act at separate scales. Moreover, classic PCA also tends to overfit data where the number of observations is smaller than the number of variables .
Constrained or regularized matrix decompositions provide a more flexible approach for modeling dynamic patterns. Specifically, prior information can be introduced through sparsity promoting regularizers to obtain a more parsimonious approximation of the data which typically provides improved interpretability. Among others, sparse principal component analysis (SPCA) has emerged as a popular and powerful technique for modern data analysis. SPCA promotes sparsity in the modes, i.e., the sparse modes have only a few active coefficients, while the majority of coefficients are constrained to be zero. The resulting sparse modes are often highly localized and more interpretable than the global PCA modes obtained from traditional PCA. As a consequence, sparse regularization of PCA allows for a decomposition strategy that can specifically identify localized spatial structures in the data and disambiguate between distinct time scales, both of which are ubiquitous in measurement data of complex systems. As an example, one only needs to consider the physical phenomenon of El Nino which is a mode characterized by a localized warm temperature profile which traverses the southern Pacific ocean. This is a highly localized mode that, as will be shown, is well characterized by SPCA, while standard PCA gives a global mode with nonzero values across the entire globe.
While the idea of sparsifying the weight vectors is not new, simple ad-hoc techniques such as naive thresholding can lead to misleading results. A formal approach to SPCA, usingregularization, was first proposed by Jolliffe et al. . This pioneering work lead to a variety of sparsity promoting algorithms [59, 17, 19, 49, 48, 56, 32]
. The success of sparse PCA in obtaining interpretable modes motivates the general approach developed in this paper. Specifically, our method offers three immediate improvements over previously proposed SPCA algorithms: (1) a faster and scalable algorithm, (2) robustness to outliers, and (3) straightforward extension to nonconvex regularizers, including thenorm. Scalability is essential for many applications — for example, dynamical systems generate very large-scale datasets, such as the sea surface temperature data analyzed in this paper. Robust formulations allow SPCA to be deployed in a broader setting, where data contamination could otherwise hide sparse modes. Nonconvex regularizers are not currently available in SPCA software — we show that the modes we get with these approaches are better in synthetic examples, and more interpretable for real data.
In this work, we develop a scalable and robust approach for SPCA. A key feature of the approach is the use of variable projection to partially minimize over the orthogonally constrained variables. This idea was used in the original alternating approach of , and we innovate on this idea by recasting the problem as a value-function optimization. This viewpoint allows for significantly faster algorithms, scalability, and broader applicability. We also allow nonconvex regularization on the loadings, which further improves interpretability and sparsity. Not only does the method scale well, but it is further accelerated using randomized methods for linear algebra . Further, the proposed approach extends to robust SPCA formulations, which can obtain meaningful principal components even with grossly corrupted input data. The outliers are modeled as perturbations to the data, as in the robust PCA model [12, 7, 1]. These innovations provide a flexible and highly-efficient algorithm for modern data analysis and diagnostics that will enable a wide range of critical applications at a scale not previously possible with other leading algorithms.
provides a detailed problem formulation and discusses the variable projection viewpoint which is advocated in this paper. Further, different loss functions and regularizes are discussed. We present the details of the proposed algorithms in Section4. First, the standard case, using the least squares loss function, is discussed. Then, a randomized accelerated and a robust algorithm is presented. The method is applied to several examples in Section 6, where SPCA correctly identifies dynamics occurring at different timescales in multiscale data. We draw conclusions about the method and discuss its outlook in Section 7.
Scalars are denoted by lower case letters , and vectors in are denoted as bold lower case letters . Matrices are denoted by bold capital letters . The transpose of a real matrix is denoted as . The spectral or operator norm of a matrix is denoted as and the Frobenius norm is denoted as .
. The aim of PCA is to find a set of new uncorrelated variables, called principal components (PCs), such that the first PC accounts for the greatest amount of variance in the data, the second PC for the second greatest variance, and so on. More concretely, letbe a real data matrix of dimension , with column-wise zero empirical mean. The rows represent observations and the columns correspond to measurements of variables. The principal components are formed as a linear weighted combination of the variables
where is a vector of weights. This can be expressed more concisely as
with and . The orthonormal matrix rotates the data into a new space, where the principal components sequentially capture the maximum variability in the input data. The columns of are also often denoted as modes, basis functions, principal direction or loadings.
Mathematically, a variance maximization problem can be formulated to find the weight vectors . Alternatively, the problem can be formulated as a least-squares problem, i.e., minimizing the sum of squared residual errors between the input and the projected data
where PCA imposes orthogonality constraints on the weight matrix
. Given the singular value decomposition (SVD) of the centered (standardized) input matrix
the minimizer of (3) is given by the right singular vectors , i.e., we can set . Further, the principal components are the scaled left singular vectors , where the entries of the diagonal matrix are the singular values. In most applications, we are only interested in the first dominant PCs which account for most of the variability in the input data. Thus, PCA allows one to reduce the dimensionality from to by simply truncating the SVD. The dominant PCs can be used to visualize the data in low-dimensional space, and as features for clustering, classification and regression.
Consider any objective of the form
A classic example is the nonlinear least squares problem
The term ‘variable projection’  originally arose from the fact that the least squares projection of onto the range of has a closed form solution, which is used explicitly in iterative methods to optimize for . More generally, the word ‘projection’ is now associated with epigraphical projection , or partial minimization. We can rewrite (4) as a value function optimization problem:
In many cases, the function has an explicit form. In the classic problem (5), we have
where is a projector on the range of . Explicit expressions are not necessary as long as we have an efficient routine to compute
For many problems, we can find first and second derivatives of . For example, when is smooth and is unique, we have
Formulas for second derivatives are collected in 
. Variable projection was recently used to solve a range of large-scale structured problems in PDE-constrained optimization, nuisance parameter estimation, exponential fitting, and optimized dynamic mode decomposition[3, 2, 26, 4].
Sparse PCA aims to find a set of sparse weight vectors, i.e., weight vectors with only a few ‘active’ (nonzero) values. In this manuscript, we build up on the seminal work by Zou, Hastie and Tibshirani , who treat SPCA as an regularized regression-type problem. More concretely, their formulation directly incorporates sparsity inducing regularizers into the optimization problem:
where is a sparse weight matrix and is an orthonormal matrix. denotes a sparsity inducing regularizer such as the LASSO ( norm) or the elastic net (a combination of the and squared norm). The optimization problem is minimized using an alternating algorithm:
Update . With fixed, we solve the optimization problem
The problem splits across the columns of , yielding a regularized regression problem in each case:
The principal components are then formed as a sparsely weighted linear combination of the observed variables . The data can be approximately rotated back as .
Coordinate descent or least angle regression (LARS) are used to solve each of the subproblems . The update relies on solving a strongly convex problem, and in particular the update is unique, and the algorithm as described converges to as stationary point by the analysis of . Replacing with a nonconvex regularizer, such as , makes it difficult to guarantee anything about the update. However, as we show, using the value function (6) from the variable projection viewpoint yields an efficient implementation and a straightforward convergence analysis.
The update in the method of  is in closed form, while the update requires an iterative method. To exploit the efficiency of the update, we think of projecting out and introduce the sparse PCA value function
viewing the original SPCA problem (7) as
We show that is differentiable with a Lipschitz continuous gradient, and derive its explicit form. This viewpoint lets us use any proximal algorithm we want to minimize (8), including proximal gradient (see e.g. ) and FISTA , with the caveat that an update is computed every time is evaluated. For the original SPCA problem, this approach rebalances the work between the and updates, using a single operator to update instead of an iterative routine. When combined with randomized techniques for computing the update, we get an order of magnitude acceleration compared to current SPCA software.
The variable projection viewpoint (8) also allows a robust SPCA approach with the Huber loss function. Simply replacing the quadratic penalty in (7) with a different loss would destroy the efficient structure of the update, requiring an iterative routine to solve for it. Instead, we use a special characterization of the Huber function to obtain a formulation with three rather than two variables, preserving the efficiency of each update. We extend our analysis to this case, so the robust formulation can also be used with any prox-friendly regularizer, including the nonconvex example discussed above.
In summary, the value function viewpoint also makes it easy to extend to a broader problem setting, and we consider the following objective:
where is a separable loss, while is a separable regularizer for .
The SPCA framework incorporates a range of sparsity-inducing regularizers . Sparsity is achieved by introducing additional information into the model to find the most meaningful ‘active’ (non-zero) entries in , while most of the loadings are constrained to be zero. Sparse approaches work well when many variables are redundant, i.e., not required to capture the underlying coherent model structure. Regularization also prevents overfitting, and provides a path to solve ill-posed problems, frequently encountered in the analysis of high-dimensional datasets.
The ‘norm’, denoted or , counts the number of non-zero elements in a vector . When used as a regularizer , it encourages models with small cardinality, i.e., a small number of active loadings. Although is non-smooth and non-convex, its proximal operator is simply hard thresholding (see Table 1).
In many applications, the norm is used to approximate . In the context of least squares problems, using is known as LASSO (least absolute shrinkage and selection operator). The proximal operator of the scaled norm is the soft-thresholding operator, see Table 1.
One drawback of the norm is that it tends to activate only one coefficient from any set of highly correlated variables. The elastic net, introduced by Zou and Hastie , overcomes this drawback, using a linear combination of the and quadratic penalty:
The elastic net has an implicit grouping effect that is particularly useful for the analysis of high-dimensional multiscale physical systems, where we want to find all the associated variables which correspond to an underlying mode, rather than selecting only one variable from each underlying mode. The proximal operator of combines scaling and soft thresholding, see Table 1. Following the same idea, we can also combine and the quadratic penalty:
The regularizer detects correlated sets of very sparse predictors, and its proximal operator of combines scaling and hard thresholding, see Table 1. Figure 1 illustrates these regularizers. Many other examples of proximal operators are collected in .
A large number of separable structured regularizers can be used in the proposed SPCA framework. Separability ensures that the prox-operator can be computed either in closed form or using a routine for both convex and nonconvex regularizers. Here we highlight two examples.
In some applications, selection occurs between groups of variables known a priori. The group lasso regularizer  enforces that all the variables corresponding to these predefined group are either activated or set to Its prox operator can be written as
An extension is the sparse group lasso , which adds an additional penalty for each group. Another useful regularizer is the fused lasso , which gives a way to incorporate information about spatial or temporal structure in the data.
As a standard problem, we discuss the variable projection algorithm for (9) using the least squares loss function. We partially minimize in to obtain the value function
Evaluating this value function given reduces to solving the orthogonal Procrustes problem , with closed form solution
where and are the left and right singular vectors of see Appendix A. Variable projection takes advantage of this closed form solution. Partially minimizing in via the SVD has additional advantages over using an iterative algorithm when is ill-conditioned. This is an important consideration for robust penalties, where a closed form solution for is not immediately available, see Section 4.3.
The SPCA problem (9) is nonconvex, and so is the value function . To better understand , we consider the following simple 2D example
where , . We write explicitly as
Figure 2 shows the level sets of this function, which are clearly nonconvex. We also see that is smooth except at .
We apply proximal gradient methods (see e.g. ) to find a stationary point of the value function (10). It is easy to both evaluate and to compute the gradient. We obtain using (11) and then use the formula
Assume , then the optimality criterion satisfies
where is the stationarity of the objective (defined in Appendix B) and is the Lipchitz constant for .
See Appendix B.3 for the proof.
Randomization allows efficient computation of low-rank approximations such as the SVD and PCA [25, 36, 18, 22]. We form a low-dimensional sketch (representation) of the data, which aims to capture the essential information of the original data. Using this idea, we can reformulate (9) as
where denotes the sketch of . Here, the dimension is chosen slightly larger than the target-rank . We proceed by forming a sample matrix :
where is a randomly generated test matrix 
. Next, an orthonormal basis matrix is obtained by computing the QR-decomposition of the samples matrix. Finally, the sketch is formed by projecting the input matrix to the range of , which is low-dimensional:
This approach is suitable for input matrices with low-rank structure. The computational advantage becomes significant when the intrinsic rank of the data is relatively small compared to the dimension of the ambient measurement space. The quality of the sketch can be improved by computing additional power iterations [25, 22], especially if the singular value spectrum of is only slowly decaying. We suggest computing at least two power iterations by default.
Classically, SPCA is formulated as a least-squares problem, however, it is well-known that the squared loss is sensitive to outliers. In many real world situations we face the challenge that data are grossly corrupted due to measurement errors or other effects. This motivates the need of robust methods which can more effectively account for corrupt or missing data. Indeed, several authors have proposed a robust formulation of SPCA, using the norm as a robust loss function, to deal with grossly corrupted data [39, 15, 29].
For a robust formulation of SPCA, we use a closely related idea of separating a data matrix into a low-rank model and a sparse model. The architecture is depicted in Figure 3.
This form of additive decomposition is well-known as robust principal component analysis (RPCA), and its remarkable ability to separate high-dimensional matrices into low-rank and sparse component makes RPCA an invaluable tool for data science[12, 7, 1]. Specifically, we suggest to use the Huber loss function rather than the norm as the data misfit. The Huber norm overcomes some of the shortcomings for the Frobenius norm and can be used as a more robust measure of fit [28, 38]. We define the Huber loss function as
Figure 4 illustrates the least squares and the Huber loss functions. The Huber loss functions grows at a linear rate for residuals outside the thresholding parameter , rather than quadratically. Hence, the influence of large deviations on the parameters is reduced. This is consistent with using a heavy tail distribution to model measurement errors.
The Huber penalty can be characterized as the (scaled) Moreau envelope of the norm, see Section B.1.4:
This characterization explicitly brings out outliers as sparse perturbations to the data. It also makes it possible to develop efficient algorithms for the robust case. In general, our approach applies to any robust norm that can be characterized as the Moreau envelope of a separable penalty.
A naive approach loses the closed form of (11). To preserve the advantages of partial minimization, we must place the Huber loss on the Procrustean bed of the orthogonal Procrustes problem. We use the Moreau characterization (15) to explicitly model sparse outliers using the variable , and rewrite Eq. (9) as follows:
where is given by with
Problem (17) has the same structure as (10) in the variables , and we can easily modify the algorithm to account for the additional block, as detailed in Algorithm 2. The partial minimization of is a prox evaluation of the 1-norm, which is the soft thresholding operator, see Table 1.
Sparse decompositions are becoming increasingly relevant for data-driven spatiotemporal analysis of physical systems. The recent proliferation of machine learning and manifold learning methods seek interpretable models using physically meaningful constraints[42, 47, 33, 53, 35]. However, standard orthogonal decompositions such as SVD or proper orthogonal decomposition (POD) may suffer from overfitting and the resulting spatial modes are spatially dense. By promoting sparsity in the modes, SPCA is able to yield modes that may be more interpretable.
The goal of spatiotemporal modal analysis is a system decomposition that is separable in space and time,
where is a mode evaluated at a grid of spatial locations. Classical data-driven analysis seeks a low-rank approximation given a data matrix of snapshots in time
The proper orthogonal decomposition is a canonical data-driven decomposition in the analysis of high-dimensional flows, which seeks the optimal rank- orthogonal projection of the data that approximates the covariance of . The optimal low-rank projection is given by the dominant scaled principal components , obtained from the singular value decomposition . We define the modes to be , resulting in the separable decomposition
where . While POD modes numerically approximate the data, they may not be physically meaningful. POD modes do not generally correspond to coherent structures that persist in time. Sparse PCA, on the other hand, imposes sparsity in the spatial modes while maintaining time independence. In our framework, spatial modes are given by
which represents a sparse linear combination of the snapshots and sparse modes . Recall, the columns of are the sparse weight vectors. As we shall demonstrate, SPCA modes display greater correspondence to coherent structures in various flows.
We now apply our SPCA framework to a number of example systems of interest, ordered by increasing complexity. These examples capture many challenges that motivate the new algorithms. The first example is an artificial dataset with high-dimensional measurements and low-dimensional structures across multiple scales. In this example, there is a ground truth, providing a straightforward benchmark for SPCA and robust SPCA. The second example applies SPCA to a highly structured fluid flow, characterized by laminar vortex shedding behind a circular cylinder at low Reynolds number, which is a benchmark problem in fluid dynamics . Fluid flows are ideal for developing interpretable models of multiscale physics and deploying sparse sensors for estimation and control. This is because they are high-dimensional systems that often exhibit low-dimensional coherent patterns that are spatially localized [9, 51]. The third example involves high-dimensional satellite data of the ocean surface temperature, a complex multiscale system that is intimately related to global circulation and climate. In all of these examples, the data is dynamic, high-dimensional, exhibits low-dimensional patterns at multiple scales, and has fewer snapshots in time than measurements in space. The proposed SPCA framework allows efficient computations on large systems, yields robust estimates from noisy data, and gives interpretable modes that can be used for the downstream tasks in dynamical systems modeling and control.
First, we consider a case where spatiotemporal dynamics are generated from three spatial modes oscillating at different frequencies in overlapping time intervals:
The multiscale time dynamics switch on and off irregularly, i.e., the modes effectively appear mixed in time as is common in other real-world phenomena such as weather, climate, etc. Consequently, within a single frame, the three modes occasionally mix, rendering the disambiguation task more challenging. However, we see that SPCA is able to recover the three modes in an unsupervised manner. Specifically, the data is generated on a spatial grid for 150 seconds with timestep s. We flatten the spatial dimensions to obtain a data matrix with measurements for each of the snapshots (observations in time).
The results of PCA and SPCA on the raw frame data are compared and contrasted in Fig. 6. Here the spatial coherent structures extracted by SPCA recover the generating spatiotemporal modes, while PCA is unable to do so. By seeking a parsimonious representation, SPCA is able to accurately correlate spatial structures with their individual time histories. Because PCA has no such constraint, the different spatial structures remain mixed.
In many applications data exhibit grossly corrupted entries that typically arise from process or measurement noise. The least-squares loss function is sensitive to outliers. Thus, SPCA tends to be biased and the results can be misleading. To overcome this, our proposed robust SPCA algorithm can be used. The Huber loss function allows one to separate the input data into a low-rank component plus a sparse component. This is demonstrated in Fig. 7. The robust implementation clearly separates the polluted data into a low-rank component, while capturing the additive salt and pepper noise. However, the robust implementation is computationally more demanding than the standard SPCA algorithm.
PCA has been extensively used in fluid dynamics for decades, where it is known as proper orthogonal decomposition, providing a data-driven generalization of the Fourier transform. Here we apply SPCA to the flow behind a cylinder, a canonical example in fluid dynamics . The data consists of a time series of the vorticity field behind a solid cylinder at Reynolds number 100, which induces laminar vortex shedding downstream. The flow is simulated using an immersed boundary projection method  on a spatial grid for 3 dimensionless time units with timestep . Again, we flatten the spatial dimensions and yield a data matrix with measurements for each of the snapshots (observations in time). The resulting principal components or spatial modes of the flow are widely used for reduced-order modeling, prediction, and control.
The SPCA and PCA eigenmodes are compared in Fig. 8. Both decompositions successfully identify the dominant mode pairs that occur at characteristic harmonic frequencies. However, the mode structures extracted by SPCA are well-bounded and more interpretable, resulting in visible weakening downstream and stronger influence upstream. This is typical of the vortex shedding regime as vortices dissipate while advecting downstream and is not observed in the PCA modes.
Standard PCA has beneficial orthonormality properties that are crucial for projection based reduced-order modeling of high-dimensional systems. However, as experiments and models simulate increasingly complex flows, the field is rapidly moving towards more interpretable decompositions for learning and control. Recent directions in network analysis of turbulence and mixing require robust tracking of sparse spatial structures and vortices. The ability of SPCA to delineate boundaries of vortex dynamics are critical for the scalable decomposition of such high-resolution flow data. Furthermore, SPCA is purely data-driven and works equally well for modal decomposition of high-fidelity computational fluid dynamics (CFD) simulation, as well as robust denoising of experimental data generated by particle image velocimetry and other high-resolution imaging techniques.
We now apply SPCA to satellite ocean temperature data from 1990-2017 , and compare SPCA results from PCA.111The data are provided by the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA, and are accessible via their Web site at https://www.esrl.noaa.gov/psd/. The data consists of temporal snapshots which measure the weekly temperature means at spatial grid points. Since we omit data over continents and land, the ambient dimension reduces to observations in our analysis. Our objective is the accurate identification of the intermittent El Niño and La Niña warming events, which are famously implicated in global weather patterns and climate change. The El Niño Southern Oscillation (ENSO) is defined as any sustained temperature anomaly above running mean temperature with a duration of 9 to 24 months. In climate sciences, principal components are also known as empirical orthogonal functions or EOFs; however, traditional PCA struggles to find a low-rank representation of this complex, high-dimensional system.
The canonical El Niño is associated with a narrow band of warm water off coastal Peru that is commonly referred as NIÑO 1+2, 3, 3.4, or 4 to differentiate the types of bands. Traditional PCA is unable to isolate this band, instead combining it with broader spatial signatures across the Pacific and Atlantic in mode 4 (Fig. 8(a)). Nevertheless, this mode is often used to compute the canonical Oceanic Niño Index (ONI). On the other hand, SPCA obtains a dramatic and clean separation of NIÑO 1-4 within the 4th mode (Fig. 8(b)). This is contextualized by the associated temporal mode, which yields sharper peaks during the 1997-1999 and 2014-2016 major El Niño events compared to PCA. The 12-month moving average of the temporal modes for both PCA and SPCA is shown in Fig. 10 and confirms that SPCA differentiates major and minor ENSO events with greater clarity than PCA.
Previous study of this dataset has required a multiresolution time-frequency separation of the data matrix in order to clearly identify the ENSO mode in an unsupervised manner .Without the sparsity constraint, SVD-based methods struggle to obtain a low-rank representation of these complex systems with nonlinear dynamics, coupled interactions and multiple timescales of motion. Based on our findings, SPCA has the potential to yield sparse modal representations of complex systems and coherent structures that may alter our understanding of oceanic and atmospheric phenomena.
Cumulative variance plots reveal that sparse PCA behaves differently on the latter two examples. This can be attributed to the level of stochasticity in each system. The cylinder data has high temporal resolution and is therefore sufficiently well-resolved for sparse PCA to capture nearly all the variance within the low-rank component (Fig. 10(a)). In this case the decomposition is similar to PCA, although spatially more localized. On the other hand, the ocean data has coarse weekly temporal resolution. Therefore, faster dynamics which are not sufficiently resolved appear stochastic. Hence, slower timescales (annual, ENSO) are reflected in the low-rank component of SPCA as indicated by cumulative variance (Fig. 10(b)). PCA, however, overfits with ‘noisy’ components which are not physically meaningful.
To demonstrate the computational performance of the proposed SPCA algorithms we compute the leading components for two data matrices. First, we consider the cases of small data. Figure 12 shows the time until the objective function converges within a tolerance level of . For comparison we show the performance of the SPCA algorithm using least angle regression (LARS) and coordinate descent (CD) as proposed by . Our proposed algorithm based on variable projection outperforms both the LARS and CD algorithm.
Further, the randomized accelerated SPCA algorithm outperforms the deterministic variable projection algorithms. The desired accuracy is achieved about
times faster compared to the deterministic algorithm. This is despite the fact that the randomized algorithms require more iterations than the deterministic algorithm to converge. The computational advantage is even greater for the high-dimensional data setting (i.e., big) as shown in Figure 13 The computational advantage of the randomized algorithm becomes pronounced with increasing dimensions of the input matrix. Hence, the randomized algorithm allows exploring a large space of tuning parameters and is well suited for performing cross-validation.
We have presented a robust and scalable architecture for computing sparse principal component analysis (SPCA). Specifically, we have modeled SPCA as a matrix factorization problem with orthogonality constraints, and developed specialized optimization algorithms that partially minimize a subset of the variables (variable projection). Our SPCA algorithm is scalable and robust, greatly improving computational efficiency over current state-of-the-art methods while retaining comparable performance. More precisely, we have demonstrated that: (i) The value function view approach provides an efficient and flexible framework for SPCA; (ii) Robust SPCA can be formulated using the Huber loss; (iii) A wide variety of sparsity-inducing regularizers can be incorporated into the framework; (iv) The proposed algorithms are computationally efficient for high-dimensional data, i.e, large ; (v) Randomized methods for linear algebra substantially eases the computational demands, while obtaining a near-optimal approximation for low-rank data.
SPCA is a useful diagnostic tool for data featuring rich dynamics that give rise to multiscale structures in both space and time. Given that such phenomena are ubiquitous in the physical, engineering, biological, and social sciences, this work provides a valuable tool for improved interpretability, especially in the diagnostics of localized structures and disambiguation of distinct time scale physical processes. The work also opens a number of avenues for future development:
This scalable approach for identifying spatially localized spatial structures in high-dimensional and multiscale data may be directly applied to 1) tensor decompositions[14, 8, 21], which represent data in a multi-dimensional array structure, 2) parsimonious dynamical systems models , which identify the fewest nonlinear interactions required to capture the underlying physical mechanisms, and 3) in situ sensing and control, where sensors and actuators are generally required to be spatially localized .
The methods developed here will be broadly applicable to dynamical systems that are high-dimensional, multiscale, and where there is a need for interpretable and parsimonious models for prediction, estimation, and control. Specific applications where SPCA has already been applied include genomics, and biological systems in general, and atmospheric chemistry. In addition, there is tremendous opportunity for advances in diverse fields, such as improving climate science, detecting and controlling structures in the brain, and closed loop control of turbulent fluid systems .
The authors would like to thank Daniela Witten for helpful discussions in the early stage of writing this manuscript. Research of A. Aravkin was partially supported by the Washington Research Foundation Data Science Professorship. S. Brunton gratefully acknowledges funding support from the Army Research Office (ARO W911NF-17-1-0306). J. N. Kutz acknoledges support from the Air Force Office of Scientific Research (AFOSR) grant FA9550-17-1-0329.
We seek an orthonormal matrix so that
Indeed, a closed form solution is provided by the SVD. First,we expand the above objective function as
This problem is equivalent to find a orthonormal matrix which maximizes . We proceed by substituting the SVD of so that we yield
Note that is a diagonal matrix with non-negative entries and is an orthonormal matrix for any matrix . Because of this, the trace norm in Eq. (25) is maximized by which turns
into an identity matrix, so that we yield . Hence, an optimal solution is provided by , i.e., the left and right singular vectors of .
In the following we give a brief overview of notation and concepts used to develop and analyze the algorithms in this paper. Further, we review briefly the elements of variational analysis for the theoretical analysis of the algorithm [40, 46].
We consider the collection of all matrices with the same dimension (where could be shorthand for ) as a Hilbert space equipped with the inner product. More concretely, the inner product is defined by the trace and the norm induced by this inner product is the Frobenius norm
For any map , we set,
We say that is -Lipschitz continuous, for some , if the inequality holds.
Constraints, such as those in (7), can be represented using functions from a matrix space to the extended real line defined by . The domain and the epigraph of any function are the defined sets
For any set , we define the distance, projection and indicator function for by
For in (7), and given , we have
Characterizing stationarity (a necessary condition for optimality) is a key step in analyzing the behavior of an algorithm and deriving practical termination criteria. Problem (7) is nonsmooth, so gradients do not exist. Instead, we can use more general concepts of subgradients, which exist for nonsmooth, nonconvex functions.
Consider an arbitrary function and a point with finite. When is convex, the subgradient of at is defined as the collection of tangent affine minorants:
If is differentiable at , then contains only one element, and it is a gradient. When is not differentiable, the subdifferential can contain multiple elements (see Figure 14). From (27), it is clear that implies that for all , i.e. is a global minimum.
When is nonconvex, (27) may not hold globally for any , and we need a localized definition. The Fréchet subdifferential of at , denoted , is the set of all matrices that satisfy
as . The inclusion holds precisely when the affine function underestimates up to first-order near . The limit of Fréchet subgradients along a sequence may not be a Fréchet subgradient at the limiting point . The limiting subdifferential is the set of all matrices for which there exist sequences and that satisfy and . In the nonconvex case, the stationarity condition no longer implies global (or local) optimality. However, it is still a necessary condition, and one that can be checked. We characterize stationarity of (7) by the distance of to the limiting subdifferential .
For any function and real , the Moreau envelope and the proximal mapping are defined by
Let be a proper closed convex function. Then is convex and -smooth with
See Theorem 2.26 of .
The objective function in Equation (9) is non-convex. Hence, we rely on iterative schemes that can find the stationary points of the objective.
Assume that is smooth, we call a pair a stationary point when it satisfies
where is the limiting subdifferential defined in Section B.1. We also introduce a measure of non-stationarity :
In the main text, we show some examples of prox operator corresponding to , but for it might be hard to understand.
Here we give a simple instance of when , the space of orthonormal matrices. When consider orthogonal matrices in two dimension, we could characterize them by a single angle variable ,
For every described as above, we define the tangent direction in :
We now have
In particular, for every element in is a linear combination of the matrices
By definition, the iterates of Algorithm 1 satisfy
From the definition of the objective, we have,