Randomized Matrix Decompositions using R

08/06/2016
by   N. Benjamin Erichson, et al.
University of Washington
0

Matrix decompositions are fundamental tools in the area of applied mathematics, statistical computing, and machine learning. In particular, low-rank matrix decompositions are vital, and widely used for data analysis, dimensionality reduction, and data compression. Massive datasets, however, pose a computational challenge for traditional algorithms, placing significant constraints on both memory and processing power. Recently, the powerful concept of randomness has been introduced as a strategy to ease the computational load. The essential idea of probabilistic algorithms is to employ a degree of randomness in order to derive a smaller matrix from a high-dimensional data matrix. The smaller matrix is then used to compute the desired low-rank approximation. Such algorithms are shown to be computationally efficient for approximating matrices with low-rank structure. We present the R package rsvd, and provide a tutorial introduction to randomized matrix decompositions. Specifically, randomized routines for the singular value decomposition, (robust) principal component analysis, interpolative decomposition, and CUR decomposition are discussed. Several examples demonstrate the routines, and show the computational advantage over other methods implemented in R.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 18

page 24

page 26

02/07/2017

Randomized Dynamic Mode Decomposition

This paper presents a randomized algorithm for computing the near-optima...
10/05/2021

Efficient GPU implementation of randomized SVD and its applications

Matrix decompositions are ubiquitous in machine learning, including appl...
06/08/2019

Study of Compressed Randomized UTV Decompositions for Low-Rank Matrix Approximations in Data Science

In this work, a novel rank-revealing matrix decomposition algorithm term...
04/13/2021

Simpler is better: A comparative study of randomized algorithms for computing the CUR decomposition

The CUR decomposition is a technique for low-rank approximation that sel...
05/04/2015

Self-Expressive Decompositions for Matrix Approximation and Clustering

Data-aware methods for dimensionality reduction and matrix decomposition...
07/21/2011

Spectral approximations in machine learning

In many areas of machine learning, it becomes necessary to find the eige...
02/15/2018

Fast Generalized Conditional Gradient Method with Applications to Matrix Recovery Problems

Motivated by matrix recovery problems such as Robust Principal Component...

Code Repositories

rSVD

Randomized Matrix Decompositions using R


view repo

rTensor

Randomized Tensor Decompositions


view repo

ristretto

Randomized Dimension Reduction Library


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the era of ‘big data’, vast amounts of data are being collected and curated in the form of arrays across the social, physical, engineering, biological, and ecological sciences. Analysis of the data relies on a variety of matrix decomposition methods which seek to exploit low-rank features exhibited by the high-dimensional data. Indeed, matrix decompositions are often the workhorse algorithms for scientific computing applications in the areas of applied mathematics, statistical computing, and machine learning. Despite our ever-increasing computational power, the emergence of large-scale datasets has severely challenged our ability to analyze data using traditional matrix algorithms. Moreover, the growth of data collection is far outstripping computational performance gains. The computationally expensive singular value decomposition (SVD) is the most ubiquitous method for dimensionality reduction, data processing and compression. The concept of randomness has recently been demonstrated as an effective strategy to easing the computational demands of low-rank approximations from matrix decompositions such as the SVD, thus allowing for a scalable architecture for modern ‘big data’ applications. Throughout this paper, we make the following assumption: the data matrix to be approximated has low-rank structure, i.e., the rank is smaller than the ambient dimension of the measurement space.

1.1 Randomness as a computational strategy

Randomness is a fascinating and powerful concept in science and nature. Probabilistic concepts can be used as an effective strategy for designing better algorithms. By the deliberate introduction of randomness into computations (Motwani and Raghavan, 1995), randomized algorithms have not only been shown to outperform some of the best deterministic methods, but they have also enabled the computation of previously infeasible problems. The Monte Carlo method, invented by Stan Ulam, Nick Metropolis and John von Neumann, is among the most prominent randomized methods in computational statistics as well as one of the ‘best’ algorithms of the 20th century (Cipra, 2000).

Over the past two decades, probabilistic algorithms have been established to compute matrix approximations, forming the field of randomized numerical linear algebra (Drineas and Mahoney, 2016). While randomness is quite controversial, and is often seen as an obstacle and a nuisance, modern randomized matrix algorithms are reliable and numerically stable. The basic idea of probabilistic matrix algorithms is to employ a degree of randomness in order to derive a smaller matrix from a high-dimensional matrix, which captures the essential information. Thus, none of the ‘randomness’ should obscure the dominant spectral information of the data as long as the input matrix features some low-rank structure. Then, a deterministic matrix factorization algorithm is applied to the smaller matrix to compute a near-optimal low-rank approximation. The principal concept is sketched in Figure 1.

factors

approximated factors

deterministic algorithm

deterministic algorithm

probabilistic strategy to find a small matrix

recover near-optimal factors

Data

Decomposition

‘Big’

‘Small’

Figure 1: First, randomness is used as a computational strategy to derive a smaller matrix from . Then, the low-dimensional matrix is used to compute an approximate matrix decomposition. Finally, the near-optimal (high-dimensional) factors may be reconstructed.

Several probabilistic strategies have been proposed to find a ‘good’ smaller matrix, and we refer the reader to the surveys by Mahoney (2011)Liberty (2013), and  Halko et al. (2011b) for an in-depth discussion, and theoretical results. In addition to computing the singular value decomposition (Sarlos, 2006; Martinsson et al., 2011) and principal component analysis (Rokhlin et al., 2009; Halko et al., 2011a)

, it has been demonstrated that this probabilistic framework can also be used to compute the pivoted QR decomposition 

(Duersch and Gu, 2017), the pivoted LU decomposition (Shabat et al., 2016), the dynamic mode decomposition (Erichson and Donovan, 2016; Erichson et al., 2017b) and nonnegative matrix factorizations (Erichson et al., 2018).

1.2 The rsvd package: Motivation and contributions

The computational costs of applying deterministic matrix algorithms to massive data matrices can render the problem intractable. Randomized matrix algorithms are becoming increasingly popular as an alternative, and implementations are available in a variety of programming languages and machine learning libraries. For instance, Voronin and Martinsson (2015) provide high performance, multi core and GPU accelerated randomized routines in C.

The rsvd package aims to fill the gap in R, providing the following randomized routines:

  • Randomized singular value decomposition: rsvd().

  • Randomized principal component analysis: rpca().

  • Randomized robust principal component analysis: rrpca().

  • Randomized interpolative decomposition: rid().

  • Randomized CUR decomposition: rcur().

The routines are, in particular, efficient for matrices with rapidly decaying singular values. Figure 2 compares the computational performance of the rsvd() function to other existing SVD routines in R, which are discussed in more detail in Section 3.5. Specifically, for computing the dominant

singular values and vectors, the randomized singular value decomposition function rsvd() results in significant speedups over other existing SVD routines in R. See Section 

3.8 for a more detailed performance evaluation.

Figure 2: Runtime speedups (relative performance)of fast SVD algorithms compared to the base svd() routine in R. Here, the dominant singular values and vectors are computed for random low-rank matrices with varying dimension .

The computational benefits of the randomized SVD translates directly to principal component analysis (PCA), since both methods are closely related. Further, the randomized SVD can be used to accelerate the computation of robust principal component analysis (RPCA). More generally, the concept of randomness allows also one to efficiently compute modern matrix decompositions such as the interpolative decomposition (ID) and CUR decomposition. While the performance of the randomized algorithms depends on the actual shape of the matrix, we can state (as a rule of thumb) that significant computational speedups are achieved if the target rank is about - times smaller than the smallest dimension of the matrix. The speedup for tall and thin matrices is in general less impressive than for ‘big’ fat matrices.

The rsvd package is available via the Comprehensive R Archive Network (CRAN):

R> install.packages("rsvd")
R> library("rsvd")

Alternatively, the package can be obtained via GIT: https://github.com/erichson/rSVD.

1.3 Organization

The remainder of this paper is organized as follows. First, Section 2 outlines the advocated probabilistic framework for low-rank matrix approximations. Section 3 briefly reviews the singular value decomposition and the randomized SVD algorithm. Then, the rsvd() function and its computational performance are demonstrated. Section 4 first describes the principal component analysis. Then, the randomized PCA algorithm is outlined, followed by the demonstration of the corresponding rpca() function. Section 5 outlines robust principal component analysis, and describes the randomized robust PCA algorithm as well as the rrpca() function. Section 6 gives a high-level overview of the interpolative and CUR decomposition. Finally, concluding remarks and a roadmap for future developments are presented in Section 7.

1.4 Notation

In the following we give a brief overview of some notation used throughout this manuscript.

Scalars are denoted by lower case letters , and vectors both in and are denoted as bold lower case letters . Matrices are denoted by bold capital letters and the entry at row and column is denoted as . This notation is convenient for matrix slicing, for instance, extracts the first rows, and extracts the first columns. The transpose of a real matrix is denoted as , and without loss of generality, we restrict most of the discussion in the following to real matrices. Further, the column space (range) of is denoted as , and the row space as .

The spectral or operator norm of a matrix is defined as the largest singular value of

, i.e., the square root of the largest eigenvalue

of the positive-semidefinite matrix

The Frobenius norm is defined as the square root of the sum of the absolute squares of its elements, which is equal to the square root of the matrix trace of

2 Probabilistic framework for low-rank approximations

Assume that a matrix has rank , where . Then, in general, the objective of a low-rank matrix approximation is to find two smaller matrices such that:

(1)

where the columns of the matrix span the column space of , and the rows of the matrix span the row space of . The factors and can then be used to summarize or to reveal some interesting structure in the data. Further, the factors can be used to efficiently store the large data matrix . Specifically, while requires words of storage, and require only words of storage.

In practice, most data matrices do not feature a precise rank . Rather we are commonly interested in finding a rank- matrix , which is as close as possible to an arbitrary input matrix in the least-square sense. We refer to as the target rank in the following.

In particular, modern data analysis and scientific computing largely rely on low-rank approximations, since low-rank matrices are ubiquitous throughout the sciences. However, in the era of ‘big data’, the emergence of massive data poses a significant computational challenge for traditional deterministic algorithms.

In the following, we advocate the probabilistic framework, formulated by Halko et al. (2011b), to compute a near-optimal low-rank approximation. Conceptually, this framework splits the computational task into two logical stages:

  • Stage A: Construct a low dimensional subspace that approximates the column space of . This means, the aim is to find a matrix with orthonormal columns such that is satisfied.

  • Stage B: Form a smaller matrix , i.e., restrict the high-dimensional input matrix to the low-dimensional space spanned by the near-optimal basis . The smaller matrix can then be used to compute a desired low-rank approximation.

The first computational stage is where randomness comes into the play, while the second stage is purely deterministic. In the following, the two stages are described in detail.

2.1 The generic randomized algorithm

2.1.1 Stage A: Computing the near-optimal basis

First, we aim to find a near-optimal basis for the matrix such that

(2)

is satisfied. The desired target rank is assumed to be . Specifically, is a linear orthogonal projector. A projection operator corresponds to a linear subspace, and transforms any vector to its orthogonal projection on the subspace. This is illustrated in Figure 3, where a vector is confined to the column space .

origin

col()
Figure 3: Geometric illustration of the orthogonal projection operator . A vector is restricted to the column space of , where .

The concept of random projections can be used to sample the range (column space) of the input matrix in order to efficiently construct such a orthogonal projector. Random projections are data agnostic, and constructed by first drawing a set of random vectors

, for instance, from the standard normal distribution. Probability theory guarantees that random vectors are linearly independent with high probability. Then, a set of random projections

is computed by mapping to low-dimensional space

(3)

In other words, this process forms a set of independent randomly weighted linear combinations of the columns of , and reduces the number of columns from to . While the input matrix is compressed, the Euclidean distances between the original data points are approximately preserved. Random projections are also well known as the Johnson-Lindenstrauss (JL) transform (Johnson and Lindenstrauss, 1984), and we refer to Ahfock et al. (2017) for a recent statistical perspective.

Equation (3) can be efficiently executed in parallel. Therefore, let us define the random test matrix , which is again drawn from the standard normal distribution, and the columns of which are given by the vectors . The samples matrix , also denoted as sketch, is then obtained by post-multiplying the input matrix by the random test matrix

(4)

Once is obtained, it only remains to orthonormalize the columns in order to form a natural basis . This can be efficiently achieved using the QR-decomposition , and it follows that Equation (2) is satisfied.

2.1.2 Stage B: Compute the smaller matrix

Now, given the near-optimal basis , we aim to find a smaller matrix . Therefore, we project the high-dimensional input matrix to low-dimensional space

(5)

Geometrically, this is a projection (i.e, a linear transformation) which takes points in a high-dimensional space into corresponding points in a low-dimensional space, illustrated in Figure 

4.

projection

Figure 4: Points in a high-dimensional space are projected into low-dimensional space, while the geometric structure is preserved in an Euclidean sense.

This process preserves the geometric structure of the data in an Euclidean sense, i.e., the length of the projected vectors as well as the angles between the projected vectors are preserved. This is, due to the invariance of inner products (Trefethen and Bau, 1997). Substituting Equation (5) into (2) yields then the following low-rank approximation

(6)

This decomposition is referred to as the QB decomposition. Subsequently, the smaller matrix can be used to compute a matrix decomposition using a traditional algorithm.

2.2 Improved randomized algorithm

The basis matrix often fails to provide a good approximation for the column space of the input matrix. This is because most real-world data matrices do not feature a precise rank , and instead exhibit a gradually decaying singular value spectrum. The performance can be considerably improved using the concept of oversampling and the power iteration scheme.

2.2.1 Oversampling

Most data matrices do not feature an exact rank, which means that the singular values of the input matrix are non-zero. As a consequence, the sketch does not exactly span the column space of the input matrix. Oversampling can be used to overcome this issue by using random projections to form the sketch, instead of just . Here, denotes the number of additional projections, and a small number is often sufficient to obtain a good basis that is comparable to the best possible basis (Martinsson, 2016).

The intuition behind the oversampling scheme is the following. The sketch

is a random variable, as it depends on the drawing of a random test matrix

. Increasing the number of additional random projections allows one to decrease the variation in the singular value spectrum of the random test matrix, which subsequently improves the quality of the sketch.

2.2.2 Power iteration scheme

The second method for improving the quality of the basis involves the concept of power sampling iterations (Rokhlin et al., 2009; Halko et al., 2011b; Gu, 2015). Instead of obtaining the sketch directly, the data matrix is first preprocessed as

(7)

where is an integer specifying the number of power iterations. This process enforces a more rapid decay of the singular values. Thus, we enable the algorithm to sample the relevant information related to the dominant singular values, while unwanted information is suppressed.

Let be the singular value decomposition. It is simple to show that . Here, and are orthonormal matrices whose columns are the left and right singular vectors of , and is a diagonal matrix containing the singular values in descending order. Hence, for , the modified matrix has a relatively fast decay of singular values even when the decay in is modest. This is illustrated in Figure 5, showing the singular values of a low-rank matrix before (red) and after computing power iterations. Thus, substituting Equation (7) into (4) yields an improved sketch

(8)
Figure 5: Singular value spectrum of a low-rank () matrix before and after pre processing. The computation of power iterations enforce a more rapid decay of singular values.

When the singular values of the data matrix decay slowly, as few as power iterations can considerably improve the accuracy of the approximation. The drawback of the power scheme is that additional passes over the input matrix are required.

Algorithm 1 shows a direct implementation of the power iteration scheme. Due to potential round-off errors, however, this algorithm is not recommended in practice.

Input: Input matrix , the sketch , and parameter . function (1) for perform q power iterations (2) (3) Return: Algorithm 1 Direct implementation of the power scheme. Input: Input matrix , the sketch , and . function (1) for perform q iterations (2) economic QR (3) economic QR (4) Return: Algorithm 2 Subspace iterations. Input: Input matrix , the sketch , and . function (1) for perform q iterations (2) pivoted LU (3) pivoted LU (4) Return: Algorithm 3 Normalized power iterations.

The numerical stability can be improved by orthogonalizing the sketch between each iteration. This scheme is shown in Algorithm 2, and denoted as subspace iteration (Halko et al., 2011b; Gu, 2015). The pivoted LU decomposition can be used as an intermediate step instead of the QR decomposition as proposed by Rokhlin et al. (2009). Algorithm 3 is computationally more efficient, while slightly less accurate.

2.3 Random test matrices

The probabilistic framework above essentially depends on the random test matrix used for constructing the sketch . Specifically, we seek a matrix with independent identically distributed (i.i.d.) entries from some distribution, which ensures that its columns are linearly independent with high probability. Some popular choices for constructing the random test matrix are:

  • Gaussian. The default choice to construct a random test matrix is to draw entries from the standard normal distribution, . The normal distribution is known to have excellent performance for sketching in practice. Further, the theoretical properties of the normal distribution enable the derivation of accurate error bounds (Halko et al., 2011b).

  • Uniform.

    A simple alternative is to draw entries from the uniform distribution,

    . While the behavior is similar in practice, the generation of uniform random samples is computationally more efficient.

  • Rademacher.

    Yet another approach to construct the random test matrix is to draw independent Rademacher entries. The Rademacher distribution is a discrete probability distribution, where the random variates take the values

    and with equal probability. Rademacher entries are simple to generate, and they are cheaper to store than Gaussian and uniform random test matrices (Tropp et al., 2016).

Currently, the rsvd package only supports standard dense random test matrices; however, dense matrix operations can become very expensive for large-scale applications. This is because it takes time to apply an dense random test matrix to any dense input matrix. Structured random test matrices provide a computationally more efficient alternative (Woolfe et al., 2008), reducing the costs to . Very sparse random test matrices are even simpler to construct (Li et al., 2006; Erichson et al., 2017a), yet slightly less accurate. These can be applied to any dense matrix in using sparse matrix multiplication routines, where nnz denotes the non-zero entries in the sparse random test matrix.

3 Randomized singular value decompositions

The SVD provides a numerically stable matrix decomposition that can be used to obtain low-rank approximations, to compute the pseudo-inverses of non-square matrices, and to find the least-squares and minimum norm solutions of a linear model. Further, the SVD is the workhorse algorithm behind many machine learning concepts, for instance, matrix completion, sparse coding, dictionary learning, PCA and robust PCA. For a comprehensive technical overview of the SVD we refer to Golub and Van Loan (1996), and Trefethen and Bau (1997).

3.1 Brief historical overview

While the origins of the SVD can be traced back to the late 19th century, the field of randomized matrix algorithms is relatively young. Figure 6 shows a short time-line of some major developments of the singular value decomposition. Stewart (1993) gives an excellent historical review of the five mathematicians who developed the fundamentals of the SVD, namely Eugenio Beltrami (1835-1899), Camille Jordan (1838-1921), James Joseph Sylvester (1814-1897), Erhard Schmidt (1876-1959) and Hermann Weyl (1885-1955). The development and fundamentals of modern high-performance algorithms to compute the SVD are related to the seminal work of Golub and Kahan (1965) and Golub and Reinsch (1970), forming the basis for the EISPACK, and LAPACK SVD routines.

Modern partial singular value decomposition algorithms are largely based on Krylov methods, such as the Lanczos algorithm  (Calvetti et al., 1994; Larsen, 1998; Lehoucq et al., 1998; Baglama and Reichel, 2005). These methods are accurate and are particularly powerful for approximating structured, and sparse matrices.

Randomized matrix algorithms for computing low-rank matrix approximations have gained prominence over the past two decades. Frieze et al. (2004) introduced the ‘Monte Carlo’ SVD, a rigorous approach to efficiently compute the approximate low-rank SVD based on non-uniform row and column sampling. Sarlos (2006)Liberty et al. (2007) and Martinsson et al. (2011) introduced a more robust approach based on random projections. Specifically, the properties of random vectors are exploited to efficiently build a subspace that captures the column space of a matrix. Woolfe et al. (2008) further improved the computational performance by leveraging the properties of highly structured matrices, which enable fast matrix multiplications. Eventually, the seminal work by Halko et al. (2011b) unified and expanded previous work on the randomized singular value decomposition and introduced state-of-the-art prototype algorithms to compute the near-optimal low-rank singular value decomposition.

Figure 6: A timeline of major singular value decomposition developments.

3.2 Conceptual overview

Given a real matrix with , the singular value decomposition takes the form

(9)

The matrices and are orthonormal so that and . The left singular vectors in provide a basis for the range (column space), and the right singular vectors in provide a basis for the domain (row space) of the matrix . The rectangular diagonal matrix contains the corresponding non-negative singular values , describing the spectrum of the data.

The so called ‘economy’ or ‘thin’ SVD computes only the left singular vectors and singular values corresponding to the number (i.e., ) of right singular vectors

(10)

If the number of right singular vectors is small (i.e.,), this is a more compact factorization than the full SVD. The ‘economy’ SVD is the default form of the base svd() function in R.

Low-rank matrices feature a rank that is smaller than the dimension of the ambient measurement space, i.e, is smaller than the number of columns and rows. Hence, the singular values are zero, and the corresponding singular vectors span the left and right null spaces. The concept of the ‘economy’ SVD of a low-rank matrix is illustrated in Figure 7.

Figure 7: Schematic of the ‘economy’ SVD for a rank- matrix, where .

In practical applications matrices are often contaminated by errors, and the effective rank of a matrix can be smaller than its exact rank . In this case, the matrix can be well approximated by including only those singular vectors which correspond to singular values of a significant magnitude. Hence, it is often desirable to compute a reduced version of the SVD

(11)

where denotes the desired target rank of the approximation. In other words, this reduced form of the SVD allows one to express approximately by the sum of rank-one matrices

(12)

Choosing an optimal target rank is highly dependent on the task. One can either be interested in a highly accurate reconstruction of the original data, or in a very low dimensional representation of dominant features in the data. In the former case should be chosen close to the effective rank, while in the latter case might be chosen to be much smaller.

Truncating small singular values in the deterministic SVD gives an optimal approximation of the corresponding target rank . Specifically, the Eckart-Young theorem (Eckart and Young, 1936) states that the low-rank SVD provides the optimal rank- reconstruction of a matrix in the least-square sense

(13)

The reconstruction error in both the spectral and Frobenius norms is given by

(14)

For massive datasets, however, the truncated SVD is costly to compute. The cost to compute the full SVD of an matrix is of the order , from which the first components can then be extracted to form .

3.3 Randomized algorithm

Randomized algorithms have been recently popularized, in large part due to their ‘surprising’ reliability and computational efficiency (Gu, 2015). These techniques can be used to obtain an approximate rank- singular value decomposition at a cost of . When the dimensions of are large, this is substantially more efficient than truncating the full SVD.

We present details of the randomized low-rank SVD algorithm, which comes with favorable error bounds relative to the optimal truncated SVD, as presented in the seminal paper by Halko et al. (2011b), and further analyzed and implemented in Voronin and Martinsson (2015).

Let be a low-rank matrix, and without loss of generality . In the following, we seek the near-optimal low-rank approximation of the form

(15)

where denotes the target rank. Instead of computing the singular value decomposition directly, we embed the SVD into the probabilistic framework presented in Section 2. The principal concept is sketched in Figure 8.

Figure 8: Conceptual architecture of the randomized singular value decomposition (rSVD). First, a natural basis is computed in order to derive the smaller matrix . Then, the SVD is efficiently computed using this smaller matrix. Finally, the left singular vectors may be reconstructed from the approximate singular vectors by the expression in Eq. (17).

Specifically, we first compute the near-optimal basis using the randomized scheme as outlined in detail above. Note that we allow for both oversampling (), and additional power iterations , in order to obtain the near-optimal basis matrix. The matrix is relatively small if , and it is obtained by projecting the input matrix to low-dimensional space, i.e., . The full SVD of is then computed using a deterministic algorithm

(16)

Thus, we efficiently obtain the first right singular vectors as well as the corresponding singular values . It remains to recover the left singular vectors from the approximate left singular vectors by pre-multiplying by

(17)

The justification for the randomized SVD can be sketched as follows

(18)

Algorithm 5 presents an implementation using the randomized QB decomposition in Algorithm 4. The approximation quality can be controlled via oversampling and additional subspace iterations. Note that if an oversampling parameter has been specified, the desired rank- approximation is simply obtained by truncating the left and right singular vectors and the singular values.

The randomized singular value decomposition has several practical advantages:

  • Lower communication costs: The randomized algorithm presented here requires few (at least two) passes over the input matrix. By passes we refer to the number of sequential reads of the entire input matrix. This aspect is crucial in the area of ‘big data’ where communication costs play a significant role. For instance, the time to transfer the data from the hard-drive into fast memory can be substantially more expensive than the theoretical costs of the algorithm would suggest. Recently, Tropp et al. (2016) have introduced an interesting set of new single pass algorithms, reducing the communication costs even further.

  • Highly parallelizable: Randomized algorithms are highly scalable by design. This is because the computationally expensive steps involve matrix-matrix operations, which are very efficient on parallel architectures. Hence, modern computational architectures such as multithreading and distributed computing, can be fully exploited.

  • General applicability: Randomized matrix algorithms work for matrices with arbitrary rates of singular value decay. The approximation for matrices with rapid singular value decay approaches that of the optimal truncated SVD, with high probability (Martinsson, 2016; Gu, 2015).

Input: Input matrix with dimensions , and target rank . Optional: Parameters and to control oversampling, and the power scheme. function (1) slight oversampling (2) generate random test matrix (3) compute sketch (4) optional: compute power scheme via Algorithm 2 (9) form orthonormal basis (10) project to low-dimensional space Return: ,

Algorithm 4 A randomized QB decomposition algorithm.

Input: Input matrix with dimensions , and target rank . Optional: Parameters and to control oversampling, and the power scheme. function (1) randomized QB decomposition via Algorithm 4 (2) compute economic SVD (3) recover left singular vectors Return: , and

Remark 1.

In general we achieve a good computational performance, if the target rank is much smaller than the ambient dimensions of the input matrix, e.g., .

Remark 2.

As default values for the oversampling and the power iteration scheme we recommend the parameters and , respectively.

Algorithm 5 A randomized SVD algorithm.

3.4 Theoretical performance

Let us consider the low-rank matrix approximation , where . From the Eckart-Young theorem (Eckart and Young, 1936) it follows that the smallest possible error achievable with the best possible basis matrix is

where denotes the largest singular value of the matrix .

In Algorithm 5 we compute the full SVD of , so it follows that , where . Following Martinsson (2016), the randomized algorithm for computing the low-rank matrix approximation has the following expected error:

(19)

Here, the operator denotes the expectation with respect to a Gaussian test matrix , and Euler’s number is denoted as . Further, it is assumed that the oversampling parameter is greater or equal to two.

From this error bound it follows that both the oversampling (parameter ) and the power iteration scheme (parameter ) can be used to control the approximation error. With increasing the second and third term on the right hand side tend towards zero, i.e., the bound approaches the theoretically optimal value of . The parameter

accelerates the rate of decay of the singular values of the sampled matrix, while maintaining the same eigenvectors. This yields better performance for matrices with otherwise modest decay. Equation 

19 is a simplified version of one of the key theorems presented by Halko et al. (2011b), who provide a detailed error analysis of the outlined probabilistic framework. Further, Witten and Candes (2015) provide sharp error bounds and interesting theoretical insights.

3.5 Existing functionality for SVD in R

The svd() function is the default option to compute the SVD in R. This function provides an interface to the underlying LAPACK SVD routines (Anderson et al., 1999). These routines are known to be numerical stable and highly accurate, i.e., full double precision.

In many applications the full SVD is not necessary; only the truncated factorization is required. The truncated SVD for an matrix can be obtained by first computing the full SVD, and then extracting the dominant components to form . However, the computational time required to approximate large-scale data is tremendous using this approach.

Partial algorithms, largely based on Krylov subspace methods, are an efficient class of approximation methods to compute the dominant singular vectors and singular values. These algorithms are particularly powerful for approximating structured or sparse matrices. This is because Krylov subspace methods only require certain operations defined on the input matrix such as matrix-vector multiplication. These basic operations can be computed very efficiently, if the input matrix features some structure like sparsity. The Lanczos algorithm and its variants are the most popular choice to compute the approximate SVD. Specifically, they first find the dominant eigenvalues and eigenvectors of the symmetric matrix as

(20)

where are the dominant right singular vectors, and are the corresponding squared singular values. Depending on the dimensions of the input matrix, this operation can also be performed on . See, for instance, Demmel (1997) and Martinsson (2016) for details on how the Lanczos algorithm builds the Krylov subspace, and subsequently approximates the eigenvalues and eigenvectors.

The relationship between the singular value decomposition and eigendecomposition can then be used to approximate the left singular vectors as (see also Section 4.2). However, computing the eigenvalues of the inner product is not generally a good idea, because this process squares the condition number of . Further, the computational performance of Krylov methods depends on factors such as the initial guess for the starting vector, and additional steps used to stabilize the algorithm (Gu, 2015). While partial SVD algorithms have the same theoretical costs as randomized methods, i.e., they require floating point operations, they have higher communication costs. This is because the matrix-vector operations do not permit data reuse between iterations.

The most competitive partial SVD routines in R are provided by the svd (Korobeynikov and Larsen, 2016), Rspectra (Qiu et al., 2016), and the irlba (Baglama and Reichel, 2005) packages. The svd package provides a wrapper for the PROPACK SVD algorithm (Larsen, 1998). The Rspectra package is inspired by the software package ARPACK (Lehoucq et al., 1998) and provides fast partial SVD and eigendecompositon algorithms. The irlba package implements implicitly restarted Lanczos bidiagonalization methods for computing the dominant singular values and vectors (Baglama and Reichel, 2005). The advantage of this algorithm is that it avoids the implicit computation of the inner product or outer product . Thus, this algorithm is more numerically stable if is ill-conditioned.

3.6 The rsvd() function

The rsvd package provides an efficient routine to compute the low-rank SVD using Algorithm 5. The interface of the rsvd() function is similar to the base svd() function

obj <- rsvd(A, k, nu = NULL, nv = NULL, p = 10, q = 2, sdist = "normal")

The first mandatory argument A passes the input data matrix. The second mandatory argument k defines the target rank, which is assumed to be chosen smaller than the ambient dimensions of the input matrix. The rsvd() function achieves significant speedups for target ranks chosen to be . Similar to the svd() function, the arguments nu and nv can be used to specify the number of left and right singular vectors to be returned.

The accuracy of the approximation can be controlled via the two tuning parameters p and q. The former parameter is used to oversample the basis, and is set by default to p=10. This setting guarantees a good basis with high probability in general. The parameter q can be used to compute additional power iterations (subspace iterations). By default this parameter is set to q=2, which yields a good performance in our numerical experiments, i.e., the default values show an optimal trade-off between speed and accuracy in standard situations. If the singular value spectrum of the input matrix decays slowly, more power iterations are desirable.

Further, the rsvd() routine allows one to choose between a standard normal, uniform and Rademacher random test matrices. The different options can be selected via the argument sdist=c("normal", "unif", "rademacher").

The resulting model object, obj, is itself a list. It contains the following components:

  • d: -dimensional vector containing the singular values.

  • u: matrix containing the left singular vectors.

  • v: matrix containing the right singular vectors. Note that v is not returned in its transposed form, as it is often returned in other programing languages.

More details are provided in the corresponding documentation, see ?rsvd.

3.7 SVD example: Image compression

The singular value decomposition can be used to obtain a low-rank approximation of high-dimensional data. Image compression is a simple, yet illustrative example. The underlying structure of natural images can often be represented by a very sparse model. This means that images can be faithfully recovered from a relatively small set of basis functions. For demonstration, we use the following grayscale image:

R> data("tiger")
R> image(tiger, col = gray(0:255 / 255))

A grayscale image may be thought of as a real-valued matrix , where and are the number of pixels in the vertical and horizontal directions, respectively. To compress the image we need first decompose the matrix . The singular vectors and values provide a hierarchical representation of the image in terms of a new coordinate system defined by dominant correlations within rows and columns of the image. Thus, the number of singular vectors used for approximation poses a trade-off between the compression rate (i.e., the number of singular vectors to be stored) and the reconstruction fidelity. In the following, we use the arbitrary choice as target rank. First, the R base svd() function is used to compute the truncated singular value decomposition:

R> k <- 100
R> tiger.svd <- svd(tiger, nu = k, nv = k)

The svd() function returns three objects: u, v and d. The first two objects are and arrays, namely the truncated left and right singular vectors. The vector d is comprised of the singular values in descending order. Now, the dominant singular values are retained to approximate/reconstruct () the original image:

R> tiger.re <- tiger.svd$u %*% diag(tiger.svd$d[1:k]) %*% t(tiger.svd$v)
R> image(tiger.re, col = gray(0:255 / 255))

The normalized root mean squared error (nrmse) is a common measure for the reconstruction quality of images, computed as:

R> nrmse <- sqrt(sum((tiger - tiger.re) ** 2 ) / sum(tiger ** 2))

Using only singular values/vectors, a reconstruction error as low as is achieved. This illustrates the general fact that natural images feature a very compact representation. Note, that the singular value decomposition is also a numerically reliable tool for extracting a desired signal from noisy data. The central idea is that the small singular values mainly represent the noise, while the dominant singular values represent the desired signal.

If the data matrix exhibits low-rank structure, the provided rsvd() function can be used as a plug-in function for the base svd() function, in order to compute the near-optimal low-rank singular value decomposition:

R> tiger.rsvd <- rsvd(tiger, k = k)

Similar to the base SVD function, the rsvd() function returns three objects: u, v and d. Again, u and v are and arrays containing the approximate left and right singular vectors and the vector d is comprised of the singular values in descending order. Optionally, the approximation accuracy of the randomized SVD algorithm can be controlled by the two parameters p and q, as described in the previous section. Again, the approximated image and the reconstruction error can be computed as:

R> tiger.re <- tiger.rsvd$u %*% diag(tiger.rsvd$d) %*% t(tiger.rsvd$v)
R> nrmse <- sqrt(sum((tiger - tiger.re) ** 2) / sum(tiger ** 2))

The reconstruction error is about , i.e., close to the optimal truncated SVD. Figure 9 presents the visual results using both the deterministic and randomized SVD algorithms.

(a) Original image.
(b) SVD. (nrmse=)
(c) rSVD using . (nrmse=)
(d) rSVD using . (nrmse=)
Figure 9: Subplot (a) shows the original image, and subplots (b), (c) and (d) show the reconstructed images using the dominant components. The reconstruction quality of randomized SVD with power iterations in (d), is nearly as good as of the deterministic SVD.

By visual inspection, no significant differences can be seen between (b) and (d). However, the quality suffers by omitting subspace iterations in (c).

Table 1 shows the performance for different svd algorithms in R. The rsvd() functions achieves an average speed-up of about - over the svd() function. The svds(), and irlba() function achieve speedups of about . The computational gain of the randomized algorithm becomes more pronounced with increased matrix dimension, e.g., images with higher resolution. The trade-off between accuracy and speed of the partial SVD algorithms depends on the precision parameter tol, and we set the tolerance parameter for all algorithms to tol = 1e-5.

Package Function Parameters Time (s) Speedup Error
R base svd() nu = nv = 100 0.37 * 0.121
svd propack.svd() neig = 100 0.55 0.67 0.121
RSpectra svds() k = 100 0.25 1.48 0.121
irlba irlba() nv = 100 0.24 1.54 0.121
rsvd rsvd() k = 100, q = 0 0.03 12.3 0.165
rsvd rsvd() k = 100, q = 1 0.052 7.11 0.125
rsvd rsvd() k = 100, q = 2 0.075 4.9 0.122
rsvd rsvd() k = 100, q = 3 0.097 3.8 0.121
Table 1: Summary of algorithm runtimes (averaged over runs) and errors. The randomized routines achieve substantial speedups, while attaining similar reconstruction errors with .

3.8 Computational performance

In the following we evaluate the performance of the randomized SVD routine and compare it to other SVD routines available in R. To fully exploit the power of randomized algorithms we use the enhanced R distribution Microsoft R Open 3.4.3. This R distribution is linked with multi-threaded LAPACK libraries, which use all available cores and processors. Compared to the standard CRAN R distribution, which uses only a single thread (processor), the enhanced R distribution shows significant speedups for matrix operations. For benchmark results, see: https://mran.microsoft.com/documents/rro/multithread/. However, we see also significant speedups when using the standard CRAN R distribution.

A machine with Intel Core i7-7700K CPU Quad-Core 4.20GHz, 64GB fast memory, and operating-system Ubuntu 17.04 is used for all computations. The microbenchmark package is used for accurate timing (Mersmann et al., 2015).

To compare the computational performance of the SVD algorithms, we consider low-rank matrices with varying dimensions and , and intrinsic rank , generated as:

R> A <- matrix(rnorm(m * r), m, r) %*% matrix(rnorm(r * n), r, n)

Figure 10 shows the runtime for low-rank approximations (target-rank ) and varying matrix dimensions , where the second dimension is chosen to be . While the routines of the RSpectra and irlba packages perform best for small dimensions, the computational advantage of the randomized SVD becomes pronounced with increasing dimensions.

Figure 10: Runtimes for computing rank approximations for varying matrix dimensions.

We now investigate the performance of the routines in more detail. Figures 1112, and 13 show the computational performance for varying matrix dimensions and target ranks. The elapsed time is computed as the median over runs. The speedups show the relative performance compared to the base svd(), i.e., the average runtime of the svd() function is divided by the runtime of the other SVD algorithms. The relative reconstruction error is computed as

where is the rank- matrix approximation.

The rsvd() function achieves substantial speedups over the other SVD routines. Here, the oversampling parameter is fixed to , but it can be seen that additional power iterations improve the approximation accuracy. This allows the user to control the trade-off between computational time and accuracy, depending on the application. Note that we have set the precision parameter of the RSpectra, irlba and propack routines to tol=1e-5.

Figure 14 show the computational performance for sparse matrices with about non-zero elements. The RSpectra, and propack routines are specifically designed for sparse and structured matrices, and show considerably better computational performance.

(a) Runtime.
(b) Speedups.
(c) Relative errors.
Figure 11: Computational performance for a dense low-rank matrix.
(a) Runtime.
(b) Speedups.
(c) Relative errors.
Figure 12: Computational performance for a dense low-rank matrix.
(a) Runtime.
(b) Speedups.
(c) Relative errors.
Figure 13: Computational performance for a dense low-rank matrix.

Note that the random sparse matrices do not feature low-rank structure; hence, the large relative error. Still, the randomized SVD shows a good trade-off between speedup and accuracy.

(a) Runtime.
(b) Speedups.
(c) Relative errors.
Figure 14: Computational performance for a sparse matrix.

4 Randomized principal component analysis

Dimensionality reduction is a fundamental concept in modern data analysis. The idea is to exploit relationships among points in high-dimensional space in order to construct some low-dimensional summaries. This process aims to eliminate redundancies, while preserving interesting characteristics of the data (Burges, 2010). Dimensionality reduction is used to improve the computational tractability, to extract interesting features, and to visualize data which are comprised of many interrelated variables. The most important linear dimension reduction technique is principal component analysis (PCA), originally formulated by Pearson (1901) and Hotelling (1933). PCA plays an important role, in particular, due to its simple geometric interpretation. Jolliffe (2002) provides a comprehensive introduction to PCA.

4.1 Conceptual overview

Principal component analysis aims to find a new set of uncorrelated variables. The so called principal components (PCs) are constructed such that the first PC explains most of the variation in the data; the second PC most of the remaining variation and so on. This property ensures that the PCs sequentially capture most of the total variation (information) present in the data. In practice, we often aim to retain only a few number of PCs which capture a ‘good’ amount of the variation, where ‘good’ depends on the application. The idea is depicted for two correlated variables in Figure 15. Figure 14(a) illustrates the two principal directions of the data, which span a new coordinate system. The first principal direction is the vector pointing in the direction which accounts for most of the variability in data. The second principal direction is orthogonal (perpendicular) to the first one and captures the remaining variation in the data. Figure 14(b) shows the original data using the principal directions as a new coordinate system. Compared to the original data, the histograms indicate that most of the variation is now captured by just the first principal component, while less by the second component.

(a) Original (standardized) data.
(b) Rotated data.
Figure 15: PCA seeks to find a new set of uncorrelated variables. Plot (a) shows some two-dimensional data, and its two principal directions. Plot (b) shows the new principal components. Geometrically, the PCs are simply a rotation and reflection of the original data, so that the first component accounts for most of the variation in the data, now.

To be more formal, assume a data matrix with observations and variables (column-wise, mean-centered). Then, the principal components can be expressed as a weighted linear combination of the original variables

(21)

where denotes the principal component. The vector is the principal direction, where the elements of are the principal component coefficients.

The problem is now to find a suitable vector such that the first principal component

captures most of the variation in the data. Mathematically, this problem can be formulated either as a least square problem or as a variance maximization problem 

(Cunningham and Ghahramani, 2015). The two views are illustrated in Figure 16.

Maximize variance

Minimize residuals
Figure 16: Principal component analysis can be formulated either as a variance maximization or as a least square minimization problem. Both views are equivalent.

This is, because the total variation equals the sum of the explained and unexplained variation (Jolliffe, 2002), illustrated in Figure 17.

origin

data point

projection

total variation

unexplained variation

explained variation

principal component
Figure 17: The Pythagorean theorem provides a geometrical explanation for the relationship between the two views: The PCs can be obtained by either maximizing the variance or by minimizing the unexplained variation (squared residuals) of the data.

We follow the latter view, and maximize the variance of the first principal component subject to the normalization constraint

(22)

where denotes the variance operator. We can rewrite Equation (22) as

(23)

We note that the scaled inner product forms the sample covariance matrix

(24)

corresponds to the sample correlation matrix if the columns of are both centered and scaled. We substitute into Equation (23)

(25)

Next, the method of Lagrange multipliers is used to solve the problem. First, we formulate the Lagrange function

(26)

Then, we maximize the Lagrange function by differentiating with respect to

(27)

which leads to the well known eigenvalue problem. Thus, the first principal direction for the mean centered matrix is given by the dominant eigenvector of the covariance matrix . The amount of variation explained by the first principal component is expressed by the corresponding eigenvalue . More generally, the subsequent principal component directions can be obtained by computing the eigendecompositon of the covariance or correlation matrix

(28)

The columns of are the eigenvectors (principal directions) which are orthonormal, i.e., . The diagonal elements of are the corresponding eigenvalues. The matrix

can also be interpreted as a projection matrix that maps the original observations to new coordinates in eigenspace. Hence, the

principal components can be more concisely expressed as

(29)

Since the eigenvectors have unit norm, the projection should be purely rotational without any scaling; thus, is also denoted as rotation matrix.

4.1.1 PCA whitening

In some situations, the scaled eigenvectors

(30)

provide a more insightful interpretation of the data. is denoted as a loading matrix and provides a factorization of the covariance (correlation) matrix

(31)

Thus, the loadings have the following two interesting properties:

  • The squared column sums equal the eigenvalues.

  • The squared row sums equal the variable’s variance.

Further, the loading matrix can be used to compute the whitened principal components

(32)

Essentially, whitening rescales the principal components so that they have unit variance. This process is also called sphering (Kessy et al., 2018). In other words, whitening scales the principal component by the corresponding eigenvalue as

(33)

This is best illustrated by revisiting the above example shown in Figure 15. In Figure 18 we show both the rotated and the whitened version of the data.

(a) Rotated data.
(b) Rotated and whitened data.
Figure 18: Plot (a) shows the principal components and plot (b) shows the whitened principal components. The whitened components are uncorrelated and have unit variance.

4.1.2 Dimensionality reduction

In practice, we often seek a useful low-dimensional representation to reveal the coherent structure of the data. Choosing a ‘good’ target-rank , i.e., the number of PCs to retain, is a subtle issue and often domain specific. Little is gained by retaining too many components. Conversely, a bad approximation is produced if the number of retained components is too small. Fortunately, the eigenvalues tell us the amount of variance captured by keeping only components given by

Thus, PCs corresponding to eigenvalues of small magnitude account only for a small amount of information in the data.

Many different heuristics, like the scree plot and Kaiser criterion, have been proposed to identify the optimal number of components 

(Jolliffe, 2002). A computational intensive approach to determine the optimal number of components is via cross-validation approximations (Josse and Husson, 2012), while a mathematically refined approach is the optimal hard threshold method for singular values, formulated by Gavish and Donoho (2014)

. An interesting Bayesian approach to estimate the intrinsic dimensionality of a high-dimensional dataset was recently proposed by 

Bouveyron et al. (2017).

4.2 Randomized algorithm

The singular value decomposition provides a computationally efficient and numerically stable approach for computing the principal components. Specifically, the eigenvalue decomposition of the inner and outer dot product of can be related to the SVD as

(34a)
(34b)

It follows that the eigenvalues are equal to the squared singular values. Thus, we recover the eigenvalues of the sample covariance matrix as .

The left singular vectors correspond to the eigenvectors of the outer product , and the right singular vectors correspond to the eigenvectors of the inner product . This allows us to define the projection (rotation) matrix .

Having established the connection between the singular value decomposition and principal component analysis, it is straight forward to show that the principal components can be computed as

(35)

The randomized singular value decomposition can then be used to efficiently approximate the dominant principal components. This approach is denoted as randomized principal component analysis, first introduced by Rokhlin et al. (2009), and later by  Halko et al. (2011a). Szlam et al. (2014) provides some additional interesting implementation details for large-scale applications. Algorithm 6 presents an implementation of the randomized PCA. The approximation accuracy can be controlled via oversampling and additional power iterations as described in Section 2.2.

Input: Centered/scaled input matrix with dimensions , and target rank . Optional: Parameters and to control oversampling, and the power scheme. function (1) randomized SVD (Algorithm 5) (2) recover eigenvalues (3) optional: compute principal components Return: , and

Algorithm 6 A randomized PCA algorithm.

4.3 Existing functionality for PCA in R

The prcomp() and princomp() functions are the default options for performing PCA in R. The prcomp() routine uses the singular value decomposition and the princomp() function uses the eigenvalue decomposition to compute the principal components (Venables and Ripley, 2002). Other options in R are the PCA routines of the ade4 (Dray and Dufour, 2007) and FactoMineR (Lê et al., 2008) packages, which provide extended plot and summary capabilities. All these routines, however, are based on computationally demanding algorithms.

In many applications, only the dominant principal components are required. In this case, partial algorithms are an efficient alternative to constructing low-rank approximations, as discussed in Section 3. For instance, the irlba package provides a computationally efficient routine for computing the dominant principal components using the implicitly restarted Lanczos method (Baglama et al., 2017).

Another class of methods are incremental PCA algorithms, also denoted as online PCA. These techniques are interesting if the data matrix is not entirely available to start with, i.e., the algorithms allows one to update the decomposition with each new arriving observation in time.  Cardot and Degras (2015) give an overview of online PCA algorithms and the corresponding routines are provided via the onlinePCA package (Degras and Cardot, 2016). Similarly, the idm package provides an incremental PCA algorithm.

4.4 The rpca() function

The rpca() function provides an efficient routine for computing the dominant principal components using Algorithm 6. This routine is in particular relevant if the information in large-scale data matrices can be approximated by the first few principal components.

The interface of the rpca() function is similar to the prcomp() function:

obj <- rpca(A, k, center = TRUE, scale = TRUE, retx = TRUE, p = 10, q = 2)

The first mandatory argument A passes the input data matrix. Note, that the analysis can be affected if the variables have different units of measurement. In this case, scaling is required to ensure a meaningful interpretation of the components. The rpca() function centers and scales the input matrix by default, i.e., the analysis is based on the implicit correlation matrix. However, if all of the variables have same units of measurement, there is the option to work with either the covariance or correlation matrix. In this case, the choice largely depends on the data and the aim of the analysis. The default options can be changed via the arguments center and scale. The second mandatory argument k sets the target rank, and it is assumed that k is smaller than the ambient dimensions of the input matrix. The principal components are returned by default; otherwise the argument retx can be set to FALSE to not return the PCs. The parameters p and q are described in Section 3.6.

The resulting model object, obj, is a list and contains the following components:

  • rotation: matrix containing the eigenvectors.

  • eigvals: -dimensional vector containing the eigenvalues.

  • sdev:

    -dimensional vector containing the standard deviations of the principal components, i.e., the square root of the eigenvalues.

  • x: matrix containing the principal components (rotated variables).

  • center, scale: the numeric centering and scalings used (if any).

4.4.1 Utility functions

The rpca() routine comes with generic functions that can be used to summarize and display the model information. These are similar to the prcomp() function. The summary() function provides information about the explained variance, standard deviations, proportion of variance as well as the cumulative proportion of the computed principal components. The print() function can be used to print the eigenvectors (principal directions). The plot() function can be used to visualize the results, using the ggplot2 package (Wickham, 2009).

4.5 PCA example: Handwritten digits

Handwritten digit recognition is a widely studied problem (Lecun et al., 1998). In the following, we use a downsampled version of the MNIST (Modified National Institute of Standards and Technology) database of handwritten digits. The data are obtained from http://yann.lecun.com/exdb/mnist and can be loaded from the command line:

R> data("digits")
R> label <- as.factor(digits[,1])
R> digits <- digits[,2:785]

The data matrix is of dimension . Each row corresponds to a digit between and . The first column is comprised of the class labels, while the following columns record the pixel intensities for the flattened image patches. Figure 20(a) shows some of the digits. In R the first digit can be displayed as:

R> digit <- matrix(digits[1,], nrow = 28, ncol = 28)
R> image(digit[,28:1], col = gray(255:0 / 255))

The aim of principal component analysis is to find a low-dimension representation which captures most of the variation in the data. PCA helps to understand the sources of variability in the data as well as to understand correlations between variables. The principal components can be used for visualization, or as features to train a classifier. A common choice is to retain the dominant

principal components for classifying digits, using the -nearest neighbor algorithm (Lecun et al., 1998). Those can be efficiently approximated using the rpca() function:

R> digits.rpca <- rpca(digits, k = 40, center = TRUE, scale = FALSE)

The target rank is defined via the argument k. By default, the data are mean centered and standardized, i.e., the correlation matrix is implicitly computed. Here, we set scale = FALSE, since the variables have the same units of measurement, namely pixel intensities. The analysis can be summarized using the summary() function. The screeplot function can be used to visualize the cumulative proportion of the variance captured by the principal components:

R> ggscreeplot(digits.rpca, type =  "cum")

Figure 19 shows the corresponding plot. Next, the PCs can be plotted in order to visualize the data in low-dimensional space:

R> ggindplot(digits.rpca, groups=label, ellipse = TRUE, ind_labels = FALSE)
Figure 19: Cumulative proportion of the variance explained by the principal components. The first PCs explain about of the total variation in the data.
(a) Individuals factor map.
(b) Variables factor map.
Figure 20: Plotting functionality to visualize the PCs: (a) shows the individuals factor map, overlaid with ellipses for each class; (b) shows the variables factor map.

The so-called individual factor map, using the first and second principal component, is shown in Figure 19(a). The plot helps to reveal some interesting patterns in the data. For instance, ’s are distinct from ’s, while 3’s share commonalities with all the other classes. Further, the correlation between the original variables and the PCs can be visualized:

R> ggcorplot(digits.rpca, alpha = 0.3, top.n = 10)

The correlation plot, also denoted as variables factor map, is shown in Figure 19(b). It shows the projected variables in eigenspace. This representation of the data gives some insights into the structural relationship (correlation) between the variables and the principal components.

In order to quantify the quality of the dimensionality reduction, we can compute the relative error between the low-rank approximation and the original data. Recall, that the dominant principal component were defined as . Hence, we can approximate the input matrix as :

R> digits.re <- digits.rpca$x %*% t(digits.rpca$rotation)

Since the procedure has centered the data, we need to add the mean pixel values back:

R> digits.re <- sweep(digits.re, 2, digits.rpca$center, FUN = "+")

The relative error can then be computed:

R> norm(digits - digits.re, ’F’) / norm(digits, ’F’)

The relative error is approximately . Figure 20(b) and 20(c) show the samples of the reconstructed digits using both the prcomp() and rpca() function. By visual inspection, there is virtually no noticeable difference between the deterministic and the randomized approximation.

Runtimes and relative errors for different PCA functions in R are listed in Table 2. The randomized algorithm is much faster than the prcomp() function, while attaining near-optimal results. Both the dudi.pca() and PCA() functions are slower than the base prcomp() function. This is because we are using the MKL (math kernel library) accelerated R distribution Microsoft R Open 3.4.1. The timings can vary compared to using the standard R distribution.

Package Function Parameters Time (s) Speedup Error
R base prcomp() rank. = 40 0.56 * 0.327
FactoMineR PCA() ncp = 40 0.97 0.57 0.327
ade4 dudi.pca() nf = 40 0.91 0.61 0.327
irlba prcomp_irlba() n = 40 - - -
rsvd rpca() k = 40 0.37 1.5 0.328
Table 2: Summary of the computational performance of different PCA functions. The prcomp_irlba() function returns inconsistent results, and is excluded here.
(a) Handwritten digits.
(b) Deterministic.
(c) Randomized.
Figure 21: Handwritten digits, and its low-rank approximations using components.

4.5.1 Handwritten digit recognition

The principal component scores can be used as features to efficiently train a classifier. This is because PCA assumes that the interesting information in the data are reflected by the dominant principal components. This assumption is not always valid, i.e, in some applications it can be the case that the variance corresponds to noise rather than to the underlying signal. The question is, how good are the randomized principal components suited for this task? In the following, we use a simple

-nearest neighbor (kNN) algorithm to classify handwritten digits in order to compare the performance. The idea of kNN is to find the closest point (or set of points) to a given target point 

(Hastie et al., 2009)

. There are two reasons to use PCA for dimensionality reduction: (a) kNN is known to perform poorly in high-dimensional space, due to the ‘curse of dimensionality’ 

(Donoho, 2000); (b) kNN is computational expensive when high-dimensional data points are used for training.

First, we split the dataset into a training and a test set using the caret package (Kuhn, 2008). We aim to create a balanced split of the dataset, using about 80% of the data for training:

R> library("caret")
R> trainIndex <- createDataPartition(label, p = 0.8, list = FALSE)

We then compute the dominant randomized principal components of the training set:

R> train.rpca <- rpca(digits[trainIndex,], k = 40, scale = FALSE)

We can use the predict() function to rotate the test set into low-dimensional space:

R> test.x <- predict(train.rpca, digits[-trainIndex,])

The base class package provides a knn algorithm, which we use for classification:

R> library("class")
R> knn.1 <- knn(train.rpca$x, test.x, label[trainIndex], k = 1)

The test images are simply assigned to the class of the single nearest neighbor. The performance can be quantified by computing the accuracy, i.e., the number of correctly classified digits divided by the total number of predictions made:

R> 100 * sum(label[-trainIndex] == knn.1) / length(label[-trainIndex])

For comparison, the above steps can be repeated with the additional argument rand=FALSE or using the prcomp() function. Both the randomized PCA and the deterministic PCA algorithms achieve an accuracy of about . Figure 22 shows the performance.

Figure 22: Performance of digits classification over random splits. There is no significant difference in terms of the accuracy between using the randomized and deterministic PCs.

5 Randomized robust principal component analysis

Thus far, we have viewed matrix approximations as a factorization of a given matrix into a product of smaller (low-rank) matrices, as formulated in Equation 1

. However, there is another interesting class of matrix decompositions, which aim to separate a given matrix into low-rank, sparse and noise components. Such decompositions are motivated by the need for robust methods which can more effectively account for corrupt or missing data. Indeed, outlier rejection is critical in many applications as data is rarely free of corrupt elements. Robustification methods decompose data matrices as follows:

(36)

where denotes the low-rank matrix, the sparse matrix, and the noise matrix. Note that the sparse matrix represents the corrupted entries (outliers) of the data matrix . In the following we consider only the special case

, i.e., the decomposition of a matrix into its sparse and low-rank components. This form of additive decomposition is also denoted as robust principal component analysis (RPCA), and its remarkable ability to separate high-dimensional matrices into low-rank and sparse component makes RPCA an invaluable tool for data science. The additive decomposition is, however, different from classical robust PCA methods known in the statistical literature. These techniques are concerned with computing robust estimators for the empirical covariance matrix, for instance, see the seminal work by 

Hubert et al. (2005) and Croux and Ruiz-Gazen (2005).

While traditional principal component analysis minimizes the spectral norm of the reconstruction error, robust PCA aims to recover the underlying low-rank matrix of a heavily corrupted input matrix. Candès et al. (2011) proved that it is possible to exactly separate such a data matrix into both its low-rank and sparse components, under rather broad assumptions. This is achieved by solving a convenient convex optimization problem called principal component pursuit (PCP). The objective is to minimize a weighted combination of the nuclear norm and and the norm as

(37)

where is an arbitrary balance parameter which puts some weight on the sparse error term in the cost function. Typically is chosen to be

. The PCP concept is mathematically sound, and has been applied successfully to many applications like video surveillance and face recognition 

(Wright et al., 2009). Robust PCA is particular relevant if the underlying model of the data naturally features a low-rank subspace that is polluted with sparse components. The concept of matrix recovery can also be extend to the important problem of matrix completion.

The biggest challenge for robust PCA is computational efficiency, especially given the iterative nature of the optimization required. Bouwmans et al. (2016) have identified more than related algorithms to the original PCP approach, aiming to overcome the computational complexity, and to generalize the original algorithm.

5.1 The inexact augmented Lagrange multiplier method

A popular choice to compute RPCA, due to its favorable computational properties, is the inexact augmented Lagrange multiplier (IALM) method (Lin et al., 2011). This method formulates the following Lagrangian function

(38)

where and are positive scalars, and Z the Lagrange multiplier. Further, is defined as . The method of augmented Lagrange multipliers can be used to solve the optimization problem (Bertsekas, 1999). Lin et al. (2011) have proposed both an exact and inexact algorithm to solve Equation 38. Here, we advocate the latter approach. Specifically, the inexact algorithm avoids solving the problem

(39)

by alternately solving the following two sub-problems at step :

(40)

and

(41)

For details, we refer the reader to Lin et al. (2011).

5.2 Randomized Algorithm

The singular value decomposition is the workhorse algorithm behind the IALM method. Thus, the computational costs can become intractable for ‘big’ datasets. However, randomized SVD can be used to substantially ease the computational burden of the IALM method. Algorithms 7 outlines the randomized implementation.

Input: Input matrix with dimensions , and to put weight on the sparse error term. Optional: Parameters and to control oversampling, and the power scheme. function (1) initialize target rank (2) (3) initialize Lagrange multiplier (4) initialize sparse matrix (5) repeat (6) Randomized SVD using Algorithm 5 (7) predicted rank, and updated target rank (8) soft threshold top singular values (9) update low-rank matrix (10) update sparse matrix via soft thresholding (11) update Lagrange multiplier (12) update (13) until some convergence criterion is reached Return: , and

Remark 3.

Lin et al. (2011) provide details on how to predict the rank in step (6).

Remark 4.

Our randomized RPCA algorithm automatically switches from the randomized SVD to the deterministic SVD, if the target rank is predicted to be .

Algorithm 7 A randomized robust PCA algorithm.

5.3 Existing functionality for Robust PCA in R

Only few R packages provide robust PCA routines. For comparison, we consider the rpca package (Sykulski, 2015). The provided RPCA function implements the algorithm described by Candès et al. (2011). This algorithm is highly accurate. However, a large number of iterations is required for convergence.

5.4 The rrpca() function

The rrpca() function implements the inexact augmented Lagrange multiplier method. The interface of the rrpca() function takes the form of:

obj <- rrpca(A, lambda = NULL, maxiter = 50, tol = 1.0e-5,
             p = 10, q = 2, trace = FALSE, rand = TRUE)

The first mandatory argument A passes the input data matrix. The second argument lambda is used to put some weight on the sparse error term in the cost function. By default is set to . The next two parameters maxiter, and tol are used to control the stopping criterion of the algorithms. The routine stops either if a specified maximal number of iterations, or if a certain tolerance level is reached. The parameters p and q are described in Section 2. The argument rand can be used to switch between the deterministic and randomized algorithms. By default the randomized algorithm is selected (i.e., the randomized SVD is used). Setting this argument rand=FALSE selects the deterministic algorithm (i.e., the deterministic SVD is used). To print out progress information, the argument trace can be set TRUE.

The resulting model object, obj, is a list and contains the following components:

  • L: matrix containing the low-rank component.

  • S: matrix containing the sparse component.

5.5 Robust PCA example: Grossly corrupted handwritten digits

To demonstrate the randomized robust PCA algorithm, we consider a grossly corrupted subset of the handwritten digits dataset. We first extract a subset comprising only twos:

data("digits")
two <- subset(digits[,2:785], digits[,1] == 2)

Then, we corrupt the data using salt and pepper noise, i.e., we draw i.i.d uniform entries in the interval and sparsify the matrix so that about 10% nonzero elements are retained:

m <- nrow(two); n <- ncol(two)
S <- matrix(runif(m*n, 0, 255), nrow = m, ncol = n)
S <- S * matrix(rbinom(m*n, size = 1, prob = 0.1), nrow = m, ncol = n)

The digits are then corrupted as follows:

two_noisy <- two + S
two_noisy <- ifelse(two_noisy > 255, 255, two_noisy)

Note, the last line ensures that the pixel intensities remain in the interval . Samples of the corrupted digits are shown in Figure 22(a). Robust PCA is now used for matrix recovery (denoising) by separating the data into a low-rank and sparse component:

two.rrpca <- rrpca(two_noisy, trace = TRUE, rand = TRUE)

Figure 22(c) and 22(d) shows samples of the low-rank component and the sparse component . For comparison, Figure 22(b) shows samples of the low-rank component which are computed using the deterministic routine (rand=FALSE).

(a) Noisy digits.
(b) Deterministic .
(c) Randomized .
(d) Randomized .
Figure 23: Separation of noisy handwritten digits into a low-rank and a sparse component.

Table 3 summarizes the computational results. The randomized routine does not show a computational advantage in this case. The reasons are twofold. First, the dataset requires a relatively large target rank to approximate the data accurately, i.e., in this example the predicted rank- of the IALM algorithm for the final iteration is . Secondly, the data matrix is tall and thin (i.e., the ratio of rows to columns is large). For both of theses reasons, the performance of the deterministic algorithm remains competitive. The advantage of the randomized algorithms becomes pronounced for higher-dimensional problems which feature low-rank structure.

It appears that the rpca() routine of the rpca package is more accurate, while computationally less attractive (the algorithms converges slowly). However, using the relative error for comparing the algorithms might be misleading since we do not know the ground truth. The relative error is computed using the original data as baseline, which contains some perturbations itself. Clearly, the advocated IALM algorithm removes not only the salt and paper noise, but also shadows, specularities, and saturations from the digits. This seems to be favorable, yet it leads to a larger relative error.

Package Function Parameters Time (s) Speedup Error Iterations
rpca rpca() max.iter=50 11.88 * 0.265 50
rsvd rrpca() rand=FALSE 6.33 1.8 0.328 27
rsvd rrpca() rand=TRUE 5.56 2.1 0.329 27
Table 3: Summary of the computational performance of different RPCA functions.

Thus, to better compare the algorithms we perform a small simulation study using synthetic data. By superimposing a low-rank matrix with a sparse component, the ground truth is known. First, the low-rank component is generated:

m <- 300; n <- 300; k <- 5
L1 = matrix(rnorm(m*k), nrow = m, ncol = k)
L2 = matrix(rnorm(n*k), nrow = k, ncol = n)
L = L1 %*% L2

The sparse component with about 20% nonzero i.i.d uniform entries in the interval is generated from:

S = matrix(runif(m*n, -500, 500), nrow = m, ncol = n)
S = S * matrix(rbinom(m*n, size=1, prob=0.2), nrow = m, ncol = n)

The data matrix is then constructed by superimposing the low-rank and sparse components:

A = L + S

Figure 24 shows the performance of the RPCA algorithms over runs. Both the randomized and deterministic routines provided by the rsvd package show a better performance than the rpca package, when the maximum number of iterations is set to . In addition, we show the performance after iterations for the RPCA algorithms from the rpca package.

Figure 24: Matrix recovery performance of different RPCA algorithms.

6 Additional functionality

Principal components analysis seeks a set of new components which are formed as weighed linear combinations of the input data. This approach allows one to efficiently approximate and summarize the data, however, interpretation of the resulting components can be difficult. For instance, in a high-dimensional data setting it is cumbersome to interpret the large number of weights (loadings) required to form the components. While in many applications the eigenvectors have distinct meanings, the orthogonality constraints may not be physical meaningful in other problems. Thus, it is plausible to look for alternative factorizations which may not provide an optimal rank approximation, but which may preserve useful properties of the input matrix, such as sparsity and non-negativity as well as allowing for easier interpretation of its components. Such properties may be found in the CUR and the interpolative decompositions (ID), which are both tools for computing low-rank approximations.

6.1 Randomized CUR decomposition

Mahoney and Drineas (2009) introduced the CUR matrix decomposition, as an interesting alternative to traditional approximation techniques such as SVD and PCA. The CUR decomposition admits a factorization of the form

(42)

where the components of the matrix and are formed by small subsets of actual columns and rows, respectively. The matrix is formed so that is small. The CUR factorization is illustrated in Figure 25.

Figure 25: Schematic of the rank- CUR decomposition of an matrix. The components of are formed by small subset of actual columns and rows of the input matrix.

The low-rank factor matrices and are interpretable, since their components maintain the original structure of the data. This allows one to fully leverage any field-specific knowledge about the data, i.e., experts have often a clear understanding about the actual meaning of certain columns and rows. However, the CUR decomposition is not unique, and different computational strategies lead to different subsets of columns and rows, for instance, see Mahoney and Drineas (2009) and Boutsidis and Woodruff (2017). Thus, the practical meaning of the selected rows and columns should always be carefully examined depending on the problem and its objective.

Note, that the rank- SVD () of a general matrix yields an optimal approximation of rank to , in the sense that for any rank matrix , both in the operator (spectral) and Frobenius norms. However, if is a sparse matrix, the and factors and are typically dense. Even though the low-rank SVD is optimal for a given rank , the choice of rank may be limited to relatively low values with respect to for sparse matrices, in order to achieve any useful compression ratios. Of course, the usefulness of the SVD is not limited to compression; but the utility of a low-rank approximation is greatly reduced once the storage size of the factors exceeds that of the original matrix. The CUR decomposition provides an interesting alternative for compression, since its components preserve sparsity.

The rCUR package provides an implementation in R (Bodor et al., 2012). The rsvd package implements both the deterministic and randomized CUR decomposition, following the work by Voronin and Martinsson (2017). Specifically, the interpolative decomposition is used as an algorithmic tool to form the factor matrices and . Algorithm 8 outlines the computational steps of the rcur() routine as implemented in the rsvd package.

6.2 The rcur() function

The rcur() function provides the option to compute both the deterministic and the randomized CUR decomposition via Algorithm 8. The interface of the rcur() function is as follows:

ΨΨobj <- rcur(A, k, p = 10, q = 0, idx_only = FALSE, rand = TRUE)

The first mandatory argument A passes the input data matrix. The second mandatory argument k sets the target rank, which is required to be . The parameters p and q are described in Section 2. The argument rand can be used to switch between the deterministic and the randomized algorithm. The latter is used by default, and is more efficient for large-scale matrices. The argument idx_only can be set to TRUE in order to return only the column and row index sets which is more memory efficient than returning the and .

The resulting model object, obj, is a list containing the following components:

  • C: matrix containing the column skeleton.

  • R: matrix containing the row skeleton.

  • U: matrix.

  • C.idx: -dimensional vector containing the column index set.

  • R.idx: -dimensional vector containing the row index set.

Input: Input matrix with dimensions , and target rank . Optional: Parameters and to control oversampling, and the power scheme. function (1) randomized column ID (Algorithm 10) (2) pivoted QR decomposition (3) extract top row indices (4) extract rows from input matrix (5) compute pseudoinverse (6) compute well-conditioned matrix Return: , and

Remark 5.

The deterministic rank- CUR decomposition in computed by replacing the rid() function with the deterministic id() function, described in Algorithm 9.

Algorithm 8 A randomized CUR decomposition algorithm.

6.3 Randomized interpolative decomposition

The interpolative decomposition yields a low-rank factorization of the form

(43)

The factor matrix is formed by a small number of columns, while is a well-conditioned matrix containing the identity. is also denoted as a skeleton matrix, and as the interpolation matrix.

The question is how to choose ‘interesting’ columns of