1 Introduction
Matrix decomposition is a fundamental operation widely used across numerous reallife applications, including data compression, optimization of machine learning models and dimensionality reduction. The latter application remains especially challenging, given the increasing amount of realworld highlydimensional data, such as multilingual textual corpora or drug discovery databases with complex chemical compounds. Although the original dimensionality of those datasets is high, in practice the data lays on lowerdimensional subspace with a smaller number of parameters. Dimensionaly reduction methods, such as singular value decomposition (SVD)
[wall2003singular], are typically used to find this lowerdimensional subspace in reallife problems ranging from data analysis [ding2010application, harkat2006improved], data compression [dony2001karhunen, gastpar2006distributed], statistics and machine learning [li2000recursive, hoffmann2007kernel] as well as clustering [STRUSKI2018161]. Despite the ubiquity of SVD, its computational complexity poses a significant problem, especially when dealing with largescale datasets.One of the approaches proposed to address this limitation of SVD is to approximate matrices of low rank using randomization techniques [feng2018faster, oropeza2010randomized]. These techniques are well known to give a significant speedup, yet their practical implementations are often cumbersome and do not leverage recent advancements in the architectural design of contemporary computational hardware, such as Graphical Processing Units (GPUs). For instance, the baseline methods used to solve this task, such as GESVD or Krylov subspace methods [krylov1931cislennom], do not utilize computational linear algebra to parallelize the operations, and effectively reduce the computational cost of SVD.
In this work, we present a new approach towards solving randomized SVD problem efficiently by exploiting fast matrix multiplications and random number generators, implemented on contemporary GPUs. More specifically, we rely on the computational linear algebra to decompose randomized SVD into a set of lowlevel matrix multiplications known as Basic Linear Algebra Subprograms^{2}^{2}2BLAS functionality is categorized into three sets of routines called "levels": BLAS1 operations typically take linear time , Bias2 operations quadratic time, and BLAS3 operations cubic time. (BLAS) [lawson1979basic]. These subprograms contain BLAS1 and BLAS2 operations, which are limited by system bandwidth, and BLAS3 operations bounded by the arithmetic operation throughput of a machine. The key to achieving high performance matrices operations is the effective translation of the computations into a set of BLAS3 operations, since those operations can be easily run in parallel on modern hardware architectures such as GPUs. This paper shows that such a solution can be efficiently implemented on the contermporary GPUs and allows for faster processing of largescale dataset.
To confirm the efficacy of our method, we extensively evaluate our approach against the stateoftheart competitors. Figure 1 shows the results of this evaluation which confirms the effectiveness of our implementation and a significant acceleration it obtains relative to stateoftheart methods. In the presented experiment, we consider two types of methods that calculate the whole spectrum of eigenvalues such as GESVD GPU and LAPACK [anderson1999lapack] dgesvd, as well as the methods that calculate only
largest eigenvalues such as LAPACK dsyevr, randomized singular value decomposition (RSVD)
[rsvd] and SVDS from package [RSpectra]. As a dataset, we take CelebA [liu2015faceattributes] where we resize the images to different sizes and our approach shows consistent 10x speedup, cf. Section 4.To summarize, our contributions can be defined as follows:

We introduce a novel approach for speeding up randomized SVD computations based on fast GPU operations: BLAS2 matrix multiplications and random number generations.

We extensively evaluate our method against stateoftheart approaches and empirically prove its improved efficacy.

We make our findings available to the public by releasing our code as part of the official CUDA implementation.
2 Related works
Principal component analysis (PCA) is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lowerdimensional data while preserving as much of the data’s variation as possible. It can be shown that the principal components are eigenvectors of the data’s covariance matrix. Thus, the principal components are often computed by eigendecomposition of the data covariance matrix or singular value decomposition (SVD) of the data matrix.
If the matrix is small, we can compute eigenvalues using the characteristic polynomial. In practice, eigenvalues of large matrices are not computed using the characteristic polynomial, and algorithms to find them are iterative. There are Iterative numerical algorithms for approximating roots of polynomials, such as Newton’s method, but, in general, it is impractical to compute the characteristic polynomial and then apply these methods. One reason is that small roundoff errors in the coefficients of the characteristic polynomial can lead to large errors in the eigenvalues and eigenvectors [trefethen1997numerical].
One of the simplest and most accurate iterative algorithms is the power method [mises1929praktische]
(also known as the Von Mises iteration). In this method, a given vector is repeatedly applied to a matrix and properly normalized. Consequently, it will lie in the direction of the eigenvector associated with the eigenvalues. The power iteration is simple, but it may converge slowly. Moreover, the method approximates only one eigenvalue of a matrix. The Power approach is usually used as a part of more efficient techniques, e.g. Krylov methods
[krylov1931cislennom], inverse iteration, QRmethod. The basis of the QR method for calculating the eigenvalues of is the fact that real matrix can be factorized as where is orthogonal and is upper triangular. The method is efficient for the calculation of all eigenvalues of a matrix.In the case of a large data set, we do not usually require all the eigenvalues, but only a few first ones. Partial singular value decomposition algorithms are usually based on the Lanczos algorithm [calvetti1994implicitly, larsen1998lanczos, lehoucq1998arpack, baglama2005augmented]
. Randomized matrix algorithms nowadays are most popular in practical applications. In
[frieze2004fast] authors proposed Monte Carlo approach to SVD decomposition. The method for approximation of lowrank SVD decomposition is based on nonuniform row and column sampling. In [sarlos2006improved, liberty2007randomized, martinsson2011randomized] authors introduced a more robust approach based on random projections. The method constructs a subspace that captures the column space of a matrix. The modification of the method, which uses fast matrix multiplications was introduced in [woolfe2008fast]. In [halko2011finding] authors unified and expanded previous work on the randomized singular value decomposition and introduced stateoftheart algorithms to compute the approximation of lowrank singular value decomposition.The basic idea of probabilistic matrix algorithms is to apply a degree of randomness to derive a smaller matrix from a highdimensional matrix, which captures the essential information. Then, a deterministic matrix factorization algorithm is applied to the smaller matrix to compute a nearoptimal lowrank approximation. Recent has been demonstrated that this probabilistic framework can be effectively used to compute the pivoted QR decomposition
[duersch2017randomized], the pivoted LU decomposition [shabat2018randomized].3 Method description
In this section, we define the problem and introduce our method that leverages GPU computations of probabilistic matrix operations to speed up the processing time.
First, let be an arbitrary matrix for any , and be the target number of singular values. Recall that making singular value decomposition (SVD) of we obtain
where , are orthogonal matrices for . The above decomposition of is called as the compact SVD while it is a simple transformation of a classical SVD decomposition
where , are unitary and , have orthonormal columns.
In our approach, we use the randomized SVD algorithm as it is a straightforward realization of the proposed randomized SVD code from [halko2011finding], which computes the SVD of up to Frobenius norm relative error. Let be a good sketch of , the column space of should roughly contain the columns of by the lowrank approximation property i.e.
where is Frobenius norm and . If is a Gaussian projection matrix [johnson1984extensions] or count sketch [charikar2002finding] and ^{3}^{3}3The notation defines the asymptotic behavior., then the lowrank approximation property
holds in expectation.
From the above property we can describe randomized SVD pseudoalgorithm in the several steps that are described in Algorithm 1.
Note that so if you want only th largest eigenvalues of matrix , we used only the first five points of the Algorithm 1 procedure, i.e. we needed only the matrix . In Section 4 we describe and show results of experiments, where we calculate only th largest eigenvalues.
GPU computations to speed up processing
The above formulation let us decompose the problem of SVD into a set of matrix operations (predominantly multiplications) with a significant number of random numbers induced by the probabilistic approach. Exactly those two features of our formulation lead to a significant speed up observed when implementing Algorithm 1 on modern GPU architectures.
Thanks to the parallel nature of GPU processing, matrixmatrix operations can be effectively computed using BLAS3 operations. Throughputoriented processors, such as GPUs, are especially well suited for this kind of operation. BLAS3 kernels break down linear algebra operations in terms of GEMMlike operations, which attain very high performance on graphic processors.
Moreover, random number generators effectively optimized for GPU architecture can offer up to threefold speedup in processing time thanks to the CuRAND library^{5}^{5}5https://developer.nvidia.com/curand. Combined with the decomposition of randomized SVD into a set of matrixmatrix multiplications our approach leads to a significant computation speedups, as we present in the following sections.
4 Experiments
In this section, we evaluate the performance of our approach and compare it to other methods available either in CPU or in GPU. We document three different experiments: in the first one, we consider matrix A that is not square, then we test application of randomized SVD to principal component analysis of a random matrix, and finally we use our method to practical method of subspace clustering [STRUSKI2018161].
In the first two experiments, we run every method 10 times on the same datasets and calculate for average time denoted by and standard deviation denotes by . In the Figures we use the solid lines to draw the ratio , the dashed black line has a value of 1 (a reference to our method), and the shaded region of error defined by interval:
where denotes another method.
We consider two types of methods that calculate the whole spectrum such as GESVD GPU and LAPACK [anderson1999lapack] dgesvd and methods that calculate only largest eigenvalues such as LAPACK dsyevr, randomized singular value decomposition (RSVD) [rsvd] and SVDS from package [RSpectra].
Performance comparison
In this experiment, we make three different test cases for which we construct the matrix for with random orthogonal matrices and a certain spectrum , where for and

[label=()]

— fast decay,

— sharp decay (around breakout point ),

— slow decay.
We take 1%, 3%, 5%, 10% of eigenvalues on matrix sizes i.e. for 5% we calculate the largest eigenvalues. Due to the randomized singular value decomposition in our calculation, we kept the relative error on the limit of at most against the baseline method, which is GESVD GPU. Figures 2, 3, 4 show the evaluation results. The dashed black line has a value of 1 (a reference to our method).
Application to Principal Component Analysis
Principal Component Analysis (PCA) is a wellestablished technique for dimensionality reduction and multivariate analysis. Examples of its many applications include data compression, image processing, visualization, exploratory data analysis, pattern recognition, and time series prediction.
PCA is related to eigenvectors and eigenvalues because with their help we can reduce the dimensionality of the data at a certain level of information preservation in the data. The eigenvectors of the covariance matrix of data are actually the directions of the axes where there is the most variance (most information) and that we call Principal Components. On the other hence, the eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of variance carried in each Principal Component.
In this example, we compare a few stateoftheart numerical methods to solve the eigenproblem. We consider not only the methods that can calculate the whole spectrum and associate with their eigenvalues. In particular, we focus on ones that can calculate only several eigenvalues, because our interestonly largest eigenvalues, i.e. we want to calculate only Principal Components. Figure 1 shows the time of operation of other methods in relation to ours (speedup). As a dataset, we take CelebA [liu2015faceattributes] where we resize the images to sizes , , , . Each RGB image about size , we flatten to the vector with size , where , is height, and the width of the image, respectively.
Application for subspace clustering
Subspace clustering is one of the applications of the singular value decomposition of the matrices. In this part of the paper we present the results of comparison of the method SuMC that is presented in [STRUSKI2018161] and application of this code with the randomized SVD algorithm that requires minor changes in the code.
Dataset  Solver type  Elapsed time (sec)  Solver calls  ARI score 
first  CPU  73653.0  262209  1.0 
GPU  2584.5  189291  1.0  
25[4pt/3pt]  
second  CPU       
GPU  23967.0  1944144  1.0 
Table 1 shows the results of method of SuMC which used solver of eigenvalues implemented on CPU and GPU in synthetic datasets, which we randomly generate on the spaces , where dim denotes dimension of a space. In these generated data we known upfront what are the real subspaces and we can use the theoretical compression rate for the data set. We generate 2 random synthetic datasets: the first contained points from dimensional subspace, points from dimensional subspace and points from dimensional subspace, where each points were presented in dimensional space; the second data contained points from dimensional subspace, points from dimensional subspace and points from dimensional subspace (each points were described in dimensional space), and for them, we calculated the mean of Rand index adjusted. In each experiment we started with the same initialization of points to clusters.
Note that the number of the solver call is the case of GPU in smaller than CPU which may indicate better generalization and this in turn result in faster convergence of the SuMC algorithm.
5 Conclusion
The experiments confirmed practical usability and high performance of the GPU implementation of a randomized SVD algorithm for computing small part of the spectrum. We have tested the method in three different tasks, i.e. just the performance analysis to compute 110% of the spectrum, application for PCA, and application for subspace clustering. Obtained results were satisfactory across the board, with the speedup reaching up to 100x for some chosen cases. Be believe that applying this method for large scale PCA, as part of the machine learning pipeline might be instrumental.
Acknowledgments
The authors would like to thank Lung Sheng Chien for discussion and comments on the paper. This research was supported by: grant no POIR.04.04.000014DE/1800 carried out within the TeamNet program of the Foundation for Polish Science cofinanced by the European Union under the European Regional Development Fund. This research was funded in part by National Science Centre, Poland grant no. 2020/39/D/ST6/01332 and grant no. 2020/39/B/ST6/01511. For the purpose of Open Access, the authors have applied a CCBY public copyright licence to any Author Accepted Manuscript (AAM) version arising from this submission.