cuda Modern data science and numerical mathematics applications operate on larger and larger data, often with strict requirements of minimizing execution time and power consumption. For many of these applications, hardware accelerators such as gpu and fpga are a highly-effective solution, especially when mixed-precision and reduced-precision arithmetic come into play[clark2010solving, gupta2015deep, clark2018pushing, sze2020efficient, parravicini2020reduced, sgherzi2021solving]
. As spectral methods become ubiquitous in the large scale graph pipelines of Spectral Clustering[ng2002spectral], ir [langville2005survey] and ranking [dawson2021opinionrank], such techniques require algorithms that can compute only a subset of the most relevant eigenvalues (i.e. the largest in modulo) and their associated eigenvectors while taking advantage of the sparsity of real-world graphs. Graph analytics pipelines usually operate on graphs with millions or even billions of edges, rendering traditional methods computing all eigenvalues impractical, as their space and time cost scales at least quadratically with the number of vertices. As the size of real-world graphs exceeds the device memory size of modern gpu, an eigensolver must be able to handle out-of-core matrices as well as be capable of distributing the computation to multiple gpu. Moreover, they need to support different numerical data types for storage and computation to meet storage and accuracy requirements.
In this work, we introduce a novel Top-K gpu eigensolver for real-valued, sparse matrices capable of handling matrices with billion of non-zero entries, of partitioning the computation across multiple gpu and leveraging mixed-precision arithmetic to optimize accuracy and execution time. We compare how our eigensolver against state-of-the-art CPU and fpga implementations and investigate how mixed-precision enables intermediate operations with higher precision while results are stored with space-efficient representations.
In summary, we present the following contributions:
A multi-gpu, mixed precision, Top-K eigensolver that can process out-of-core sparse matrices with billions of non-zeros. To the best of our knowledge, this is the largest amount reported in the literature (Section III).
A performance evaluation of our gpu eigensolver against state-of-the-art Top-K eigensolvers on multiple architectures. We are on average 67 faster than a multi-core CPU implementation and 1.9 faster than an fpga hardware design, with average reconstruction error below (Section IV-B).
A characterization of mixed-precision arithmetic in terms of accuracy versus execution time. We prove how mixed-precision is 50 % faster than double-precision floating-point arithmetic, and 12 more accurate than single-precision (Section IV-D).
Ii Related Work
Although solving Top-K sparse eigenproblems is computationally demanding and has strong practical applications, little prior research optimizes them with hardware accelerators. On gpu, most Top-K eigensolvers are domain-specific, do not support large-scale inputs and multiple devices, or are outright not supported on modern gpu architectures [aurentz2015gpu, evstigneev2017implementation, dubois2011accelerating]. The nvGRAPH library [nvgraph] by Nvidia uses internally the Lanczos algorithm, whose implementation is, however, not user-accessible. Mixed precision arithmetic on numerical algorithms on gpu has been evaluated on multiple algorithms from the QUDA library. However, the effectiveness of mixed and reduced precision on the numerically unstable Lanczos algorithm is still unknown [babich2010parallelizing, clark2018pushing].
Custom hardware designs for Top-K sparse eigensolvers have been recently investigated by Sgherzi et al. [sgherzi2021solving], who prototyped their work on high-end fpga equipped with hbm. They investigates the role of mixed-precision arithmetic for Top-K sparse eigensolvers, but the proposed design does not scale to multiple devices or large out-of-core matrices. Moreover, limitations of the fpga’s hbm controller force unnecessary data replication and allow achieving only a fraction of the maximum hbm bandwidth. To the best of our knowledge, no other work optimizes large Top-K sparse eigencomputations using fpga or dsa. There are numerous implementations of large-scale Top-K sparse eigenproblem solver for CPUs [aktulga2012topology, hernandez2009survey, lee2018solution]. However, none is as well-known as ARPACK [lehoucq1998arpack], a multi-core Fortran library that implements the iram, a variation of the Lanczos algorithm with support for non-Hermitian matrices.
Our sparse eigensolver employs a two-phase algorithm, as in Figure 1. The first step is based on the Lanczos [lanczos1950iteration] algorithm, which takes as input the original matrix , the number of desired eigencomponents and a L2-normalized random vector . The algorithm proceeds by incrementally building a Kyrlov subspace (the Lanczos vectors ) of , through vector projections (Algorithm 1 line 9) and orthogonalizations (lines 11–21). A tridiagonal matrix stores the residuals of the previous operations (lines 6, 10) and reduces the problem from size to one of size ().
The second phase employs the Jacobi algorithm [Rutishauser1966jacobi] to solve the eigenproblem on the much smaller matrix . The Jacobi algorithm stores the eigenvalues of in the main diagonal of , and the eigenvectors of in . The eigenvectors of are given by .
Iii-a Optimizing sparse eigensolvers for GPUs
We desire to render our gpu sparse eigensolver scalable to real-world matrices with billions of non-zero values, often encountered in graph analytics. To do so, we devise a workload partition scheme that distributes the computation across multiple gpu while minimizing unnecessary data movements and synchronization events.
The Lanczos algorithm has two synchronization points (Algorithm 1, lines 6 and 10), corresponding to the computation of and (Figure 1 A B). Another optional synchronization point occurs if reorthogonalization of the Lanczos vectors is needed (lines 15–18, Figure 1 C). All other operations operate linearly on the input arrays and can be computed across multiple gpu independently. The input matrix is partitioned by balancing the number of non-zero elements in each partition. All vectors, except for , are partitioned according to the same partition scheme as the input matrix. As the spmv performs indirect accesses to the vector , we replicate it to all gpu instead of partitioning it. There is an additional synchronization at each iteration when the previous Lanczos vector becomes the input of the spmv. We prevent this synchronization by having each gpu copy, in a round-robin fashion, a single partition to a single replica of (Figure 1 C). When all gpu have completed a cycle, has been fully copied, and the computation can proceed to a new iteration.
The orthogonality of the Lanczos vectors and the quality of the final eigencomponents produced crucially depends on the output of the scalar product (, line 10) and the L2-norm (, line 6). For this reason, our eigensolver can perform the intermediate operations of each kernel in double-precision floating-point arithmetic to ensure maximum accuracy. However, vectors can still be stored in single-precision floating-point arithmetic, to consume less device memory and better use the available memory bandwidth. In our experiments, other data types (half-precision FP16, BFLOAT16) resulted in numerical instability, and have been omitted from Section IV.
Iii-B Implementation details
We implemented our eigensolver using the GrCUDA API [parravicini2021dag] and GraalVM [wurthinger2013one] to support several high-level programming languages automatically, while we wrote the core gpu computational kernels in CUDA. Since GrCUDA internally leverages CUDA unified memory, our eigensolver can scale to out-of-core computations on sparse matrices that would not otherwise fit in the gpu memory. We also modified the internal GrCUDA runtime to schedule gpu kernels across multiple devices, using a round-robin device selection policy for kernels operating on disjoint data. Through the partition swapping presented in Section III-A, we minimize unnecessary memory transfers between devices and out-of-core memory pages, guaranteeing scalability in what would be an otherwise memory and transfer-bound computation (Section IV-C). The small tridiagonal matrices that the Lanczos algorithm outputs () cannot saturate the stream processors of a modern gpu [cosnau2014gpueigen]. Instead, we achieve better execution time by performing this step on a CPU (Figure 1 D).
Iv Experimental Evaluation
We evaluate the quality of our sparse eigensolver in terms of execution time and results’ quality, and provide a performance characterization against state-of-the-art sparse eigensolvers running on different hardware architectures. We provide an in-depth evaluation over a single gpu, and validate the scalability of our algorithm over multiple gpu (up to 8). Most importantly, we assess the impact of mixed-precision arithmetic and prove how reduced precision results in faster execution time with no meaningful detriment to accuracy.
Iv-a Experimental Setup
We measure results for our Top-K sparse eigensolver using up to 8 Nvidia Tesla V100s ( of HBM2 for each gpu). As baselines, we employ the multi-threaded ARPACK library [lehoucq1998arpack], a Top-K sparse eigensolver that uses the iram algorithm, running on two Intel Xeon Platinum 8167M (104 threads in total) and of DDR4 memory, with single-precision floating point arithmetic. We also compare against the recent fpga implementation by Sgherzi et al. [sgherzi2021solving], running on a Xilinx Alveo U280 accelerator card equipped with of HBM2 memory. We repeat measurements 20 times, using random initialization for the Lanczos vectors .
To provide a fair comparison, we use the same collection of sparse matrices in Sgherzi et al. [sgherzi2021solving], enriched with two extremely large matrices (billions of non-zero entries) that do not fit in the fpga’s and gpu’s device memory, and allows us to test the out-of-core performance of our gpu implementation. All matrices come from the SuiteSparse collection [davis2011university] and represent graph topologies, although the eigensolvers in our analysis can be applied to other domains as well [tung2010enabling].
Iv-B Execution Time Comparison
We first compare the speed of our gpu eigensolver, when running on a single gpu, against the CPU and fpga baselines, on matrices of increasing size (Figure 2). Results have been aggregated over an increasing amount of eigenvectors , from 8 to 24, as the execution time scales linearly with . For the fpga implementation, we use the values reported by the authors. Results of the two largest matrices (KRON and URAND) have been omitted for the fpga hardware design as it does not support out-of-core computations.
CPU and gpu use single-precision floating point arithmetic, while the fpga hardware design uses 32-bit signed fixed point arithmetic with one bit of integer part (S1.1.30) for Lanczos, and half-precision floating point arithmetic for Jacobi.
Our gpu eigensolver is always faster than both the CPU and fpga baselines (on the RC matrix the difference is not statistically significant). On average, the gpu eigensolver is 67 faster than the CPU implementation and 1.9 faster than the fpga hardware design. The fpga hardware design is still competitive in terms of Performance/Watt, as the fpga design consumes 38W [sgherzi2021solving], versus the 300W of the gpu [v100].
As our partitioning minimizes inter-gpu data-transfer (Section III-A), we are 180 faster than the CPU on very large out-of-core matrices despite storing only a small fraction of the input data on the gpu at any given time.
A Relative execution time for increasing number of gpu. Two outlier matrices (plotted separately) perform worse with more gpu due to the larger inter-gpu communication. B Accuracy of our eigensolver, in terms of orthogonality of the eigenvectors and L2 error, for increasing.
Iv-C Multi-GPU Performance
Scaling the computation of the Top-K eigenvectors on sparse matrices to multiple gpu is far from trivial, as explained in Section III-A. From Figure 3, we observe how our partitioning scheme improves the execution time when using multiple gpu, with somewhat diminishing returns. On average, two gpu provide a 50 % speedup, while eight gpu are close to a 100 % speedup. Only on two very small matrices we observe a loss of performance on systems with 4 or 8 gpu. This phenomenon is explained by the heterogeneous NVLink interconnection found in V100-based systems like ours [li2019evaluating]. Some gpu pairs are not directly connected with NVLink, and data transfer has to go through the CPU and PCIe, which has 10 lower bandwidth than NVLink.
Iv-D Impact of Reorthogonalization and Mixed-precision
To measure the quality of our eigensolver, we measure the average angle that occurs between every pair of eigenvectors. Eigenvectors are by definition pairwise orthogonal, i.e. their angle is , and their dot product must be 0. Figure 3 provides, for increasing , the average orthogonality and the L2 norm of , the reconstruction error computed using the definition of eigenvalues. Both results are aggregated for all matrices due to space limitations. We observe how reorthogonalization improves the results’ quality, with
2 degrees of difference compared to the implementation without reorthogonalization. Choosing whether reorthogonalization is suitable or not depends on the application. Spectral methods in machine learning often do not demand the same numerical accuracy as engineering applications, and reorthogonalization increases the algorithmic complexity by anfactor.
Employing mixed-precision arithmetic in numerical algorithms is usually a matter of trade-offs, with better precision translating to higher execution time. We visualize this behavior in Figure 4
, showing for each matrix the L2 reconstruction error and the relative execution time, and a linear regression to capture the general trend. In all cases, increasing precision reduces the error and increases the execution time. The float-double-float (FDF) configuration (Section III-A) is 50 % faster than a pure double-precision implementation (DDD). Its error is only 40 % higher, and 12 lower than the floating-point implementation (FFF), showing how mixed-precision arithmetic is a great compromise in Top-K sparse eigensolvers.
As graph analytics and spectral methods deal with larger and larger sparse matrices, it is critical to have high-performance Top-K sparse eigensolvers to extract low-dimensional representations of sparse datasets. We provide a novel gpu Top-K sparse eigensolver that can scale to out-of-core matrices with billion of non-zero entries, partition the computation over multiple gpu, and leverage mixed-precision floating-point arithmetic. We are on average 67 faster than the multi-core ARPACK CPU library implementation and 1.9 faster than a state-of-the-art fpga hardware design. As future work, we will extend our implementation to fixed-point arithmetic and validate if novel interconnection technologies such as NVSwitch can improve even further multi-gpu scaling.