I Introduction
cuda Modern data science and numerical mathematics applications operate on larger and larger data, often with strict requirements of minimizing execution time and power consumption. For many of these applications, hardware accelerators such as gpu and fpga are a highlyeffective solution, especially when mixedprecision and reducedprecision arithmetic come into play
[clark2010solving, gupta2015deep, clark2018pushing, sze2020efficient, parravicini2020reduced, sgherzi2021solving]. As spectral methods become ubiquitous in the large scale graph pipelines of Spectral Clustering
[ng2002spectral], ir [langville2005survey] and ranking [dawson2021opinionrank], such techniques require algorithms that can compute only a subset of the most relevant eigenvalues (i.e. the largest in modulo) and their associated eigenvectors while taking advantage of the sparsity of realworld graphs. Graph analytics pipelines usually operate on graphs with millions or even billions of edges, rendering traditional methods computing all eigenvalues impractical, as their space and time cost scales at least quadratically with the number of vertices. As the size of realworld graphs exceeds the device memory size of modern gpu, an eigensolver must be able to handle outofcore matrices as well as be capable of distributing the computation to multiple gpu. Moreover, they need to support different numerical data types for storage and computation to meet storage and accuracy requirements.In this work, we introduce a novel TopK gpu eigensolver for realvalued, sparse matrices capable of handling matrices with billion of nonzero entries, of partitioning the computation across multiple gpu and leveraging mixedprecision arithmetic to optimize accuracy and execution time. We compare how our eigensolver against stateoftheart CPU and fpga implementations and investigate how mixedprecision enables intermediate operations with higher precision while results are stored with spaceefficient representations.
In summary, we present the following contributions:

A multigpu, mixed precision, TopK eigensolver that can process outofcore sparse matrices with billions of nonzeros. To the best of our knowledge, this is the largest amount reported in the literature (Section III).

A performance evaluation of our gpu eigensolver against stateoftheart TopK eigensolvers on multiple architectures. We are on average 67 faster than a multicore CPU implementation and 1.9 faster than an fpga hardware design, with average reconstruction error below (Section IVB).

A characterization of mixedprecision arithmetic in terms of accuracy versus execution time. We prove how mixedprecision is 50 % faster than doubleprecision floatingpoint arithmetic, and 12 more accurate than singleprecision (Section IVD).
Ii Related Work
Although solving TopK sparse eigenproblems is computationally demanding and has strong practical applications, little prior research optimizes them with hardware accelerators. On gpu, most TopK eigensolvers are domainspecific, do not support largescale inputs and multiple devices, or are outright not supported on modern gpu architectures [aurentz2015gpu, evstigneev2017implementation, dubois2011accelerating]. The nvGRAPH library [nvgraph] by Nvidia uses internally the Lanczos algorithm, whose implementation is, however, not useraccessible. Mixed precision arithmetic on numerical algorithms on gpu has been evaluated on multiple algorithms from the QUDA library. However, the effectiveness of mixed and reduced precision on the numerically unstable Lanczos algorithm is still unknown [babich2010parallelizing, clark2018pushing].
Custom hardware designs for TopK sparse eigensolvers have been recently investigated by Sgherzi et al. [sgherzi2021solving], who prototyped their work on highend fpga equipped with hbm. They investigates the role of mixedprecision arithmetic for TopK sparse eigensolvers, but the proposed design does not scale to multiple devices or large outofcore matrices. Moreover, limitations of the fpga’s hbm controller force unnecessary data replication and allow achieving only a fraction of the maximum hbm bandwidth. To the best of our knowledge, no other work optimizes large TopK sparse eigencomputations using fpga or dsa. There are numerous implementations of largescale TopK sparse eigenproblem solver for CPUs [aktulga2012topology, hernandez2009survey, lee2018solution]. However, none is as wellknown as ARPACK [lehoucq1998arpack], a multicore Fortran library that implements the iram, a variation of the Lanczos algorithm with support for nonHermitian matrices.
Iii Implementation
Our sparse eigensolver employs a twophase algorithm, as in Figure 1. The first step is based on the Lanczos [lanczos1950iteration] algorithm, which takes as input the original matrix , the number of desired eigencomponents and a L2normalized random vector . The algorithm proceeds by incrementally building a Kyrlov subspace (the Lanczos vectors ) of , through vector projections (Algorithm 1 line 9) and orthogonalizations (lines 11–21). A tridiagonal matrix stores the residuals of the previous operations (lines 6, 10) and reduces the problem from size to one of size ().
The second phase employs the Jacobi algorithm [Rutishauser1966jacobi] to solve the eigenproblem on the much smaller matrix . The Jacobi algorithm stores the eigenvalues of in the main diagonal of , and the eigenvectors of in . The eigenvectors of are given by .
Iiia Optimizing sparse eigensolvers for GPUs
We desire to render our gpu sparse eigensolver scalable to realworld matrices with billions of nonzero values, often encountered in graph analytics. To do so, we devise a workload partition scheme that distributes the computation across multiple gpu while minimizing unnecessary data movements and synchronization events.
The Lanczos algorithm has two synchronization points (Algorithm 1, lines 6 and 10), corresponding to the computation of and (Figure 1 A B). Another optional synchronization point occurs if reorthogonalization of the Lanczos vectors is needed (lines 15–18, Figure 1 C). All other operations operate linearly on the input arrays and can be computed across multiple gpu independently. The input matrix is partitioned by balancing the number of nonzero elements in each partition. All vectors, except for , are partitioned according to the same partition scheme as the input matrix. As the spmv performs indirect accesses to the vector , we replicate it to all gpu instead of partitioning it. There is an additional synchronization at each iteration when the previous Lanczos vector becomes the input of the spmv. We prevent this synchronization by having each gpu copy, in a roundrobin fashion, a single partition to a single replica of (Figure 1 C). When all gpu have completed a cycle, has been fully copied, and the computation can proceed to a new iteration.
The orthogonality of the Lanczos vectors and the quality of the final eigencomponents produced crucially depends on the output of the scalar product (, line 10) and the L2norm (, line 6). For this reason, our eigensolver can perform the intermediate operations of each kernel in doubleprecision floatingpoint arithmetic to ensure maximum accuracy. However, vectors can still be stored in singleprecision floatingpoint arithmetic, to consume less device memory and better use the available memory bandwidth. In our experiments, other data types (halfprecision FP16, BFLOAT16) resulted in numerical instability, and have been omitted from Section IV.
IiiB Implementation details
We implemented our eigensolver using the GrCUDA API [parravicini2021dag] and GraalVM [wurthinger2013one] to support several highlevel programming languages automatically, while we wrote the core gpu computational kernels in CUDA. Since GrCUDA internally leverages CUDA unified memory, our eigensolver can scale to outofcore computations on sparse matrices that would not otherwise fit in the gpu memory. We also modified the internal GrCUDA runtime to schedule gpu kernels across multiple devices, using a roundrobin device selection policy for kernels operating on disjoint data. Through the partition swapping presented in Section IIIA, we minimize unnecessary memory transfers between devices and outofcore memory pages, guaranteeing scalability in what would be an otherwise memory and transferbound computation (Section IVC). The small tridiagonal matrices that the Lanczos algorithm outputs () cannot saturate the stream processors of a modern gpu [cosnau2014gpueigen]. Instead, we achieve better execution time by performing this step on a CPU (Figure 1 D).







WBTA  wikiTalk  2.39  5.02  
WBGO  webGoogle  0.91  5.11  
WBBE  webBerkstan  0.69  7.60  
FL  Flickr  0.82  9.84  
IT  italy_osm  6.69  14.02  
PA  patents  3.77  14.97  
VL3  venturiLevel3  4.02  16.10  
DE  germany_osm  11.54  24.73  
ASIA  asia_osm  11.95  25.42  
RC  road_central  14.08  33.87  
WK  Wikipedia  3.56  45.00  
HT  hugetrace00020  16.00  47.80  
WB  wbedu  9.84  57.15  
KRON  GAPkron  134.21  4223.26  
URAND  GAPurand  134.21  4294.96 
Iv Experimental Evaluation
We evaluate the quality of our sparse eigensolver in terms of execution time and results’ quality, and provide a performance characterization against stateoftheart sparse eigensolvers running on different hardware architectures. We provide an indepth evaluation over a single gpu, and validate the scalability of our algorithm over multiple gpu (up to 8). Most importantly, we assess the impact of mixedprecision arithmetic and prove how reduced precision results in faster execution time with no meaningful detriment to accuracy.
Iva Experimental Setup
We measure results for our TopK sparse eigensolver using up to 8 Nvidia Tesla V100s ( of HBM2 for each gpu). As baselines, we employ the multithreaded ARPACK library [lehoucq1998arpack], a TopK sparse eigensolver that uses the iram algorithm, running on two Intel Xeon Platinum 8167M (104 threads in total) and of DDR4 memory, with singleprecision floating point arithmetic. We also compare against the recent fpga implementation by Sgherzi et al. [sgherzi2021solving], running on a Xilinx Alveo U280 accelerator card equipped with of HBM2 memory. We repeat measurements 20 times, using random initialization for the Lanczos vectors .
To provide a fair comparison, we use the same collection of sparse matrices in Sgherzi et al. [sgherzi2021solving], enriched with two extremely large matrices (billions of nonzero entries) that do not fit in the fpga’s and gpu’s device memory, and allows us to test the outofcore performance of our gpu implementation. All matrices come from the SuiteSparse collection [davis2011university] and represent graph topologies, although the eigensolvers in our analysis can be applied to other domains as well [tung2010enabling].
IvB Execution Time Comparison
We first compare the speed of our gpu eigensolver, when running on a single gpu, against the CPU and fpga baselines, on matrices of increasing size (Figure 2). Results have been aggregated over an increasing amount of eigenvectors , from 8 to 24, as the execution time scales linearly with . For the fpga implementation, we use the values reported by the authors. Results of the two largest matrices (KRON and URAND) have been omitted for the fpga hardware design as it does not support outofcore computations.
CPU and gpu use singleprecision floating point arithmetic, while the fpga hardware design uses 32bit signed fixed point arithmetic with one bit of integer part (S1.1.30) for Lanczos, and halfprecision floating point arithmetic for Jacobi.
Our gpu eigensolver is always faster than both the CPU and fpga baselines (on the RC matrix the difference is not statistically significant). On average, the gpu eigensolver is 67 faster than the CPU implementation and 1.9 faster than the fpga hardware design. The fpga hardware design is still competitive in terms of Performance/Watt, as the fpga design consumes 38W [sgherzi2021solving], versus the 300W of the gpu [v100].
As our partitioning minimizes intergpu datatransfer (Section IIIA), we are 180 faster than the CPU on very large outofcore matrices despite storing only a small fraction of the input data on the gpu at any given time.
A Relative execution time for increasing number of gpu. Two outlier matrices (plotted separately) perform worse with more gpu due to the larger intergpu communication. B Accuracy of our eigensolver, in terms of orthogonality of the eigenvectors and L2 error, for increasing
.IvC MultiGPU Performance
Scaling the computation of the TopK eigenvectors on sparse matrices to multiple gpu is far from trivial, as explained in Section IIIA. From Figure 3, we observe how our partitioning scheme improves the execution time when using multiple gpu, with somewhat diminishing returns. On average, two gpu provide a 50 % speedup, while eight gpu are close to a 100 % speedup. Only on two very small matrices we observe a loss of performance on systems with 4 or 8 gpu. This phenomenon is explained by the heterogeneous NVLink interconnection found in V100based systems like ours [li2019evaluating]. Some gpu pairs are not directly connected with NVLink, and data transfer has to go through the CPU and PCIe, which has 10 lower bandwidth than NVLink.
IvD Impact of Reorthogonalization and Mixedprecision
To measure the quality of our eigensolver, we measure the average angle that occurs between every pair of eigenvectors. Eigenvectors are by definition pairwise orthogonal, i.e. their angle is , and their dot product must be 0. Figure 3 provides, for increasing , the average orthogonality and the L2 norm of , the reconstruction error computed using the definition of eigenvalues. Both results are aggregated for all matrices due to space limitations. We observe how reorthogonalization improves the results’ quality, with
2 degrees of difference compared to the implementation without reorthogonalization. Choosing whether reorthogonalization is suitable or not depends on the application. Spectral methods in machine learning often do not demand the same numerical accuracy as engineering applications, and reorthogonalization increases the algorithmic complexity by an
factor.Employing mixedprecision arithmetic in numerical algorithms is usually a matter of tradeoffs, with better precision translating to higher execution time. We visualize this behavior in Figure 4
, showing for each matrix the L2 reconstruction error and the relative execution time, and a linear regression to capture the general trend. In all cases, increasing precision reduces the error and increases the execution time. The floatdoublefloat (FDF) configuration (
Section IIIA) is 50 % faster than a pure doubleprecision implementation (DDD). Its error is only 40 % higher, and 12 lower than the floatingpoint implementation (FFF), showing how mixedprecision arithmetic is a great compromise in TopK sparse eigensolvers.V Conclusion
As graph analytics and spectral methods deal with larger and larger sparse matrices, it is critical to have highperformance TopK sparse eigensolvers to extract lowdimensional representations of sparse datasets. We provide a novel gpu TopK sparse eigensolver that can scale to outofcore matrices with billion of nonzero entries, partition the computation over multiple gpu, and leverage mixedprecision floatingpoint arithmetic. We are on average 67 faster than the multicore ARPACK CPU library implementation and 1.9 faster than a stateoftheart fpga hardware design. As future work, we will extend our implementation to fixedpoint arithmetic and validate if novel interconnection technologies such as NVSwitch can improve even further multigpu scaling.
Comments
There are no comments yet.