I Introduction
Research in information retrieval and recommender systems has spiked novel interest in spectral methods [zhou2019SpectralRecommender]
, a class of Machine Learning algorithms able to detect communities in large social and ecommerce graphs, and compute the similarity of graph elements such as users or products
[tang2020spectral]. At the core of many spectral methods lies the TopK eigenproblem for largescale sparse matrices, i.e. the computation of the eigenvectors associated with the largest eigenvalues (in modulo) of a matrix that stores only nonzero entries (Figure 1). For example, the famous Spectral Clustering algorithm boils down to computing the largest eigenvalues of a sparse matrix representing the graph topology
[ng2002spectral]. Despite the rise of theoretical interests for spectral methods, little research has focused on improving the performance and scalability of the TopK sparse eigenproblem solvers, making them applicable to largescale graphs.Most existing highperformance implementations of eigenproblem algorithms operate on dense matrices and are completely unable to process matrices with millions of rows and columns (each encoding, for example, the user’s friends in a social network graph) [Myllykoski2020GPULIBQR]. Even the highly optimized multicore implementation of LAPACK requires more than 3 minutes to solve the full eigenproblem on a small graph with vertices and edges on a Xeon 6248, as the eigenproblem complexity scales at least quadratically with the number of vertices in the graph. Many implementations that support sparse matrices, on the other hand, are either forced to compute all the eigenvalues or require an expensive matrix inversion before solving the eigenproblem [knyazev2001toward].
The need for highperformance TopK sparse eigenproblem algorithms goes hand in hand with custom hardware designs that can outperform traditional architectures in raw performance and power efficiency, given how applications ontop of TopK eigenproblem are mostly encountered in data centers. In this work, we tackle both problems by presenting a new TopK sparse eigensolver whose building blocks are specifically optimized for highperformance hardware designs.
We introduce a novel algorithm to address the TopK sparse eigenproblem and prototype it through a custom hardware design on an FPGA accelerator card; to the best of our knowledge this is the first FPGAbased TopK sparse eigensolver. Our algorithm is a 2step procedure that combines the Lanczos algorithm (to reduce the problem size) [lanczos1950iteration] with the Jacobi algorithm (to compute the final eigencomponents) [Rutishauser1966jacobi], as shown in Figure 2. The Lanczos algorithm, often encountered in TopK sparse eigensolvers [nmeth2020scipy], has never been combined with the Jacobi algorithm. Part of the reason lies in their different computational bottlenecks: the Lanczos algorithm demands large memory bandwidth, while the Jacobi algorithm is strongly computebound. Our approach exploits the strengths of FPGA accelerator cards and overcomes the limitations of traditional architectures in this class of algorithms.
First, the Lanczos algorithm presents a spmv as its main bottleneck, an extremely memoryintensive computation with indirect and fully random memory accesses (Figure 2 B). Optimizing spmv requires high peak memory bandwidth and finegrained control over memory accesses, without passing through the traditional caching policies of generalpurpose architectures. Our hardware design features an iterative dataflow spmv with multiple cu, leveraging every hbm channels through a custom memory subsystem that efficiently handles indirect memory accesses.
Then, we introduce a sa design for the Jacobi eigenvalue algorithm, a computationallyintensive operation that operates on reducedsize inputs ( (Figure 2 D). The Jacobi algorithm maps naturally to a sa that ensures
convergence, while traditional architectures do not ensure the same degree of performance. CPUs cannot guarantee that all the data are kept in L1 cache and are unlikely to have enough floatingpoint arithmetic units to parallelize the computation. This results in
computational complexity and execution times more than 50 times higher than a fpga (Section V). Instead, GPUs cannot fill all their Stream Processors, as the input size is much smaller than what is required to utilize the GPU parallelism fully [cosnau2014gpueigen].Moreover, our FPGAbased hardware design employs highlyoptimized mixedprecision arithmetic, partially replacing traditional floatingpoint computations with faster fixedprecision arithmetic. While high numerical accuracy is usually demanded in eigenproblem algorithms, we employ fixedprecision arithmetic in parts of the design that are not critical to the overall accuracy and resort to floatingpoint arithmetic when required to guarantee precise results.
In summary, we present the following contributions:

A novel algorithm for approximate resolution of largescale TopK sparse eigenproblems (Section III), optimized for custom hardware designs.

A modular mixedprecision fpga design for our algorithm that efficiently exploits the available programmable logic and the bandwidth of DDR and HBM (Section IV).

A performance evaluation of our TopK eigendecomposition algorithm against the multicore ARPACK CPU library, showing a speedup of 6.22x and a power efficiency gain of 49x, with a reconstruction error due to mixedprecision arithmetic as good as (Section V).
Ii Related Work
To the best of our knowledge, no prior work optimizes TopK sparse eigenproblem with custom FPGA hardware designs.
The most wellknown largescale TopK sparse eigenproblem solver on CPU is the ARPACK library [lehoucq1998arpack], a multicore Fortran library that is also available in SciPy and MATLAB through thin software wrappers. ARPACK implements the iram, a variation of the Lanczos algorithm that supports nonHermitian matrices. Other sparse eigensolvers provide techniques optimized for specific domains or matrix types, although none is as common as ARPACK [aktulga2012topology, hernandez2009survey, maschhoff1996p_arpack, lee2018solution].
On GPUs, the cuSOLVER [cusolver] library by Nvidia provides a simple eigensolver based on the shiftinverse method that retrieves only the largest eigenvalue and its eigenvector (i.e. ), which is significantly more limited than the general TopK eigenproblem. The nvGRAPH library [nvgraph], also developed by Nvidia, provides an implementation of spectral clustering at whose core lies the Lanczos algorithm. However, the implementation of the inner Lanczos algorithm is not publicly available. To the best of our knowledge, there is no publicly available GPU implementation of the Lanczos algorithm that can solve large scale sparse eigenproblems required by spectral methods. The MAGMA library [magma] solves the TopK sparse eigenproblem through the alternative LOBPCG algorithm [knyazev2001toward], which requires multiple iterations (each containing at least one spmv) to compute even a single eigenvector, to the contrary of the Lanczos algorithm. Other GPUbased TopK eigensolvers are domainspecific, do not support largescale inputs, or do not leverage features of modern GPUs such as HBM memory or mixedprecision arithmetic [evstigneev2017implementation, dubois2011accelerating]. Eigensolvers for dense matrices are more common on GPUs, as they easily exploit the enormous memory bandwidth of these architectures: Myllykoski et al. [Myllykoski2020GPULIBQR] focus on accelerating the case of dense highdimensional matrices (around rows) while Cosnuau [cosnau2014gpueigen] operates on multiple small input matrices. Clearly, none of the techniques that operate on dense matrices can easily scale to matrices with millions of rows as simply storing them requires terabytes of memory.
Specialized hardware designs for eigensolvers are limited to resolving the full eigenproblem on small dense matrices, through the QRHouseholder Decomposition and Jacobi eigenvalue algorithm. Most formulations of the Jacobi algorithm [gupta2019eigen, Burger2020Embedded] leverage sa, a major building block of high performance domainspecific architectures from their inception [kung1982systolic] to more recent results [jouppi2017datacenter, langhammer2020high, asgari2020proposing]. However, hardware designs of the Jacobi algorithm based on sa cannot scale to large matrices, as the resource utilization scales linearly with the size of the matrix. Implementations of the QRHouseolder algorithm face similar problems [vazquez2020fpga, aslan2012qr] as they also leverage systolic architectures, although research research about resourceefficient designs do exist [Desai2017QRHLS].
Iii Solving the TopK Sparse Eigenproblem
Algorithms like Spectral Clustering contain as their core step a TopK sparse eigenproblem, i.e. finding eigenvalues and eigenvectors of sparse matrices representing, for instance, graph topologies with millions of vertices and edges.
Given a sparse square matrix and an integer the goal of the TopK sparse eigenproblem is to find the eigenvalues with the highest magnitude, and their associated eigenvectors. This is equivalent to computing the approximate decomposition , with and . contains the eigenvectors, while is a diagonal matrix containing the eigenvalues. Indeed, computing all the eigenvalues of the matrix is intractable for large matrices and redundant for many applications that require only a handful of eigencomponents. For example, Spectral Clustering and many of its variations rely only on the TopK eigenvectors, with K rarely above .
In this work, we propose a novel algorithm to solve the TopK sparse eigenproblem, combining the Lanczos algorithm and the Jacobi eigenvalue algorithm. Our technique is particularly suited for highly optimized and modular hardware designs. The first phase leverages the Lanczos algorithm, taking as input the original matrix , the number of desired eigencomponents and an L2normalized random vector , initialized with values equal to . The Lanczos algorithm outputs a symmetric tridiagonal matrix (Figure 3) and a set of orthogonal Lanczos vectors . As second step, we apply the Jacobi eigenvalue algorithm to . This algorithm transforms into a diagonal matrix containing its eigenvalues, and returns a matrix with the eigenvectors of . Each eigenvalue of is also an eigenvalue of the original matrix . Moreover, if is the eigenvector of associated to , then is the eigenvector of associated to . can be obtained as , although many applications in spectral analysis only require the TopK eigenvalues and eigenvectors of instead of retrieving .
Iiia The Lanczos Algorithm
The Lanczos algorithm retrieves the TopK eigencomponents of a matrix and is often employed as a building block of largescale eigenproblem algorithms [lehoucq1998arpack, golub1977block, calvetti1994implicitly]. The output tridiagonal matrix is significantly smaller than the input () and also simpler in structure, as elements outside of the band enclosing the main diagonal and the ones immediately above and below are zero. Pseudocode of the algorithm is provided in Algorithm 1. For each of the iterations, it computes a Lanczos vector by normalizing , obtained at the previous iteration (line 6 and Figure 2A). From , we obtain by projecting the matrix M into (line 7 and Figure 2B), followed by an orthogonalization (lines 8 to 10 and Figure 2C). The algorithm is highly efficient as each vector is computed in a single iteration, and .
The Lanczos algorithm is particularly efficient on sparse matrices, as its most expensive operation is an iterative spmv, bounding its computational complexity to , with being the number of non zero elements of . In our hardware design, we optimize the memoryintensive spmv computation through multiple independent cu, so that we can take advantage of all the available 32 HBM channels of a Xilinx Alveo U280 FPGA accelerator card (Section IVB).
This algorithm is prone to numerical instability as the Lanczos vectors can quickly lose pairwise orthogonality if is very large. To prevent instability, we normalize the input matrix in Frobenius norm as eigencomponents are invariant to constant scaling: values of the matrix are in the range , which implies that eigenvalues and eigenvectors are also in the range . This property enables the use of fixedpoint arithmetic to improve performance and reduce resource usage (Section VC). We further improve numerical stability by adopting a version of the algorithm that reorders operations [paige1972computational] and reorthogonalizes Lanczos vectors in each iteration [parlett98]. Reorthogonalization (Algorithm 1, line 10) requires more operations of cost , increasing complexity to . We also introduce the option of performing reorthogonalization every 2 iterations, for a lower overhead of , with negligible accuracy loss (Section VC). In practice, execution time is usually dominated by spmv making reorthogonalization a viable option.
IiiB The Jacobi Eigenvalue Algorithm
The Jacobi eigenvalue algorithm computes the eigenvalues and eigenvectors of a dense symmetric real matrix. It is an iterative procedure that performs rotations on square submatrices. Each iteration is highly computationallyintensive as it contains trigonometric operations. However, this algorithm is particularly well suited to solve eigenproblems on small tridiagonal matrices. As many matrix values are zero and cannot introduce datadependencies in rotations, it is possible to parallelize the entire computation at hardwarelevel.
The Jacobi eigenvalue algorithm has sought many formulations to improve either its parallelism or its resource utilization. The bestknown formulation of the algorithm was proposed by Brent and Luk [brent1984solution] and has been the standard for implementing the algorithm on fpga to this day [gupta2019eigen, Burger2020Embedded]. Our design improves this formulation with a more resourceefficient procedure for interchanging rows and columns, and its structure is shown in Figure 5.
We employ a sa design that maps the input matrix as submatrices to adjacent processors (or cu) (Figure 5, A). The systolic architecture propagates the rotation angles B and the values stored in each processors E.
Starting from , the algorithm set to zero offdiagonal entries per iteration by using rotations. Diagonal processors annihilate and components (Algorithm 2, line 7) with a rotation of angle . This angle is propagated (line 8) to the offdiagonal processor (line 9), and to the eigenvector processor that applies the same rotation to the identity matrix (line 14).
To ensure convergence, diagonal cu are fed nonzero elements at each iteration in the and position. New nonzero elements are provided to the diagonal cu by swapping rows and columns, since eigencomponents are invariant to linear combinations. We improve the swap procedures of Brent and Luk [brent1984solution] by swapping vectors in reverse, obtaining the same results with fewer resources (Section IVC2).
The sa formulation allows performing each iteration of the algorithm in constant time, enabling complexity equal to the number of iterations, , instead of having cost above due to the matrix multiplications [brent1984solution].
Iv The Proposed Hardware Design
This section presents our custom FPGAbased hardware design for the TopK sparse eigenproblem algorithm previously introduced. The logical division between the Lanczos and Jacobi algorithms is also present in the hardware implementation. Our hardware design is composed of two macroareas that are mapped to separate reconfigurable slr of the FPGA, to provide more efficient resource utilization and higher flexibility in terms of clock frequency, memory interfaces, and reconfigurability. Figure 6 shows a high level view of our fpga design. We prototyped our hardware design on an Alveo U280 accelerator card with HBM2 and DDR4 memory. The Lanczos algorithm, being a memoryintensive computation, is mapped to SLR0, which provides direct access to all the HBM2 memory interfaces on the accelerator card. SLR1 and SLR2 hosts different replicas of the IP core implementing the Jacobi algorithm, optimized for different numbers of eigenvectors .
Iva Lanczos Hardware Design
The left part of Figure 6 highlights the Lanczos algorithm hardware design components. Partitions of the sparse input matrix are read from hbm A and distributed to 5 parallel spmv cu B (Algorithm 1, line 7). Partial results from every partition are merged C into a single vector to be used by the remaining linear operations D (lines 5, 6, 8, 9). Operations are then repeated times to produce the values in the tridiagonal matrix and the Lanczos vectors in , stored in DDR memory.
IvB SpMV Hardware Design
The biggest bottleneck in the Lanczos algorithm is an iterative spmv computation (Algorithm 1, line 7). While other computations in the Lanczos algorithm are relatively straightforward to optimize and parallelize, spmv is wellknown for being a complex, memoryintensive computation that presents indirect and random memory accesses [nguyen2020FPGAPotential]. Although significant research has been made into developing highperformance spmv implementations on FPGA [10.1145/3352460.3358330, grigoras2015accelerating, umuroglu2015vector, parravicini2021scaling, jain2020domain], the Lanczos algorithm introduces circumstances that prevents us from using an outofthebox FPGA spmv implementation. Our spmv design must perform multiple iterations without communication from device to host, as datatransfer and synchronizations would hinder performance. Then, the spmv must be easily partitioned and replicated to provide flexibility over the hardware resources. Finally, we require access to multiple HBM channels to maximize the overall memory bandwidth achieved in the computation.
Our final spmv design extends and improves the one recently proposed by Parravicini et al. [parravicini2020reduced] in the context of graph ranking algorithms, which are also variations of the power iteration method as in the case of the Lanczos algorithm. Below we introduce how we leveraged HBM memory in our spmv design to provide better scalability and performance.
IvB1 SpMV Dataflow Architecture
As spmv is an extremely memoryintensive computation, a good spmv implementation should make efficient use of the memory bandwidth made available by the underlying hardware. Figure 7 shows the structure of one of our spmv cu. We employ a streaming dataflow spmv design that reads the input sparse matrix stored using the coo format. In the coo format, nonzero entries of the matrix are stored using 3 32bits values: the row and column index in the matrix and the value itself. Compared to other sparse matrix datalayouts, such as csr, the COO format does not present indirect data accesses that can severely reduce the opportunities for a pipelined design. The Matrix Fetch Unit
in each cu is connected to a single HBM channel and reads, for each clock cycle, a packet of 512 bits containing 5 nonzero matrix entries A. Memory transactions happen in continuous bursts of maximum AXI4 length (256 beats): each cu reads the matrix at the maximum bandwidth offered by the HBM channel (14.37 GB/s, for a total of 71.87 GB/s using 5 cu). For each of the 5 nonzero values in each COO packet, the
Dense Vector Fetch Unit performs a random access to the spmv dense vector B. This step is critical to the overall spmv performance: compared to [parravicini2020reduced], we leverage HBM instead of uram, achieving better scalability and performance. We detail our Dense Vector Memory Subsystem below and in Figure 8. The Aggregation Unit sums results within a single datapacket that refers to the same matrix column C. A WriteBack FiniteState Machine stores results of each cu to HBM D. Each writetransaction is a 512bits datapacket containing up to 15 values, each referring to a single matrix row. Compared to [parravicini2020reduced] we reduce the number of write transactions by 3 times the average number of nonzeros per row. As such, we can store results through the same HBM channels of the dense vector with no detriment to performance.Compared to the original spmv design in [parravicini2020reduced], we support multiple spmv cu that operate on partitions on the coo input matrix, created by assigning an equal number of rows to each cu. We employ up to 5 spmv cu (Figure 6). While in principle it is possible to place more cu, our current design is limited by the hardened AXI switch in the Alveo U280 that prevents the use of more than 32 AXI master channels to HBM, which we fully employ [alveohbmcontroller]. Each spmv cu compute a portion of the output vector: partial results are aggregated by the Merge Unit (Figure 6 C) and replicated across HBM channels to use them in the following iteration.
IvB2 SpMV Dense Vector Memory Subsystem
Each spmv cu processes 5 nonzero matrix entries per clock cycle, and for each nonzero entry it must perform a random access on a dense vector of size (in our case, the Lanczos vector at iteration ). As each AXI master channel can handle only one read transaction per cycle, we need to replicate the dense vector 5 times, similarly to [kestur2012towards]. The hardened AXI switch in the Alveo U280 renders highly inefficient to attach multiple AXI master channels to the same HBM bank: only 32 AXI master channels are available, and small memory transactions (32 bits) have the same performance as larger transactions, preventing sustained bandwidth sharing [choi2020hls, lu2021demystifying, wang2020shuhai]. We solve the issue by leveraging the abundant HBM memory on the Alveo U280, and replicating the dense vector 5 times for each cu, as in Figure 8. A more flexible AXI switch would enable multiple 32bits read transactions on the same HBM channel in a single clock cycle, reducing the demand for data replication. Compared to [parravicini2020reduced], our HBMbased memory subsystem marks a significant improvement, as we avoid uram to store the intermediate dense vector and results. Instead of being limited by the FPGA’s 90 MB of uram, we store the dense vector using individual HBM banks with 250 MB of capacity, allowing computations on matrices with up to 62.4 million rows. Moreover, high uram consumption significantly limits the maximum attainable frequency, while we do not incur in this limitation (Table I).
IvC Jacobi Systolic Array Design
The Jacobi eigenvalue algorithm is very computationally intensive. Although it processes a small input of size , unoptimized implementations still require a significant amount of time due to a large number of dense matrix multiplications. Moreover, its convergence rate is implementationdependent and as high as . By adopting a sabased design, we overcome both issues. By parallelizing the computation through a sa formulation and performing rotations concurrently, we decrease the number of iterations for convergence to . Rotations, equivalent to multiplications on submatrices, are unrolled and performed in constant time.
Our design for the Jacobi algorithm is optimized to compute up to eigenvalues. While it can compute a lower amount of eigenvalues without a reconfiguration, we place in the same FPGA bitstream multiple Jacobi cores optimized for specific (4, 8, 16, etc.). We can configure both SLR1 and SLR2 with Jacobi cores to fully utilize the FPGA resources and opening the doors for independent optimization on specific values of by reconfiguring individual slr. The Lanczos Core on slr transfers only the values of to the Jacobi cores on slr1 and slr2. We prevent inefficiencies related to interslr communication by moving data through PLRAM, while also avoiding the long readwrite latency of DDR and HBM.
In practice, the systolic formulation cannot scale beyond very small matrices () due to the large number of resources required for trigonometric operations in each cu. While resource utilization has prevented widespread adoption of the Jacobi algorithm for general eigenproblem resolution, it is not a limitation for our use case, as we apply the Jacobi eigenvalue algorithm on small inputs by design.
On CPU, approaches such as QR factorization are more common [buttari2008parallel], because efficient systolic array formulations of the Jacobi algorithm require full control over cache eviction policies. Moreover, even modern CPUs lack enough floatingpoint arithmetic units to perform the operations required for an iteration at once: even for a small such as , the Jacobi algorithm computes 16 trigonometric operations and about floatingpoint multiplications per iteration.
Instead, we leverage the abundant hardware resources of our fpga platform to perform all these operations concurrently, making it the optimal choice for our Jacobi sa design.
IvC1 Diagonal And Offdiagonal cu
Diagonal cu (Algorithm 2, line 4) annihilate elements immediately outside the diagonal via a matrix rotation. Although the rotation angle is arbitrary, the fastest convergence is achieved by setting , which eliminates the and components (Figure 3(a)). We efficiently compute the components of the rotation matrix via Taylor series expansion. Even an order3 approximation provides excellent accuracy ( at ), using significantly fewer DSPs and BRAMs than the CORDIC core. Rotation on the diagonal (Figure 5 A) are performed by parallel cores, propagating rotation values B in constant time to the Offdiagonal cu (Algorithm 2, line 9). As each cu holds only 4 elements, matrix multiplications are fully unrolled and performed in constant time. Eigenvectors (Algorithm 2, line 14) D are computed in parallel to the rotation of the Offdiagonal cu C as they only require rotation values.
IvC2 Row/Column Interchange
Each cu has 8 connections to propagate input and output values of values to adjacent processors, in addition to communicating the rotation value . As shown in Figure 5E, each processor with and propagates its and values to the and slots of and its and values to the and slots of . Processors in the first column () propagate and to to the and slots of . Processors propagate and to their own and slots. Operations for the column interchange are symmetrical. As and of are never propagated, more swaps are performed towards lower indices than higher indices. These additional swaps require temporary vectors to store rows that would be overwritten by the swaps. To avoid wasting resources for these temporary vectors, we execute operations in reverse, from to . As row/column swaps do not introduce additional data dependencies, we perform them in a single clock cycle using FFs.
V Experimental Evaluation









Lanczos  SLR0  42%  13%  15%  0%  16%  225  
Jacobi  SLR1  40%  42%  0%  0%  68%  225  
Jacobi  SLR2  15%  17%  0%  0%  34%  225  
Available  1097419  2180971  1812  960  9020 







WBTA  wikiTalk  2.39  5.02  
WBGO  webGoogle  0.91  5.11  
WBBE  webBerkstan  0.69  7.60  
FL  Flickr  0.82  9.84  
IT  italy_osm  6.69  14.02  
PA  patents  3.77  14.97  
VL3  venturiLevel3  4.02  16.10  
DE  germany_osm  11.54  24.73  
ASIA  asia_osm  11.95  25.42  
RC  road_central  14.08  33.87  
WK  Wikipedia  3.56  45.00  
HT  hugetrace00020  16.00  47.80  
WB  wbedu  9.84  57.15 
To prove that our custom FPGA design is suitable for solving largescale TopK sparse eigenproblems, we compare it against the popular ARPACK library, measuring how it compares in terms of execution time, power efficiency, and accuracy. The multithreaded ARPACK library [lehoucq1998arpack], a TopK sparse eigensolver that employs iram, runs on two Intel Xeon Gold 6248 (80 threads in total) and 384 GB of DRAM using singleprecision floatingpoint arithmetic. Our eigensolver is prototyped on a Xilinx Alveo U280 accelerator card equipped with 8 GB of HBM2 memory, 32GB of DDR4 memory, and a xcu280fsvh28922Le FPGA whose resources are reported in Table I. Results are averaged over 20 runs.
Tests are carried out using a collection of large sparse matrices representing graph topologies, each containing millions of rows and nonzero entries (Table II). All test matrices come from the SuiteSparse collection [davis2011university]. While our evaluation is focused on sparse matrices representing graphs, our TopK sparse eigenproblem FPGA design is applicable to other domains such as image analysis [perona1998factorization, shi2000normalized, tung2010enabling].
Resource utilization and clock frequency of our design are reported in Table I. The Lanczos algorithm and Jacobi algorithm have similar utilization, with around 20% LUT utilization each (50% of the available LUTs in each slr). Although the sa architecture of the Jacobi algorithm processes small inputs, it requires the computation of many trigonometric operations and multiplications (16 and 800 for ) in each iteration. Resource utilization of the Jacobi algorithm scales quadratically with the number of eigenvalues , while the Lanczos algorithm is not affected.
Va Execution Time
We measure the execution time speedup of the FPGAbased hardware design implementing our TopK sparse eigenproblem solver against the CPU baseline and report results in Figure 9
. We are always faster than the baseline, with a geometric mean speedup of 6.22x, up to 64x for specific graphs. The speedup is mostly unaffected by
, showing how our design can efficiently compute many eigenvalues at once. Figure 10A shows how the time required by our FPGA design to process a single matrix value is unaffected by the overall graph size, while the CPU behavior is drastically more unpredictable.We estimate that the Lanczos dominates the overall execution time due to the spmv computations, taking more than 99% of the execution time. However, optimizing the Jacobi algorithm with a sa design is still worth the effort, compared to running this step on CPU.
Figure 10B shows the speedup of our Jacobi sa design compared to an optimized C++ CPU implementation: the execution time on CPU grows quadratically due to repeated matrix multiplications, becoming a nonnegligible part of the execution time for large .Our hardware design synthesized at 225 Mhz on the Alveo U280 accelerator card. A clock frequency beyond 225 Mhz does not significantly improve performance as spmv represents the main computational bottleneck in the computation, and its performance is bound by HBM bandwidth [lu2021demystifying]. Each spmv cu processes data at the maximum bandwidth offered by the HBM channel from which it reads the matrix (14.37 GB/s, for a total of 71.87 GB/s using 5 cu).
VB Power Efficiency
We measured via an external power meter that our FPGA design consumes about 38W during execution, plus 40W for the host server. The CPU implementation consumes around 300W during execution. Our FPGA design provides 49x higher Performance/Watt ratio (24x if accounting for the FPGA host machine): we provide higher performance without sacrificing power efficiency, making our design suitable for repeated computations typical of data center applications.
VC Accuracy Analysis of the Approximate Eigencomputation
The Lanczos algorithm is known to suffer from numerical instability [paige1972computational]. To limit this phenomenon, we reorganize the algorithm’s operations as in [paige1972computational] and apply reorthogonalization as in [parlett98]. To assess the stability of our design, we measure the eigenvectors’ pairwise orthogonality and the eigenvector error norm. Eigenvectors must form an orthonormal basis and be pairwise orthogonal, i.e. their dot product is 0, or equivalently their angle is . For each pair of eigenvectors, we measure the average angle that occurs. Then, if is an eigenvalue of and is its associated eigenvector, it must hold . By measuring the L2 norm of for all we evaluate how precise the eigenvector computation is. Results are reported, for increasing , in Figure 11, aggregated on all graphs due to space constraints. Accuracy is excellent if reorthogonalization is applied every two iterations, but even without this procedure results are satisfactory. Despite using fixedprecision arithmetic in the Lanczos algorithm, the average reconstruction error is below , and the average orthogonality is degrees, when applying reorthogonalization every two iterations. Orthogonality is not affected by , while the average reconstruction error improves as increases. Spectral methods in machine learning applications use eigenvectors to capture the most important features of their input and do not usually require the same degree of precision as other engineering applications. Reorthogonalization adds an overhead up to to the algorithm compared to Figure 9. On large graphs this overhead is negligible compared to spmv, and is a viable option in applications where maximum accuracy is necessary. Still, our hardware design can provide excellent accuracy while being significantly faster than a highly optimized CPU implementation.
Vi Conclusion
The computation of the TopK eigenvalues and eigenvectors on large graphs represented as sparse matrices is critical in spectral methods, a class of powerful Machine Learning algorithms that can extract useful features from graphs. We solve the TopK sparse eigenproblem with a new algorithm that is optimized for reconfigurable hardware designs: in the first part of the computation, we exploit the enormous bandwidth of HBM through the Lanczos algorithm, while in the second part, we introduce a systolic array architecture that efficiently parallelizes the computeintensive Jacobi eigenvalue algorithm. Compared to the popular ARPACK CPU library, we achieve a geomean speedup of 6.22x on 13 graphs with millions of vertices, raising the bar for highperformance TopK sparse eigensolvers at a large scale.
As future work, we will extend our hardware design to support nonHermitian matrices through the Implicitly Restarted Arnoldi Method. We will also investigate heterogeneous implementations that combine the abundant memory bandwidth of GPUs for highperformance spmv with our systolic array FPGA design for the Jacobi eigenvalue.
Comments
There are no comments yet.