Solving Large Top-K Graph Eigenproblems with a Memory and Compute-optimized FPGA Design

03/18/2021
by   Francesco Sgherzi, et al.
Politecnico di Milano
0

Large-scale eigenvalue computations on sparse matrices are a key component of graph analytics techniques based on spectral methods. In such applications, an exhaustive computation of all eigenvalues and eigenvectors is impractical and unnecessary, as spectral methods can retrieve the relevant properties of enormous graphs using just the eigenvectors associated with the Top-K largest eigenvalues. In this work, we propose a hardware-optimized algorithm to approximate a solution to the Top-K eigenproblem on sparse matrices representing large graph topologies. We prototype our algorithm through a custom FPGA hardware design that exploits HBM, Systolic Architectures, and mixed-precision arithmetic. We achieve a speedup of 6.22x compared to the highly optimized ARPACK library running on an 80-thread CPU, while keeping high accuracy and 49x better power efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/19/2022

A Mixed Precision, Multi-GPU Design for Large-scale Top-K Sparse Eigenproblems

Graph analytics techniques based on spectral methods process extremely l...
04/13/2021

MELOPPR: Software/Hardware Co-design for Memory-efficient Low-latency Personalized PageRank

Personalized PageRank (PPR) is a graph algorithm that evaluates the impo...
06/29/2020

A Multilevel Spectral Indicator Method for Eigenvalues of Large Non-Hermitian Matrices

Recently a novel family of eigensolvers, called spectral indicator metho...
04/29/2020

Synergistic CPU-FPGA Acceleration of Sparse Linear Algebra

This paper describes REAP, a software-hardware approach that enables hig...
02/03/2016

An SSD-based eigensolver for spectral analysis on billion-node graphs

Many eigensolvers such as ARPACK and Anasazi have been developed to comp...
03/08/2021

Scaling up HBM Efficiency of Top-K SpMV for Approximate Embedding Similarity on FPGAs

Top-K SpMV is a key component of similarity-search on sparse embeddings....
05/01/2019

RedisGraph GraphBLAS Enabled Graph Database

RedisGraph is a Redis module developed by Redis Labs to add graph databa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Research in information retrieval and recommender systems has spiked novel interest in spectral methods [zhou2019SpectralRecommender]

, a class of Machine Learning algorithms able to detect communities in large social and e-commerce graphs, and compute the similarity of graph elements such as users or products

[tang2020spectral]. At the core of many spectral methods lies the Top-K eigenproblem for large-scale sparse matrices, i.e. the computation of the eigenvectors associated with the largest eigenvalues (in modulo) of a matrix that stores only non-zero entries (Figure 1

). For example, the famous Spectral Clustering algorithm boils down to computing the largest eigenvalues of a sparse matrix representing the graph topology

[ng2002spectral]. Despite the rise of theoretical interests for spectral methods, little research has focused on improving the performance and scalability of the Top-K sparse eigenproblem solvers, making them applicable to large-scale graphs.

Most existing high-performance implementations of eigenproblem algorithms operate on dense matrices and are completely unable to process matrices with millions of rows and columns (each encoding, for example, the user’s friends in a social network graph) [Myllykoski2020GPULIBQR]. Even the highly optimized multi-core implementation of LAPACK requires more than 3 minutes to solve the full eigenproblem on a small graph with vertices and edges on a Xeon 6248, as the eigenproblem complexity scales at least quadratically with the number of vertices in the graph. Many implementations that support sparse matrices, on the other hand, are either forced to compute all the eigenvalues or require an expensive matrix inversion before solving the eigenproblem [knyazev2001toward].

The need for high-performance Top-K sparse eigenproblem algorithms goes hand in hand with custom hardware designs that can outperform traditional architectures in raw performance and power efficiency, given how applications on-top of Top-K eigenproblem are mostly encountered in data centers. In this work, we tackle both problems by presenting a new Top-K sparse eigensolver whose building blocks are specifically optimized for high-performance hardware designs.


Fig. 1: Top-K eigencomputation of a graph , represented as a sparse matrix. While can have millions of vertices, we often need just the Top-K eigenvectors ( in this example).

We introduce a novel algorithm to address the Top-K sparse eigenproblem and prototype it through a custom hardware design on an FPGA accelerator card; to the best of our knowledge this is the first FPGA-based Top-K sparse eigensolver. Our algorithm is a 2-step procedure that combines the Lanczos algorithm (to reduce the problem size) [lanczos1950iteration] with the Jacobi algorithm (to compute the final eigencomponents) [Rutishauser1966jacobi], as shown in Figure 2. The Lanczos algorithm, often encountered in Top-K sparse eigensolvers [nmeth2020scipy], has never been combined with the Jacobi algorithm. Part of the reason lies in their different computational bottlenecks: the Lanczos algorithm demands large memory bandwidth, while the Jacobi algorithm is strongly compute-bound. Our approach exploits the strengths of FPGA accelerator cards and overcomes the limitations of traditional architectures in this class of algorithms.

First, the Lanczos algorithm presents a spmv as its main bottleneck, an extremely memory-intensive computation with indirect and fully random memory accesses (Figure 2 B). Optimizing spmv requires high peak memory bandwidth and fine-grained control over memory accesses, without passing through the traditional caching policies of general-purpose architectures. Our hardware design features an iterative dataflow spmv with multiple cu, leveraging every hbm channels through a custom memory subsystem that efficiently handles indirect memory accesses.

Then, we introduce a sa design for the Jacobi eigenvalue algorithm, a computationally-intensive operation that operates on reduced-size inputs ( (Figure 2 D). The Jacobi algorithm maps naturally to a sa that ensures

convergence, while traditional architectures do not ensure the same degree of performance. CPUs cannot guarantee that all the data are kept in L1 cache and are unlikely to have enough floating-point arithmetic units to parallelize the computation. This results in

computational complexity and execution times more than 50 times higher than a fpga (Section V). Instead, GPUs cannot fill all their Stream Processors, as the input size is much smaller than what is required to utilize the GPU parallelism fully [cosnau2014gpueigen].

Moreover, our FPGA-based hardware design employs highly-optimized mixed-precision arithmetic, partially replacing traditional floating-point computations with faster fixed-precision arithmetic. While high numerical accuracy is usually demanded in eigenproblem algorithms, we employ fixed-precision arithmetic in parts of the design that are not critical to the overall accuracy and resort to floating-point arithmetic when required to guarantee precise results.

In summary, we present the following contributions:

  • A novel algorithm for approximate resolution of large-scale Top-K sparse eigenproblems (Section III), optimized for custom hardware designs.

  • A modular mixed-precision fpga design for our algorithm that efficiently exploits the available programmable logic and the bandwidth of DDR and HBM (Section IV).

  • A performance evaluation of our Top-K eigendecomposition algorithm against the multi-core ARPACK CPU library, showing a speedup of 6.22x and a power efficiency gain of 49x, with a reconstruction error due to mixed-precision arithmetic as good as (Section V).

Ii Related Work

To the best of our knowledge, no prior work optimizes Top-K sparse eigenproblem with custom FPGA hardware designs.

The most well-known large-scale Top-K sparse eigenproblem solver on CPU is the ARPACK library [lehoucq1998arpack], a multi-core Fortran library that is also available in SciPy and MATLAB through thin software wrappers. ARPACK implements the iram, a variation of the Lanczos algorithm that supports non-Hermitian matrices. Other sparse eigensolvers provide techniques optimized for specific domains or matrix types, although none is as common as ARPACK [aktulga2012topology, hernandez2009survey, maschhoff1996p_arpack, lee2018solution].

On GPUs, the cuSOLVER [cusolver] library by Nvidia provides a simple eigensolver based on the shift-inverse method that retrieves only the largest eigenvalue and its eigenvector (i.e. ), which is significantly more limited than the general Top-K eigenproblem. The nvGRAPH library [nvgraph], also developed by Nvidia, provides an implementation of spectral clustering at whose core lies the Lanczos algorithm. However, the implementation of the inner Lanczos algorithm is not publicly available. To the best of our knowledge, there is no publicly available GPU implementation of the Lanczos algorithm that can solve large scale sparse eigenproblems required by spectral methods. The MAGMA library [magma] solves the Top-K sparse eigenproblem through the alternative LOBPCG algorithm [knyazev2001toward], which requires multiple iterations (each containing at least one spmv) to compute even a single eigenvector, to the contrary of the Lanczos algorithm. Other GPU-based Top-K eigensolvers are domain-specific, do not support large-scale inputs, or do not leverage features of modern GPUs such as HBM memory or mixed-precision arithmetic [evstigneev2017implementation, dubois2011accelerating]. Eigensolvers for dense matrices are more common on GPUs, as they easily exploit the enormous memory bandwidth of these architectures: Myllykoski et al. [Myllykoski2020GPULIBQR] focus on accelerating the case of dense high-dimensional matrices (around rows) while Cosnuau [cosnau2014gpueigen] operates on multiple small input matrices. Clearly, none of the techniques that operate on dense matrices can easily scale to matrices with millions of rows as simply storing them requires terabytes of memory.

Specialized hardware designs for eigensolvers are limited to resolving the full eigenproblem on small dense matrices, through the QR-Householder Decomposition and Jacobi eigenvalue algorithm. Most formulations of the Jacobi algorithm [gupta2019eigen, Burger2020Embedded] leverage sa, a major building block of high performance domain-specific architectures from their inception [kung1982systolic] to more recent results [jouppi2017datacenter, langhammer2020high, asgari2020proposing]. However, hardware designs of the Jacobi algorithm based on sa cannot scale to large matrices, as the resource utilization scales linearly with the size of the matrix. Implementations of the QR-Houseolder algorithm face similar problems [vazquez2020fpga, aslan2012qr] as they also leverage systolic architectures, although research research about resource-efficient designs do exist [Desai2017QRHLS].

Iii Solving the Top-K Sparse Eigenproblem


Fig. 2: Steps of our novel Top-K sparse eigenproblem solver, which combines the Lanczos algorithm with a Systolic Array formulation for the Jacobi eigenvalue algorithm.

Algorithms like Spectral Clustering contain as their core step a Top-K sparse eigenproblem, i.e. finding eigenvalues and eigenvectors of sparse matrices representing, for instance, graph topologies with millions of vertices and edges.

1:function Lanczos()
2:     Initialization
3:    for  in  do Main Lanczos loop
4:        if  then
5:           
6:            Compute new Lanczos vector         
7:         Sparse matrix-vector multiplication
8:        
9:        
10:        Orthogonalize with respect to     
11:     Tridiagonal matrix and Lanczos vectors
12:    Output
13:    Output
Algorithm 1 Lanczos algorithm for the Top-K eigenvectors

Fig. 3: Example of () tridiagonal matrix, obtained as output of the Lanczos algorithm for .

Given a sparse square matrix and an integer the goal of the Top-K sparse eigenproblem is to find the eigenvalues with the highest magnitude, and their associated eigenvectors. This is equivalent to computing the approximate decomposition , with and . contains the eigenvectors, while is a diagonal matrix containing the eigenvalues. Indeed, computing all the eigenvalues of the matrix is intractable for large matrices and redundant for many applications that require only a handful of eigencomponents. For example, Spectral Clustering and many of its variations rely only on the Top-K eigenvectors, with K rarely above .

In this work, we propose a novel algorithm to solve the Top-K sparse eigenproblem, combining the Lanczos algorithm and the Jacobi eigenvalue algorithm. Our technique is particularly suited for highly optimized and modular hardware designs. The first phase leverages the Lanczos algorithm, taking as input the original matrix , the number of desired eigencomponents and an L2-normalized random vector , initialized with values equal to . The Lanczos algorithm outputs a symmetric tridiagonal matrix (Figure 3) and a set of orthogonal Lanczos vectors . As second step, we apply the Jacobi eigenvalue algorithm to . This algorithm transforms into a diagonal matrix containing its eigenvalues, and returns a matrix with the eigenvectors of . Each eigenvalue of is also an eigenvalue of the original matrix . Moreover, if is the eigenvector of associated to , then is the eigenvector of associated to . can be obtained as , although many applications in spectral analysis only require the Top-K eigenvalues and eigenvectors of instead of retrieving .

Iii-a The Lanczos Algorithm

The Lanczos algorithm retrieves the Top-K eigencomponents of a matrix and is often employed as a building block of large-scale eigenproblem algorithms [lehoucq1998arpack, golub1977block, calvetti1994implicitly]. The output tridiagonal matrix is significantly smaller than the input () and also simpler in structure, as elements outside of the band enclosing the main diagonal and the ones immediately above and below are zero. Pseudo-code of the algorithm is provided in Algorithm 1. For each of the iterations, it computes a Lanczos vector by normalizing , obtained at the previous iteration (line 6 and Figure 2A). From , we obtain by projecting the matrix M into (line 7 and Figure 2B), followed by an orthogonalization (lines 8 to 10 and Figure 2C). The algorithm is highly efficient as each vector is computed in a single iteration, and .

1:function Jacobi()
2:     Identity matrix of size
3:    repeat
4:        for  in  do Diagonal cu
5:           
6:           
7:           Rotate Full equation in Figure 3(a)
8:           Propagate and         
9:        for  in  do Offdiagonal cu
10:           
11:           Receive , from
12:           
13:           Rotate Full equation in Figure 3(b)         
14:        for  in  do Eigenvector cu
15:           for  in  do
16:               
17:               Receive from
18:               Rotate Full equation in Figure 3(c)                    
19:        Permute rows and columns of and Figure 5
20:    until  becomes diagonal
21:    Output Eigenvalues of the input
22:    Output Eigenvectors of the input
Algorithm 2 Jacobi eigenvalue algorithm with Systolic Arrays

(a) Operations for the Diagonal Processor (Figure 5A).

(b) Operations for the Offdiagonal Processor (Figure 5C).

(c) Operations for the Eigenvector Processor (Figure 5D).
Fig. 4: Operations performed by different processors in the Jacobi eigenvalue Systolic Array architecture. Values and indicate and , with .

The Lanczos algorithm is particularly efficient on sparse matrices, as its most expensive operation is an iterative spmv, bounding its computational complexity to , with being the number of non zero elements of . In our hardware design, we optimize the memory-intensive spmv computation through multiple independent cu, so that we can take advantage of all the available 32 HBM channels of a Xilinx Alveo U280 FPGA accelerator card (Section IV-B).

This algorithm is prone to numerical instability as the Lanczos vectors can quickly lose pairwise orthogonality if is very large. To prevent instability, we normalize the input matrix in Frobenius norm as eigencomponents are invariant to constant scaling: values of the matrix are in the range , which implies that eigenvalues and eigenvectors are also in the range . This property enables the use of fixed-point arithmetic to improve performance and reduce resource usage (Section V-C). We further improve numerical stability by adopting a version of the algorithm that reorders operations [paige1972computational] and reorthogonalizes Lanczos vectors in each iteration [parlett98]. Reorthogonalization (Algorithm 1, line 10) requires more operations of cost , increasing complexity to . We also introduce the option of performing reorthogonalization every 2 iterations, for a lower overhead of , with negligible accuracy loss (Section V-C). In practice, execution time is usually dominated by spmv making reorthogonalization a viable option.

Iii-B The Jacobi Eigenvalue Algorithm

The Jacobi eigenvalue algorithm computes the eigenvalues and eigenvectors of a dense symmetric real matrix. It is an iterative procedure that performs rotations on square submatrices. Each iteration is highly computationally-intensive as it contains trigonometric operations. However, this algorithm is particularly well suited to solve eigenproblems on small tridiagonal matrices. As many matrix values are zero and cannot introduce data-dependencies in rotations, it is possible to parallelize the entire computation at hardware-level.


Fig. 5: Steps of the Jacobi eigenvalues computation using Systolic Arrays. Each pe holds 4 values , , , , and .

The Jacobi eigenvalue algorithm has sought many formulations to improve either its parallelism or its resource utilization. The best-known formulation of the algorithm was proposed by Brent and Luk [brent1984solution] and has been the standard for implementing the algorithm on fpga to this day [gupta2019eigen, Burger2020Embedded]. Our design improves this formulation with a more resource-efficient procedure for interchanging rows and columns, and its structure is shown in Figure 5.

We employ a sa design that maps the input matrix as submatrices to adjacent processors (or cu) (Figure 5, A). The systolic architecture propagates the rotation angles B and the values stored in each processors E.

Starting from , the algorithm set to zero off-diagonal entries per iteration by using rotations. Diagonal processors annihilate and components (Algorithm 2, line 7) with a rotation of angle . This angle is propagated (line 8) to the off-diagonal processor (line 9), and to the eigenvector processor that applies the same rotation to the identity matrix (line 14).

To ensure convergence, diagonal cu are fed non-zero elements at each iteration in the and position. New non-zero elements are provided to the diagonal cu by swapping rows and columns, since eigencomponents are invariant to linear combinations. We improve the swap procedures of Brent and Luk [brent1984solution] by swapping vectors in reverse, obtaining the same results with fewer resources (Section IV-C2).

The sa formulation allows performing each iteration of the algorithm in constant time, enabling complexity equal to the number of iterations, , instead of having cost above due to the matrix multiplications [brent1984solution].

Iv The Proposed Hardware Design

This section presents our custom FPGA-based hardware design for the Top-K sparse eigenproblem algorithm previously introduced. The logical division between the Lanczos and Jacobi algorithms is also present in the hardware implementation. Our hardware design is composed of two macro-areas that are mapped to separate reconfigurable slr of the FPGA, to provide more efficient resource utilization and higher flexibility in terms of clock frequency, memory interfaces, and reconfigurability. Figure 6 shows a high level view of our fpga design. We prototyped our hardware design on an Alveo U280 accelerator card with HBM2 and DDR4 memory. The Lanczos algorithm, being a memory-intensive computation, is mapped to SLR0, which provides direct access to all the HBM2 memory interfaces on the accelerator card. SLR1 and SLR2 hosts different replicas of the IP core implementing the Jacobi algorithm, optimized for different numbers of eigenvectors .


Fig. 6: High-level architecture of our Top-K Sparse Eigencomputation FPGA design. We highlight interconnections between FPGA computational units and the FPGA board memory.

Iv-a Lanczos Hardware Design

The left part of Figure 6 highlights the Lanczos algorithm hardware design components. Partitions of the sparse input matrix are read from hbm A and distributed to 5 parallel spmv cu B (Algorithm 1, line 7). Partial results from every partition are merged C into a single vector to be used by the remaining linear operations D (lines 5, 6, 8, 9). Operations are then repeated times to produce the values in the tridiagonal matrix and the Lanczos vectors in , stored in DDR memory.

Iv-B SpMV Hardware Design

The biggest bottleneck in the Lanczos algorithm is an iterative spmv computation (Algorithm 1, line 7). While other computations in the Lanczos algorithm are relatively straightforward to optimize and parallelize, spmv is well-known for being a complex, memory-intensive computation that presents indirect and random memory accesses [nguyen2020FPGAPotential]. Although significant research has been made into developing high-performance spmv implementations on FPGA [10.1145/3352460.3358330, grigoras2015accelerating, umuroglu2015vector, parravicini2021scaling, jain2020domain], the Lanczos algorithm introduces circumstances that prevents us from using an out-of-the-box FPGA spmv implementation. Our spmv design must perform multiple iterations without communication from device to host, as data-transfer and synchronizations would hinder performance. Then, the spmv must be easily partitioned and replicated to provide flexibility over the hardware resources. Finally, we require access to multiple HBM channels to maximize the overall memory bandwidth achieved in the computation.

Our final spmv design extends and improves the one recently proposed by Parravicini et al. [parravicini2020reduced] in the context of graph ranking algorithms, which are also variations of the power iteration method as in the case of the Lanczos algorithm. Below we introduce how we leveraged HBM memory in our spmv design to provide better scalability and performance.


Fig. 7: Architecture of one iterative spmv cu. Each cu processes a portion of the input matrix through a 4-stage dataflow design, and results are replicated on the dense vector memory subsystem after each iteration.

Iv-B1 SpMV Dataflow Architecture

As spmv is an extremely memory-intensive computation, a good spmv implementation should make efficient use of the memory bandwidth made available by the underlying hardware. Figure 7 shows the structure of one of our spmv cu. We employ a streaming dataflow spmv design that reads the input sparse matrix stored using the coo format. In the coo format, non-zero entries of the matrix are stored using 3 32-bits values: the row and column index in the matrix and the value itself. Compared to other sparse matrix data-layouts, such as csr, the COO format does not present indirect data accesses that can severely reduce the opportunities for a pipelined design. The Matrix Fetch Unit

in each cu is connected to a single HBM channel and reads, for each clock cycle, a packet of 512 bits containing 5 non-zero matrix entries A. Memory transactions happen in continuous bursts of maximum AXI4 length (256 beats): each cu reads the matrix at the maximum bandwidth offered by the HBM channel (14.37 GB/s, for a total of 71.87 GB/s using 5 cu). For each of the 5 non-zero values in each COO packet, the

Dense Vector Fetch Unit performs a random access to the spmv dense vector B. This step is critical to the overall spmv performance: compared to [parravicini2020reduced], we leverage HBM instead of uram, achieving better scalability and performance. We detail our Dense Vector Memory Subsystem below and in Figure 8. The Aggregation Unit sums results within a single data-packet that refers to the same matrix column C. A Write-Back Finite-State Machine stores results of each cu to HBM D. Each write-transaction is a 512-bits data-packet containing up to 15 values, each referring to a single matrix row. Compared to [parravicini2020reduced] we reduce the number of write transactions by 3 times the average number of non-zeros per row. As such, we can store results through the same HBM channels of the dense vector with no detriment to performance.

Compared to the original spmv design in [parravicini2020reduced], we support multiple spmv cu that operate on partitions on the coo input matrix, created by assigning an equal number of rows to each cu. We employ up to 5 spmv cu (Figure 6). While in principle it is possible to place more cu, our current design is limited by the hardened AXI switch in the Alveo U280 that prevents the use of more than 32 AXI master channels to HBM, which we fully employ [alveohbmcontroller]. Each spmv cu compute a portion of the output vector: partial results are aggregated by the Merge Unit (Figure 6 C) and replicated across HBM channels to use them in the following iteration.


Fig. 8: Dense vector memory subsystem of our spmv FPGA design. Index accesses replica , guaranteeing a pipelined design with 5 random vector accesses per clock cycle.

Iv-B2 SpMV Dense Vector Memory Subsystem

Each spmv cu processes 5 non-zero matrix entries per clock cycle, and for each non-zero entry it must perform a random access on a dense vector of size (in our case, the Lanczos vector at iteration ). As each AXI master channel can handle only one read transaction per cycle, we need to replicate the dense vector 5 times, similarly to [kestur2012towards]. The hardened AXI switch in the Alveo U280 renders highly inefficient to attach multiple AXI master channels to the same HBM bank: only 32 AXI master channels are available, and small memory transactions (32 bits) have the same performance as larger transactions, preventing sustained bandwidth sharing [choi2020hls, lu2021demystifying, wang2020shuhai]. We solve the issue by leveraging the abundant HBM memory on the Alveo U280, and replicating the dense vector 5 times for each cu, as in Figure 8. A more flexible AXI switch would enable multiple 32-bits read transactions on the same HBM channel in a single clock cycle, reducing the demand for data replication. Compared to [parravicini2020reduced], our HBM-based memory subsystem marks a significant improvement, as we avoid uram to store the intermediate dense vector and results. Instead of being limited by the FPGA’s 90 MB of uram, we store the dense vector using individual HBM banks with 250 MB of capacity, allowing computations on matrices with up to 62.4 million rows. Moreover, high uram consumption significantly limits the maximum attainable frequency, while we do not incur in this limitation (Table I).

Iv-C Jacobi Systolic Array Design

The Jacobi eigenvalue algorithm is very computationally intensive. Although it processes a small input of size , unoptimized implementations still require a significant amount of time due to a large number of dense matrix multiplications. Moreover, its convergence rate is implementation-dependent and as high as . By adopting a sa-based design, we overcome both issues. By parallelizing the computation through a sa formulation and performing rotations concurrently, we decrease the number of iterations for convergence to . Rotations, equivalent to multiplications on submatrices, are unrolled and performed in constant time.

Our design for the Jacobi algorithm is optimized to compute up to eigenvalues. While it can compute a lower amount of eigenvalues without a reconfiguration, we place in the same FPGA bitstream multiple Jacobi cores optimized for specific (4, 8, 16, etc.). We can configure both SLR1 and SLR2 with Jacobi cores to fully utilize the FPGA resources and opening the doors for independent optimization on specific values of by reconfiguring individual slr. The Lanczos Core on slr transfers only the values of to the Jacobi cores on slr1 and slr2. We prevent inefficiencies related to inter-slr communication by moving data through PLRAM, while also avoiding the long read-write latency of DDR and HBM.

In practice, the systolic formulation cannot scale beyond very small matrices () due to the large number of resources required for trigonometric operations in each cu. While resource utilization has prevented widespread adoption of the Jacobi algorithm for general eigenproblem resolution, it is not a limitation for our use case, as we apply the Jacobi eigenvalue algorithm on small inputs by design.

On CPU, approaches such as QR factorization are more common [buttari2008parallel], because efficient systolic array formulations of the Jacobi algorithm require full control over cache eviction policies. Moreover, even modern CPUs lack enough floating-point arithmetic units to perform the operations required for an iteration at once: even for a small such as , the Jacobi algorithm computes 16 trigonometric operations and about floating-point multiplications per iteration.

Instead, we leverage the abundant hardware resources of our fpga platform to perform all these operations concurrently, making it the optimal choice for our Jacobi sa design.

Iv-C1 Diagonal And Offdiagonal cu

Diagonal cu (Algorithm 2, line 4) annihilate elements immediately outside the diagonal via a matrix rotation. Although the rotation angle is arbitrary, the fastest convergence is achieved by setting , which eliminates the and components (Figure 3(a)). We efficiently compute the components of the rotation matrix via Taylor series expansion. Even an order-3 approximation provides excellent accuracy ( at ), using significantly fewer DSPs and BRAMs than the CORDIC core. Rotation on the diagonal (Figure 5 A) are performed by parallel cores, propagating rotation values B in constant time to the Offdiagonal cu (Algorithm 2, line 9). As each cu holds only 4 elements, matrix multiplications are fully unrolled and performed in constant time. Eigenvectors (Algorithm 2, line 14) D are computed in parallel to the rotation of the Offdiagonal cu C as they only require rotation values.

Iv-C2 Row/Column Interchange

Each cu has 8 connections to propagate input and output values of values to adjacent processors, in addition to communicating the rotation value . As shown in Figure 5E, each processor with and propagates its and values to the and slots of and its and values to the and slots of . Processors in the first column () propagate and to to the and slots of . Processors propagate and to their own and slots. Operations for the column interchange are symmetrical. As and of are never propagated, more swaps are performed towards lower indices than higher indices. These additional swaps require temporary vectors to store rows that would be overwritten by the swaps. To avoid wasting resources for these temporary vectors, we execute operations in reverse, from to . As row/column swaps do not introduce additional data dependencies, we perform them in a single clock cycle using FFs.

V Experimental Evaluation

Algorithm
SLR
LUT
FF
BRAM
URAM
DSP
Clock (MHz)
Lanczos SLR0 42% 13% 15% 0% 16% 225
Jacobi SLR1 40% 42% 0% 0% 68% 225
Jacobi SLR2 15% 17% 0% 0% 34% 225
Available 1097419 2180971 1812 960 9020
TABLE I: Resource usage and clock frequency in our FPGA hardware design, divided by algorithm.
ID
Name
Rows (M)
Non-zeros (M)
Sparsity (%)
Size (GB)
WB-TA wiki-Talk 2.39 5.02
WB-GO web-Google 0.91 5.11
WB-BE web-Berkstan 0.69 7.60
FL Flickr 0.82 9.84
IT italy_osm 6.69 14.02
PA patents 3.77 14.97
VL3 venturiLevel3 4.02 16.10
DE germany_osm 11.54 24.73
ASIA asia_osm 11.95 25.42
RC road_central 14.08 33.87
WK Wikipedia 3.56 45.00
HT hugetrace-00020 16.00 47.80
WB wb-edu 9.84 57.15
TABLE II: Matrices/graphs in the evaluation, sorted by number of edges/non-zero entries (in millions). For each matrix, we report the memory footprint when stored as COO.

To prove that our custom FPGA design is suitable for solving large-scale Top-K sparse eigenproblems, we compare it against the popular ARPACK library, measuring how it compares in terms of execution time, power efficiency, and accuracy. The multi-threaded ARPACK library [lehoucq1998arpack], a Top-K sparse eigensolver that employs iram, runs on two Intel Xeon Gold 6248 (80 threads in total) and 384 GB of DRAM using single-precision floating-point arithmetic. Our eigensolver is prototyped on a Xilinx Alveo U280 accelerator card equipped with 8 GB of HBM2 memory, 32GB of DDR4 memory, and a xcu280-fsvh2892-2L-e FPGA whose resources are reported in Table I. Results are averaged over 20 runs.

Tests are carried out using a collection of large sparse matrices representing graph topologies, each containing millions of rows and non-zero entries (Table II). All test matrices come from the SuiteSparse collection [davis2011university]. While our evaluation is focused on sparse matrices representing graphs, our Top-K sparse eigenproblem FPGA design is applicable to other domains such as image analysis [perona1998factorization, shi2000normalized, tung2010enabling].

Resource utilization and clock frequency of our design are reported in Table I. The Lanczos algorithm and Jacobi algorithm have similar utilization, with around 20% LUT utilization each (50% of the available LUTs in each slr). Although the sa architecture of the Jacobi algorithm processes small inputs, it requires the computation of many trigonometric operations and multiplications (16 and 800 for ) in each iteration. Resource utilization of the Jacobi algorithm scales quadratically with the number of eigenvalues , while the Lanczos algorithm is not affected.


Fig. 9: Speedup (higher is better) of our Top-K sparse eigensolver vs. the ARPACK multi-core CPU library, for increasing number of eigenvalues

. Geomean excludes the outlier graph HT, where the speedup of our FPGA design exceeds 400x.

[] []

Fig. 10: a Relation between number of matrix non-zero values and time to process a single value. b Speedup vs. CPU of our Systolic Array architecture for the Jacobi algorithm.

Fig. 11: Accuracy of our Top-K sparse eigensolver, in terms of orthogonality and reconstruction error, for increasing .

V-a Execution Time

We measure the execution time speedup of the FPGA-based hardware design implementing our Top-K sparse eigenproblem solver against the CPU baseline and report results in Figure 9

. We are always faster than the baseline, with a geometric mean speedup of 6.22x, up to 64x for specific graphs. The speedup is mostly unaffected by

, showing how our design can efficiently compute many eigenvalues at once. Figure 10A shows how the time required by our FPGA design to process a single matrix value is unaffected by the overall graph size, while the CPU behavior is drastically more unpredictable.

We estimate that the Lanczos dominates the overall execution time due to the spmv computations, taking more than 99% of the execution time. However, optimizing the Jacobi algorithm with a sa design is still worth the effort, compared to running this step on CPU.

Figure 10B shows the speedup of our Jacobi sa design compared to an optimized C++ CPU implementation: the execution time on CPU grows quadratically due to repeated matrix multiplications, becoming a non-negligible part of the execution time for large .

Our hardware design synthesized at 225 Mhz on the Alveo U280 accelerator card. A clock frequency beyond 225 Mhz does not significantly improve performance as spmv represents the main computational bottleneck in the computation, and its performance is bound by HBM bandwidth [lu2021demystifying]. Each spmv cu processes data at the maximum bandwidth offered by the HBM channel from which it reads the matrix (14.37 GB/s, for a total of 71.87 GB/s using 5 cu).

V-B Power Efficiency

We measured via an external power meter that our FPGA design consumes about 38W during execution, plus 40W for the host server. The CPU implementation consumes around 300W during execution. Our FPGA design provides 49x higher Performance/Watt ratio (24x if accounting for the FPGA host machine): we provide higher performance without sacrificing power efficiency, making our design suitable for repeated computations typical of data center applications.

V-C Accuracy Analysis of the Approximate Eigencomputation

The Lanczos algorithm is known to suffer from numerical instability [paige1972computational]. To limit this phenomenon, we reorganize the algorithm’s operations as in [paige1972computational] and apply reorthogonalization as in [parlett98]. To assess the stability of our design, we measure the eigenvectors’ pairwise orthogonality and the eigenvector error norm. Eigenvectors must form an orthonormal basis and be pairwise orthogonal, i.e. their dot product is 0, or equivalently their angle is . For each pair of eigenvectors, we measure the average angle that occurs. Then, if is an eigenvalue of and is its associated eigenvector, it must hold . By measuring the L2 norm of for all we evaluate how precise the eigenvector computation is. Results are reported, for increasing , in Figure 11, aggregated on all graphs due to space constraints. Accuracy is excellent if reorthogonalization is applied every two iterations, but even without this procedure results are satisfactory. Despite using fixed-precision arithmetic in the Lanczos algorithm, the average reconstruction error is below , and the average orthogonality is degrees, when applying reorthogonalization every two iterations. Orthogonality is not affected by , while the average reconstruction error improves as increases. Spectral methods in machine learning applications use eigenvectors to capture the most important features of their input and do not usually require the same degree of precision as other engineering applications. Reorthogonalization adds an overhead up to to the algorithm compared to Figure 9. On large graphs this overhead is negligible compared to spmv, and is a viable option in applications where maximum accuracy is necessary. Still, our hardware design can provide excellent accuracy while being significantly faster than a highly optimized CPU implementation.

Vi Conclusion

The computation of the Top-K eigenvalues and eigenvectors on large graphs represented as sparse matrices is critical in spectral methods, a class of powerful Machine Learning algorithms that can extract useful features from graphs. We solve the Top-K sparse eigenproblem with a new algorithm that is optimized for reconfigurable hardware designs: in the first part of the computation, we exploit the enormous bandwidth of HBM through the Lanczos algorithm, while in the second part, we introduce a systolic array architecture that efficiently parallelizes the compute-intensive Jacobi eigenvalue algorithm. Compared to the popular ARPACK CPU library, we achieve a geomean speedup of 6.22x on 13 graphs with millions of vertices, raising the bar for high-performance Top-K sparse eigensolvers at a large scale.

As future work, we will extend our hardware design to support non-Hermitian matrices through the Implicitly Restarted Arnoldi Method. We will also investigate heterogeneous implementations that combine the abundant memory bandwidth of GPUs for high-performance spmv with our systolic array FPGA design for the Jacobi eigenvalue.

References