DeepAI
Log In Sign Up

LL-GNN: Low Latency Graph Neural Networks on FPGAs for Particle Detectors

09/28/2022
by   Zhiqiang Que, et al.
0

This work proposes a novel reconfigurable architecture for low latency Graph Neural Network (GNN) design specifically for particle detectors. Adopting FPGA-based GNNs for particle detectors is challenging since it requires sub-microsecond latency to deploy the networks for online event selection in the Level-1 triggers for the CERN Large Hadron Collider experiments. This paper proposes a custom code transformation with strength reduction for the matrix multiplication operations in the interaction-network based GNNs with fully connected graphs, which avoids the costly multiplication. It exploits sparsity patterns as well as binary adjacency matrices, and avoids irregular memory access, leading to a reduction in latency and improvement in hardware efficiency. In addition, we introduce an outer-product based matrix multiplication approach which is enhanced by the strength reduction for low latency design. Also, a fusion step is introduced to further reduce the design latency. Furthermore, an GNN-specific algorithm-hardware co-design approach is presented which not only finds a design with a much better latency but also finds a high accuracy design under a given latency constraint. Finally, a customizable template for this low latency GNN hardware architecture has been designed and open-sourced, which enables the generation of low-latency FPGA designs with efficient resource utilization using a high-level synthesis tool. Evaluation results show that our FPGA implementation is up to 24 times faster and consumes up to 45 times less power than a GPU implementation. Compared to our previous FPGA implementations, this work achieves 6.51 to 16.7 times lower latency. Moreover, the latency of our FPGA design is sufficiently low to enable deployment of GNNs in a sub-microsecond, real-time collider trigger system, enabling it to benefit from improved accuracy.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/17/2022

Low-latency Mini-batch GNN Inference on CPU-FPGA Heterogeneous Platform

Mini-batch inference of Graph Neural Networks (GNNs) is a key problem in...
04/03/2021

Adaptive Filters and Aggregator Fusion for Efficient Graph Convolutions

Training and deploying graph neural networks (GNNs) remains difficult du...
11/17/2021

Enabling Automated FPGA Accelerator Optimization Using Graph Neural Networks

High-level synthesis (HLS) has freed the computer architects from develo...
03/10/2018

Efficient FPGA Implementation of Conjugate Gradient Methods for Laplacian System using HLS

In this paper, we study FPGA based pipelined and superscalar design of t...
06/26/2021

Accelerating Recurrent Neural Networks for Gravitational Wave Experiments

This paper presents novel reconfigurable architectures for reducing the ...
01/25/2022

PowerGear: Early-Stage Power Estimation in FPGA HLS via Heterogeneous Edge-Centric GNNs

Power estimation is the basis of many hardware optimization strategies. ...
01/21/2021

Direct Spatial Implementation of Sparse Matrix Multipliers for Reservoir Computing

Reservoir computing systems rely on the recurrent multiplication of a ve...

Code Repositories

GNN-JEDInet-FPGA

An template for GNN-based JEDI-net using Vivado HLS


view repo

1 Introduction

Real-time data processing from high-energy proton collisions at the CERN Large Hadron Collider (LHC) is challenging since the particle detectors around the LHC ring produce hundreds of terabytes of data per second [coelho2021automatic, duarte2018fast] from collisions that occur every 25 ns. In the next phase of the LHC, the upgraded High-Luminosity LHC (HL-LHC) experiments would expect an explosion of data due to cutting edge detectors with improved resolution and increased area as well as volume. The large data volumes produced from the detectors are reduced by a real-time processing system, known as the trigger, which keeps interesting collision events while discarding the others. The Level-1 Trigger (L1T), using only FPGAs, requires processing latency of applications in s in the LHC [CERN2020L1T]. If algorithm latency exceeds the limit, data or interesting events are lost.

Previous trigger strategies [cheatham2016atlas]

have utilized the jet energy alone to make a decision. Recently, machine learning based strategies are being used for their superior accuracy. However, because of the low latency requirement, only simple Multi-Layer Perceptron (MLP) networks 

[coelho2021automatic, duarte2018fast] on FPGAs are proposed with an accuracy of around 75%. High accuracy in the trigger is crucial to keep only the most interesting events while keeping the output bandwidth low [coelho2021automatic]. [moreno2020jedi] presents JEDI-net, a Graph Neural Network (GNN) based algorithm which achieves state-of-the-art accuracy for particle selection. It is in high demand in the trigger systems.

However, GNNs require large amounts of computation and suffer from irregular memory accesses, resulting in large inference latency which makes them difficult to be deployed in real-time in the Level-1 trigger system of an LHC experiment [moreno2020jedi]. GNNs are computationally demanding and currently there are no real-time GNNs for particle identifications in the Level-1 trigger. Hence, accelerating GNN inference using reconfigurable accelerators such as FPGAs is essential in the LHC since it would enable sophisticated processing to run in real-time on the data stream from detectors with superior accuracy. Many existing GNN accelerators of FPGAs are often designed using a single engine architecture to process layers or sub-layers (blocks) repeatedly like GPUs, and the networks are processed in a recurrent fashion [yan2020hygcn, liang2020engn, zhang2021boostgcn, lin2021gcn, lin2022hp, kang2022grow]. However, they are not efficient for GNN execution when targeting small graphs with requirements of ultra-low latency and high throughput for scientific applications, such as particle identification. Also none of them targets a scenario with a hard constraint of latency in s.

This work proposes a custom Low Latency (LL)-GNN hardware architecture based on a layer-wise tailor-made pipeline to accelerate the GNNs for particle detectors, using the GNN-based JEDI-net algorithm as an end-to-end application. The layer-wise architecture has been used to speedup CNNs [blott2018finn, shen2017maximizing, zhang2020dnnexplorer] and RNNs [que2021accelerating], but few studies focus on accelerating GNNs. This paper addresses this gap. First, we propose custom strength reduction for matrix operations based on the character of interaction-network based GNNs with a fully connected graph as input, which avoids the expensive matrix multiplications of the adjacency matrix with the input feature matrix. It reduces the computational complexity and also avoids the irregular memory access of the sparse adjacency matrices since there is no need to fetch them. Second, this work adopts column-major order instead of conventional row-major order for the data layout of intermediate results traversed along the hardware datapath, which avoids the access of irregular memory for intermediate results in GNNs. Third, we introduce an outer-product based matrix-matrix multiplication (MMM) approach for the GNN aggregation function computation. Additionally, it is further enhanced by the custom strength reduction.

Moreover, instead of running GNNs in a coarse-grained pipeline as shown in our previous work [que2022aicas, que2022optimizing], this work fuses sub-layers as much as possible, converting several coarse-grained pipeline stages into a single stage with a fine-grained pipeline inside. This removes extra handshake acknowledgments and dual buffers between coarse-grained pipeline stages, resulting in low end-to-end latency. It is the first time that the GNNs of JEDI-net can run in less than s on FPGAs. Furthermore, a GNN-specific algorithmic-hardware co-design approach is presented to optimize the algorithm and hardware simultaneously, and to explore the optimal configuration to improve the overall performance. Our FPGA implementation achieves a sub-microsecond latency, which makes the algorithm compatible with the HL-LHC conditions where there is a strict latency constraint, while enabling improved accuracy.

To the best of our knowledge, this is the first FPGA-based design of GNNs with a latency in s for particle identification for the detectors at the CERN HL-LHC experiments. This work would help to improve the next-generation trigger system, enabling powerful algorithms to process the experimental data accurately and efficiently.

We make the following contributions in this paper:

  • A low latency layer-wise hardware architecture for GNNs with several novel design optimizations, including custom strength reduction, column-major order, outer-product based matrix multiplication and sub-layer fusion as well as a GNN-specific algorithmic and hardware co-optimization approach, resulting in sub-microsecond latency with high efficiency.

  • The first FPGA-based design of GNNs with a latency in s for particle identification for the detectors at the CERN HL-LHC experiments.

  • A scalable, efficient open-source template111https://github.com/walkieq/GNN-JEDInet-FPGA for GNN-based JEDI-net which enables the generation of low-latency FPGA designs with efficient resource utilization leveraging HLS tools.

  • A comprehensive evaluation of the proposed method and hardware architecture.

While we use GNN-based JEDI-net for illustrating our hardware architecture and approach, the proposed optimizations could be applied to other GNN networks with small graphs for applications beyond particle identification.

Relationship to Prior Publications

This paper expands on our two conference papers [que2022aicas, que2022optimizing] which focus on hardware optimizations and target low initiation interval. But they still suffer from high latency. This work addresses a limitation in our previous work, which does not cover co-design of algorithmic and hardware optimizations. This limitation results in a sub-optimal design with large latency, which hinders the deployment of GNNs in the CERN LHC. This work presents a GNN-specific co-design approach to optimize the algorithm and hardware simultaneously. Our approach can explore the design performance trade-off under both user-defined algorithm and hardware constraints, e.g., the highest accuracy with a latency requirement of less than 1s on a given FPGA. In addition, a new low latency hardware architecture with sub-layer fusion is proposed, which targets end-to-end latency reduction instead of initiation interval. It fuses several coarse-grained pipeline stages into a single stage with fine-grained pipeline inside using an FSM-based code transformation, resulting in low latency designs. The new optimizations proposed allow us to obtain a significant reduction in end-to-end latency over our previous work [que2022aicas, que2022optimizing]. These novel optimizations, combining the ones in our previous work, lead to sub-microsecond design latency, which makes the algorithm compatible with the CERN HL-LHC where there is a strict latency constraint of s.

2 Background

2.1 Graph Neural Network and Interaction Network

GNNs have shown remarkable successes in wide applications with graph-structured data, such as recommender systems [ying2018graph, fan2019graph], molecule property prediction [fout2017protein], particle physics [ju2020graph, ju2021performance, moreno2020jedi]. GNNs can adapt their graph-based structure to an input graph with an iterative process of information aggregation across nodes to learn the complex dependencies of a system.

JEDI-net is a fully connected graph neural network based on interaction network [battaglia2016interaction] architecture. The interaction network is a powerful graph based framework for reasoning about objects and relations in complex and dynamic systems. The input to an interaction network is a graph of objects and the relations between them. It learns to capture complex interactions that can be used to predict future states and abstract physical properties. The acceleration of interaction-network based GNNs has also been studied for charged particle tracking at the CERN LHC on FPGAs [elabd2021graph].

2.2 JEDI-net for particle identification

The GNN of JEDI-net can be represented as a graph, with the nodes, , corresponding to physics particles, and the edges, , to the relations. It is a fully connected GNN. The input of nodes () is defined as a matrix, whose columns represent the node’s

-length feature vectors, and

is the number of particles in a jet. The relations are a triplet, , where and are binary matrices which index the receiver and sender nodes, respectively. Each column of and is a one-hot vector and it indicates the receiver node’s index; indicates the sender similarly. The number of the edges, , is since the input graph is fully connected with directional edges.

Fig. 1 shows the dataflow of JEDI-net. In addition, Fig. 2 shows the and for an example. To illustrate the idea, the number of particles (nodes) is 4 in this example. But please note a real case will have more particles. The input matrix is multiplied by the and matrices and the results are then concatenated to form a matrix, having dimension . Each column of the matrix represents an edge, i.e. a particle-to-particle interaction. The elements of each column are the features of the sending and receiving nodes for that edge. A trainable deep neural network (DNN) function is then applied to each column of and produces a matrix . Then is conducted in MMM3 (see Fig. 1) to gather the cumulative effects of interactions received by a given node. Thus, the cumulative effects of the interactions at a given node are obtained by summing the hidden features over the incoming edges, which is implemented by computing in the MMM3 unit. The and are then concatenated to form the matrix, working as a shortcut connection. Each column of the matrix represents a constituent in the jet, expressed as a -dimensional feature vector, containing input features and hidden features. It represents the combined effect of all the interactions between particles. Another trainable function is presented to build a post-interaction representation of each jet constituent. It is performed on each column of to produce the matrix, having dimension . A final trainable function

returns the probability for that jet to belong to each of the five categories.

, and are implemented as Multi-Layer Perceptrons (MLPs).

Fig. 1: Overview of the JEDI-net architecture.

3 Design and Optimization

This section introduces several optimizations to accelerate the interaction network based fully connected GNNs. We follow ”divide, conquer and fusion”, an optimization strategy designed to achieve low latency for GNN inferences. In the ”divide” and ”conquer” steps, we split GNNs into multiple sub-layers and perform dedicated optimizations, such as strength reduction based on structured sparsity. Then in the ”fusion” step, multiple sub-layers are fused to remove boundaries and buffers between different pipeline stages, resulting in low latency and high hardware efficiency. Finally a co-design approach is presented.

3.1 Code transformation with strength reduction

The design of the GNN in [moreno2020jedi] adopts dense matrix-matrix multiplication (MMM) operations to calculate the MMMs between the input feature matrix and adjacency matrices, i.e., and , which is costly and time-consuming. However, we observe that most of the computational operations in these MMMs of interaction-network based fully-connected GNNs are unnecessary if the sparse structures and binary feature in the adjacency matrices could be exploited. With an interaction network architecture and fully connection graph, each column of the receiving and sending matrices is one-hot. Besides, both matrices are binary and they have a fixed pattern as shown in Fig. 2. The element is set to 1 when the node receives the edge and is 0 otherwise. Similarly, the element is set to 1 when the node sends the edge and is 0 otherwise. Because of the fixed patterns and the binary feature, MMMs are unnecessary to calculate the and . First, the multiplication operations are unnecessary because the and matrices only have binary values. Second, accumulation (addition) operations can be avoided because each column of and is one-hot, resulting in the reduction of the iteration factor from to . Hence, only load and store operations are needed to calculate the MMM of as well as the . The detailed pseudocode of the code transformation with strength reduction for MMM1/2 has been illustrated in Algorithm 1.

Fig. 2: An example of an interaction-network based fully-connected graph with 4 nodes and the corresponding 12 uni-directional edges (left) with its receiving matrix as well as the sending matrix (right).

One of the big challenges for GNNs is irregular memory access of the sparse adjacency matrices. Our approach can avoid such memory access since their structured sparsity patterns can be statically fused into the loop index, which not only saves memory bandwidth but also avoids the irregular memory access of these adjacency matrices. The access of the input feature matrix for and is sequential in our design since we adopt a column-major order data format which will be discussed in the next subsection. Our proposed approach not only eliminates the expensive MMM operations to increase the computational efficiency but also avoid the irregular access of the adjacency matrices, which largely reduces the design latency.

Although this work targets a fully connected GNN based on an interaction network architecture, the proposed technique of custom code transformation with strength reduction, exploiting the sparse structures and binary features, can be adapted to optimize other GNN networks with a hardware friendly structured sparsity for low latency designs. Sometimes, a structured sparsity pattern could also be user-defined when designing the architecture of neural networks at the very beginning of projects to achieve high hardware efficiency, e.g., a butterfly sparsity pattern [fan2022adaptable].

1 Function MMM_( , , ):
2        for  to  do
3               for  to  do
4                      // Reduced from to
5                      for  to  do
6                             ; MMM1 ; MMM2 ;
7                             // Multiplications are avoided
8                             // Access of and matrix is avoided
9                      end for
10                     
11               end for
12              
13        end for
14       
15
End Function
Algorithm 1 The pseudocode of the custom MMMs with strength reduction.

3.2 Column-major order

The intermediate results in the layer-wise GNN hardware architecture are captured using two dimensional (2D) arrays representing a matrix as shown in Fig. 1. Row-major and column-major orders (Fig. 3) are two data layout methods which are critical for correctly passing arrays between hardware units. More importantly, it is also critical for hardware performance, which is the focus of this work. The difference between the two orders lies in which elements of an array are contiguous in memory. An appropriate data layout will have a significant impact on hardware performance. When mapping a 2D data onto a one dimensional (1D) structure (i.e. memory) using a high level synthesis tool (e.g., Xilinx Vivado/Vitis HLS), often the default data layout is row-major order, which follows a C language legacy.

However, row-major order for GNN-based JEDI-net will lead to poor spatial locality and hinder parallelism since the functions and are applied to each column of the input matrix, as shown in Fig. 1. The concatenation of two input matrices into one output matrix also works on columns, such as concatenating and to matrix. With a row-major order data layout, the input data of these functions does not sit in memory contiguously so it is very time-consuming to fetch all the elements in a column. However, if the data is represented using a column-major order in which the consecutive elements of a column reside next to each other, iterating over columns becomes easy because the data are accessed sequentially. Thus, this work proposes the column-major order to increase the data spatial locality for accelerating the layer-wise GNNs efficiently, leading to good hardware performance.

3.3 Outer-product based matrix multiplication

To multiply a matrix by another matrix we often do the inner-product of rows from the 1st matrix and columns from the 2nd matrix. For example, to compute the in the MMM3 unit, it requires a whole row of the matrix and a whole column of to perform the inner-product for each entry of . However, in GNN-based JEDI-net, the input matrix of the MMM3 unit comes from the output of that produces the results column by column, as shown in Fig. 1. With an inner-product based MMM, this unit needs to wait for a long time until a whole row of matrix is ready, resulting in long latency. To solve this issue, this work proposes an outer-product based matrix multiplication for MMM3 to process the . Instead of using a whole row from matrix, now a whole column of matrix is multiplied by one element from matrix to generate the partial result of the first column of result matrix as shown in Fig. 4. The partial result will then be accumulated to form the column of the result matrix. Since the matrix is generated column by column, MMM3 can start as soon as the first column of is ready. To efficiently support the outer-product based MMM, the column-major order data layout is used for representing the intermediate results (i.e., 2D matrix arrays) as we discussed in Section 3.2. Thus, the input elements can be grouped as a vector (i.e., a whole column as a vector) and can be processed efficiently with high parallelism because the data can be fetched sequentially. It largely reduces the waiting time of the MMM3 unit and reduces the design latency.

Fig. 3: Row-major (left) and column-major (right) orders.

[The calculation of the first vector of the resultant matrix.]
[The calculation of the second vector.]

Fig. 4: Outer-product based matrix multiplication with strength reduction and column-major order.
1 Function MMM3( , ):
2        for  to  do
3               for  to  do
4                      for  to  do
5                             ;
6                             ;
7                             ;
8                            
9                      end for
10                     
11               end for
12              for  to  do
13                      ;
14               end for
15              
16        end for
17       
18
End Function
Algorithm 2 The pseudocode of the outer-product based MMM with strength reduction for JEDI-net.

In addition, the code transformation of strength reduction, described in Section 3.1, can be adopted to enhance the proposed outer-product based MMM, which exploits the sparsity patterns of the adjacency matrix of as well as the binary feature. It can avoid costly multiplications but involves only load and store operations with a small number of additions for this MMM unit. The detailed pseudocode of the strength reduction enhanced outer-product based MMM3 is illustrated in Algorithm 2. The input matrix is binary and each of its columns is one-hot as introduced in Subsection 2.2. Thus, the multiplication operations are unnecessary since is binary. Only of total additions are required. Besides, since the adjacency matrix has a structured pattern, it can be fused into the loop index to avoid irregular memory access.

One limitation of the outer-product based MMM is that it will create a full size resultant matrix but with just partial results until the end, leading to large memory write bandwidth requirement. The AWB-GCN [geng2020awb] presents a column-wise product architecture, a variant of outer-product MMM, which involves one column in the first matrix but only one element in the second matrix, so that the output will only touch one resultant vector. In our design, each row is one-hot, which means that only one vector in the resultant matrix is valid each time. Moreover, we further exploit the structured pattern of so that the partial results will only update the corresponding vectors in the resultant matrix sequentially, which avoids irregular memory writes to achieve low latency. Furthermore, because of the structured pattern in , the proposed approach only requires to read each data in matrix once, which also reduces the bandwidth of memory access. Fig. 4 shows the design reads the vectors from column to in matrix and generates the second vector in the resultant matrix.

The latency could be further reduced by involving multiple columns of input matrix. But this requires multiple preceding hardware units to generate the corresponding number of columns in each cycle, resulting in more hardware resources. Our proposed methods not only eliminate the expensive matrix multiplication operation and reduce the iterations but also avoid the input of the adjacency matrices to improve the memory access efficiency, which reduces the design latency and increases the throughput as well as the hardware efficiency.

3.4 Balancing initiation interval (II)

There are many components in a GNN model. To achieve high throughput and low latency, the design could be unrolled as much as possible with many hardware resources. However, a naive implementation often results in an unbalanced design II in the coarse-grained pipeline, which causes hardware inefficiency and low system throughput. The design II is decided by the largest II among all the units on the datapath. Generally, it is unnecessary to unroll every unit in order to achieve the lowest II. However, some hardware resources can be saved from the units that do not require full unrolling and then these resources can be re-allocated to the bottleneck unit that dominates the whole design to reduce the II of the design as well as the latency. Partitioning FPGA resources to enhance throughput/latency in a layer-wise architecture has been studied for CNNs [shen2017maximizing, zhang2020dnnexplorer, zhang2018dnnbuilder, gong2018maloc] and RNNs [que2021accelerating], but there is little work focusing on GNNs. This work balances the IIs of the sub-layer units in GNNs by partitioning the FPGA resources properly. A design space exploration is also performed to find the appropriate parameters of parallelism to achieve an optimal trade-off between the II and hardware resources.

3.5 Divide, conquer and fuse

In the previous subsections, we ”divide” the GNN model into several sub-layers and ”conquer” these sub-layers by tailed-made optimizations. Our previous design [que2022aicas, que2022optimizing] run these sub-layers in a coarse-grained pipeline, achieving a low initiation interval. Nevertheless, the end to end latency is still higher than the s which is required for the CERN HL-LHC. This work devises a succeeding step named ”fuse” which combines several optimized sub-layers into an single unit and run it in a fine-grained pipeline to reduce overall design latency.

For GNNs, many operations are performed repeatedly on graph nodes or edges. This means that many of the processing units (loops) may share the same loop bound, i.e., the number of nodes (or edges) in a graph, which makes many sequential units (loops) mergeable. Especially, in our design, the matrix multiplication units for MMM1/2/3 are largely simplified after strength reduction. Running them as individual stages in a coarse-grained pipeline not only leads to lots of wasted hardware resources but also large latency because of the extra acknowledgment mechanism, including ping-pong buffers that exist between stages in the coarse-grained pipeline. Since several sequential loops share the same loop bound, we could simply merge these loops into one loop. For example, the MMM1/2, Concat1 and DNN1 units all have a loop bound of , the number of edges, due to all of them performing calculations on the edges of the graph. They can be merged into one big loop which then can be pipelined to reduce overall latency since merging loops allows the logic within the loops to be optimized together. Please note that fusing of compute units into single loop is data independent as well. One drawback is it could slightly increase the overall initiation interval (II), which is the largest II among all the stages in a coarse-grained pipeline, since we merge some of these stages to create a big stage. Nevertheless, it is worth doing since now we have fewer stages and fewer boundaries after fusion, resulting in lower latency.

1 for  to  do // Old
2        #pragma HLS PIPELINE
3        for  to  do
4               #pragma HLS PIPELINE
5               code_body_A;
6              
7        end for
8       code_body_B;
9       
10 end for
11 for  to  do // New
12        #pragma HLS PIPELINE
13        // The loop II will be 1 but the equivalent II is the
14        switch (curr_state) do
15               case 0 do
16                      for  to Ceiling()  do
17                             code_body_A;
18                            
19                      end for
20                     curr_state ++;
21                      break;
22               end case
23              case 1 do
24                     ......
25               end case
26              ......
27               case () do
28                      for  to  do
29                             code_body_A;
30                            
31                      end for
32                     code_body_B;      // Used only in this case statement
33                      curr_state = 0;
34                      break;
35               end case
36              Default: break;
37        end switch
38       
39 end for
Algorithm 3 Code transformation with an FSM to transfer an imperfect loop into a perfect loop.

One issue of sub-layer fusion is that it may create an imperfect loop, especially for GNNs in which some units are applied to edges with a loop bound based on edges while some other units are applied to nodes with a smaller bound based on nodes. Algorithm 3 shows an example loop at lines 18 from GNNs of JEDI-net. The code_body_A (e.g., the Concat1 and DNN1 units) is iterated with a bound of the number of the edges which is while the code_body_B (e.g., the part of MMM3 from line 10 to 12 in Algorithm 2 plus DNN2) is only iterated for times. After the fusion, it becomes an imperfect loop with the code_body_B outside of the inner loop. To pipeline the code_body_B, one could set the #pragma of ”PIPELINE” at line 2 in the outer loop, but it leads to automatically unrolling all loops in the hierarchy below. In this case, this will completely unroll the inner loop at line 3, resulting in copies of code_body_A, consuming lots of hardware resources. If the required hardware resources exceed the given budget, one has to restrict the number of copies of the inner loop. For example, if the total budget can only support instances of code_body_A, this loop can only be unrolled by a factor of . However, using a #pragma of unrolling with a factor for the inner loop will not manage to reduce the number of copies since the ”PIPELINE” at line 2 has a priority and will force the inner loop fully unrolled. Thus, one has to move the #pragma of ”PIPELINE” to line 4 to only pipeline the inner loop without pipelining the code_body_B, leading to a poor design and large latency.

To solve this issue, this work proposes another code transformation which transfers an imperfect loop into a perfect one using a finite-state machine (FSM) based structure with a target II (), as shown from line 9 to 34 in Algorithm 3. If the number of instances of the code_body_A that can be deployed under the given resource budget is , then equals . The loop can now run in an II of one but the equivalent II is the target II. With this code transformation, we could deploy as many instances as possible based on the hardware budget to improve the design performance after fusion and reduce the overall design latency.

4 Implementation

4.1 Two-level parallelism

The trade-off between latency, throughput and FPGA resource usage is determined by the parallelization of the design. This work exploits a two-level parallelism scheme. First, we adopt the reuse factor [duarte2018fast] to fine tune the parallelism of the MLPs used in the GNNs. The reuse factor is configured to set the number of times a multiplier is used in the computation of a module. The code transformation is performed manually using strength reduction to optimize the matrix multiplications to avoid multiplications. Hence, only the three MLPs (, , ) consume multipliers in the JEDI-net design. We apply the reuse factors , and to these three MLPs. This work always tries to achieve extremely low latency by using as many hardware resources as possible, such as unrolling all the layers in the MLPs by adopting a reuse factor value of 1.

Second, this work deploys multiple copies of the unit to further increase the design parallelism. And the is always set to 1 for low latency. The is applied to each column of the matrix, as mentioned in Section 2.2, resulting in a significant number of iterations since there are columns in the matrix. Fully unrolling all the iterations requires thousands of hardware copies of , leading to a large hardware resource consumption that will easily exceed a given FPGA. Hence, this work partially unrolls it with a factor, , resulting in copies of the hardware units, each processing a vector of matrix.

4.2 Resource model

To perform design space exploration to find the best trade-off between hardware resources and design throughput, a simple resource model for JEDI-net is developed. Since the DSPs are the most critical resources for FPGA-based neural networks, our resource model focuses on DSPs. The DSP resource model is shown in equation (1):

(1)

where is the number of DSPs used for a single fully-connected (FC) layer with input size and output size as and . The JEDI-net has three MLP models which consist of only FC layers. is a label of the three MLPs which are and , as shown in Fig. 1

. For simplicity, this work utilizes a unified bitwidth (Q12.12, 24 total bits with 12 fractional bits) for most of design datapath. However, the values of the MLP model weights are among the range of [0,1), which results in around 13 effective bits. Hence, one of two inputs for the multipliers only has around 13 bits with the other bits removed by the Vivado HLS tool. And one multiplier can be fully implemented by one Xilinx DSP. In addition, the accumulator is using Q16.16. The input feature of design can go down to Q0.8 with batch normalization. With our proposed approach, there are no multiplications in the MMM1/2/3 units. Thus, only the MLP units require multipliers that are implemented using DSPs. The total number of DSPs used in JEDI-net is shown in equation (

1) and it should be smaller than the total number of DSPs on the targeted FPGA.

4.3 Latency model

We have managed to fuse most of the sub-layers, resulting in a lower latency architecture comparing to the one in our previous work [que2022optimizing]. The latency model of JEDI-net based on this low latency hardware architecture is illustrated in equation (2).

(2)

where is the II of the multiplier, which is one cycle in this work. is the II of the fused loop, which depends on the maximum value of how many copies of units can be deployed on a given FPGA with limited hardware resources, and reuse factors for and . Please note, is always set to 1 since is the bottleneck of the design. is the pipeline depth of the model while is the depth for the logic outside the major fused loop. For simplicity, both of them are constants based on the design architecture.

4.4 Algorithmic-hardware co-design

The material in [moreno2020jedi] only focuses on studying the algorithmic optimizations for neural networks. This often leads to the deployment issue where neural network models optimized to achieve high accuracy cannot be deployed on FPGAs due to the large model size and limited hardware resources. As another line of research, our previous studies [que2022aicas, que2022optimizing] focus on hardware-aware optimizations. However, without optimizing the algorithm, our designs still can not achieve an end-to-end latency less than s. Therefore, existing GNN designs for particle identification only primarily focus on either algorithm or hardware, leading to sub-optimal designs while deploying GNNs on FPGAs. To address this issue, this work presents a GNN-specific co-design approach to optimize the algorithm and hardware simultaneously. Our approach can explore the trade-off under both users-defined algorithm and hardware constraints, e.g., the highest accuracy with a latency requirement of less than 1s on a given FPGA.

Generally in GNNs, since the number of edges are usually larger than the number of nodes in a graph, the operations involving graph edges usually need to run more iterations than the ones involving graph nodes. In our design, the DNN1 () is required to run times, which is much larger than the DNN2 () that runs only times, and DNN3 () that runs only once per inference. Taking this into consideration, this work set a small value to the size () of hidden layers in while keeping or increasing the size () of the other two MLPs, leading to reduction in latency while maintaining the model accuracy. [moreno2020jedi]

has conducted a wide algorithmic search based on various model parameters, such as the number of the output neurons (

and ) and the sizes of hidden layers (denoted as ) in the 3-layer and

MLPs, and various activation functions of each layer in these MLPs, and the optimizer algorithm of training. However, it sets the same size to all the three MLPs, which is not latency friendly as we discuss above. This work reuses the model parameters searched in 

[moreno2020jedi] but re-balances the size of different MLPs used in the graph networks to explore the trade-off between the algorithmic and hardware performance.

Since this work targets low latency on a resource constrained hardware device with a constant latency requirement (), we can further optimize our design search space and exploration flow. Generally, the training of a GNN model takes a long time, but since we have a hard latency requirement, it is unnecessary to train a model if we know its latency () is too large for the requirement. We define a parameter named , and drop the training of a model if its , to save a lot of time in model training. We set the larger than 1 to loosen the major constraint in case the exploration misses some interesting cases.

could also be set to a very large number to make a full coverage. This work estimates the latency of the design candidate by different GNN configurations with various hidden sizes and layer numbers of MLPs using the equations (

1) and (2). Training these designs is unnecessary because the latency limits their deployment in the L1T system. Thus, it leads to a large reduction in GPU/CPU training hours to find the optimal design. Although this work prioritizes on minimizing the latency on a resource constrained device, the optimization mode of our approach can be easily switched to other user-defined metric with other constraints.

Once we find the optimal design based on the user-defined metric, the model parameters as well as weights/bias are generated. And then an low latency FPGA design is generated using our HLS-based templates, which improves design productivity.

Fig. 5: System Overview. (a) The CMS Level-1 Trigger as well as data acquisition system; (b) The FPGA system of JEDI-net for particle detection.

4.5 Implementation of the hardware accelerator

The CMS experiment at CERN has two levels of real time triggering. Fig. 5 (a) the data acquisition system including the Level-1 trigger. A subset of the data from the detector is first sent to the L1T for processing. If an accepted decision is made, the remaining data is then read out for further processing, otherwise the data will be dropped. The L1T has a total latency budget of 12.5s in the upgraded High-Luminosity Large Hadron Collider (HL-LHC) [CERN2020L1T]. It consists of many FPGA-based sub-systems without CPU or PCIe in the datapath. And particle identification is one of them. The particle identification system accepts streamed data which arrives on a number of parallel optical fibres running at 25 Gb/s, as shown in Fig. 5 (b). This work splits the whole GNN into several sub-layers and adopts a layer-wise hardware architecture [blott2018finn, duarte2018fast, zhang2020dnnexplorer] to map all the sub-layers on-chip which is flexible and able to take full advantage of the customizability of FPGAs. In our previous work, different sub-layers run in a fashion of coarse-grained pipeline to further increase the design throughput. However, this work fuse as many sub-layers as possible to convert several coarse-grained stages into a single stage with a fine-grained pipeline inside, resulting in low end to end latency.

5 Evaluation and Analysis

This section presents the evaluation results of the GNN-based JEDI-net on FPGAs demonstrating the scalability of the proposed optimization for GNNs.

5.1 Experimental setup

This study focuses on JEDI-net-30p [moreno2020jedi] models targeting a dataset of 30 particles [jet_dataset_30p] and JEDI-net-50p models targeting a 50 particles dataset [jet_dataset_50p]. To study the performance and limitations of the proposed optimizations and hardware architecture, the design is implemented using Xilinx Vivado HLS 19.2 on a Xilinx Alveo U250 board for the evaluation and comparison with other implementations. It runs at 200MHz so each cycle is 5ns. FPGA power consumption is reported by the Xilinx Vivado tool after place and route. The data datapath is using Q12.12: one sign bit, 11 integer bits and 12 fractional bits. But the accumulator is using Q16.16 to keep accuracy. It achieves the same accuracy as the floating-point model.

Fig. 6: The accuracy of JEDI-net 50p model for a range of bit widths.

5.2 Model quantization and accuracy

To find a proper fixed-point representation that can achieve no reduction in the physics performance of the algorithm, we scan the fixed-point precision with total bit widths from 16 to 26 bits and integer bits from 6 to 13, including the sign bit, as shown in Fig. 6. With 24 total bits and 12 integer bits, the fixed-point model effectively achieves the same accuracy as the FP32 floating-point counterpart. In addition, JEDI-net achieves much higher accuracy than the previous work based on DNNs [duarte2018fast, coelho2021automatic]

with an accuracy of around 75%. We also evaluate the Receiver Operating Characteristic (ROC) curves with the area under the curve (AUC) for the 5 jet classifiers, including gluon, light quarks, W boson, Z boson and top quark, as shown in Fig. 

7. The AUC of the light quarks tagger (blue lines) using 24-bit fixed-point data representation seems different from the floating-point one, but note there is a logarithmic scale on the x-axis of Fig. 7 and the AUC loss of the q tagger is less than 1%.

Fig. 7: The AUCs of five jet taggers.

5.3 Resource utilization

Task LUT FF BRAM DSP
Available 1728k 3456k 5376 12288
JEDI-net 30P Used [] 1158k 246k 1392 11504
(J2 [que2022optimizing]) Utiliz. [%, ] 67 7.1 25 93
JEDI-net 30P Used [] 734k 137k 16 9013
(J3, w/ fusion) Utiliz. [%, ] 42 4.0 0.3 73
JEDI-net 30P Used [] 865k 138k 37 7267
(J4, Opt-Latn) Utiliz. [%, ] 50 4.0 0.7 59
JEDI-net-30P Used [] 911k 153k 37 9833
(J5, Opt-Acc) Utiliz. [%, ] 53 4.4 0.7 80
JEDI-net-50P Used [] 855k 201k 25 8945
(U4, Opt-Latn) Utiliz. [%, ] 49 5.8 0.5 73
JEDI-net-50P Used [] 815k 189k 37 8986
(U5, Opt-Acc) Utiliz. [%, ] 47 5.5 0.7 73
TABLE I: Resource utilization.

Table I shows the resource utilization of our designs on the U250 FPGA with different parallelism parameters. For the model JEDI-net-30p, the input particle number is 30 with a feature size as 16, which are defined in the dataset. For JEDI-net-50p, is 50 with the same size of . The number of edges, , increases dramatically when increases. It is which equals 870 when and 2450 when . The two models also have different sized and . Our new low latency design with fusion consumes less hardware resource than the previous design on the same sized JEDI-net 30p model, as shown in Table I. Both J2 and J3 targets the same sized JEDI-net model. The details of J2 and J3 are in Table II. The utilization of both the BRAM and LUT has been reduced in the new design based on the new proposed low latency hardware architecture, resulting in a more hardware efficiency design.

5.4 Performance and analysis

5.4.1 Optimization of strength reduction

Code transformation using strength reduction is conducted to optimize matrix multiplications, transforming multiplications to only load and store operations, sometimes with a small number of additions. Fig. 8(a) shows that all the multiplications and additions for MMM1/2 units are removed and Fig. 8(b) shows that only 6,960 (3.3%) additions of the original implementation [moreno2020jedi] are required for the MMM3 unit in JEDI-net-30p models. Besides, there is a 96.7% reduction in the number of iterations for all the MMM units, which largely reduces the design latency. Fig. 8(c) and (d) show the operations reduction in JEDI-net-50p models. Moreover, the run time of C synthesis using Vivado HLS is also reduced by over 3 times when compared with the HLS design without strength reduction.

Fig. 8: The reduction in the number of multiplications, additions and iterations for (a) MMM1/2 as well as (b) MMM3 of JEDI-net-30p models, and (c) MMM1/2 as well as (d) MMM3 of JEDI-net-50p models.

To achieve low latency and high throughput, each of the layers in the , , units are firstly fully unrolled. Besides, the proposed strength reduction enhanced MMMs are applied with a column-major order data representation. The latency of our designs reduces from a few milliseconds to a few microseconds for both JEDI-net 30p and 50p models, as shown in Fig. 9. The design J1 and U1 are the designs with initial parallelism parameters for JEDI-net-30p and JEDI-net-50p respectively, as shown in Table II.

5.4.2 Parallelization and resource balance

To further improve the latency and II, we increase the design parallelism by deploying multiple DNN1 () hardware units since this unit requires to be iterated times. The latency of design J2 is reduced from 12.56s to 1.91s when the number of deployed DNN1 units, , increases from 1 to 13 for JEDI-net-30p model. The latency improvement is achieved by trading hardware resources for performance (low latency). But naively increasing the copy of DNN1 could not lead to an optimal design. The designs U1 and U2 for the large model, JEDI-net-50p, shows that the latency decreases from 32.60s to 12.47s when increases from 1 to 3, as shown in Table II. However, the required number of DSPs has exceeded the total DSPs on this FPGA. To solve this issue, we re-allocate some DSP blocks from DNN2 () and DNN3 () to DNN1 () via increasing the reuse factors of DNN2 and DNN3 as well as deploying multiple DNN1 units to balance the whole design II, leading to a reduced II and latency. A design space exploration is conducted to find the the appropriate values of the parallelism parameters as (4,4), resulting in design U3 which has both better II and latency than the design U2.

J1 J2 J3 J4 J5 U1 U2 U3 U4 U5
[que2022optimizing] This work [que2022optimizing] This work
NN Model JEDI-net-30p JEDI-net-50p
(NL, Size) (3, 20) (1, 8) (2, 32) (3, 50) (2, 8) (2, 8)
(NL, Size) (3, 20) (3, 48) (3, 48) (3, 50) (3, 32) (3, 48)
Accuracy 78.74% 78.41% 79.85% 80.42% 80.90% 81.18%
1 1 1 1 1 1 1 4 1 1
1 13 10 29 6 1 3 4 25 17
DSP used
1831
(14%)
11504
(93%)
9013
(73%)
8776
(71%)
9833
(80%)
7342
(59%)
12894
(104%)
12284
(94%)
8945
(73%)
8986
(73%)
II (cycles) 880 80 90 30 150 2462 854 650 100 150
II (s) 4.40 0.40 0.45 0.15 0.75 12.31 4.27 3.25 0.50 0.75
Latency (cycles) 2511 382 124 58 181 6519 2493 2131 130 181
Latency (s) 12.56 1.91 0.62 0.29 0.91 32.60 12.47 10.66 0.65 0.91
Note Opt-Latn Opt-Acc Opt-Latn Opt-Acc
TABLE II: Performance comparison of FPGA-based implementations of JEDI-net with various parameters.
Fig. 9: The latency reduction of JEDI-net models using custom MMMs with strength reduction.
Fig. 10: The latency and initiation interval (II) for various designs of both JEDI-net-30p and JEDI-net-50p models.

5.4.3 Fusion

The latency can be further reduced by the new proposed sub-layer fusion for GNN that uses a layer-wise hardware architecture. The fusion removes the boundary between the hardware units and enables a fine-grained pipeline instead of a coarse grained pipeline. A code transformation with an FSM structure is introduced to mitigate the pipeline issue of an imperfect loop. For example, the design J3 targets the same NN model to J2, but with the fusion, the latency can be reduced from 1.91s to 0.62s. This is the first time that the JEDI-net-30p model can run in 1s. Besides, setting to a number between 10 to 14 will bring the same II and latency using the new low latency hardware architecture because the partition is performed on the inner loop which has a bound of 29 for JEDI-net-30p. Thus 10 is chosen since it consumes the least hardware resources. Compared to our previous work [que2022optimizing], the new design J3 is 3.1 times faster as shown Fig. 10(a), but it consumes 22% less DSP blocks. The cost is a slightly higher II from 0.40s to 0.45s since after fusion the pipeline depth of the fused stage increases.

5.4.4 Algorithmic and hardware co-optimization

Moreover, we introduce a co-design approach to optimize the GNNs both on model accuracy and hardware performance. And it is the first time that JEDI-net-50p models can run in 1s without sacrificing model accuracy. In [moreno2020jedi] and our previous work [que2022aicas, que2022optimizing], the size for all the three MLPs are the same. However, after the analysis of JEDI-nets, the will be iterated times more than and times more than . Thus if the size of can be small, the design can deploy more copies of units on a given FPGA, resulting in a fast design. We consider JEDI-net-30p models with the number of layer (NL) of searched in (1,2,3,4) and layer size in (8, 16, 24, 32). The latency parameter is set to 2 for JEDI-net-30p. The JEDI-net-50p is much larger and the size of is searched in (8, 16, 32, 48) with NL in (1,2,3,4). The is set to 4 for JEDI-net-50p. For simplicity, we keep the layer number and other configurations of and the same to [moreno2020jedi] but only set the size of their first layer to one of (16, 32, 48, 64, 96) for both JEDI-net-30p and 50p.

The results are shown in Fig. 11 and Fig. 12 using a U250 FPGA platform. Each blue dot is an explored design. Fig. 11 shows the designs with a latency less than 2s. Since JEDI-net-50p is much larger than JEDI-net-30p, Fig. 12 shows the designs with a latency less than 4s. J4 and U4 are selected since they have the lowest latency (Opt-Latn) in all the candidate designs and they have the highest model accuracy in the designs with the same latency. With an accuracy loss of 0.33%, the latency of JEDI-net-30p can be further reduced to 0.29s which is 2.1 times faster than J3. The II can also be reduced to 0.15s. For JEDI-net-50p, the latency of U4 can be reduced from 10.66s to 0.65s which is 16.7 times faster than U3 [que2022optimizing], as shown in Fig. 10(b). In addition, the the accuracy is higher than U3.

Fig. 11: The latency and accuracy for JEDI-net-30p models.

5.4.5 Analysis and discussion

Although we target low latency GNNs, for a real-world application we also consider the model accuracy. J5 and U5 are selected since they have the highest model accuracy (Opt-Acc) in all the designs with a latency less than 1s. Compared to our previous work [que2022optimizing] (J2), the new design J5 achieves 1.11% better model accuracy and 2.1 times reduction in latency. For JEDI-net-50p, the new searched design U5 achieves 0.76% better model accuracy and 11.8 times reduction in latency than the previous work (U3). For JEDI-net-50p, design U6 is an interesting case which can achieve the highest accuracy in all the searched designs but the latency is 2.71s which is higher than the latency constraint (s) for the CERN HL-LHC. However, with a larger FPGA chip in the future, it could be further reduced to an accepted latency based on our optimizations. In addition, based on the latency model in Section 4.3, the estimated latency of design J4, J5, U4 and U5 are 0.30s, 0.91s, 0.66s and 0.915s respectively, leading to less than 5% prediction errors.

We demonstrate that our approach can not only find a design with a much better latency (Opt-Latn) but can also find a more balanced design, such as a design achieving nice accuracy with a slightly high but still accepted latency (s), such as U5. We can loosen the latency and trade it for model accuracy. Besides, when the scientific domain improves, the scientists will try to design more sensitive detectors and require even low latency. With our proposed approach and optimizations, we can not only maximize the hardware performance on a given FPGA. But with a new performance requirement in the future, we can also find out what a future FPGA would be needed to meet such a system requirement.

Fig. 12: The latency and accuracy for JEDI-net-50p models.

When the particle identification is part of the whole processing in the trigger, our approach can still lead to an appropriate set of parameters to get the optimal II and latency with a given hardware budget. Besides, in a realistic use case, the cardinality of the input dataset might be much smaller. In that case, one would be able to speed up the algorithm even more than what we show in this work, as well as to reduce the resource utilization.

5.5 Comparison with GPUs and CPUs

To compare the performance of the proposed design on FPGA with other platforms, we run the JEDI-net models implemented in [moreno2020jedi]

on Intel Xeon Gold 6154 CPU and NVIDIA GeForce RTX 2080 Ti (CUDA 10.2) based on PyTorch (1.8.1) framework. The CuDNN libraries are used for optimizing the hardware performance on GPUs. Each batch has 1000 graph events (samples) according to 

[moreno2020jedi], so we set the same batch size for all the hardware platforms for a fair comparison. CPU power consumption is measured by the pcm-power utility [han2017ese], excluding the DRAM power consumption. GPU power consumption is measured using nvidia-smi utility. We adopt KGPS (Kilo Graphs Per Second), which denotes the number of graph events inferences that run per second, as an indicator of throughput. This work uses the same CPU and GPU implementations like the one in [moreno2020jedi]. Compared with the JEDI-net implementation on GPU, our FPGA design is 5.124 times faster and consumes 9.710.2 times less power. In terms of the power efficiency, which is denoted as KGPS per Watt, our design is 5.145 times higher than the GPU implementation. When compared to the CPU implementation, our FPGA implementation is 76791 times faster. In addition, our design achieves 3262575 times higher power efficiency than the CPU implementation. We believe the proposed custom strength reduction can also be applied to CPU and GPU implementations, but the latency profiling shows that the three MMMs cost less than 15% of the total latency. We leave that for future work since it has a limited impact on the conclusions in this paper. The FPGA implementation is faster and more efficient because it is unrolled on-chip with a fine-grained pipeline and benefits from tailor-made optimizations for the JEDI-net based on the proposed approach.

Platform
CPU
Gold 6154
GPU
GeForce 2080Ti
FPGA U250
Frequency 3.00 GHz 1.63 GHz 200 MHz
Technology 14 nm 12 nm 28 nm
Precision F32 F32 24 Fixed
NN Model
JEDI-net
50p1 30p1 50p1 30p1 50p 30p
Power (W) 103 106 250 245 25.9 24.0
Batch Size 1000
Average Lat.
(s)
593.1 56.9 16.8 3.8 0.75 0.75
Throughput
(KGPS)
1.69 17.6 59.52 263.2 1333 1333
Power Effic.
(KGPS/W)
0.02 0.17 0.24 1.07 51.5 55.5
  • Same JEDI-net architecture to [moreno2020jedi, que2022optimizing].

TABLE III: Comparison of the FPGA, CPU and GPU designs.

6 Related work

There have been studies exploring GNNs for particle physics applications, such as jet tagging (identification) [moreno2020jedi], charged particle tracking [ju2021performance], and calorimeter energy measurements [qasim2019learning]. More can be found in this survey [thais2022graph]. To achieve low latency, FPGAs are involved. [elabd2021graph] extends the hls4ml [duarte2018fast] tool to translate GNNs into FPGA firmware automatically for charged particle tracking. GarNet [iiyama2021distance], a GNN-based algorithm, is proposed for calorimeter energy regression.

There are also many studies about general GNN accelerations [zhang2021boostgcn, lin2021gcn, geng2020awb, geng2021gcn, zhou2022model, tian2022g, garg2022understanding]. AWB-GCN [geng2020awb] is based on a column-wise-product architecture with runtime rebalancing for GCN acceleration. Their upgraded version I-GCN [geng2021gcn] presents Islandization, a new runtime graph restructuring algorithm, to improve data locality. BoostGCN [zhang2021boostgcn] presents a novel hardware-aware Partition-Centric Feature Aggregation (PCFA) scheme for pipelined GCNs. [lin2021gcn] introduces GCNs acceleration using HLS and hardware-friendly optimizations. G-NMP [tian2022g] presents a Near-Memory Processing (NMP) solution for accelerating GNNs to handle the irregular memory access. [garg2022understanding] explores various dataflow choices for sparse and dense GNNs on spatial accelerators. [besta2022parallel] designs a taxonomy of parallelism in GNNs. [sohrabizadeh2022streamgcn] presents StreamGCN for accelerating GCN specialized for GNN streaming processing with small graphs. [chen2022regraph] proposes a heterogeneous pipeline architecture for GNNs on high bandwidth memory (HBM) enabled FPGAs. [abi2022gengnn] proposes GenGNN framework to deliver ultra-fast GNN inference and support a diverse set of GNN models. Results show their designs achieve millisecond level latency. They also propose FlowGNN [sarkar2022flowgnn] which can flexibly support the majority of message-passing GNNs. [kang2022grow] proposes a GCN accelerator named GROW with Gustavson’s algorithm to architect a sparse-dense GEMM accelerator with row-wise product. [sun2022multi] proposes MultiGCN which balances network latency and network bandwidth for GCNs in multi-node systems. [yang2022drgn] presents a dynamically reconfigurable accelerator for GNNs named DRGN. EGCN [han2022egcn] is proposed using tiled matrix multiplication to reduce Off-Chip Memory Access.

There are also previous studies about algorithmic and hardware co-design for GNNs [zhou2022model, you2022gcod, zhang2021g]. [zhang2021g] presents a GNN and accelerator automatically co-search framework to maximize both task accuracy and acceleration efficiency. [zhou2022model] proposes a model-architecture co-design with a light-weight algorithm for temporal GNN inferences on FPGAs. [you2022gcod] proposes GCoD framwork, involving a two-pronged accelerator. Some previous studies focus on accelerating GNN training [zeng2020graphact, su2021graph, lin2022hp, ogbogu2022accelerating]. GraphACT [zeng2020graphact] introduces an FPGA-based accelerator with a subgraph-based algorithm for Graph Convolutional Networks (GCNs) training. [su2021graph] presents an efficient graph sampling accelerator on HBM enabled FPGAs for training GNNs. [lin2022hp] proposes HP-GNN which maps GNN training on the CPU-FPGA platform automatically. DietGNN [ogbogu2022accelerating], a crossbar-aware pruning technique, is proposed to accelerate the training of large-scale GNNs. All of these studies utilize the single engine architecture. This work focuses on layer-wise architecture and proposes several novel optimizations for high throughput and microsecond-level latency. These previous studies are orthogonal to our proposed approach and hardware architecture. Their techniques could be complementary to our approach, which could be extended in future to achieve even lower latency.

7 Conclusions and Future Work

This paper presents a novel approach for minimizing the latency for the GNN-based JEDI-net on an FPGA. It involves optimizing the matrix operations and hardware pipeline to support next-generation low-latency collider trigger systems, key to many fundamental physics experiments including particle identification. Results show up to 45 times reduction in latency over the existing GPU-based JEDI-net implementation and up to 16.7 times reduction over our previous work. Our future work includes exploring the use of new FPGA resources such as the AI Engines [xilinx_white]

and the AI Tensor Blocks 

[langhammer2021stratix], and incorporating the proposed techniques into the design and implementation of the data processing architecture for next-generation collider trigger systems.

References