GNNJEDInetFPGA
An template for GNNbased JEDInet using Vivado HLS
view repo
This work proposes a novel reconfigurable architecture for low latency Graph Neural Network (GNN) design specifically for particle detectors. Adopting FPGAbased GNNs for particle detectors is challenging since it requires submicrosecond latency to deploy the networks for online event selection in the Level1 triggers for the CERN Large Hadron Collider experiments. This paper proposes a custom code transformation with strength reduction for the matrix multiplication operations in the interactionnetwork based GNNs with fully connected graphs, which avoids the costly multiplication. It exploits sparsity patterns as well as binary adjacency matrices, and avoids irregular memory access, leading to a reduction in latency and improvement in hardware efficiency. In addition, we introduce an outerproduct based matrix multiplication approach which is enhanced by the strength reduction for low latency design. Also, a fusion step is introduced to further reduce the design latency. Furthermore, an GNNspecific algorithmhardware codesign approach is presented which not only finds a design with a much better latency but also finds a high accuracy design under a given latency constraint. Finally, a customizable template for this low latency GNN hardware architecture has been designed and opensourced, which enables the generation of lowlatency FPGA designs with efficient resource utilization using a highlevel synthesis tool. Evaluation results show that our FPGA implementation is up to 24 times faster and consumes up to 45 times less power than a GPU implementation. Compared to our previous FPGA implementations, this work achieves 6.51 to 16.7 times lower latency. Moreover, the latency of our FPGA design is sufficiently low to enable deployment of GNNs in a submicrosecond, realtime collider trigger system, enabling it to benefit from improved accuracy.
READ FULL TEXT VIEW PDFAn template for GNNbased JEDInet using Vivado HLS
Realtime data processing from highenergy proton collisions at the CERN Large Hadron Collider (LHC) is challenging since the particle detectors around the LHC ring produce hundreds of terabytes of data per second [coelho2021automatic, duarte2018fast] from collisions that occur every 25 ns. In the next phase of the LHC, the upgraded HighLuminosity LHC (HLLHC) experiments would expect an explosion of data due to cutting edge detectors with improved resolution and increased area as well as volume. The large data volumes produced from the detectors are reduced by a realtime processing system, known as the trigger, which keeps interesting collision events while discarding the others. The Level1 Trigger (L1T), using only FPGAs, requires processing latency of applications in s in the LHC [CERN2020L1T]. If algorithm latency exceeds the limit, data or interesting events are lost.
Previous trigger strategies [cheatham2016atlas]
have utilized the jet energy alone to make a decision. Recently, machine learning based strategies are being used for their superior accuracy. However, because of the low latency requirement, only simple MultiLayer Perceptron (MLP) networks
[coelho2021automatic, duarte2018fast] on FPGAs are proposed with an accuracy of around 75%. High accuracy in the trigger is crucial to keep only the most interesting events while keeping the output bandwidth low [coelho2021automatic]. [moreno2020jedi] presents JEDInet, a Graph Neural Network (GNN) based algorithm which achieves stateoftheart accuracy for particle selection. It is in high demand in the trigger systems.However, GNNs require large amounts of computation and suffer from irregular memory accesses, resulting in large inference latency which makes them difficult to be deployed in realtime in the Level1 trigger system of an LHC experiment [moreno2020jedi]. GNNs are computationally demanding and currently there are no realtime GNNs for particle identifications in the Level1 trigger. Hence, accelerating GNN inference using reconfigurable accelerators such as FPGAs is essential in the LHC since it would enable sophisticated processing to run in realtime on the data stream from detectors with superior accuracy. Many existing GNN accelerators of FPGAs are often designed using a single engine architecture to process layers or sublayers (blocks) repeatedly like GPUs, and the networks are processed in a recurrent fashion [yan2020hygcn, liang2020engn, zhang2021boostgcn, lin2021gcn, lin2022hp, kang2022grow]. However, they are not efficient for GNN execution when targeting small graphs with requirements of ultralow latency and high throughput for scientific applications, such as particle identification. Also none of them targets a scenario with a hard constraint of latency in s.
This work proposes a custom Low Latency (LL)GNN hardware architecture based on a layerwise tailormade pipeline to accelerate the GNNs for particle detectors, using the GNNbased JEDInet algorithm as an endtoend application. The layerwise architecture has been used to speedup CNNs [blott2018finn, shen2017maximizing, zhang2020dnnexplorer] and RNNs [que2021accelerating], but few studies focus on accelerating GNNs. This paper addresses this gap. First, we propose custom strength reduction for matrix operations based on the character of interactionnetwork based GNNs with a fully connected graph as input, which avoids the expensive matrix multiplications of the adjacency matrix with the input feature matrix. It reduces the computational complexity and also avoids the irregular memory access of the sparse adjacency matrices since there is no need to fetch them. Second, this work adopts columnmajor order instead of conventional rowmajor order for the data layout of intermediate results traversed along the hardware datapath, which avoids the access of irregular memory for intermediate results in GNNs. Third, we introduce an outerproduct based matrixmatrix multiplication (MMM) approach for the GNN aggregation function computation. Additionally, it is further enhanced by the custom strength reduction.
Moreover, instead of running GNNs in a coarsegrained pipeline as shown in our previous work [que2022aicas, que2022optimizing], this work fuses sublayers as much as possible, converting several coarsegrained pipeline stages into a single stage with a finegrained pipeline inside. This removes extra handshake acknowledgments and dual buffers between coarsegrained pipeline stages, resulting in low endtoend latency. It is the first time that the GNNs of JEDInet can run in less than s on FPGAs. Furthermore, a GNNspecific algorithmichardware codesign approach is presented to optimize the algorithm and hardware simultaneously, and to explore the optimal configuration to improve the overall performance. Our FPGA implementation achieves a submicrosecond latency, which makes the algorithm compatible with the HLLHC conditions where there is a strict latency constraint, while enabling improved accuracy.
To the best of our knowledge, this is the first FPGAbased design of GNNs with a latency in s for particle identification for the detectors at the CERN HLLHC experiments. This work would help to improve the nextgeneration trigger system, enabling powerful algorithms to process the experimental data accurately and efficiently.
We make the following contributions in this paper:
A low latency layerwise hardware architecture for GNNs with several novel design optimizations, including custom strength reduction, columnmajor order, outerproduct based matrix multiplication and sublayer fusion as well as a GNNspecific algorithmic and hardware cooptimization approach, resulting in submicrosecond latency with high efficiency.
The first FPGAbased design of GNNs with a latency in s for particle identification for the detectors at the CERN HLLHC experiments.
A scalable, efficient opensource template^{1}^{1}1https://github.com/walkieq/GNNJEDInetFPGA for GNNbased JEDInet which enables the generation of lowlatency FPGA designs with efficient resource utilization leveraging HLS tools.
A comprehensive evaluation of the proposed method and hardware architecture.
While we use GNNbased JEDInet for illustrating our hardware architecture and approach, the proposed optimizations could be applied to other GNN networks with small graphs for applications beyond particle identification.
This paper expands on our two conference papers [que2022aicas, que2022optimizing] which focus on hardware optimizations and target low initiation interval. But they still suffer from high latency. This work addresses a limitation in our previous work, which does not cover codesign of algorithmic and hardware optimizations. This limitation results in a suboptimal design with large latency, which hinders the deployment of GNNs in the CERN LHC. This work presents a GNNspecific codesign approach to optimize the algorithm and hardware simultaneously. Our approach can explore the design performance tradeoff under both userdefined algorithm and hardware constraints, e.g., the highest accuracy with a latency requirement of less than 1s on a given FPGA. In addition, a new low latency hardware architecture with sublayer fusion is proposed, which targets endtoend latency reduction instead of initiation interval. It fuses several coarsegrained pipeline stages into a single stage with finegrained pipeline inside using an FSMbased code transformation, resulting in low latency designs. The new optimizations proposed allow us to obtain a significant reduction in endtoend latency over our previous work [que2022aicas, que2022optimizing]. These novel optimizations, combining the ones in our previous work, lead to submicrosecond design latency, which makes the algorithm compatible with the CERN HLLHC where there is a strict latency constraint of s.
GNNs have shown remarkable successes in wide applications with graphstructured data, such as recommender systems [ying2018graph, fan2019graph], molecule property prediction [fout2017protein], particle physics [ju2020graph, ju2021performance, moreno2020jedi]. GNNs can adapt their graphbased structure to an input graph with an iterative process of information aggregation across nodes to learn the complex dependencies of a system.
JEDInet is a fully connected graph neural network based on interaction network [battaglia2016interaction] architecture. The interaction network is a powerful graph based framework for reasoning about objects and relations in complex and dynamic systems. The input to an interaction network is a graph of objects and the relations between them. It learns to capture complex interactions that can be used to predict future states and abstract physical properties. The acceleration of interactionnetwork based GNNs has also been studied for charged particle tracking at the CERN LHC on FPGAs [elabd2021graph].
The GNN of JEDInet can be represented as a graph, with the nodes, , corresponding to physics particles, and the edges, , to the relations. It is a fully connected GNN. The input of nodes () is defined as a matrix, whose columns represent the node’s
length feature vectors, and
is the number of particles in a jet. The relations are a triplet, , where and are binary matrices which index the receiver and sender nodes, respectively. Each column of and is a onehot vector and it indicates the receiver node’s index; indicates the sender similarly. The number of the edges, , is since the input graph is fully connected with directional edges.Fig. 1 shows the dataflow of JEDInet. In addition, Fig. 2 shows the and for an example. To illustrate the idea, the number of particles (nodes) is 4 in this example. But please note a real case will have more particles. The input matrix is multiplied by the and matrices and the results are then concatenated to form a matrix, having dimension . Each column of the matrix represents an edge, i.e. a particletoparticle interaction. The elements of each column are the features of the sending and receiving nodes for that edge. A trainable deep neural network (DNN) function is then applied to each column of and produces a matrix . Then is conducted in MMM3 (see Fig. 1) to gather the cumulative effects of interactions received by a given node. Thus, the cumulative effects of the interactions at a given node are obtained by summing the hidden features over the incoming edges, which is implemented by computing in the MMM3 unit. The and are then concatenated to form the matrix, working as a shortcut connection. Each column of the matrix represents a constituent in the jet, expressed as a dimensional feature vector, containing input features and hidden features. It represents the combined effect of all the interactions between particles. Another trainable function is presented to build a postinteraction representation of each jet constituent. It is performed on each column of to produce the matrix, having dimension . A final trainable function
returns the probability for that jet to belong to each of the five categories.
, and are implemented as MultiLayer Perceptrons (MLPs).This section introduces several optimizations to accelerate the interaction network based fully connected GNNs. We follow ”divide, conquer and fusion”, an optimization strategy designed to achieve low latency for GNN inferences. In the ”divide” and ”conquer” steps, we split GNNs into multiple sublayers and perform dedicated optimizations, such as strength reduction based on structured sparsity. Then in the ”fusion” step, multiple sublayers are fused to remove boundaries and buffers between different pipeline stages, resulting in low latency and high hardware efficiency. Finally a codesign approach is presented.
The design of the GNN in [moreno2020jedi] adopts dense matrixmatrix multiplication (MMM) operations to calculate the MMMs between the input feature matrix and adjacency matrices, i.e., and , which is costly and timeconsuming. However, we observe that most of the computational operations in these MMMs of interactionnetwork based fullyconnected GNNs are unnecessary if the sparse structures and binary feature in the adjacency matrices could be exploited. With an interaction network architecture and fully connection graph, each column of the receiving and sending matrices is onehot. Besides, both matrices are binary and they have a fixed pattern as shown in Fig. 2. The element is set to 1 when the node receives the edge and is 0 otherwise. Similarly, the element is set to 1 when the node sends the edge and is 0 otherwise. Because of the fixed patterns and the binary feature, MMMs are unnecessary to calculate the and . First, the multiplication operations are unnecessary because the and matrices only have binary values. Second, accumulation (addition) operations can be avoided because each column of and is onehot, resulting in the reduction of the iteration factor from to . Hence, only load and store operations are needed to calculate the MMM of as well as the . The detailed pseudocode of the code transformation with strength reduction for MMM1/2 has been illustrated in Algorithm 1.
One of the big challenges for GNNs is irregular memory access of the sparse adjacency matrices. Our approach can avoid such memory access since their structured sparsity patterns can be statically fused into the loop index, which not only saves memory bandwidth but also avoids the irregular memory access of these adjacency matrices. The access of the input feature matrix for and is sequential in our design since we adopt a columnmajor order data format which will be discussed in the next subsection. Our proposed approach not only eliminates the expensive MMM operations to increase the computational efficiency but also avoid the irregular access of the adjacency matrices, which largely reduces the design latency.
Although this work targets a fully connected GNN based on an interaction network architecture, the proposed technique of custom code transformation with strength reduction, exploiting the sparse structures and binary features, can be adapted to optimize other GNN networks with a hardware friendly structured sparsity for low latency designs. Sometimes, a structured sparsity pattern could also be userdefined when designing the architecture of neural networks at the very beginning of projects to achieve high hardware efficiency, e.g., a butterfly sparsity pattern [fan2022adaptable].
The intermediate results in the layerwise GNN hardware architecture are captured using two dimensional (2D) arrays representing a matrix as shown in Fig. 1. Rowmajor and columnmajor orders (Fig. 3) are two data layout methods which are critical for correctly passing arrays between hardware units. More importantly, it is also critical for hardware performance, which is the focus of this work. The difference between the two orders lies in which elements of an array are contiguous in memory. An appropriate data layout will have a significant impact on hardware performance. When mapping a 2D data onto a one dimensional (1D) structure (i.e. memory) using a high level synthesis tool (e.g., Xilinx Vivado/Vitis HLS), often the default data layout is rowmajor order, which follows a C language legacy.
However, rowmajor order for GNNbased JEDInet will lead to poor spatial locality and hinder parallelism since the functions and are applied to each column of the input matrix, as shown in Fig. 1. The concatenation of two input matrices into one output matrix also works on columns, such as concatenating and to matrix. With a rowmajor order data layout, the input data of these functions does not sit in memory contiguously so it is very timeconsuming to fetch all the elements in a column. However, if the data is represented using a columnmajor order in which the consecutive elements of a column reside next to each other, iterating over columns becomes easy because the data are accessed sequentially. Thus, this work proposes the columnmajor order to increase the data spatial locality for accelerating the layerwise GNNs efficiently, leading to good hardware performance.
To multiply a matrix by another matrix we often do the innerproduct of rows from the 1st matrix and columns from the 2nd matrix. For example, to compute the in the MMM3 unit, it requires a whole row of the matrix and a whole column of to perform the innerproduct for each entry of . However, in GNNbased JEDInet, the input matrix of the MMM3 unit comes from the output of that produces the results column by column, as shown in Fig. 1. With an innerproduct based MMM, this unit needs to wait for a long time until a whole row of matrix is ready, resulting in long latency. To solve this issue, this work proposes an outerproduct based matrix multiplication for MMM3 to process the . Instead of using a whole row from matrix, now a whole column of matrix is multiplied by one element from matrix to generate the partial result of the first column of result matrix as shown in Fig. 4. The partial result will then be accumulated to form the column of the result matrix. Since the matrix is generated column by column, MMM3 can start as soon as the first column of is ready. To efficiently support the outerproduct based MMM, the columnmajor order data layout is used for representing the intermediate results (i.e., 2D matrix arrays) as we discussed in Section 3.2. Thus, the input elements can be grouped as a vector (i.e., a whole column as a vector) and can be processed efficiently with high parallelism because the data can be fetched sequentially. It largely reduces the waiting time of the MMM3 unit and reduces the design latency.
In addition, the code transformation of strength reduction, described in Section 3.1, can be adopted to enhance the proposed outerproduct based MMM, which exploits the sparsity patterns of the adjacency matrix of as well as the binary feature. It can avoid costly multiplications but involves only load and store operations with a small number of additions for this MMM unit. The detailed pseudocode of the strength reduction enhanced outerproduct based MMM3 is illustrated in Algorithm 2. The input matrix is binary and each of its columns is onehot as introduced in Subsection 2.2. Thus, the multiplication operations are unnecessary since is binary. Only of total additions are required. Besides, since the adjacency matrix has a structured pattern, it can be fused into the loop index to avoid irregular memory access.
One limitation of the outerproduct based MMM is that it will create a full size resultant matrix but with just partial results until the end, leading to large memory write bandwidth requirement. The AWBGCN [geng2020awb] presents a columnwise product architecture, a variant of outerproduct MMM, which involves one column in the first matrix but only one element in the second matrix, so that the output will only touch one resultant vector. In our design, each row is onehot, which means that only one vector in the resultant matrix is valid each time. Moreover, we further exploit the structured pattern of so that the partial results will only update the corresponding vectors in the resultant matrix sequentially, which avoids irregular memory writes to achieve low latency. Furthermore, because of the structured pattern in , the proposed approach only requires to read each data in matrix once, which also reduces the bandwidth of memory access. Fig. 4 shows the design reads the vectors from column to in matrix and generates the second vector in the resultant matrix.
The latency could be further reduced by involving multiple columns of input matrix. But this requires multiple preceding hardware units to generate the corresponding number of columns in each cycle, resulting in more hardware resources. Our proposed methods not only eliminate the expensive matrix multiplication operation and reduce the iterations but also avoid the input of the adjacency matrices to improve the memory access efficiency, which reduces the design latency and increases the throughput as well as the hardware efficiency.
There are many components in a GNN model. To achieve high throughput and low latency, the design could be unrolled as much as possible with many hardware resources. However, a naive implementation often results in an unbalanced design II in the coarsegrained pipeline, which causes hardware inefficiency and low system throughput. The design II is decided by the largest II among all the units on the datapath. Generally, it is unnecessary to unroll every unit in order to achieve the lowest II. However, some hardware resources can be saved from the units that do not require full unrolling and then these resources can be reallocated to the bottleneck unit that dominates the whole design to reduce the II of the design as well as the latency. Partitioning FPGA resources to enhance throughput/latency in a layerwise architecture has been studied for CNNs [shen2017maximizing, zhang2020dnnexplorer, zhang2018dnnbuilder, gong2018maloc] and RNNs [que2021accelerating], but there is little work focusing on GNNs. This work balances the IIs of the sublayer units in GNNs by partitioning the FPGA resources properly. A design space exploration is also performed to find the appropriate parameters of parallelism to achieve an optimal tradeoff between the II and hardware resources.
In the previous subsections, we ”divide” the GNN model into several sublayers and ”conquer” these sublayers by tailedmade optimizations. Our previous design [que2022aicas, que2022optimizing] run these sublayers in a coarsegrained pipeline, achieving a low initiation interval. Nevertheless, the end to end latency is still higher than the s which is required for the CERN HLLHC. This work devises a succeeding step named ”fuse” which combines several optimized sublayers into an single unit and run it in a finegrained pipeline to reduce overall design latency.
For GNNs, many operations are performed repeatedly on graph nodes or edges. This means that many of the processing units (loops) may share the same loop bound, i.e., the number of nodes (or edges) in a graph, which makes many sequential units (loops) mergeable. Especially, in our design, the matrix multiplication units for MMM1/2/3 are largely simplified after strength reduction. Running them as individual stages in a coarsegrained pipeline not only leads to lots of wasted hardware resources but also large latency because of the extra acknowledgment mechanism, including pingpong buffers that exist between stages in the coarsegrained pipeline. Since several sequential loops share the same loop bound, we could simply merge these loops into one loop. For example, the MMM1/2, Concat1 and DNN1 units all have a loop bound of , the number of edges, due to all of them performing calculations on the edges of the graph. They can be merged into one big loop which then can be pipelined to reduce overall latency since merging loops allows the logic within the loops to be optimized together. Please note that fusing of compute units into single loop is data independent as well. One drawback is it could slightly increase the overall initiation interval (II), which is the largest II among all the stages in a coarsegrained pipeline, since we merge some of these stages to create a big stage. Nevertheless, it is worth doing since now we have fewer stages and fewer boundaries after fusion, resulting in lower latency.
One issue of sublayer fusion is that it may create an imperfect loop, especially for GNNs in which some units are applied to edges with a loop bound based on edges while some other units are applied to nodes with a smaller bound based on nodes. Algorithm 3 shows an example loop at lines 18 from GNNs of JEDInet. The code_body_A (e.g., the Concat1 and DNN1 units) is iterated with a bound of the number of the edges which is while the code_body_B (e.g., the part of MMM3 from line 10 to 12 in Algorithm 2 plus DNN2) is only iterated for times. After the fusion, it becomes an imperfect loop with the code_body_B outside of the inner loop. To pipeline the code_body_B, one could set the #pragma of ”PIPELINE” at line 2 in the outer loop, but it leads to automatically unrolling all loops in the hierarchy below. In this case, this will completely unroll the inner loop at line 3, resulting in copies of code_body_A, consuming lots of hardware resources. If the required hardware resources exceed the given budget, one has to restrict the number of copies of the inner loop. For example, if the total budget can only support instances of code_body_A, this loop can only be unrolled by a factor of . However, using a #pragma of unrolling with a factor for the inner loop will not manage to reduce the number of copies since the ”PIPELINE” at line 2 has a priority and will force the inner loop fully unrolled. Thus, one has to move the #pragma of ”PIPELINE” to line 4 to only pipeline the inner loop without pipelining the code_body_B, leading to a poor design and large latency.
To solve this issue, this work proposes another code transformation which transfers an imperfect loop into a perfect one using a finitestate machine (FSM) based structure with a target II (), as shown from line 9 to 34 in Algorithm 3. If the number of instances of the code_body_A that can be deployed under the given resource budget is , then equals . The loop can now run in an II of one but the equivalent II is the target II. With this code transformation, we could deploy as many instances as possible based on the hardware budget to improve the design performance after fusion and reduce the overall design latency.
The tradeoff between latency, throughput and FPGA resource usage is determined by the parallelization of the design. This work exploits a twolevel parallelism scheme. First, we adopt the reuse factor [duarte2018fast] to fine tune the parallelism of the MLPs used in the GNNs. The reuse factor is configured to set the number of times a multiplier is used in the computation of a module. The code transformation is performed manually using strength reduction to optimize the matrix multiplications to avoid multiplications. Hence, only the three MLPs (, , ) consume multipliers in the JEDInet design. We apply the reuse factors , and to these three MLPs. This work always tries to achieve extremely low latency by using as many hardware resources as possible, such as unrolling all the layers in the MLPs by adopting a reuse factor value of 1.
Second, this work deploys multiple copies of the unit to further increase the design parallelism. And the is always set to 1 for low latency. The is applied to each column of the matrix, as mentioned in Section 2.2, resulting in a significant number of iterations since there are columns in the matrix. Fully unrolling all the iterations requires thousands of hardware copies of , leading to a large hardware resource consumption that will easily exceed a given FPGA. Hence, this work partially unrolls it with a factor, , resulting in copies of the hardware units, each processing a vector of matrix.
To perform design space exploration to find the best tradeoff between hardware resources and design throughput, a simple resource model for JEDInet is developed. Since the DSPs are the most critical resources for FPGAbased neural networks, our resource model focuses on DSPs. The DSP resource model is shown in equation (1):
(1)  
where is the number of DSPs used for a single fullyconnected (FC) layer with input size and output size as and . The JEDInet has three MLP models which consist of only FC layers. is a label of the three MLPs which are and , as shown in Fig. 1
. For simplicity, this work utilizes a unified bitwidth (Q12.12, 24 total bits with 12 fractional bits) for most of design datapath. However, the values of the MLP model weights are among the range of [0,1), which results in around 13 effective bits. Hence, one of two inputs for the multipliers only has around 13 bits with the other bits removed by the Vivado HLS tool. And one multiplier can be fully implemented by one Xilinx DSP. In addition, the accumulator is using Q16.16. The input feature of design can go down to Q0.8 with batch normalization. With our proposed approach, there are no multiplications in the MMM1/2/3 units. Thus, only the MLP units require multipliers that are implemented using DSPs. The total number of DSPs used in JEDInet is shown in equation (
1) and it should be smaller than the total number of DSPs on the targeted FPGA.We have managed to fuse most of the sublayers, resulting in a lower latency architecture comparing to the one in our previous work [que2022optimizing]. The latency model of JEDInet based on this low latency hardware architecture is illustrated in equation (2).
(2)  
where is the II of the multiplier, which is one cycle in this work. is the II of the fused loop, which depends on the maximum value of how many copies of units can be deployed on a given FPGA with limited hardware resources, and reuse factors for and . Please note, is always set to 1 since is the bottleneck of the design. is the pipeline depth of the model while is the depth for the logic outside the major fused loop. For simplicity, both of them are constants based on the design architecture.
The material in [moreno2020jedi] only focuses on studying the algorithmic optimizations for neural networks. This often leads to the deployment issue where neural network models optimized to achieve high accuracy cannot be deployed on FPGAs due to the large model size and limited hardware resources. As another line of research, our previous studies [que2022aicas, que2022optimizing] focus on hardwareaware optimizations. However, without optimizing the algorithm, our designs still can not achieve an endtoend latency less than s. Therefore, existing GNN designs for particle identification only primarily focus on either algorithm or hardware, leading to suboptimal designs while deploying GNNs on FPGAs. To address this issue, this work presents a GNNspecific codesign approach to optimize the algorithm and hardware simultaneously. Our approach can explore the tradeoff under both usersdefined algorithm and hardware constraints, e.g., the highest accuracy with a latency requirement of less than 1s on a given FPGA.
Generally in GNNs, since the number of edges are usually larger than the number of nodes in a graph, the operations involving graph edges usually need to run more iterations than the ones involving graph nodes. In our design, the DNN1 () is required to run times, which is much larger than the DNN2 () that runs only times, and DNN3 () that runs only once per inference. Taking this into consideration, this work set a small value to the size () of hidden layers in while keeping or increasing the size () of the other two MLPs, leading to reduction in latency while maintaining the model accuracy. [moreno2020jedi]
has conducted a wide algorithmic search based on various model parameters, such as the number of the output neurons (
and ) and the sizes of hidden layers (denoted as ) in the 3layer andMLPs, and various activation functions of each layer in these MLPs, and the optimizer algorithm of training. However, it sets the same size to all the three MLPs, which is not latency friendly as we discuss above. This work reuses the model parameters searched in
[moreno2020jedi] but rebalances the size of different MLPs used in the graph networks to explore the tradeoff between the algorithmic and hardware performance.Since this work targets low latency on a resource constrained hardware device with a constant latency requirement (), we can further optimize our design search space and exploration flow. Generally, the training of a GNN model takes a long time, but since we have a hard latency requirement, it is unnecessary to train a model if we know its latency () is too large for the requirement. We define a parameter named , and drop the training of a model if its , to save a lot of time in model training. We set the larger than 1 to loosen the major constraint in case the exploration misses some interesting cases.
could also be set to a very large number to make a full coverage. This work estimates the latency of the design candidate by different GNN configurations with various hidden sizes and layer numbers of MLPs using the equations (
1) and (2). Training these designs is unnecessary because the latency limits their deployment in the L1T system. Thus, it leads to a large reduction in GPU/CPU training hours to find the optimal design. Although this work prioritizes on minimizing the latency on a resource constrained device, the optimization mode of our approach can be easily switched to other userdefined metric with other constraints.Once we find the optimal design based on the userdefined metric, the model parameters as well as weights/bias are generated. And then an low latency FPGA design is generated using our HLSbased templates, which improves design productivity.
The CMS experiment at CERN has two levels of real time triggering. Fig. 5 (a) the data acquisition system including the Level1 trigger. A subset of the data from the detector is first sent to the L1T for processing. If an accepted decision is made, the remaining data is then read out for further processing, otherwise the data will be dropped. The L1T has a total latency budget of 12.5s in the upgraded HighLuminosity Large Hadron Collider (HLLHC) [CERN2020L1T]. It consists of many FPGAbased subsystems without CPU or PCIe in the datapath. And particle identification is one of them. The particle identification system accepts streamed data which arrives on a number of parallel optical fibres running at 25 Gb/s, as shown in Fig. 5 (b). This work splits the whole GNN into several sublayers and adopts a layerwise hardware architecture [blott2018finn, duarte2018fast, zhang2020dnnexplorer] to map all the sublayers onchip which is flexible and able to take full advantage of the customizability of FPGAs. In our previous work, different sublayers run in a fashion of coarsegrained pipeline to further increase the design throughput. However, this work fuse as many sublayers as possible to convert several coarsegrained stages into a single stage with a finegrained pipeline inside, resulting in low end to end latency.
This section presents the evaluation results of the GNNbased JEDInet on FPGAs demonstrating the scalability of the proposed optimization for GNNs.
This study focuses on JEDInet30p [moreno2020jedi] models targeting a dataset of 30 particles [jet_dataset_30p] and JEDInet50p models targeting a 50 particles dataset [jet_dataset_50p]. To study the performance and limitations of the proposed optimizations and hardware architecture, the design is implemented using Xilinx Vivado HLS 19.2 on a Xilinx Alveo U250 board for the evaluation and comparison with other implementations. It runs at 200MHz so each cycle is 5ns. FPGA power consumption is reported by the Xilinx Vivado tool after place and route. The data datapath is using Q12.12: one sign bit, 11 integer bits and 12 fractional bits. But the accumulator is using Q16.16 to keep accuracy. It achieves the same accuracy as the floatingpoint model.
To find a proper fixedpoint representation that can achieve no reduction in the physics performance of the algorithm, we scan the fixedpoint precision with total bit widths from 16 to 26 bits and integer bits from 6 to 13, including the sign bit, as shown in Fig. 6. With 24 total bits and 12 integer bits, the fixedpoint model effectively achieves the same accuracy as the FP32 floatingpoint counterpart. In addition, JEDInet achieves much higher accuracy than the previous work based on DNNs [duarte2018fast, coelho2021automatic]
with an accuracy of around 75%. We also evaluate the Receiver Operating Characteristic (ROC) curves with the area under the curve (AUC) for the 5 jet classifiers, including gluon, light quarks, W boson, Z boson and top quark, as shown in Fig.
7. The AUC of the light quarks tagger (blue lines) using 24bit fixedpoint data representation seems different from the floatingpoint one, but note there is a logarithmic scale on the xaxis of Fig. 7 and the AUC loss of the q tagger is less than 1%.Task  LUT  FF  BRAM  DSP  
Available  1728k  3456k  5376  12288  
JEDInet 30P  Used []  1158k  246k  1392  11504 
(J2 [que2022optimizing])  Utiliz. [%, ]  67  7.1  25  93 
JEDInet 30P  Used []  734k  137k  16  9013 
(J3, w/ fusion)  Utiliz. [%, ]  42  4.0  0.3  73 
JEDInet 30P  Used []  865k  138k  37  7267 
(J4, OptLatn)  Utiliz. [%, ]  50  4.0  0.7  59 
JEDInet30P  Used []  911k  153k  37  9833 
(J5, OptAcc)  Utiliz. [%, ]  53  4.4  0.7  80 
JEDInet50P  Used []  855k  201k  25  8945 
(U4, OptLatn)  Utiliz. [%, ]  49  5.8  0.5  73 
JEDInet50P  Used []  815k  189k  37  8986 
(U5, OptAcc)  Utiliz. [%, ]  47  5.5  0.7  73 
Table I shows the resource utilization of our designs on the U250 FPGA with different parallelism parameters. For the model JEDInet30p, the input particle number is 30 with a feature size as 16, which are defined in the dataset. For JEDInet50p, is 50 with the same size of . The number of edges, , increases dramatically when increases. It is which equals 870 when and 2450 when . The two models also have different sized and . Our new low latency design with fusion consumes less hardware resource than the previous design on the same sized JEDInet 30p model, as shown in Table I. Both J2 and J3 targets the same sized JEDInet model. The details of J2 and J3 are in Table II. The utilization of both the BRAM and LUT has been reduced in the new design based on the new proposed low latency hardware architecture, resulting in a more hardware efficiency design.
Code transformation using strength reduction is conducted to optimize matrix multiplications, transforming multiplications to only load and store operations, sometimes with a small number of additions. Fig. 8(a) shows that all the multiplications and additions for MMM1/2 units are removed and Fig. 8(b) shows that only 6,960 (3.3%) additions of the original implementation [moreno2020jedi] are required for the MMM3 unit in JEDInet30p models. Besides, there is a 96.7% reduction in the number of iterations for all the MMM units, which largely reduces the design latency. Fig. 8(c) and (d) show the operations reduction in JEDInet50p models. Moreover, the run time of C synthesis using Vivado HLS is also reduced by over 3 times when compared with the HLS design without strength reduction.
To achieve low latency and high throughput, each of the layers in the , , units are firstly fully unrolled. Besides, the proposed strength reduction enhanced MMMs are applied with a columnmajor order data representation. The latency of our designs reduces from a few milliseconds to a few microseconds for both JEDInet 30p and 50p models, as shown in Fig. 9. The design J1 and U1 are the designs with initial parallelism parameters for JEDInet30p and JEDInet50p respectively, as shown in Table II.
To further improve the latency and II, we increase the design parallelism by deploying multiple DNN1 () hardware units since this unit requires to be iterated times. The latency of design J2 is reduced from 12.56s to 1.91s when the number of deployed DNN1 units, , increases from 1 to 13 for JEDInet30p model. The latency improvement is achieved by trading hardware resources for performance (low latency). But naively increasing the copy of DNN1 could not lead to an optimal design. The designs U1 and U2 for the large model, JEDInet50p, shows that the latency decreases from 32.60s to 12.47s when increases from 1 to 3, as shown in Table II. However, the required number of DSPs has exceeded the total DSPs on this FPGA. To solve this issue, we reallocate some DSP blocks from DNN2 () and DNN3 () to DNN1 () via increasing the reuse factors of DNN2 and DNN3 as well as deploying multiple DNN1 units to balance the whole design II, leading to a reduced II and latency. A design space exploration is conducted to find the the appropriate values of the parallelism parameters as (4,4), resulting in design U3 which has both better II and latency than the design U2.
J1  J2  J3  J4  J5  U1  U2  U3  U4  U5  
[que2022optimizing]  This work  [que2022optimizing]  This work  
NN Model  JEDInet30p  JEDInet50p  
(NL, Size)  (3, 20)  (1, 8)  (2, 32)  (3, 50)  (2, 8)  (2, 8)  
(NL, Size)  (3, 20)  (3, 48)  (3, 48)  (3, 50)  (3, 32)  (3, 48)  
Accuracy  78.74%  78.41%  79.85%  80.42%  80.90%  81.18%  
1  1  1  1  1  1  1  4  1  1  
1  13  10  29  6  1  3  4  25  17  












II (cycles)  880  80  90  30  150  2462  854  650  100  150  
II (s)  4.40  0.40  0.45  0.15  0.75  12.31  4.27  3.25  0.50  0.75  
Latency (cycles)  2511  382  124  58  181  6519  2493  2131  130  181  
Latency (s)  12.56  1.91  0.62  0.29  0.91  32.60  12.47  10.66  0.65  0.91  
Note  OptLatn  OptAcc  OptLatn  OptAcc 
The latency can be further reduced by the new proposed sublayer fusion for GNN that uses a layerwise hardware architecture. The fusion removes the boundary between the hardware units and enables a finegrained pipeline instead of a coarse grained pipeline. A code transformation with an FSM structure is introduced to mitigate the pipeline issue of an imperfect loop. For example, the design J3 targets the same NN model to J2, but with the fusion, the latency can be reduced from 1.91s to 0.62s. This is the first time that the JEDInet30p model can run in 1s. Besides, setting to a number between 10 to 14 will bring the same II and latency using the new low latency hardware architecture because the partition is performed on the inner loop which has a bound of 29 for JEDInet30p. Thus 10 is chosen since it consumes the least hardware resources. Compared to our previous work [que2022optimizing], the new design J3 is 3.1 times faster as shown Fig. 10(a), but it consumes 22% less DSP blocks. The cost is a slightly higher II from 0.40s to 0.45s since after fusion the pipeline depth of the fused stage increases.
Moreover, we introduce a codesign approach to optimize the GNNs both on model accuracy and hardware performance. And it is the first time that JEDInet50p models can run in 1s without sacrificing model accuracy. In [moreno2020jedi] and our previous work [que2022aicas, que2022optimizing], the size for all the three MLPs are the same. However, after the analysis of JEDInets, the will be iterated times more than and times more than . Thus if the size of can be small, the design can deploy more copies of units on a given FPGA, resulting in a fast design. We consider JEDInet30p models with the number of layer (NL) of searched in (1,2,3,4) and layer size in (8, 16, 24, 32). The latency parameter is set to 2 for JEDInet30p. The JEDInet50p is much larger and the size of is searched in (8, 16, 32, 48) with NL in (1,2,3,4). The is set to 4 for JEDInet50p. For simplicity, we keep the layer number and other configurations of and the same to [moreno2020jedi] but only set the size of their first layer to one of (16, 32, 48, 64, 96) for both JEDInet30p and 50p.
The results are shown in Fig. 11 and Fig. 12 using a U250 FPGA platform. Each blue dot is an explored design. Fig. 11 shows the designs with a latency less than 2s. Since JEDInet50p is much larger than JEDInet30p, Fig. 12 shows the designs with a latency less than 4s. J4 and U4 are selected since they have the lowest latency (OptLatn) in all the candidate designs and they have the highest model accuracy in the designs with the same latency. With an accuracy loss of 0.33%, the latency of JEDInet30p can be further reduced to 0.29s which is 2.1 times faster than J3. The II can also be reduced to 0.15s. For JEDInet50p, the latency of U4 can be reduced from 10.66s to 0.65s which is 16.7 times faster than U3 [que2022optimizing], as shown in Fig. 10(b). In addition, the the accuracy is higher than U3.
Although we target low latency GNNs, for a realworld application we also consider the model accuracy. J5 and U5 are selected since they have the highest model accuracy (OptAcc) in all the designs with a latency less than 1s. Compared to our previous work [que2022optimizing] (J2), the new design J5 achieves 1.11% better model accuracy and 2.1 times reduction in latency. For JEDInet50p, the new searched design U5 achieves 0.76% better model accuracy and 11.8 times reduction in latency than the previous work (U3). For JEDInet50p, design U6 is an interesting case which can achieve the highest accuracy in all the searched designs but the latency is 2.71s which is higher than the latency constraint (s) for the CERN HLLHC. However, with a larger FPGA chip in the future, it could be further reduced to an accepted latency based on our optimizations. In addition, based on the latency model in Section 4.3, the estimated latency of design J4, J5, U4 and U5 are 0.30s, 0.91s, 0.66s and 0.915s respectively, leading to less than 5% prediction errors.
We demonstrate that our approach can not only find a design with a much better latency (OptLatn) but can also find a more balanced design, such as a design achieving nice accuracy with a slightly high but still accepted latency (s), such as U5. We can loosen the latency and trade it for model accuracy. Besides, when the scientific domain improves, the scientists will try to design more sensitive detectors and require even low latency. With our proposed approach and optimizations, we can not only maximize the hardware performance on a given FPGA. But with a new performance requirement in the future, we can also find out what a future FPGA would be needed to meet such a system requirement.
When the particle identification is part of the whole processing in the trigger, our approach can still lead to an appropriate set of parameters to get the optimal II and latency with a given hardware budget. Besides, in a realistic use case, the cardinality of the input dataset might be much smaller. In that case, one would be able to speed up the algorithm even more than what we show in this work, as well as to reduce the resource utilization.
To compare the performance of the proposed design on FPGA with other platforms, we run the JEDInet models implemented in [moreno2020jedi]
on Intel Xeon Gold 6154 CPU and NVIDIA GeForce RTX 2080 Ti (CUDA 10.2) based on PyTorch (1.8.1) framework. The CuDNN libraries are used for optimizing the hardware performance on GPUs. Each batch has 1000 graph events (samples) according to
[moreno2020jedi], so we set the same batch size for all the hardware platforms for a fair comparison. CPU power consumption is measured by the pcmpower utility [han2017ese], excluding the DRAM power consumption. GPU power consumption is measured using nvidiasmi utility. We adopt KGPS (Kilo Graphs Per Second), which denotes the number of graph events inferences that run per second, as an indicator of throughput. This work uses the same CPU and GPU implementations like the one in [moreno2020jedi]. Compared with the JEDInet implementation on GPU, our FPGA design is 5.124 times faster and consumes 9.710.2 times less power. In terms of the power efficiency, which is denoted as KGPS per Watt, our design is 5.145 times higher than the GPU implementation. When compared to the CPU implementation, our FPGA implementation is 76791 times faster. In addition, our design achieves 3262575 times higher power efficiency than the CPU implementation. We believe the proposed custom strength reduction can also be applied to CPU and GPU implementations, but the latency profiling shows that the three MMMs cost less than 15% of the total latency. We leave that for future work since it has a limited impact on the conclusions in this paper. The FPGA implementation is faster and more efficient because it is unrolled onchip with a finegrained pipeline and benefits from tailormade optimizations for the JEDInet based on the proposed approach.Platform 


FPGA U250  
Frequency  3.00 GHz  1.63 GHz  200 MHz  
Technology  14 nm  12 nm  28 nm  
Precision  F32  F32  24 Fixed  

50p^{1}  30p^{1}  50p^{1}  30p^{1}  50p  30p  
Power (W)  103  106  250  245  25.9  24.0  
Batch Size  1000  

593.1  56.9  16.8  3.8  0.75  0.75  

1.69  17.6  59.52  263.2  1333  1333  

0.02  0.17  0.24  1.07  51.5  55.5 
Same JEDInet architecture to [moreno2020jedi, que2022optimizing].
There have been studies exploring GNNs for particle physics applications, such as jet tagging (identification) [moreno2020jedi], charged particle tracking [ju2021performance], and calorimeter energy measurements [qasim2019learning]. More can be found in this survey [thais2022graph]. To achieve low latency, FPGAs are involved. [elabd2021graph] extends the hls4ml [duarte2018fast] tool to translate GNNs into FPGA firmware automatically for charged particle tracking. GarNet [iiyama2021distance], a GNNbased algorithm, is proposed for calorimeter energy regression.
There are also many studies about general GNN accelerations [zhang2021boostgcn, lin2021gcn, geng2020awb, geng2021gcn, zhou2022model, tian2022g, garg2022understanding]. AWBGCN [geng2020awb] is based on a columnwiseproduct architecture with runtime rebalancing for GCN acceleration. Their upgraded version IGCN [geng2021gcn] presents Islandization, a new runtime graph restructuring algorithm, to improve data locality. BoostGCN [zhang2021boostgcn] presents a novel hardwareaware PartitionCentric Feature Aggregation (PCFA) scheme for pipelined GCNs. [lin2021gcn] introduces GCNs acceleration using HLS and hardwarefriendly optimizations. GNMP [tian2022g] presents a NearMemory Processing (NMP) solution for accelerating GNNs to handle the irregular memory access. [garg2022understanding] explores various dataflow choices for sparse and dense GNNs on spatial accelerators. [besta2022parallel] designs a taxonomy of parallelism in GNNs. [sohrabizadeh2022streamgcn] presents StreamGCN for accelerating GCN specialized for GNN streaming processing with small graphs. [chen2022regraph] proposes a heterogeneous pipeline architecture for GNNs on high bandwidth memory (HBM) enabled FPGAs. [abi2022gengnn] proposes GenGNN framework to deliver ultrafast GNN inference and support a diverse set of GNN models. Results show their designs achieve millisecond level latency. They also propose FlowGNN [sarkar2022flowgnn] which can flexibly support the majority of messagepassing GNNs. [kang2022grow] proposes a GCN accelerator named GROW with Gustavson’s algorithm to architect a sparsedense GEMM accelerator with rowwise product. [sun2022multi] proposes MultiGCN which balances network latency and network bandwidth for GCNs in multinode systems. [yang2022drgn] presents a dynamically reconﬁgurable accelerator for GNNs named DRGN. EGCN [han2022egcn] is proposed using tiled matrix multiplication to reduce OffChip Memory Access.
There are also previous studies about algorithmic and hardware codesign for GNNs [zhou2022model, you2022gcod, zhang2021g]. [zhang2021g] presents a GNN and accelerator automatically cosearch framework to maximize both task accuracy and acceleration efficiency. [zhou2022model] proposes a modelarchitecture codesign with a lightweight algorithm for temporal GNN inferences on FPGAs. [you2022gcod] proposes GCoD framwork, involving a twopronged accelerator. Some previous studies focus on accelerating GNN training [zeng2020graphact, su2021graph, lin2022hp, ogbogu2022accelerating]. GraphACT [zeng2020graphact] introduces an FPGAbased accelerator with a subgraphbased algorithm for Graph Convolutional Networks (GCNs) training. [su2021graph] presents an efficient graph sampling accelerator on HBM enabled FPGAs for training GNNs. [lin2022hp] proposes HPGNN which maps GNN training on the CPUFPGA platform automatically. DietGNN [ogbogu2022accelerating], a crossbaraware pruning technique, is proposed to accelerate the training of largescale GNNs. All of these studies utilize the single engine architecture. This work focuses on layerwise architecture and proposes several novel optimizations for high throughput and microsecondlevel latency. These previous studies are orthogonal to our proposed approach and hardware architecture. Their techniques could be complementary to our approach, which could be extended in future to achieve even lower latency.
This paper presents a novel approach for minimizing the latency for the GNNbased JEDInet on an FPGA. It involves optimizing the matrix operations and hardware pipeline to support nextgeneration lowlatency collider trigger systems, key to many fundamental physics experiments including particle identification. Results show up to 45 times reduction in latency over the existing GPUbased JEDInet implementation and up to 16.7 times reduction over our previous work. Our future work includes exploring the use of new FPGA resources such as the AI Engines [xilinx_white]
and the AI Tensor Blocks
[langhammer2021stratix], and incorporating the proposed techniques into the design and implementation of the data processing architecture for nextgeneration collider trigger systems.