Modern Field Programmable Gate Arrays (FPGAs) are prefabricated integrated circuits intended to be configured by the end user. FPGAs have grown significantly in terms of device features and gate counts. Figure 1 illustrates an island-styled heterogeneous FPGA which consists of a two-dimensional array of Configurable Logic Blocks (CLBs) surrounded by peripheral I/O blocks. The CLBs internally contain Look-Up Tables (LUTs) and Flip-Flips (FFs) to implement logic. Heterogeneous blocks such as Random Access Memories (RAMs) and digital signal processors (DSPs) are present. Other hardened IP blocks (e.g., processors, SERDES, etc.) can also be included in the device.
Not obvious in Figure 1 are the FPGA routing resources which consist of wire segments and programmable switches. The programmable switches are used to connect wire segments to each other and to other resources such as input and output pins on different blocks. This information (i.e., the wire segments, programmable switches, pins on blocks, etc.) must be maintained during detailed routing via the Routing Resource Graph (RRG) data structure. The RRG can consume a significant amount of memory which can negatively impact the router and the overall CAD flow. In situations where insufficient memory is available, swap memory will need to be used. A large memory footprint will also have negative consequences for cache.
This observation is made in [3, 2] and the RRG memory footprint is reduced by relying on the FPGA tiling. Tiling means that “copies” of the same routing resources are repeated over and over throughout the device. Different tiles may exist; e.g., the routing resources surrounding CLBs might be different that those surrounding RAMs, DSPs or I/O blocks. Nevertheless, when viewed as a graph, the RRG consists of many identical sub-graphs and that only a single copy of each sub-graph (i.e., tile) is required in memory. Tiles and resources can be expanded as needed during FPGA routing. This observation resulted in a X to X reduction in the memory requirements for the RRG in [3, 2]. However, this approach comes with a penalty. Specifically, in [3, 2] the required modifications to the routing algorithm resulted in X runtime penalty for the time taken to perform detailed routing.
We propose an alternative view of the problem to compress the RRG. Our approach benefits from FPGA tiling, but neither requires nor relies upon it111We do not explicitly identify tiles. We also demonstrate compression when we ignore tiling.. Specifically, we focus on the adjacency information in the RRG and compress only this portion of the RRG in two ways. First, we apply delta encoding and v-byte compression  to reduce the size of the data needed to store adjacency lists. Second, we apply a sliding window compression to avoid storing duplicate adjacency lists222Delta encoding and v-byte compression does not require tiling whereas the sliding window compression does benefit from the repetition of resources introduced by tiling.. Depending on the device, we can achieve as much as a X reduction in the memory needed to store the adjacency information which translates into as much as a X reduction in the size of the RRG. While this does not appear as significant as that reported in [3, 2], we find that our changes only slows down the router by an average of compared to the more that X slowdown reported in [3, 2]. Our proposed scheme requires very few lines of code to implement and requires only insignificant and unintrusive changes to the FPGA router.
Ii Routing resource graphs
The RRG is one piece of data which is maintained during routing and represents all of the physical resources inside of the FPGA required to facilitate routing of signal nets. This information includes wires and pins as well as additional source and sink pins to model logically equivalences. A simple example of an RRG construction taken from [3, 2] is illustrated in Figure 2.
Resources are represented as nodes in the RRG. Programmable connections (switches) between different resources are modeled as directed edges. There can be a variety of different switches available in the architecture; e.g., pass transistors or buffered switches with different resistances and capacitances. Potential connections and switches form edges in the RRG and potentially consume significant memory. An example of how this information might be stored for a particular resource is illustrated in Figure 3
in which adjacent resources and the switch types used to make a connection are stored as two vectors.
Additional information is stored in each node, but we don’t consider it. This includes the physical location, span, occupancy, capacity, resistance and capacitance of the resource. However, we are only interested in the adjacencies; it appears there is significant potential to reduce the memory requirements by focusing only on this information. As illustrated in Figure 3, each edge in the RRG would require one int and one short, consuming a minimum of 6 bytes per edge. Our aim is to either compress or eliminate as much of this information as possible without losing any RRG details.
Iii Our algorithm
Iii-a Delta encoding and v-byte compression
Nodes in the RRG are typically created in a regular localized manner; e.g., given an coordinate in the FPGA, all routing resources around that coordinate are created at the same time. Further, these routing resources are typically connected locally. This implies that two RRG nodes and near the same coordinate will have similar integer identifiers and this fact can be exploited.
Figure 4 shows an adjacency list for an RRG node with 7 adjacencies which required 28 bytes minimum (we don’t consider the switches in our explanation).
Figure 4 also shows the same information, but with delta encoding. With delta encoding, the deltas between consecutive sorted integers is always a smaller positive integer. It is not necessary to use 4 bytes to store each adjacency and this is even more true with delta encoding. We can apply v-byte encoding to compress the adjacencies into even fewer bytes and only use as many bytes as required. The compressed v-byte encoded adjacency list is also shown in Figure 4. The integer ids of adjacent resources can be reduced from 7 integers (28 bytes) to only 9 bytes. It is also possible to compress the switch information into bytes effectively.
The complexity of the compression is where is the length of the adjacency list due to the need to sort the adjacency lists. Deltas and compression is achieved with a single pass over the adjacency list. Compression pseudocode is given in Figure 5. Adjacency lists are compressed as they are created to avoid using large amounts of memory to represent the RRG. Numerical results demonstrate the effectiveness of delta encoding and v-bytes compression.
|Input: // vector of integer adjacencies Output: // vector of compressed byte adjacencies for ( do loop if then else end if end loop end for|
Iii-B Sliding window compression
Although we don’t specifically require tiling, it is very likely that there are repeated patterns in the routing graph due to tiling. Figure 6 shows the adjacencies for two resources in different physical regions of the FPGA after delta encoding.
|Node #8564 (referenced)||:||8525,-373|
It is clear that the repeated patterns of connections are made clear by the use of delta encoding. Consequently, the storage of both adacency lists is not required. During RRG creation, adjacency patterns (the deltas) are hashed and if RRG nodes with identical deltas are encountered, their adjacency lists are not stored explicitly (only a reference is recorded to the other RRG node).
Iii-C Node renumbering
We also consider renumbering RRG nodes. Since adjacencies can be viewed as a matrix, we consider the use of matrix reordering techniques. It is typical that RRGs are created such that physically close resources are created with similar identifiers and connected. For example, Figure 7(a) shows the RRG for a homogeneous architecture when viewed as an adjacency matrix.
In Figure 7(a), the clustering of adjacencies around the diagonal indicates connected nodes have similar identifiers which benefits the use of delta encoded lists.
The situation is slighly different for a heterogeneous device whose RRG is depicited in Figure 7(b). Entries are not always clustered around the diagonal, although blocks appear off diagonal. The use of delta encoding could be impaired. Figure 7(c) shows the same RRG as Figure 7(b), but with RRG nodes renumbered through RCM matrix reordering . The bandwidth reduction obtained through matrix reordering can potentially result in smaller deltas which will benefit compression. However, we note that renumbering RRG nodes can potentially “hide” any tiling within the device and negatively impact the previously described windowing strategy.
Iv Router modifications
Typical FPGA routers are based on path-finding algorithms [6, 1] which use graph search algorithms (e.g., BFS or ). Each sign net is first routed ignoring other nets and then a loop is entered as long as there are overused RRG resources due to multiple nets requesting the same RRG resources. In each iteration, each net is ripped up and re-routed and effort is made to avoid reusing overused resources. Each net is then routed using some sort of graph search algorithm and a salient feature of these algorithms is the need for “neighborhood expansion” which required looping over the adjacencies of an RRG node.
In our case, these adjacencies are either encoded and compressed or found via a reference to another node (in which the information is also encoded or compressed). Prior to expanding neighbors, we must derefernce, decompress and decode the adjacency lists. This is the only modification needed to the router. The decompression and decoding is linear and is done as shown in Figure 8.
|Input: // vector of compressed byte adjacencies Output: // vector of integer adjacencies for ( do loop if then ; else ; end if end loop end for|
V Numerical Results
We modified VPR4.3  and VTR 333We consider VPR4.3 and homogeneous FPGAs for some comparison to [3, 2]. We used the lastest version of VTR from github to consider heterogeneous FPGAs. to examine compressed RRG storage requirements and router runtimes.
V-a Memory reductions
Figure 9(a) shows the RRG compression for different FPGA grid sizes using VPR4.3 in which each CLB contained 10 LUT/FF pairs with 22 inputs and 10 outputs, respectively. The channel width was fixed at 150 wires of length 4. We found that adjacency information alone consumed of the RRG memory indicating this is a “valid target” for compression. With all compression options, we see a reduction of X X in the RRG size (a X X reduction the storage requirements if only the adjacency information is considered). With only delta encoding with v-byte compression, Figure 9(a) shows compressions of X X. While not as substantial, this compression is readily available without relying on the FPGA tiling.
Figure 9(b) shows the same results for a heterogeneous architectures created using VTR444The heterogeneous FPGAs included DSP and RAM blocks and were created using the k6_frac_N10_mem32K_40nm.xml architecture file.. Here, the adjacency information consumed of the total RRG memory. Figure 9(b) shows a savings of X X using delta encoding and v-bytes compression. With all options, we see only a modest improvement of X X. Interestingly, with these heterogenous architectures, windowing does not seem to significantly improve the RRG compression.
We do not include compression results obtained using RRG node renumbering. Our investigation thus far into this technique has not been useful for achieving improved compression results. Specifically, we found that node renumbering was effective at reducing the magnitudes of the delta values, but only lead to minor improvement in the compression. However, finding identical patterns of delta values through windowing became less effective. The net result was little to no change in the overall compression ratios.
V-B Router impact
Routing results are deterministic so we only consider the impact on the routing runtime due to the need to constantly decompress RRG adjacencies. Figure 10(a) shows the results on a set of designs run through VPR4.3. Each CLB consisted of a single LUT/FF pair and all wire segments are length 1. Designs were mapped to the smallest FPGA into which they would fit. Figure 10(a) shows the router runtime is impacted by only on average with a maximum penalty of which compares very favorably to the runtime impact mentioned in [3, 2]. Figure 10(b) shows the same results for a set of heterogeneous designs run through VTR into a heterogenous FPGA. Here we see a runtime impact of on average with a maximum penalty of .
Figure 10 also shows the RRG compressions achieved for these additional FPGA devices to demonstrate memory savings.
We have presented several simple ideas to compress the RRGs used by FPGA detailed routers. Our ideas are extremely easy to implement and focus on viewing the RRG as an adjacency graph. The compression achieved is reasonable and appears to not significantly impact detailed router runtimes.
Our compression and decompression implementations were straightforward and the impact on router runtime could possibly be reduced by using more efficient decompression . More investigation of compression on heterogeneous architectures seems worthwhile.
-  V. Betz, J. Rose, and A. Marquardt. Architecture and CAD For Deep-Submicron FPGAs. Kluwer Academic Publishers, 1999.
-  S. Y. L. Chin and S. J. E. Wilton. Static and dynamic memory footprint reduction for FPGA routing algorithms. TREATS, 1(4):1–20.
-  S. Y. L. Chin and S. J. E. Wilton. Memory footprint reduction for FPGA routing algorithms. In Proc. FPT, pages 1–8, 2007.
-  A. George and J. Liu. Computer Solution of Large Sparse Positive Definite Matrices. Prentice Hall, 1981.
-  D. Lemire, N. Kurz, and C. Rupp. Stream VByte: Faster byte-oriented integer compression. Information Processing Letters, 130:1–6, February 2018.
-  L. McMurchie and C. Ebeling. Pathfinder: A negotiation-based performance-driven router for FPGAs. In Proc. FPGA, pages 111–117, 1995.
-  J. Rose, J. Luu, C. W. Yu, O. Densmore, J. Goeders, A. Somerville, K. B. Kent, P. Jamieson, , and J. Anderson. The VTR project: architecture and CAD for FPGA from verilog to routing. In FPGA, pages 77–86, 2012.