Amorphous Dynamic Partial Reconfiguration with Flexible Boundaries to Remove Fragmentation

10/23/2017 ∙ by Marie Nguyen, et al. ∙ Carnegie Mellon University 0

Dynamic partial reconfiguration (DPR) allows one region of an field-programmable gate array (FPGA) fabric to be reconfigured without affecting the operations on the rest of the fabric. To use an FPGA as a dynamically shared compute resource, one could partition and manage an FPGA fabric as multiple DPR partitions that can be independently reconfigured at runtime with different application function units (AFUs). Unfortunately, dividing a fabric into DPR partitions with fixed boundaries causes the available fabric resources to become fragmented. An AFU of a given size cannot be loaded unless a sufficiently large DPR partition was floorplanned at build time. To overcome this inefficiency, we devised an "amorphous" DPR technique that is compatible with current device and tool support but does not require the DPR partition boundaries to be a priori fixed. A collection of AFU bitstreams can be simultaneously loaded on the fabric if their footprints (the actual area used by an AFU) in the fabric do not overlap. We verified the feasibility of amorphous DPR on Xilinx Zynq System-on-Chip (SoC) FPGAs using Vivado. We evaluated the benefits of amorphous DPR in the context of a dynamically reconfigurable vision processing pipeline framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Motivation. Dynamic partial reconfiguration (DPR) allows a region of the field programmable gate array (FPGA) fabric to be reconfigured while the remainder of the fabric can continue to operate unaffected [1]. This allows the portion of the FPGA fabric with real-time functionalities, such as external I/O interfacing, to remain online while the functionality realized by a DPR region is updated. The dynamism and flexibility made possible by DPR are especially important when using FPGAs for computing.

Use-Case. Consider a use-case where the available FPGA fabric is divided into multiple DPR partitions with fixed boundaries at build time. Each DPR partition is provided with a standard interface connection (e.g., AXI4). The DPR partitions are enclosed by an infrastructural static partition that provides datapath to connect the DPR partitions, through the standard interface, with each other and with off-fabric resources (e.g., on-chip embedded processor, off-chip DRAM and I/O). At runtime, the DPR partitions can be dynamically reconfigured for use by independent or loosely-coupled application function units (AFUs) to flexibly share the fabric spatially and temporally. Example systems of this kind of dynamically managed multi-AFU fabric use-case include [2, 3, 4].

Fig. 1: A vision processing pipeline framework that uses DPR to reconfigure the functions of the pipeline stages at runtime.

We have developed a working example of this use-case in the form of a runtime framework to map vision processing pipelines onto a Xilinx Zynq System-on-Chip (SoC) FPGA for real-time interactive applications (Fig. 1). Each DPR partition can be loaded with an AFU that is a vision processing stage (e.g., blob/color detection/tracking, edge/corner detection, morphological transforms, etc.). The infrastructural static partition provides the AFUs with AXI4 connections to off-chip DRAM, camera and video-out. The AFU in one DPR partition can also stream video frames directly to another DPR partition. The connectivity provided by the static partition is general so the assignment of processing stage to partition is flexible. This framework allows a Zynq SoC FPGA fabric to be dynamically reconfigured for different vision processing pipelines by loading and composing appropriate AFUs at runtime. The fabric can be shared by multiple vision processing pipelines running at the same time. The fabric can also be temporally shared when there is not enough DPR partitions to host all the pipelines simultaneously.

Problem: Fragmentation. Dividing a fabric into DPR partitions with fixed boundaries causes the available fabric resources to become fragmented. We risk creating external fragmentation if we divided the fabric into many small DPR partitions. An over-sized AFU cannot be loaded onto the fabric unless a sufficiently large DPR partition had been allocated when the DPR partitions were floorplanned. We risk creating internal fragmentation if we try to make the DPR partitions large enough. This reduces the number of AFUs that can run concurrently; the large DPR partitions would frequently be wasted on under-sized AFUs. In our vision processing use-case, the effect is especially pernicious for SRAM and DSP resources that are in very high demand.

Solution: “Amorphous” DPR. We devised a DPR technique that does away with the need to commit upfront to a layout of fixed DPR partition boundaries. This technique relies on the assumption that DPR partitions only physically connect with the static partition and never directly with each other. Only the boundary of the static partition and the locations of the AXI4 interface nets have to be fixed at build time. Instead of mapping an AFU to fit in a DPR partition’s fixed boundary, we map an AFU to a custom floorplan that only encloses the minimum consumed fabric region around an interface. In fact, for each AFU, we compile multiple bitstream versions corresponding to differently shaped footprints; each footprint option is chosen to minimize the potential for conflict with other AFUs’ footprints. At runtime, a desired combination of AFUs can occupy the fabric at the same time if a non-overlapping packing of footprints can be found from the available versions.

Contributions. We verified the feasibility of amorphous DPR on Xilinx Zynq SoC FPGAs using Vivado. We further integrated amorphous DPR into our vision processing pipeline framework. Doing away with the impositions of fixed DPR partition boundaries removes resource fragmentation and thus, greatly expands the allowed AFU combinations that can co-exist on the fabric simultaneously.

We evaluated the improvement in placement rate (fraction of a given set of AFU combinations that can be placed successfully) when using amorphous DPR vs. standard DPR in our vision processing pipeline framework. We also evaluated the savings in DPR overhead in terms of time because amorphous DPR reconfigures only the footprint area actually used for an AFU (instead of a complete DPR partition regardless of the degree of utilization within when using standard DPR). The results show that amorphous DPR offers significant improvement in flexibility and efficiency over standard DPR in our vision processing pipeline framework.

Paper Outline. Following this introduction, Section II provides a review of DPR as currently supported by Xilinx Vivado and Xilinx Zynq SoC FPGA. Section III presents the amorphous DPR technique. Section IV introduces the design of our evaluation. Section V presents the evaluation outcomes. Section VI offers a survey of related work. Lastly, Section VII offers our conclusion.

Ii Dynamic Partial Reconfiguration (DPR)

Although DPR has not seen widespread use over the decade since its commercial introduction, the technology today is flexible and well supported. The discussion of DPR in this section is based on Vivado [1] and 7-series Xilinx Zynq SoC FPGA [5], the environment we used for our study.

Fig. 2: An FPGA fabric organized into a top-level static partition enclosing an uncommitted reconfiguration region subdivided as two DPR partitions. The termination LUTs in (a) have been arbitrarily placed; the termination LUTs in (b) have been placed deliberately.

Static and DPR Partitions. The cartoon in Fig. 2(a) depicts an FPGA fabric organized into a top-level static partition enclosing an uncommitted reconfiguration region subdivided as two DPR partitions. For logical design entry, a DPR partition appears in the top-level design as a “black-box” submodule with a known input/output port list but opaque internals. In Fig. 2(a), the two DPR partitions are shown to have the same port list, simply A and B in this toy example.

A DPR partition can have an arbitrary rectilinear outline and can cross clock regions. On 7-series Xilinx FPGAs, the minimum unit to allocate to a DPR partition is a column (whether LUT, BRAM or DSP blocks) spanning the full-height of a clock region if the “reset after reconfiguration” attribute is set. Otherwise, LUT, BRAM and DSP blocks can be allocated in the granularity of individual units. Please refer to [1] for more detailed rules and restrictions.

Build Flow. At the time the static partition design is built, the physical boundaries of the static partition and DPR partitions are set by floorplanning. The net for a port (whether input or output) terminates at a reserved LUT location (a.k.a. proxy logic; other resource types can also be used for termination) within the DPR partition. In Fig. 2(a), the termination LUTs A and B are shown as to have been placed arbitrarily by the tool. The figure also shows the placed-and-routed nets that connect the termination LUTs out to the static partition. Fig. 2(b) shows another version where the termination LUTs have been deliberately placed during floorplanning.

Bitstream Versions. Separately from building the static partition, any AFU design with a matching port list (i.e., A and B in our example) can be synthesized for the DPR partitions, subject to the restrictions of (1) the DPR partition’s fixed boundary and (2) the reserved resources for the input/output termination LUTs and nets. The same logical AFU design could be synthesized for use in either or both DPR partitions (same I/O port list). However, the AFU design would have to be separately placed-and-routed for the two different DPR partitions, resulting in two non-interchangeable, partition-specific bitstreams for that one AFU design.

Reconfiguring at Runtime. At runtime, the reconfiguration of a DPR partition can be initiated from outside the FPGA, by logic on the fabric, or by the embedded ARM core. To reconfigure a DPR partition, interactions with the out-going AFU are paused; the incoming bitstream is loaded (from BRAM, DRAM or FLASH); and finally, the new AFU is started. In the system we built, DPR is managed by software running on the ARM core and bitstreams are held in FLASH initially, and loaded into DRAM for use.

While one DPR partition is undergoing reconfiguration, the logic on the rest of the fabric is not affected except the portions that interact directly with the DPR partition’s input/output ports. The disruption during DPR must be accounted for explicitly by the enclosing design with the help of auxiliary DPR status signals (that indicate the readiness of the DPR submodule). The minimum time to reconfigure a DPR partition is on the order of milliseconds. The total time is a function of the size of the loaded bitstream. For standard DPR, the bitstream size is a function of the DPR partition size regardless of the actual degree of resource utilization within.

Implications on Use-Case. In the introduction, we motivated a dynamically managed multi-AFU fabric use-case where the FPGA fabric is divided into DPR partitions to support dynamic spatial and temporal mixing of AFUs. We can extrapolate from the simplified cartoon in Fig. 2 to a more realistic implementation of the use-case by increasing the number of DPR partitions, and by replacing the toy input/output port list by AXI4 interfaces (including the AXI4-Lite slave interface for the embedded ARM core to control the AFU by memory-mapped I/O).

As pointed out in the introduction, dividing the uncommitted reconfiguration region into a layout of fixed DPR partition boundaries results in resource fragmentation. Please note that it is not necessary that all the DPR partitions be the same size. For example, if the AFU workload mix is known ahead of time, one could improve mappability by creating asymmetrically resourced DPR partitions tuned to the AFU workload at build time. For example, one would want to allocate DPR partitions to be large enough for the largest required AFU or combination of AFUs (e.g., to form a particular vision pipeline in our vision processing use-case). Please also keep in mind that the distribution of resources (LUTs and hard blocks) on the FPGA fabric is actually not uniform from region to region, at neither coarse nor fine-scale. It is generally not possible to form truly equally resourced DPR partitions.

Iii Amorphous DPR

For the dynamically managed multi-AFU fabric use-case, allocating too-large DPR partitions creates internal fragmentation; allocating too-small DPR partitions creates external fragmentation. Either way, the effect is that some un-utilized resources become off-limits—due to some boundary line—to an AFU that needs them. This inefficiency and inflexibility is a significant obstacle to the dynamically managed multi-AFU fabric use-case.

Fig. 3: The elements actually locked down as the result of building the floorplan in Fig. 2(a). The dashed outlines indicate examples of valid and invalid footprint options for building AFUs to attach to the left interface.

Flexible Boundaries. We realized we could avoid fragmentation by doing away completely with the requirement of fixing the DPR partition boundaries at build time. This is because in our use-case, the AFUs to be configured at runtime only connect physically with the static partition and never directly with each other. At build time, we only have to fix (1) the boundary of the static partition and (2) the resources reserved for the AXI4 interface nets and the termination LUTs. Fig. 3 depicts the elements actually locked down as the result of building the floorplan in Fig. 2(a).

Instead of confining an AFU to a predetermined DPR partition boundary, we could build an AFU to attach to the left interface using any of the several possible valid footprints (examples shown in dashed lines in Fig. 3). For a given AFU design, the footprint only needs to be large enough to contain the required fabric resources. The same flexibility is available when building AFUs for the right interface. Please note that all resources (including routing) needed by a given AFU must be entirely contained within its footprint.

Fig. 4: A valid packing of two non-overlapping footprints of AFUs for the left and right interfaces. The left AFU would have been too large to fit within the fixed boundaries of the left DPR partition in Fig. 2(a).

At runtime, two AFU bitstreams—one for the left and one for the right interface—can be simultaneously loaded provided their footprints do not overlap. Fig. 4 shows the example where the large footprint of a resource-demanding AFU, attached to the left interface, co-exists with the small footprint of a less resource-demanding AFU, attached to the right interface. Some resources are left over, not needed by either footprint. This combination of AFUs would have been prevented by resource fragmentation had we followed the fixed, equally resourced DPR partitions in Fig. 2(a) or Fig. 2(b).

Interface Placement. Using amorphous DPR, we no longer have to make hard decisions on how to divide up the uncommitted reconfiguration region upfront. The decision is reduced to how many AXI4 interfaces to support and the placement of the AXI4 interfaces’ termination LUTs. The placement of termination LUTs should not be arbitrary as they can interfere with the packing of AFU footprints. For example, the largest footprint in Fig. 3 is not valid for attaching an AFU to the left interface because it also encloses the termination LUTs for the right interface. Thus, we can see that the deliberate placement of termination LUTs in Fig. 2(b) is preferable to Fig. 2(a) because the deliberate placement is less restrictive. Please note that the resources withheld for the interface nets do not pose similar restrictions.

When extrapolating to a realistic implementation supporting many more interfaces, the placement of the termination LUTs becomes of strategic importance. The goal is to allow one AFU’s footprint—which must include its own interface’s termination LUTs—to grow, as necessary, unimpeded by other interfaces’ termination LUTs. For the sizes of contemporary available FPGAs, we follow the heuristic of placing the interfaces evenly along the periphery of the uncommitted reconfiguration region. This heuristic allows interfaces to access more freely the resources in the uncommitted reconfiguration region, by allowing the AFU footprints to grow toward the interior of the region.

To place the large number of signals associated with the AXI4 and AXI4-Lite interfaces, we use the floorplanner to tightly constrain the outline of a placeholder DPR partition so the interface signals will be automatically placed into an intended area. Later, when building an AFU to attach to a particular interface, we use the floorplanner to expand the associated placeholder DPR partition’s original boundary to the desired rectilinear footprint. This final footprint outline, as well as any of the termination LUTs and nets reserved within, is then used to constrain the place-and-route to produce a footprint- and interface-specific version of the DPR bitstream.

Footprint/Bitstream Management. In Section II

, we noted that an AFU needs to have different bitstream versions to be instantiated in different DPR partitions under standard DPR. Under amorphous DPR, one AFU can have still more versions of DPR bitstreams, each corresponding to a particular interface attachment and a particular footprint. This extra degree of freedom in footprint choice expands the set of valid combination of AFUs that can be loaded on the fabric simultaneously. The downsides to this degree of freedom are (1) increased storage for additional bitstream versions and (2) algorithmic complexity in optimizing the compile-time decisions of footprint choices, and the runtime decisions of bitstream version selection.

Iv Evaluation Methodology

In this section, we explain the metrics and methodology used to evaluate the effectiveness of amorphous DPR over standard DPR in our vision processing pipeline framework. We use synthetic benchmarks to focus the evaluation on the fragmentation of BRAM and DSP blocks, which have been the resource bottleneck in our usage. We consider 3 synthetic AFU workloads (Workload, Workload and Workload) that focus on BRAM-only, DSP-only, and mixed BRAM/DSP, respectively.

Iv-a Metrics

Placement Rate. The primary metric we present in this paper is the placement rate. For this measurement, we assume there exists a library of AFUs where each AFU has a number of bitstream versions available corresponding to different interface attachments and, in the case of amorphous DPR, also different footprint shapes. A user can demand a combination of up to AFUs to be in-use at a time ( is the number of AXI4 interfaces available). Some combinations may not be feasible due to FPGA resource bounds. In standard DPR, a combination is not feasible when some of the demanded AFUs cannot fit into the fixed DPR partitions available. In amorphous DPR, a demanded combination is not feasible due to footprint conflicts, that is, a valid non-overlapping packing of the available footprints cannot be found. Placement rate is the fraction of feasible combinations for a given set of demanded combinations.

DPR Overhead. During DPR, the affected fabric region is not contributing to computation for a time, resulting in a loss of performance. Amorphous DPR can be faster than standard DPR because amorphous DPR reconfigures only a required footprint size. Standard DPR reconfigures the entire DPR partition regardless of the actual resource utilization within.

To quantify the difference in reconfiguration overhead, we consider an interactive scenario where the user demands a sequence of AFU combinations. Consecutive AFU combinations in the sequence differ by AFUs, where is an experimental parameter that specifies how many AFUs change between consecutive combinations in a sequence. We measure DPR overhead as the total time lost to DPR over the demanded sequence. The reconfiguration process is handled through the processor configuration access port (PCAP), with an empirically observed bandwidth of 128 MByte/sec.

Keep in mind, this is a direct measurement of overhead. In practice, the overhead’s significance must be weighed against the execution interval between DPR events. Also, in measuring overheads, we assume execution interval is synchronized such that AFUs are only changed together in between intervals. In general, the lifetime of different AFUs needs not be coupled.

Iv-B Evaluation Platform

FPGA and tool. We used the Xilinx ZC702 development board with an XC7Z020 SoC FPGA for our evaluation. The XC7Z020 SoC FPGA has 53200 logic cells, 140 BRAMs and 220 DSP blocks. We used Xilinx Vivado version 2014.4 for all the builds. All designs are placed-and-routed at 100 MHz.

Static Partition. We built three instances of our parameterized vision processing pipeline framework (Fig. 1) to support the three workloads. All three static partition instances support six AFUs (), but the AXI4 interfaces provided are specialized to the workload. Static provides AXI4 interfaces to DMA; Static provides AXI4-Stream interfaces; and Static provides both. When building the static partition, we manually positioned the AXI4 interfaces’ termination LUTs.

On the small XC7Z020 SoC FPGA, the static partition can consume as much as 45% of the available logic cells and 25% of the available BRAMs. Although the static partition does not make use of DSP blocks, it can still prevent some of the DSP blocks from being used by AFUs loaded into the uncommitted reconfiguration region. Table I summarizes the resources available in the uncommitted reconfiguration regions for the three workloads.

Workload Logic Cell BRAM DSP AXI4
BRAM 27816 80 90 memory
DSP 23968 38 120 streaming
mixed 22712 40 80 memory+streaming
TABLE I: Resources in Uncommitted Reconfiguration Region by Workload

The real deployment of our vision processing pipeline framework is on a custom embedded board with an XC7Z045 SoC FPGA. There, the static partition supports up to 12 AXI4 interfaces for AFUs, consuming about 5% of the available logic cells and 1% of the available BRAMs. The uncommitted reconfiguration region has over 200000 logic cells, 500 BRAM blocks and 900 DSP blocks to be flexibly shared by the 12 AFUs. In our experience, we can reliably use up to around 70% of the available resources in the uncommitted reconfiguration region before the tool experiences difficulty in routing and timing-closure.

A sample screenshot of the static partition floorplan on the XC7Z045 SoC FPGA is shown in Fig. 5. The screenshot gives an indication of the relative sizes of the static partition and the uncommitted reconfiguration region. Within the uncommitted reconfiguration region, the areas enclosing the individual AXI4 interfaces are highlighted as well.

Fig. 5: A sample screenshot of the static partition floorplan on the XC7Z045 SoC FPGA. This instance of the vision processing pipeline framework (Fig. 1) supports 12 AXI4 interfaces.

DPR Partitions and Amorphous Footprints. When evaluating standard DPR, the “naive” baseline case divides the uncommitted reconfiguration region into six roughly equally resourced DPR partitions. In addition, for each experiment conducted, we tested 1000 randomized layouts of six DPR partitions that enclose different fractions of the total resources (nonsensical layouts are pruned from consideration). For each experiment, the best result from among the 1000 layouts is reported as “best-effort”. This is to approximate the results of a tuned layout when the workload mix is known ahead of time.

In order to conduct the best-effort study, a large number of bitstream versions has to be generated for each AFU, corresponding to different interface attachments and differently shaped DPR partitions. We directly adopted this collection of bitstreams as the bitstream database for amorphous DPR. As such, in our evaluations, amorphous DPR can always match the results of best-effort standard DPR by using the corresponding selection of AFU bitstreams. Amorphous DPR can exceed best-effort standard DPR because it can also combine bitstreams arising from different layouts, whereas standard DPR is limited to one fixed layout at a time.

In a real scenario, instead of generating a large number of random footprint bitstream versions, one would strategically maintain a much smaller number of well-chosen footprints following heuristics such as to pack tightly around the reserved interface region and to obey handedness when consuming a fraction of a column (i.e., consume from the bottom if reaching from the right and vice versa).

Iv-C Synthetic Workloads

Below we describe the three synthetic AFU workloads (Workload, Workload and Workload) that focus on BRAM-only, DSP-only, and mixed BRAM/DSP, respectively. Each workload has three variants of different degrees of difficulty.

Workload. We used Vivado HLS to develop a simple AFU design to read a large number of values from DRAM into BRAM and to compute the sum of those values. The AFU design is parameterizable to use different numbers of BRAMs. We constructed a library comprising “different” AFU instances utilizing between 0 and 40 BRAMs in increments of 5 BRAMs. (The uncommitted reconfiguration region in Static has 80 available BRAMs total. AFUs with more than 40 BRAMs almost always result in failed synthesis even for the largest DPR partition/footprint considered.) From this library, we randomly select AFUs to form the demanded AFU combinations to measure placement rate and overhead. Selecting a 0-BRAM AFU corresponds to a combination where less than six AFUs are demanded. As in our real usage experience, the AFUs use relatively little logic cell resources so their fragmentation and conflicts are not considered in this evaluation; this applies to all three workloads studied.

The advantage of amorphous DPR over standard DPR depends on resource utilization pressure. Therefore, for each workload, we considered three variants with different degrees of difficulty, , , . For , we restricted AFU selection to come from AFUs utilizing 0 up to 20 BRAMs. The selected AFUs on average utilize BRAMs, less than the average number of BRAMs, , available to each interface. For and , we raise the BRAM ceiling to 30 and 40, respectively.

Workload. We used the FFT IP with AXI4-Stream interface from Vivado’s IP Library. The FFT IP is parameterizable to use different numbers of DSP blocks. Similar to Workload, we constructed a library comprising “different” AFU instances utilizing between 0 and 50 DSP blocks in increments of 5 DSP blocks. For , and , we restricted AFU selections to come from AFUs utilizing a maximum of 30, 40, and 50 DSP blocks, respectively. The uncommitted reconfiguration region in Static has 120 available DSP blocks total.

Workload. This last workload mixes AFUs from the two previous workloads. For , and , we restrict AFU selection to come from AFUs utilizing either a maximum of 20, 30 or 40 BRAMs; or a maximum of 20, 30, or 40 DSP blocks. The uncommitted reconfiguration region in Static has 40 available BRAMs and 80 available DSP blocks total.

V Results

This section presents the outcomes of the evaluations outlined in the last section.

V-a Placement Rates

Following the procedures described in the last section, for each placement rate measurement, we generated 1000 AFU combinations, each with up to six AFUs randomly selected according to workload and degree of difficulty.

Fig. 6: Comparing the placement rates achieved by naive standard DPR vs. best-effort standard DPR vs. amorphous DPR.

Fig. 6 reports the placement rates (y-axis) for naive standard DPR vs. best-effort standard DPR vs. amorphous DPR in experiments corresponding to different workloads (separated by plots) and degrees of difficulty (x-axis). The placement rate for naive standard DPR is poor even for the variant of the workloads. The workload variants are setup such that the AFU average resource requirement is just less than the resource available in a naive standard DPR partition. However, a combination fails if any of the six AFUs is above average. By “tuning” the DPR partition sizes according to workload at build time, best-effort standard DPR does well on the variant of the workloads (up to 80% placement rate) but is unable to cope with the utilization pressure as the degree of difficulty increases to and .

Amorphous DPR achieves over 80% placement rate on all workload variants, except for the and variants of Workload at over 70%. More telling than the absolute values are the improvements from standard DPR to amorphous DPR. As expected, we observe that amorphous DPR yields greater improvement going from to to workloads. The very significant differences on the variants translate tangibly to a much greater effective usable capacity in a dynamically managed multi-AFU fabric use-case like our vision processing pipeline framework.

V-B Reconfiguration Overhead

To evaluate reconfiguration overhead, we randomly constructed 1000-combination long sequences. The sequences include only combinations that are valid in both best-effort standard DPR and amorphous DPR. Fig. 7 reports for Workload the average reconfiguration time (y-axis), in milliseconds, spent in transitioning between consecutive combinations. We report results when using best-effort standard DPR vs. amorphous DPR in experiments corresponding to different degrees of difficulty (x-axis). We do not report results for naive standard DPR because it accepts too few combinations to be included for comparison. The separate plots in Fig. 7 correspond to results using sequences with =1, 2, 3, and 4, respectively. (Recall, is a parameter that specifies how many AFUs change between consecutive combinations in a sequence.) Fig. 8 and Fig. 9 report the results for Workload and Workload, respectively. Plots for some values of are missing because not enough combinations are acceptable under standard DPR to make meaningful comparisons.

Fig. 7: Comparing the average reconfiguration times per transition for Workload when using best-effort standard DPR vs. amorphous DPR.
Fig. 8: Comparing the average reconfiguration times per transition for Workload when using best-effort standard DPR vs. amorphous DPR.
Fig. 9: Comparing the average reconfiguration times per transition for Workload when using best-effort standard DPR vs. amorphous DPR.

The average reconfiguration time spent in transitioning between consecutive combinations correlates most strongly with the bitstream size of loaded AFUs. We observe that the average reconfiguration time increases directly with the number of AFUs changed, . The average reconfiguration time is also sensitive to the degrees of difficulty, which affects the range of AFU sizes involved. The ratios of average reconfiguration time of best-effort standard DPR over amorphous DPR are between 1.1x and 1.5x. This ratio corresponds well with the ratios of their respective bitstream sizes. Though not reported, naive standard DPR would do much worse than both best-effort standard DPR and amorphous DPR because its six equally resourced DPR partitions would quite often be larger than necessary for the AFUs, due to variations in AFU resource requirements.

We also measured the energy overhead due to DPR using the Texas Instruments digital power controllers on the Xilinx ZC702 development board. The energy overhead of DPR is also mainly a function of the size of the bitstreams loaded. Therefore, the comparisons of energy overhead mirror that of the time overhead reported above.

Vi Related work

Working DPR Systems. While DPR has not seen ubiquitous use, the technology has been shown to be effective in a number of projects. DPR has been used to dynamically reuse, adapt or customize the datapath over the same fabric resources to improve performance without incurring additional cost in fabric area (e.g., [6, 7, 8]). Along the line of our motivating use-case, past systems that divided the fabric and managed its use as DPR partitions include [2, 3, 4]. These examples operate with fixed DPR partition layouts and experience fragmentation inefficiencies that this paper wants to address with amorphous DPR.
DPR Scheduling and Defragmentation. There is an accumulated body of past works (e.g., [9, 10, 11, 12, 13]) addressing resource scheduling and defragmentation in contexts similar to our motivating use-case of dynamically managed multi-AFU operations. Examples like [14, 15, 16] proactively defragment the fabric by relocating AFUs at runtime. These past work predominantly have focused on algorithmic solutions to a formalization of the problem where AFUs are dealt with as geometrical shapes to be packed into a two-dimensional area that represents the fabric. There is comparably much less work on addressing DPR resource scheduling and fragmentation under working technology and implementation assumptions (e.g., [17]). On the other hand, presented as a mechanism in this paper, amorphous DPR needs the support of further study in algorithms to optimize footprint generation at build time and footprint scheduling/selection at runtime.
Dessouky et al. developed an interesting orthogonal approach to efficiently share BRAM without fragmentation [18]. Their hardware runtime system manages BRAMs centrally as a pooled resource and supports AFUs with managed virtualized access.
Architecture and Tools. Support for DPR has steadily improved with each new generation of commercial devices and tools (e.g., from 7-series to Ultrascale [19]). In research, Compton et al. proposed a new FPGA architecture with direct support for module relocation and defragmentation [20]. Koch et al. proposed the Go Ahead tool for Xilinx devices to facilitate the development of efficient DPR-enabled designs [21].

Vii Conclusion

Even as FPGAs are increasingly used in computing, they are nevertheless deployed much more like ASICs than processors. Programmability of FPGAs, especially DPR, is still a very much under tapped capability in casting FPGAs as a much more dynamic and flexible shared resource in computing use. This paper is motivated by such a dynamic usage context where an FPGA’s fabric is spatially and temporally shared by multiple AFUs using DPR. This type of system is possible using today’s commercial devices and tools. However, standard DPR discipline requires dividing the fabric resources into fixed DPR partitions, creating fragmentation. We presented an amorphous DPR technique that avoids the need to make upfront commitment to fixed DPR partitions boundaries, and hence avoiding resource fragmentation. Our evaluation shows amorphous DPR can greatly improve the effective usable capacity in a dynamically managed multi-AFU fabric use-case.

References

  • [1] Xilinx, “Vivado Design Suite User Guide: Partial Reconfiguration (UG909),” 2016.
  • [2] D. Goehringer, L. Meder, M. Hubner, and J. Becker, “Adaptive Multi-client Network-on-Chip Memory,” in 2011 International Conference on Reconfigurable Computing and FPGAs, pp. 7–12, Nov 2011.
  • [3] S. Byma, J. G. Steffan, H. Bannazadeh, A. L. Garcia, and P. Chow, “FPGAs in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack,” in 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 109–116, May 2014.
  • [4] M. Majer, J. Teich, A. Ahmadinia, and C. Bobda, “The Erlangen Slot Machine: A Dynamically Reconfigurable FPGA-based Computer,” J. VLSI Signal Process. Syst., vol. 47, pp. 15–31, Apr. 2007.
  • [5] Xilinx, “Zynq-7000 All Programmable SoC Technical Reference Manual,” 2016.
  • [6] J. Arram, W. Luk, and P. Jiang, “Ramethy: Reconfigurable Acceleration of Bisulfite Sequence Alignment,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’15, (New York, NY, USA), pp. 250–259, ACM, 2015.
  • [7] D. Koch and J. Torresen, “FPGASort: A High Performance Sorting Architecture Exploiting Run-time Reconfiguration on Fpgas for Large Problem Sorting,” in Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ’11, (New York, NY, USA), pp. 45–54, ACM, 2011.
  • [8] X. Niu, W. Luk, and Y. Wang, “EURECA: On-Chip Configuration Generation for Effective Dynamic Data Access,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’15, (New York, NY, USA), pp. 74–83, ACM, 2015.
  • [9] K. Sigdel, C. Galuzzi, K. Bertels, M. Thompson, and A. D. Pimentel, “Runtime Task Mapping Based on Hardware Configuration Reuse,” in 2010 International Conference on Reconfigurable Computing and FPGAs, pp. 25–30, Dec 2010.
  • [10] H. Walder and M. Platzner, “Online scheduling for block-partitioned reconfigurable devices,” in 2003 Design, Automation and Test in Europe Conference and Exhibition, pp. 290–295, 2003.
  • [11] C. Steiger, H. Walder, and M. Platzner, “Operating systems for reconfigurable embedded platforms: online scheduling of real-time tasks,” IEEE Transactions on Computers, vol. 53, pp. 1393–1407, Nov 2004.
  • [12] N. B. Grigore and D. Koch, “Placing partially reconfigurable stream processing applications on FPGAs,” in 2015 25th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4, Sept 2015.
  • [13] P.-A. Hsiung, C.-H. Huang, and Y.-H. Chen, “Hardware Task Scheduling and Placement in Operating Systems for Dynamically Reconfigurable SoC,” J. Embedded Comput., vol. 3, pp. 53–62, Jan. 2009.
  • [14] O. Diessel, H. ElGindy, M. Middendorf, H. Schmeck, and B. Schmidt, “Dynamic scheduling of tasks on partially reconfigurable FPGAs,” IEE Proceedings - Computers and Digital Techniques, vol. 147, pp. 181–188, May 2000.
  • [15] M. Koester, H. Kalte, M. Porrmann, and U. Rückert, Defragmentation Algorithms for Partially Reconfigurable Hardware, pp. 41–53. Boston, MA: Springer US, 2007.
  • [16] S. P. Fekete, T. Kamphans, N. Schweer, C. Tessars, J. C. van der Veen, J. Angermeier, D. Koch, and J. Teich, “Dynamic Defragmentation of Reconfigurable Devices,” ACM Trans. Reconfigurable Technol. Syst., vol. 5, pp. 8:1–8:20, June 2012.
  • [17] D. Koch and C. Beckhoff, “Hierarchical reconfiguration of FPGAs,” in 2014 24th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–8, Sept 2014.
  • [18] G. Dessouky, M. J. Klaiber, D. G. Bailey, and S. Simon, “ Adaptive Dynamic On-chip Memory Management for FPGA-based reconfigurable architectures,” in 2014 24th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–8, Sept 2014.
  • [19] Xilinx, “UltraScale Architecture and Product Data Sheet: Overview (DS890),” 2017.
  • [20] K. Compton, Z. Li, J. Cooley, S. Knol, and S. Hauck, “Configuration relocation and defragmentation for run-time reconfigurable computing,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 10, pp. 209–220, June 2002.
  • [21] C. Beckhoff, D. Koch, and J. Torresen, “Go Ahead: A Partial Reconfiguration Framework,” in 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines, pp. 37–44, April 2012.

Viii Acknowledgements

Funding for this work was provided by NSF CNS-1446601. We thank the members of the SmarthHeadlight project and members of CALCM for their comments and feedback. We thank Xilinx for their FPGA and tool donations.