Partial Reconfiguration for Design Optimization

07/31/2020 ∙ by Marie Nguyen, et al. ∙ 0

FPGA designers have traditionally shared a similar design methodology with ASIC designers. Most notably, at design time, FPGA designers commit to a fixed allocation of logic resources to modules in a design. At runtime, some of the occupied resources could be left idle or under-utilized due to hard-to-avoid sources of inefficiencies (e.g., operation dependencies). With partial reconfiguration (PR), FPGA resources can be re-allocated over time. Therefore, using PR, a designer can attempt to reduce idleness and under-utilization with better area-time scheduling. In this paper, we explain when, how, and why PR-style designs can improve over the performance-area Pareto front of ASIC-style designs (without PR). We first introduce the concept of area-time volume to explain why PR-style designs can improve upon ASIC-style designs. We identify resource under-utilization as an opportunity that can be exploited by PR-style designs. We then present a first-order analytical model to help a designer decide if a PR-style design can be beneficial. When it is the case, the model points to the most suitable PR execution strategy and provides an estimate of the improvement. The model is validated in three case studies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Motivations. Today, with growing emphasis on deploying Field Programmable Gate Arrays (FPGAs) for computing, we are starting to see FPGAs’ reprogrammability being recognized as a deciding feature in selecting FPGAs over ASICs [10]. Yet, partial reconfiguration (PR), which allows parts of an FPGA to be reconfigured at millisecond timescales, remains an under-appreciated capability.

This paper explores the questions of when, how, and why FPGA designers should consider using PR. The discussions in this paper focus on the use of PR in challenging design scenarios that have to deliver required performance under strict area, cost, power, and energy constraints (e.g., [50, 28]). This work is particularly relevant to AI-driven applications at the Edge (e.g., [35, 1, 50, 28]) that (1) are deployed on low-end FPGAs due to cost, power, and size concerns, and (2) need to accelerate many compute intensive tasks with stringent latency or throughput requirements ([50], [12, 51]).

Shortcomings of ASIC-Style Designs. To accelerate these constrained applications on the FPGA, designers typically commit, at design time, to a fixed allocation of logic resources to modules. We refer to this design as an ASIC-style design. At runtime, some of the occupied resources could be left idle or under-utilized due to hard-to-avoid sources of inefficiencies (e.g., operation dependencies) which may occur even in a highly-optimized design. Under-utilization may result in (1) the design not running at the desired performance given an area budget, or (2) the design running at the desired performance but being too big to fit in the given area.

Fig. 1: In an ASIC-style design, logic resources that are inactive still occupy the fabric. In a PR-style design, under-utilization can be reduced with better area-time scheduling.

PR-Style Designs to Reduce Under-Utilization. Using PR, a designer can attempt to reduce under-utilization by changing the allocation of resources over time. In this paper, we identify under-utilization of resources as an opportunity that can be exploited by PR-style designs to improve upon ASIC-style designs. We refer to a PR-style design as a design in which logic resources are allocated to different modules of one design over time. In return, a PR-style design may be faster and/or smaller than an ASIC-style design (illustration in Figure 1).

This work: when, how and why PR.

To address the questions of when, how, and why PR, this paper develops a set of PR execution strategies (allocation and scheduling) applicable to a range of non-trivial applications. An application consists of a set of tasks, and each task is accelerated by a hardware module. Modules can be dependent, execute concurrently, and have multiple implementation variants with different performance-area trade-offs. Dependent modules share data either through (1) external memory or (2) on-chip memory. The paper proposes a first-order analytical model to help a designer (1) determine a suitable PR execution strategy and (2) analyze the throughput and latency of ASIC-style and PR-style designs. The model enables quick exploration of the design space to help decide if a PR-style design can be beneficial for a given problem. The effectiveness of this model is examined in three compute-bound case studies involving computer vision and machine learning tasks.

The contributions of this paper are:

  • developing a set of PR execution strategies for practical design scenarios

  • developing a first-order performance model to estimate ASIC-style and PR-style designs’ performance

  • demonstrating the effectiveness of the performance model with three case studies of implemented designs.

Ii Background and Related Work

Partial Reconfiguration. When using PR, the FPGA fabric is divided into a non-reconfigurable region (containing the I/O infrastructure) and PR regions that can be reprogrammed individually at runtime. Each PR region can be reprogrammed at runtime with partial bitstreams built for this region at design time. When loading bitstreams from on-board DRAM, the time to load a PR region (PR time) is a function of the bitstream size, e.g., approximately 453 MB/sec on an Ultrascale+ device through the processor configuration access port (PCAP).

Applications of PR Today. Many academic projects have explored the potential of using PR (e.g., [34, 32, 15, 9, 18, 25, 6, 16, 39, 21, 22, 46, 45]). Commercially, PR has been mainly used in a “role-and-shell” approach ([10, 3]). A static shell design provides I/O and isolation while independent designs with different functionalities, or roles, can be loaded as required (e.g., [10, 3]). The different role designs reuse the same logic resources over time. However, each role is still an ASIC-style design. This paper does not focus on using PR in a role-and-shell approach.

PR-Style Benefits. In a PR-style design, the designer decides how the FPGA is divided into PR regions and when/which reconfigurations are needed during the design’s execution. For instance, in [37], to accelerate a vision processing pipeline, a PR region is reconfigured every few milliseconds with different pipeline stages.

For domain-specific applications, other prior works have exploited under-utilization in ASIC-style designs, and have shown that using PR can provide area [13], performance [30, 4, 43], power/energy [38, 36], and compilation time reduction [33] benefits. For instance, in adaptive [29] or cloud computing applications ([17, 11, 8]), multiple modes or implementation variants exist for a module but only one is needed at a time depending on the context. Instead of mapping all variants of a module in an ASIC-style design, only one variant is reprogrammed on the fabric at a time.

Scheduling for PR-Style. A vast body of work on FPGA OSes ([27, 40, 2, 42, 19]) and on FPGA virtualization ([47, 8, 17]) has focused on the theory of spatial and temporal sharing, mechanisms for task preemption, or hardware and software task scheduling [5]. Mostly, these works share the common goal of maximizing resource utilization to improve throughput, and often assume that PR time is negligible compared to compute, and/or that tasks are independent.

Building on top of prior work, this paper introduces the concept of area-time volume to make clear why PR-style designs can be beneficial. We also give practical examples of when it is the case considering both throughput, which is the metric to optimize in many applications (e.g., video analytics [35, 50], batch jobs [1, 7]), and latency, the metric of interest for an emerging class of Edge applications that have tolerance for 100 ms-response time, and that could benefit from FPGA acceleration ([12, 51]). We also account for cases where PR time can be equal to or greater than compute time.

Iii When and how can PR help?

In this section, we use an idealized and simplified example to develop the intuitions behind when and how PR-style designs can be faster or smaller than ASIC-style designs. The next section continues with a more complete examination.

Iii-a Simplified Execution Model

We consider an application with two dependent tasks, and ; can start only after is finished. Each task runs once per execution of the application. The latency of the application is the sum of the two dependent tasks’ latencies. Multiple implementation module variants exist for and and are characterized by the latency function . is the latency achieved by the module variant for using logic resources. For a given , larger variants have lower latency, if .

Iii-B ASIC-Style Design

Consider two common design objectives: (1) minimize latency given an area budget, or (2) minimize area given a latency upper bound. For simplicity, assume = for any . In that case, to achieve optimality in either objective, the total logic resources, , must be equally divided between and ’s modules (==0.5). The latency of the application is 2. Solving either optimization scenarios repeatedly for different latency or area targets will produce a set of ASIC-style implementations that trade off latency against logic resources. Starting from this, we ask the question: can a PR-style design improve over the Pareto front of an ASIC-style design?

Iii-C PR-Style Design

The above scenario for the ASIC-style design is shown in Figure 1.a. In this area-time volume representation of the FPGA, the fabric area is 100% occupied by the modules for and . However, due to the dependency between the two modules, only one of the two modules is active at a time. In other words, the ASIC-style design has under-utilization since some resources available to the design are not active all the time.

In contrast to an ASIC-style design where resource allocation cannot change over time, it is possible to reduce under-utilization with better area-time scheduling in a PR-style design. Therefore, a PR-style design may be able to achieve a smaller area-time volume by being faster, by using fewer resources, or both. For instance, to minimize latency given the same area budget, we can allocate the entirety of to a module for first and then to (Figure 1.b). By doing so, the PR-style design’s latency is reduced as both modules now run faster using all of the resources available. On the other hand, a PR-style design can maintain the same latency using half the resources by allocating 0.5 to a module for first and then to (Figure 1.c). With under-utilization reduced, both PR-style designs fit into smaller area-time volumes than the ASIC-style design. Notice in Figure 1.b and Figure 1.c, a small amount of under-utilization appears when switching between modules to reflect the non-zero delay to perform PR.

Iii-D Opportunities for Improvement by PR

In ASIC-style designs, resource under-utilization stemming from data dependencies cannot be eliminated without changing the initial algorithm or implementation. In practice, under-utilization can arise in other forms. In our simplified example, we assume that module variants exist for any amount of resources. However, module variants for a task only exist at certain performance/resource combinations in practice. The modules selected to fit an area budget in an ASIC-style design may not sum up perfectly to use all resources. Further, when the modules of and

are executed in a pipelined fashion to improve the throughput of many independent executions, it may not be possible to find variants with equal throughput for the two tasks; in the resulting unbalanced pipeline, a too-fast stage has to stop or slow down to wait for the other stage. A more subtle example exists when implementing a generic engine capable of accelerating different algorithms or neural networks. This generalized engine consists of a superset of features to accommodate all possibilities but only a subset of features is needed at a time (e.g., NPU

[20], DPU [53]). A PR-style design could potentially remove this type of inefficiencies.

Iv Analytical Model

In this section, we present our model and discuss the additional memory requirements of a PR-style design and the impact of limited memory bandwidth on design’s performance.

Fig. 2: Example timeline of an application with three dependent tasks accelerated by modules , , and .

Iv-a Overview

Optimization Goals. To derive our performance model, we consider the problem of maximizing an application’s performance given an area budget.

  • minimize the application’s latency given an area budget . We label this problem as min L given A.

  • maximize the application’s throughput given an area budget . We label this problem as max T given A.

Execution Model. In this section, we consider an application with dependent tasks; each task is accelerated by a module. Dependent modules share data either through external or on-chip memory depending on data size. Though our discussion focuses on applications with dependent tasks, our model also applies if tasks are independent. We define as the set of subscripts for tasks in the application. A single start-to-finish execution of a module is referred to as a run. If an application requires multiple independent runs, modules can execute concurrently. Figure 2 illustrates this execution model. The example application consists of three dependent tasks , , and accelerated by three modules. In this application, each module needs to complete three runs , , and . Modules execute concurrently to complete the runs as quickly as possible, subject to the dependency constraints.

We consider two performance metrics, latency and throughput. Latency is defined as the start-to-finish time required for all modules accelerating an application to complete one run (including I/O time for data read and write and compute time). Throughput is defined as the number of runs completed per unit time in steady-state.

Performance-Area Trade-offs. For each module, a finite set of implementation variants exists. A variant accelerating is characterized by its area , its latency , and its throughput as functions of area. We assume that and are monotonically increasing functions but make no further assumption on their shape, e.g., performance can scale sub-linearly or lineary with area.

PR-Style Design Considerations. We define as the time to reconfigure a PR region of size , and assume that PR time is proportional to the PR region size.

Fig. 3: In an ASIC-style design, dependent modules share data through either external (blue) or on-chip memory (orange) depending on data size.

Iv-B ASIC-Style

We first derive the equations for the ASIC-style design that are applicable whether dependent modules share data through external or on-chip memory (Figure 3). In both cases, the number of buffers required to hold intermediate data is .

Min L Given A. Let be the latency of the ASIC-style design given resources.

(1)

Max T Given A. Let be the throughput of the ASIC-style design given resources.

(2)

Iv-C Ignoring PR Time: PR-Style Performance Bounds

Ignoring PR time, we first derive the lower and upper bounds on the latency and throughput, respectively, achievable by any PR-style design presented in the next subsections. The simplest and most efficient execution strategy is to schedule tasks serially on one PR region. Each module runs once before the PR region is reconfigured with the next module. In the best-case scenario, the PR region is of size and the highest performance variant using resources exists for all modules.

Min L Given A. Let be the lower bound on latency for the PR-style design with one PR region.

(3)

Max T Given A. Let be the upper bound on throughput for the PR-style design with one PR region.

(4)
Fig. 4: Example of serialized execution in a PR-style design with one PR region when batching ().

Iv-D Including PR Time: Serialized Execution on one PR Region

When accounting for PR time and scheduling tasks serially on one PR region, each module runs once before the PR region is reconfigured with the next module. Given tasks, the PR region is reconfigured times. Compute and reconfigurations are serialized.

Min L Given A. Let be the latency of the PR-style design with one PR region.

(5)

Scheduling tasks serially on one PR region of the largest size may not result in the design’s minimum latency. Though using larger variants leads to a decrease in compute time, it also has the effect of increasing PR time, which may offset the speedup benefit of larger variants. In the next subsection, we discuss a scheduling alternative where compute and reconfigurations are overlapped.

Max T Given A: Batching to Amortize PR Time. Let be the steady-state throughput of the PR-style design with one PR region.

(6)

If PR time is non-trivial compared to compute time, we can amortize PR time by executing each module times (i.e. batching runs) before reconfiguring the PR region (Figure 4). Let be the steady-state throughput of the PR design with one PR region when batching runs.

(7)

Batching allows us to reduce the ratio of total PR time to total compute time at a greater resource cost to buffer intermediate results. Given enough buffering capacity, PR time can be almost totally amortized for large enough .

Iv-E Including PR Time: Special Cases

Min L Given A: Interleaved Execution on Two PR regions. When optimizing for latency, interleaving task execution on multiple PR regions allows us to overlap reconfigurations and compute to hide PR time, which may result in better latency than serializing task execution on one PR region. Figure 5 shows an example of interleaved execution for . In this example, . By overlapping compute and reconfigurations, PR time is completely hidden. Having may be beneficial provided that multiple PR regions can be reconfigured simultaneously. Simultaneous reconfiguration of multiple PR regions is not supported from a user standpoint using current FPGA tools and PR flow. In this paper, we only consider the case where , and define as the latency of the PR-style design with two PR regions.

(8)
Fig. 5: Interleaved execution on two PR regions. PR time can be hidden by overlapping compute and reconfiguration.

Max T Given A: Serialized Execution on PR regions. When optimizing for throughput, it is generally preferable to choose the smallest to reduce a design’s complexity in terms of buffering management since each PR region requires its own intermediate buffer. A -PR region solution should be considered when appropriately large module variants are not available for all modules in a single PR region solution.

When having multiple PR regions executing in parallel (similar to -way SIMD), task execution can be serialized on each PR region of size . On each PR region, each module runs once or multiple times before the PR region is reconfigured. Let be the steady-state throughput of a single PR region of size and be the steady-state throughput of the PR-style design with PR regions. Assuming that reconfigurations can occur simultaneously,

(9)

As explained previously, only one reconfiguration can happen at a time using current tools. The above throughput can still be achieved by offsetting the start of compute on each PR region by a sufficient number of PR times to ensure that two PR regions are not reconfigured simultaneously.

Iv-F Memory Requirements in PR-style designs

In this section, we discuss the buffering and memory bandwidth requirements of a PR-style design. Compared to an ASIC-style design, a PR-style design requires additional buffering capacity for batching and additional external memory bandwidth when faster module variants are used. A module variant is faster if it uses more resources and/or operates at a higher clock frequency. For the Max T Given A problem, we also model the impact of limited memory bandwidth on throughput.

Buffering Requirement. In a PR-style design, each PR region requires two intermediate buffers to hold its intermediate input and output data. The intermediate buffers can be stored in on-chip or off-chip memory depending on the data size. The on-chip buffering option is preferred to minimize the latency and power/energy for data movement. In practice, when batching to amortize reconfiguration time, the buffering capacity required by a PR-style design exceeds the amount of on-chip memory available on current FPGAs (few MBs on large FPGAs). The amount of data to buffer can range from tens to hundreds of MBs depending on the use-case.

If the intermediate buffers are stored in on-chip memory, additional architecture support is needed so that the output of the upstream module stored on chip is used as the input to the next module. One possible solution if to design an intermediate on-chip memory controller to connect the PR region to the intermediate buffers instead of having static, direct connections between the PR region and the buffers. The on-chip memory controller fetches the data from the appropriate intermediate buffer to send to the PR region, and writes the output from the PR region to the appropriate buffer.

Max T Given A: Memory Bandwidth Requirement. When maximizing throughput given an area budget, the best strategy is to serialize module execution on one PR region. An upper bound on the memory bandwidth required by the PR-style design can be determined by considering the read and write bandwidth required by the fastest variant in the design i.e. the variant with the highest throughput.

When the memory bandwidth required by the variant is greater than the total memory bandwidth available in the system, the variant throughput is going to be degraded by some factor proportional to the memory bandwidth required. We introduce a scaling factor F to model the impact of limited memory bandwidth on a variant’s throughput. F is equal to the ratio of memory bandwidth required by the variant to the memory bandwidth available in the system if the bandwidth required by the variant is greater than the bandwidth available. Otherwise, F is equal to 1. Let be the peak throughput of the module variant that accelerates , the bandwidth requirement of the variant, and the total bandwidth available in the system.

(10)

V Experimental Setup

(1 PR region) (2 PR regions)
I/O infrastructure PR region Total I/O infrastructure PR region 0 PR region 1 Total
LUT 3366 (4.8%) 61,920 (87.8%) 65,286 (92.5%) 5231 (7.4%) 28,800 (40.8%) 30,240 (42.9 %) 64,271 (91%)
BRAM36Kb 0 198 (91.7%) 198 (91.7%) 0 108 (50%) 108 (50%) 216 (100%)
DSP 0 288 (80%) 288 (80%) 0 144 (40%) 216 (60%) 360 (100%)
PR time (ms) N/A 12 N/A N/A 6 6 N/A
TABLE I: Resource utilization of the two PR-style designs and post place & route on the Ultra96 v2 board at 150 MHz. In both designs, most resources are spent for compute. In , the PR regions are almost equally-sized.
Module variants ASIC-style
hog cnn lstm I/O Infrastructure Modules Total

LUT
15,495 (22%) 14,614 (20.7%) 7715 (10.9%) 6082 (8.6%) 37,824 (53.6%) 43,906 (62.2%)

BRAM36Kb
34 (15.7%) 92 (42.6%) 80.5 (37.3%) 0 206.5 (95.6%) 206.5 (95.6%)

DSP
64 (17.8%) 10 (2.8%) 7 (1.9%) 0 81 (23%) 81 (23%)
Memory bandwidth (MB/s) 23.6 42.7 3.3 N/A N/A 64.6
Throughput (fps) 30 16 271 N/A N/A 16
TABLE II: Resource utilization, average memory bandwidth, and throughput of the ASIC-style design and the module variants used post place & route on the Ultra96 v2 board at 150 MHz for the activity recognition study.

We develop three compute-bound applications representative of real-world applications with cost constraints [24, 44, 31]. For all studies, we use a low-end FPGA board (Ultra96 v2) with a XC7ZU3EG Zynq part that has 70,560 LUTs, 216 BRAMs and 360 DSPs. These studies serve as concrete examples of ASIC-style designs with under-utilization (due to module dependencies or modules having mismatched throughput). Each application consists of three dependent tasks, with some tasks being more compute intensive than others, which perform common vision processing such as detection or classification. Dependent modules share data through external memory since the amount of on-chip memory on the Ultra96 is not sufficient to hold the inter-module buffers in on-chip memory. Note that having more tasks per application would favor PR-style designs, since the length of the dependency chain would increase. In other words, we choose to focus on more challenging design scenarios (shorter pipelines).

Design Scenario. In the studies, we solve the max T given A and min L given A problems from Section IV, and also consider the problem of minimizing area given a latency upper bound, which we refer to as given L min A. Using our model, we search the design space to find the best-achievable ASIC-style and PR-style designs for a given problem. The best-achievable design consists of the set of module variants resulting in the design’s maximum throughput, minimum latency or minimum area possible given the module variants available. We use Vivado 2019.1 to build our designs [49].

PR-Style Designs. We consider three possible PR-style designs: (1) with a single large PR region on which tasks are scheduled sequentially, (2) with a single smaller PR region (one PR region of ) on which tasks are scheduled sequentially, and (3) with two almost equally-sized PR regions on which tasks are executed in an interleaved fashion. Table I reports the resource utilization of and (the PR region of has the same size as PR region 1 of ) on the Ultra96 v2 board at 150 MHz. In both designs, most resources on the Ultra96 v2 are used for compute. The time to reconfigure a PR region through the processor configuration access port (PCAP) when partial bitstreams are stored in external DDR is 12 ms (partial bitstreams of 5.5 MB for ) and 6 ms (partial bitstreams of 2.8 MB for ). We use one ARM core to manage the operation of the fabric at runtime (i.e. reconfiguration of the PR regions and module execution). PR bitstreams are stored into on-board external DDR.

When optimizing for latency, we report the latency of , , and whenever possible. We refer to latency (or frame latency) as the time to process one input frame by the application, i.e. the time it takes for each module to run once. When optimizing for throughput, we report the throughput of for different batch sizes . In the context of our studies, the input to an application is a frame. When , the module processes frames before the PR region is reconfigured.

Performance Density. In addition to latency and throughput, we also compare the performance density of ASIC-style and PR-style designs. Performance density is defined as the number of frames processed per unit time per unit area. This metric quantifies how efficiently a design utilizes available resources. The higher the performance density, the more area-efficient the design is (less under-utilization in the area-time volume). Since there is no simple definition for area on an FPGA, we consider the resources used by the bottleneck resource as a proxy for area. For instance, if BRAM is the bottleneck as it is the case in our studies, performance density is computed as the number of frames processed per unit time per BRAM. For latency, we divide 1/latency by the number of BRAM used in the design. For throughput, we simply divide throughput by the number of BRAM used in the design.

Module Characterization. In the studies, we use six modules: hog [26], cnn [23], lstm [41], viola [52], flow [14], and stereo (developed in-house). Each module has up to three implementation variants generated with Vivado HLS 2019.1 [48]. The variants are provided by the module developer or obtained by changing parameters in the HLS source code, such as the number of compute engines, the data precision, and the on-chip buffering size. The modules’ interfaces are modified to conform to our PR region interfaces. In our studies, all PR regions have the same interfaces, namely, one AXI memory-mapped, one AXI-lite, a clock, a reset, and an interrupt. All data transfers, including data sharing between modules in the ASIC-style design, happen through external DRAM.

Modules operate on 256256 frames, except for the lstm module which operates on 3232 frames. Modules process one frame at a time. Therefore, frame latency is the inverse of throughput, and includes both compute and data movement time. Data movement accounts for no more than 15% of the end-to-end latency. For all variants, module throughput scales mostly linearly with its resources. The bottleneck resource for all modules is either LUTs or BRAM on the Ultra96 v2.

V-a Model Validation: Case Study Results

In this section, we illustrate how to use our model and validate its effectiveness in three case studies. We show that (1) our first-order model allows to accurately estimate a design’s throughput and latency. (2) Our analysis helps determine the most suited PR execution strategy for a problem. Notably, when optimizing for latency, it is important to evaluate both PR execution strategies (serialized execution on one PR region and interleaved execution on multiple PR regions) to find the best one for a given problem. (3) PR-style designs improve performance and performance density upon ASIC-style designs with under-utilization. (4) Given an area budget, if the ASIC-style design is too big to fit, using PR can help make the design fit and run at useful performance.


hog cnn lstm stereo flow viola
LUT 55,635 27,573 47,745 51477 40,509 42,283

BRAM36Kb
109 180 144 96.5 195 91.5

DSP
114 11 13 0 49 101
Throughput (fps) 116 32 2.1k 240 180 41.3
Frame latency (ms) 8.6 31.2 0.48 4.2 5.6 24.2
TABLE III: Resource utilization, throughput and frame latency of the variants used in .
hog cnn lstm stereo flow
LUT 27,879 15,009 7461 23,551 20,106

BRAM36Kb
53.5 92 80.5 96.5 95.5

DSP
114 11 13 0 48

Frame latency (ms)
17.9 62.5 0.87 8.3 11.1
TABLE IV: Resource utilization and frame latency of the variants used in .

Study 1: Activity Recognition. The first case study performs activity recognition and is based on [24]. Three dependent tasks are accelerated by a hog, a cnn and a lstm modules. This study explores the max T given A and min L given A problems. In this study, we explain how to use our model for quick design space exploration. The same methodology is used for the two other studies.

Max T Given A. Table II shows the resource utilization and the throughput of the ASIC-style design and the module variants used. The ASIC-style design’s throughput is equal to 16 fps and is limited by the throughput of the slowest module (cnn). The hog and lstm variants are roughly 2 and one order of magnitude faster than the cnn variant, respectively. The amount of computation per frame for the lstm variant is much less than the two other modules. Therefore, the ASIC-style design has under-utilization, and there is opportunity for PR to improve.

Based on our analysis and on module variants available, batched execution on a single PR region solution () should provide best performance. Figure 6 shows the estimated and measured throughput, and the intermediate buffering capacity required for vs. batch size . We use equation 7, measured throughput variants (Table III) and PR time (Table I) to compute these estimations. We observe that (1) as predicted by the model, when increases, PR time gets amortized, but with diminishing return when . (2) For all , the estimated and measured throughput match within 2.35%. (3) At , the throughput of the PR-style design is , which represents a improvement over the ASIC-style design. (4) Intermediate buffering capacity linearly increases with , and is equal to 50.3 MB for . The intermediate buffers are stored in on-board external memory (on the Ultra96, 2 GB of external DDR is available). The peak external memory bandwidth (read and write) requirement for is 91.2 MB/s due to the hog module. This represents a 41.2% increase over the ASIC-style design which needs on average 64.6 MB/s (Table II).

The ASIC-style design uses BRAMs (95.6% of BRAM resources) and has a performance density of fps per BRAM. uses BRAMs (91.7% of BRAM resources available) and has a performance density of fps per BRAM, which represents a improvement over the ASIC-style design.

Module variants ASIC-style
hog stereo flow I/O Infrastructure Modules Total

LUT
27,244 (38.6%) 13,767 (19.5%) 10,943 (15.5%) 3366 (4.8%) 51,924 (73.6%) 55,320 (78.4%)

BRAM36Kb
52.5 (24.3%) 79.5 (36.8%) 70.5 (32.6%) 0 202.5 (93.8%) 202.5 (93.8%)

DSP
114 (31.7%) 0 44 (12.2%) 0 158 (43.9%) 158 (43.9%)

Frame latency (ms)
17.8 16.7 22.2 N/A N/A 56.7
TABLE V: Resource utilization and latency of the ASIC-style design and module variants used post place & route on the Ultra96 v2 board at 150 MHz for the depth and motion estimation study.
Fig. 6: Throughput of vs. for the first case study.

Min L Given A. When optimizing for latency, the ASIC-style design has under-utilization since modules are dependent (one frame processed at a time), and therefore, we expect PR to be beneficial. Figure 7.activity shows the frame latency of the latency-optimized ASIC-style design (), and the three PR-style designs (, , and ). We estimate the latency of using equation 1 and measured module latencies (Table II). The ASIC-style design has an estimated latency of , which exactly matches our measurement.

We estimate the latencies of the PR-style designs using equations 5 and 8, measured latencies from Tables III and IV, and PR time from Table I. The estimated latencies for , , and are , , and , respectively. The measured latencies for , , and are , , and , respectively. We observe that (1) estimated and measured latencies match within , and (2) among the three PR-style designs, has the smallest latency, as predicted by the model (22.8% improvement over the ASIC-style design). Note that PR time accounts for a non-negligible fraction of the frame latency of (46.9%). However, still outperforms , illustrating that the ratio of PR time to compute time should not be considered alone when optimizing for latency.

Considering performance density, uses BRAMs and has a performance density of per-seconds per BRAM. uses BRAM and has a performance density of per-seconds per BRAM ( improvement over ASIC-style).

Fig. 7: Frame latency of the ASIC-style design () and the PR-style designs , and for the three studies.

Study 2: Depth and Motion Estimation. The second case study performs depth and motion estimation, and is based on [44]. Three dependent tasks are accelerated by a hog, a stereo, and a flow module, respectively. This study explores the min L given A problem.

Figure 7.depth shows the frame latency of the latency-optimized ASIC-style design (), and the three PR-style designs (, , and ). We estimate the latency of using equation 1 and module latencies from Table V. The estimated latency of is (matches the measured latency). Using the same procedure described in the first case study, we obtain latency estimations for , , and of , , and , respectively. The measured latencies for , , , and , are , , , and , respectively. We observe that (1) estimated and measured latencies match within 0.18%, and (2) among all PR-style designs, has the lowest latency, as predicted by the model ( improvement over the ASIC-style design), reinforcing the fact that using the largest variants available may not achieve minimum latency.

Considering performance density, uses BRAMs (93.8% of BRAM resources available) and has a performance density of per-seconds per BRAM. uses BRAMs, and has a performance density of per-seconds per BRAM ( improvement over the ASIC-style design). Note that uses only BRAMs while achieving a latency improvement compared to . uses 2 more BRAM but only improves latency by 1.8% compared to . has a performance density of fps per BRAM ( improvement over the ASIC-style design). In a design scenario where area is to be minimized given a latency upper bound of 60 ms, would be the best design choice.

Study 3: Facial Emotion Recognition. The final study performs facial emotion recognition, and is based on [31]. Three dependent tasks are accelerated by a viola, a cnn and an lstm module, respectively. This study explores the min L given A and given L min A problems.

Min L Given A. The BRAM resources on the Ultra96 v2 are insufficient to map , , and . Figure 7.facial shows the frame latency of . Using the same procedure as in the first case study, we estimate the frame latency of to be . The measured latency is ( error). uses BRAMs and has a performance density of per-seconds per BRAM. In summary, when the ASIC-style design is too big to fit, PR can make the design fit and achieve useful performance (less than 100 ms).

Given L Min A. Given a latency upper bound of , we want to estimate the minimum area needed by an ASIC-style design to achieve this requirement. On a larger FPGA board (Ultrascale+ 102), the ASIC-style design consisting of the smallest module variants available uses LUTs, BRAMs, and DSPs, and achieves a latency of post place & route at 150 MHz. The performance density of the ASIC-style design is per-seconds per BRAM. Considering the PR-style design from min L given A, improves latency by and performance density by compared to the ASIC-style design.

Vi Conclusion

This paper investigates the question of when, how and why FPGA designers should consider using PR. To address this question, we identify reducing under-utilization in ASIC-style designs as one of the main means for improvement available to PR-style designs. We then present a set of PR execution strategies to build efficient PR-style designs that can (1) be faster given an area budget or (2) smaller given a performance bound than ASIC-style designs with under-utilization. We discuss our first-order model to quickly and accurately estimate the relative merits of ASIC-style and PR-style designs in the early stage of design development. We validate our first-order model in three study applications that serve as practical examples of ASIC-style designs with under-utilization. Though limited, this choice of execution model and performance metrics allows us to cover a non-trivial range of design scenarios and applications (e.g., video analytics/image processing pipelines, feed-forward neural networks).

The model relies on the existence of a module library consisting of Pareto-optimal module variants used to build the ASIC-style and PR-style designs. The accuracy of the model depends on (1) how well the library has been characterized in terms of area, latency, throughput, and memory bandwidth requirement and (2) the ability to place and route modules at the required clock frequency, which can be challenging depending on the problem. The model could be improved to account for this clock frequency uncertainty, for instance, by defining different levels of confidence based on the design’s complexity.

Vii Acknowledgments

This work was supported in part by the CONIX Research Center, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA. We thank Intel and Xilinx for their FPGA and tool donations.

References

  • [1] (Megh Computing, 2019) External Links: Link Cited by: §I, §II.
  • [2] A. Agne, M. Happe, A. Keller, E. Lübbers, B. Plattner, M. Platzner, and C. Plessl (2014-01) ReconOS: an operating system approach for reconfigurable computing. IEEE Micro 34 (1), pp. 60–71. External Links: Document, ISSN 0272-1732 Cited by: §II.
  • [3] Amazon Amazon EC2 F1 Instances. Cited by: §II.
  • [4] J. Arram, W. Luk, and P. Jiang (2015) Ramethy: Reconfigurable Acceleration of Bisulfite Sequence Alignment. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’15, New York, NY, USA, pp. 250–259. External Links: ISBN 978-1-4503-3315-3, Link, Document Cited by: §II.
  • [5] S. Banerjee, E. Bozorgzadeh, and N. Dutt (2005-06) Physically-aware hw-sw partitioning for reconfigurable architectures with partial dynamic reconfiguration. In Proceedings. 42nd Design Automation Conference, 2005., Vol. , pp. 335–340. External Links: Document, ISSN 0738-100X Cited by: §II.
  • [6] S. Bhandari, S. Subbaraman, S. Pujari, F. Cancare, F. Bruschi, M. D. Santambrogio, and P. R. Grassi (2012-Sep.) High speed dynamic partial reconfiguration for real time multimedia signal processing. In 2012 15th Euromicro Conference on Digital System Design, Vol. , pp. 319–326. External Links: Document, ISSN null Cited by: §II.
  • [7] S. Biookaghazadeh, M. Zhao, and F. Ren (2018-07) Are fpgas suitable for edge computing?. In USENIX Workshop on Hot Topics in Edge Computing (HotEdge 18), Boston, MA. External Links: Link Cited by: §II.
  • [8] S. Byma, J. G. Steffan, H. Bannazadeh, A. L. Garcia, and P. Chow (2014-05) FPGAs in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 109–116. External Links: Document Cited by: §II, §II.
  • [9] E. Caspi, M. Chu, R. Huang, J. Yeh, J. Wawrzynek, and A. DeHon (2000) Stream computations organized for reconfigurable execution (score). In Proceedings of the The Roadmap to Reconfigurable Computing, 10th International Workshop on Field-Programmable Logic and Applications, FPL ’00, Berlin, Heidelberg, pp. 605–614. External Links: ISBN 3540678999 Cited by: §II.
  • [10] A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, M. Humphrey, P. Kaur, J. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka, D. Chiou, and D. Burger (2016-10) A cloud-scale acceleration architecture. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Vol. , pp. 1–13. External Links: Document, ISSN Cited by: §I, §II.
  • [11] F. Chen, Y. Shan, Y. Zhang, Y. Wang, H. Franke, X. Chang, and K. Wang (2014) Enabling fpgas in the cloud. In Proceedings of the 11th ACM Conference on Computing Frontiers, CF ’14, New York, NY, USA, pp. 3:1–3:10. External Links: ISBN 978-1-4503-2870-8, Link, Document Cited by: §II.
  • [12] Z. Chen, W. Hu, J. Wang, S. Zhao, B. Amos, G. Wu, K. Ha, K. Elgazzar, P. Pillai, R. Klatzky, D. Siewiorek, and M. Satyanarayanan (2017) An empirical study of latency in an emerging class of edge computing applications for wearable cognitive assistance. In Proceedings of the Second ACM/IEEE Symposium on Edge Computing, SEC ’17, New York, NY, USA, pp. 14:1–14:14. External Links: ISBN 978-1-4503-5087-7, Link, Document Cited by: §I, §II.
  • [13] C. Claus, W. Stechele, and A. Herkersdorf (2007-05) Autovision – a run-time reconfigurable mpsoc architecture for future driver assistance systems (autovision – eine zur laufzeit rekonfigurierbare mpsoc architektur für zukünftige fahrerassistenzsysteme). 49, pp. 181–. Cited by: §II.
  • [14] P. K. Daniele Bagni and S. Neuendorffer (2017) Demystifying the lucas-kanade optical flow algorithm with vivado hls. Cited by: §V.
  • [15] C. Dennl, D. Ziener, and J. Teich (2012-04) On-the-fly composition of fpga-based sql query accelerators using a partially reconfigurable module library. In 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines, Vol. , pp. 45–52. External Links: Document, ISSN null Cited by: §II.
  • [16] M. Dyer, C. Plessl, and M. Platzner (2002) Partially reconfigurable cores for xilinx virtex. In Field-Programmable Logic and Applications: Reconfigurable Computing Is Going Mainstream, M. Glesner, P. Zipf, and M. Renovell (Eds.), Berlin, Heidelberg, pp. 292–301. External Links: ISBN 978-3-540-46117-3 Cited by: §II.
  • [17] S. A. Fahmy, K. Vipin, and S. Shreejith (2015-11) Virtualized fpga accelerators for efficient cloud computing. In 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom), Vol. , pp. 430–435. External Links: Document, ISSN null Cited by: §II, §II.
  • [18] B. A. Farisi, K. Heyse, and D. Stroobandt (2014-12) Reducing the overhead of dynamic partial reconfiguration for multi-mode circuits. In 2014 International Conference on Field-Programmable Technology (FPT), Vol. , pp. 282–283. External Links: Document, ISSN null Cited by: §II.
  • [19] K. Fleming, H. Yang, M. Adler, and J. Emer (2014-Sep.) The leap fpga operating system. In 2014 24th International Conference on Field Programmable Logic and Applications (FPL), Vol. , pp. 1–8. External Links: Document, ISSN 1946-147X Cited by: §II.
  • [20] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. Caulfield, E. S. Chung, and D. Burger (2018-06) A configurable cloud-scale dnn processor for real-time ai. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Vol. , pp. 1–14. External Links: Document, ISSN 2575-713X Cited by: §III-D.
  • [21] J. Goeders, T. Gaskin, and B. Hutchings (2018-04) Demand driven assembly of fpga configurations using partial reconfiguration, ubuntu linux, and pynq. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Vol. , pp. 149–156. External Links: Document, ISSN 2576-2621 Cited by: §II.
  • [22] L. Gong and O. Diessel (2011-12) ReSim: a reusable library for rtl simulation of dynamic partial reconfiguration. In 2011 International Conference on Field-Programmable Technology, Vol. , pp. 1–8. External Links: Document, ISSN null Cited by: §II.
  • [23] D. Gschwend (2016)

    ZynqNet: an fpga-accelerated embedded convolutional neural network

    .
    External Links: Link Cited by: §V.
  • [24] M. Harvey (2017) Five video classification methods. External Links: Link Cited by: §V-A, §V.
  • [25] C. Huriaux, O. Sentieys, and R. Tessier (2014-Sep.) FPGA architecture support for heterogeneous, relocatable partial bitstreams. In 2014 24th International Conference on Field Programmable Logic and Applications (FPL), Vol. , pp. 1–6. External Links: Document, ISSN 1946-1488 Cited by: §II.
  • [26] N. Katsaros and N. Patsiatzis (2017) A real time histogram of oriented gradients implementation on fpga. External Links: Link Cited by: §V.
  • [27] A. Khawaja, J. Landgraf, R. Prakash, M. Wei, E. Schkufza, and C. J. Rossbach (2018-10) Sharing, protection, and compatibility for reconfigurable fabric with amorphos. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, pp. 107–127. External Links: ISBN 978-1-939133-08-3, Link Cited by: §II.
  • [28] N. S. Kim and P. Mehra (2019) Practical near-data processing to evolve memory and storage devices into mainstream heterogeneous computing systems. In Proceedings of the 56th Annual Design Automation Conference 2019, DAC ’19, New York, NY, USA, pp. 22:1–22:4. External Links: ISBN 978-1-4503-6725-7, Link, Document Cited by: §I.
  • [29] V. Kizheppatt and S. Fahmy (2011-12) Efficient region allocation for adaptive partial reconfiguration. pp. 1–6. External Links: Document Cited by: §II.
  • [30] D. Koch and J. Torresen (2011) FPGASort: A High Performance Sorting Architecture Exploiting Run-time Reconfiguration on Fpgas for Large Problem Sorting. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ’11, New York, NY, USA, pp. 45–54. External Links: ISBN 978-1-4503-0554-9, Link, Document Cited by: §II.
  • [31] S. Li and W. Deng (2018) Deep facial expression recognition: a survey. External Links: 1804.08348 Cited by: §V-A, §V.
  • [32] X. Li, X. Wang, F. Liu, and H. Xu (2018-07) DHL: enabling flexible software network functions with fpga acceleration. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), Vol. , pp. 1–11. External Links: Document, ISSN 2575-8411 Cited by: §II.
  • [33] S. Ma, Z. Aklah, and D. Andrews (2016) Just in time assembly of accelerators. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’16, New York, NY, USA, pp. 173–178. External Links: ISBN 9781450338561, Link, Document Cited by: §II.
  • [34] M. Majer, J. Teich, A. Ahmadinia, and C. Bobda (2007-04) The Erlangen Slot Machine: A Dynamically Reconfigurable FPGA-based Computer. J. VLSI Signal Process. Syst. 47 (1), pp. 15–31. External Links: ISSN 0922-5773, Link, Document Cited by: §II.
  • [35] Microsoft (2012) Live video analytics. External Links: Link Cited by: §I, §II.
  • [36] M. Nguyen, R. Tamburo, S. Narasimhan, and J. C. Hoe (2019-Sep.) Quantifying the benefits of dynamic partial reconfiguration for embedded vision applications. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Vol. , pp. 129–135. External Links: Document, ISSN 1946-147X Cited by: §II.
  • [37] M. Nguyen and J. C. Hoe (2018) Time-shared execution of realtime computer vision pipelines by dynamic partial reconfiguration. In 28th International Conference on Field Programmable Logic and Applications, FPL 2018, Dublin, Ireland, August 27-31, 2018, pp. 230–234. External Links: Link, Document Cited by: §II.
  • [38] J. Noguera and I. O. Kennedy (2007-08) Power reduction in network equipment through adaptive partial reconfiguration. In 2007 International Conference on Field Programmable Logic and Applications, Vol. , pp. 240–245. External Links: Document, ISSN 1946-1488 Cited by: §II.
  • [39] H. Omidian and G. G.F. Lemieux (2019) Software-based dynamic overlays require fast, fine-grained partial reconfiguration. In Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, HEART 2019, New York, NY, USA. External Links: ISBN 9781450372558, Link, Document Cited by: §II.
  • [40] W. Peck, E. Anderson, J. Agron, J. Stevens, F. Baijot, and D. Andrews (2006-08) Hthreads: a computational model for reconfigurable devices. In 2006 International Conference on Field Programmable Logic and Applications, Vol. , pp. 1–4. External Links: Document, ISSN 1946-1488 Cited by: §II.
  • [41] V. Rybalkin, A. Pappalardo, M. M. Ghaffar, G. Gambardella, N. Wehn, and M. Blott (2018) FINN-l: library extensions and design trade-off analysis for variable precision lstm networks on fpgas. 2018 28th International Conference on Field Programmable Logic and Applications (FPL), pp. 89–897. Cited by: §V.
  • [42] H. K. So and R. Brodersen (2008-01) A unified hardware/software runtime environment for fpga-based reconfigurable computers using borph. ACM Trans. Embed. Comput. Syst. 7 (2), pp. 14:1–14:28. External Links: ISSN 1539-9087, Link, Document Cited by: §II.
  • [43] A. Sudarsanam, R. Barnes, J. Carver, R. Kallam, and A. Dasu (2010-03)

    Dynamically reconfigurable systolic array accelerators: a case study with extended kalman filter and discrete wavelet transform algorithms

    .
    IET Computers Digital Techniques 4 (2), pp. 126–142. External Links: Document, ISSN 1751-8601 Cited by: §II.
  • [44] T. Taniai, S. Sinha, and Y. Sato (2017-07) Fast multi-frame stereo scene flow with motion segmentation. pp. . External Links: Document Cited by: §V-A, §V.
  • [45] N. Thomas, A. Felder, and C. Bobda (2015-12) Adaptive controller using runtime partial hardware reconfiguration for unmanned aerial vehicles (uavs). In 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Vol. , pp. 1–7. External Links: Document, ISSN null Cited by: §II.
  • [46] M. Ullmann, M. Huebner, B. Grimm, and J. Becker (2004-04) An fpga run-time system for dynamical on-demand reconfiguration. In 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., Vol. , pp. 135–. External Links: Document, ISSN Cited by: §II.
  • [47] A. Vaishnav, K. D. Pham, D. Koch, and J. Garside (2018-08) Resource elastic virtualization for fpgas using opencl. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL), Vol. , pp. 111–1117. External Links: Document, ISSN 1946-147X Cited by: §II.
  • [48] (2019) Vivado design suite user guide: high-level synthesis (ug902). Xilinx. Cited by: §V.
  • [49] (2019) Vivado design suite user guide: using the vivado ide (ug893). Xilinx. Cited by: §V.
  • [50] S. Wang, C. Zhang, Y. Shu, and Y. Liu (2019-10) Live video analytics with fpga-based smart cameras. In Workshop on Hot Topics in Video Analytics and Intelligent Edges (HotEdgeVideo), External Links: Link Cited by: §I, §II.
  • [51] W. Zhang, S. Li, L. Liu, Z. Jia, Y. Zhang, and D. Raychaudhuri (2019-04) Hetero-edge: orchestration of real-time vision applications on heterogeneous edge clouds. In IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, Vol. , pp. 1270–1278. External Links: Document, ISSN 0743-166X Cited by: §I, §II.
  • [52] Y. Zhou, U. Gupta, S. Dai, R. Zhao, N. Srivastava, H. Jin, J. Featherston, Y. Lai, G. Liu, G. A. Velasquez, W. Wang, and Z. Zhang (2018-02) Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software-Programmable FPGAs. Int’l Symp. on Field-Programmable Gate Arrays (FPGA). Cited by: §V.
  • [53] (2019) Zynq dpu v3.1. Xilinx. Cited by: §III-D.