Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration

05/26/2018 ∙ by Marie Nguyen, et al. ∙ 0

This paper presents an FPGA runtime framework that demonstrates the feasibility of using dynamic partial reconfiguration (DPR) for time-sharing an FPGA by multiple realtime computer vision pipelines. The presented time-sharing runtime framework manages an FPGA fabric that can be round-robin time-shared by different pipelines at the time scale of individual frames. In this new use-case, the challenge is to achieve useful performance despite high reconfiguration time. The paper describes the basic runtime support as well as four optimizations necessary to achieve realtime performance given the limitations of DPR on today's FPGAs. The paper provides a characterization of a working runtime framework prototype on a Xilinx ZC706 development board. The paper also reports the performance of realtime computer vision pipelines when time-shared.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Motivation. FPGAs have increasingly been deployed in compute settings. However, past examples have not fully exploited the dynamic programmability of FPGAs. Typically, once a design is loaded, the FPGA acts as an ASIC with a fixed set of functionalities that are statically mapped on the FPGA for the duration of the deployment.

Modern computer vision applications have become interactive, requiring systems to adapt dynamically in functionality and/or in performance to user and environment inputs. For instance, advanced driver-assistance systems (ADAS) need to change the behavior of the car in realtime based on the driver and on sensor inputs. These interactive applications present an opportunity to leverage FPGAs dynamic programmability potential. The main challenge when implementing interactive realtime systems is that the sequence and combination of applications requested at runtime are not known at design time. This dynamic adaptation requirement leads to a very large number of potential FPGA states. Mapping all possible application combinations on an FPGA using a traditional static design flow is inflexible, expensive and may be impossible given the area or power budget alloted. Also, statically mapping all possible applications combinations is wasteful since only a subset of applications needs to be active at a time.

Dynamic partial reconfiguration (DPR) has been used successfully to provide on-the-fly adaptability by allowing an FPGA to be repurposed with new functionalities with minimal operational disruption [1]. In the context of repurposing for interactivity or adaptability, the interval between reconfiguration is in the minutes to hours range with tolerance for missed frames during reconfiguration. For these uses, the FPGA still acts like an ASIC for extensive periods in between reconfigurations.

The work in this paper aims to apply DPR to time-share the FPGA fabric by multiple realtime computer vision pipelines at the time scale of individual frames. With time-sharing, the FPGA can support more concurrent realtime functionalities than what could statically fit on the FPGA. In this use-case, every frame must be processed by every pipeline.

Fig. 1: Conceptual sketch of a repurposable DPR framework for computer vision pipelines. (a): the fabric is organized into a static region and three reconfigurable partitions ( ). (b)&(c): Different computer vision pipelines can be executed over time by reloading the reconfigurable partitions with modules from a library.

Prior Work: DPR for Repurposing. DPR allows a region of the FPGA fabric to be reconfigured without disrupting the operation of the remainder of the fabric [2]. As an illustrative example, Figure 1.a depicts an FPGA fabric organized into a static region and three reconfigurable partitions (RPs). The static region provides infrastructure logic to connect the camera to the first RP () and the last RP () to display. The infrastructural logic further connects the three RPs by streaming connections in a linear topology. The RPs can be reconfigured with pre-compiled modules ( ) from a library to serve as the stages of a computer vision pipeline. For example, Figure 1.b shows the RPs configured as the pipeline: camera display. Alternatively, Figure 1.c shows the RPs repurposed as a different pipeline: camera display.

In practice, a modern large FPGA could have more than three RPs, and a framework could support more elaborate streaming connection topology between RPs. We have created a repurposable DPR framework as sketched above in prior work. Others have also shown that this kind of repurposable DPR framework is readily realizable using standard DPR support in commercial tools and FPGAs [3, 4].

Fig. 2: The measured reconfiguration time as a function of the RP size (represented by the size of its bitstream). Sizes of common computer vision modules are marked as reference points.

This Paper: DPR for Realtime Time-Sharing. In this work, we want to apply DPR multiple times in the time scale of a single camera frame to support time-shared round-robin execution of multiple realtime computer vision pipelines. For example, to time-share the two pipelines in Figure 1.b and 1.c, we divide the time quantum of one camera frame into two timeslices for both pipelines. The time scale for time-sharing in computer vision applications—just 16.7 milliseconds per frame at 60 frames-per-second (fps)—is just within reach of today’s DPR support. To maintain realtime processing, both the reconfiguration time and the time for processing one camera frame have to fit within a pipeline timeslice. Each camera frame is buffered for processing by the two pipelines during their respective timeslice. The output frames from the two pipelines could be merged for display (e.g., split-screen) or buffered separately in DRAM for downstream consumption. By synchronizing time-sharing with frame boundaries, there is no context (frames) to save/restore when switching between pipelines.

The challenge when time-sharing comes from the restrictive DPR speed in today’s FPGAs. On the Xilinx XCZ7045 FPGA, the time to reconfigure one RP is proportional to its size, typically a few to 10s of milliseconds (see Figure 2). Moreover, only one RP can be reconfigured at a time so the multiple RPs of a pipeline are reconfigured one after the other. Consequently, the reconfiguration time alone can easily overwhelm the alloted time slice at any non-trivial frame rates.

Contributions. In this work, we show that it is feasible to use DPR for time-sharing realtime computer vision pipelines with performance requirements in the tens of milliseconds range. We develop four optimizations to hide, amortize or eliminate reconfiguration time, and are necessary to achieve usable time-sharing performance for computer vision pipelines. We have created a runtime framework for computer vision streaming processing that implements these optimizations:

  • overlapping stage reconfiguration and processing within a pipeline timeslice to hide reconfiguration time

  • round-robin scheduling at an enlarged granularity of multi-frame bundles to amortize reconfiguration cost over the processing time of multiple frames

  • a flexibly configurable streaming interconnect infrastructure to reduce the number of partitions that must be reconfigured when pipelines share common stages

  • downsampling video stream from camera (with lower effective frame rate) in a fashion transparent to the computer vision modules instantiated into the RPs

Our final results show that we can time-share an FPGA between streaming computer vision pipelines, and achieve useful frame rates (30+ fps) for each time-shared pipeline.

Paper Outline. Following this introduction, Section II provides background and a survey of related work. Section III presents the basic design and operation of our time-sharing runtime framework. Section IV next presents the techniques to reduce the impact of reconfiguration time in time-sharing. Section V describes a working realization of the presented runtime framework on a Xilinx ZC706 development board. Section VI presents the performance evaluation when time-sharing streaming computer vision pipelines. Lastly, Section VII offers our conclusions.

Ii Background

Dynamic Partial Reconfiguration (DPR). Section I briefly introduced the notions of a static region and reconfigurable partitions (RPs) in a DPR example (Figure 1). In Xilinx’s environment, at design time, the RPs appear as black-box submodules with declared input/output ports but unspecified internals. At implementation time, the RPs’ bounding box and port locations are fixed by floorplanning. Separately, different modules that (1) have matching interface ports and (2) can fit within the logic resources of a RP can be separately placed-and-routed as variants to be loaded into the RP at runtime. In our work, we control reconfiguration from the embedded ARM core. The bitstreams of the modules are stored and loaded at runtime from DRAM to reconfigure the RPs through Xilinx’s PCAP interface [2].

DPR technology has been supported for over a decade and has been used in many prior works (e.g., [5, 6, 7]). In these works, DPR has been exploited for saving area by time multiplexing application phases [8], for customizing data paths to improve performance [9], or for virtualizing FPGA resources in the cloud [10]. DPR has been used for streaming vision processing [3, 4] at a coarse time-scale, where the time between reconfigurations is within seconds to minutes range, to repurpose the FPGA for different functionalities. In [11], the authors identified the issues of asynchronous module execution and frame skipping when applying DPR for repurposing in a vision streaming context.

Streaming Vision Pipeline. We use the simple streaming vision pipelines depicted in Figure 1 to explain the operation of standard streaming vision pipelines and the operation of time-sharing. We will discuss the operation of more complex pipelines later in the paper (Figure 8). We assume the streaming vision pipeline is driven by a camera and outputs to a display, and that pixels are continuously streamed into the pipeline. The camera streams pixels into the first stage of the pipeline at a steady rate. is the time between the first pixel and last pixel of a frame produced by a camera; the frame rate is . In a simple pipeline, all pipeline stages consume and produce pixels at the same steady rate as the camera, logically computing an output frame from each input frame. The stages may need to buffer multiple lines of the frame but never a complete frame. Due to buffering, there is a delay between when the first pixel of a frame enters a stage (or a pipeline) and when the first pixel of the same frame exits a stage (or a pipeline); this time is . After the first pixel exits, the last pixel exits later. In steady-state with continuous streaming inputs, a complete frame would exit every .

For example, considering a full-HD camera that outputs frames with 1920-by-1080 pixels at 60 frames-per-second (fps), =16.7 milliseconds. When using a 16-bit wide streaming connection, the pipeline needs to operate at a minimum frequency of 148.5 MHz. If a stage requires buffering of 10 lines of the frame, of that stage will be at least 0.13 milliseconds (). (The size of a full-HD frame in YUYV422 format (2 bytes/pixel) is 4 MB.)

Under basic operation, any given stage just needs to keep up with the pixel rate from the camera. However, a stage running by itself could be clocked faster resulting in shorter and . For example, this is applicable when the streaming input and output of a stage are sourced from and sinked into DRAM instead of camera and display.

Iii Basic Round-Robin Time-Sharing

This section describes the operation of a basic time-sharing system and its performance model.

Fig. 3: Time-sharing by two three-stage pipelines. A pipeline starts processing only after all stages have been configured.

We want to time-share the FPGA fabric by round-robin execution of multiple realtime vision pipelines. Since every input frame needs to be processed by every pipeline, initially we take to be the basic scheduling quantum for one round of round-robin execution. Each pipeline is assigned a timeslice . During a pipeline’s timeslice, the partitions needed by the pipeline are configured first, and then one camera frame is fully processed.

To present the same input frame to each pipeline during its timeslice, the input frame from camera needs to be double-buffered 1. to synchronize module execution after a partition reconfiguration with the start of every frame i.e. no frame skipping, and 2. for every pipeline to process every frame. We double-buffer input frame from camera into DRAM since the amount of data to buffer, which ranges from few KBs to MBs, may exceed the amount of available on-chip RAM (BRAM) on the FPGA we use (2.4MB). During each timeslice, the runtime framework drives the active pipeline with a pixel stream from DRAM at the maximum rate the pipeline can handle or up to the DRAM bandwidth. The output of the pipeline is also double-buffered into DRAM so the runtime framework can produce an evenly timed output stream to display. The multiple output video streams can be merged for display by a function (e.g., XOR) or rendered simultaneously as split-screen.

Performance Model. If the total time to configure a pipeline is ,

Note in the above, and are for when the pipelines are operating against DRAM. For a valid realtime schedule,

=

Figure 3 illustrates an execution timeline when two three-stage pipelines are time-shared as described above. This straightforward approach is not sufficient for achieving useful frame rates given the reconfiguration speed on today’s FPGAs. The next section presents additional techniques needed to achieve usable performance.

Iv Reconfiguration Time Optimizations

As noted in the introduction, the time to reconfigure a RP is a few to 10s of milliseconds (Figure 2), and multiple RPs can only be configured one after the other. Therefore, the time to reconfigure the RPs for a pipeline is often comparable with the processing time. The time to reconfigure the RPs is also significant relative to . If we use time-sharing as described in previous section, the time to configure a pipeline alone will exceed in most non-trivial scenarios. This section introduces techniques to hide, amortize or eliminate the reconfiguration time when possible.

Fig. 4: Time-sharing by two three-stage pipelines. A stage of a pipeline starts processing as early as possible, that is, when the stage is configured AND its upstream stage is producing output.

Iv-a Overlapping Reconfiguration and Processing

In the last section, we waited until all of RPs of a pipeline have been configured before starting processing. With all of the stages ready, streaming processing can progress synchronously throughout the pipeline. However, given that reconfiguration time of a partition is significant relative to processing time, we are motivated to overlap processing and reconfiguration by (1) reconfiguring RPs in order from first to last; and (2) streaming input into the earlier stages as soon as they are ready. Figure 4 illustrates the execution timeline for the same two pipelines used in Figure 3 but now starting a stage as soon as possible, in other words, when the stage is configured and its upstream stage is producing output.

In this staggered-start execution, it is possible for an upstream stage to start producing output before its downstream stage is ready. Thus, it becomes necessary to introduce buffering as a part of the streaming connection abstraction between a downstream stage being reconfigured and its upstream stage to support staggered start. The buffering capacity must be sufficient to capture all of the output of an upstream stage until the downstream stage is ready. Data is buffered into DRAM since the amount of data may exceed BRAM capacity. Hence, to buffer and delay the data stream until the downstream stage is ready, we need to use a decoupling DMA engine between each downstream stage being reconfigured and its upstream stage. In the worst case, we have found it necessary to support the option for streaming connections to be physically realized as a circular-buffer in DRAM. This need for DRAM streaming connections motivates a more generalized streaming interconnect infrastructure to connect RPs.

With staggered start, of pipeline is upper bounded by

In the case when all the stages have comparable processing time is lower bounded by

where and are and of the last pipeline stage only. When some stages have much longer configuration or processing time is more tightly lower bounded by

MAX over all stages :

+ +

Iv-B Amortization and Downsampling

In the basic scheme presented in previous section, a round of round-robin execution is completed for each quantum =. This is not necessary. We can increase to be a multiple . In this case, we would double-buffer frames at a time from camera into DRAM. During each pipeline’s timeslice, the runtime framework drives the active pipeline with consecutive active frames from the DRAM double-buffer. Thus the cost of reconfiguration is amortized over a longer processing time. This option can be used with or without staggered start. In both cases, of pipeline is now upper bounded by

For a valid realtime schedule,

=

The runtime framework can still produce a smooth video output when because the output is also double-buffered. Beside the added storage cost, a major downside of increasing is the very large increase in end-to-end latency through the runtime framework (which now includes the time to buffer multiple frames of the input and output).

Also note, increasing improves scheduling by amortizing the reconfiguration cost. Therefore, it cannot help when the sum of the pipeline’s processing time already exceeds . In this case, the only option is to downsample the video stream from camera into the pipeline. If the runtime framework selectively only passes every frames of the camera input to the pipelines, the pipeline timeslices only need to fit within the new scheduling quantum of

=

Fig. 5: An illustrative example set of transitions between pipelines where reconfiguration of RPs can be avoided by retaining and reusing already configured stages.

Iv-C Configurable Streaming Interconnect

In vision processing, even when pipelines have different functionalities, they may share common stages (see for example Figure 8). The high cost of reconfiguring an RP can be avoided when an already configured processing stage can be retained and reused across pipelines.

Figure 5 identifies different ways for multiple pipelines to reuse common stage configurations. The simplest scenario is when switching from pipeline (a) to pipeline (b) where the two pipelines have the same topology and differ only in one stage. To switch from (a) to (b) (and vice versa), only the middle RP has to be reconfigured. When switching from (b) to (c), no RP reconfiguration is needed if there is a way to skip over the stage of (b). Furthermore, by retaining even though it is not used by (c), a switch from (c) back to (b) can also be done without RP reconfiguration. On the other hand, also with in place, a switch from (c) to (d) does not require reconfiguration but requires a different streaming connectivity than going to (b). In fact, one can switch between (b), (c), (d) arbitrarily without RP reconfigurations but as combinations of deletion, insertion, or reordering by changing the connectivity between already configured RPs.

To support these different scenario, we provide a flexibly configurable streaming interconnect between RPs and other infrastructural elements. A configurable crossbar connects all elements in the system. This crossbar is not reconfigured by DPR bitstream but by control registers that can be quickly written by the controlling software between pipeline reconfigurations to establish the desired static streaming topology for the next timeslice. Configuring the interconnect by software is 3-orders of magnitude faster than reconfiguring an RP. The statically decided streaming connectivity topology never has two streaming sources going to the same destination so the crossbar can be simple and efficient with no need for flow control nor buffering. (To support forks and joins in the pipeline, some RPs have multiple input or multiple output interfaces while others have exactly one input and one output interface.) For simple streaming connections, the upstream and downstream stages are connected with a single-cycle buffered path. When a DRAM streaming connection is used to allow for buffering and stage decoupling (Section IV-A), the source and sink RPs’ streaming interfaces are redirected to/from the DMA engines of the infrastructure instead.

This flexible interconnect infrastructure turned out to be a critical mechanism. The expense of this interconnect infrastructure is well justified by the reconfiguration time savings. As we will see in the evaluation section, given the currently high cost of reconfiguration, practical time-sharing is only feasible if the number of reconfigured partitions between pipeline switches is kept to a minimum.

V Prototype System

FPGA Xilinx XCZ7045
Hard CPU Cores 2 x ARM A9
LUT 218,600
BRAM (36 Kb) 545
DSP 900
DRAM Bandwidth 12.8 GB/s (Fabric only)
TABLE I: Xilinx ZC706 board specification

We have implemented a working prototype of the time-sharing runtime framework. The prototype system is built on a Xilinx ZC706 development board with an Xilinx XCZ7045 Zynq SoC FPGA. (Table I gives the specifications of this board.) Camera input comes from a VITA 2000-sensor that supports up to 1920-by-1080 resolution at 60 fps (1080p@60fps). The prototype’s HDMI video output can drive standard monitors. This section gives an overview of the prototype.

V-a System Overview

Fig. 6: The high-level organization of the static runtime framework infrastructure.

Static Region Infrastructure. The organization of the runtime framework infrastructure implemented in the static region is shown in Figure 6. The backbone of the runtime framework is the configurable interconnect infrastructure discussed in Section IV-C. This interconnect infrastructure provides streaming connections between ten RPs for vision processing stages, the camera controller input, HDMI controller output, as well as five DMA engines for streaming to and from DRAM buffers. The interconnect is based on a custom crossbar implementation but the interface follows AXI4-Stream standard. Once configured, the interconnect infrastructure is capable of streaming frames at 1080p@60 fps between a fixed pair of source and sink RPs. Two RPs can also be connected by a DRAM streaming connection that incorporates a circular-buffer FIFO in DRAM. Except for the camera and display controllers, the entire system–static region and reconfigurable partitions—are by default clocked at 200 MHz. The camera and display controllers are clocked at 148.5 MHz.

Management Software. At runtime, a runtime manager running on the embedded ARM processor core manages the creation, execution and time-sharing of vision pipelines. The specification of each pipeline (such as number of stages, module running in each stage and connectivity between stages) is registered with the runtime manager. To switch execution to a new pipeline, the runtime manager assigns a stage to an RP if the RP already has the required module. The RPs for the remaining stages are reconfigured through the PCAP interface with bitstreams loaded from DRAM. Once the partitions are reconfigured, the runtime manager configures the modules, DMA engines, and interconnect to effect the required connectivity before starting pipeline execution. The built-in camera and display controllers are initialized once when the FPGA is first started.

For time-sharing, the runtime manager will cycle through all of the registered pipelines once for every frames of video. The runtime manager will poll the active pipeline for completion before initiating a switch to the next pipeline. This runtime manager does not do scheduling or enforce maximum time quantum. If total time to cycle through all of the pipelines exceeds the time quantum of , the processing falls out of sync to produce glitching output.

Vision Modules. We use Xilinx Vivado HLS to develop custom modules. We also make use of the HLS video library that offers a subset of HLS-synthesizable OpenCV functions. These HLS-based modules can be incorporated into our runtime framework since our interconnect supports AXI4-streaming interface.

V-B Runtime Framework Characteristics

Static Reconfigurable
Crossbar DMA engines Misc
LUT 4940 (2%) 10725 (5%) 30578 (14%) 122400 (56%)
BRAM (36 Kb) 0 15 (3%) 23.5 (4%) 360 (66%)
DSP 0 0 0 300 (33%)
TABLE II: Logic resource used by static region and reconfigurable partitions on the Xilinx XCZ7045.

Logic Resource Utilization. Table II breaks down the fabric resource utilization between the static region and reconfigurable partitions. The infrastructure logic requires non-trivial resources. The interconnect crossbar is only a small fraction of the total infrastructure. On the other hand, the DMA engines to stream data through DRAM is quite expensive.

On a large FPGA like the Xilinx XCZ7045, ample resources remain to be divided as ten independent reconfigurable RPs. We aimed for a total fabric utilization of roughly 70% to ease the placement and routing process.

DRAM Bandwidth. The DRAM bandwidth on Xilinx ZC706 development board is 12.8 GB/s. This bandwidth is shared by all of the DRAM streaming connections through AXI HP ports. To support 1080p@60fps, each DRAM streaming connection requires a total of 497 MB/sec of memory bandwidth (read and write). DRAM streaming connections include the double-buffers for the camera input and display output, and the decoupling buffers needed to support staggered start (Section IV-A).

We created a microbenchmark to measure the total DRAM bandwidth actually utilized for increasing number of active thru-DRAM streaming connections. The results are shown in Figure 7. On the Xilinx ZC706 development board, only up to 5 concurrent thru-DRAM streaming connections can be supported for 1080p@60fps. Since two thru-DRAM streaming connections are taken up by camera and display for double-buffering, we are left with only three usable thru-DRAM streaming connections for decoupling the staggered start of RPs (Section IV). This restricts the applicability of the staggered start optimization in the prototype.

Fig. 7: Measured DRAM bandwidth utilized vs the number of DRAM streaming connections on the ZC706 board. Up to five DRAM streaming connections can concurrently sustain 1080p@60fps.

Vi Performance Evaluation

This section presents an application-level evaluation of the time-sharing runtime framework prototype. This evaluation aims to show that useful realtime performance (30+ fps) can be achieved when time-sharing multiple streaming vision pipelines. We first quantify the opportunities of using DPR for repurposing and realtime time-sharing. We then present our results when measuring the achieved performance of time-shared pipelines in frames-per-second (fps) under different operating conditions with the camera running at 720p@60 fps and 1080p@60 fps.

Fig. 8: Logical view of three pipeline examples: (a) color-based object tracking where objects of up to three different colors are tracked, (b) background subtraction, (c) corner and edge detection.
Edge detection computes a binary mask of vertical and horizontal edges using a Sobel filter [12]
Color-based object tracking tracks objects based on their color
Template tracking tracks a given template by computing sum-of-absolute differences and by thresholding
Corner detection computes a binary mask of corners using a Harris corner detector [12]
Blob detection detects blobs by using morphological operations and thresholding
Gaussian blur blurs an image by using a Gaussian filter
Background subtraction removes frame background by thresholding
TABLE III: Vision modules used in our evaluation.
edge + corner edge+template blob+color edge+color corner+template background + corner background + edge
LUT 13147 13098 14142 13601 14085 14797 13810

FF
12455 11635 11423 11234 12146 13222 12711

BRAM (36 Kb)
5 5 3.5 3.5 5 3.5 3.5
TABLE IV: Logic resource used by seven 3-branch pipelines. Each pipeline is mapped individually in a static FPGA design (without runtime framework), and runs at 250 MHz.

Vi-a Opportunities for Using DPR

Interactive Realtime Vision Applications. The dynamic adaptation requirement of interactive realtime vision systems leads to a large number of potential pipelines to execute at runtime. The realtime vision system we built (introduced in I) requires the flexible, dynamic creation and execution of a large number of pipelines. These pipelines can have a variable topology and number of stages. Figure 8 shows the logical view of three pipeline examples that have different number of stages and topology. (A non-linear pipeline topology is standard in vision and allows to overlay different masks, computed on each branch, on the original camera frame.) Each pipeline branch can execute one or a combination of vision modules listed in Table III, leading to hundreds of potential different pipelines.

Static Design Limitation. Traditionally, one way to implement such a system is to map all pipelines simultaneously on the FPGA. Table IV presents the logic resources used by seven of the most resource-expensive 3-branch pipelines, when each pipeline is mapped individually and directly on the FPGA (without DPR). These numbers give an idea of the potential cost of mapping a large number of parallel pipelines statically on an FPGA. If all possible linear and non-linear pipelines were to be mapped statically and simultaneously, they would not fit on the FPGA. Mapping those pipelines individually and directly to the FPGA results in the best possible performance. We expect performance to degrade with an increasing number of parallel pipelines to map statically. When mapped individually on the FPGA, each of these pipelines can make timing at 250 MHz.

DPR Performance Results. DPR presents a viable alternative to overcome the inflexibility and the resource limitation of a static FPGA design. When using DPR, we expect the performance to degrade compared to static design due to the RP I/O placement port constraint that can add wire delay. For repurposing and realtime time-sharing, the performance needs to be sufficient for correct realtime pipeline operation for 720p@60fps and 1080p@60fps input video. Also, there should be enough performance slack to interleave pipelines at the time scale of a camera frame.

To assert the performance of a DPR system, we use the system described in section V. The ten RPs are differently sized to support repurposing and time-sharing. The four largest RPs (bitstream size of 1.1MB) are reconfigured when repurposing while the six smallest RPs (bitstream size of 300KB) are used for time-sharing. We generate partial bitstreams for the seven modules such that all modules can be hosted in any RPs. We are able to generate partial bitstreams at 200 MHz when our runtime framework is imposed. Despite the expected performance degradation, pipelines can operate correctly for 720p@60fps and 1080p@60fps input video (when run by themselves in the runtime framework without time-sharing). An operating speed of 200 MHz also allows to time-share pipelines at the time scale of a camera frame.

Vi-B Reconfiguration Overhead

Before presenting the performance of time-shared pipelines, we need to simplify the set of pipelines that we use for the time-sharing evaluation. To do so, we perform a first set of experiments to assert that the cost of switching from one pipeline to another is dominated by the cost of RP reconfiguration i.e. the cost of configuring the interconnect, the DMA engines, the modules, and the cost of starting the pipeline are negligible. In these experiments, the pipelines occupy up to ten stages and have up to three branches.

We generate randomly tens of pipeline pairs (with different topology, different stages, different number of stages etc), and measure the time to interleave two pipelines in a pair. We find that the sole overhead that matters is the time spent in RP reconfiguration. RP reconfiguration dominates the cost of a pipeline switch by three orders of magnitude. (The cost of reconfiguring the interconnect for topology change, and other configuration and startup cost, are within the range of 50s to 100s of microseconds depending on the number of RPs and interconnect links to reconfigure.) Switching between pipelines with different topology does not impact time-shared performance. If switching from one pipeline to another does not change the state of any RP, the switch is almost free. (This is the case, for instance, when the set of stages used by one pipeline is a subset of the stages used by the other pipeline.)

Vi-C Performance of Time-Shared Pipelines

For this evaluation, we only consider linear pipelines since pipeline topology does not impact performance as established previously. For these experiments, pipelines occupy up to six stages, and two interleaved pipelines differ by one, two, three, four, five and six RPs. (We only reconfigure the six smallest RPs while the four largest RPs remain unchanged. The time spent in reconfiguring RPs is proportional to the number of RPs to reconfigure since the six RPs have the same size.) When time-sharing, we execute pipelines (1) two at a time, and (2) three at a time. The runtime framework logic produces a simultaneous split-screen video output of the time-shared pipelines.

Figure 9 first summarizes the achieved performance in frames-per-second when the runtime framework is driven with a 720p@60fps video stream, and when we execute (a) two pipelines at a time (b) three pipelines at a time by time-sharing. In Figure 9.a and Figure 9.b, there are six sets of bars corresponding to cases where we reconfigure between one to six RPs to switch from one pipeline to another. For each case, bars for different are shown. Figure 9.a shows the processing time required by two pipelines can fit into the (every frame is processed) scheduling quantum of 16.7 milliseconds (= second) when one RP is reconfigured per pipeline transition. Factoring in reconfiguration time for more than one RP, the time-shared execution of the two pipelines reconfiguration can only keep up when the input is downsampled by , i.e., each pipeline runs at 30 fps (video output at 30 fps is still visually smooth), by , i.e., each pipeline runs at 20 fps, or by , i.e., each pipeline runs at 15 fps. Running the runtime framework at (two consecutive frames are processed) can restore the frame rate of each pipeline to 60 fps with up to three RP reconfigurations. Factoring in reconfiguration time for more than three RPs, the time-shared execution of the two pipelines reconfiguration can only keep up when the input is downsampled by , i.e., each pipeline runs at 30 fps.

When time-sharing by three pipelines (figure 9.b), the interleaved execution only keeps up for when the input is downsampled by (one RP reconfiguration), i.e., 30 fps, or by (more than one RP reconfiguration), i.e., . Increasing to 3 in this case allows to be reduced to 2 (except for the last case when six RPs are reconfigured).

Figure 10 similarly summarizes the achieved performance measured in frames-per-second when the runtime framework is driven with a 1080p@60fps video stream, and when we execute (a) two pipelines at a time (b) three pipelines at a time by time-sharing. For 1080p processing, the higher processing time required by two pipelines, without considering reconfiguration time, already would not have fit into the scheduling quantum of 16.7 milliseconds. In this case, increasing cannot improve scheduling slack. Thus, time-shared execution of two pipelines (figure 10.a) requires downsampling by , i.e., . Time-shared execution of three pipelines (figure 10.b) would require further downsampling to , i.e., .

Fig. 9: The frames-per-second (fps) for each time-shared pipeline for 720p@60 fps input video when we execute (a) two pipelines (b) three pipelines at a time. We reconfigure between one to six RPs per pipeline switch. Each time-shared pipeline processes consecutive frames before reconfiguration.
Fig. 10: The frames-per-second (fps) for each time-shared pipeline for 1080p@60 fps input video when we execute (a) two pipelines (b) three pipelines at a time. We reconfigure between one to six RPs per pipeline switch. Each time-shared pipeline processes consecutive frames before reconfiguration.

Vii Discussion and Conclusion

This paper has discussed and demonstrated the feasability of using DPR for time-sharing despite the restrictive reconfiguration time. We developed four techniques that are part of an essential set to overcome high reconfiguration time when possible. We demonstrated through a working runtime framework the practical feasibility of timing-sharing by vision pipelines at useful frame rates—30 fps for 1080p and 60 fps for 720p on the Xilinx ZC706 board.

In this paper, we examined the opportunity of using DPR for time-sharing in vision systems, and showed that time-sharing is a promising approach in cases where commonality can be exploited (stage sharing across pipelines), and where performance can be traded for flexibility and cost (in terms of device and power cost). If maximum performance is required for a fixed set of functionalities known ahead of time, mapping all pipelines simultaneously and statically is preferred.

In addition to our proposed techniques, orthogonal approaches have been discussed in the literature and could be used in our system to further reduce reconfiguration time, as in [13]. A disadvantage of using PCAP is that the CPU blocks until a partition reconfiguration is done. The time wasted in blocking is not an important limitation in our case since the CPU is not a compute stage and can only use this time to reconfigure the interconnect (tens of microseconds).

In our work, the scheduling and mapping of stages to RPs rely on manual decision. A scheduler that can manage resource and time with higher resolution is part of future work.

Faster and concurrent DPR support in the future would improve effectiveness of time-sharing. DRAM bandwidth is another limitation when we need to buffer streams in DRAM. These improvements are critical to enable time-sharing applications at time quanta less than 10s of milliseconds.

Viii Acknowledgments

This work was supported in part by the CONIX Research Center, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.

References

  • [1] M. Ullmann, M. Huebner, B. Grimm, and J. Becker, “An fpga run-time system for dynamical on-demand reconfiguration,” in 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., pp. 135–, April 2004.
  • [2] Xilinx, “Vivado Design Suite User Guide: Partial Reconfiguration (UG909),” 2016.
  • [3] M. Majer, J. Teich, A. Ahmadinia, and C. Bobda, “The Erlangen Slot Machine: A Dynamically Reconfigurable FPGA-based Computer,” J. VLSI Signal Process. Syst., vol. 47, pp. 15–31, Apr. 2007.
  • [4] C. Claus, W. Stechele, and A. Herkersdorf, “Autovision – a run-time reconfigurable mpsoc architecture for future driver assistance systems (autovision – eine zur laufzeit rekonfigurierbare mpsoc architektur für zukünftige fahrerassistenzsysteme),” vol. 49, pp. 181–, 05 2007.
  • [5] D. Koch and J. Torresen, “FPGASort: A High Performance Sorting Architecture Exploiting Run-time Reconfiguration on Fpgas for Large Problem Sorting,” in Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ’11, (New York, NY, USA), pp. 45–54, ACM, 2011.
  • [6] D. Goehringer, L. Meder, M. Hubner, and J. Becker, “Adaptive Multi-client Network-on-Chip Memory,” in 2011 International Conference on Reconfigurable Computing and FPGAs, pp. 7–12, Nov 2011.
  • [7] C. H. Hoo and A. Kumar, “An area-efficient partially reconfigurable crossbar switch with low reconfiguration delay,” in 22nd International Conference on Field Programmable Logic and Applications (FPL), pp. 400–406, Aug 2012.
  • [8] J. Arram, W. Luk, and P. Jiang, “Ramethy: Reconfigurable Acceleration of Bisulfite Sequence Alignment,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’15, (New York, NY, USA), pp. 250–259, ACM, 2015.
  • [9] X. Niu, W. Luk, and Y. Wang, “EURECA: On-Chip Configuration Generation for Effective Dynamic Data Access,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’15, (New York, NY, USA), pp. 74–83, ACM, 2015.
  • [10] S. Byma, J. G. Steffan, H. Bannazadeh, A. L. Garcia, and P. Chow, “FPGAs in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack,” in 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 109–116, May 2014.
  • [11] C. Claus, W. Stechele, M. Kovatsch, J. Angermeier, and J. Teich, “A comparison of embedded reconfigurable video-processing architectures,” in 2008 International Conference on Field Programmable Logic and Applications, pp. 587–590, Sept 2008.
  • [12] Itseez, The OpenCV Reference Manual, 2.4.9.0 ed., April 2014.
  • [13] K. Vipin and S. A. Fahmy, “Zycap: Efficient partial reconfiguration management on the xilinx zynq,” IEEE Embedded Systems Letters, vol. 6, pp. 41–44, Sept 2014.