Due to power and energy constraints, conventional general-purpose processors are no longer able to sustain the performance and energy improvement in commercial datacenters. To overcome the inefficiency of homogeneous multicore systems, heterogeneous architectures that feature specialized hardware accelerators have been widely considered to be a promising paradigm. In particular, field programmable gate arrays (FPGAs), which offer the potential of orders-of-magnitude performance/watt gains for a broad class of applications while retaining reconfigurability, attract increasing attention as a mainstream acceleration technology. For example, both Microsoft and Baidu have incorporated FPGA-based accelerators in their datacenters to accelerate large-scale production workloads such as search engines [29, 10]
and neural networks[24, 25]. Amazon also introduced F1 instance , a compute instance equipped with FPGA boards, in its Elastic Compute Cloud (EC2). Moreover, with the $16.7 billion acquisition of Altera, Intel recently announced the Heterogeneous Architecture Research Platform (HARP) , which provides an FPGA and a Xeon processor in a single semiconductor package. Predictions have been made that as much as 30% of datacenter servers will have FPGAs by 2020 . This suggests that FPGAs could become a common component in future servers and could play an important role as primary computing resources .
On the other hand, a major challenge in FPGA-based acceleration is programmability. FPGA programming is generally recognized as an RTL (register-transfer level) design practice, which requires notable hardware expertise in designing accelerator microarchitectures such as controls, data paths, and finite state machines . This makes the effort of FPGA programming prohibitive to most datacenter application developers. It is even more challenging when the mainstream algorithm in an application domain is constantly evolving; i.e., an algorithm may have already been obsolete during the development process of its hardware accelerator.
Decades of research have focused on improving FPGA programmability. High-level synthesis (HLS)  that allows hardware designs to be described in high-level programming languages like C/C++ (such C/C++ programs for hardware designs are generally called hardware behavioral descriptions) is recognized as an encouraging approach. In fact, a C program can even be compiled by state-of-the-art HLS tools like Xilinx SDAccel into a working FPGA circuit without any modification of the program itself. However, a high-quality software program is generally far away from a high-quality hardware behavioral description due to the lack of proper consideration regarding the underlying FPGA architecture. Our experiments show that a software program, if naively treated as a hardware behavioral description, almost always leads to an FPGA accelerator that performs orders-of-magnitude worse than running the program on a modern CPU. This is because HLS still leaves programmers to face the challenge of identifying the optimal design configuration among a tremendous number of choices, which in turn requires intimate knowledge of hardware intricacies to efficiently reduce the design space and obtain a high-quality solution in a reasonable time. Consequently, to programmers HLS still presents a significant gap between a software program and a high-quality hardware behavioral description, which prevents the FPGA programmability from being further improved.
This paper presents a comprehensive approach to pave the path from a software program to a high-quality hardware behavioral description that 1) is functionally equivalent to the software program, and 2) leads to a high-performance FPGA accelerator. The approach consists of three main stages. The first stage, design space reduction, aims to reduce the tremendous design space. Specifically, we introduce the composable, parallel and pipeline (CPP) microarchitecture, a template of accelerator designs, as a specification of the program-to-behavioral-description transformation. Such a carefully designed template fits for a variety of computation kernels and guarantees the quality of accelerator designs. Also, with the CPP microarchitecture as the transformation specification, the design space is restricted to only configurations of that specific microarchitecture. The second stage, automatic design space exploration
, realizes a near-optimal CPP microarchitecture configuration automatically with an analytical model and a machine-learning-based search engine. With this near-optimal configuration, the third stage,automatic accelerator generation, organizes a collection of code transformation primitives to transform the software program to the behavioral description of the desired CPP microarchitecture. We develop the AutoAccel framework to implement the proposed approach and make the entire accelerator generation process automated. In summary, this paper makes the following contributions:
The CPP microarchitecture. By introducing this broadly applicable accelerator design template as the specification of program-to-behavioral-description transformation, we achieve the objective of drastically reducing the design space while preserving accelerator design quality.
The analytical model. This proposed model captures the performance and resource trade-offs among all design configurations of the CPP microarchitecture, laying the foundation for fast, automated design space exploration.
The AutoAccel framework. AutoAccel automates the entire accelerator generation process, provides datacenter application developers with a nearly push-button experience of FPGA programming, and thus substantially improves the FPGA programmability.
Detailed evaluation. We evaluate AutoAccel via the MachSuite  benchmark suite by proposing a metric to measure whether the qualities of AutoAccel-generated accelerators reach optimality. We also evaluate the accuracy of the proposed analytical model using Xilinx SDAccel and the on-board execution.
Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for the MachSuite computation kernels.
A field-programmable gate array (FPGA) is an integrated circuit that contains an array of reprogrammable logic and memory blocks: lookup tables (LUTs), flip-flops (FFs), digital signal processing slices (DSPs) and block RAMs (BRAMs). Connected through a hierarchy of reconfigurable interconnects, these blocks can be customized into different circuits to solve various computation problems. Such hardware customizability allows FPGA circuits to avoid the significant overhead of the general-purpose microprocessors, resulting in orders-of-magnitude performance/watt gains for a broad class of workloads.
However, the FPGA programmability issue is a serious impediment against its adoption by datacenter application developers. Section 2.1 briefly describes state-of-the-art commercial HLS tools that represent the latest effort in improving the FPGA programmability through HLS. The fact that such tools leave programmers to take full responsibility for performance optimization motivates our work. In Section 2.2 we then introduce the Merlin compiler [2, 13, 14], a compilation framework that attempts to alleviate the burden of manual code optimization by providing a library of automated code transformation primitives. While the Merlin compiler still relies on programmers to determine the optimal combination and parameters of the transformation operations, and thus does not substantially relieve the burden, its transformation library serves as a good preliminary tool for us to agilely implement automatic generation of the CPP microarchitecture.
2.1 Commercial HLS Tools
Commercial HLS tools such as Xilinx SDAccel  and Intel FPGA SDK for OpenCL  have been widely used to fast prototype user-defined functionalities expressed in high-level languages (e.g., C/C++ and OpenCL) on FPGAs without involving register-transfer level (RTL) descriptions. The example design flow used by common commercial HLS tools is shown in Fig. 1.
Commercial HLS tools usually have a set of language extensions for users, such as C pragmas, that provide the guidances of memory organization and task scheduling to complement the missing information of static analysis while optimizing the design. The language extensions are specified by the user at the source code level, but the core HLS code transformation and optimization happens at the intermediate representation (IR) level, indicating that the effectiveness of user guidances highly depends on its IR structure and front-end compiler. It implies that two programs with the same functionality but different coding styles (leading to different IR structures) might result in a significant performance difference. In fact, this difference can be up to several orders of magnitude based on our experiences. As a consequence, programmers have to pay attention to every detail that may affect the generated IR structure, which often requires a profound understanding of the FPGA architecture and circuit design.
2.2 Merlin Compiler
The Merlin compiler [2, 13, 14] is a source-to-source transformation tool for FPGA acceleration based on the CMOST  compilation flow. It provides a transformation library and a set of pragmas with prefix “#pragma Accel” for developers to perform design optimization at the source-code level. Each pragma corresponds to a code transformation primitive, as listed in Table 1.
|Example: #pragma Accel data_tiling tilesize=16|
|Memory Coalescing||Buffer||bitwidth=||Pack DRAM buffer to bits.|
|Example: #pragma Accel bitwidth variable=buf factor=512|
|Example: #pragma Accel pipeline|
|Example: #pragma Accel parallel factor=4|
Based on the transformation library, Fig. 2 presents the Merlin compiler execution flow. It leverages the ROSE compiler infrastructure  and polyhedral framework  for abstract syntax tree (AST) analysis and transformation. The front-end stage analyzes the user program and separates host and computation kernel. The kernel code transformation stage then applies multiple code transformations according to user-specified pragmas. Note that the Merlin compiler will perform all necessary code reconstructions to make a transformation effective. For example, when performing loop unrolling, the Merlin compiler not only unrolls a loop but also conducts memory partitioning for the sake of avoiding bank conflict . Finally, the back-end stage takes the transformed kernel and uses the HLS tool to generate the FPGA bitstream.
Compared to the pure HLS solution, the Merlin compiler further improves the FPGA programmability by making design optimization “semiautomatic”: instead of manually reconstructing the code to make one optimization operation effective, programmers now can simply place a pragma and let the Merlin compiler do the necessary changes. However, programmers still have to identify the best combination and parameters among these operations, i.e., manually searching in an exponential design space.
3 Accelerator Design Template
This section presents the details of the design space reduction stage of the proposed approach. In general, our solution is to introduce an accelerator design template as the specification of the transformation from software programs to hardware behavioral descriptions. A software program will only be transformed to a hardware behavioral description of this introduced template, so the design space is restricted to only configurations of the template. As a result, the design space is drastically reduced (see Section 4 for design space definition). Meanwhile, this template ought to be applicable for a variety of computation kernels, and guarantees the accelerator design quality once a kernel fits into the template. Section 3.1 and 3.2 present our proposed accelerator design template, the composable, parallel and pipeline (CPP) microarchitecture, as well as showing how the CPP microarchitecture is derived. Section 3.3 discusses the applicability of the CPP microarchitecture for various computation kernels.
3.1 Obstacles Towards Efficient Behavioral Description
We derive the CPP microarchitecture by conducting an analysis on the major obstacles from a software program towards an efficient hardware behavioral description. Specifically, we start from a collection of computation kernels, straightforwardly treat their software implementations111The computation kernels and their software implementations are from the MachSuite benchmark suite  (see Section 6.1). as behavioral descriptions, feed such naive behavioral descriptions into Xilinx SDAccel, and identify the microarchitectural inefficiencies of the generated FPGA accelerators. Such inefficiencies represent the obstacles towards efficient behavioral descriptions.
We use the NW (Needleman-Wunsch algorithm) benchmark (see Section 6.1) as an example for demonstration and discussion. The NW benchmark processes a series of genome sequence alignment jobs, each with a pair of 128-entry sequences as input and a pair of 256-entry sequences as output. The alignment engine applies the Needleman-Wunsch algorithm, a dynamic programming algorithm with quadratic time complexity, to the input sequences, and generates the optimal post-aligned sequences given a predefined scoring system .
Fig. 3 presents the NW code snippet and the microarchitecture of the FPGA accelerator generated by naively feeding the NW code into Xilinx SDAccel. Our experiments show that this accelerator performs 92x slower than a single CPU core. We dig into the implementation inefficiencies of the NW benchmark that cause such poor performance as follows.
Inefficiency #1: Inefficient off-chip transaction. The kernel function is the top-level function of the NW benchmark and defines the entire accelerator. Its arguments—seqAs, seqBs, alignedAs and alignedBs that correspond to the original sequence pairs and the aligned sequence pairs—define the input and output buffers that reside in the off-chip DRAM of the FPGA board. The FPGA accelerator connects to these off-chip buffers through AXI channels. The data width of each AXI channel is eight bits, inferred from the data type of the corresponding argument (8-bit char type in the NW case). As a result, the off-chip data transaction throughput is only one byte/cycle for each channel, or four byte/cycle aggregately, while state-of-the-art CPU-FPGA platforms typically support 64 byte/cycle off-chip communication throughput.
Inefficiency #2: No data caching. No data caching module is presented in the microarchitecture, with the result that every data access goes through the off-chip DRAM.
Inefficiency #3: Sequential loop scheduling. The kernel function body is a loop statement that iteratively traverses every sequence pair through the engine function that defines the hardware engine module. In the presented microarchitecture, the engine module accepts and processes only one sequence pair at a time, despite the fact that these sequence pairs are independent of each other and thus can be processed in parallel or pipeline. Worse still, all loops presented in the NW kernel are scheduled to be processed sequentially, regardless of whether one is able to be mapped to a parallel or pipeline circuit.222The latest Xilinx flow starts to perform loop pipelining automatically, but only for simple loop statements.
Inefficiency #4: Inefficient on-chip memory utilization. The major computation of the NW algorithm is to generate a two-dimension score matrix. The engine function therefore includes a local two-dimensional array, M, to store the matrix, and some loop statements to calculate the values of the matrix elements. In the presented microarchitecture, the array M is mapped to an on-chip BRAM buffer that has only one write port, implying that even if the algorithm has the potential to generate multiple matrix element values per cycle, the BRAM buffer is not able to fulfill this potential because only one value can be written into the buffer in each cycle.
These inefficiencies, though demonstrated only in the NW example, are present in all MachSuite benchmarks and represent the major obstacles from software programs to high-quality hardware behavioral descriptions. The CPP microarchitecture is thus derived to resolve these inefficiencies.
3.2 CPP Microarchitecture
The composable, parallel and pipeline (CPP) microarchitecture is proposed as a template of accelerator designs and a specification of the program-to-behavioral-description transformation. It includes a series of features to address the inefficiencies in the previous section. In the following text we continue to use the NW benchmark as an example to demonstrate the CPP microarchitecture along with its key features, as shown in Fig. 4.
Feature #1: Coarse-grained pipeline with data caching. Fig. 4 illustrates the NW accelerator design under the CPP microarchitecture. The overall CPP microarchitecture is a coarse-grained pipeline that consists of three stages: load, compute and store. The kernel function in the NW source code only corresponds to the compute module instead of defining the entire accelerator. The input sequence pairs are processed tile by tile, i.e., iteratively loading a certain number of sequence pairs into on-chip buffers (Stage load), aligning these pairs (Stage compute), and storing the post-aligned pairs back to DRAM (Stage store). Different tiles are processed in pipeline since they are independent from each other. This feature addresses inefficiency #2 because off-chip data movement only happens in the load and store stages, leaving the data accesses of computation completely on chip.
The load and store modules connect to two input and output DRAM buffers, respectively, through AXI channels. The data widths of the AXI channels are decoupled from the type sizes of the top-level function arguments. Hence, the off-chip bandwidth can potentially reach the highest physical bandwidth of the CPU-FPGA platform. Also, the load-compute-store pipeline improves the effective bandwidth of the accelerator by overlapping communication with computation. Consequently, inefficiency #1 is addressed as well.
Feature #2: Loop scheduling. The CPP microarchitecture tries to map every loop statement presented in the computation kernel function to either 1) a circuit that processes different loop iterations in parallel, 2) a pipeline where the loop body corresponds to the pipeline stages, or 3) a combination of both. As for the NW example, the loop statement in the kernel function is mapped to a set of engine modules to process the sequence pairs in parallel. Moreover, the loop statements in the engine function are mapped to parallel and pipeline circuits as well. This resolves inefficiency #3.
Feature #3: On-chip buffer reorganization. In the CPP microarchitecture, all the on-chip BRAM buffers are partitioned to meet the port requirement of parallel circuits, where the number of partitions of each buffer is determined by the duplication factor of the parallel circuit that connects to the buffer. This feature is used for resolving inefficiency #4. In the NW example, the on-chip buffers that cache the input and output sequence pairs are partitioned into multiple segments, each segment feeding one engine module. The local buffer M that stores the score matrix is also partitioned to allow parallel read and write transactions.
In summary, the CPP microarchitecture guarantees the quality of accelerator designs by providing corresponding features to address the inefficiencies. However, it is not applicable to all kinds of computation kernels with various data processing patterns. The following section discusses the applicability of the CPP microarchitecture for various computation kernels.
3.3 Applicability for Computation Kernels
The CPP microarchitecture features a load-compute-store coarse-grained pipeline, which requires the computation kernel to process input data block by block. Meanwhile, the size of each block is required to be less than a few megabytes in order to be entirely cached on chip. As a consequence, the CPP microarchitecture favors the computation kernels with regular data-level parallelism, like streaming or batch processing programs with the MapReduce  pattern. On the contrary, it does not fit well for the computation kernels featuring extensive random accesses on a large memory footprint, such as PageRank  and and the breadth-first search (BFS) algorithm.
4 Analytical Model
Another advantage of using CPP microarchitecture is to have a clear design space. This section presents our CPP microarchitecture analytical model that estimates the execution cycles and resource consumptions of these configurations; this lays the foundation for theautomatic design space exploration stage of the proposed approach.
Unlike most existing models [18, 20, 28, 32, 36] that analyze the source program directly, many parameters of our proposed model are obtained from the HLS synthesis reports of a few design points. This feature enables our model to capture most scheduling optimizations performed by the HLS tool. As we will show in Section 6, the proposed model has less than a error rate compared to the HLS report.
4.1 Performance Modeling
The performance model estimates an accelerator’s overall execution cycle () through Eq. 1:
where , and denote the cycles of the load, compute and store modules, respectively. Since the load and store modules share the off-chip bandwidth and are together overlapped with the compute module in our experimental platform, we make a maximum operation between the cycles of the load/store modules and that of the compute module.
The execution cycles of the load, compute and store modules, as well as all of their submodules, can be quantified as the total cycles of all its loops (), submodules () and standalone logic (), as shown Eq. 2.
where is obtained from the HLS report.
Then we model the loop execution. Although a loop statement can be scheduled in pipeline, parallel or the combination of both, the first two schedules can be treated as special cases of the last one, and can together be modeled as Eq. 3:
where , , and denote the iteration latency, initiation interval, trip count and unroll factor, respectively. and are obtained from the HLS report; is a design parameter that needs to be explored.
Subsequently, we break down and model the loop iteration in Eq. 4, where the loop iteration latency is composed of the total cycles of all their sub-loops, submodules and standalone logic.
Eq. 2 and Eq. 4 reflect the architecture hierarchy with nested modules and loops. The proposed model recursively traverses all the loops and modules until a loop or module does not contain any sub-structures. In addition, we can find that Eq. 2 and Eq. 4 are almost identical. This is because the loop iteration can be treated as a special “module” and modeled in the same way for both performance and resource. Hence, we omit the loop iteration breakdowns in the following resource models.
4.2 Resource Modeling
The resource models estimate the consumptions of the four FPGA on-chip resources: BRAMs, LUTs, DSPs and FFs. As the DSP model is relatively straightforward and the FF model is similar to the LUT model, we only demonstrate the BRAM and LUT models in this section.
BRAM modeling: The BRAM consumption of a hardware module consists of the BRAM blocks used by all its local buffers () and those used by all its submodules (), as shown in Eq. 5:
where is the duplication factor of submodule which is equivalent to the unroll factor of the loop that includes this submodule. We use “duplication factor” instead of “unroll factor” since the former is a better fit for depicting hardware modules and the latter is more suitable for describing loop statements.
Then we model the BRAM consumption of on-chip buffers. A buffer’s BRAM consumption is determined by three factors: 1) partition factors on all dimensions, , 2) the size of each partition, , and 3) the bit-width of the buffer, , as shown in Eq. 6:
where denotes the size of a BRAM block that is a platform-dependent constant. is a function that calculates the minimum number of BRAM blocks needed to compose a BRAM buffer with bit-width . Eq. 8 shows its expression, where is a platform-dependent constant that represents the largest supported bit-width of a BRAM building block.
LUT modeling: The LUT consumption of a hardware module () is composed of the number of LUTs used by all loops, submodules, BRAM buffers (for control logic) and the standalone logic:
where depicts the LUT consumption of the loop iteration that is, again, treated and modeled as a special “module.” denotes the LUT consumption of the standalone logic and is obtained from two HLS reports.
We then model the LUT consumption of on-chip buffers (). It can be decoupled into two parts: 1) the control () and data () signals of each BRAM partition, and 2) the -to-1 multiplexer () that selects the desired data from all the partitions, as shown in Eq. 10:
where and are obtained from the HLS report, and can be calculated via Eq. 11. We can also see that the LUT consumption of a buffer depends on its BRAM usage.
Based on the proposed model, the design space of the CPP microarchitecture is composed of 1) the capacity and bit-width of every on-chip buffer, and 2) the unroll factor of every loop, as indicated in Table 2. Unfortunately, the proposed model is neither linear nor convex, and therefore not able to be mathematically solved in polynomial time. Hence, we implement automatic design space exploration by leveraging a machine-learning-based search engine that is able to greatly reduce the number of search iterations needed to reach a near-optimal solution. This, together with the AutoAccel framework, will be presented in the following section.
|Loop unroll factor|
5 AutoAccel Framework
In this section we present the AutoAccel framework that takes a nested loop333Computation kernels with multiple nested loops can be decoupled into multiple sub-kernels, each corresponding to a CPP microarchitecture. Existing work  has extensively studied how to connect multiple accelerators through FIFO channels with efficient inter-accelerator communication. in C as input and performs a series of transformations to produce a high-quality FPGA accelerator with the CPP microarchitecture. AutoAccel is built on top of the Merlin compiler and uses its transformation library to construct the CPP microarchitecture.
Fig. 5 illustrates the overall flow of the AutoAccel framework. The input program is first evaluated by the legalization checking to determine whether it fits into the CPP microarchitecture. Next, we implement a CPP microarchitecture constructor to refactor the input program to a hardware behavioral description of the CPP microarchitecture. Subsequently, a design space builder is developed to identify the design space via static code analysis. After the design space has been built, we introduce a design space explorer with our proposed analytical model to realize the best design specification in minutes. Finally, we refactor the behavioral description code again by applying the best design specification to generate the desired accelerator design. This design can be directly fed into Xilinx SDAccel to derive a high-quality accelerator bitstream. In the remainder of this section we present the detailed implementation of each component.
5.1 Legalization Checking
Since AutoAccel does not require any user modification of the input computation kernel code, the goal of legalization checking is to evaluate whether the input kernel is able to be mapped to the CPP microarchitecture. We briefly describe the evaluation points of the AutoAccel built-in legalization checking algorithm as follows:
Kernel size. The resource requirements of generated designs cannot exceed the capacity of single FPGA fabric. This can be evaluated by running HLS with the basic configuration.
Task-dependent data chunk length. A task-dependent array is an array that is traversed by the PE-loop so that every PE will use a different chunk of data. To achieve the most efficient parallelism and pipeline scheduling in the CPP microarchitecture, the on-chip scratchpad memory is partitioned for every PE to avoid writing conflicts. For example, the string length of NW kernel in Fig. 3 is always 128, so it can be processed by AutoAccel. However, if the size of the task data chunk is determined dynamically, AutoAccel cannot statically allocate a certain memory size to each PE; this results in the failure of legalization checking.
Task-independent data size. A task-independent array, on the other hand, is an array that is accessed by all PEs. For example, in the breadth-first search (BFS) implementation of the MachSuite benchmark , the array that stores the tree is task-independent, because every PE might access any part in the array so that it cannot be partitioned regularly. As a result, it is better to duplicate task-independent arrays in on-chip memory for each PE to guarantee the efficiency. In case the array is too large to be stored in on-chip memory, the kernel fails to pass the legalization checking.
We perform legalization checking by traversing an abstract syntax tree (AST). We analyze the iteration domain to reason kernel accessed data size by the polyhedral analysis from .
5.2 CPP Microarchitecture Construction and Design Space Establishment
AutoAccel makes use of the transformation library of the Merlin compiler to preprocess user input code to fit the CPP microarchitecture. To constrain a design space when constructing the CPP microarchitecture, we use static analysis and a polyhedral model to collect the necessary information (e.g., loop trip count, maximal buffer size, bit-width, etc). Instead of specifying an integer number in Merlin pragmas for a certain configuration (e.g., data tiling size), we define an expression “auto(min, max, inc)” to represent a set of design points. In the expression, min and max indicate the range while inc specifies the incremental operator from the minimum value to the maximum value. We currently support two incremental operators: 1) seq that represents the “” increment, and 2) pow2 that represents the “” increment. This expression will be replaced with a specific integer of the best configuration after the design space exploration (DSE).
We now introduce the transformation operations used to construct the CPP microarchitecture. Again, the NW benchmark is used as an example to demonstrate the transformation flow, as shown in Code 1. The first three transformations are data tiling, coarse-grained pipeline and processing element duplication.
1. Data tiling: The transformation first tiles a sub-loop in the nested loop and creates a set of on-chip buffers for data caching. Then it instruments the code for establishing efficient off-chip data communication by enabling memory burst. The transformed code corresponds to lines 33-51 in Code 1.
Since the CPP microarchitecture decouples the off-chip memory communication from computation, the analytical model does not cover the design points with different data tiling granularity. To solve this problem, we find the best design point of all possible data tiling granularities that are reported by the legalization checking algorithm in parallel. We plan to include data tiling granularity into the design space in the future.
2. Coarse-grained pipeline: After the data tiling, we apply the coarse-grained pipeline transformation that encapsulates load, compute, store into three functions to draw the boundaries between pipeline stages (lines 41-51). Subsequently, the transformation duplicates on-chip buffers created by step 1 and interleaves all of them by enabling double buffering.
3. Processing element duplication: The next step is to enable parallel computing. We apply the parallelism transformation to the compute stage in the tiled nested loop (lines 20-24). This creates multiple homogeneous processing elements (PEs) to process the loop iterations in parallel.
Until now, we have constructed a microarchitecture with a coarse-grained pipeline and a PE array that covers feature #1 and part of feature #2 of the CPP microarchitecture. Subsequently, we focus on loop scheduling inside PEs.
4. Small loop flatten: Based on our experiences, it is usually better to flat the in-PE loops with fixed, small trip counts. The reason is that 1) flatting loops with small trip counts provides more opportunities for HLS to generate a more efficient scheduling, and 2) flatting such loops will not affect the overall resource utilization considerably. As a result, we make an ad hoc strategy to fully unroll in-PE loops with trip count less than 16.
5. Fine-grained parallel/pipeline: If an in-PE loop cannot be fully unrolled by step 4, it must satisfy one of the following conditions: 1) its trip count is either unknown or larger than 16, 2) it has loop carried-dependency, or 3) it contains one or more sub-loops that cannot be fully unrolled. In the first condition, we apply fine-grained parallelism and explore the best partial-unroll factor (lines 6, 10, and 15). In the other two conditions, we apply a fine-grained pipeline to improve the throughput and resource efficiency.
The above two transformations cover the remaining part of feature #2. Finally, we apply step 6 to cover feature #3.
6. On-chip buffer reorganization: We finally apply memory coalescing to reorganize the on-chip buffer (lines 28-31). We analyze the data type to determine the minimal bit-width, and always set the maximal bit-width to 512 bits since this is the maximal supported by the experimental platform. In addition, we only set the power-of-two bit-width values as DSE candidates, because HLS tools round BRAM sizes up to a power of two. As a result, this reduced design space can still cover the optimal solution in the original design space.
By applying the above code transformations, we are able to generate a transformed kernel code with the CPP microarchitecture and a design space. As can be seen in Code 1, the design space of the NW example has roughly design points. Therefore, an efficient DSE component is essential for the AutoAccel framework.
5.3 Design Space Exploration
The DSE flow of AutoAccel, as shown in Fig. 6, is implemented using OpenTuner , an open-source framework for building domain-specific program tuners. The OpenTuner runtime has a search technique library that contains a collection of machine learning algorithms to cover as many customized tuning problems as possible. In order to assemble all search techniques, OpenTuner adopts a multi-armed bandit algorithm  as a meta technique to judge the effectiveness of each search technique and allocate design points according to the judgment. Specifically, the search technique that can efficiently find high-quality design points will be rewarded and allocated more design points. In contrast, the technique that performs poorly on high-quality design point discovery will be allocated fewer points and eventually disabled. By harnessing OpenTuner, our DSE flow is able to realize the best design point efficiently and effectively.
In Figure 6, the model initialization stage first parses the HLS reports of a few design points and generates the values of the design constants. While most values are obtained by running HLS once, the LUT consumption of the standalone logic of a loop iteration ( in Eq. 9) is calculated via two HLS reports. In detail, we run HLS twice with two consecutive unroll factors of a loop to calculate the increment of the LUT consumption. This increment is the LUT consumption of the loop’s standalone logic. Next, we analyze the kernel source code to 1) establish the architecture hierarchy, and 2) fetch the design parameters and their value ranges from the auto pragmas inserted during design space establishment (Section 5.2). After the model is initialized, we simply feed the parameter sets of the remaining design points to the model and collect the performance and resource estimations.
6 Experimental Evaluation
In this section we first describe our experimental setup, including hardware platform, software environments, and benchmarks. Then we evaluate AutoAccel by analyzing the results of design space exploration (DSE), analytical model, and overall performance and energy efficiency.
6.1 Experimental Setup
The evaluation of AutoAccel is performed on the mainstream PCIe-based CPU-FPGA platform and the Xilinx SDAccel design flow. Table 3 lists the detailed hardware and software configuration. An Xeon CPU is connected with a Xilinx Virtex-7 FPGA board through the PCIe interface. For a fair comparison, both the CPU and the FPGA fabric were launched in 2012. On top of the platform hardware, we use Xilinx SDAccel to provide a hardware-software co-design environment.
Table 4 lists the benchmarks used in our experiment. We use MachSuite , a benchmark suite that contains a broad class of computational kernels programmed as C functions for accelerator study, to evaluate the AutoAccel framework. For each kernel, MachSuite provides at least one implementation that is based on a commonly used algorithm in software programming, which makes it a natural fit for demonstrating AutoAccel.
|Host CPU Model||Intel Xeon E5-2420 @ 1.9GHz (released in 2012)|
|Host Memory||64GB DDR3-1600|
|FPGA Fabric||Xilinx Virtex-7 @ 200MHz (released in 2012)|
|Device Memory||8GB DDR3-1600 (Max Band.: 12.8GB/s)|
|CPU-FPGA Interface||PCIe Gen3 x8 (Max Band.: 8GB/s)|
|Transformation Flow||Merlin compiler 2017.1|
|Synthesis Flow||SDAccel 2016.3|
|Kernel||Description and Input Information|
|AES||Advanced encryption standard. Input: 256-bit key; 64MB data.|
|VITERBI||Viterbi algorithm. Input: 1M 128-element chains.|
|STENCIL||Stencil computation. Input: a 40964096 image|
6.2 Design Space Exploration Evaluation
Fig. 7 illustrates the process of finding the optimal design point using the learning-based DSE approach with the analytical model to evaluate the performance and resource consumption. Thanks to the multi-armed bandit algorithm, the DSE process is able to find the right direction to the optimal solution efficiently, so the DSE time limit is set to only 180 seconds after the model initialization. As can be seen in Fig. 7, the execution cycles drop significantly in the first 20 seconds except for KMP. We analyze the process log of KMP in detail and find that the DSE spends some iterations attempting to improve the performance of the compute stage, because KMP has a relative large design space inside the compute module. However, the performance of KMP is heavily bounded by memory bandwidth so reduced compute latency does not benefit for overall performance improvement. Despite this, the DSE process for KMP is still able to be converged in time.
Based on the best configuration realized by the DSE, Table 5 presents the performance and resource utilization for each benchmark.444We set 80% as the resource constraint based on the resources available for users. Note that the C2C metric in the second column is calculated by the following equation:
The concept of C2C shares the merits of the CTC ratio used in . We use C2C to analyze whether a design has achieved optimality (). The design is identified as computational bound if C2C is larger than 1; otherwise it is communication bound.
According to Table 5, the overall performance of AES, SPMV, KMP, and STENCIL is bounded by the off-chip bandwidth, because those four designs need to input or output a large amount of data, so the memory transaction time cannot be hidden by the computation time even if the AutoAccel DSE has successfully found the design point with the largest bit-width. In fact, the memory-bounded design may potentially be further optimized by introducing data reuse analysis. For example,  leverages polyhedral analysis to realize and optimize the data access pattern, and this results in a much lower external memory transaction volume for stencil computation. However, the impact of this kind of transformation cannot be estimated by our analytical model and is beyond the scope of this paper. Future work would extend the model to cover those transformations.
For the other four designs, VITERBI and NW are bounded by LUTs. We can see that the C2C of both designs is higher than 2. It means that the PE in both designs consumes many LUTs, so even the overall design can still be further optimized by duplicating more PEs. There are no more available LUTs to use.
On the other hand, FFT and GEMM are bounded by BRAM. Since their PE logics are relatively simple, BRAM becomes the major resource bottleneck. In this case, the DSE balances the computation and communication cycles by adjusting the PE number and buffer bit-width. As a result, the BRAM-bounded design has a C2C value larger than but close to 1.
6.3 Analytical Model Evaluation
We conduct two experiments to evaluate the accuracy of the analytical model. The first experiment aims to evaluate whether the model-generated results are consistent with those collected from HLS reports. In detail, we randomly select 20 design points for each benchmark, and compare the performance and resource usage for each design point between the model estimation and HLS report. Table 6 presents the average absolute difference rates for all cases.
We can see that the proposed model aligns to the HLS report accurately on performance and BRAM/DSP usage, and also results in only moderate differences on LUT/FF usage. The differences are lead by the fact that the HLS tool adopts some resource-efficient implementations for its building blocks when a design requires a large proportion of on-board resources. For example, VITERBI includes a loop statement with initiation interval (II) equal to 40. The hardware circuit for this loop has some 25-to-1 multiplexers to select one floating-point number from 25 numbers. We observe that when the number of PEs in the VITERBI design grows, the HLS tool automatically replaces a fully pipelined multiplexer implementation that consumes over 500 LUTs with the implementation that consumes only 32 LUTs to 1) meet the II=40 restriction, and 2) save on-board resources. Since such dynamic optimization strategies are hard to capture with a static analytical model, a few percentages of differences on LUT/FF usage is inevitable.
The second experiment evaluates the performance difference between the HLS report and the actual on-board result. Table 7 presents the absolute performance difference rate of the optimal design point identified by AutoAccel. We can see that the average difference among all the benchmarks is only 6.2%, which proves that the cycle estimation from the HLS tool is able to match the actual on-board execution time for the proposed microarchitecture. Note that the actual frequency of generated designs is not variant dramatically due to the following two reasons. First, Xilinx SDAccel 2016.3 optimizes the timing prior to optimizing other factors, so it might sacrifice resource efficiency (e.g., enlarge II) to preserve the frequency. Second, all of our designs reserve sufficient resources for the tool to avoid strict timing constraints. As a result, the impact of frequency on performance difference is moderate.
In addition, we further analyze the benchmarks with over 10% performance difference, i.e., AES and KMP. We find that such relatively a large difference is mainly because the accelerator designs for these benchmarks have a very small execution time (10 ms). For these time frames, the start-up and end overhead bias the time significantly. On the contrary, we also observe that the error rate of the model to on-board execution is always less than 5% when a design has an over 100-millisecond execution time. Hence, the proposed model is able to accurately predict the on-board execution time of a design given that its execution time is tens of milliseconds or larger.
6.4 Performance and Energy Evaluation
We finally evaluate the performance speed-up and energy efficiency improvement of the generated FPGA accelerator designs. Figure 8 compares the performances between the naive implementation of MachSuite, manual HLS designs and AutoAccel-generated accelerator designs, all of which are normalized to the performances of the corresponding software implementations. We can clearly see that AutoAccel-generated accelerators drastically outperform the naive implementations by 27,000x, indicating that AutoAccel has strongly addressed the gap from software programs towards high-quality hardware behavioral descriptions. Meanwhile, the AutoAccel-generated accelerators also outperform the software implementations by 72x, indicating that our approach does lead to high-quality accelerator designs.
We can also see from the experimental results that the manual designs only outperform the AutoAccel-generated designs by an average 2.5, even after we spent several days to weeks applying more behavioral-level transformations to achieve the optimal performance. In detail, for the benchmarks with C2C<1 — AES, SPMV, KMP and STENCIL — the generated designs have achieved the same optimal performance as manual designs in the experimental platform, because these benchmarks are all of linear time complexity, and their PEs run faster than the off-chip communication. On the other hand, the performances of the benchmarks with super-linear time complexity — FFT, NW, VITERBI and GEMM — are bounded by FPGA on-chip resources. As a result. the performance can potentially be further improved by using application-specific accelerator circuits to improve resource efficiency. For example, we use the systolic array microarchitecture to improve the GEMM accelerator design and achieve the optimal performance with all on-chip DSPs. Although such specialized architectures cannot be covered by AutoAccel, AutoAccel still preserves high accelerator quality while substantially improving the FPGA programmability.
Finally, we analyze the energy efficiency gain of AutoAccel-generated designs. We estimate the energy efficiency (performance per watt) of our experiments by considering execution time and thermal design power (TDP). The TDP of the Intel Xeon CPU and the Xilinx FPGA used in this comparison is 80W and 25W, respectively. Accordingly, AutoAccel-generated designs can achieve up to 1677.9 energy efficiency improvement and 260.4 on average.
7 Related Work
In this section we discuss related work in the analytical models and the automated frameworks for FPGA design optimization.
Analytical Modeling: Fast performance estimation on FPGAs has become popular in recent years. In general, performance analysis is mainly performed at either IR level [32, 36, 28, 20, 18] or source code level . Since most of the existing work performs analysis without explicitly considering back-end design flow [32, 36, 18, 28, 20], their analysis cannot reflect the optimization done by the commercial tool. On the other hand, similar to this paper,  builds the performance model with the help of the commercial tool, but  provides neither the resource model nor automated code transformation, so users still need to manually change the kernel code while considering the FPGA resource limitation.
Automated Framework: Some projects aim to provide an automated framework to perform code generation and design space exploration [28, 20, 31]. The framework presented by [28, 20] accepts parallel patterns (e.g., map, groupBy, filter, reduce, etc.) and performs FPGA accelerator generation with analytical DSE. Different from this paper, which automatically applies the CPP microarchitecture, the FPGA architecture generated by [28, 20] is composed of predefined hardware components (i.e., memory, controller, and primitive operations) to guarantee efficiency. However the selection of these components highly depends on the semantic information of user-specified parallel patterns. Furthermore, the performance model for DSE in [28, 20] is built only for the predefined hardware components.
In addition, Melia  is a MapReduce framework that supports automated code generation from user-written C code to OpenCL. Melia asks users to provide the best configuration by leveraging the model from  to generate the OpenCL code. Consequently, Melia only generates the FPGA accelerator design under a MapReduce programming model, and misses automatic design space exploration.
Finally, some frameworks also focus on general-purpose programming languages such as C/C++ [21, 35, 18]. SOAP3  is a framework that analyzes a kernel at the metasemantic intermediate representation (MIR) graph level and transforms it according to the result of design space exploration. However, SOAP3 adopts regression models for resource estimation, so the model is not general enough to cover nonlinear resource consumption. A framework in  uses an analytical model based on HLS results (like this paper) for maximizing throughput given resource constraints. However, they consider only loop pipelining and ignore the design space of coarse- and fine-grained parallelism. Lin-analyzer  is a framework to identify the performance bottleneck for C/C++ programs, but it does not involve code transformation and only focuses on fine-grained parallelism.
While the FPGA-based heterogeneous architectures are becoming a promising paradigm to provide continued performance and energy improvement in modern datacenters, accelerator programming arises as a serious challenge to application developers. In this paper we propose the AutoAccel framework to provide a nearly push-button experience on mapping C functions into high-quality FPGA accelerator designs. Featuring the CPP microarchitecture, a fast analytical model-based design space exploration and automatic code transformation, AutoAccel achieves 72x speed-up and 260.4 energy improvement for a broad class of computation kernels.
Furthermore, we believe that the design principles of AutoAccel can be further generalized to stimulate more research on the adoption of FPGAs in datacenters. For example, the CPP microarchitecture serves as a proof-of-concept that using an accelerator design template as a specification of the program-to-behavioral-description transformation drastically reduces the design space while preserving the accelerator quality. Therefore, more microarchitectures might be added in AutoAccel to improve the coverage of computation kernels. Also, more sophisticated, high-abstract code transformations (e.g., loop permutation) are able to be supported in the future, along with polyhedral analysis, to form a larger design space and create more optimization opportunities.
-  Rose Compiler Infrastructure, 2000. http://rosecompiler.org/.
-  Merlin Compiler, 2015. http://www.falcon-computing.com/index.php/solutions/merlin-compiler.
-  Xeon+FPGA Platform for the Data Center. https://www.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf, 2015.
-  Amazon ec2 f1 instance, 2016. https://aws.amazon.com/ec2/instance-types/f1/.
-  Intel SDK for OpenCL Applications, 2016. https://software.intel.com/en-us/intel-opencl.
-  Intel to Start Shipping Xeons With FPGAs in Early 2016. http://www.eweek.com/servers/intel-to-start-shipping-xeons-with-fpgas-in-early-2016.html, 2016.
-  SDAccel Development Environment. http://www.xilinx.com/products/design-tools/software-zone/sdaccel.html, 2016.
-  Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. Opentuner: An extensible framework for program autotuning. In PACT, 2014.
-  Pierre Bricaud. Reuse methodology manual: for system-on-a-chip designs. Springer Science & Business Media, 2012.
-  Adrian M Caulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A cloud-scale acceleration architecture. In MICRO-49, 2016.
-  J. Cong, P. Li, B. Xiao, and P. Zhang. An optimal microarchitecture for stencil computation acceleration based on nonuniform partitioning of data reuse buffers. TCAD, 2016.
-  J. Cong, Bin Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Zhiru Zhang. High-level synthesis for FPGAs: From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 2011.
-  Jason Cong, Muhuan Huang, Peichen Pan, Yuxin Wang, and Peng Zhang. Source-to-source optimization for HLS. In FPGAs for Software Programmers. Springer International Publishing, 2016.
-  Jason Cong, Muhuan Huang, Peichen Pan, Di Wu, and Peng Zhang. Software infrastructure for enabling fpga-based accelerations in data centers: Invited paper. In ISLPED, 2016.
-  Jason Cong, Wei Jiang, Bin Liu, and Yi Zou. Automatic memory partitioning and scheduling for throughput and power optimization. TODAES, 2011.
-  Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. Commun. ACM, 2008.
Álvaro Fialho, Luis Da Costa, Marc Schoenauer, and Michèle Sebag.
Analyzing bandit-based adaptive operator selection mechanisms.
Annals of Mathematics and Artificial Intelligence, 2010.
-  Xitong Gao, John Wickerson, and George A. Constantinides. Automatically optimizing the latency, area, and accuracy of c programs for high-level synthesis. In FPGA, 2016.
-  Muhuan Huang, Kevin Lim, and Jason Cong. A scalable, high-performance customized priority queue. In FPL, 2014.
-  D. Koeplinger, R. Prabhakar, Y. Zhang, C. Delimitrou, C. Kozyrakis, and K. Olukotun. Automatic generation of efficient accelerators for reconfigurable hardware. In ISCA-43, 2016.
-  Peng Li, Peng Zhang, Louis-Noel Pouchet, and Jason Cong. Resource-aware throughput optimization for high-level synthesis. In FPGA, 2015.
-  Saul B Needleman and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 1970.
E. Nurvitadhi, Jaewoong Sim, D. Sheffield, A. Mishra, S. Krishnan, and D. Marr.
Accelerating recurrent neural networks in analytics servers: Comparison of fpga, cpu, gpu, and asic.In FPL, 2016.
-  Jian Ouyang, Shiding Lin, Wei Qi, Yong Wang, Bo Yu, and Song Jiang. Sda: Software-defined accelerator for largescale dnn systems. In Hot Chips, 2014.
Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss,
and Eric S Chung.
Toward accelerating deep learning at scale using specialized hardware in the datacenter.In Hot Chips, 2015.
-  Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
-  Louis-Noel Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. Polyhedral-based data reuse optimization for configurable computing. In FPGA, 2013.
-  Raghu Prabhakar, David Koeplinger, Kevin J Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. Generating configurable hardware from parallel patterns. In ASPLOS-XXI, 2016.
-  Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, et al. A reconfigurable fabric for accelerating large-scale datacenter services. In ISCA-41, 2014.
-  Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In IISWC, 2014.
-  Z. Wang, S. Zhang, B. He, and W. Zhang. Melia: A mapreduce framework on opencl-based fpgas. TPDS, 2016.
-  Zeke Wang, Bingsheng He, Wei Zhang, and Shunning Jiang. A performance analysis framework for optimizing opencl applications on fpgas. In HPCA-22, 2016.
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong.
Optimizing fpga-based accelerator design for deep convolutional neural networks.In FPGA, 2015.
-  Peng Zhang, Muhuan Huang, Bingjun Xiao, Hui Huang, and Jason Cong. CMOST: A system-level fpga compilation framework. In DAC-52, 2015.
-  G. Zhong, A. Prakash, Y. Liang, T. Mitra, and S. Niar. Lin-analyzer: A high-level performance analysis tool for fpga-based accelerators. In DAC-53, 2016.
-  Guanwen Zhong, Alok Prakash, Siqi Wang, Yun Liang, Tulika Mitra, and Smail Niar. Design space exploration of fpga-based accelerators with multi-level parallelism. In DATE, 2017.
-  Hamid Reza Zohouri, Naoya Maruyama, Aaron Smith, Motohiko Matsuda, and Satoshi Matsuoka. Evaluating and optimizing opencl kernels for high performance computing with fpgas. In SC, 2016.
-  Wei Zuo, Peng Li, Deming Chen, Louis-Noël Pouchet, Shunan Zhong, and Jason Cong. Improving polyhedral code generation for high-level synthesis. In CODES+ISSS, 2013.