High-Level Synthesis (HLS) fostered a revolution in hardware design. HLS frameworks allow the specification of hardware components in languages such as C, C++, or SystemC. As opposed to traditional Register Transfer Level (RTL) approaches, HLS flows do not require detailed descriptions of the logic gates, memory elements and interconnects comprising hardware implementations. Instead, these are automatically generated, based on the high-level specifications and on a set of directive values specifying optimizations such as the unrolling factor of loops and the inlining of functions. By decoupling specification from implementation, HLS allows unprecedented productivity, leading to considerable reductions in non-recurring engineering costs.
Nonetheless, while HLS allows to easily define vast design spaces for a given hardware specification, determining the performance (latency) and resource requirements (area, power) of each implementation still requires time-consuming syntheses. The amount of possible implementations of a design explodes exponentially with the number of applied directives, while, in general, only a few of them are Pareto-optimal from a performance/resources perspective. Exhaustive explorations are therefore wasteful (since only Pareto implementations are of interest) and impractical beyond very simple cases.
Various strategies, which we summarize in Section II, have been proposed to identify (or approximate) the set of Pareto-implementations while minimising the number of synthesis runs [LiuJun13] [FerrettiJan18] [ferretti2018lattice]. This problem is named HLS-driven Design Space Exploration (DSE). The proposed DSEs strategies are typically validated against exhaustive explorations, which the authors performed ad-hoc. Moreover, works such as [wang20Mar] [Ferretti20Oct] rely on prior knowledge to steer the HLS exploration process. Performing the huge number of synthesis required for validation or for generating a high-quality knowledge base entails a very high effort, which at present must be repeated ex-novo when investigating the performance of a novel DSE methodology.
Against this backdrop, we introduce DB4HLS, a database of high-level synthesis design space explorations. The database comprises more than 100000 design points, reporting the synthesis outcomes of exhaustive explorations performed on 39 designs from the MachSuite [reagen2014machsuite] benchmark suite. In addition, we define a simple domain-specific language to define design spaces, resulting in an open infrastructure that can be enriched by further contributions from the research community.
We believe that, by providing standardized synthesis data sets, our effort will allow easier comparisons among DSE strategies, enabling fairer evaluations of the strengths and weaknesses of each approach. It will also facilitate the development and assessment of future design exploration frameworks, spurring research in this challenging field.
Ii Related Works
State of the art DSE frameworks for HLS follow three main approaches. Black-box methodologies aim, after an initial phase, at iteratively refining explorations by smartly selecting additional design points. To this end, they employ unsupervised learning strategies such as clustering[FerrettiJan18]LiuJun13], lattice traversing [ferretti2018lattice] and response surface models [XydisOct14]
. Model-based strategies, on the other hand, estimate performance and resource requirements of implementations by developing an analytical formulation of the effect of directives when applied to a design. Typically, they can well approximate the Pareto set of best-performing implementations with few synthesis, but are restricted in the type of targeted optimizations (e.g., loop unrolling and dataflow in[ZhongDec14]). The authors of all these works adopt as figure of merit either the Hypervolume or the Average Distance from Reference Set (ADRS) for validation, and both require the computation of true Pareto frontiers from exhaustive explorations. Recently, a promising research avenue has focused, instead, on exploiting prior knowledge in order to perform Design Space Exploration in hardware design. These works [wang20Mar] [Ferretti20Oct] leverage the availability of a comprehensive knowledge base, such as the one we describe in our paper, to achieve exploration results close to that of model-based strategies while being much more flexible in the number and type of supported directives.
While benchmark suites dedicated to hardware design are available, such as CHStone [hara2008chstone], MachSuite [reagen2014machsuite], Rosetta [zhou2018rosetta] and S2CBench [schafer2014s2cbench], they only provide specifications (in the form C/C++ code) as benchmarks. Conversely, our DB4HLS suite offers rich and well-defined design spaces and related synthesis outcomes, greatly easing the burden of performing comparative evaluations of exploration methodologies. To the best of our knowledge, this is the first database of HLS implementation made publicly available with the intent of standardize the evaluation process, and provide a source of knowledge for ML strategies.
Iii Available design space explorations
We provide a rich set of DSEs by targeting the benchmarks of the MachSuite collection of designs [reagen2014machsuite]. We have performed DSEs for 39 out of 50 functions in the benchmark suite, discarding those having a variable latency due to input-dependent control flows, and those having very small design spaces. The considered functions present on average 40 lines of code, with the biggest having 308 lines of code.
We performed an exhaustive exploration of each design–according to the configuration space defined by the user–running more than 100000 synthesis. Table I lists all the designs explored and their configuration space size.
We used Vivado HLS [VivadoHLS] version 2018.2 to perform the syntheses , and we targeted a ZynqMP Ultrascale+ (xczu9eg) FPGA chip, with a target clock of .
To restrain the design spaces sizes, we have constrained directive set values with a numerical range (e.g., the unrolling factor) to power-of-two or integer divisor of the maximum admissible values (e.g., number of loop iterations). Moreover, for some designs, different optimizations are forced to have the same values when intuitively such choice would lead to better cost/performance trade-offs (e.g., binding the loop unrolling factor to the array partitioning one).
Even when considering these constraints, the data collection required more than 4 years of single-core machine time. To speed up this process, GNU Parallel was adopted to collect synthesis results from 60 parallel Vivado HLS instances, allowing us to populate the database in approximately 25 days of wall-clock time.
Iv DB4HLS infrastructure
In addition to the DSE data, the DB4HLS framework offers a) a database infrastructure hosting DSE in a structured and easy-to-access way, b) a domain-specific language used to describe a configuration space for a target design, c) an interface to generate new explorations and further enrich the database. The remaining of this section describes these further contributions in details.
Iv-a A database for DSEs
The database structure, implemented in MySQL, comprises a description of the design targeted for exploration (top part of Figure 1), and that of the explored HLS optimizations applied to each design (middle part of Figure 1). Finally, it reports the resource and performance results obtained by synthesis (as described in the bottom part of the figure). Each of these components is described more in detail in the following.
Similarly to the taxonomy adopted in MachSuite [reagen2014machsuite], applications are identified by the benchmark they belong to (e.g.: aes), by the algorithm they realize (e.g.: aes256_encrypt) and by the design implementing such algorithms. As an example, two variants are provided by MachSuite for the aes256_encrypt algorithm (one using lookup tables to store encryption keys and one generating the values online), each corresponding to a separate design specified as C++ code.
A descriptor of the HLS optimizations considered for the DSEs are stored as entries in configuration space table. Multiple explorations (hence, rows in the configuration space table) for the same design are possible, corresponding to different choices of optimizations, or explorations targeting different tools/FPGAs, or even contributions from different researchers. An entry in the configuration space table is linked to many entries of the configuration table, where each entry indicates a specific element of the design space.
A line in the configuration table (that indicates the set of HLS optimizations defining a design space element) is linked to an entry in the implementation table. Furthermore, the synthesis information table provides additional information on each performed synthesis: the synthesis timestamp, the contributor that originated the data, the employed synthesis tool and version, and the targeted FPGA. Finally, each implementation links to one or more entries in the resources and performance tables, which report the synthesis outcomes. Resources are expressed as employed Flip-Flops, Look-Up Tables, Block RAMs (BRAM) and DSP blocks, while performances are reported in terms of effective latency.
Iv-B A domain-specific language for DSEs
Generating the different configurations associated with an DSE is a tedious and error-prone process when performed by hand. We therefore developed a Domain-Specific Language (DSL) to automatically and concisely define configuration spaces by employing Configuration Space Descriptors (CSDs).
Each line of a descriptor encodes a knob, which comprises a directive type, a label corresponding to its location in the design C/C++ code, and one or multiple sets of values. The number of sets is equal to the number of parameters required by the directive type. Values can be numerical when expressing optimizations such as loop unrolling or array partitioning factors, or categorical when determining the type of employed FPGA resources such as BRAM types. A shorthand is provided for expressing regular value series (e.g., a succession of power-of-two values). Finally, we provide a @bind decorator, which constraints the values associated with different directives.
Figure 2 shows, for the last_step_scan function in Snippet 1, an example of DSL descriptor created to define its configuration space (Snippet 2) created using the DSL. The DSL descriptor defines seven different knobs. Lines 1 and 2 of Snippet 2 show two knobs associating a dual-port BRAM to the input array bucket, and sum respectively. Lines 3 and 4 define knobs specifying the array_partitioning directive. These directives are created as combinations of partitioning strategies and partitioning factors. Both line 3 and 4 combine two partitioning strategies (cyclic and block) with the associated directive values set for the partitioning factors–all the powers of two from 1 up to 512 for knob 3, and all the powers of two from 1 up to 128 for knob 4. Then line 5 and 6 define for loop_1 and loop_2 the associated set of unrolling factors to consider during the exploration, all the powers of two from 1 up to 128 and 16, respectively. Both line 4 and 5 have a binding decorator (@bind_a), that specifies that the array partitioning directive and the unrolling one must have the same partitioning and unrolling factor for all the configurations described by the CSD. Finally line 7 defines the target clock.
The configuration space resulting from a DSL descriptor having different knobs is the Cartesian product of all knob values: ; where is the directive values set related to knob , taking into account the restrictions imposed by the bind decorator. In case of directives requiring multiple parameters, the knob is itself the Cartesian product among each set of values associated to the knob. Lastly, the total number of configurations, i.e., the configuration space size, is given by its cardinality ().
Iv-C A framework for parallelizing HLS runs
Figure 3 gives a high-level view of the infrastructure, realized through Bash and Python scripts, which we provide to automate DSE and commit their outcomes in DB4HLS. Starting from a user-provided design and Configuration Space Descriptor (CSD), configuration files are automatically generated and stored in the database. Then, using GNU Parallel [gnuparallel], a tunable number of instances of an employed HLS tool (we use Vivado HLS for the data collection described in Section III) are concurrently and independently executed, one for each configuration. As synthesis runs terminate, the retrieved performance and resources information are also stored in DB4HLS, and new HLS processes are launched until all configurations have been explored.
MySQL statements can then be used to retrieve data from the tables in the database and to access the design’s implementations and the associated performance and resources results.
V Case Study
Herein, we showcase two possible uses of DB4HLS. We use the database both to compare the results of two DSE methodologies, and as a source of knowledge for one of them. We employed a lattice-based strategy (LB) from [ferretti2018lattice], and one leveraging prior knowledge (PK)[Ferretti20Oct], to perform DSEs for the local_scan design space available in DB4HLS. Figure 4 reports the Pareto curve obtained by LB and PK for the local_scan design space. Grey dots represent the area and latency of the 704 implementations belonging to the local_scan design space provided by DB4HLS. The figure also reports the approximated Pareto fronts retrieved by the lattice methodology described in [ferretti2018lattice] (LB) and by the prior-knowledge strategy in [Ferretti20Oct] (PK).
In this scenario, DB4HLS is employed to comparatively evaluate the two strategies, without requiring to re-run ex-novo a large number of time-consuming synthesis runs. Besides, for PK, the database mandates the availability of a set of source design spaces in order to extract previous knowledge. In fact, DB4HLS can be effectively employed in these cases, or in similar ML-based methods [wang20Mar], to provide the required knowledge base.
DB4HLS offers an extensive set of DSEs targeting functions from MachSuite [reagen2014machsuite]. The data collection is made publicly available and will be will be updated increasing the number of design explorations and targeted benchmarks. In addition, further design spaces can be effectively defined through a novel domain-specific language and a framework to efficiently contribute novel explorations to DB4HLS. Both the DB4HLS database and the framework for DSE generation are publicly available at https://www.db4hls.inf.usi.ch/.